Reproducible Execution of Data Collection/Processing
Yaroslav O. Halchenko
@yarikoptic
Center for Open Neuroscience
Department of Psychological and Brain Sciences
Center for Cognitive Neuroscience
Dartmouth College
Live slides:
http://datasets.datalad.org/repronim/artwork/talks/webinar-2020-reprocomp/
Sources:
https://github.com/ReproNim/webinar-2020-reprocomp
Acknowledgements
- Reproducible Execution ⊃ Re-Execution - need *Exact Same* or *Nominally Similar* Data - need *Exact Same* or *Nominally Similar* Analysis - Reproducible Execution is useful to Original - rarely research project is "linear" ---- # Ultimate Goal/Approach Reproducibility should become merely a *feature*
(if not a side-effect) of the *results*
AKA "Reproducible by Design"
# But HOW? ---- ## HOWTO: Guiding principles - Be greedy - get as much as possible (even if you think you don't need it ATM) - Be lazy - manually do as little as necessary ---- ## HOWTO: Guiding principles - Be ~~greedy~~ thorough - get as much as possible (even if you think you don't need it) - know what you are going to do and what you have done - Be ~~lazy~~ efficient - manually do as little as necessary - achieve more than originally planned ---- ## One more: Pareto Principle  more: https://en.wikipedia.org/wiki/Pareto_principle
# What about a Recipe? I was told that my Lasagna would be as great if I just follow the recipe... liers ---- ## Recipe of a Study
- **Ingredients**: - humans (100% of *effort*) - computers (0% of *effort*) - language(s) (100% of the *result*) - English/... (human -to- human) - programming/scripting languages (human -to- computer) - standards (human/computer -to- computer/human) note: - automate: use language(s) to make computers (not humans) do boring stuff - human-to-human languages aren't good for automation ---- # Languages ---- ## Quick take aways - Human languages ... suck (unless you are into Poetry or Literature) - https://www.cognitiveatlas.org by Russ Poldrack (CP5*) - https://neuinfo.org/interlex by TR&D1 group - ... but are the ones for README - Methods & Results *should be* **low %effort** collation of automatically generated materials in both human- AND computer- readable forms - not unprecedented: FSL, fMRIPrep, ... - The *best* Programming language is the one which **minimizes %effort** - Corollary:
%effort is minimized by using no Programming language ---- ## ? ---- ## Standards *The problem with Programming:* - It takes **huge %effort** to write/test/maintain a program - A program typically creates a new "language" for interaction with it - it takes **%effort** to learn a new program - a program cannot learn to *talk* to another program,
so it takes **%effort** to script "adapter" or "pipeline" communication between
(thank you NiPype and TR&D2 for helping out!) ---- ## Standards *To **minimize %effort** for human/computer -to- computer/human interactions*: - reuse and contribute to existing programs - make programs operate on and produce data in standard form(s): - BIDS (Derivatives, Models, ...), NIDM, HED, ... - DICOM, NIfTI, JSON, YAML, ... - make programs interface via standard(ized) interface(s): - BIDS-Apps, ABC apps, Boutiques, Flywheel Gears, ... - contribute to development of standards to - become part of their 80% coverage - **minimize the %effort to produce results in a standard form so
they could be re-used with minimal %effort to produce new results** ---- # Main take away:
Embrace standards ## ---- ## Recipe of a Study
- **Ingredients**: - humans (100% of *effort*) - computers (0% of *effort*) - language(s) (100% of the *result*) - English/... (human -to- human) - programming/scripting languages (human -to- computer) - standards (human/computer -to- computer/human) - **Steps**: - **humans**: plan the study ahead - **humans and computers**: do data collection/processing note: - automate: use language(s) to make computers (not humans) do boring stuff - human-to-human languages aren't good for automation ---- ### Currently Dominant Recipe Effort Proportions - **Steps**: - **humans**: plan the study ahead (<20% effort) - **(a good number of) humans and (some) computers**:
do data collection/processing (>80% effort) ---- ### Target Recipe Effort Proportions **Prove Pareto to be *wrong***
**and that we can avoid wasting our effort** - **Steps**: - **humans**: plan the study ahead (>80% effort) - **(some) humans and (many) computers**:
automated data collection/processing (<20% effort) ---- ### Recipe Steps&Ingredients for Planing Ahead (>80% effort)  ---- ### Humans: Plan Ahead (>80% effort) - Plan to be ~~greedy~~ thorough - plan for **all** [5 ReproNim steps](http://5steps.repronim.org) (including *do-ing* analyses etc.) - prepare to be (ab)used ([Halchenko&Hanke, 2015](http://dx.doi.org/10.1186/s13742-015-0072-7)) - be ~~lazy~~ efficient and (re)use work of others - choose what to (ab)use and possibly contribute to: - language(s), analysis&execution platforms, ... - **choose an RDM (Research Data Management) platform/approach** - decide how to *log* what you will have done - **aim to collect rich(er) datasets** - Pre-register - treat it as a checklist (now) and a "regression-test" (later) - Prepare/train humans to "talk" to computers - [ReproNim Training](https://www.repronim.org/teach.html) - Listen to and/or participate in [BrainHacks](https://brainhack.org/tutorials.html) - [DataLad Handbook](http://handbook.datalad.org) notes: - nothing is "final" until ... virtually never
# Plan Ahead: Choose an RDM ----  ---- ### Plan Ahead: YODA  https://github.com/myyoda/poster/
by Michael Hanke (CP7, DataLad) et al. ---- ### Plan Ahead: YODA's Layout  ---- ### Plan Ahead: YODA's Hierarchy  https://github.com/ReproNim/containers/ ---- ### Example: YODA's DataLad Reproducible Paper  http://handbook.datalad.org/en/latest/usecases/reproducible-paper.html
by ReproNim YODA master Adina Wagner, Michael Hanke, et al. ---- ### Fact: ~~No~~Everybody should care about YODA  https://fmriprep.org by Russ Poldrack (CP5*, OpenNeuro), et al. ---- ### Plan Ahead: More on YODA via DataLad  http://handbook.datalad.org/en/latest/basics/basics-yoda.html ---- ### Plan Ahead: DataLad or not but ... - become friends with YODA and its principles - choose an RDM platform/approach which - tells you what you have done to obtain result X - knows where you have or can get data Y (of exact version Z) - **minimizes %effort** to re-run desired steps as-is or modified - works well with your *analytics* platforms - extra: allows you to search for data, results, etc - to find other *nominally similar*
# Plan Ahead:
aim to collect rich(er) datasets ---- ## Plan Ahead: automate collection of any relevant (meta)data in standard form Re *automate*: - manual data entry/wrangling = hard to trace/fix data bugs - we must be efficient: - facilitate human/computer -to- computer/human interactions - we are not unique: (ab)use existing solutions - seek for longer term low %effort solutions Re *any relevant*: - of cause there is a trade-off - prepare to be (ab)used - ~~others~~ you could find (meta)data relevant to their study missing - more of explained variance = higher power - new explanations of "noise" regularly emerge - have data ≠ have to analyze all data Re *standard form*: - without it - **high %effort** for a human/computer to understand it ---- ## Plan Ahead: Prior Webinars on "Languages" at tools to master them - *Harmonizing clinical and behavioral data collection through ReproSchema* by Satra Ghosh - *Tools and Techniques for BIDS Semantic Annotation and Query Across Datasets with NIDM* by David Keator - *Damn it Jim, I am a researcher not an ontologist: Exploring and using terminologies* by Jeff Grethe - ... and more at https://www.repronim.org/webinar-series.html
## Plan Ahead: Collect DICOMs (not NIfTI, PAR/REC, ...) - DICOMs contain lots of relevant metadata - most of the contained metadata is not relevant to your study - Conversion from DICOMs to BIDS could be automated - You should be able to - extract additional metadata happen you need it - reconvert if the conversion tool was buggy ---- ### Plan Ahead: HeuDiConv/ReproIn - **HeuDiConv** (https://github.com/nipy/heudiconv) - A flexible scriptable (Python) framework for conversion from DICOMs into an arbitrary layout - Uses [dcm2niix](https://github.com/rordenlab/dcm2niix/) by ReproNim Guru Chris Rorden for basic DICOM -> NIfTIs conversion - BIDS-aware and comes with a collection of conversion heuristics - **ReproIn** (https://github.com/repronim/reproin) - A convention for organizing and naming sequences on the scanner console - BIDS-like - **very low %effort to "adopt"** - HeuDiConv heuristic to convert from such convention to BIDS ---- ### Plan Ahead: use ReproIn Convention  ---- ### Plan Ahead: ReproIn or not but automate conversion to BIDS! - *Taking Control of your DICOM Data: ReproIn/Heudiconv Tools* webinar https://www.repronim.org/webinar-series.html - WiP: reproin helper to streamline handling of multiple studies - If not ReproIn, consider - https://github.com/psychoinformatics-de/datalad-hirni (Hanke, CP7) - https://github.com/brainlife/ezbids (Pestilli, Brainlife, CP6*) - https://bids.neuroimaging.io/benefits.html#converters ...
## Plan Ahead: Get your ~~greedy~~ thorough hands on Phantom QA Data ### Surprise: Phantom QA data can explain (some)
variance in your data!
(Operation code-name: ReproPhantom (?)) ---- ### FYI: Study "Nuisance"  Cheng CP and Halchenko YO. A new virtue of phantom MRI data: explaining variance in human participant data (v1; under peer review). F1000Research 2020, 9:1131 https://doi.org/10.12688/f1000research.24544.1 Full slide stack: http://datasets.datalad.org/centerforopenneuroscience/nuisance/presentations/2020-NNL/ GitHub: https://github.com/proj-nuisance/nuisance ---- ### Planned Ahead: (somewhat) - Phantom Data: DBIC QA ([///dbic/QA](http://datasets.datalad.org/?dir=/dbic/QA)) - Human Data: 206 participants from studies of 3 PIs at DBIC - DICOM-to-BIDS: [HeuDiConv](https://github.com/nipy/heudiconv/) with [ReproIn](https://github.com/repronim/reproin/) heuristic - Base OS: [Debian GNU/Linux](http://debian.org) + [NeuroDebian](http://neuro.debian.net) - QA: [MRIQC](https://github.com/poldracklab/mriqc) (BIDS-App) - Morphometrics: ReproNim's ["Simple Workflow"](https://github.com/ReproNim/simple_workflow) - [FSL](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki): BET, FAST, FIRST - code/data/details: [10.12688/f1000research.10783.2](http://dx.doi.org/10.12688/f1000research.10783.2) - Data wrangling and analyses: - [Python](http://python.org/), [pandas](https://pandas.pydata.org/), [statsmodels](https://www.statsmodels.org/stable/index.html), [Jupyter notebooks](https://jupyter.org/) - Containerization: [Singularity](https://singularity.lbl.gov) - Version control/distribution: - [DataLad](http://datalad.org), [datalad-container](http://handbook.datalad.org/en/latest/basics/101-133-containersrun.html?), [///ReproNim/containers](https://github.com/ReproNim/containers) - Organization: follows [YODA principles](https://github.com/myyoda/poster/blob/master/ohbm2018.pdf) - https://github.com/proj-nuisance/nuisance ---- ### What we know: Phantoms are good for scanner QA  https://www.dartmouth.edu/dbic/research_infrastructure/qualityassurance.html and the variance is largely "noise", right? ---- ### Model: Phantom SNR "explained"  ---- ### Model: Phantom SNR (variables)  ---- ### Model: Gray brain matter  ---- ### Model: Gray brain matter (variables)  ---- ### Plan Ahead: Talk to your MR physicist/technician - they better be doing QA - do not discard phantom data - can come handy - improve you power - possibly help to harmonize across sites - **ultra low %effort** to keep phantom data around - **∞ %effort** to recover when deleted - **small %effort** to make use of it - concern: dates are "sensitive data"
## Plan Ahead: Physiological data ### Let's streamline acquisition of physiological data
(Operation code-name: ReproPhys) ---- ### Others Planned it already: [phys2bids](https://github.com/physiopy/phys2bids/)  A nice overview: [OHBM 2020 poster](https://cdn-akamai.6connex.com/645/1827//phys2bids_OHBM_15922384856589877.pdf) ---- ### Others Planned it already: [bidsphysio](https://github.com/cbinyu/bidsphysio)  HeuDiConv support PR: https://github.com/nipy/heudiconv/pull/446 ---- ### Plan Ahead: Consider collecting physiological data - benefits are known - improve power of your studies - tools for conversion/slicing are available - relatively high **%effort** at the moment to setup/do - pales in comparison to other **%effort**s spent
## Plan Ahead: "Raw" audio/video stimuli ### Let's collect all video and audio stimuli as presented to the participants
(Operation code-name: ReproStim) ---- ### But WHY/What For? - QA (was there a jitter/dropped stimuli/randomization...)? - make it possible to forward model **any** collected dataset - *resting state folks - see previous sections* - explain low level signal features (/confounds?) - post-hoc salience features analysis - **100% reproduce experiment stimulation at ≈0% effort** ---- ### Plowing Ahead: https://github.com/ReproNim/reprostim - Goal: **0% effort** for "clients" - Minimal **%effort** to setup - Fully seamless and automated after that - HOW: - [Video](https://www.amazon.com/gp/product/B00BLZDY6A/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1)/Audio splitters - Video/Audio grabber, e.g. [Magewell USB Capture DVI Plus](https://www.amazon.com/gp/product/B01MSDFAO5/) - loseless video codec - A new video file upon connect/change of resolution
(e.g., `2020.11.24.12.57.08_2020.11.24.15.51.23.mkv` ) - Synchronization: - NTP where possible
(stimuli delivery computer, video grabber, ...) - video stream QR time-stamping and detection/decoding - Automated "slicing" into BIDS datasets (WiP)
## Plan Ahead: Events description ### Automate collection of all events information in consistent machine-readable form
(Operation code-name: ReproEvents) ---- ### Still Planning Ahead: Join us - **There is no generic library/helper yet** - Target: helper+converters for PsychoPy/PTB-3 to produce rich
BIDS [`_events.tsv` + `_events.json`](https://bids-specification.readthedocs.io/en/stable/04-modality-specific-files/07-behavioral-experiments.html)
with [Hierarchical Event Descriptors (HED)](https://github.com/hed-standard/hed-specification) - But with **little %effort** you should already make you stimuli scripts/logs ready
(see e.g., https://github.com/mvdoc/pitcher_localizer)
## Plan Ahead (the 80%): Summary - Plan to be ~~greedy~~ thorough - plan for **all** [5 ReproNim steps](http://5steps.repronim.org) (including *do-ing* analyses etc.) - prepare to be (ab)used ([Halchenko&Hanke, 2015](http://dx.doi.org/10.1186/s13742-015-0072-7)) - be ~~lazy~~ efficient and (re)use work of others - choose what to (ab)use and possibly contribute to: - language(s), analysis&execution platforms, ... - **choose an RDM (Research Data Management) platform/approach** - decide how to *log* what you will have done - **aim to collect rich(er) datasets** - Pre-register - treat it as a checklist (now) and a "regression-test" (later) - Prepare/train humans to "talk" to computers - [ReproNim Training](https://www.repronim.org/teach.html) - Listen to and/or participate in [BrainHacks](https://brainhack.org/tutorials.html) - [DataLad Handbook](http://handbook.datalad.org)
# Reproducible Execution of Data Processing (the 20%) Becomes more feasible with increased automation,
good RDM, and good (human- AND machine-) readable *logging*. ---- ## The ultimate *Do It* (the 20%) - **(some) humans and (many) computers**:
automated data collection/processing (<20% effort) - data collection - QA - pre-processing - analysis ---- ## Random Example: [srndna-master](https://github.com/victoriaakelly/srndna-master)
 ---- ## The ultimate do it (the 20%) - **(some) humans and (many) computers**:
automated data collection/processing (<20% effort) - data collection - QA - pre-processing - analysis - publication composition - also automate as much as possible by tuning prior stages - keep in mind/plan for: - everything might need to be 'repeated' - pipelines are great, but could be tricky. Use pre-crafted: - MRIQC, fMRIPrep, C-PAC, ... - having a small number of modular steps allow for "ad-hoc" pipelining (minimal script etc) - containers can help to collaborate/scale/debug/... ---- ### ReproYODA approach with [ReproNim/containers](https://github.com/ReproNim/containers/#a-typical-workflow)  ---- ### ReproYODA Log  ---- ### Ultimately - many computers  *Version control your data and computation using containers, DataLad and ReproMan, and reproducible they be!*
https://www.repronim.org/webinar-series.html
# Thank you
for hanging till the End! ## Let the reproducibility be with you
Slides: https://github.com/ReproNim/webinar-2020-reprocomp