Abstract
Large datasets that enable researchers to perform investigations with unprecedented rigor are growing increasingly common in neuroimaging. Due to the simultaneous increasing popularity of open science, these state-of-the-art datasets are more accessible than ever to researchers around the world. While analysis of these samples has pushed the field forward, they pose a new set of challenges that might cause difficulties for novice users. Here we offer practical tips for working with large datasets from the end-user’s perspective. We cover all aspects of the data lifecycle: from what to consider when downloading and storing the data to tips on how to become acquainted with a dataset one did not collect and what to share when communicating results. This manuscript serves as a practical guide one can use when working with large neuroimaging datasets, thus dissolving barriers to scientific discovery.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Van Essen, D. C. et al. The WU-Minn Human Connectome Project: an overview. Neuroimage 80, 62–79 (2013).
Casey, B. J. et al. The Adolescent Brain Cognitive Development (ABCD) study: imaging acquisition across 21 sites. Dev. Cogn. Neurosci. 32, 43–54 (2018).
Miller, K. L. et al. Multimodal population brain imaging in the UK Biobank prospective epidemiological study. Nat. Neurosci. 19, 1523–1536 (2016).
Alexander, L. M. et al. An open resource for transdiagnostic research in pediatric mental health and learning disorders. Sci. Data 4, 170181 (2017).
Biswal, B. B. et al. Toward discovery science of human brain function. Proc. Natl. Acad. Sci. USA 107, 4734–4739 (2010).
Caspers, S. et al. Studying variability in human brain aging in a population-based German cohort-rationale and design of 1000BRAINS. Front. Aging Neurosci. 6, 149 (2014).
HD-200 Consortium. The ADHD-200 Consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Front. Syst. Neurosci. 6, 62 (2012).
Das, S. et al. Cyberinfrastructure for open science at the Montreal Neurological Institute. Front. Neuroinform. 10, 53 (2017).
Das, S., Zijdenbos, A. P., Harlap, J., Vins, D. & Evans, A. C. LORIS: a web-based data management system for multi-center studies. Front. Neuroinform. 5, 37 (2012).
Di Martino, A. et al. Enhancing studies of the connectome in autism using the autism brain imaging data exchange II. Sci. Data 4, 170010 (2017).
Di Martino, A. et al. The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Mol. Psychiatry 19, 659–667 (2014).
Gorgolewski, K. J. et al. NeuroVault.org: a repository for sharing unthresholded statistical maps, parcellations, and atlases of the human brain. Neuroimage 124, 1242–1244 (2016). Pt B.
Holmes, A. J. et al. Brain Genomics Superstruct Project initial data release with structural, functional, and behavioral measures. Sci. Data 2, 150031 (2015).
LaMontagne, P.J. et al. OASIS-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and Alzheimer disease. Preprint at medRxiv https://doi.org/10.1101/2019.12.13.19014902 (2019).
Luo, X. Z., Kennedy, D. N. & Cohen, Z. Neuroimaging informatics tools and resources clearinghouse (NITRC) resource announcement. Neuroinformatics 7, 55–56 (2009).
Marek, K. et al. The Parkinson’s progression markers initiative (PPMI) - establishing a PD biomarker cohort. Ann. Clin. Transl. Neurol. 5, 1460–1477 (2018).
Marek, K. et al. Parkinson Progression Marker Initiative. The Parkinson Progression Marker Initiative (PPMI). Prog. Neurobiol. 95, 629–635 (2011).
Mennes, M., Biswal, B. B., Castellanos, F. X. & Milham, M. P. Making data sharing work: the FCP/INDI experience. Neuroimage 82, 683–691 (2013).
Mueller, S. G. et al. Ways toward an early diagnosis in Alzheimer’s disease: the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimers Dement. 1, 55–66 (2005).
Nooner, K. B. et al. The NKI-Rockland Sample: a model for accelerating the pace of discovery science in psychiatry. Front. Neurosci. 6, 152 (2012).
Poldrack, R. A. et al. Toward open sharing of task-based fMRI data: the OpenfMRI project. Front. Neuroinform. 7, 12 (2013).
Poldrack, R. A. & Gorgolewski, K. J. OpenfMRI: Open sharing of task fMRI data. Neuroimage 144, 259–261 (2017). Pt B.
Satterthwaite, T. D. et al. Neuroimaging of the Philadelphia neurodevelopmental cohort. Neuroimage 86, 544–553 (2014).
Scott, A. et al. COINS: an innovative informatics and neuroimaging tool suite built for large heterogeneous datasets. Front. Neuroinform. 5, 33 (2011).
Shafto, M. A. et al. The Cambridge Centre for Ageing and Neuroscience (Cam-CAN) study protocol: a cross-sectional, lifespan, multidisciplinary examination of healthy cognitive ageing. BMC Neurol. 14, 204 (2014).
Snoek, L. et al. The Amsterdam Open MRI Collection, a set of multimodal MRI datasets for individual difference analyses. Preprint at bioRxiv https://doi.org/10.1101/2020.06.16.155317 (2020).
Taylor, J. R. et al. The Cambridge Centre for Ageing and Neuroscience (Cam-CAN) data repository: structural and functional MRI, MEG, and cognitive data from a cross-sectional adult lifespan sample. Neuroimage 144, 262–269 (2017). Pt B.
Zuo, X. N. et al. An open science resource for establishing reliability and reproducibility in functional connectomics. Sci. Data 1, 140049 (2014).
Southard, E. E. On the topographical distribution of cortex lesions and anomalies in dementia praecox, with some account of their functional significance. Am. J. Insanity 71, 603–671 (1915).
Smith, S. M. & Nichols, T. E. Statistical challenges in “Big Data” human neuroimaging. Neuron 97, 263–268 (2018).
Noble, S., Scheinost, D. & Constable, R. T. Cluster failure or power failure? Evaluating sensitivity in cluster-level inference. Neuroimage 209, 116468 (2020).
Bzdok, D., Nichols, T. E. & Smith, S. M. Towards algorithmic analytics for large-scale datasets. Nat. Mach. Intell. 1, 296–306 (2019).
Bzdok, D. & Yeo, B. T. T. Inference in the age of big data: Future perspectives on neuroscience. Neuroimage 155, 549–564 (2017).
Fan, J., Han, F. & Liu, H. Challenges of big data analysis. Natl. Sci. Rev. 1, 293–314 (2014).
Sandu, A. L., Paillère Martinot, M. L., Artiges, E. & Martinot, J. L. 1910s′ brains revisited. Cortical complexity in early 20th century patients with intellectual disability or with dementia praecox. Acta Psychiatr. Scand. 130, 227–237 (2014).
Brakewood, B. & Poldrack, R. A. The ethics of secondary data analysis: considering the application of Belmont principles to the sharing of neuroimaging data. Neuroimage 82, 671–676 (2013).
Meyer, M. N. Practical tips for ethical data sharing. Adv. Methods Pract. Psychol. Sci. 1, 131–144 (2018).
White, T., Blok, E. & Calhoun, V.D. Data sharing and privacy issues in neuroimaging research: opportunities, obstacles, challenges, and monsters under the bed. Hum. Brain Map. https://doi.org/10.1002/hbm.25120 (2020).
Nichols, T. E. et al. Best practices in data analysis and sharing in neuroimaging using MRI. Nat. Neurosci. 20, 299–303 (2017).
Poline, J. B. et al. Data sharing in neuroimaging research. Front. Neuroinform. 6, 9 (2012).
Barron, D.S. & Fox, P.T. BrainMap Database as a Resource for Computational Modeling. in Brain Mapping: An Encyclopedic Reference (ed. Toga, A. W.) 1, 675–683 (Elsevier, 2015).
Poldrack, R. A. & Gorgolewski, K. J. Making big data open: data sharing in neuroimaging. Nat. Neurosci. 17, 1510–1517 (2014).
Hagler, D. J. Jr. et al. Image processing and analysis methods for the Adolescent Brain Cognitive Development Study. Neuroimage 202, 116091 (2019).
Gordon, E. M. et al. Generation and evaluation of a cortical area parcellation from resting-state correlations. Cereb. Cortex 26, 288–303 (2016).
Botvinik-Nezer, R. et al. Variability in the analysis of a single neuroimaging dataset by many teams. Nature 582, 84–88 (2020).
Ciric, R. et al. Benchmarking of participant-level confound regression strategies for the control of motion artifact in studies of functional connectivity. Neuroimage 154, 174–187 (2017).
Dadi, K. et al. Benchmarking functional connectome-based predictive models for resting-state fMRI. Neuroimage 192, 115–134 (2019).
Gorgolewski, K. J. et al. The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Sci. Data 3, 160044 (2016).
Bennett, L. M. & Gadlin, H. Collaboration and team science: from theory to practice. J. Investig. Med. 60, 768–775 (2012).
Lake, E. M. R. et al. The functional brain organization of an individual allows prediction of measures of social abilities transdiagnostically in autism and attention-deficit/hyperactivity disorder. Biol. Psychiatry 86, 315–326 (2019).
Pomponio, R. et al. Harmonization of large MRI datasets for the analysis of brain imaging patterns throughout the lifespan. Neuroimage 208, 116450 (2020).
Sripada, C. et al. Prediction of neurocognition in youth from resting state fMRI. Mol. Psychiatry https://doi.org/10.1038/s41380-019-0481-6 (2019).
Fortin, J. P. et al. Harmonization of cortical thickness measurements across scanners and sites. Neuroimage 167, 104–120 (2018).
Fortin, J. P. et al. Harmonization of multi-site diffusion tensor imaging data. Neuroimage 161, 149–170 (2017).
Yamashita, A. et al. Harmonization of resting-state functional MRI data across multiple imaging sites via the separation of site differences into sampling bias and measurement bias. PLoS Biol. 17, e3000042 (2019).
Yu, M. et al. Statistical harmonization corrects site effects in functional connectivity measurements from multi-site fMRI data. Hum. Brain Mapp. 39, 4213–4227 (2018).
Pinto, M. S. et al. Harmonization of brain diffusion MRI: concepts and methods. Front. Neurosci. 14, 396 (2020).
Orban, C., Kong, R., Li, J., Chee, M. W. L. & Yeo, B. T. T. Time of day is associated with paradoxical reductions in global signal fluctuation and functional connectivity. PLoS Biol. 18, e3000602 (2020).
Noble, S. et al. Multisite reliability of MR-based functional connectivity. Neuroimage 146, 959–970 (2017).
Marek, S. et al. Identifying reproducible individual differences in childhood functional brain networks: an ABCD study. Dev. Cogn. Neurosci. 40, 100706 (2019).
Alfaro-Almagro, F. et al. Confound modelling in UK Biobank brain imaging. Neuroimage 224, 117002 (2021).
Esteban, O. et al. MRIQC: Advancing the automatic prediction of image quality in MRI from unseen sites. PLoS One 12, e0184661 (2017).
Bissett, P.G., Hagen, M.P. & Poldrack, R.A. A cautionary note on stop-signal data from the Adolescent Brain Cognitive Development [ABCD] study. Preprint at bioRxiv https://doi.org/10.1101/2020.05.08.084707(2020).
Barch, D. M. et al. Function in the human connectome: task-fMRI and individual differences in behavior. Neuroimage 80, 169–189 (2013).
Gur, R. C. et al. Age group and sex differences in performance on a computerized neurocognitive battery in children age 8-21. Neuropsychology 26, 251–265 (2012).
Fischbach, G. D. & Lord, C. The Simons Simplex Collection: a resource for identification of autism genetic risk factors. Neuron 68, 192–195 (2010).
Lord, C. et al. A multisite study of the clinical diagnosis of different autism spectrum disorders. Arch. Gen. Psychiatry 69, 306–313 (2012).
Greene, A. S., Gao, S., Scheinost, D. & Constable, R. T. Task-induced brain state manipulation improves prediction of individual traits. Nat. Commun. 9, 2807 (2018).
Duncan, N. W. & Northoff, G. Overview of potential procedural and participant-related confounds for neuroimaging of the resting state. J. Psychiatry Neurosci. 38, 84–96 (2013).
Pervaiz, U., Vidaurre, D., Woolrich, M. W. & Smith, S. M. Optimising network modelling methods for fMRI. Neuroimage 211, 116604 (2020).
Rao, A., Monteiro, J. M. & Mourao-Miranda, J. Predictive modelling using neuroimaging data in the presence of confounds. Neuroimage 150, 23–49 (2017).
Snoek, L., Miletić, S. & Scholte, H. S. How to control for confounds in decoding analyses of neuroimaging data. Neuroimage 184, 741–760 (2019).
Milham, M. P. et al. Assessment of the impact of shared brain imaging data on the scientific literature. Nat. Commun. 9, 2818 (2018).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Lombardo, M. V., Lai, M. C. & Baron-Cohen, S. Big data approaches to decomposing heterogeneity across the autism spectrum. Mol. Psychiatry 24, 1435–1450 (2019).
Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 (2013).
Szucs, D. & Ioannidis, J. P. Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature. PLoS Biol. 15, e2000797 (2017).
Wasserstein, R. L., Schirm, A. L. & Lazar, N. A. Moving to a world beyond “P < 0.05”. Am. Stat. 73 Suppl. 1, 1–19 (2019).
Kaplan, R. M., Chambers, D. A. & Glasgow, R. E. Big data and large sample size: a cautionary note on the potential for bias. Clin. Transl. Sci. 7, 342–346 (2014).
Bzdok, D. & Ioannidis, J. P. A. Exploration, inference, and prediction in neuroscience and biomedicine. Trends Neurosci. 42, 251–262 (2019).
Chen, G., Taylor, P. A. & Cox, R. W. Is the statistic value all we should care about in neuroimaging? Neuroimage 147, 952–959 (2017).
Szucs, D. & Ioannidis, J. P. A. When null hypothesis significance testing is unsuitable for research: a reassessment. Front. Hum. Neurosci. 11, 390 (2017).
Wasserstein, R. L. & Lazar, N. A. The ASA’s statement on P-values: context, process, and purpose. Am. Stat. 70, 129–133 (2016).
Earp, B. D. The need for reporting negative results - a 90 year update. J. Clin. Transl. Res. 3, 344–347 (2017). Suppl 2.
Easterbrook, P. J., Berlin, J. A., Gopalan, R. & Matthews, D. R. Publication bias in clinical research. Lancet 337, 867–872 (1991).
Greenwald, A. G. Consequences of prejudice against the null hypothesis. Psychol. Bull. 82, 1–20 (1975).
Heger, M. Editor’s inaugural issue foreword: perspectives on translational and clinical research. J. Clin. Transl. Res. 1, 1–5 (2015).
Pautasso, M. Worsening file-drawer problem in the abstracts of natural, medical and social science databases. Scientometrics 85, 193–202 (2010).
Rosenthal, R. The file drawer problem and tolerance for null results. Psychol. Bull. 86, 638–641 (1979).
Thompson, W. H., Wright, J., Bissett, P. G. & Poldrack, R. A. Dataset decay and the problem of sequential analyses on open datasets. eLife 9, e53498 (2020).
Cawley, G. C. & Talbot, N. L. C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010).
Dietterich, T. Overfitting and undercomputing in machine learning. ACM Comp. Surv. 27, 326–327 (1995).
Reunanen, J. Overfitting in making comparisons between variable selection methods. J. Mach. Learn. Res. 3, 1371–1382 (2003).
Thompson, P. M. et al. Alzheimer’s Disease Neuroimaging Initiative, EPIGEN Consortium, IMAGEN Consortium, Saguenay Youth Study (SYS) Group. The ENIGMA Consortium: large-scale collaborative analyses of neuroimaging and genetic data. Brain Imaging Behav. 8, 153–182 (2014).
Pierce, H. H., Dev, A., Statham, E. & Bierer, B. E. Credit data generators for data reuse. Nature 570, 30–32 (2019).
Weston, S. J., Ritchie, S. J., Rohrer, J. M. & Przybylski, A. K. Recommendations for increasing the transparency of analysis of preexisting data sets. Adv. Methods Pract. Psychol. Sci. 2, 214–227 (2019).
Milham, M. P. & Klein, A. Be the change you seek in science. BMC Biol. 17, 27 (2019).
Nowogrodzki, A. Eleven tips for working with large data sets. Nature 577, 439–440 (2020).
Van Essen, D. C. et al. The Brain Analysis Library of Spatial Maps and Atlases (BALSA) database. Neuroimage 144, 270–274 (2017). Pt B.
Niso, G. et al. OMEGA: the open MEG archive. Neuroimage 124, 1182–1187 (2016). Pt B.
Acknowledgements
The authors acknowledge funding from the following NIH grants: C.H. and A.S.G., T32GM007205; S.N., K00MH122372; K.L., R01MH111424 and P50MH115716; D.S.B., T32 MH019961 and R25 MH071584; and D.S., R24 MH114805. The funders had no role in the conception or writing of this manuscript.
Author information
Authors and Affiliations
Contributions
C.H. wrote the first draft of the manuscript. C.H., S.N., A.S.G., K.L., D.S.B., S.G., D.O’C., M.S., J.D., X.S., E.M.R.L., R.T.C. and D.S. contributed to the conceptualization, writing and editing of the manuscript. C.H., S.N., A.S.G., K.L., D.S.B., S.G., D.O’C., M.S., J.D., X.S., E.M.R.L., R.T.C. and D.S. read and approved the final draft.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Primary Handling Editor: Marike Schiffer
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Table 1.
Rights and permissions
About this article
Cite this article
Horien, C., Noble, S., Greene, A.S. et al. A hitchhiker’s guide to working with large, open-source neuroimaging datasets. Nat Hum Behav 5, 185–193 (2021). https://doi.org/10.1038/s41562-020-01005-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41562-020-01005-4
This article is cited by
-
Data leakage inflates prediction performance in connectome-based machine learning models
Nature Communications (2024)
-
Gradients of Brain Organization: Smooth Sailing from Methods Development to User Community
Neuroinformatics (2024)
-
Positron emission tomography and magnetic resonance imaging methods and datasets within the Dominantly Inherited Alzheimer Network (DIAN)
Nature Neuroscience (2023)
-
FAIRly big: A framework for computationally reproducible processing of large-scale data
Scientific Data (2022)
-
The role of artificial intelligence in paediatric neuroradiology
Pediatric Radiology (2022)