Public sharing of research datasets: A pilot study of associations
Introduction
Sharing and reusing primary research datasets has the potential to increase research efficiency and quality. Raw data can be used to explore related or new hypotheses, particularly when combined with other available datasets. Real data is indispensable for developing and validating study methods, analysis techniques, and software implementations. The larger scientific community also benefits: sharing data encourages multiple perspectives, helps to identify errors, discourages fraud, is useful for training new researchers, and increases efficient use of funding and population resources by avoiding duplicate data collection.
Eager to realize these benefits, funders, publishers, societies, and individual research groups have developed tools, resources, and policies to encourage investigators to make their data publicly available. For example, some journals require the submission of detailed biomedical datasets to publicly available databases as a condition of publication (McCain, 1995, Piwowar and Chapman, 2008). Many funders require data sharing plans as a condition of funding: since 2003, the National Institutes of Health (NIH) in the USA has required a data sharing plan for all large funding grants (NIH, 2003) and has more recently introduced stronger requirements for genome-wide association studies (NIH, 2007). Several government whitepapers (Cech, 2003, Fienberg et al., 1985) and high-profile editorials (Data's Shameful Neglect, 2009, Got Data?, 2007, Time for Leadership, 2007) call for responsible data sharing and reuse. Large-scale collaborative science is increasing the need to share datasets (Kakazu et al., 2004, The GAIN Collaborative Research Group, 2007), and many guidelines, tools, standards, and databases are being developed and maintained to facilitate data sharing and reuse (Schofield et al., 2009, Barrett et al., 2007, Brazma et al., 2001).
Despite these investments of time and money, we do not yet understand the impact of these initiatives. There is a well-known adage: you cannot manage what you do not measure. For those with a goal of promoting responsible data sharing, it would be helpful to evaluate the effectiveness of requirements, recommendations, and tools. When data sharing is voluntary, insights could be gained by learning which datasets are shared, on what topics, by whom, and in what locations. When policies make data sharing mandatory, monitoring is useful to understand compliance and unexpected consequences.
Dimensions of data sharing action and intension have been investigated by a variety of studies. Manual annotations and systematic data requests have been used to estimate the frequency of data sharing within biomedicine (Kyzas et al., 2005, Noor et al., 2006, Ochsner et al., 2008, Reidpath and Allotey, 2001), though few attempts were made to determine patterns of sharing and withholding within these samples. Blumenthal (2006), Campbell et al. (2002), Hedstrom (2006) and others have used survey results to correlate self-reported instances of data sharing and withholding with self-reported attributes like industry involvement, perceived competitiveness, career productivity, and anticipated data sharing costs. Others have used surveys and interviews to analyze opinions about the effectiveness of mandates (Ventura, 2005) and the value of various incentives (Giordano, 2007, Hedstrom, 2006, Hedstrom and Niu, 2008, Niu, 2006). A few inventories list the data sharing policies of funders (Lowrance, 2006, University of Nottingham, 2009) and journals (Brown, 2003, McCain, 1995), and some work has been done to correlate policy strength with outcome (McCullough et al., 2008, Piwowar and Chapman, 2008). Surveys and case studies have been used to develop models of information behavior in related domains, including knowledge sharing within an organization (Constant et al., 1994, Matzler et al., 2008), physician knowledge sharing in hospitals (Ryu, Ho, & Han, 2003), participation in open source projects (Bitzer, Schrettl, & Schröder, 2007), academic contributions to institutional archives (Kim, 2007, Seonghee and Boryung, 2008), the choice to publish in open access journals (Warlick & Vaughan, 2007), sharing social science datasets (Hedstrom, 2006), and participation in large-scale biomedical research collaborations (Lee et al., 2006).
Although these studies provide valuable insights and their methods facilitate investigation into an author's intentions and opinions, they have several limitations. First, associations between an investigator's intention to share data do not directly translate to an association with actually sharing data (Kuo & Young, 2008). Second, associations that rely on self-reported data sharing and withholding likely suffer from underreporting and confounding, since people admit withholding data much less frequently than they report having experienced the data withholding of others (Blumenthal et al., 2006).
We suggest a supplemental approach for investigating research data sharing behavior. As part of an ongoing doctoral dissertation project, we are collecting and analyzing a large set of observed data sharing actions and associated policy, investigator, and environmental variables. In this report we provide preliminary findings on a small collection of studies and a few key questions: Are studies led by experienced and prolific primary investigators more likely to share their data than those led by junior investigators? Do funder and publisher requirements for data sharing increase the frequency with which data is shared? Are other funder and publisher characteristics associated with data sharing frequency?
We choose to study data sharing for one particular type of data: biological gene expression microarray intensity values. Microarray studies provide a useful environment for exploring data sharing policies and behaviors. Despite being a rich resource valuable for reuse (Rhodes et al., 2004), microarray data are often but not yet universally shared. Best-practice guidelines for sharing microarray data are fairly mature (Brazma et al., 2001, Hrynaszkiewicz and Altman, 2009). Two centralized databases have emerged as best-practice repositories: the Gene Expression Omnibus (GEO) (Barrett et al., 2007) and ArrayExpress (Parkinson et al., 2007). Finally, high-profile letters have called for strong journal data sharing policies (Ball et al., 2004), resulting in unusually strong data sharing requirements in some journals (Microarray Standards at Last, 2002).
Section snippets
Methods
We identified a set of studies in which the investigators had generated gene expression microarray datasets, and which of these had made their datasets publicly available on the Internet. We analyzed variables related to the investigators, journals, and funding of these studies to determine which attributes were associated an increased frequency of data sharing.
Results
We studied the data sharing patterns of 397 gene expression microarray studied published in 2007 within 20 journals, as identified in a systematic review by Ochsner et al. (2008). Almost half of the studies made their raw datasets available (47%).
We found that 41 of the articles acknowledged NIH funding but did not reveal specific grant numbers; these studies appear to be randomly distributed throughout the sample, so we estimated their levels of NIH funding from other attributes, as described
Discussion
This study explored the association between policy variables, author experience, selected article attributes, and frequency of data sharing within 397 recent gene expression microarray studies. We found that data sharing was more prevalent for studies published in journals with a higher impact factor and by authors with more experience. Whether or not the study was funded by the NIH had little impact on data sharing rates. We estimate the NIH data sharing policy applied to only 19% of the
Conclusions
We believe our emphasis on observed variables facilitates measurement of important quantitative associations. Data sharing policies are controversial (Campbell, 1999, Cecil and Boruch, 1988, King, 1995), and thus deserve to be thoughtfully considered and evaluated. We hope the results from our analyses will contribute to a deeper understanding of information behavior for research data sharing, and eventually more effective data sharing initiatives so that the value of research related output
Acknowledgements
HAP was supported by NLM training grant 5T15-LM007059-19 and the Department of Biomedical Informatics at the University of Pittsburgh. WWC is funded through NLM grant 5R01-LM009427-0.
References (55)
- et al.
Intrinsic motivation in open source software development
Journal of Comparative Economics
(2007) Personality traits and knowledge sharing
Journal of Economic Psychology
(2008)- et al.
Knowledge sharing behavior of physicians in hospitals
Expert Systems With Applications
(2003) - et al.
An analysis of faculty perceptions: Attitudes toward knowledge sharing and collaboration in an academic institution
Library
(2008) - Anonymous. (2002). Microarray standards at last. Nature, 419(6905),...
- Anonymous. (2007). Time for leadership. Nature Biotechnology, 25(8),...
- Anonymous. (2007). Got data? Nature Neuroscience, 10(8),...
- Anonymous. (2009). Data's shameful neglect. Nature, 461(7261),...
Submission of microarray data to public repositories
PLoS Biology
(2004)NCBI GEO: Mining tens of millions of expression profiles—Database and tools update
Nucleic Acids Research
(2007)
Withholding research results in academic life science. Evidence from a national survey of faculty
Journal of the American Medical Association
Data withholding in genetics and the other life sciences: prevalences and predictors
Academic Medicine
Are there better indices for evaluation purposes than the h index? A comparison of nine different variants of theh index using data from biomedicine
Journal of the American Society for Information Science and Technology
Minimum information about a microarray experiment (MIAME)-toward standards for microarray data
Nature Genetics
The changing face of scientific discourse: Analysis of genomic and proteomic database usage and acceptance
Journal of the American Society for Information Science and Technology
Data withholding in academic genetics: Evidence from a national survey
Journal of the American Medical Association
Controversial Proposal on Public Access to Research Data Draws 10,000 Comments
The Chronicle of Higher Education
Sharing publication-related data and materials: Responsibilities of authorship in the life sciences
Compelled Disclosure of Research Data: An early warning and suggestions for psychologists
Law and Human Behavior
What's mine is ours, or is it? A study of attitudes about information sharing
Information Systems Research
Sharing Research Data
Regression Modeling Strategies: With applications to linear models, logistic regression and survival analysis
Producing archive-ready datasets: Compliance incentives, and motivation
Research Forum Presentation: Incentives to Create “Archive-Ready” Data: Implications for Archives and Records Management
An index to quantify an individual's scientific research output
Proceedings of the National Academy of Sciences
Cited by (94)
Does open data have the potential to improve the response of science to public health emergencies?
2024, Journal of InformetricsHow do scholars and non-scholars participate in dataset dissemination on Twitter
2022, Journal of InformetricsAttenuated total reflection FTIR dataset for identification of type 2 diabetes using saliva
2022, Computational and Structural Biotechnology JournalCitation Excerpt :By analyzing the structure of samples of two or more populations, it would be possible to find a mathematical model that would allow them to be reliably characterized. Some of the studies that have shown the feasibility of using FTIR spectroscopy to assist in the diagnosis and control of diabetes are those presented by [3,6–10] and [14–10], where the problem of overlap between the spectra of the studied populations is reduced thanks to machine learning (ML) techniques such as those suggested by [4–6] and [22–26]. The studies mentioned, despite the good results reported, do not allow us to think about the development of a strategy that, through FTIR spectroscopy and using ML, allows us to reliably carry out the diagnosis and control of diabetic patients.
A two-stage workflow to extract and harmonize drug mentions from clinical notes into observational databases
2021, Journal of Biomedical InformaticsShare useful resources for research: Open data
2021, Educacion MedicaDRAT: Data risk assessment tool for university-industry collaborations
2020, Data-Centric Engineering