Elsevier

Journal of Informetrics

Volume 4, Issue 2, April 2010, Pages 148-156
Journal of Informetrics

Public sharing of research datasets: A pilot study of associations

https://doi.org/10.1016/j.joi.2009.11.010Get rights and content

Abstract

The public sharing of primary research datasets potentially benefits the research community but is not yet common practice. In this pilot study, we analyzed whether data sharing frequency was associated with funder and publisher requirements, journal impact factor, or investigator experience and impact. Across 397 recent biomedical microarray studies, we found investigators were more likely to publicly share their raw dataset when their study was published in a high-impact journal and when the first or last authors had high levels of career experience and impact. We estimate the USA's National Institutes of Health (NIH) data sharing policy applied to 19% of the studies in our cohort; being subject to the NIH data sharing plan requirement was not found to correlate with increased data sharing behavior in multivariate logistic regression analysis. Studies published in journals that required a database submission accession number as a condition of publication were more likely to share their data, but this trend was not statistically significant. These early results will inform our ongoing larger analysis, and hopefully contribute to the development of more effective data sharing initiatives.

Introduction

Sharing and reusing primary research datasets has the potential to increase research efficiency and quality. Raw data can be used to explore related or new hypotheses, particularly when combined with other available datasets. Real data is indispensable for developing and validating study methods, analysis techniques, and software implementations. The larger scientific community also benefits: sharing data encourages multiple perspectives, helps to identify errors, discourages fraud, is useful for training new researchers, and increases efficient use of funding and population resources by avoiding duplicate data collection.

Eager to realize these benefits, funders, publishers, societies, and individual research groups have developed tools, resources, and policies to encourage investigators to make their data publicly available. For example, some journals require the submission of detailed biomedical datasets to publicly available databases as a condition of publication (McCain, 1995, Piwowar and Chapman, 2008). Many funders require data sharing plans as a condition of funding: since 2003, the National Institutes of Health (NIH) in the USA has required a data sharing plan for all large funding grants (NIH, 2003) and has more recently introduced stronger requirements for genome-wide association studies (NIH, 2007). Several government whitepapers (Cech, 2003, Fienberg et al., 1985) and high-profile editorials (Data's Shameful Neglect, 2009, Got Data?, 2007, Time for Leadership, 2007) call for responsible data sharing and reuse. Large-scale collaborative science is increasing the need to share datasets (Kakazu et al., 2004, The GAIN Collaborative Research Group, 2007), and many guidelines, tools, standards, and databases are being developed and maintained to facilitate data sharing and reuse (Schofield et al., 2009, Barrett et al., 2007, Brazma et al., 2001).

Despite these investments of time and money, we do not yet understand the impact of these initiatives. There is a well-known adage: you cannot manage what you do not measure. For those with a goal of promoting responsible data sharing, it would be helpful to evaluate the effectiveness of requirements, recommendations, and tools. When data sharing is voluntary, insights could be gained by learning which datasets are shared, on what topics, by whom, and in what locations. When policies make data sharing mandatory, monitoring is useful to understand compliance and unexpected consequences.

Dimensions of data sharing action and intension have been investigated by a variety of studies. Manual annotations and systematic data requests have been used to estimate the frequency of data sharing within biomedicine (Kyzas et al., 2005, Noor et al., 2006, Ochsner et al., 2008, Reidpath and Allotey, 2001), though few attempts were made to determine patterns of sharing and withholding within these samples. Blumenthal (2006), Campbell et al. (2002), Hedstrom (2006) and others have used survey results to correlate self-reported instances of data sharing and withholding with self-reported attributes like industry involvement, perceived competitiveness, career productivity, and anticipated data sharing costs. Others have used surveys and interviews to analyze opinions about the effectiveness of mandates (Ventura, 2005) and the value of various incentives (Giordano, 2007, Hedstrom, 2006, Hedstrom and Niu, 2008, Niu, 2006). A few inventories list the data sharing policies of funders (Lowrance, 2006, University of Nottingham, 2009) and journals (Brown, 2003, McCain, 1995), and some work has been done to correlate policy strength with outcome (McCullough et al., 2008, Piwowar and Chapman, 2008). Surveys and case studies have been used to develop models of information behavior in related domains, including knowledge sharing within an organization (Constant et al., 1994, Matzler et al., 2008), physician knowledge sharing in hospitals (Ryu, Ho, & Han, 2003), participation in open source projects (Bitzer, Schrettl, & Schröder, 2007), academic contributions to institutional archives (Kim, 2007, Seonghee and Boryung, 2008), the choice to publish in open access journals (Warlick & Vaughan, 2007), sharing social science datasets (Hedstrom, 2006), and participation in large-scale biomedical research collaborations (Lee et al., 2006).

Although these studies provide valuable insights and their methods facilitate investigation into an author's intentions and opinions, they have several limitations. First, associations between an investigator's intention to share data do not directly translate to an association with actually sharing data (Kuo & Young, 2008). Second, associations that rely on self-reported data sharing and withholding likely suffer from underreporting and confounding, since people admit withholding data much less frequently than they report having experienced the data withholding of others (Blumenthal et al., 2006).

We suggest a supplemental approach for investigating research data sharing behavior. As part of an ongoing doctoral dissertation project, we are collecting and analyzing a large set of observed data sharing actions and associated policy, investigator, and environmental variables. In this report we provide preliminary findings on a small collection of studies and a few key questions: Are studies led by experienced and prolific primary investigators more likely to share their data than those led by junior investigators? Do funder and publisher requirements for data sharing increase the frequency with which data is shared? Are other funder and publisher characteristics associated with data sharing frequency?

We choose to study data sharing for one particular type of data: biological gene expression microarray intensity values. Microarray studies provide a useful environment for exploring data sharing policies and behaviors. Despite being a rich resource valuable for reuse (Rhodes et al., 2004), microarray data are often but not yet universally shared. Best-practice guidelines for sharing microarray data are fairly mature (Brazma et al., 2001, Hrynaszkiewicz and Altman, 2009). Two centralized databases have emerged as best-practice repositories: the Gene Expression Omnibus (GEO) (Barrett et al., 2007) and ArrayExpress (Parkinson et al., 2007). Finally, high-profile letters have called for strong journal data sharing policies (Ball et al., 2004), resulting in unusually strong data sharing requirements in some journals (Microarray Standards at Last, 2002).

Section snippets

Methods

We identified a set of studies in which the investigators had generated gene expression microarray datasets, and which of these had made their datasets publicly available on the Internet. We analyzed variables related to the investigators, journals, and funding of these studies to determine which attributes were associated an increased frequency of data sharing.

Results

We studied the data sharing patterns of 397 gene expression microarray studied published in 2007 within 20 journals, as identified in a systematic review by Ochsner et al. (2008). Almost half of the studies made their raw datasets available (47%).

We found that 41 of the articles acknowledged NIH funding but did not reveal specific grant numbers; these studies appear to be randomly distributed throughout the sample, so we estimated their levels of NIH funding from other attributes, as described

Discussion

This study explored the association between policy variables, author experience, selected article attributes, and frequency of data sharing within 397 recent gene expression microarray studies. We found that data sharing was more prevalent for studies published in journals with a higher impact factor and by authors with more experience. Whether or not the study was funded by the NIH had little impact on data sharing rates. We estimate the NIH data sharing policy applied to only 19% of the

Conclusions

We believe our emphasis on observed variables facilitates measurement of important quantitative associations. Data sharing policies are controversial (Campbell, 1999, Cecil and Boruch, 1988, King, 1995), and thus deserve to be thoughtfully considered and evaluated. We hope the results from our analyses will contribute to a deeper understanding of information behavior for research data sharing, and eventually more effective data sharing initiatives so that the value of research related output

Acknowledgements

HAP was supported by NLM training grant 5T15-LM007059-19 and the Department of Biomedical Informatics at the University of Pittsburgh. WWC is funded through NLM grant 5R01-LM009427-0.

References (55)

  • D. Blumenthal

    Withholding research results in academic life science. Evidence from a national survey of faculty

    Journal of the American Medical Association

    (1997)
  • D. Blumenthal

    Data withholding in genetics and the other life sciences: prevalences and predictors

    Academic Medicine

    (2006)
  • L. Bornmann et al.

    Are there better indices for evaluation purposes than the h index? A comparison of nine different variants of theh index using data from biomedicine

    Journal of the American Society for Information Science and Technology

    (2008)
  • A. Brazma

    Minimum information about a microarray experiment (MIAME)-toward standards for microarray data

    Nature Genetics

    (2001)
  • C. Brown

    The changing face of scientific discourse: Analysis of genomic and proteomic database usage and acceptance

    Journal of the American Society for Information Science and Technology

    (2003)
  • E.G. Campbell

    Data withholding in academic genetics: Evidence from a national survey

    Journal of the American Medical Association

    (2002)
  • P. Campbell

    Controversial Proposal on Public Access to Research Data Draws 10,000 Comments

    The Chronicle of Higher Education

    (1999)
  • T. Cech

    Sharing publication-related data and materials: Responsibilities of authorship in the life sciences

    (2003)
  • J.S. Cecil et al.

    Compelled Disclosure of Research Data: An early warning and suggestions for psychologists

    Law and Human Behavior

    (1988)
  • D. Constant et al.

    What's mine is ours, or is it? A study of attitudes about information sharing

    Information Systems Research

    (1994)
  • S.E. Fienberg et al.

    Sharing Research Data

    (1985)
  • Giordano, R. (2007). The Scientist: Secretive, Selfish, or Reticent? A Social Network Analysis. E-Social Science 2007,...
  • F.E. Herrell

    Regression Modeling Strategies: With applications to linear models, logistic regression and survival analysis

    (2001)
  • M. Hedstrom

    Producing archive-ready datasets: Compliance incentives, and motivation

  • M. Hedstrom et al.

    Research Forum Presentation: Incentives to Create “Archive-Ready” Data: Implications for Archives and Records Management

  • J.E. Hirsch

    An index to quantify an individual's scientific research output

    Proceedings of the National Academy of Sciences

    (2005)
  • Hosek, S. D., et al. (2005). Gender Differences in Major Federal External Grant Programs, from...
  • Cited by (94)

    • Attenuated total reflection FTIR dataset for identification of type 2 diabetes using saliva

      2022, Computational and Structural Biotechnology Journal
      Citation Excerpt :

      By analyzing the structure of samples of two or more populations, it would be possible to find a mathematical model that would allow them to be reliably characterized. Some of the studies that have shown the feasibility of using FTIR spectroscopy to assist in the diagnosis and control of diabetes are those presented by [3,6–10] and [14–10], where the problem of overlap between the spectra of the studied populations is reduced thanks to machine learning (ML) techniques such as those suggested by [4–6] and [22–26]. The studies mentioned, despite the good results reported, do not allow us to think about the development of a strategy that, through FTIR spectroscopy and using ML, allows us to reliably carry out the diagnosis and control of diabetic patients.

    View all citing articles on Scopus
    View full text