Skip to main content

Advertisement

Log in

Outlier detection methods to improve the quality of citizen science data

  • Original Paper
  • Published:
International Journal of Biometeorology Aims and scope Submit manuscript

Abstract

Citizen science involves public participation in research, usually through volunteer observation and reporting. Data collected by citizen scientists are a valuable resource in many fields of research that require long-term observations at large geographic scales. However, such data may be perceived as less accurate than those collected by trained professionals. Here, we analyze the quality of data from a plant phenology network, which tracks biological response to climate change. We apply five algorithms designed to detect outlier observations or inconsistent observers. These methods rely on different quantitative approaches, including residuals of linear models, correlations among observers, deviations from multivariate clusters, and percentile-based outlier removal. We evaluated these methods by comparing the resulting cleaned datasets in terms of time series means, spatial data coverage, and spatial autocorrelations after outlier removal. Spatial autocorrelations were used to determine the efficacy of outlier removal, as they are expected to increase if outliers and inconsistent observations are successfully removed. All data cleaning methods resulted in better Moran’s I autocorrelation statistics, with percentile-based outlier removal and the clustering method showing the greatest improvement. Methods based on residual analysis of linear models had the strongest impact on the final bloom time mean estimates, but were among the weakest based on autocorrelation analysis. Removing entire sets of observations from potentially unreliable observers proved least effective. In conclusion, percentile-based outlier removal emerges as a simple and effective method to improve reliability of citizen science phenology observations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Aggarwal CC (2013) Outlier analysis. Springer, New York

    Book  Google Scholar 

  • Beaubien E, Freeland HJ (2000) Spring phenology trends in Alberta, Canada: links to ocean temperature. Int J Biometeorol 44:53–59

    Article  CAS  Google Scholar 

  • Beaubien E, Hamann A (2011a) Spring flowering response to climate change between 1936 and 2006 in Alberta, Canada. Biosci 61:514–524. https://doi.org/10.1525/bio.2011.61.7.6

    Article  Google Scholar 

  • Beaubien E, Hamann A (2011b) Plant phenology network of citizen scientists: recommendations from two decades of experience in Canada. Int J Biometeorol 55:833–841. https://doi.org/10.1007/s00484-011-0457-y

    Article  Google Scholar 

  • Beaubien E, Johnson DL (1994) Flowering plant phenology and weather in Alberta, Canada. Int J Biometeorol 38:23–27

    Article  Google Scholar 

  • Bonney R, Cooper CB, Dickinson J, Kelling S, Phillips T, Rosenberg KV, Shirk J (2009) Citizen science: a developing tool for expanding science knowledge and scientific literacy. Bioscience 59:977–984

    Article  Google Scholar 

  • Butler DG, Cullis BR, Gilmour AR, Gogel BJ (2009) ASReml-R reference manual version 3. www.vsni.co.uk

  • Crall AW, Newman GJ, Stohlgren TJ, Holfelder KA, Graham J, Waller DM (2011) Assessing citizen science data quality: an invasive species case study. Conserv Lett 4:433–442. https://doi.org/10.1111/j.1755-263X.2011.00196.x

    Article  Google Scholar 

  • Crall AW, Jarnevich CS, Young NE, Panke BJ, Renz M, Stohlgren TJ (2015) Citizen science contributes to our knowledge of invasive plant distributions. Biol Invasions 17:2415–2427. https://doi.org/10.1007/s10530-015-0885-4

    Article  Google Scholar 

  • Danielsen F, Jensen PM, Burgess ND, Altamirano R, Alviola PA, Andrianandrasana H, Brashares JS, Burton AC, Coronado I, Corpuz N, Enghoff M, Fjeldså J, Funder M, Holt S, Hübertz H, Jensen AE, Lewis R, Massao J, Mendoza MM, Ngaga Y, Pipper CB, Poulsen MK, Rueda RM, Sam MK, Skielboe T, Sørensen M, Young R (2014) A multicountry assessment of tropical resource monitoring by local communities. Bioscience 64:236–251. https://doi.org/10.1093/biosci/biu001

    Article  Google Scholar 

  • DataONE (2017) DataONE education module: data quality control and assurance. Data Observation network for Earth. https://www.dataone.org/sites/all/documents/education-modules/pptx/L05_DataQualityControlAssurance.pptx. Accessed 1 Nov 2017

  • Dickinson JL, Shirk J, Bonter D, Bonney R, Crain RL, Martin J, Phillips T, Purcell K (2012) The current state of citizen science as a tool for ecological research and public engagement. Front Ecol Environ 10:291–297

    Article  Google Scholar 

  • Donaldson J (2012) tsne: t-distributed stochastic neighbor embedding for R (t-SNE). R. Package version 0.1–2. http://CRAN.R-project.org/package=tsne

  • Donnelly A, Yu R (2017) The rise of phenology with climate change: an evaluation of IJB publications. Int J Biometeorol 61(Suppl 1):S29–S50. https://doi.org/10.1007/s00484-017-1371-8

    Article  Google Scholar 

  • Feldman RE, Zemaite I, Miller-Rushing AJ (2018) How training citizen scientists affects the accuracy and precision of phenological data. Int J Biometeorol 62:1421–1435

    Article  Google Scholar 

  • Foster-Smith J, Evans SM (2003) The value of marine ecological data collected by volunteers. Biol Conserv 113:199–213

    Article  Google Scholar 

  • Fraley C, Raftery AE, Murphy B, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation technical report no. 597. Department of Statistics, University of Washington

  • Fuccillo KK, Crimmins TM, de Riviera CE, Elder TS (2014) Assessing accuracy in science-based plant phenology monitoring. Int J Biometerol 59:917–926. https://doi.org/10.1007/s00484-014-0892-7

    Article  Google Scholar 

  • Gajer P, Schatz M, Salzberg SL (2004) Automated correction of genome sequence errors. Nuc Acids Res 32:562–569

    Article  CAS  Google Scholar 

  • Gueta T, Carmel Y (2016) Quantifying the value of user-level data cleaning for big data: a case study using mammal distribution models. Ecol Informat 34:139–145. https://doi.org/10.1016/j.ecoinf.2016.06.001

    Article  Google Scholar 

  • Havens K, Vitt P, Masi S (2012) Citizen science on a local scale: the Plants of Concern program. Front Ecol Environ 10:321–323. https://doi.org/10.1890/110258

    Article  Google Scholar 

  • Hufkens K (2017) khufkens/daymetr: download daymet data using R. Zenodo. https://doi.org/10.5281/zenodo.437886

  • Hunter J, Alabri A, van Ingen C (2013) Assessing the quality and trustworthiness of citizen science data. Concurrency Computat Pract Exper 25:454–466. https://doi.org/10.1002/cpe.2923

  • IPCC (2007) Intergovernmental Panel on Climate Change, Climate Change 2007: synthesis report. Contribution of Working Groups I, II, and III to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change. Core Writing Team, Pachauri RK, Reisinger A (eds). IPCC, Geneva, Switzerland, 104 pp

  • Kosmala M, Wiggins A, Swanson A, Simmons B (2016) Assessing data quality in citizen science. Front Ecol Environ 14:551–560. https://doi.org/10.1002/fee.1436

    Article  Google Scholar 

  • MacKenzie CM, Murray G, Primack R, Weihrauch D (2017) Lessons from citizen science: assessing volunteer-collected plant phenology data with Mountain watch. Biol Conserv 208:121–126. https://doi.org/10.1016/j.biocon.2016.07.027

    Article  Google Scholar 

  • Mathew C, Güntsch A, Obst M, Vicario S, Haines R, Williams A, de Jong Y, Goble C (2014) A semi-automated workflow for biodiversity data retrieval, cleaning, and quality control. Biodivers Data J 2:e4221. https://doi.org/10.3897/BDJ.2.e4221

    Article  Google Scholar 

  • McKinley DC, Miller-Rushing AJ, Ballard HL et al (2017) Citizen science can improve conservation science, natural resource management, and environmental protection. Biol Conserv 208:15–28. https://doi.org/10.1016/j.biocon.2016.05.015

    Article  Google Scholar 

  • Mehdipoor H, Zurita-Milla R, Rosemartin A, Gerst KL, Weltzin JF (2015) Developing a workflow to identify inconsistencies in volunteered geographic information: a phenological case study Plos One 10. https://doi.org/10.1371/journal.pone.0140811

  • Miller-Rushing A, Primack R, Bonney R (2012) The history of public participation in ecological research. Front Ecol Environ 10:285–290. https://doi.org/10.1890/1102798

    Article  Google Scholar 

  • Moran PAP (1950) Notes on continuous stochastic phenomena. Biometrika. 37(1):17–23

    Article  CAS  Google Scholar 

  • Natural Regions Committee (2006) Natural regions and subregions of Alberta. Compiled by D.J. Downing and W.W. Pettapiece. Edmonton. Pub. No. T/852. Alberta Environment, Government of Alberta, Edmonton, AB

  • Paradis E, Claude J, Strimmer K (2004) APE: analysis of phylogenetics and evolution in R language. Bioinformatics 20:289–290

    Article  CAS  Google Scholar 

  • R Development Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna http://www.R-project.org/

    Google Scholar 

  • Ranjitkar S (2013) Effect of elevation and latitude on spring phenology of rhododendron and Kanchenjunga conservation area, East Nepal. Int J Appl Sci Biotech 1:253–257. https://doi.org/10.3126/ijasbt.v1i4.9154

    Article  Google Scholar 

  • Rathcke B, Lacey EP (1985) Phenological patterns of terrestrial plants. Ann Rev Ecol Syst 16:179–214

    Article  Google Scholar 

  • Schwartz MD, Beaubien EG, Crimmins TM, Weltzin JF (2013) Chapter 5. North America. In: Schwartz M (ed) Phenology: an integrative environmental science. Springer, Dortrecht, pp 67–89

    Chapter  Google Scholar 

  • Schwartz MD, Hanes JM, Liang L (2014) Separating temperature from other factors in phenological measurements. Int J Biometeorol 58:1699–1704. https://doi.org/10.1007/s00484-013-0723-2

    Article  Google Scholar 

  • Silvertown J (2009) A new dawn for citizen science. Trends Ecol Evol 24:467–471

    Article  Google Scholar 

  • Silvertown J, Buesching CD, Jacobson SK, Rebelo T (2013) Citizen science and nature conservation. In: Macdonald DW, Willis KJ (eds) Key topics in conservation biology 2, 1st edn. Wiley, New York, pp 127–142

    Chapter  Google Scholar 

  • Thornton PE, Thornton MM, Mayer BW, Wilhelmi N, Wei Y, Devarakonda R, Cook RB (2016) Daymet: daily surface weather data on a 1-km grid for North America, Version 3 ORNL DAAC, Oak Ridge, Tennessee, USA. Accessed June 5, 2017. Time period: 1987-01-01 to 2016-12-31. Spatial range: N=59.82, S=49.13, E=-109.22, W=-119.67. https://doi.org/10.3334/ORNLDAAC/1219

  • Tobler WR (1970) A computer movie simulating urban growth in the Detroit region. Econ Geogr 46:234–240. https://doi.org/10.2307/143141

    Article  Google Scholar 

  • USANPN (2017) USA National Phenology Network. How to observe. https://www.usanpn.org/nn/guidelines. Accessed 2 Nov 2017

  • van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:1–48

    Google Scholar 

  • Vander Stelt E, Fant JB, Masi S, Larkin DJ (2017) Assessing habitat requirements and genetic status of a rare ephemeral wetland plant species, Isoëtes butleri Engelm. Aquat Bot 138:74–81. https://doi.org/10.1016/j.aquabot.2017.01.002

    Article  Google Scholar 

Download references

Acknowledgments

We thank all citizen scientists that contributed to the Alberta PlantWatch program led by EB, and we appreciate their enthusiasm and continued support of this program.

Funding

Funding to carry out the analysis presented in this paper was provided by the NSERC Discovery Grant RGPIN-330527 to AH.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jennifer S. Li.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J.S., Hamann, A. & Beaubien, E. Outlier detection methods to improve the quality of citizen science data. Int J Biometeorol 64, 1825–1833 (2020). https://doi.org/10.1007/s00484-020-01968-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00484-020-01968-z

Keywords

Navigation