Abstract
Data provenance focuses on the source of the data and on the identification of data sources and their transformations undergone over time. This paper proposes a generic method for collecting provenance data, and is a follow-up of a study carried out by the same authors in a Brazilian hemotherapy center. This method is based on the W3C’s Provenance Data Model (PROV-DM), and proposes a way to capture, store and analyze anemia-index provenance data by applying a scientific workflow, together with the management of provenance of knowledge. This is an exploratory, practical and deductive study carried out with real data from 197,551 candidates for blood donors, extracted from reports ranging from 2000 to 2018 provided by a Brazilian hemotherapy center. People identified with high anemia rates were quantified and tagged as not-suitable for blood donations. The inadequate candidates were quantified with the highest rate of anemia, and out of 1011 male candidates and 4039 female candidates, women had the highest levels of inadequate blood donations. At the end of this study, it can be concluded that the generic method for collecting data provenance proposed here can be applied in several areas of knowledge.
Similar content being viewed by others
Data Availability
Data made available from research previously published in consent by the same authors of that study.
References
Hey T, Tansley S, Tolle K (eds) (2009) The fourth paradigm: data-intensive scientific discovery, 1st edn. Microsoft Research, Redmond
Davidson SB, Freire J (2008) Provenance and scientific workflows: challenges and opportunities. ACM SIGMOD international conference on management of data, pp. 1345–1350
Moreau L, Groth P (2013) Provenance: an introduction to PROV. Synthesis lectures on the semantic web: theory and technology 3(4):1–129. California: Morgan & Claypool Publishers
Veregin H, Lanter DP (1995) Data-quality enhancement techniques in layer-based geographic information systems. Computers, Environment and Urban Systems 19(1):23–36. Elsevier Science Ltd., Oxford. https://doi.org/10.1016/0198-9715(94)00032-8
Zhang M, Jiang L, Zhao J et al (2020) Coupling OGC WPS and W3C PROV for provenance-aware geoprocessing workflows. Comput Geosci 138:104419. https://doi.org/10.1016/j.cageo.2020.104419
Tan WC (2004) Research problems in data provenance. IEEE Data Eng Bull 27(4):45–52
Tan WC (2008) Provenance in databases: past, current and future. IEEE Data Eng Bull 30(4):3–12
Moreau L, Clifforf B, Freire J et al (2011) The open provenance model core specification (v1.1). Futur Gener Comput Syst 27(6):743–756
Moreau L et al (2011) The open provenance model core specification (v1.1). Futur Gener Comput Syst 27(6):01–15
Almeida FN (2012) Description of the provenance of data for knowledge extraction in Hemotherapy information systems. Thesis Doctorate Bioinformatics Course, Bioinformatics, Universidade de São Paulo (USP), São Paulo. [in Portuguese]
Stolzfus RJ (2001) Defining iron deficiency Anemian Public health terms: a time for reflection. J Nutr 131:565S 7S supplement
Sembay MJ, Macedo DDJ, Dutra ML (2020) A method for collecting provenance data: a case study in a Brazilian hemotherapy center. Proceedings of the 1st EAI International Conference on Data and Information in Online Environments, DIONE 2020, Florianopolis, Brazil, pp 1–14
WHO (2008) Worldwide prevalence of anaemia 1993–2005: WHO Global Database on Anaemia. Genebra: World Health Organization. http://apps.who.int/iris/bitstream/handle/10665/43894/9789241596657_eng.pdf?ua=1. Accessed 20 Feb 2020
WHO, United Nations Children's Fund, United Nations University (2001) Iron deficiency anaemia: assessment, prevention, and control [Internet]. Genebra: World Health Organization. http://www.who.int/nutrition/publications/en/ida_assessment_prevention_control.pdf. Accessed 14 Feb 2020
WHO (2019) https://www.who.int/topics/anaemia/en/. Accessed 21 Sept 2020
Mendrone AJR, Sabino EC, Sampaio L et al (2009) Anemia screening in potential female blood donors: comparison of two different quantitative methods. Transfusion 49(4):662–668. https://doi.org/10.1111/j.1537-2995.2008.02023.x
Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-Science: an overview of workflow system features and capabilities. Futur Gener Comput Syst 25(5):528–540
Talia FD (2013) Workflow systems for Science: concepts and tools. Concepts and tools. Isrn Software Engineering, pp. 1–15, Hindawi Limited. https://doi.org/10.1155/2013/404525
Workflow Management Coalition (1999) Terminology and glossary, document number WFMC- TC-1011, Issue 3.0, Belgium
Jablonski BS, Bussler C (1996) Workflow management: modeling concepts, architecture and implementation. Thomson International Computer Press, London
Grefen P, Remmerts De Vries RH (1998) A reference architecture for workflow management systems. Data Knowl Eng 27(1):31–57. https://doi.org/10.1016/S0169-023X(97)00057-8
Liu L, Pu C, Ruiz DD (2004) A systematic approach to flexible specification, composition, and restructuring of workflow activities. J Database Manag 15(1):1–40
Lin C, Lu S (2008) Architectures of workflow management systems: a survey. Technical Report TRSWR-01-2008
Ostrowski K, Birman K, Dolev D (2007) Extensible architecture for high-performance, scalable, reliable publish-subscribe eventing and notification. Int J Web Serv Res 4(4):18–58
Lathers A, Su MH, Kulungowski A et al (2006) Enabling parallel scientific applications with workflow tools. Proceedings of the Challenges of Large Applications in Distributed Environments (CLADE '06), pp. 55–60
Oinn T, Li P, Kell D, Goble C (2007) Taverna/mygrid: aligning a work ow system with the life sciences community. Workflows for e-Science, p 300–319
Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17):3045–3054
Oinn T, Greenwood M, Addis M, Alpdemir MN, Ferris J, Glover K, Goble C, Goderis A, Hull D, Marvin D, Li P, Lord P, Pocock MR, Senger M, Stevens R, Wipat A, Wroe C (2006) Taverna: lessons in creating a workflow environment for the life sciences. Concurr Computat Pract Exper 18(10):1067–1100
Taylor I, Shields M, Wang I, Rana O (2004) Triana, applications within grid computing and peer to peer environments. J Grid Comput 1:199–217
Taylor I, Al-Shakarchi E, Beck SD (2006) Distributed audio retrieval using Triana, DART. Proceedings of the International Computer Music Conference (ICMC '06), New Orleans, Lo, USA, pp. 716–722
Deelman E, Singh G, Su MH, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman GB, Good J, Laity A, Jacob JC, Katz DS (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Program 13(3):219–237
Altintas I, Berkley C, Jaeger E, Jones M (2004) Kepler: an extensible system for design and execution of scientific workflows. Proceedings of 16th International conference on scientific and statistical database management, Santorini Island, Greece: IEEE, pp. 423–424
Fahringer T, Prodan R, Duan R et al (2005) ASKALON: a grid application development and computing environment. Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, pp. 122–131
Fahringer T, Jun Q, Hainzer S (2005) Specification of Grid workflow applications with AGWL: an abstract Grid workflow language. Proceedings of IEEE International Symposium on Cluster Computing and the Grid (CCGrid '05), pp. 676–685
Hoheisel A (2006) User tools and languages for graph-based grid workflows. Concurr Computat Pract Exper 18(10):1101–1113
Von Laszewski G, Hategan M (2005) Java CoG Kit Karajan/Gridant workflow guide. Tech. Rep., Argonne National Laboratory, Argonne, Ill, USA
Von Laszewski G, Hategan M, Kodeboyina D (2007) Java CoG kit workflow. Workflows for e-Science. Springer, New York, pp 143–166
Feller M, Foster I, Martin S (2007) GT4 GRAM: a functionality and performance study. Proceedings of the TERAGRID Conference, Madison, Wis, USA
Cuevas-Vicenttin, Dey V, Wang S et al (2012) Modeling and querying scientific workflow provenance in the d-opm. Proceedings of 12th SC companion high performance computing, networking, storage and analysis, Washington, EUA: IEEE, pp. 119–128
Ni WW, Shen T, Yan D (2020) Differential privacy based on data provenance publishing method. Jisuanji Xuebao/Chinese Journal of Computers 43(3):573–586. https://doi.org/10.11897/SP.J.1016.2020.00573
Buneman P, Khanna SE, Chiew W (2001) Why and where: a characterization of data provenance. In: Van Den Bussche, J, Vianu, V. (ed.). Database theory: ICDT 2001. Lecture Notes in Computer Science, 1973:316–330. Berlin, Heidelberg: Springer
Freire J, Koop D, Santos E, Silva CT (2008) Provenance for computational tasks: a survey. J Comput Sci Eng 10(3):11–21
Simmhan YL, Plale B, Gannon D (2005) A survey of data provenance techniques. Technical Report TR-618: Computer Science Department; Indiana University
Moreau L, Freire J, Futrelle J et al (2008) The open provenance model: an overview. IPAW, LNCS 5272:323–326
Lim C, Lu S, Chebotkot A et al (2010) Prospective and retrospective provenance collection in scientific workflow environments (2010). Proceedings 2010 IEEE 7th International Conference on Services Computing, SCC 2010, art. n.5557202:449–456
Woodruff A, Stonebraker M (1997) supporting fine-grained data lineage in a database visualization. Proceedings of 13th International Conference on Data Engineering, Birmingham, UK
Belhajjame K et al (2018) PROV-DM: the PROV data model. W3C recommendation, 3 apr. 2013. https://www.w3.org/TR/prov-dm/. Accessed 30 May 2018
Bose R, Frew J (2005) Lineage retrieval for scientific data processing: a survey. ACM Comput Surv 37(1):1–28
Buneman P, Tan WC (2007) Provenance in databases: tutorial outline. Proceedings of ACM SIGMOD International Conference on Management of Data, Beijing, China: ACM, pp.11–14
Moreau L, Groth P, Cheney J, Lebo T, Miles S (2015) The rationale of PROV. Web Semant Sci Serv Agents World Wide Web 35:235–257
Curbera F, Doganata Y, Martens A, Mukhi N.K., Slominski A. (2008) Business provenance: a technology to increase traceability of end-to-end operations. In: Meersman, R., Tari, Z. (ed.). On the move to meaningful internet systems: OTM 2008. Lecture Notes in Computer Science. Berlin, Heidelberg, pp. 100–119
Ahmed I, Abid K, Adeel A et al (2018) A secure provenance scheme for detecting consecutive colluding users in distributed networks. Int J Parallel Prog 48(2):344-366 Springer Science and Business Media LLC. https://doi.org/10.1007/s10766-018-0601-y
Silva PP, Mcguinness DL, Mccool R (2003) Knowledge provenance infrastructure. Proceedings of IEEE Data Eng. Bull. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.7528
Stevens R, Zhao J, Goble C (2007) Using provenance to manage knowledge of in silico experiments. Brief Bioinform 8:183–194
Ram S, Liu J (2007) Understanding the semantics of data provenance to support active conceptual modeling. In Active conceptual modeling of learning, Springer, pp. 17–29. https://link.springer.com/chapter/10.1007/978-3-540-77503-4_3
Hartig O, Zhao J (2010) Publishing and consuming provenance metadata on the web of linked data. In Provenance and annotation of data and processes, Springer, pp. 78–90. https://link.springer.com/chapter/10.1007/978-3-642-17819-1_10
Sahoo SS, Sheth AP (2009) Provenir ontology: towards a framework for escience provenance management. Kno.e.sis Publications
OPM. Open Provenance Model (2010) https://openprovenance.org/opm/old-index.html. Accessed 03 Jan 2019
W3C. PROV-DM: The PROV Data Model (2013) http://www.w3.org/TR/prov-dm/. Accessed 02 Jan 2019
Wang M, Blount M Davis J, Misra A, Sow D (2007) A time-and-value centric provenance model and architecture for medical event streams. Proceedings of the 1st ACM Sigmobile International Workshop on Systems and Networking Support for Healthcare and Assisted Living Environments - Healthnet '07, 1:95-100. ACM Press. https://doi.org/10.1145/1248054.1248082
Ustun Y, Belhajjame K, Grigori D (2015) Modeling evidence-based medicine applications with provenance data in pathways. Proceedings of 9th International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth)
Sun Y, Tun L, Ning G (2017) A method of electronic health data quality assessment: enabling data provenance. IEEE 21st International Conference on Computer Supported Cooperative Work in Design (CSCWD) Local: Wellington, New Zealand, pp. 26-28
Mascia C, Uva P, Leo S, Zanetti G (2018) OpenEHR modeling for genomics in clinical practice. Int J Med Inform 120:147–156
Xu S, Fairweather E, Rogers T, Curcin V (2018) Implementing data provenance in health data analytics software. Lecture Notes in computer science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11017 LNCS, pp. 173-176
Jaigirdar FT, Rudolph C, Bain C (2019) Can I trust the data I see? A Physician's concern on medical data in IoT health architectures. Proceedings of the Australasian Computer Science Week Multiconference on - Acsw 2019, art. n.27:1-10, ACM Press. https://doi.org/10.1145/3290688.3290731
Danese MD, Halperin M, Duryea J, Duryea R (2019) The generalized data model for clinical research. BMC Med Inform Decis Mak 19(1):117. https://doi.org/10.1186/s12911-019-0837-5
Kubendiran M, Singh S, Sangaiah AK (2019) Enhanced security framework for E-health systems using Blockchain. J Inform Process Syst 15(2):239–250. https://doi.org/10.3745/JIPS.04.0106
Wang M, Blount M, Davis J et al (2007) A time-and-value centric provenance model and architecture for medical event streams. Proceedings of the 1st ACM Sigmobile International Workshop on Systems and Networking Support for Healthcare and Assisted Living Environments - Healthnet '07 (1):95-100, ACM Press. https://doi.org/10.1145/1248054.1248082
Meyers DG (2000) The iron hypothesis: does iron play a role in atherosclerosis? Transfusion 40(8):1023–1029
Machado ÍE, Malta DC, Bacal NS et al (2019) Prevalence of anemia in Brazilian adults and elderly. Braz J Epidemiol 22(2):1–15, FapUNIFESP (SciELO). [in Portuguese]. https://doi.org/10.1590/1980-549720190008.supl.2
Author information
Authors and Affiliations
Contributions
SEMBAY MJ collaborated in the conception, collection and analysis of data and writing of the article. MACEDO DDJ and DUTRA ML collaborated in data analysis, writing and review of the article.
Corresponding author
Ethics declarations
Conflicts of interest/Competing interests
the authors declare that they do not have conflicts of interest.
Code availability
Not applicable’ for that section.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sembay, M.J., de Macedo, D.D.J. & Dutra, M.L. A Proposed Approach for Provenance Data Gathering. Mobile Netw Appl 26, 304–318 (2021). https://doi.org/10.1007/s11036-020-01648-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11036-020-01648-7