Skip to main content
Log in

A Proposed Approach for Provenance Data Gathering

  • Published:
Mobile Networks and Applications Aims and scope Submit manuscript

Abstract

Data provenance focuses on the source of the data and on the identification of data sources and their transformations undergone over time. This paper proposes a generic method for collecting provenance data, and is a follow-up of a study carried out by the same authors in a Brazilian hemotherapy center. This method is based on the W3C’s Provenance Data Model (PROV-DM), and proposes a way to capture, store and analyze anemia-index provenance data by applying a scientific workflow, together with the management of provenance of knowledge. This is an exploratory, practical and deductive study carried out with real data from 197,551 candidates for blood donors, extracted from reports ranging from 2000 to 2018 provided by a Brazilian hemotherapy center. People identified with high anemia rates were quantified and tagged as not-suitable for blood donations. The inadequate candidates were quantified with the highest rate of anemia, and out of 1011 male candidates and 4039 female candidates, women had the highest levels of inadequate blood donations. At the end of this study, it can be concluded that the generic method for collecting data provenance proposed here can be applied in several areas of knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data Availability

Data made available from research previously published in consent by the same authors of that study.

References

  1. Hey T, Tansley S, Tolle K (eds) (2009) The fourth paradigm: data-intensive scientific discovery, 1st edn. Microsoft Research, Redmond

    Google Scholar 

  2. Davidson SB, Freire J (2008) Provenance and scientific workflows: challenges and opportunities. ACM SIGMOD international conference on management of data, pp. 1345–1350

  3. Moreau L, Groth P (2013) Provenance: an introduction to PROV. Synthesis lectures on the semantic web: theory and technology 3(4):1–129. California: Morgan & Claypool Publishers

  4. Veregin H, Lanter DP (1995) Data-quality enhancement techniques in layer-based geographic information systems. Computers, Environment and Urban Systems 19(1):23–36. Elsevier Science Ltd., Oxford. https://doi.org/10.1016/0198-9715(94)00032-8

  5. Zhang M, Jiang L, Zhao J et al (2020) Coupling OGC WPS and W3C PROV for provenance-aware geoprocessing workflows. Comput Geosci 138:104419. https://doi.org/10.1016/j.cageo.2020.104419

    Article  Google Scholar 

  6. Tan WC (2004) Research problems in data provenance. IEEE Data Eng Bull 27(4):45–52

    Google Scholar 

  7. Tan WC (2008) Provenance in databases: past, current and future. IEEE Data Eng Bull 30(4):3–12

    Google Scholar 

  8. Moreau L, Clifforf B, Freire J et al (2011) The open provenance model core specification (v1.1). Futur Gener Comput Syst 27(6):743–756

    Article  Google Scholar 

  9. Moreau L et al (2011) The open provenance model core specification (v1.1). Futur Gener Comput Syst 27(6):01–15

    Article  Google Scholar 

  10. Almeida FN (2012) Description of the provenance of data for knowledge extraction in Hemotherapy information systems. Thesis Doctorate Bioinformatics Course, Bioinformatics, Universidade de São Paulo (USP), São Paulo. [in Portuguese]

  11. Stolzfus RJ (2001) Defining iron deficiency Anemian Public health terms: a time for reflection. J Nutr 131:565S 7S supplement

    Google Scholar 

  12. Sembay MJ, Macedo DDJ, Dutra ML (2020) A method for collecting provenance data: a case study in a Brazilian hemotherapy center. Proceedings of the 1st EAI International Conference on Data and Information in Online Environments, DIONE 2020, Florianopolis, Brazil, pp 1–14

  13. WHO (2008) Worldwide prevalence of anaemia 1993–2005: WHO Global Database on Anaemia. Genebra: World Health Organization. http://apps.who.int/iris/bitstream/handle/10665/43894/9789241596657_eng.pdf?ua=1. Accessed 20 Feb 2020

  14. WHO, United Nations Children's Fund, United Nations University (2001) Iron deficiency anaemia: assessment, prevention, and control [Internet]. Genebra: World Health Organization. http://www.who.int/nutrition/publications/en/ida_assessment_prevention_control.pdf. Accessed 14 Feb 2020

  15. WHO (2019) https://www.who.int/topics/anaemia/en/. Accessed 21 Sept 2020

  16. Mendrone AJR, Sabino EC, Sampaio L et al (2009) Anemia screening in potential female blood donors: comparison of two different quantitative methods. Transfusion 49(4):662–668. https://doi.org/10.1111/j.1537-2995.2008.02023.x

    Article  Google Scholar 

  17. Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-Science: an overview of workflow system features and capabilities. Futur Gener Comput Syst 25(5):528–540

    Article  Google Scholar 

  18. Talia FD (2013) Workflow systems for Science: concepts and tools. Concepts and tools. Isrn Software Engineering, pp. 1–15, Hindawi Limited. https://doi.org/10.1155/2013/404525

  19. Workflow Management Coalition (1999) Terminology and glossary, document number WFMC- TC-1011, Issue 3.0, Belgium

  20. Jablonski BS, Bussler C (1996) Workflow management: modeling concepts, architecture and implementation. Thomson International Computer Press, London

    Google Scholar 

  21. Grefen P, Remmerts De Vries RH (1998) A reference architecture for workflow management systems. Data Knowl Eng 27(1):31–57. https://doi.org/10.1016/S0169-023X(97)00057-8

    Article  MATH  Google Scholar 

  22. Liu L, Pu C, Ruiz DD (2004) A systematic approach to flexible specification, composition, and restructuring of workflow activities. J Database Manag 15(1):1–40

    Article  Google Scholar 

  23. Lin C, Lu S (2008) Architectures of workflow management systems: a survey. Technical Report TRSWR-01-2008

  24. Ostrowski K, Birman K, Dolev D (2007) Extensible architecture for high-performance, scalable, reliable publish-subscribe eventing and notification. Int J Web Serv Res 4(4):18–58

    Article  Google Scholar 

  25. Lathers A, Su MH, Kulungowski A et al (2006) Enabling parallel scientific applications with workflow tools. Proceedings of the Challenges of Large Applications in Distributed Environments (CLADE '06), pp. 55–60

  26. Oinn T, Li P, Kell D, Goble C (2007) Taverna/mygrid: aligning a work ow system with the life sciences community. Workflows for e-Science, p 300–319

  27. Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17):3045–3054

    Article  Google Scholar 

  28. Oinn T, Greenwood M, Addis M, Alpdemir MN, Ferris J, Glover K, Goble C, Goderis A, Hull D, Marvin D, Li P, Lord P, Pocock MR, Senger M, Stevens R, Wipat A, Wroe C (2006) Taverna: lessons in creating a workflow environment for the life sciences. Concurr Computat Pract Exper 18(10):1067–1100

    Article  Google Scholar 

  29. Taylor I, Shields M, Wang I, Rana O (2004) Triana, applications within grid computing and peer to peer environments. J Grid Comput 1:199–217

    Article  Google Scholar 

  30. Taylor I, Al-Shakarchi E, Beck SD (2006) Distributed audio retrieval using Triana, DART. Proceedings of the International Computer Music Conference (ICMC '06), New Orleans, Lo, USA, pp. 716–722

  31. Deelman E, Singh G, Su MH, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman GB, Good J, Laity A, Jacob JC, Katz DS (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Program 13(3):219–237

    Google Scholar 

  32. Altintas I, Berkley C, Jaeger E, Jones M (2004) Kepler: an extensible system for design and execution of scientific workflows. Proceedings of 16th International conference on scientific and statistical database management, Santorini Island, Greece: IEEE, pp. 423–424

  33. Fahringer T, Prodan R, Duan R et al (2005) ASKALON: a grid application development and computing environment. Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, pp. 122–131

  34. Fahringer T, Jun Q, Hainzer S (2005) Specification of Grid workflow applications with AGWL: an abstract Grid workflow language. Proceedings of IEEE International Symposium on Cluster Computing and the Grid (CCGrid '05), pp. 676–685

  35. Hoheisel A (2006) User tools and languages for graph-based grid workflows. Concurr Computat Pract Exper 18(10):1101–1113

    Article  Google Scholar 

  36. Von Laszewski G, Hategan M (2005) Java CoG Kit Karajan/Gridant workflow guide. Tech. Rep., Argonne National Laboratory, Argonne, Ill, USA

  37. Von Laszewski G, Hategan M, Kodeboyina D (2007) Java CoG kit workflow. Workflows for e-Science. Springer, New York, pp 143–166

    Google Scholar 

  38. Feller M, Foster I, Martin S (2007) GT4 GRAM: a functionality and performance study. Proceedings of the TERAGRID Conference, Madison, Wis, USA

  39. Cuevas-Vicenttin, Dey V, Wang S et al (2012) Modeling and querying scientific workflow provenance in the d-opm. Proceedings of 12th SC companion high performance computing, networking, storage and analysis, Washington, EUA: IEEE, pp. 119–128

  40. Ni WW, Shen T, Yan D (2020) Differential privacy based on data provenance publishing method. Jisuanji Xuebao/Chinese Journal of Computers 43(3):573–586. https://doi.org/10.11897/SP.J.1016.2020.00573

    Article  Google Scholar 

  41. Buneman P, Khanna SE, Chiew W (2001) Why and where: a characterization of data provenance. In: Van Den Bussche, J, Vianu, V. (ed.). Database theory: ICDT 2001. Lecture Notes in Computer Science, 1973:316–330. Berlin, Heidelberg: Springer

  42. Freire J, Koop D, Santos E, Silva CT (2008) Provenance for computational tasks: a survey. J Comput Sci Eng 10(3):11–21

    Article  Google Scholar 

  43. Simmhan YL, Plale B, Gannon D (2005) A survey of data provenance techniques. Technical Report TR-618: Computer Science Department; Indiana University

  44. Moreau L, Freire J, Futrelle J et al (2008) The open provenance model: an overview. IPAW, LNCS 5272:323–326

    Google Scholar 

  45. Lim C, Lu S, Chebotkot A et al (2010) Prospective and retrospective provenance collection in scientific workflow environments (2010). Proceedings 2010 IEEE 7th International Conference on Services Computing, SCC 2010, art. n.5557202:449–456

  46. Woodruff A, Stonebraker M (1997) supporting fine-grained data lineage in a database visualization. Proceedings of 13th International Conference on Data Engineering, Birmingham, UK

  47. Belhajjame K et al (2018) PROV-DM: the PROV data model. W3C recommendation, 3 apr. 2013. https://www.w3.org/TR/prov-dm/. Accessed 30 May 2018

  48. Bose R, Frew J (2005) Lineage retrieval for scientific data processing: a survey. ACM Comput Surv 37(1):1–28

    Article  Google Scholar 

  49. Buneman P, Tan WC (2007) Provenance in databases: tutorial outline. Proceedings of ACM SIGMOD International Conference on Management of Data, Beijing, China: ACM, pp.11–14

  50. Moreau L, Groth P, Cheney J, Lebo T, Miles S (2015) The rationale of PROV. Web Semant Sci Serv Agents World Wide Web 35:235–257

    Article  Google Scholar 

  51. Curbera F, Doganata Y, Martens A, Mukhi N.K., Slominski A. (2008) Business provenance: a technology to increase traceability of end-to-end operations. In: Meersman, R., Tari, Z. (ed.). On the move to meaningful internet systems: OTM 2008. Lecture Notes in Computer Science. Berlin, Heidelberg, pp. 100–119

  52. Ahmed I, Abid K, Adeel A et al (2018) A secure provenance scheme for detecting consecutive colluding users in distributed networks. Int J Parallel Prog 48(2):344-366 Springer Science and Business Media LLC. https://doi.org/10.1007/s10766-018-0601-y

  53. Silva PP, Mcguinness DL, Mccool R (2003) Knowledge provenance infrastructure. Proceedings of IEEE Data Eng. Bull. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.7528

  54. Stevens R, Zhao J, Goble C (2007) Using provenance to manage knowledge of in silico experiments. Brief Bioinform 8:183–194

    Article  Google Scholar 

  55. Ram S, Liu J (2007) Understanding the semantics of data provenance to support active conceptual modeling. In Active conceptual modeling of learning, Springer, pp. 17–29. https://link.springer.com/chapter/10.1007/978-3-540-77503-4_3

  56. Hartig O, Zhao J (2010) Publishing and consuming provenance metadata on the web of linked data. In Provenance and annotation of data and processes, Springer, pp. 78–90. https://link.springer.com/chapter/10.1007/978-3-642-17819-1_10

  57. Sahoo SS, Sheth AP (2009) Provenir ontology: towards a framework for escience provenance management. Kno.e.sis Publications

  58. OPM. Open Provenance Model (2010) https://openprovenance.org/opm/old-index.html. Accessed 03 Jan 2019

  59. W3C. PROV-DM: The PROV Data Model (2013) http://www.w3.org/TR/prov-dm/. Accessed 02 Jan 2019

  60. Wang M, Blount M Davis J, Misra A, Sow D (2007) A time-and-value centric provenance model and architecture for medical event streams. Proceedings of the 1st ACM Sigmobile International Workshop on Systems and Networking Support for Healthcare and Assisted Living Environments - Healthnet '07, 1:95-100. ACM Press. https://doi.org/10.1145/1248054.1248082

  61. Ustun Y, Belhajjame K, Grigori D (2015) Modeling evidence-based medicine applications with provenance data in pathways. Proceedings of 9th International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth)

  62. Sun Y, Tun L, Ning G (2017) A method of electronic health data quality assessment: enabling data provenance. IEEE 21st International Conference on Computer Supported Cooperative Work in Design (CSCWD) Local: Wellington, New Zealand, pp. 26-28

  63. Mascia C, Uva P, Leo S, Zanetti G (2018) OpenEHR modeling for genomics in clinical practice. Int J Med Inform 120:147–156

    Article  Google Scholar 

  64. Xu S, Fairweather E, Rogers T, Curcin V (2018) Implementing data provenance in health data analytics software. Lecture Notes in computer science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11017 LNCS, pp. 173-176

  65. Jaigirdar FT, Rudolph C, Bain C (2019) Can I trust the data I see? A Physician's concern on medical data in IoT health architectures. Proceedings of the Australasian Computer Science Week Multiconference on - Acsw 2019, art. n.27:1-10, ACM Press. https://doi.org/10.1145/3290688.3290731

  66. Danese MD, Halperin M, Duryea J, Duryea R (2019) The generalized data model for clinical research. BMC Med Inform Decis Mak 19(1):117. https://doi.org/10.1186/s12911-019-0837-5

    Article  Google Scholar 

  67. Kubendiran M, Singh S, Sangaiah AK (2019) Enhanced security framework for E-health systems using Blockchain. J Inform Process Syst 15(2):239–250. https://doi.org/10.3745/JIPS.04.0106

    Article  Google Scholar 

  68. Wang M, Blount M, Davis J et al (2007) A time-and-value centric provenance model and architecture for medical event streams. Proceedings of the 1st ACM Sigmobile International Workshop on Systems and Networking Support for Healthcare and Assisted Living Environments - Healthnet '07 (1):95-100, ACM Press. https://doi.org/10.1145/1248054.1248082

  69. Meyers DG (2000) The iron hypothesis: does iron play a role in atherosclerosis? Transfusion 40(8):1023–1029

    Article  Google Scholar 

  70. Machado ÍE, Malta DC, Bacal NS et al (2019) Prevalence of anemia in Brazilian adults and elderly. Braz J Epidemiol 22(2):1–15, FapUNIFESP (SciELO). [in Portuguese]. https://doi.org/10.1590/1980-549720190008.supl.2

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

SEMBAY MJ collaborated in the conception, collection and analysis of data and writing of the article. MACEDO DDJ and DUTRA ML collaborated in data analysis, writing and review of the article.

Corresponding author

Correspondence to Márcio José Sembay.

Ethics declarations

Conflicts of interest/Competing interests

the authors declare that they do not have conflicts of interest.

Code availability

Not applicable’ for that section.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sembay, M.J., de Macedo, D.D.J. & Dutra, M.L. A Proposed Approach for Provenance Data Gathering. Mobile Netw Appl 26, 304–318 (2021). https://doi.org/10.1007/s11036-020-01648-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11036-020-01648-7

Keywords

Navigation