Abstract
The deluge of dark data is about to happen. Lacking data management capabilities, especially in the field of supercomputing, and missing data documentation (i.e., missing metadata annotation) constitute a major source of dark data. The present work contributes to addressing this challenge by presenting ExtractIng, a generic automated metadata extraction toolkit. Existing metadata information of simulation output files scattered through the file system, can be aggregated, parsed and converted to the EngMeta metadata model. Use cases from computational engineering are considered to demonstrate the viability of ExtractIng. The evaluation results show that the metadata extraction is simulation-code independent in the sense that it can handle data outputs from various fields of science, is easy to integrate into simulation workflows and compatible with a multitude of computational environments.
Similar content being viewed by others
Notes
https://codemeta.github.io/, last Access 25.11.2020.
http://exptml.sourceforge.net/, last Access 25.11.2020.
https://docs.oracle.com/javase/8/docs/api/java/util/Scanner.html, last access Feb 14th 2020.
https://spark.apache.org/, last access Feb 14th 2020.
https://www.oracle.com/technical-resources/articles/javase/jaxb.html, last access 25.11.2020.
https://wiki.iag.uni-stuttgart.de/eas3wiki/index.php/Main_Page, last access Feb 26th 2020.
https://www.unidata.ucar.edu/software/netcdf/, last accessed Feb 26th 2020.
http://cfconventions.org/, last accessed Feb 26th 2020.
https://www.unidata.ucar.edu/software/netcdf/examples/files.html, last accessed Feb 26th 2020.
https://dataverse.org/, last access March 2 2020.
Interestingly, the authors speak of a “data swamp” in terms of dark data contrasting this with a “data lake” of well-annotated data.
References
Hey AJ, Trefethen AE (2003) The data deluge: an e-science perspective, pp 809–824. https://eprints.soton.ac.uk/257648/1/The_Data_Deluge.pdf
Hey T, Tansley S, Tolle K (2009) The fourth paradigm: data-intensive scientific discovery (Microsoft Research). https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/
Heidorn PB (2008) Shedding light on the dark data in the long tail of science. Library Trends 57(2):280. https://doi.org/10.1353/lib.0.0036
Heidorn PB, Stahlman GR, Steffen J (2018) Astrolabe: curating, linking, and computing astronomy’s dark data. Astrophys J Suppl Ser 236(1):3. https://doi.org/10.3847/1538-4365/aab77e
Schembera B, Durán JM (2019) Dark data as the new challenge for big data science and the introduction of the scientific data officer. Philos Technol. https://doi.org/10.1007/s13347-019-00346-x
IBM. Digging up dark data. https://siliconangle.com/2015/10/30/ibm-is-at-the-forefront-of-insight-economy-ibminsight/ (2015). Accessed 14 Feb 2020
Goetz T (2007) Freeing the dark data of failed scientific experiment. Wired Mag 15(10):7
Cafarella M, Ilyas IF, Kornacker M, Kraska T, Ré C (2016) Dark data: Are we solving the right problems?. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp 1444–1445. https://doi.org/10.1109/ICDE.2016.7498366
Lin D, Wang Q (2017) A game theory based energy efficient clustering routing protocol for WSNs. Wirel Netw 23(4):1101
Lin D, Min W, Xu J (2020) An energy-saving routing integrated economic theory with compressive sensing to extend the lifespan of WSNs. IEEE Internet of Things J
Lin D, Wang Q, Min W, Xu J, Zhang Z (2020) A survey on energy-efficient strategies in static wireless sensor networks. ACM Trans Sens Netw (TOSN) 17(1):1
Wilkinson MD, Dumontier M, Aalbersberg J, Appleton G, Axton M et al (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3:160018
Schembera B, Bönisch T (2017) Challenges of research data management for high performance computing. In: Kamps J, Tsakonas G, Manolopoulos Y, Iliadis L, Karydis I (eds) Research and advanced technology for digital libraries. Springer International Publishing, Cham, pp 140–151
Hick J (2010) HPSS in the Extreme Scale Era: Report to DOE Office of Science on HPSS in 2018-2022. Tech. Rep. LBNL-3877E. https://escholarship.org/uc/item/4wn1s2d3
Arora R (2015) Data management: state-of-the-practice at open-science data centers. Springer, New York, pp 1095–1108. https://doi.org/10.1007/978-1-4939-2092-1_37
Jones SN, Strong CR, Parker-Wood A, Holloway A, Long DDE (2011) Easing the Burdens of HPC File Management. In: Proceedings of the Sixth Workshop on Parallel Data Storage (ACM), PDSW ’11, pp 25–30. https://doi.org/10.1145/2159352.2159359
Parker-Wood A, Long DDE, Madden BA, Adams IF, McThrow M, Wildani A (2013) Examining Extended and Scientific Metadata for Scalable Index Designs. In: Proceedings of the 6th International Systems and Storage Conference (ACM, New York, NY, USA), SYSTOR ’13, pp 4:1–4:6. https://doi.org/10.1145/2485732.2485754
Mattmann CA (2013) Computing: a vision for data science. Nature 493(7433):473. https://doi.org/10.1038/493473a
Michener WK, Brunt JW, Helly JJ, Kirchner TB (1997) Stafford SG nongeospatial metadata for the ecological sciences. Ecol Appl 7(1):330. https://doi.org/10.2307/2269427
Schembera B (2019) Forschungsdatenmanagement im Kontext dunkler Daten in den Simulationswissenschaften. Dissertation, Universität Stuttgart. https://doi.org/10.18419/opus-11028
Petersen AM, Fortunato S, Pan RK, Kaski K, Penner O, Rungi A, Riccaboni M, Stanley HE, Pammolli F (2014) Reputation and impact in academic careers. Proc Natl Acad Sci 111(43):15316. https://doi.org/10.1073/pnas.1323111111
Schembera B, Iglezakis D (2020) EngMeta–metadata for computational engineering. Preprint arXiv:2005.01637
Edwards PN, Mayernik MS, Batcheller AL, Bowker GC, Borgman CL (2011) Science friction: data, metadata, and collaboration. Soc Stud Sci 41(5):667. https://doi.org/10.1177/0306312711413314 PMID: 22164720
Schembera B, Iglezakis D (2019) The genesis of engmeta: a metadata model for research data in computational engineering. In: Garoufallou E, Sartori F, Siatri R, Zervas M (eds) Metadata and semantic research. Springer International Publishing, Cham, pp 127–132
Caplan P (2009) Understanding PREMIS. https://www.loc.gov/standards/premis/understanding-premis-rev2017.pdf. Accessed 25 Nov 2020
Ammann N, Nielsen LH, Peters CS, de Smaele TM (2011) Datacite metadata schema for the publication and citation of research data. https://doi.org/10.5438/0010. https://schema.datacite.org/meta/kernel-3.1/index.html. Zugegriffen: 27.4.2019
Riley J (2017) Understanding metadata: What is metadata, and what is it for?: A primer. Tech. rep, NISO
Hess B, van der Spoel D, Lindahl E, Smith JC, Shirts MR, Bjelkmar P, Larsson P, Kasson PM, Schulz R, Apostolov R, Pronk S, Páll S (2013) GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics 29(7):845. https://doi.org/10.1093/bioinformatics/btt055
Greenberg J (2004) Metadata extraction and harvesting: a comparison of two automatic metadata generation applications. J Internet Catal 6(4):59
Giuffrida G, Shek EC, Yang J (2000) Knowledge-based metadata extraction from PostScript files. In: Proceedings of the Fifth ACM Conference on Digital Libraries, pp 77–84
Spinosa P, Giardiello G, Cherubini M, Marchi S, Venturi G, Montemagni S (2009) NLP-based metadata extraction for legal text consolidation. In: Proceedings of the 12th International Conference on Artificial Intelligence and Law, pp 40–49
Liu R, Gao L, An D, Jiang Z, Tang Z (2017) Automatic document metadata extraction based on deep networks. In: National CCF Conference on Natural Language Processing and Chinese Computing (Springer, 2017), pp 305–317
Paul AK, Wang B, Rutman N, Spitz C, Butt AR (2020) Efficient Metadata Indexing for HPC Storage Systems. In: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID) (IEEE, 2020), pp 162–171
Paul AK (2020) An application-attuned framework for optimizing hpc storage systems. Ph.D. thesis, Virginia Tech
Khan A, Kim T, Byun H, Kim Y (2019) SciSpace: a scientific collaboration workspace for geo-distributed HPC data centers. Fut Gen Comput Syst 101:398
Liang S, Holmes V, Antoniou G, Higgins J (2015) iCurate: a research data management system. Springer International Publishing, Cham, pp 39–47. https://doi.org/10.1007/978-3-319-26181-2_4
Grunzke R, Breuers S, Gesing S, Herres-Pawlis S, Kruse M, Blunk D, de la Garza L, Packschies L, Schäfer P, Schärfe C, Schlemmer T, Steinke T, Schuller B, Müller-Pfefferkorn R, Jäkel R, Nagel WE, Atkinson M, Krüger J (2014) Standards-based metadata management for molecular simulations. Concurr Comput Pract Exp 26(10):1744. https://doi.org/10.1002/cpe.3116
Grunzke R (2016) Generic metadata handling in scientific data life cycles. Ph.D. thesis, Technische Universität Dresden
Grunzke R, Hartmann V, Jejkal T, Kollai H, Prabhune A, Herold H, Deicke A, Dressler C, Dolhoff J, Stanek J, Hoffmann A, Müller-Pfefferkorn R, Schrade T, Meinel G, Herres-Pawlis S, Nagel WE (2019) Future Generation Computer Systems 94:879. https://doi.org/10.1016/j.future.2017.12.023, http://www.sciencedirect.com/science/article/pii/S0167739X17305344
Adorf CS, Dodd PM, Ramasubramani V, Glotzer SC (2018) Simple data and workflow management with the signac framework. Comput Mater Sci 146:220. https://doi.org/10.1016/j.commatsci.2018.01.035
Skluzacek TJ (2019) Dredging a data lake: decentralized metadata extraction. In: Proceedings of the 20th International Middleware Conference Doctoral Symposium, pp 51–53
Skluzacek TJ, Chard R, Wong R, Li Z, Babuji YN, Ward L, Blaiszik B, Chard K, Foster I (2019) Serverless workflows for indexing large scientific data. In: Proceedings of the 5th International Workshop on Serverless Computing, pp 43–48
Skluzacek TJ, Kumar R, Chard R, Harrison G, Beckman P, Chard K, Foster I (2018) Skluma: an extensible metadata extraction pipeline for disorganized data. In: 2018 IEEE 14th International Conference on e-Science (e-Science) (IEEE, 2018), pp 256–266
Padhy S, Jansen G, Alameda J, Black E, Diesendruck L, Dietze M, Kumar P, Kooper R, Lee J, Liu R, et al (2015) Brown Dog: leveraging everything towards autocuration. In: 2015 IEEE International Conference on Big Data (Big Data) (IEEE, 2015), pp 493–500
Satheesan SP, Alameda J, Bradley S, Dietze M, Galewsky B, Jansen G, Kooper R, Kumar P, Lee J, Marciano R et al (2018) Brown dog: making the digital world a better place, a few files at a time. In: Proceedings of the Practice and Experience on Advanced Research Computing, pp 1–8
Rodrigo GP, Henderson M, Weber GH, Ophus C, Antypas K, Ramakrishnan L (2018) ScienceSearch: enabling search through automatic metadata generation. In: 2018 IEEE 14th International Conference on e-Science (e-Science) (IEEE, 2018), pp 93–104
Acknowledgements
The author likes to thank the Federal Ministry of Education and Research for funding the Dipling project under Grant No. FDM-008. The author also likes to thank Dr. Martin Thomas Horsch for comments on the script and proofreading.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Schembera, B. Like a rainbow in the dark: metadata annotation for HPC applications in the age of dark data. J Supercomput 77, 8946–8966 (2021). https://doi.org/10.1007/s11227-020-03602-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03602-6