Skip to main content
Log in

On-demand big data integration

A hybrid ETL approach for reproducible scientific research

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager extract, transform, and load (ETL) process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. A lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We incorporate a human-in-the-loop approach, to enhance the hybrid ETL, with selective data integration driven by the user queries and sharing of integrated data between users. We implement our hybrid ETL approach in a prototype platform, Óbidos, and evaluate it in the context of data sharing for medical research. Óbidos outperforms both the eager ETL and lazy ETL approaches, for scientific research data integration and sharing, through its selective loading of data and metadata, while storing the integrated data in a scalable integrated data repository.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Óbidos is a medieval fortified town that has been patronized by various Portuguese queens. It is known for its sweet wine, served in a chocolate cup.

References

  1. Ahern, T., Casey, R., Barnes, D., Benson, R., Knight, T.: SEED Standard for the Exchange of Earthquake Data Reference Manual Format Version 2.4. Incorporated Research Institutions for Seismology (IRIS), Seattle (2007)

    Google Scholar 

  2. Antonioletti, M., Atkinson, M., Baxter, R., Borley, A., Chue Hong, N.P., Collins, B., Hardman, N., Hume, A.C., Knox, A., Jackson, M.: The design and implementation of Grid database services in OGSA-DAI. Concurr. Comput. Pract. Exp. 17(2–4), 357–376 (2005)

    Article  Google Scholar 

  3. Ardestani, S.B., Håkansson, C.J., Laure, E., Livenson, I., Stranák, P., Dima, E., Blommesteijn, D., van de Sanden, M.: B2SHARE: an open e-Science data sharing platform. In: 2015 IEEE 11th International Conference on e-Science (e-Science), pp. 448–453. IEEE (2015)

  4. Borckholder, C., Heinzel, A., Kaniovskyi, Y., Benkner, S., Lukas, A., Mayer, B.: A generic, service-based data integration framework applied to linking drugs and clinical trials. Procedia Comput. Sci. 23, 24–35 (2013)

    Article  Google Scholar 

  5. caMicroscope: caMicroscope (2018). http://camicroscope.org

  6. Çaparlar, C.Ö., Dönmez, A.: What is scientific research and how can it be done? Turk. J. Anaesthesiol. Reanim. 44(4), 212 (2016)

    Article  Google Scholar 

  7. Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. ACM SIGMOD Rec. 26(1), 65–74 (1997)

    Article  Google Scholar 

  8. Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M.: The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J. Digit. Imaging 26(6), 1045–1057 (2013)

    Article  Google Scholar 

  9. Dong, X.L., Srivastava, D.: Big data integration. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1245–1248. IEEE (2013)

  10. Gradecki, J.D., Cole, J.: Mastering Apache Velocity. Wiley (2003)

  11. Hausenblas, M., Nadeau, J.: Apache Drill: interactive ad-hoc analysis at scale. Big Data 1(2), 100–104 (2013)

    Article  Google Scholar 

  12. Heinzlreiter, P., Perkins, J.R., Tirado, O.T., Karlsson, T.J.M., Ranea, J.A., Mitterecker, A., Blanca, M., Trelles, O.: A cloud-based GWAS analysis pipeline for clinical researchers. In: CLOSER, pp. 387–394 (2014)

  13. Hey, T., Trefethen, A.E.: Cyberinfrastructure for e-Science. Science 308(5723), 817–821 (2005)

    Article  Google Scholar 

  14. HL7: FHIR (2018). https://www.hl7.org/fhir/

  15. Huang, Z.: Data integration for urban transport planning. Citeseer (2003)

  16. Kadadi, A., Agrawal, R., Nyamful, C., Atiq, R.: Challenges of data integration and interoperability in big data. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 38–40. IEEE (2014)

  17. Kargín, Y., Ivanova, M., Zhang, Y., Manegold, S., Kersten, M.: Lazy ETL in action: ETL technology dates scientific data. Proc. VLDB Endow. 6(12), 1286–1289 (2013)

    Article  Google Scholar 

  18. Kathiravelu, P., Chen, Y., Sharma, A., Galhardas, H., Van Roy, P., Veiga, L.: On-demand service-based big data integration: optimized for research collaboration. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare, pp. 9–28. Springer (2017)

  19. Krishnan, S., Haas, D., Franklin, M.J., Wu, E.: Towards reliable interactive data cleaning: a user survey and recommendations. In: Proceedings of the Workshop on Human-in-the-Loop Data Analytics, p. 9. ACM (2016)

  20. Langegger, A., Wöß, W., Blöchl, M.: A semantic web middleware for virtual data integration on the web. In: European Semantic Web Conference, pp. 493–507. Springer (2008)

  21. Lecarpentier, D., Wittenburg, P., Elbers, W., Michelini, A., Kanso, R., Coveney, P., Baxter, R.: EUDAT: a new cross-disciplinary data infrastructure for science. Int. J. Digit. Curation 8(1), 279–287 (2013)

    Article  Google Scholar 

  22. Lee, G., Doyle, S., Monaco, J., Madabhushi, A., Feldman, M.D., Master, S.R., Tomaszewski, J.E.: A knowledge representation framework for integration, classification of multi-scale imaging and non-imaging data: preliminary results in predicting prostate cancer recurrence by fusing mass spectrometry and histology. In: 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 77–80. IEEE (2009)

  23. Li, G.: Human-in-the-loop data integration. Proc. VLDB Endow. 10(12), 2006–2017 (2017)

    Article  Google Scholar 

  24. Lyu, D.M., Tian, Y., Wang, Y., Tong, D.Y., Yin, W.W., Li, J.S.: Design and implementation of clinical data integration and management system based on Hadoop platform. In: 2015 7th International Conference on Information Technology in Medicine and Education (ITME), pp. 76–79. IEEE (2015)

  25. Marchioni, F., Surtani, M.: Infinispan Data Grid Platform. Packt Publishing Ltd., Birmingham (2012)

    Google Scholar 

  26. Milchevski, E., Michel, S.: LigDB—online query processing without (almost) any storage. In: EDBT, pp. 683–688 (2015)

  27. Mildenberger, P., Eichelberg, M., Martin, E.: Introduction to the DICOM standard. Eur. Radiol. 12(4), 920–927 (2002)

    Article  Google Scholar 

  28. Reichman, O.J., Jones, M.B., Schildhauer, M.P.: Challenges and opportunities of open data in ecology. Science 331(6018), 703–705 (2011)

    Article  Google Scholar 

  29. Scality: Scality RING (2018). http://storage.scality.com/rs/963-KAI-434/images/Scality%20Technical%20Whitepaper.pdf

  30. Spark: Spark Framework: An Expressive Web Framework for Kotlin and Java (2018). http://sparkjava.com/

  31. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  32. Vassiliadis, P.: A survey of Extract-transform-Load technology. Int. J. Data Warehous. Min. 5(3), 1–27 (2009)

    Article  Google Scholar 

  33. White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc, Sebastopol (2012)

    Google Scholar 

  34. Widmann, H., Thiemann, H.: EUDAT B2FIND: a cross-discipline metadata service and discovery portal. In: EGU General Assembly Conference Abstracts, vol. 18, p. 8562 (2016)

  35. Zhang, Q., Zhang, X., Zhang, Q., Shi, W., Zhong, H.: Firework: big data sharing and processing in collaborative edge environment. In: 2016 Fourth IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb), pp. 20–25. IEEE (2016)

Download references

Acknowledgements

This work was supported by NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory), National funds through Fundação para a Ciência e a Tecnologia with reference UID/CEC/50021/2013, PTDC/EEI-SCR/6945/2014, a Google Summer of Code project, and a PhD grant offered by the Erasmus Mundus Joint Doctorate in Distributed Computing (EMJD-DC) under grant agreement 2012-0030.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pradeeban Kathiravelu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kathiravelu, P., Sharma, A., Galhardas, H. et al. On-demand big data integration. Distrib Parallel Databases 37, 273–295 (2019). https://doi.org/10.1007/s10619-018-7248-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-018-7248-y

Keywords

Navigation