A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis

Roccetti, Marco; Delnevo, Giovanni; Casini, Luca; Salomoni, Paola

doi:10.1007/s11036-020-01530-6

A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis

Published: 19 February 2020

Volume 25, pages 1075–1083, (2020)
Cite this article

Mobile Networks and Applications Aims and scope Submit manuscript

Marco Roccetti¹,
Giovanni Delnevo¹,
Luca Casini¹ &
…
Paola Salomoni¹

524 Accesses
31 Citations
1 Altmetric
Explore all metrics

Abstract

Supervised Machine Learning (ML) requires that smart algorithms scrutinize a very large number of labeled samples before they can make right predictions. And this is not always true either. In our experience, in fact, a neural network trained with a huge database comprised of over fifteen million water meter readings had essentially failed to predict when a meter would malfunction/need disassembly based on a history of water consumption measurements. With a second step, we developed a methodology, based on the enforcement of a specialized data semantics, that allowed us to extract only those samples for training that were not noised by data impurities. With this methodology, we re-trained the neural network up to a prediction accuracy of over 80%. Yet, we simultaneously realized that the new training dataset was significantly different from the initial one in statistical terms, and much smaller, as well. We had reached a sort of paradox: We had alleviated the initial problem with a better interpretable model, but we had changed the replicated form of the initial data. To reconcile that paradox, we further enhanced our data semantics with the contribution of field experts. This has finally led to the extrapolation of a training dataset truly representative of regular/defective water meters and able to describe the underlying statistical phenomenon, while still providing an excellent prediction accuracy of the resulting classifier. At the end of this path, the lesson we have learnt is that a human-in-the-loop approach may significantly help to clean and re-organize noised datasets for an empowered ML design experience.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine Learning: Algorithms, Real-World Applications and Research Directions

Article 22 March 2021

Iqbal H. Sarker

Artificial intelligence-based solutions for climate change: a review

Article Open access 13 June 2023

Lin Chen, Zhonghao Chen, … Pow-Seng Yap

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Vitor Werner de Vargas, Jorge Arthur Schneider Aranda, … Jorge Luis Victória Barbosa

References

Pettersen L (2018) Why artificial intelligence will not outsmart complex knowledge work. Work, Employment and Society. Sage. To appear
Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260
Article MathSciNet Google Scholar
Delnevo G, Roccetti M, Mirri S (2019) Intelligent and good machines? The role of domain and context codification, Mobile networks and applications, Elsevier. To appear
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann
Alkowaileet W, Alsubaiee S, Carey M, Li C, Ramampiaro H, Sinthong P, Wang X (2018) Enhancing big data with semantics: the AsterixDB approach. In Proc. of 12th IEEE international conference on semantic computing, 314-315. IEEE
Emani CK, Cullot N, Nicolle C (2015) Understandable big data: a survey. Comput Sci Rev 17:70–81
Article MathSciNet Google Scholar
Casini L, Delnevo G, Roccetti M, Zagni N, Cappiello G (2019, August) Deep water: predicting water meter failures through a human-machine intelligence collaboration. In international conference on human interaction and emerging technologies (pp. 688-694). Springer, Cham
Roccetti M, Delnevo G, Casini L, Zagni N, Cappiello G (2019, September). A paradox in ML design: less data for a smarter water metering cognification experience. In proceedings of the 5th EAI international conference on smart objects and Technologies for Social Good (pp. 201-206). ACM
Roccetti M, Delnevo G, Casini L, Cappiello G (2019) Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures. J Big Data 6(1):70
Article Google Scholar
Wang RY, Storey VC, Firth CP (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng 4:623–640
Article Google Scholar
ISO 8000-8:2015, https://www.iso.org/obp/ui/#iso:std:iso:8000:-8:ed-1:v1:en
Juran J, Godfrey AB (1999) Quality handbook. Republished McGraw-Hill, 173-178
Kodra Y, De La Paz MP, Coi A, Santoro M, Bianchi F, Ahmed F, ... Taruscio D (2017) Data quality in rare diseases registries. In rare diseases epidemiology: update and overview (pp. 149–164). Springer, Cham
Scannapieco M, Missier P, Batini C (2005) Data quality at a glance. Datenbank-Spektrum, 14(January), 6–14
Sidi F, Panahy PHS, Affendey LS, Jabar MA, Ibrahim H, Mustapha A (2012, March). Data quality: a survey of data quality dimensions. In 2012 international conference on Information Retrieval & Knowledge Management (pp. 300-304). IEEE
Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218
Article Google Scholar
Cai L, Zhu Y (2015) The challenges of data quality and data quality assessment in the big data era. Data Sci J 14
Chen H, Hailey D, Wang N, Yu P (2014) A review of data quality assessment methods for public health information systems. Int J Environ Res Public Health 14;11(5):5170–5207. https://doi.org/10.3390/ijerph110505170
Chen JV, Su BC, Widjaja AE (2016) Facebook C2C social commerce: a study of online impulse buying. Decis Support Syst 83:57–69
Article Google Scholar
Von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417(6887):399
Article Google Scholar
Burggräf P, Dannapfel M, Förstmann R, Adlon T, Fölling C (2018, January). Data quality-based process enabling: application to logistics supply processes in low-volume ramp-up context. In 2018 international conference on information management and processing (ICIMP) (pp. 36-41). IEEE
Breck E, Polyzotis N, Roy S, Whang SE, Zinkevich M (2018, January). Data Infrastructure for Machine Learning. In SysML Conference
Sessions V, Valtorta M (2006) The effects of data quality on machine learning algorithms. ICIQ
Foidl H, Felderer M (2019, August). Risk-based data validation in machine learning-based software systems. In proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation (pp. 13-18). ACM
Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst 12(4):5–33
Article Google Scholar

Download references

Acknowledgements

We are indebted towards the company that has provided us with the data of interest. Our thanks also go to Giuseppe Cappiello and Nicolò Zagni (University of Bologna) for their participation to a previous phase of this research activity.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Bologna, Bologna, Italy
Marco Roccetti, Giovanni Delnevo, Luca Casini & Paola Salomoni

Authors

Marco Roccetti
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Delnevo
View author publications
You can also search for this author in PubMed Google Scholar
Luca Casini
View author publications
You can also search for this author in PubMed Google Scholar
Paola Salomoni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Roccetti.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Roccetti, M., Delnevo, G., Casini, L. et al. A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis. Mobile Netw Appl 25, 1075–1083 (2020). https://doi.org/10.1007/s11036-020-01530-6

Download citation

Published: 19 February 2020
Issue Date: June 2020
DOI: https://doi.org/10.1007/s11036-020-01530-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis

Abstract

Access this article

Similar content being viewed by others

Machine Learning: Algorithms, Real-World Applications and Research Directions

Artificial intelligence-based solutions for climate change: a review

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis

Abstract

Access this article

Similar content being viewed by others

Machine Learning: Algorithms, Real-World Applications and Research Directions

Artificial intelligence-based solutions for climate change: a review

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation