Abstract
Supervised Machine Learning (ML) requires that smart algorithms scrutinize a very large number of labeled samples before they can make right predictions. And this is not always true either. In our experience, in fact, a neural network trained with a huge database comprised of over fifteen million water meter readings had essentially failed to predict when a meter would malfunction/need disassembly based on a history of water consumption measurements. With a second step, we developed a methodology, based on the enforcement of a specialized data semantics, that allowed us to extract only those samples for training that were not noised by data impurities. With this methodology, we re-trained the neural network up to a prediction accuracy of over 80%. Yet, we simultaneously realized that the new training dataset was significantly different from the initial one in statistical terms, and much smaller, as well. We had reached a sort of paradox: We had alleviated the initial problem with a better interpretable model, but we had changed the replicated form of the initial data. To reconcile that paradox, we further enhanced our data semantics with the contribution of field experts. This has finally led to the extrapolation of a training dataset truly representative of regular/defective water meters and able to describe the underlying statistical phenomenon, while still providing an excellent prediction accuracy of the resulting classifier. At the end of this path, the lesson we have learnt is that a human-in-the-loop approach may significantly help to clean and re-organize noised datasets for an empowered ML design experience.
Similar content being viewed by others
References
Pettersen L (2018) Why artificial intelligence will not outsmart complex knowledge work. Work, Employment and Society. Sage. To appear
Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260
Delnevo G, Roccetti M, Mirri S (2019) Intelligent and good machines? The role of domain and context codification, Mobile networks and applications, Elsevier. To appear
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann
Alkowaileet W, Alsubaiee S, Carey M, Li C, Ramampiaro H, Sinthong P, Wang X (2018) Enhancing big data with semantics: the AsterixDB approach. In Proc. of 12th IEEE international conference on semantic computing, 314-315. IEEE
Emani CK, Cullot N, Nicolle C (2015) Understandable big data: a survey. Comput Sci Rev 17:70–81
Casini L, Delnevo G, Roccetti M, Zagni N, Cappiello G (2019, August) Deep water: predicting water meter failures through a human-machine intelligence collaboration. In international conference on human interaction and emerging technologies (pp. 688-694). Springer, Cham
Roccetti M, Delnevo G, Casini L, Zagni N, Cappiello G (2019, September). A paradox in ML design: less data for a smarter water metering cognification experience. In proceedings of the 5th EAI international conference on smart objects and Technologies for Social Good (pp. 201-206). ACM
Roccetti M, Delnevo G, Casini L, Cappiello G (2019) Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures. J Big Data 6(1):70
Wang RY, Storey VC, Firth CP (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng 4:623–640
ISO 8000-8:2015, https://www.iso.org/obp/ui/#iso:std:iso:8000:-8:ed-1:v1:en
Juran J, Godfrey AB (1999) Quality handbook. Republished McGraw-Hill, 173-178
Kodra Y, De La Paz MP, Coi A, Santoro M, Bianchi F, Ahmed F, ... Taruscio D (2017) Data quality in rare diseases registries. In rare diseases epidemiology: update and overview (pp. 149–164). Springer, Cham
Scannapieco M, Missier P, Batini C (2005) Data quality at a glance. Datenbank-Spektrum, 14(January), 6–14
Sidi F, Panahy PHS, Affendey LS, Jabar MA, Ibrahim H, Mustapha A (2012, March). Data quality: a survey of data quality dimensions. In 2012 international conference on Information Retrieval & Knowledge Management (pp. 300-304). IEEE
Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218
Cai L, Zhu Y (2015) The challenges of data quality and data quality assessment in the big data era. Data Sci J 14
Chen H, Hailey D, Wang N, Yu P (2014) A review of data quality assessment methods for public health information systems. Int J Environ Res Public Health 14;11(5):5170–5207. https://doi.org/10.3390/ijerph110505170
Chen JV, Su BC, Widjaja AE (2016) Facebook C2C social commerce: a study of online impulse buying. Decis Support Syst 83:57–69
Von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417(6887):399
Burggräf P, Dannapfel M, Förstmann R, Adlon T, Fölling C (2018, January). Data quality-based process enabling: application to logistics supply processes in low-volume ramp-up context. In 2018 international conference on information management and processing (ICIMP) (pp. 36-41). IEEE
Breck E, Polyzotis N, Roy S, Whang SE, Zinkevich M (2018, January). Data Infrastructure for Machine Learning. In SysML Conference
Sessions V, Valtorta M (2006) The effects of data quality on machine learning algorithms. ICIQ
Foidl H, Felderer M (2019, August). Risk-based data validation in machine learning-based software systems. In proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation (pp. 13-18). ACM
Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst 12(4):5–33
Acknowledgements
We are indebted towards the company that has provided us with the data of interest. Our thanks also go to Giuseppe Cappiello and Nicolò Zagni (University of Bologna) for their participation to a previous phase of this research activity.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Roccetti, M., Delnevo, G., Casini, L. et al. A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis. Mobile Netw Appl 25, 1075–1083 (2020). https://doi.org/10.1007/s11036-020-01530-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11036-020-01530-6