当前位置: X-MOL 学术Mobile Netw. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis
Mobile Networks and Applications ( IF 2.3 ) Pub Date : 2020-02-19 , DOI: 10.1007/s11036-020-01530-6
Marco Roccetti , Giovanni Delnevo , Luca Casini , Paola Salomoni

Supervised Machine Learning (ML) requires that smart algorithms scrutinize a very large number of labeled samples before they can make right predictions. And this is not always true either. In our experience, in fact, a neural network trained with a huge database comprised of over fifteen million water meter readings had essentially failed to predict when a meter would malfunction/need disassembly based on a history of water consumption measurements. With a second step, we developed a methodology, based on the enforcement of a specialized data semantics, that allowed us to extract only those samples for training that were not noised by data impurities. With this methodology, we re-trained the neural network up to a prediction accuracy of over 80%. Yet, we simultaneously realized that the new training dataset was significantly different from the initial one in statistical terms, and much smaller, as well. We had reached a sort of paradox: We had alleviated the initial problem with a better interpretable model, but we had changed the replicated form of the initial data. To reconcile that paradox, we further enhanced our data semantics with the contribution of field experts. This has finally led to the extrapolation of a training dataset truly representative of regular/defective water meters and able to describe the underlying statistical phenomenon, while still providing an excellent prediction accuracy of the resulting classifier. At the end of this path, the lesson we have learnt is that a human-in-the-loop approach may significantly help to clean and re-organize noised datasets for an empowered ML design experience.

中文翻译:

机器学习设计的警示故事:为什么我们仍然需要人工辅助的大数据分析

监督机器学习(ML)要求智能算法仔细检查大量标记的样本,然后才能做出正确的预测。而且也不总是如此。实际上,根据我们的经验,经过训练的神经网络由庞大的数据库组成,该数据库包含一千五百万个水表读数,基本上无法根据水耗测量历史来预测水表何时发生故障/需要拆卸。第二步,我们基于特殊数据语义的实施开发了一种方法,该方法允许我们仅提取那些不受数据杂质干扰的训练样本。使用这种方法,我们对神经网络进行了重新训练,使其预测精度超过80%。然而,我们同时意识到,新的训练数据集在统计意义上与初始训练数据集显着不同,并且也要小得多。我们遇到了一种悖论:我们用更好的可解释模型缓解了最初的问题,但是我们改变了初始数据的复制形式。为了调解这一悖论,我们在现场专家的帮助下进一步增强了数据语义。这最终导致了一个训练数据集的外推,该训练数据集真正代表了常规/有缺陷的水表,并且能够描述潜在的统计现象,同时仍提供了所得分类器的出色预测精度。在这条路的尽头,
更新日期:2020-02-19
down
wechat
bug