当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An alternative approach to dimension reduction for pareto distributed data: a case study
Journal of Big Data ( IF 8.1 ) Pub Date : 2021-02-25 , DOI: 10.1186/s40537-021-00428-8
Marco Roccetti , Giovanni Delnevo , Luca Casini , Silvia Mirri

Deep learning models are tools for data analysis suitable for approximating (non-linear) relationships among variables for the best prediction of an outcome. While these models can be used to answer many important questions, their utility is still harshly criticized, being extremely challenging to identify which data descriptors are the most adequate to represent a given specific phenomenon of interest. With a recent experience in the development of a deep learning model designed to detect failures in mechanical water meter devices, we have learnt that a sensible deterioration of the prediction accuracy can occur if one tries to train a deep learning model by adding specific device descriptors, based on categorical data. This can happen because of an excessive increase in the dimensions of the data, with a correspondent loss of statistical significance. After several unsuccessful experiments conducted with alternative methodologies that either permit to reduce the data space dimensionality or employ more traditional machine learning algorithms, we changed the training strategy, reconsidering that categorical data, in the light of a Pareto analysis. In essence, we used those categorical descriptors, not as an input on which to train our deep learning model, but as a tool to give a new shape to the dataset, based on the Pareto rule. With this data adjustment, we trained a more performative deep learning model able to detect defective water meter devices with a prediction accuracy in the range 87–90%, even in the presence of categorical descriptors.



中文翻译:

减少Pareto分布式数据的降维方法:一个案例研究

深度学习模型是用于数据分析的工具,适用于近似变量之间的(非线性)关系,以最好地预测结果。尽管这些模型可以用来回答许多重要问题,但它们的效用仍然受到严厉批评,要确定哪些数据描述符最能代表特定的特定现象是极富挑战性的。借助开发用于检测机械水表设备故障的深度学习模型的最新经验,我们了解到,如果尝试通过添加特定的设备描述符来训练深度学习模型,则预测准确性可能会明显下降,基于分类数据。之所以会发生这种情况,是因为数据规模过大,相应地失去了统计意义。在使用其他方法进行了几次失败的实验之后,这些方法要么允许减少数据空间维数,要么使用更多传统的机器学习算法,然后根据Pareto分析,改变了训练策略,重新考虑了分类数据。从本质上讲,我们使用这些分类描述符,不是将其用作训练我们的深度学习模型的输入,而是用作基于帕累托算法为数据集赋予新形状的工具规则。通过此数据调整,我们训练了性能更高的深度学习模型,即使在存在分类描述符的情况下,也能够以87-90%的预测精度检测有缺陷的水表设备。

更新日期:2021-02-25
down
wechat
bug