On missing random effects in machine learning,Communications in Statistics - Simulation and Computation

当前位置： X-MOL 学术 › Commun. Stat. Simul. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

On missing random effects in machine learning
Communications in Statistics - Simulation and Computation ( IF 0.8 ) Pub Date : 2020-08-11 , DOI: 10.1080/03610918.2020.1801729
Fabio D’Ottaviano ₁ , Wenzhao Yang ₁

Affiliation

Abstract

The large availability of undesigned data, a by-product of chemical industrial research and manufacturing, makes it attractive the venturesome use of machine learning for its plug-and-play appeal in attempt to extract value out of this data. Often this type of data does not only reflect the response to controlled variation but also to that caused by random effects. Thus, machine learning based models in this industry may easily miss active random effects out. This study shows by simulation the effect of missing a random effect via machine learning — vs. including it properly via mixed models as a benchmark — in a context commonly encountered in the chemical industry — mixture experiments with process variables — and as a function of relative cluster size, total variance, proportion of variance attributed to the random effect, and data size. Simulation was employed for it allows the comparison — missing vs. not missing random effects — to be made clear and in a simple manner while avoiding unwanted confounders found in real world data. Besides the long-established fact that machine learning performs better the larger the size of the data, it was also observed that data lacking due specificity — i.e. without clustering information — causes critical prediction biases regardless the data size.

中文翻译：

关于机器学习中缺失的随机效应

摘要

未经设计的数据（化学工业研究和制造的副产品）的大量可用性使得机器学习的冒险使用具有吸引力，因为它具有即插即用的吸引力，试图从这些数据中提取价值。通常这种类型的数据不仅反映了对受控变化的反应，而且还反映了随机效应引起的反应。因此，该行业中基于机器学习的模型很容易漏掉主动随机效应。本研究通过模拟显示了通过机器学习遗漏随机效应的影响——与通过混合模型正确包含它作为基准——在化学工业中常见的环境中——具有过程变量的混合实验——以及作为相对函数的函数聚类大小、总方差、归因于随机效应的方差比例和数据大小。采用模拟是因为它允许以简单的方式进行比较——缺失与不缺失随机效应——同时避免在现实世界数据中发现不需要的混杂因素。除了数据量越大机器学习性能越好这一长期公认的事实之外，还观察到缺乏适当特异性的数据（即没有聚类信息）会导致严重的预测偏差，无论数据量如何。

更新日期：2020-08-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文