On a Scalable Entropic Breaching of the Overfitting Barrier for Small Data Problems in Machine Learning,Neural Computation

当前位置： X-MOL 学术 › Neural Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

On a Scalable Entropic Breaching of the Overfitting Barrier for Small Data Problems in Machine Learning
Neural Computation ( IF 2.7 ) Pub Date : 2020-08-01 , DOI: 10.1162/neco_a_01296
Illia Horenko ₁

Affiliation

Overfitting and treatment of small data are among the most challenging problems in machine learning (ML), when a relatively small data statistics size T is not enough to provide a robust ML fit for a relatively large data feature dimension D. Deploying a massively parallel ML analysis of generic classification problems for different D and T, we demonstrate the existence of statistically significant linear overfitting barriers for common ML methods. The results reveal that for a robust classification of bioinformatics-motivated generic problems with the long short-term memory deep learning classifier (LSTM), one needs in the best case a statistics T that is at least 13.8 times larger than the feature dimension D. We show that this overfitting barrier can be breached at a 10-12 fraction of the computational cost by means of the entropy-optimal scalable probabilistic approximations algorithm (eSPA), performing a joint solution of the entropy-optimal Bayesian network inference and feature space segmentation problems. Application of eSPA to experimental single cell RNA sequencing data exhibits a 30-fold classification performance boost when compared to standard bioinformatics tools and a 7-fold boost when compared to the deep learning LSTM classifier.

中文翻译：

关于机器学习中小数据问题的过度拟合障碍的可扩展熵突破

过拟合和处理小数据是机器学习 (ML) 中最具挑战性的问题之一，当相对较小的数据统计量 T 不足以为相对较大的数据特征维度 D 提供稳健的 ML 拟合时。部署大规模并行 ML通过对不同 D 和 T 的通用分类问题的分析，我们证明了常见 ML 方法存在统计上显着的线性过度拟合障碍。结果表明，对于使用长短期记忆深度学习分类器 (LSTM) 对生物信息学驱动的通用问题进行稳健分类，在最好的情况下需要一个至少比特征维度 D 大 13.8 倍的统计量 T。我们表明，通过熵最优可扩展概率近似算法 (eSPA)，执行熵最优贝叶斯网络推理和特征空间分割的联合解决方案，可以以 10-12 部分的计算成本突破这种过拟合障碍问题。与标准生物信息学工具相比，将 eSPA 应用于实验性单细胞 RNA 测序数据表现出 30 倍的分类性能提升，与深度学习 LSTM 分类器相比，表现出 7 倍的提升。

更新日期：2020-08-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11