当前位置: X-MOL 学术Nat. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fitting elephants in modern machine learning by statistically consistent interpolation
Nature Machine Intelligence ( IF 18.8 ) Pub Date : 2021-05-19 , DOI: 10.1038/s42256-021-00345-8
Partha P. Mitra

Textbook wisdom advocates for smooth function fits and implies that interpolation of noisy data should lead to poor generalization. A related heuristic is that fitting parameters should be fewer than measurements (Occam’s razor). Surprisingly, contemporary machine learning approaches, such as deep nets, generalize well, despite interpolating noisy data. This may be understood via statistically consistent interpolation (SCI), that is, data interpolation techniques that generalize optimally for big data. Here, we elucidate SCI using the weighted interpolating nearest neighbours algorithm, which adds singular weight functions to k nearest neighbours. This shows that data interpolation can be a valid machine learning strategy for big data. SCI clarifies the relation between two ways of modelling natural phenomena: the rationalist approach (strong priors) of theoretical physics with few parameters, and the empiricist (weak priors) approach of modern machine learning with more parameters than data. SCI shows that the purely empirical approach can successfully predict. However, data interpolation does not provide theoretical insights, and the training data requirements may be prohibitive. Complex animal brains are between these extremes, with many parameters, but modest training data, and with prior structure encoded in species-specific mesoscale circuitry. Thus, modern machine learning provides a distinct epistemological approach that is different both from physical theories and animal brains.

A preprint version of the article is available at ArXiv.


中文翻译:

通过统计一致的插值使大象适应现代机器学习

教科书中的智慧主张平滑函数拟合,并暗示对嘈杂数据进行插值会导致泛化不良。一种相关的启发式方法是,拟合参数应小于测量值(Occam的剃刀)。令人惊讶的是,尽管插值了嘈杂的数据,但诸如深网之类的现代机器学习方法仍能很好地推广。这可以通过统计一致插值(SCI)(即对大数据进行最佳归纳的数据插值技术)来理解。在这里,我们使用加权插值最近邻算法阐明SCI,该算法将奇异权重函数添加到k最近的邻居。这表明数据插值可以是大数据的有效机器学习策略。SCI阐明了两种自然现象建模方法之间的关系:具有很少参数的理论物理学的理性主义方法(强先验性)和具有比数据更多参数的现代机器学习的经验主义方法(弱先验性)。SCI表明,纯经验方法可以成功预测。但是,数据插值不能提供理论上的见识,并且训练数据的要求可能令人望而却步。复杂的动物大脑处于这些极端之间,具有许多参数,但训练数据不多,并且具有在特定物种的中尺度电路中编码的先有结构。因此,

该文章的预印本可在ArXiv上获得。
更新日期:2021-05-19
down
wechat
bug