A strategy to apply machine learning to small datasets in materials science,npj Computational Materials

当前位置： X-MOL 学术 › npj Comput. Mater. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A strategy to apply machine learning to small datasets in materials science
npj Computational Materials ( IF 9.4 ) Pub Date : 2018-05-14 , DOI: 10.1038/s41524-018-0081-z
Ying Zhang , Chen Ling

There is growing interest in applying machine learning techniques in the research of materials science. However, although it is recognized that materials datasets are typically smaller and sometimes more diverse compared to other fields, the influence of availability of materials data on training machine learning models has not yet been studied, which prevents the possibility to establish accurate predictive rules using small materials datasets. Here we analyzed the fundamental interplay between the availability of materials data and the predictive capability of machine learning models. Instead of affecting the model precision directly, the effect of data size is mediated by the degree of freedom (DoF) of model, resulting in the phenomenon of association between precision and DoF. The appearance of precision–DoF association signals the issue of underfitting and is characterized by large bias of prediction, which consequently restricts the accurate prediction in unknown domains. We proposed to incorporate the crude estimation of property in the feature space to establish ML models using small sized materials data, which increases the accuracy of prediction without the cost of higher DoF. In three case studies of predicting the band gap of binary semiconductors, lattice thermal conductivity, and elastic properties of zeolites, the integration of crude estimation effectively boosted the predictive capability of machine learning models to state-of-art levels, demonstrating the generality of the proposed strategy to construct accurate machine learning models using small materials dataset.

中文翻译：

将机器学习应用于材料科学中的小型数据集的策略

在材料科学研究中应用机器学习技术的兴趣日益浓厚。然而，尽管人们认识到材料数据集通常比其他领域小，有时甚至更多，但尚未研究材料数据的可用性对训练机器学习模型的影响，这阻止了使用小数据建立准确的预测规则的可能性。材料数据集。在这里，我们分析了材料数据的可用性与机器学习模型的预测能力之间的基本相互作用。数据大小的影响不是直接影响模型的精度，而是由模型的自由度（DoF）来介导，从而导致精度和DoF之间存在关联现象。精确度-DoF关联的出现预示了拟合不足的问题，并且具有较大的预测偏差，因此限制了在未知域中的准确预测。我们建议将特征的粗略估计合并到特征空间中，以使用小尺寸的材料数据建立ML模型，从而在不增加较高自由度的成本的情况下提高了预测的准确性。在预测二元半导体的带隙，晶格热导率和沸石的弹性特性的三个案例研究中，粗略估计的集成有效地将机器学习模型的预测能力提高到了最先进的水平，证明了这种方法的普遍性。提出了使用小材料数据集构建准确的机器学习模型的策略。

更新日期：2018-05-14

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11