An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models,Integrating Materials and Manufacturing Innovation

当前位置： X-MOL 学术 › Integr. Mater. Manuf. Innov. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models
Integrating Materials and Manufacturing Innovation ( IF 2.4 ) Pub Date : 2021-06-09 , DOI: 10.1007/s40192-021-00213-8
Jason R. Hattrick-Simpers , Brian DeCost , A. Gilad Kusne , Howie Joress , Winnie Wong-Ng , Debra L. Kaiser , Andriy Zakutayev , Caleb Phillips , Shijing Sun , Janak Thapa , Heshan Yu , Ichiro Takeuchi , Tonio Buonassisi

Modern machine learning and autonomous experimentation schemes in materials science rely on accurate analysis of the data ingested by these models. Unfortunately, accurate analysis of the underlying data can be difficult, even for domain experts, complicating the training of the models intended to drive experiments. This is especially true when the goal is to identify the presence of weak signatures in diffraction or spectroscopic datasets. In this work, we examine a set of as-obtained diffraction data that track the phase transition from monoclinic to tetragonal in a Nb-doped VO₂ film as a function of temperature and dopant concentration. We then task a set of domain experts and a set of machine learning experts with identifying which phase is present in each diffraction pattern manually and algorithmically, respectively; in both cases, the labels can vary dramatically, especially at the phase boundaries. We use the mode of the labels and the Shannon entropy as a method to capture, preserve and propagate consensus labels and their variance. Further we use the expert labels as a benchmark and demonstrate the use of Shannon entropy weighted scoring to test the performance of machine learning generated labels. Finally, we propose a material data challenge centered around generating improved labeling algorithms. This real-world dataset curated with expert labels can act as test bed for new algorithms. The raw data, annotations and code used in this study are all available online at data.gov and the interested reader is encouraged to replicate and improve the existing models

中文翻译：

一个开放的组合衍射数据集，包括具有量化不确定性的共识人类和机器学习标签，用于训练新的机器学习模型

材料科学中的现代机器学习和自主实验方案依赖于对这些模型摄取的数据的准确分析。不幸的是，即使对于领域专家来说，对基础数据的准确分析也很困难，这使得旨在驱动实验的模型训练变得复杂。当目标是识别衍射或光谱数据集中弱特征的存在时尤其如此。在这项工作中，我们检查了一组所获得的衍射数据，这些数据跟踪了 Nb 掺杂的 VO _{2 中}从单斜晶到四方晶的相变膜作为温度和掺杂剂浓度的函数。然后，我们让一组领域专家和一组机器学习专家分别手动和算法识别每个衍射图案中存在哪个相位；在这两种情况下，标签可能会发生巨大变化，尤其是在相边界处。我们使用标签模式和香农熵作为捕获、保存和传播共识标签及其方差的方法。此外，我们使用专家标签作为基准，并演示使用香农熵加权评分来测试机器学习生成标签的性能。最后，我们提出了一个以生成改进的标记算法为中心的材料数据挑战。这个带有专家标签的真实世界数据集可以作为新算法的测试平台。原始数据，

更新日期：2021-06-09

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11