当前位置: X-MOL 学术J. Chem. Inf. Model. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Resampling Techniques for Materials Informatics: Limitations in Crystal Point Groups Classification
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2022-07-19 , DOI: 10.1021/acs.jcim.2c00666
Abdulmohsen A Alsaui 1 , Yousef A Alghofaili 2 , Mohammed Alghadeer 3 , Fahhad H Alharbi 4, 5
Affiliation  

Imbalanced data sets in materials informatics are pervasive and pose a challenge to the development of classification models. This work investigates crystal point group prediction as an example of an imbalanced classification problem in materials informatics. Multiple resampling and classification techniques were considered. The findings suggest that the most influential variable of the resampling algorithms is the one controlling the number of samples to omit (undersample) or synthetically generate (oversample), as expected. The effect of balancing is to enhance the classification performance of the minority class at the cost of reducing the correct predictions of the majority class. Moreover, ideal balancing, where the classes are precisely balanced, is not optimum. Alternatively, partial balancing should be performed. In this study, the ideal ratio of the minority to majority class was found to be around two-thirds. The biggest improvement in the classification was for the random undersampling technique with k-nearest neighbors and random forest.

中文翻译:

材料信息学的重采样技术:晶体点群分类的局限性

材料信息学中的不平衡数据集普遍存在,对分类模型的开发提出了挑战。这项工作研究了晶体点群预测作为材料信息学中不平衡分类问题的一个例子。考虑了多种重采样和分类技术。研究结果表明,重采样算法中最有影响力的变量是控制要省略(欠采样)或综合生成(过采样)的样本数量的变量,正如预期的那样。平衡的效果是以减少多数类的正确预测为代价来增强少数类的分类性能。此外,类被精确平衡的理想平衡不是最佳的。或者,应该执行部分平衡。在这项研究中,少数族裔与多数族裔的理想比例约为三分之二。分类中最大的改进是随机欠采样技术k-最近邻和随机森林。
更新日期:2022-07-19
down
wechat
bug