Abstract
Materials informatics is increasingly finding ways to exploit machine learning algorithms. Techniques such as decision trees, ensemble methods, support vector machines, and a variety of neural network architectures are used to predict likely material characteristics and property values. Supplemented with laboratory synthesis, applications of machine learning to compound discovery and characterization represent one of the most promising research directions in materials informatics. A shortcoming of this trend, in its current form, is a lack of standardized materials data sets on which to train, validate, and test model effectiveness. Applied machine learning research depends on benchmark data to make sense of its results. Fixed, predetermined data sets allow for rigorous model assessment and comparison. Machine learning publications that do not refer to benchmarks are often hard to contextualize and reproduce. In this data descriptor article, we present a collection of data sets of different material properties taken from the AFLOW database. We describe them, the procedures that generated them, and their use as potential benchmarks. We provide a compressed ZIP file containing the data sets and a GitHub repository of associated Python code. Finally, we discuss opportunities for future work incorporating the data sets and creating similar benchmark collections.
Similar content being viewed by others
References
Donoho D (2017) 50 years of data science. J Comput Gr Stat 26(4):745–766
Seshadri R, Sparks TD (2016) Perspective: interactive material property databases through aggregation of literature data. APL Mater 4(5):053206
Curtarolo S, Setyawan W, Hart GLW, Jahnatek M, Chepulskii RV, Taylor RH, Wang S, Xue J, Yang K, Levy O et al (2012) AFLOW: an automatic framework for high-throughput materials discovery. Comput Mater Sci 58:218–226
Jain A, Ong SP, Hautier G, Chen W, Richards WD, Dacek S, Cholia S, Gunter D, Skinner D, Ceder G et al (2013) Commentary: the materials project—a materials genome approach to accelerating materials innovation. Apl Mater 1(1):011002
Hellenbrandt M (2004) The inorganic crystal structure database (ICSD)—present and future. Crystallogr Rev 10(1):17–22
Saal JE, Kirklin S, Aykol M, Meredig B, Wolverton C (2013) Materials design and discovery with high-throughput density functional theory: the open quantum materials database (OQMD). JOM 65(11):1501–1509
Hill J, Mulholland G, Persson K, Seshadri R, Wolverton C, Meredig B (2016) Materials science with large-scale data and informatics: unlocking new opportunities. MRS Bull 41(5):399–409
Ward L, Dunn A, Faghaninia A, Zimmermann NER, Bajaj S, Wang Q, Montoya J, Chen J, Bystrom K, Dylla M et al (2018) Matminer: an open source toolkit for materials data mining. Comput Mater Sci 152:60–69
Ong SP, Richards WD, Jain A, Hautier G, Kocher M, Cholia S, Gunter D, Chevrier VL, Persson KA, Ceder G (2013) Python materials genomics (pymatgen): a robust, open-source python library for materials analysis. Comput Mater Sci 68:314–319
Giannozzi P, Baroni S, Bonini N, Calandra M, Car R, Cavazzoni C, Ceresoli D, Chiarotti GL, Cococcioni M, Dabo I et al (2009) Quantum ESPRESSO: a modular and open-source software project for quantum simulations of materials. J Phys Condens Matter 21(39):395502
Citrination. www.citrination.com
Schmidt J, Marques MRG, Botti S, Marques MAL (2019) Recent advances and applications of machine learning in solid-state materials science. NPJ Comput Mater 5(1):1–36
Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH (2017) PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Min 10(1):36
Deng L (2012) The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag 29(6):141–142
Krizhevsky A, Nair V, Hinton G, CIFAR-10 and CIFAR-100 datasets. www.cs.toronto.edu/kriz/cifar.html
Kauwe SK, Welker T, Sparks TD (2018) Extracting knowledge from dft: experimental band gap predictions through ensemble learning. https://doi.org/10.26434/chemrxiv.7236029
Zhuo Y, Tehrani AM, Brgoch J (2018) Predicting the band gaps of inorganic solids by machine learning. J Phys Chem Lett 9(7):1668–1673
Zhang Y, Kitchaev DA, Yang J, Chen T, Dacek ST, Sarmiento-Pérez RA, Marques MAL, Peng H, Ceder G, Perdew JP et al (2018) Efficient first-principles prediction of solid stability: towards chemical accuracy. NPJ Comput Mater 4(1):1–6
Murdock R, Kauwe S, Wang A, Sparks T (2020) Is domain knowledge necessary for machine learning materials properties? https://doi.org/10.26434/chemrxiv.11879193.v1
Hall SR, Allen FH, Brown ID (1991) The crystallographic information file (CIF): a new standard archive file for crystallography. Acta Crystallogr A 47(6):655–685
Xie T, Grossman JC (2018) Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys Rev Lett 120(14):145301
Schütt KT, Sauceda HE, Kindermans P-J, Tkatchenko A, Müller K-R (2018) SchNet—a deep learning architecture for molecules and materials. J Chem Phys 148(24):241722
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, Berlin
Acknowledgements
The authors gratefully acknowledge support from the NSF CAREER Award DMR 1651668. The authors thank the creators of AFLOW for the creation of the database and for making its contents available for this article. In addition, the authors express their gratitude to the open-source software community, for developing the excellent tools used in this research, including Python and the pandas, numpy, matplotlib, and sklearn Python libraries, among others.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Rights and permissions
About this article
Cite this article
Clement, C.L., Kauwe, S.K. & Sparks, T.D. Benchmark AFLOW Data Sets for Machine Learning. Integr Mater Manuf Innov 9, 153–156 (2020). https://doi.org/10.1007/s40192-020-00174-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40192-020-00174-4