Skip to main content
Log in

Benchmark AFLOW Data Sets for Machine Learning

  • Technical Article
  • Published:
Integrating Materials and Manufacturing Innovation Aims and scope Submit manuscript

Abstract

Materials informatics is increasingly finding ways to exploit machine learning algorithms. Techniques such as decision trees, ensemble methods, support vector machines, and a variety of neural network architectures are used to predict likely material characteristics and property values. Supplemented with laboratory synthesis, applications of machine learning to compound discovery and characterization represent one of the most promising research directions in materials informatics. A shortcoming of this trend, in its current form, is a lack of standardized materials data sets on which to train, validate, and test model effectiveness. Applied machine learning research depends on benchmark data to make sense of its results. Fixed, predetermined data sets allow for rigorous model assessment and comparison. Machine learning publications that do not refer to benchmarks are often hard to contextualize and reproduce. In this data descriptor article, we present a collection of data sets of different material properties taken from the AFLOW database. We describe them, the procedures that generated them, and their use as potential benchmarks. We provide a compressed ZIP file containing the data sets and a GitHub repository of associated Python code. Finally, we discuss opportunities for future work incorporating the data sets and creating similar benchmark collections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  1. Donoho D (2017) 50 years of data science. J Comput Gr Stat 26(4):745–766

    Article  Google Scholar 

  2. Seshadri R, Sparks TD (2016) Perspective: interactive material property databases through aggregation of literature data. APL Mater 4(5):053206

    Article  CAS  Google Scholar 

  3. Curtarolo S, Setyawan W, Hart GLW, Jahnatek M, Chepulskii RV, Taylor RH, Wang S, Xue J, Yang K, Levy O et al (2012) AFLOW: an automatic framework for high-throughput materials discovery. Comput Mater Sci 58:218–226

    Article  CAS  Google Scholar 

  4. Jain A, Ong SP, Hautier G, Chen W, Richards WD, Dacek S, Cholia S, Gunter D, Skinner D, Ceder G et al (2013) Commentary: the materials project—a materials genome approach to accelerating materials innovation. Apl Mater 1(1):011002

    Article  CAS  Google Scholar 

  5. Hellenbrandt M (2004) The inorganic crystal structure database (ICSD)—present and future. Crystallogr Rev 10(1):17–22

    Article  CAS  Google Scholar 

  6. Saal JE, Kirklin S, Aykol M, Meredig B, Wolverton C (2013) Materials design and discovery with high-throughput density functional theory: the open quantum materials database (OQMD). JOM 65(11):1501–1509

    Article  CAS  Google Scholar 

  7. Hill J, Mulholland G, Persson K, Seshadri R, Wolverton C, Meredig B (2016) Materials science with large-scale data and informatics: unlocking new opportunities. MRS Bull 41(5):399–409

    Article  CAS  Google Scholar 

  8. Ward L, Dunn A, Faghaninia A, Zimmermann NER, Bajaj S, Wang Q, Montoya J, Chen J, Bystrom K, Dylla M et al (2018) Matminer: an open source toolkit for materials data mining. Comput Mater Sci 152:60–69

    Article  Google Scholar 

  9. Ong SP, Richards WD, Jain A, Hautier G, Kocher M, Cholia S, Gunter D, Chevrier VL, Persson KA, Ceder G (2013) Python materials genomics (pymatgen): a robust, open-source python library for materials analysis. Comput Mater Sci 68:314–319

    Article  CAS  Google Scholar 

  10. Giannozzi P, Baroni S, Bonini N, Calandra M, Car R, Cavazzoni C, Ceresoli D, Chiarotti GL, Cococcioni M, Dabo I et al (2009) Quantum ESPRESSO: a modular and open-source software project for quantum simulations of materials. J Phys Condens Matter 21(39):395502

    Article  Google Scholar 

  11. Citrination. www.citrination.com

  12. Schmidt J, Marques MRG, Botti S, Marques MAL (2019) Recent advances and applications of machine learning in solid-state materials science. NPJ Comput Mater 5(1):1–36

    Article  Google Scholar 

  13. Olson RS, La Cava W, Orzechowski P, Urbanowicz RJ, Moore JH (2017) PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Min 10(1):36

    Article  Google Scholar 

  14. Deng L (2012) The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag 29(6):141–142

    Article  Google Scholar 

  15. Krizhevsky A, Nair V, Hinton G, CIFAR-10 and CIFAR-100 datasets. www.cs.toronto.edu/kriz/cifar.html

  16. Kauwe SK, Welker T, Sparks TD (2018) Extracting knowledge from dft: experimental band gap predictions through ensemble learning. https://doi.org/10.26434/chemrxiv.7236029

  17. Zhuo Y, Tehrani AM, Brgoch J (2018) Predicting the band gaps of inorganic solids by machine learning. J Phys Chem Lett 9(7):1668–1673

    Article  CAS  Google Scholar 

  18. Zhang Y, Kitchaev DA, Yang J, Chen T, Dacek ST, Sarmiento-Pérez RA, Marques MAL, Peng H, Ceder G, Perdew JP et al (2018) Efficient first-principles prediction of solid stability: towards chemical accuracy. NPJ Comput Mater 4(1):1–6

    Article  CAS  Google Scholar 

  19. Murdock R, Kauwe S, Wang A, Sparks T (2020) Is domain knowledge necessary for machine learning materials properties? https://doi.org/10.26434/chemrxiv.11879193.v1

  20. Hall SR, Allen FH, Brown ID (1991) The crystallographic information file (CIF): a new standard archive file for crystallography. Acta Crystallogr A 47(6):655–685

    Article  Google Scholar 

  21. Xie T, Grossman JC (2018) Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys Rev Lett 120(14):145301

    Article  CAS  Google Scholar 

  22. Schütt KT, Sauceda HE, Kindermans P-J, Tkatchenko A, Müller K-R (2018) SchNet—a deep learning architecture for molecules and materials. J Chem Phys 148(24):241722

    Article  CAS  Google Scholar 

  23. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, Berlin

    Book  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge support from the NSF CAREER Award DMR 1651668. The authors thank the creators of AFLOW for the creation of the database and for making its contents available for this article. In addition, the authors express their gratitude to the open-source software community, for developing the excellent tools used in this research, including Python and the pandas, numpy, matplotlib, and sklearn Python libraries, among others.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taylor D. Sparks.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Clement, C.L., Kauwe, S.K. & Sparks, T.D. Benchmark AFLOW Data Sets for Machine Learning. Integr Mater Manuf Innov 9, 153–156 (2020). https://doi.org/10.1007/s40192-020-00174-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40192-020-00174-4

Keywords

Navigation