ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science,Journal of Chemical Information and Modeling

当前位置： X-MOL 学术 › J. Chem. Inf. Model. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2021-09-16 , DOI: 10.1021/acs.jcim.1c00446
Juraj Mavračić _{1,

2} , Callum J Court ₁ , Taketomo Isazawa ₁ , Stephen R Elliott ₂ , Jacqueline M Cole _{1,

3,

4}

Affiliation

The ever-growing abundance of data found in heterogeneous sources, such as scientific publications, has forced the development of automated techniques for data extraction. While in the past, in the physical sciences domain, the focus has been on the precise extraction of individual properties, attention has recently been devoted to the extraction of higher-level relationships. Here, we present a framework for an automated population of ontologies. That is, the direct extraction of a larger group of properties linked by a semantic network. We exploit data-rich sources, such as tables within documents, and present a new model concept that enables data extraction for chemical and physical properties with the ability to organize hierarchical data as nested information. Combining these capabilities with automatically generated parsers for data extraction and forward-looking interdependency resolution, we illustrate the power of our approach via the automatic extraction of a crystallographic hierarchy of information. This includes 18 interrelated submodels of nested data, extracted from an evaluation set of scientific articles, yielding an overall precision of 92.2%, across 26 different journals. Our method and associated toolkit, ChemDataExtractor 2.0, offers a key step toward the seamless integration of primary literature sources into a data-driven scientific framework.

中文翻译：

ChemDataExtractor 2.0：材料科学的自动填充本体

在不同来源（例如科学出版物）中发现的数据日益丰富，这迫使开发用于数据提取的自动化技术。过去，在物理科学领域，重点一直是精确提取单个属性，而最近则专注于提取更高级别的关系。在这里，我们提出了一个自动填充本体的框架。也就是说，直接提取由语义网络链接的更大的一组属性。我们利用数据丰富的来源，例如文档中的表格，并提出了一种新的模型概念，该概念能够提取化学和物理特性的数据，并能够将分层数据组织为嵌套信息。将这些功能与用于数据提取和前瞻性相互依赖性解析的自动生成的解析器相结合，我们通过自动提取晶体学信息层次结构来说明我们方法的力量。这包括从一组科学文章评估集中提取的嵌套数据的 18 个相互关联的子模型，在 26 种不同的期刊中产生了 92.2% 的整体精度。我们的方法和相关工具包 ChemDataExtractor 2.0 为将主要文献来源无缝集成到数据驱动的科学框架中迈出了关键一步。从一组科学文章的评估集中提取，在 26 种不同的期刊中产生了 92.2% 的整体精度。我们的方法和相关工具包 ChemDataExtractor 2.0 为将主要文献来源无缝集成到数据驱动的科学框架中迈出了关键一步。从一组科学文章的评估集中提取，在 26 种不同的期刊中产生了 92.2% 的整体精度。我们的方法和相关工具包 ChemDataExtractor 2.0 为将主要文献来源无缝集成到数据驱动的科学框架中迈出了关键一步。

更新日期：2021-09-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11