当前位置: X-MOL 学术arXiv.cond-mat.soft › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A general-purpose material property data extraction pipeline from large polymer corpora using Natural Language Processing
arXiv - PHYS - Soft Condensed Matter Pub Date : 2022-09-27 , DOI: arxiv-2209.13136
Pranav Shetty, Arunkumar Chitteth Rajan, Christopher Kuenneth, Sonkakshi Gupta, Lakshmi Prerana Panchumarti, Lauren Holm, Chao Zhang, Rampi Ramprasad

The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used natural language processing (NLP) methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at https://polymerscholar.org which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.

中文翻译:

使用自然语言处理从大型聚合物语料库中提取通用材料属性数据流水线

越来越多的材料科学文章使得很难从已发表的文献中推断出化学-结构-性质的关系。我们使用自然语言处理 (NLP) 方法从聚合物文献的摘要中自动提取材料特性数据。作为我们管道的一个组成部分,我们使用 240 万个材料科学摘要训练了一种语言模型 MaterialsBERT,当用作文本编码器时,它在五分之三的命名实体识别数据集中优于其他基线模型。使用这条管道,我们在 60 小时内从约 130,000 个摘要中获得了约 300,000 条材料属性记录。对提取的数据进行分析,用于燃料电池、超级电容器和聚合物太阳能电池等各种应用,以恢复重要的见解。通过我们的管道提取的数据可通过 https://polymerscholar.org 的网络平台获得,该平台可用于方便地定位摘要中记录的材料特性数据。这项工作证明了自动管道的可行性,该管道从已发表的文献开始,以一整套提取的材料特性信息结束。
更新日期:2022-09-28
down
wechat
bug