当前位置: X-MOL 学术Integr. Mater. Manuf. Innov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Classification of Journal Articles in a Search for New Experimental Thermophysical Property Data: A Case Study.
Integrating Materials and Manufacturing Innovation ( IF 2.4 ) Pub Date : 2017-06-07 , DOI: 10.1007/s40192-017-0096-1
Adele P Peskin 1 , Alden A Dima 2
Affiliation  

We present a case study in which we use natural language processing and machine learning techniques to automatically select candidate scientific articles that may contain new experimental thermophysical property data from thousands of articles available in five different relevant journals. The National Institute of Standards and Technology (NIST) Thermodynamic Research Center (TRC) maintains a large database of available thermophysical property data extracted from articles that are manually selected for content. Over time, the number of articles requiring manual inspection has grown and assistance from machine-based methods is needed. Previous work used topic modeling along with classification techniques to classify these journal articles into those with data for the TRC database and those without. These techniques have produced classifications with accuracy between 85 and 90%. However, the TRC does not want to lose data from the misclassified articles that contain relevant information. In this study, we start with these topic modeling and classification techniques, and then enhance the model using information relevant to the TRC’s selection process. Our goal is to minimize the number of articles that require manual selection without missing articles of importance. Through a series of selection methods, we eliminate those articles for which we can determine a rejection criterion. We can reduce the number of articles that are not of interest by 70.8% while retaining 98.7% of the articles of interest. We have also found that topic model classification improves when the corpus of words is derived from specific sections of the articles rather than the entire articles, and we improve on our classification by using a combination of topic models from different sections of the article. Our best classification used only the Experimental and Literature Cited sections.

中文翻译:

搜索新的实验热物理性质数据的期刊文章分类:一个案例研究。

我们提出了一个案例研究,其中我们使用自然语言处理和机器学习技术从五种不同相关期刊中的数千篇文章中自动选择可能包含新的实验热物理性质数据的候选科学文章。美国国家标准技术研究院(NIST)热力学研究中心(TRC)维护着一个庞大的数据库,其中包含从为内容手动选择的文章中提取的可用热物理性质数据。随着时间的流逝,需要手动检查的物品数量不断增加,并且需要基于机器的方法的帮助。先前的工作使用主题建模以及分类技术将这些期刊文章分为具有TRC数据库数据的期刊文章和没有TRC数据库数据的期刊文章。这些技术产生的分类精度在85%至90%之间。但是,TRC不想丢失包含相关信息的分类错误文章的数据。在本研究中,我们从这些主题建模和分类技术开始,然后使用与TRC选择过程相关的信息来增强模型。我们的目标是减少需要手动选择的文章数量,而不会丢失重要的文章。通过一系列选择方法,我们消除了可以确定拒绝标准的那些文章。我们可以将不感兴趣的文章数量减少70.8%,同时保留不感兴趣的文章98.7%。我们还发现,当单词的语料库是从文章的特定部分而不是整个文章中提取时,主题模型的分类会有所改善,并且通过结合文章不同部分的主题模型,我们的分类也会有所改进。我们的最佳分类仅使用“实验性”和“文献引用”部分。
更新日期:2017-06-07
down
wechat
bug