当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
Information Processing & Management ( IF 8.6 ) Pub Date : 2020-05-30 , DOI: 10.1016/j.ipm.2020.102269
Iqra Safder , Saeed-Ul Hassan , Anna Visvizi , Thanapon Noraset , Raheel Nawaz , Suppawong Tuarob

The advancements of search engines for traditional text documents have enabled the effective retrieval of massive textual information in a resource-efficient manner. However, such conventional search methodologies often suffer from poor retrieval accuracy especially when documents exhibit unique properties that behoove specialized and deeper semantic extraction. Recently, AlgorithmSeer, a search engine for algorithms has been proposed, that extracts pseudo-codes and shallow textual metadata from scientific publications and treats them as traditional documents so that the conventional search engine methodology could be applied. However, such a system fails to facilitate user search queries that seek to identify algorithm-specific information, such as the datasets on which algorithms operate, the performance of algorithms, and runtime complexity, etc. In this paper, a set of enhancements to the previously proposed algorithm search engine are presented. Specifically, we propose a set of methods to automatically identify and extract algorithmic pseudo-codes and the sentences that convey related algorithmic metadata using a set of machine-learning techniques. In an experiment with over 93,000 text lines, we introduce 60 novel features, comprising content-based, font style based and structure-based feature groups, to extract algorithmic pseudo-codes. Our proposed pseudo-code extraction method achieves 93.32% F1-score, outperforming the state-of-the-art techniques by 28%. Additionally, we propose a method to extract algorithmic-related sentences using deep neural networks and achieve an accuracy of 78.5%, outperforming a Rule-based model and a support vector machine model by 28% and 16%, respectively.



中文翻译:

基于深度学习的全文学术文档中算法元数据的提取

传统文本文档搜索引擎的进步使得能够以资源高效的方式有效地检索大量文本信息。但是,这样的常规搜索方法通常遭受差的检索精度的困扰,特别是当文档表现出应该进行专门且更深入的语义提取的独特属性时。最近,提出了一种算法搜索引擎AlgorithmSeer,该算法从科学出版物中提取伪代码和浅层文本元数据,并将其视为传统文档,从而可以应用常规搜索引擎方法。但是,这样的系统无法促进用户搜索查询,这些查询试图识别特定于算法的信息,例如算法所依据的数据集,算法的性能以及运行时的复杂性,在本文中,提出了对先前提出的算法搜索引擎的一组增强。具体来说,我们提出了一组方法,以使用一组机器学习技术自动识别和提取算法伪代码以及传达相关算法元数据的句子。在一个超过93,000条文本行的实验中,我们引入了60种新颖的功能,包括基于内容,基于字体样式和基于结构的功能组,以提取算法伪代码。我们提出的伪代码提取方法可达到93.32%的F1得分,比最新技术高出28%。此外,我们提出了一种使用深度神经网络提取算法相关语句的方法,其准确率达到了78.5%,优于基于规则的模型和支持向量机模型分别为28%和16%,

更新日期:2020-05-30
down
wechat
bug