当前位置: X-MOL 学术Int. J. Doc. Anal. Recognit. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Locating and parsing bibliographic references in HTML medical articles.
International Journal on Document Analysis and Recognition ( IF 2.3 ) Pub Date : 2010-01-16 , DOI: 10.1007/s10032-009-0105-9
Jie Zou 1 , Daniel Le , George R Thoma
Affiliation  

The set of references that typically appear toward the end of journal articles is sometimes, though not always, a field in bibliographic (citation) databases. But even if references do not constitute such a field, they can be useful as a preprocessing step in the automated extraction of other bibliographic data from articles, as well as in computer-assisted indexing of articles. Automation in data extraction and indexing to minimize human labor is key to the affordable creation and maintenance of large bibliographic databases. Extracting the components of references, such as author names, article title, journal name, publication date and other entities, is therefore a valuable and sometimes necessary task. This paper describes a two-step process using statistical machine learning algorithms, to first locate the references in HTML medical articles and then to parse them. Reference locating identifies the reference section in an article and then decomposes it into individual references. We formulate this step as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles drawn from 100 medical journals achieves near-perfect precision and recall rates for locating references. Reference parsing identifies the components of each reference. For this second step, we implement and compare two algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, followed by a search algorithm that systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference-parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.

中文翻译:

定位和解析 HTML 医学文章中的参考书目。

通常出现在期刊文章末尾的参考文献集有时(但并非总是)是书目(引文)数据库中的一个字段。但是,即使参考文献不构成这样的领域,它们也可以用作从文章中自动提取其他书目数据的预处理步骤,以及在计算机辅助的文章索引中。数据提取和索引的自动化以最大限度地减少人工是大型书目数据库创建和维护的关键。因此,提取参考文献的组成部分,例如作者姓名、文章标题、期刊名称、出版日期和其他实体,是一项有价值且有时是必要的任务。本文描述了一个使用统计机器学习算法的两步过程,参考定位识别文章中的参考部分,然后将其分解为单独的参考。我们将此步骤表述为基于文本和几何特征的两类分类问题。对来自 100 种医学期刊的 500 篇文章进行的评估在定位参考文献方面实现了近乎完美的精确度和召回率。引用解析标识每个参考的组件。对于第二步,我们实现并比较两种算法。一种依赖于序列统计并训练条件随机场。另一个侧重于局部特征统计并训练支持向量机对每个单词进行分类,然后是搜索算法,如果标签序列违反一组预定义规则,则系统地纠正低置信度标签。这两种参考解析算法的整体性能大致相同:单词级别的准确率超过 99%,词块级别的准确率超过 97%。
更新日期:2010-01-16
down
wechat
bug