当前位置: X-MOL 学术Nat. Lang. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning keyphrases from corpora and knowledge models
Natural Language Engineering ( IF 2.3 ) Pub Date : 2019-09-10 , DOI: 10.1017/s1351324919000342
R. Silveira , V. Furtado , V. Pinheiro

Extraction keyphrase systems traditionally use classification algorithms and do not consider the fact that part of the keyphrases may not be found in the text, reducing the accuracy of such algorithms a priori. In this work, we propose to improve the accuracy of these systems with inferential mechanisms that use a knowledge representation model, including symbolic models of knowledge bases and distributional semantics, to expand the set of keyphrase candidates to be submitted to the classification algorithm with terms that are not in the text (not-in-text terms). The basic assumption we have is that not-in-text terms have a semantic relationship with terms that are in the text. To represent this relationship, we have defined two new features to be represented as input to the classification algorithms. The first feature refers to the power of discrimination of the inferred not-in-text terms. The intuition behind this is that good candidates for a keyphrase are those that are deduced from various textual terms in a specific document and that are not often deduced in other documents. The other feature represents the descriptive strength of a not-in-text candidate. We argue that not-in-text keyphrases must have a strong semantic relationship with the text and that the power of this semantic relationship can be measured in a similar way as popular metrics like TFxIDF. The method proposed in this work was compared with state-of-the-art systems using five corpora and the results show that it has significantly improved automatic keyphrase extraction, dealing with the limitation of extracting keyphrases absent from the text.

中文翻译:

从语料库和知识模型中学习关键词

提取关键词系统传统上使用分类算法,不考虑部分关键词可能在文本中找不到的事实,先验降低了此类算法的准确性。在这项工作中,我们建议通过使用知识表示模型(包括知识库的符号模型和分布语义)的推理机制来提高这些系统的准确性,以扩展要提交给分类算法的关键词候选集不在文本中(不在文本中的术语)。我们的基本假设是非文本术语与文本中的术语具有语义关系。为了表示这种关系,我们定义了两个新特征来表示为分类算法的输入。第一个特征是指对推断的非文本术语的区分能力。这背后的直觉是,关键短语的良好候选者是那些从特定文档中的各种文本术语推导出来的,并且在其他文档中不经常推导出来的那些。另一个特征表示非文本候选的描述强度。我们认为非文本关键短语必须与文本有很强的语义关系,并且这种语义关系的力量可以用类似于 TFxIDF 等流行指标的方式来衡量。在这项工作中提出的方法与使用五个语料库的最先进的系统进行了比较,结果表明它显着改进了自动关键短语提取,解决了提取文本中缺少的关键短语的限制。
更新日期:2019-09-10
down
wechat
bug