TripleRank: An unsupervised keyphrase extraction algorithm,Knowledge-Based Systems

当前位置： X-MOL 学术 › Knowl. Based Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

TripleRank: An unsupervised keyphrase extraction algorithm
Knowledge-Based Systems ( IF 7.2 ) Pub Date : 2021-02-19 , DOI: 10.1016/j.knosys.2021.106846
Tuohang Li , Liang Hu , Hongtu Li , Chengyu Sun , Shuai Li , Ling Chi

Automatic keyphrase extraction algorithms aim to identify words and phrases that contain the core information in documents. As online scholarly resources have become widespread in recent years, better keyphrase extraction techniques are required to improve search efficiency. We present two features, keyphrase semantic diversity and keyphrase coverage, to overcome limitations of existing methods for unsupervised keyphrase extraction. Keyphrase semantic diversity is the degree of semantic variety in the extraction result, which is introduced to avoid extracting synonym phrases that contain the same high-score candidate. Keyphrase coverage refers to candidates’ representativeness of other words in documents. We propose an unsupervised keyphrase extraction method called TripleRank, which evaluates three features: word position (a sensitive feature for academic documents) and two innovative features mentioned above. The architecture of TripleRank includes three sub-models that score the three features and a summing model. Though involving multiple models, there is no typical iteration process in TripleRank; hence, the computational cost is relatively low. TripleRank has led the experiment results on four academic datasets compared to four state-of-the-art baseline models, which confirmed the influence of keyphrase semantic diversity and keyphrase coverage and proved the efficiency of our method.

中文翻译：

TripleRank：一种无监督的关键词提取算法

自动关键词提取算法旨在识别包含文档中核心信息的单词和短语。近年来，随着在线学术资源的广泛使用，需要更好的关键词提取技术来提高搜索效率。我们提出了两个特征，即关键短语语义多样性和关键短语覆盖范围，以克服现有的无监督关键短语提取方法的局限性。关键短语语义多样性是指提取结果中语义多样性的程度，引入该词是为了避免提取包含相同高分候选者的同义词短语。关键字覆盖率是指考生对文档中其他单词的代表性。我们提出了一种无监督的关键字短语提取方法，称为TripleRank，该方法可以评估以下三个功能：单词位置（学术文件的敏感特征）和上述两个创新特征。TripleRank的体系结构包括对这三个特征进行评分的三个子模型以及一个求和模型。尽管涉及多个模型，但是在TripleRank中没有典型的迭代过程。因此，计算成本相对较低。与四个最先进的基线模型相比，TripleRank在四个学术数据集上领先了实验结果，这证实了关键短语语义多样性和关键短语覆盖率的影响，并证明了我们方法的有效性。计算成本相对较低。与四个最先进的基线模型相比，TripleRank在四个学术数据集上领先了实验结果，这证实了关键短语语义多样性和关键短语覆盖率的影响，并证明了我们方法的有效性。计算成本相对较低。与四个最先进的基线模型相比，TripleRank在四个学术数据集上领先了实验结果，这证实了关键短语语义多样性和关键短语覆盖率的影响，并证明了我们方法的有效性。

更新日期：2021-03-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11