当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Inside Importance Factors of Graph-Based Keyword Extraction on Chinese Short Text
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 1.8 ) Pub Date : 2020-06-22 , DOI: 10.1145/3388971
Junjie Chen 1 , Hongxu Hou 2 , Jing Gao 3
Affiliation  

Keywords are considered to be important words in the text and can provide a concise representation of the text. With the surge of unlabeled short text on the Internet, automatic keyword extraction task has proven useful in other information processing applications. Graph-based approaches are prevalent unsupervised models for this task. However, most of these methods emphasize the importance of the relation between words without considering other importance factors. Furthermore, when measuring the importance of a word in a text, the damping factor is set to 0.85 following PageRank. To the best of our knowledge, there is no existing work investigating the impact of the damping factor on the keyword extraction task. In addition, there are few publicly available labeled Chinese short text datasets for this task. In this article, we investigate the importance parts of words in a given document and propose an improved graph-based method for keyword extraction from short documents. Moreover, we analyze the impact of importance factors on performance. We also provide annotated long and short Chinese datasets for this task. The model is performed on Chinese and English datasets, and results show that our model obtains improvements in performance over the previous unsupervised models on short documents. Comparative experiments show that the damping factor is related to the text length, which is neglected in traditional methods.

中文翻译:

基于图形的中文短文本关键词提取的内在重要性因素

关键词被认为是文本中的重要词,可以提供文本的简洁表示。随着互联网上无标签短文本的激增,自动关键字提取任务已被证明在其他信息处理应用中很有用。基于图的方法是该任务的普遍无监督模型。然而,这些方法中的大多数都强调单词之间关系的重要性,而没有考虑其他重要因素。此外,在测量文本中单词的重要性时,阻尼因子在 PageRank 之后设置为 0.85。据我们所知,目前还没有研究阻尼因子对关键字提取任务的影响的工作。此外,用于此任务的公开可用的标记中文短文本数据集很少。在本文中,我们研究了给定文档中单词的重要性部分,并提出了一种改进的基于图形的方法,用于从短文档中提取关键字。此外,我们分析了重要因素对性能的影响。我们还为此任务提供了带注释的长短中文数据集。该模型在中文和英文数据集上进行,结果表明我们的模型在短文档上比之前的无监督模型获得了性能提升。对比实验表明,阻尼因子与文本长度有关,在传统方法中被忽略。我们还为此任务提供了带注释的长短中文数据集。该模型在中文和英文数据集上进行,结果表明我们的模型在短文档上比之前的无监督模型获得了性能提升。对比实验表明,阻尼因子与文本长度有关,在传统方法中被忽略。我们还为此任务提供了带注释的长短中文数据集。该模型在中文和英文数据集上进行,结果表明我们的模型在短文档上比之前的无监督模型获得了性能提升。对比实验表明,阻尼因子与文本长度有关,在传统方法中被忽略。
更新日期:2020-06-22
down
wechat
bug