Unsupervised Derivation of Keyword Summary for Short Texts,ACM Transactions on Internet Technology

当前位置： X-MOL 学术 › ACM Trans. Internet Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised Derivation of Keyword Summary for Short Texts
ACM Transactions on Internet Technology ( IF 3.9 ) Pub Date : 2020-07-07 , DOI: 10.1145/3397162
Bin Cao ₁ , Jiawei Wu ₁ , Sichao Wang ₂ , Jing Fan ₁ , Honghao Gao ₃ , Shuiguang Deng ₄ , Jianwei Yin ₄ , Xuan Liu ₅

Affiliation

Automatically summarizing a group of short texts that mainly share one topic is a fundamental task in many applications, e.g., summarizing the main symptoms for a disease based on a group of medical texts that are usually short. Conventional unsupervised short text summarization techniques tend to find the most representative short text document. However they may cause privacy issues, e.g., personal information in the medical texts may be exposed. Moreover, compared with the complete short text where some unimportant words may exist, a summary consisting of only a few keywords is more preferable by the user due to its clear and concise form. Due to above reasons, in this paper, we aim to solve the problem of unsupervised derivation of keyword summary for short texts. Existing keyword extraction methods such as LDA cannot be applied to solve this problem since (1) the ordering relations among the extracted keywords are ignored, which causes troubles for people to capture the main idea of the event; and (2) short texts contain limited context, which makes it hard to find the optimal words for semantic coverage. Hence, we propose a simple but yet effective method named Frequent Closed Wordsets Ranking (FCWRank) to derive the keyword summary from a short text cluster. FCWRank is an unsupervised method which builds on the idea of frequent closed itemset mining in transaction database. FCWRank firstly mines all frequent closed wordsets from a cluster of short texts, and then selects the most important wordset based on an importance model where the similarity between closed wordsets and the relation between the closed wordset and the short text document are considered simultaneously. Experiments on real-world short text collections show that FCWRank outperforms the state-of-the-art baselines in terms of ROUGE-L F1, precision and recall scores.

中文翻译：

短文本关键词摘要的无监督推导

自动总结一组主要共享一个主题的短文本是许多应用程序中的一项基本任务，例如，基于一组通常很短的医学文本来总结疾病的主要症状。传统的无监督短文本摘要技术倾向于找到最具代表性的短文本文档。然而，它们可能会导致隐私问题，例如，医学文本中的个人信息可能会被泄露。此外，与可能存在一些不重要的词的完整短文本相比，仅由几个关键词组成的摘要由于其形式清晰、简洁而更受用户欢迎。由于上述原因，本文旨在解决短文本关键词摘要的无监督推导问题。现有的关键词提取方法如LDA不能解决这个问题，因为（1）忽略了提取的关键词之间的顺序关系，导致人们难以捕捉事件的主旨；(2) 短文本包含有限的上下文，这使得很难找到语义覆盖的最佳单词。因此，我们提出了一种简单但有效的方法，称为频繁封闭词集排名（FCWRank），以从短文本集群中得出关键字摘要。FCWRank 是一种建立在交易数据库中频繁闭项集挖掘思想的无监督方法。FCWRank 首先从一组短文本中挖掘所有频繁的封闭词集，然后基于重要性模型选择最重要的词集，该模型同时考虑封闭词集之间的相似性以及封闭词集与短文本文档之间的关系。对真实世界短文本集合的实验表明，FCWRank 在 ROUGE-L F1、精度和召回分数方面优于最先进的基线。

更新日期：2020-07-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11