当前位置: X-MOL 学术IEEE Trans. Knowl. Data. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Jointly Learning Topics in Sentence Embedding for Document Summarization
IEEE Transactions on Knowledge and Data Engineering ( IF 8.9 ) Pub Date : 2020-04-01 , DOI: 10.1109/tkde.2019.2892430
Yang Gao , Yue Xu , Heyan Huang , Qian Liu , Linjing Wei , Luyang Liu

Summarization systems for various applications, such as opinion mining, online news services, and answering questions, have attracted increasing attention in recent years. These tasks are complicated, and a classic representation using bag-of-words does not adequately meet the comprehensive needs of applications that rely on sentence extraction. In this paper, we focus on representing sentences as continuous vectors as a basis for measuring relevance between user needs and candidate sentences in source documents. Embedding models based on distributed vector representations are often used in the summarization community because, through cosine similarity, they simplify sentence relevance when comparing two sentences or a sentence/query and a document. However, the vector-based embedding models do not typically account for the salience of a sentence, and this is a very necessary part of document summarization. To incorporate sentence salience, we developed a model, called CCTSenEmb, that learns latent discriminative Gaussian topics in the embedding space and extended the new framework by seamlessly incorporating both topic and sentence embedding into one summarization system. To facilitate the semantic coherence between sentences in the framework of prediction-based tasks for sentence embedding, the CCTSenEmb further considers the associations between neighboring sentences. As a result, this novel sentence embedding framework combines sentence representations, word-based content, and topic assignments to predict the representation of the next sentence. A series of experiments with the DUC datasets validate CCTSenEmb's efficacy in document summarization in a query-focused extraction-based setting and an unsupervised ILP-based setting.

中文翻译:

用于文档摘要的句子嵌入中的共同学习主题

近年来,针对意见挖掘、在线新闻服务和问答等各种应用的摘要系统越来越受到关注。这些任务很复杂,使用词袋的经典表示不能充分满足依赖句子提取的应用程序的综合需求。在本文中,我们专注于将句子表示为连续向量,作为衡量用户需求与源文档中候选句子之间相关性的基础。基于分布式向量表示的嵌入模型经常用于摘要社区,因为通过余弦相似度,它们在比较两个句子或句子/查询和文档时简化了句子相关性。然而,基于向量的嵌入模型通常不考虑句子的显着性,这是文档摘要中非常必要的部分。为了结合句子显着性,我们开发了一个名为 CCTSenEmb 的模型,该模型在嵌入空间中学习潜在的判别高斯主题,并通过将主题和句子嵌入无缝地合并到一个摘要系统中来扩展新框架。为了在基于预测的句子嵌入任务框架中促进句子之间的语义一致性,CCTSenEmb 进一步考虑了相邻句子之间的关联。因此,这个新颖的句子嵌入框架结合了句子表示、基于单词的内容和主题分配来预测下一个句子的表示。使用 DUC 数据集进行的一系列实验验证了 CCTSenEmb'
更新日期:2020-04-01
down
wechat
bug