当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A semantic approach to extractive multi-document summarization: Applying sentence expansion for tuning of conceptual densities
Information Processing & Management ( IF 8.6 ) Pub Date : 2020-06-25 , DOI: 10.1016/j.ipm.2020.102341
Mohammad Bidoki , Mohammad R. Moosavi , Mostafa Fakhrahmad

Today, due to a vast amount of textual data, automated extractive text summarization is one of the most common and practical techniques for organizing information. Extractive summarization selects the most appropriate sentences from the text and provide a representative summary. The sentences, as individual textual units, usually are too short for major text processing techniques to provide appropriate performance. Hence, it seems vital to bridge the gap between short text units and conventional text processing methods.

In this study, we propose a semantic method for implementing an extractive multi-document summarizer system by using a combination of statistical, machine learning based, and graph-based methods. It is a language-independent and unsupervised system. The proposed framework learns the semantic representation of words from a set of given documents via word2vec method. It expands each sentence through an innovative method with the most informative and the least redundant words related to the main topic of sentence. Sentence expansion implicitly performs word sense disambiguation and tunes the conceptual densities towards the central topic of each sentence. Then, it estimates the importance of sentences by using the graph representation of the documents. To identify the most important topics of the documents, we propose an inventive clustering approach. It autonomously determines the number of clusters and their initial centroids, and clusters sentences accordingly. The system selects the best sentences from appropriate clusters for the final summary with respect to information salience, minimum redundancy, and adequate coverage.

A set of extensive experiments on DUC2002 and DUC2006 datasets was conducted for investigating the proposed scheme. Experimental results showed that the proposed sentence expansion algorithm and clustering approach could considerably enhance the performance of the summarization system. Also, comparative experiments demonstrated that the proposed framework outperforms most of the state-of-the-art summarizer systems and can impressively assist the task of extractive text summarization.



中文翻译:

提取多文档摘要的语义方法:将句子扩展应用于概念密度的调整

如今,由于大量的文本数据,自动提取文本摘要是组织信息的最常见,最实用的技术之一。摘录摘要从文本中选择最合适的句子并提供代表性的摘要。句子作为单独的文本单位,通常对于主要的文本处理技术而言太短而无法提供适当的性能。因此,弥合短文本单元和常规文本处理方法之间的差距显得至关重要。

在这项研究中,我们提出了一种语义方法,该方法通过结合使用统计方法,基于机器学习的方法和基于图的方法来实现提取性多文档摘要器系统。它是独立于语言且不受监督的系统。所提出的框架通过word2vec方法从一组给定的文档中学习单词的语义表示。它通过一种创新的方法来扩展每个句子,该方法具有与句子主要主题相关的信息最多且冗余最少的单词。句子扩展隐含地执行单词歧义消除,并将概念密度调整为每个句子的中心主题。然后,它通过使用文档的图形表示来估计句子的重要性。为了确定文档中最重要的主题,我们提出了一种创造性的聚类方法。它自动确定聚类的数量及其初始质心,并相应地对句子进行聚类。该系统从适当的类中选择最佳的句子,以进行关于信息显着性,最小冗余和足够覆盖的最终摘要。

在DUC2002和DUC2006数据集上进行了一系列广泛的实验,以研究所提出的方案。实验结果表明,所提出的句子扩展算法和聚类方法可以大大提高摘要系统的性能。此外,比较实验表明,所提出的框架优于大多数最新的摘要器系统,并且可以出色地协助提取文本摘要的任务。

更新日期:2020-06-25
down
wechat
bug