An Arabic Multi-source News Corpus: Experimenting on Single-document Extractive Summarization,Arabian Journal for Science and Engineering

当前位置： X-MOL 学术 › Arab. J. Sci. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Arabic Multi-source News Corpus: Experimenting on Single-document Extractive Summarization
Arabian Journal for Science and Engineering ( IF 2.6 ) Pub Date : 2021-02-04 , DOI: 10.1007/s13369-020-05258-z
Amina Chouigui , Oussama Ben Khiroun , Bilel Elayeb

Automatic text summarization is considered as an important task in various fields in natural language processing such as information retrieval. It is a process of automatically generating a text representation. Text summarization can be a solution to the problem of information overload. Hence, with the large amount of information available on the Internet, the presentation of a document by a summary helps to get the most relevant result of a search. We propose in this paper a new free Arabic structured corpus in the standard XML TREC format. ANT corpus v2.1 is collected using RSS feeds from different news sources. This corpus is useful for multiple text mining purposes such as generic text summarization, clustering or classification. We test this corpus for an unsupervised single-document extractive summarization using statistical and graph-based language-independent summarizers such as LexRank, TextRank, Luhn and LSA. We investigate the sensitivity of the summarization process to the stemming and stop words removal steps. We evaluate these summarizers performance by comparing the extracted texts fragments to the abstracts existing in ANT corpus v2.1 using ROUGE and BLEU metrics. Experimental results show that LexRank summarizer has achieved the best scores for the ROUGE metric using the stop words removal scenario.

中文翻译：

阿拉伯语多源新闻语料库：尝试单文档摘录摘要

自动文本摘要被认为是自然语言处理（例如信息检索）中各个领域的重要任务。这是一个自动生成文本表示的过程。文本摘要可以解决信息过载的问题。因此，利用Internet上可用的大量信息，通过摘要显示文档有助于获得最相关的搜索结果。我们在本文中提出了一种新的标准XML TREC格式的免费阿拉伯语结构化语料库。ANT corpus v2.1是使用来自不同新闻来源的RSS feed收集的。该语料库可用于多种文本挖掘目的，例如通用文本摘要，聚类或分类。我们使用LexRank，TextRank，Luhn和LSA等基于统计数据和基于图形的语言独立摘要程序来测试该语料库的无监督单文档摘要。我们调查了摘要过程对词干和停用词删除步骤的敏感性。我们通过使用ROUGE和BLEU指标将提取的文本片段与ANT corpus v2.1中存在的摘要进行比较，从而评估这些摘要生成器的性能。实验结果表明，在使用停用词删除方案的情况下，LexRank摘要生成器在ROUGE指标上获得了最佳分数。我们通过使用ROUGE和BLEU指标将提取的文本片段与ANT corpus v2.1中存在的摘要进行比较，从而评估这些摘要生成器的性能。实验结果表明，在使用停用词删除方案的情况下，LexRank摘要生成器在ROUGE指标上获得了最佳分数。我们通过使用ROUGE和BLEU指标将提取的文本片段与ANT corpus v2.1中存在的摘要进行比较，从而评估这些摘要生成器的性能。实验结果表明，在使用停用词删除方案的情况下，LexRank摘要生成器在ROUGE指标上获得了最佳分数。

更新日期：2021-02-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11