当前位置: X-MOL 学术Egypt. Inform. J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Extractive Arabic Text Summarization Using Modified PageRank Algorithm
Egyptian Informatics Journal ( IF 5.2 ) Pub Date : 2019-12-05 , DOI: 10.1016/j.eij.2019.11.001
Reda Elbarougy , Gamal Behery , Akram El Khatib

This paper proposed an approach for Arabic text summarization. Text summarization is one of the natural language processing's applications which is used for reducing the original text amount and retrieving only the important information from the original text. The Arabic language has a complex morphological structure which makes it very difficult to extract nouns to be used as a feature for summarization process. Therefore, Al-Khalil morphological analyzer is used to solve the problem of nouns extraction. The proposed approach is a graph-based system, which represents the document as a graph where the vertices of the graph are the sentences. A Modified PageRank algorithm is applied with an initial score for each node that is the number of nouns in this sentence. More nouns in the sentence mean more information, so nouns count used here as initial rank for the sentence. Edges between sentences are the cosine similarity between the sentences, to get a final summary that contains sentences with more information and well connected with each other. The process of text summarization consists of three major stages: pre-processing stage, features extraction and graph construction stage, and finally applying the Modified PageRank algorithm and summary extraction. The Modified PageRank algorithm used a different number of iterations to find the number returns the best summary results, and the extracted summary depends on compression ratio, taking into account removing redundancy depending on the overlapping between the sentences. To evaluate the performance of this approach EASC Corpus is used as a standard. LexRank and TextRank algorithms were used under the same circumstances, the proposed approach provides better results when compared with other Arabic text summarization techniques. The proposed approach performs efficiently with the number of iteration 10,000.



中文翻译:

使用改进的PageRank算法提取阿拉伯文字摘要

本文提出了一种阿拉伯文本摘要的方法。文本摘要是自然语言处理的应用程序之一,用于减少原始文本的数量并仅从原始文本中检索重要信息。阿拉伯语具有复杂的形态结构,因此很难提取名词以用作摘要过程的功能。因此,使用Al-Khalil形态分析仪解决名词提取问题。所提出的方法是基于图的系统,该系统将文档表示为图,其中图的顶点是句子。将改进的PageRank算法应用于每个节点的初始分数,该分数是该句子中名词的数量。句子中更多的名词意味着更多的信息,因此名词在这里用作句子的初始等级。句子之间的边缘是句子之间的余弦相似度,以获得最终的摘要,该摘要包含具有更多信息且彼此之间具有良好联系的句子。文本总结过程包括三个主要阶段:预处理阶段,特征提取和图形构建阶段,最后应用改进的PageRank算法和摘要提取。改进的PageRank算法使用不同的迭代次数来找到返回最佳摘要结果的次数,并且提取的摘要取决于压缩率,并考虑到根据句子之间的重叠来消除冗余。为了评估这种方法的性能,以EASC语料库为标准。在相同情况下使用LexRank和TextRank算法,与其他阿拉伯文本摘要技术相比,该方法提供了更好的结果。所提出的方法可以有效地执行10,000次迭代。

更新日期:2019-12-05
down
wechat
bug