当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Extractive Text Summarization Models for Urdu Language
Information Processing & Management ( IF 7.4 ) Pub Date : 2020-09-25 , DOI: 10.1016/j.ipm.2020.102383
Ali Nawaz , Maheen Bakhtyar , Junaid Baber , Ihsan Ullah , Waheed Noor , Abdul Basit

In the recent few years, a lot of advancement has been made in Urdu linguistics. There are many portals and news websites that are generating a huge amount of data every day. However, there is still no publicly available dataset nor any framework available for automatic Urdu extractive summary generation. In an automatic extractive summary generation, the sentences with the highest weights are given importance to be included in the summary. The sentence weight is computed by the sum of the weights of the words in the sentence. There are two famous approaches to compute the weight of the words in the English language: local weights (LW) approach and global weights (GW) approach. The sensitivity of the weights depends on the contents of the text, the one word may have different weights in a different article, known as LW based approach. Whereas, in the case of GW, the weights of the words are computed from the independent dataset, which implies the weights of all words remain the same in different articles. In the proposed framework, LW and GW based approaches are modeled for the Urdu language. The sentence weight method and the weighted term-frequency method are LW based approaches that compute the weights of the sentences by the sum of important words and the sum of frequencies of the important words, respectively. Whereas, vector space model (VSM) is GW based approach, that computes the weight of the words from the independent dataset, and then remain the same for all types of the text; GW is widely used in the English language for various applications such as information retrieval and text classification. The extractive summaries are generated by LW and GW based approaches and evaluated with ground-truth summaries that are obtained by the experts. The VSM is used as a baseline framework for sentence weighting. Experiments show that LW based approaches are better for extractive summary generation. The F-score of the sentence weight method and the weighted term-frequency method are 80% and 76%, respectively. The VSM achieved only 62% accuracy on the same dataset. Both, the datasets with ground-truth, and the code are made publicly available for the researchers.



中文翻译:

乌尔都语语言的提取文本摘要模型

近年来,乌尔都语语言学取得了很大进步。有许多门户网站和新闻网站每天都在生成大量数据。但是,仍然没有公开可用的数据集,也没有任何可用于自动生成Urdu提取摘要的框架。在自动提取摘要的生成中,权重最高的句子被赋予要包含在摘要中的重要性。句子权重由句子中单词的权重之和计算得出。有两种著名的方法来计算英语单词的权重:局部权重(LW)方法和全局权重(GW)方法。权重的敏感性取决于文本的内容,一个单词在不同文章中的权重可能不同,这称为基于LW的方法。鉴于,在GW的情况下,单词的权重是从独立的数据集中计算得出的,这意味着在不同的文章中所有单词的权重保持相同。在提出的框架中,为乌尔都语语言建模了基于LW和GW的方法。句子加权法和加权词频法是基于LW的方法,分别通过重要词的总和和重要词的频率之和来计算句子的权重。向量空间模型(VSM)是基于GW的方法,它从独立的数据集中计算单词的权重,然后对于所有类型的文本都保持不变;GW在英语中广泛用于各种应用程序,例如信息检索和文本分类。提取摘要是通过基于LW和GW的方法生成的,并使用专家获得的真实的摘要进行评估。VSM用作句子加权的基准框架。实验表明,基于LW的方法更适合提取摘要生成。句子加权法和加权词频法的F分数分别为80%和76%。VSM在同一数据集上仅达到62%的准确性。具有事实真相的数据集和代码均公开提供给研究人员。句子加权法和加权词频法的F分数分别为80%和76%。VSM在同一数据集上仅达到62%的准确性。具有事实真相的数据集和代码均公开提供给研究人员。句子加权法和加权词频法的F分数分别为80%和76%。VSM在同一数据集上仅达到62%的准确性。具有事实真相的数据集和代码均公开提供给研究人员。

更新日期:2020-09-25
down
wechat
bug