当前位置: X-MOL 学术Connect. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A simple but effective method for Indonesian automatic text summarisation
Connection Science ( IF 3.2 ) Pub Date : 2021-06-10 , DOI: 10.1080/09540091.2021.1937942
Nankai Lin 1 , Jinxian Li 1 , Shengyi Jiang 1, 2
Affiliation  

Automatic text summarisation (ATS) (therein two main approaches–abstractive summarisation and extractive summarisation are involved) is an automatic procedure for extracting critical information from the text using a specific algorithm or method. Due to the scarcity of corpus, abstractive summarisation achieves poor performance for low-resource language ATS tasks. That’s why it is common for researchers to apply extractive summarisation to low-resource language instead of using abstractive summarisation. As an emerging branch of extraction-based summarisation, methods based on feature analysis quantitate the significance of information by calculating utility scores of each sentence in the article. In this study, we propose a simple but effective extractive method based on the Light Gradient Boosting Machine regression model for Indonesian documents. Four features are extracted, namely PositionScore, TitleScore, the semantic representation similarity between the sentence and the title of document, the semantic representation similarity between the sentence and sentence’s cluster center. We define a formula for calculating the sentence score as the objective function of the linear regression. Considering the characteristics of Indonesian, we use Indonesian lemmatisation technology to improve the calculation of sentence score. The results show that our method is more applicable.



中文翻译:

一种简单有效的印尼语自动文本摘要方法

自动文本摘要 (ATS)(其中涉及两种主要方法——抽象摘要和提取摘要)是使用特定算法或方法从文本中提取关键信息的自动过程。由于语料库的稀缺性,抽象摘要在低资源语言 ATS 任务中表现不佳。这就是为什么研究人员通常将提取摘要应用于低资源语言而不是使用抽象摘要。作为基于提取的摘要的新兴分支,基于特征分析的方法通过计算文章中每个句子的效用分数来量化信息的重要性。在这项研究中,我们提出了一种基于 Light Gradient Boosting Machine 回归模型的印尼文档简单但有效的提取方法。提取了四个特征,即PositionScoreTitleScore,句子与文档标题的语义表示相似度,句子与句子聚类中心的语义表示相似度。我们定义了一个计算句子分数的公式作为线性回归的目标函数。考虑到印尼语的特点,我们使用印尼语词形还原技术来改进句子分数的计算。结果表明我们的方法更适用。

更新日期:2021-06-10
down
wechat
bug