当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient Measuring of Readability to Improve Documents Accessibility for Arabic Language Learners
arXiv - CS - Computation and Language Pub Date : 2021-09-09 , DOI: arxiv-2109.08648
Sadik Bessou, Ghozlane Chenni

This paper presents an approach based on supervised machine learning methods to build a classifier that can identify text complexity in order to present Arabic language learners with texts suitable to their levels. The approach is based on machine learning classification methods to discriminate between the different levels of difficulty in reading and understanding a text. Several models were trained on a large corpus mined from online Arabic websites and manually annotated. The model uses both Count and TF-IDF representations and applies five machine learning algorithms; Multinomial Naive Bayes, Bernoulli Naive Bayes, Logistic Regression, Support Vector Machine and Random Forest, using unigrams and bigrams features. With the goal of extracting the text complexity, the problem is usually addressed by formulating the level identification as a classification task. Experimental results showed that n-gram features could be indicative of the reading level of a text and could substantially improve performance, and showed that SVM and Multinomial Naive Bayes are the most accurate in predicting the complexity level. Best results were achieved using TF-IDF Vectors trained by a combination of word-based unigrams and bigrams with an overall accuracy of 87.14% over four classes of complexity.



本文提出了一种基于监督机器学习方法的方法来构建可以识别文本复杂性的分类器,以便为阿拉伯语学习者提供适合其水平的文本。该方法基于机器学习分类方法来区分阅读和理解文本的不同难度级别。在从在线阿拉伯语网站挖掘的大型语料库上训练了几个模型并进行了手动注释。该模型同时使用 Count 和 TF-IDF 表示,并应用了五种机器学习算法;多项朴素贝叶斯、伯努利朴素贝叶斯、逻辑回归、支持向量机和随机森林,使用 unigrams 和 bigrams 特征。以提取文本复杂度为目标,通常通过将级别识别制定为分类任务来解决该问题。实验结果表明,n-gram 特征可以指示文本的阅读水平,并且可以显着提高性能,并且表明 SVM 和多项朴素贝叶斯在预测复杂度方面最准确。使用由基于单词的 unigrams 和 bigrams 组合训练的 TF-IDF Vectors 获得了最佳结果,在四类复杂性上的总体准确率为 87.14%。