Arabic machine reading comprehension on the Holy Qur’an using CL-AraBERT,Information Processing & Management

当前位置： X-MOL 学术 › Inf. Process. Manag. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Arabic machine reading comprehension on the Holy Qur’an using CL-AraBERT
Information Processing & Management ( IF 8.6 ) Pub Date : 2022-09-09 , DOI: 10.1016/j.ipm.2022.103068
Rana Malhas , Tamer Elsayed

In this work, we tackle the problem of machine reading comprehension (MRC) on the Holy Qur’an to address the lack of Arabic datasets and systems for this important task. We construct QRCD as the first Qur’anic Reading Comprehension Dataset, composed of 1,337 question-passage-answer triplets for 1,093 question-passage pairs, of which 14% are multi-answer questions. We then introduce CLassical-AraBERT (CL-AraBERT for short), a new AraBERT-based pre-trained model, which is further pre-trained on about 1.0B-word Classical Arabic (CA) dataset, to complement the Modern Standard Arabic (MSA) resources used in pre-training the initial model, and make it a better fit for the task. Finally, we leverage cross-lingual transfer learning from MSA to CA, and fine-tune CL-AraBERT as a reader using two MSA-based MRC datasets followed by our QRCD dataset to constitute the first (to the best of our knowledge) MRC system on the Holy Qur’an. To evaluate our system, we introduce Partial Average Precision ( $p A P$ ) as an adapted version of the traditional rank-based Average Precision measure, which integrates partial matching in the evaluation over multi-answer and single-answer MSA questions. Adopting two experimental evaluation setups (hold-out and cross validation (CV)), we empirically show that the fine-tuned CL-AraBERT reader model significantly outperforms the baseline fine-tuned AraBERT reader model by 6.12 and 3.75 points in $p A P$ scores, in the hold-out and CV setups, respectively. To promote further research on this task and other related tasks on Qur’an and Classical Arabic text, we make both the QRCD dataset and the pre-trained CL-AraBERT model publicly available.

中文翻译：

使用 CL-AraBERT 对古兰经进行阿拉伯语机器阅读理解

在这项工作中，我们解决了古兰经机器阅读理解 (MRC) 的问题，以解决这一重要任务缺乏阿拉伯语数据集和系统的问题。我们构建QRCD作为第一个古兰经阅读理解数据集，由 1,093 个问答对的 1,337 个问答三元组组成，其中 14% 是多答案问题。然后我们介绍了 CLassical-AraBERT（简称 CL-AraBERT），这是一种新的基于 AraBERT 的预训练模型，它在大约 1.0B 字的古典阿拉伯语（CA）数据集上进行了进一步的预训练，以补充现代标准阿拉伯语（ MSA）用于预训练初始模型的资源，并使其更适合任务。最后，我们利用从 MSA 到 CA 的跨语言迁移学习，并使用两个基于 MSA 的 MRC 数据集和我们的QRCD数据集微调 CL-AraBERT 作为阅读器，以构成第一个（据我们所知）MRC 系统在《古兰经》上。为了评估我们的系统，我们引入部分平均精度( $p 一个磷$ ) 作为传统基于等级的平均精度度量的改编版本，它在多答案和单答案 MSA 问题的评估中集成了部分匹配。采用两种实验评估设置（保留和交叉验证（CV）），我们凭经验表明，微调的 CL-AraBERT 阅读器模型显着优于基线微调 AraBERT 阅读器模型 6.12 和 3.75 $p 一个磷$ 分数，分别在保持和 CV 设置中。为了促进对古兰经和古典阿拉伯文本的这项任务和其他相关任务的进一步研究，我们公开了QRCD数据集和预训练的 CL-AraBERT 模型。

更新日期：2022-09-09

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>