Studying the history of the Arabic language: language technology and a large-scale historical corpus,Language Resources and Evaluation

当前位置： X-MOL 学术 › Lang. Resour. Eval. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Studying the history of the Arabic language: language technology and a large-scale historical corpus
Language Resources and Evaluation ( IF 2.7 ) Pub Date : 2019-04-12 , DOI: 10.1007/s10579-019-09460-w
Yonatan Belinkov , Alexander Magidow , Alberto Barrón-Cedeño , Avi Shmidman , Maxim Romanov

Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.

中文翻译：

研究阿拉伯语言的历史：语言技术和大规模的历史语料库

阿拉伯语是一门口述广泛的语言，历史悠久且历史悠久，但是现有的语料库和语言技术主要集中在现代阿拉伯语及其变体上。因此，迄今为止，对语言历史的研究主要限于小规模的手工分析。在这项工作中，我们将介绍一个跨越1400年的大规模阿拉伯语历史文集。我们描述了使用阿拉伯语NLP工具清理和处理该语料库的工作，包括识别重复使用的文本。我们使用一种新颖的自动分期算法以及其他技术来研究阿拉伯语言的历史。我们的发现证实了书面阿拉伯文已划分为现代标准和古典阿拉伯文，并确定了其他已确定的分期，

更新日期：2019-04-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>