当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2019-04-01 , DOI: 10.1007/s10579-019-09454-8
Kheireddine Abainia

Algeria’s socio-linguistic situation is known as a complex phenomenon involving several historical, cultural and technological factors. However, there are three languages that are mainly spoken in Algeria (Arabic, Tamazight and French) and they can be mixed in the same sentence (code-switching). Moreover, there are several varieties of dialects that differ from one region to another and sometimes within the same region. This paper aims to provide a new multi-purpose parallel corpus (i.e., DZDC12 corpus), which will serve as a testbed for various natural language processing and information retrieval applications. In particular, it can be a useful tool to study Arabic–French code-switching phenomenon, Algerian Romanized Arabic (Arabizi), different Algerian sub-dialects, sentiment analysis, gender writing style, machine translation, abuse detection, etc. To the best of our knowledge, the proposed corpus is the first of its kind, where the texts are written in Latin script and crawled from Facebook. More specifically, this corpus is organised by gender, region and city, and is transliterated into Arabic script and translated into Modern Standard Arabic. In addition, it is annotated for emotion detection and abuse detection, and annotated at the word level. This article focuses in particular on Algeria’s socio-linguistic situation and the effect of social media networks. Furthermore, the general guidelines for the design of DZDC12 corpus are described as well as the dialects clustering over the map.

中文翻译:

DZDC12:一种新的多用途并行阿尔及利亚阿拉伯语-法语代码转换语料库

阿尔及利亚的社会语言状况被称为复杂现象,涉及多个历史,文化和技术因素。但是,阿尔及利亚主要使用三种语言(阿拉伯语,塔马兹特语和法语),并且可以将它们混合在同一句子中(代码切换)。此外,方言有多种变体,一个地区到另一个地区,有时在同一地区。本文旨在提供一种新的多功能并行语料库(即DZDC12语料库),它将用作各种自然语言处理和信息检索应用程序的测试平台。尤其是,它可能是研究阿拉伯语-法语代码转换现象,阿尔及利亚罗马化阿拉伯语(Arabizi),阿尔及利亚不同方言,情绪分析,性别书写风格,机器翻译,滥用检测,就我们所知,拟议的语料库是同类中的第一个,其中的文本以拉丁文字书写并从Facebook爬取。更具体地说,该语料库由性别,地区和城市组成,并被译成阿拉伯语文字并翻译成现代标准阿拉伯语。此外,它还用于情感检测和滥用检测的注释,并在单词级别进行注释。本文特别关注阿尔及利亚的社会语言状况和社交媒体网络的影响。此外,还介绍了DZDC12语料库设计的一般指南以及地图上的方言聚类。该语料库由性别,地区和城市组成,并被译成阿拉伯语文字并翻译成现代标准阿拉伯语。此外,它还用于情感检测和滥用检测的注释,并在单词级别进行注释。本文特别关注阿尔及利亚的社会语言状况和社交媒体网络的影响。此外,还介绍了DZDC12语料库设计的一般指南以及地图上的方言聚类。该语料库由性别,地区和城市组成,并被译成阿拉伯语文字并翻译成现代标准阿拉伯语。此外,它还用于情感检测和滥用检测的注释,并在单词级别进行注释。本文特别关注阿尔及利亚的社会语言状况和社交媒体网络的影响。此外,还介绍了DZDC12语料库设计的一般指南以及地图上的方言聚类。
更新日期:2019-04-01
down
wechat
bug