当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus
arXiv - CS - Computation and Language Pub Date : 2020-03-20 , DOI: arxiv-2003.09520
Elisa Gugliotta, Marco Dinarelli

This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC). Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters). This code-system was developed by Arabic-speaking users of social media in order to facilitate the writing in the Computer-Mediated Communication (CMC) and text messaging informal frameworks. There is variety in the realization of Arabish amongst dialects, and each Arabish code-system is under-resourced, in the same way as most of the Arabic dialects. In the last few years, the focus on Arabic dialects in the NLP field has considerably increased. Taking this into consideration, TArC will be a useful support for different types of analyses, computational and linguistic, as well as for NLP tools training. In this article we will describe preliminary work on the TArC semi-automatic construction process and some of the first analyses we developed on TArC. In addition, in order to provide a complete overview of the challenges faced during the building process, we will present the main Tunisian dialect characteristics and their encoding in Tunisian Arabish.

中文翻译:

TArC:增量和半自动收集突尼斯阿拉伯语语料库

本文描述了第一个带有形态句法注释的突尼斯阿拉伯语语料库(TArC)的构成过程。阿拉伯语,也称为阿拉伯语,是阿拉伯语方言的拉丁字符和算术(用作字母的数字)的自发编码。该代码系统由社交媒体的阿拉伯语用户开发,以促进计算机中介通信 (CMC) 和短信非正式框架中的写作。阿拉伯语在方言中的实现方式多种多样,每个阿拉伯语代码系统都资源不足,就像大多数阿拉伯语方言一样。在过去几年中,NLP 领域对阿拉伯方言的关注显着增加。考虑到这一点,TArC 将为不同类型的分析、计算和语言、以及 NLP 工具培训。在本文中,我们将描述 TArC 半自动构建过程的初步工作以及我们在 TArC 上开发的一些初步分析。此外,为了全面概述构建过程中面临的挑战,我们将介绍突尼斯方言的主要特征及其在突尼斯阿拉伯语中的编码。
更新日期:2020-11-11
down
wechat
bug