Mapping languages: the Corpus of Global Language Use,Language Resources and Evaluation

当前位置： X-MOL 学术 › Lang. Resour. Eval. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Mapping languages: the Corpus of Global Language Use
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2020-04-08 , DOI: 10.1007/s10579-020-09489-2
Jonathan Dunn

This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the paper evaluates a language identification model that supports more local languages with smaller sample sizes than alternative off-the-shelf models. Improved language identification is essential for moving beyond majority languages. Given the focus on language mapping, the paper analyzes how well this digital language data represents actual populations by (i) systematically comparing the corpus with demographic ground-truth data and (ii) triangulating the corpus with an alternate Twitter-based dataset. In total, the corpus contains 423 billion words representing 148 languages (with over 1 million words from each language) and 158 countries (again with over 1 million words from each country), all distilled from Common Crawl web data. The main contribution of this paper, in addition to describing this publicly-available corpus, is to provide a comprehensive analysis of the relationship between two sources of digital data (the web and Twitter) as well as their connection to underlying populations.

中文翻译：

映射语言：全球语言使用的语料库

本文介绍了一个基于Web的全局语言使用语料库，重点介绍了如何将该语料库用于数据驱动的语言映射。首先，语料库提供了使用主要语言的国家变体（例如英语，阿拉伯语，俄语）的位置的表示，以及每个变体的一致收集的数据。其次，与替代的现成模型相比，本文评估了一种语言识别模型，该模型支持更多的本地语言且样本量较小。改进语言识别对于超越主流语言至关重要。考虑到语言映射的重点，本文通过（i）系统比较语料库和人口统计实地数据，以及（ii）用基于Twitter的备用数据集对语料库进行三角剖分来分析此数字语言数据如何很好地代表实际人口。该语料库总共包含4,280亿个单词，代表148种语言（每种语言中的超过100万个单词）和158个国家（同样，每个国家中的100万个以上的单词），全部摘自Common Crawl网络数据。除了描述这个公开的语料库之外，本文的主要贡献是对数字数据的两种来源（网络和Twitter）之间的关系以及它们与潜在人群的联系进行了全面的分析。

更新日期：2020-04-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11