Skip to main content
Log in

Mapping languages: the Corpus of Global Language Use

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the paper evaluates a language identification model that supports more local languages with smaller sample sizes than alternative off-the-shelf models. Improved language identification is essential for moving beyond majority languages. Given the focus on language mapping, the paper analyzes how well this digital language data represents actual populations by (i) systematically comparing the corpus with demographic ground-truth data and (ii) triangulating the corpus with an alternate Twitter-based dataset. In total, the corpus contains 423 billion words representing 148 languages (with over 1 million words from each language) and 158 countries (again with over 1 million words from each country), all distilled from Common Crawl web data. The main contribution of this paper, in addition to describing this publicly-available corpus, is to provide a comprehensive analysis of the relationship between two sources of digital data (the web and Twitter) as well as their connection to underlying populations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. The dataset is visualized at http://www.earthLings.io.

  2. https://github.com/jonathandunn/common_crawl_corpus.

  3. https://www.github.com/fxsjy/jieba.

  4. https://pypi.org/project/tinysegmenter.

  5. https://github.com/jonathandunn/idNet.

  6. https://publicdata.canterbury.ac.nz/Research/NZILBB/jonathandunn/idNet_models/.

  7. https://www.geonames.org.

  8. The analysis presented in this section is also visualized in an open-source manner at https://www.earthlings.io.

  9. https://www.earthlings.io/download_ngrams.html.

  10. https://www.earthlings.io/download_cglu.html.

  11. https://www.earthlings.io/download_ngrams.html.

  12. https://github.com/jonathandunn/common_crawl_corpus.

  13. https://github.com/jonathandunn/idNet.

  14. https://www.earthlings.io and https://github.com/jonathandunn/earthlings.

References

  • Andrus, T., Dubinski, E., Fiscus, J., Gillies, B., Harper, M., Hazen, T., Hefright, B., Jarrett, A., Lin, W., Ray, J., Rytting, A., Shen, W., Tzoukermann, E., & Wong, J. (2016). IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c LDC2016S02. Web Download. Philadelphia: Linguistic Data Consortium. https://catalog.ldc.upenn.edu.

  • Baker, P., Hardie, A., McEnery, A., Xiao, R., Bontcheva, K., Cunningham, H., Gaizauskas, R., Hamza, O., Maynard, D., Tablan, V., Ursu, C., Jayaram, B., & Leisher, M. (2004). Corpus linguistics and South Asian languages: Corpus creation and tool development. Literary and Linguistic Computing, 19(4), 509–524. http://ota.ox.ac.uk/desc/2460.

  • Baldwin, T., & Lui, M. (2010). Language identification: The long and short of the matter. In Proceedings of the Annual Meeting of the North American Association for Computational Linguistics. Association for Computational Linguistics. 229–237.

  • Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43, 209–226. http://wacky.sslmit.unibo.it/doku.php.

  • Benko, V. (2014). Aranea: Yet another family of (comparable) web corpora. In Proceedings of the 17th International Conference on Text, Speech and Dialogue. Springer International Publishing. 257–264. http://sketch.juls.savba.sk/aranea_about/.

  • Brown, R. (2014). Non-linear mapping for improved identification of 1300+ languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. http://sourceforge.net/projects/la-strings/files/Language-Data/LTI-LangID-rel2.txz.

  • Christodoulopoulos, C., & Steedman, M. (2015). A massively parallel corpus: The Bible in 100 languages. Language Resources and Evaluation, 49(2). https://github.com/christos-c/bible-corpus.

  • Cook, P., & Brinton, J. (2017). building and evaluating web corpora representing national varieties of English. Language Resources and Evaluation, 51(3), 643–662.

    Article  Google Scholar 

  • Davies, M., & Fuchs, R. (2015). Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-based English Corpus (GloWbE). English World-Wide, 36(1), 1-28.

  • Donoso, G., & Sanchez, D. (2017). Dialectometric analysis of language variation in twitter. In Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects. Association for Computational Linguistics. 16–25.

  • Dunn, J. (2019a). Modeling global syntactic variation in English using dialect classification. In Proceedings of the 6th Workshop on NLP for Similar Languages, Varieties and Dialects. Association for Computational Linguistics. 42–53. https://doi.org/10.18653/v1/W19-1405.

  • Dunn, J. (2019b). Global syntactic variation in seven languages: towards a computational dialectology. Frontiers in Artificial Intelligence. https://doi.org/10.3389/frai.2019.00015.

    Article  Google Scholar 

  • Dunn, J., & Adams, B. (2019). Mapping languages and demographics with georeferenced corpora. Proceedings of GeoComputation, 19. https://doi.org/10.17608/k6.auckland.9869252.v1.

  • Eisenstein, J., O’Connor, B., Smith, N., & Xing, E. (2010). A latent variable model for geographic lexical variation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 1227–1287.

  • Eisenstein, J., O’Connor, B., Smith, N., & Xing, E. (2014). Diffusion of lexical change in social media. PloS ONE, 10, 1371.

    Google Scholar 

  • Fothergill, R., Cook, P., & Baldwin, T. (2016). Evaluating a topic modelling approach to measuring corpus similarity. In Proceedings of the Tenth International Conference on Language Resources and Evaluation. European Language Resources Association. 273–279.

  • Goldhahn, D., Eckart, T., & Quastho, U. (2012). Building large monolingual dictionaries at the Leipzig corpora collection from 100 to 200 languages. In Proceedings of the Eighth Conference on Language Resources and Evaluation. European Language Resources Association. 759–765.

  • Google. (2013). Google compact language detector 2. https://github.com/CLD2Owners/cld2.

  • Google. (2014). Google language-detection library. https://github.com/Mimino666/langdetect.

  • Graham, S., Hale, S., & Gaffney, D. (2014). Where in the world are you? Geolocation and language identification on twitter. The Professional Geographer, 66, 4.

    Article  Google Scholar 

  • Grieve, J., Montgomery, C., Nini, A., Murakami, A., & Guo, D. (2019). Mapping lexical dialect variation in British English using Twitter. Frontiers in Artificial Intelligence. https://doi.org/10.3389/frai.2019.00011.

    Article  Google Scholar 

  • Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133.

    Article  Google Scholar 

  • Kilgarriff, A., & Rose, T. (1998). Measures for corpus similarity and homogeneity. In Proceedings of the Third Conference on Empirical Methods for Natural Language Processing. Association for Computational Linguistics. 46–52.

  • Kondor, D., Csabai, I., Dobos, L., Szüle, J., Barankai, N., Hanyecz, T., Sebok, T., Kallus, Z., & Vattay, G. (2013). Using robust PCA to estimate regional characteristics of language-use from geotagged twitter messages. In Proceedings of 4th International Conference on Cognitive Infocommunications. IEEE. 393–398.

  • Lui, M., & Baldwin, T. (2011). Cross-domain Feature Selection for Language Identification. In Proceedings of the International Joint Conference on Natural Language Processing. 553–561.

  • Lui, M., & Baldwin, T. (2012). langid.py: An off-the-shelf language identification tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 25–30. https://github.com/saffsd/langid.py.

  • Majlĭs, M., & Zabokrtsḱy, Z. (2012). Language richness of the web. In Proceedings of the International Conference on Language Resources and Evaluation. European Language Resources Association. https://ufal.mff.cuni.cz/w2c.

  • Mocanu, D., Baronchelli, A., Perra, N., Gonçalves, B., Zhang, Q., & Vespignani, A. (2013). The Twitter of Babel: Mapping world languages through microblogging platforms. PLOSOne, 10, 1371.

    Google Scholar 

  • Post, M., Callison-Burch, C., & Osborne, M. (2012). Constructing parallel corpora for six Indian languages via crowdsourcing. In Proceedings of the Workshop for Statistical Machine Translation. Association for Computational Linguistics. https://github.com/joshua-decoder/indian-parallel-corpora.

  • Roller, S., Speriosu, M., Rallapalli, S., Wing, B., & Baldridge, J. (2012). Supervised text-based geolocation using language models on an adaptive grid. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics. 1500–1510.

  • Scannell, K. (2007). The Crúbadán Project: Corpus building for under-resourced languages. In Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop. 5–15. http://crubadan.org.

  • Skadiņš, R., Tiedemann, J., Rozis, R., & Deksne, D. (2014). Billions of parallel words for free. In Proceedings of the International Conference on Language Resources and Evaluation. European Language Resources Association. http://opus.lingfil.uu.se/EUbookshop.php.

  • Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. In Proceedings of the International Conference on Language Resources and Evaluation. European Language Resources Association. http://opus.lingfil.uu.se.

  • Twitter. (2015). Web Download. https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html.

  • United Nations. (2011). Economic and Social Statistics on the Countries and Territories of the World, with Particular Reference to Children’s Well-Being. United Nations Children’s Fund.

  • United Nations. (2017a). National Accounts Estimates of Main Aggregates. Per Capita GDP at Current Prices in US Dollars. United Nations Statistics Division.

  • United Nations. (2017b). World Population Prospects: The 2017 Revision, DVD Edition. United Nations Population Division.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jonathan Dunn.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dunn, J. Mapping languages: the Corpus of Global Language Use. Lang Resources & Evaluation 54, 999–1018 (2020). https://doi.org/10.1007/s10579-020-09489-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-020-09489-2

Keywords

Navigation