Natural language processing for similar languages, varieties, and dialects: A survey

Marcos Zampieri; Preslav Nakov; Yves Scherrer

doi:10.1017/S1351324920000492

Natural language processing for similar languages, varieties, and dialects: A survey

Published online by Cambridge University Press: 20 November 2020

Marcos Zampieri ,

Preslav Nakov and

Yves Scherrer

Show author details

Marcos Zampieri: Affiliation:
Rochester Institute of Technology, USA
Preslav Nakov: Affiliation:
Qatar Computing Research Institute, HBKU, Qatar
Yves Scherrer: Affiliation:
University of Helsinki, Finland

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.

Keywords

Type: Survey Paper
Information: Natural Language Engineering , Volume 26 , Issue 6: Natural Language Processing for Similar Languages, Varieties, and Dialects , November 2020 , pp. 595 - 612

DOI: https://doi.org/10.1017/S1351324920000492 [Opens in a new window]
Copyright: © Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aepli, N., von Waldenfels, R. and Samardžić, T. (2014). Part-of-speech tag disambiguation by cross-linguistic majority vote. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial), Dublin, Ireland, pp. 76–84.CrossRef Google Scholar

Agić, ž., Hovy, D. and Søgaard, A. (2015). If all you have is a bit of the bible: Learning pos taggers for truly low-resource languages. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, pp. 268–272.CrossRef Google Scholar

Aharoni, R., Johnson, M. and Firat, O. (2019). Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT’19, Minneapolis, Minnesota, pp. 3874–3884.CrossRef Google Scholar

AlGhamdi, F. and Diab, M. (2019). Leveraging pretrained word embeddings for part-of-speech tagging of code switching data. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan, June. Association for Computational Linguistics, pp. 99–109.10.18653/v1/W19-1410CrossRef Google Scholar

Ali, A., Dehak, N., Cardinal, P., Khurana, S., Yella, S.H., Glass, J., Bell, P. and Renals, S. (2016). Automatic dialect detection in Arabic broadcast speech. In Proceedings of INTERSPEECH, San Francisco, USA, pp. 2934–2938.10.21437/Interspeech.2016-1297CrossRef Google Scholar

Ali, A., Vogel, S. and Renals, S. (2017). Speech recognition challenge in the wild: Arabic MGB-3. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, pp. 316–322.CrossRef Google Scholar

Alshutayri, A. and Atwell, E. (2017). Exploring twitter as a source of an Arabic dialect corpus. International Journal of Computational Linguistics (IJCL) 8(2), 37–44.Google Scholar

Altintas, K. and Cicekli, I. (2002). A machine translation system between a pair of closely related languages. In Proceedings of the 17th International Symposium on Computer and Information Sciences, ISCIS’02, Orlando, Florida, USA, pp. 192–196.Google Scholar

Armentano-Oller, C., Carrasco, R.C., Corbí-Bellot, A.M., Forcada, M.L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez-Sánchez, G., Sánchez-Martínez, F. and Scalco, M.A. (2006). Open-source Portuguese-Spanish machine translation. In Proceedings of the 7th International Workshop on Computational Processing of the Portuguese Language, PROPOR ’06, Itatiaia, Brazil, pp. 50–59.10.1007/11751984_6CrossRef Google Scholar

Artetxe, M. and Schwenk, H. (2019). Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics (TACL) 7, 597–610.CrossRef Google Scholar

Aw, A., Zhang, M., Xiao, J. and Su, J. (2006). A phrase-based statistical model for SMS text normalization. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, ACL-COLING’06, Sydney, Australia, pp. 33–40.CrossRef Google Scholar

Bahdanau, D., Cho, K. and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR’15, San Diego, California, USA.Google Scholar

Bakr, H.A., Shaalan, K. and Ziedan, I. (2008). A hybrid approach for converting written Egyptian colloquial dialect into diacritized Arabic. In Proceedings of the 6th International Conference on Informatics and Systems, INFOS’08, Egypt, pp. 27–33.Google Scholar

Bemova, A., Oliva, K. and Panevova, J. (1988). Some problems of machine translation between closely related languages. In Proceedings of the International Conference on Computational Linguistics, COLING’88, Budapest, Hungary.10.3115/991635.991645CrossRef Google Scholar

Bergsma, S., McNamee, P., Bagdouri, M., Fink, C. and Wilson, T. (2012). Language identification for creating language-specific Twitter collections. In Proceedings of the Second Workshop on Language in Social Media, pp. 65–74.Google Scholar

Bernier-Colborne, G., Goutte, C. and Léger, S. (2019). Improving cuneiform language identification with BERT. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Minneapolis, USA, pp. 17–25.Google Scholar

Bestgen, Y. (2017). Improving the character ngram model for the DSL task with BM25 weighting and less frequently used feature sets. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain, pp. 115–123.CrossRef Google Scholar

Bick, E. and Nygaard, L. (2007). Using Danish as a CG interlingua: A wide-coverage Norwegian-English machine translation system. In Proceedings of the 16th Nordic Conference of Computational Linguistics, NODALIDA’07, Tartu, Estonia, pp. 21–28.Google Scholar

Biemann, C., Heyer, G., Quasthoff, U. and Richter, M. (2007). The Leipzig corpora collection-monolingual corpora of standard size. In Proceedings of Corpus Linguistics.Google Scholar

Bojja, N., Nedunchezhian, A. and Wang, P. (2015). Machine translation in mobile games: Augmenting social media text normalization with incentivized feedback. In Proceedings of the 15th Machine Translation Summit (MT Users’ Track), vol. 2, Miami, Florida, USA, pp. 11–16.Google Scholar

Bouamor, H., Habash, N. and Oflazer, K. (2014). A multidialectal parallel corpus of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 1240–1245.Google Scholar

Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., Obeid, O., Khalifa, S., Eryani, F., Erdmann, A., Oflazer, K. (2018). The MADAR arabic dialect corpus and lexicon. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pp. 3387–3396.Google Scholar

Bouamor, H., Hassan, S. and Habash, N. (2019). The MADAR shared task on arabic fine-grained dialect identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp. 199–207.CrossRef Google Scholar

Cao, S., Kitaev, N. and Klein, D. (2020). Multilingual alignment of contextual word representations. In Proceedings of the 8th International Conference on Learning Representations, ICLR’20, Addis Ababa, Ethiopia.Google Scholar

Çöltekin, Ç. and Rama, T. (2017). Tübingen system in VarDial 2017 shared task: Experiments with language identification and cross-lingual parsing. In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial).CrossRef Google Scholar

Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. and Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP’14, Doha, Qatar, pp. 1724–1734.CrossRef Google Scholar

Christensen, H. (2014). Hc corpora. http://www.corpora.heliohost.org/.Google Scholar

Ciobanu, A.M. and Dinu, L.P. (2016). A computational perspective on the Romanian dialects. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, May, pp. 3281–3285.Google Scholar

Clyne, M. (1992). Pluricentric Languages: Different Norms in Different Nations, Amsterdam: De Gruyter Mouton.Google Scholar

Conneau, A. and Lample, G. (2019). Cross-lingual language model pretraining. In Wallach H., Larochelle H., Beygelzimer A., dAlché-Buc F., Fox E. and Garnett R. (eds), Advances in Neural Information Processing Systems 32, Vancouver, Canada, pp. 7059–7069.Google Scholar

Corbí-Bellot, A.M., Forcada, M.L., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez-Sánchez, G., Sánchez-Martínez, F., Alegria, I., Mayor, A. and Sarasola, K. (2005). An open-source shallow-transfer machine translation engine for the romance languages of Spain. In Proceedings of the Tenth Conference of the European Association for Machine Translation, EAMT’05, Budapest, Hungary, pp. 79–86.Google Scholar

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning 20(3), 273–297.CrossRef Google Scholar

Costa-jussà M.R., Zampieri M. and Pal S. (2018). A neural approach to language variety translation. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial’18, Santa Fe, New Mexico, USA, pp. 275–282.Google Scholar

Cotterell, R. and Callison-Burch, C. (2014). A multi-dialect, multi-genre corpus of informal written Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, pp. 241–245.Google Scholar

Cotterell, R. and Heigold, G. (2017). Cross-lingual character-level neural morphological tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 748–759.CrossRef Google Scholar

Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 4171–4186.Google Scholar

Diwersy, S., Evert, S. and Neumann, S. (2014). A weakly supervised multivariate approach to the study of language variation. In Szmrecsanyi B. and Wälchli B. (eds), Aggregating Dialectology, Typology, and Register Analysis. Linguistic Variation in Text and Speech. Berlin: De Gruyter.Google Scholar

Elfardy, H. and Diab, M. (2013). Sentence level dialect identification in Arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, pp. 456–461.Google Scholar

Elgabou, H.A. and Kazakov, D. (2017). Building dialectal Arabic corpora. In The Proceedings of the First Workshop on Human-Informed Translation and Interpreting Technology (HiT-IT), Varna, Bulgaria, pp. 52–57.CrossRef Google Scholar

Feldman, A., Hana, J. and Brew, C. (2006). A cross-language approach to rapid creation of new morphosyntactically annotated resources. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006). European Language Resources Association (ELRA), pp. 549–554.Google Scholar

Forcada, M.L. (2006). Open-source machine translation: An opportunity for minor languages. In Proceedings of the LREC’06 Workshop on Strategies for Developing Machine Translation for Minority Languages, Genoa, Italy.Google Scholar

Francis, W.N. and Kucera, H. (1979). Brown Corpus Manual.Google Scholar

Găman, M., Hovy, D., Ionescu, R.T., Jauhiainen, H., Jauhiainen, T., Lindén, K., Ljubešić, N., Partanen, N., Purschke, C., Scherrer, Y. and Zampieri, M. (2020). A report on the VarDial evaluation campaign 2020. In Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial).Google Scholar

Gebre, B.G., Wittenburg, P. and Heskes, T. (2013). Automatic sign language identification. In Proceedings of the IEEE International Conference on Image Processing. IEEE, pp. 2626–2630.CrossRef Google Scholar

Goutte, C., Léger, S., Malmasi, S. and Zampieri, M. (2016). Discriminating similar languages: Evaluations and explorations. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, pp. 1800–1807.Google Scholar

Greenbaum, S. (1991). Ice: The international corpus of english. English Today 7(4), 3–7.CrossRef Google Scholar

Guzmán, F., Chen, P.-J., Ott, M., Pino, J., Lample, G., Koehn, P., Chaudhary, V. and Ranzato, M. (2019). The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP’19, Hong Kong, China, pp. 6098–6111.CrossRef Google Scholar

Hajič, J., Hric, J. and Kuboň, V. (2000). Machine translation of very close languages. In Proceedings of the Sixth Conference on Applied Natural Language Processing, ANLP’00, Seattle, Washington, USA, pp. 7–12.CrossRef Google Scholar

Han, B. and Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT’11, Portland, Oregon, USA, pp. 368–378.Google Scholar

Han, B., Cook, P. and Baldwin, T. (2012). Geolocation prediction in social media data by finding location indicative words. In Proceedings of the International Conference in Computational Linguistics (COLING), pp. 1045–1062.Google Scholar

Hollenstein, N. and Aepli, N. (2015). A resource for natural language processing of Swiss German dialects. In Proceedings of GSCL, pp. 108–109.Google Scholar

Huang, C.-R. and Lee, L.-H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation, Cebu City, Philippines, November, pp. 404–410.Google Scholar

Huck, M., Dutka, D. and Fraser, A. (2019). Cross-lingual annotation projection is effective for neural part-of-speech tagging. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, Ann Arbor, Michigan, June. Association for Computational Linguistics, pp. 223–233.CrossRef Google Scholar

Jauhiainen, T., Lindén, K. and Jauhiainen, H. (2019). Language model adaptation for language and dialect identification of text. Natural Language Engineering 25(5), 561–583.CrossRef Google Scholar

Jauhiainen, T., Jauhiainen, H., Alstola, T. and Lindén, K. (2019a). Language and dialect identification of Cuneiform texts. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Ann Arbor, Michigan. Association for Computational Linguistics, pp. 89–98.Google Scholar

Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T. and Lindén, K. (2019b). Automatic language identification in texts: A survey. Journal of Artificial Intelligence Research, pp. 675–782.CrossRef Google Scholar

Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., Hughes, M. and Dean, J. (2017). Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, 339–351.CrossRef Google Scholar

Jørgensen, A., Hovy, D. and Søgaard, A. (2016). Learning a POS tagger for AAVE-like language. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 1115–1120.CrossRef Google Scholar

Josan, G.S. and Lehal, G.S. (2008). A Punjabi to Hindi machine translation system. In Proceedings of the 22nd International Conference on on Computational Linguistics, COLING’08, Manchester, UK, pp. 157–160.Google Scholar

Joty, S., Nakov, P., Màrquez, L. and Jaradat, I. (2017). Cross-language learning with adversarial neural networks. In Proceedings of the 21st Conference on Computational Natural Language Learning, CoNLL’17, Vancouver, Canada, pp. 226–237.CrossRef Google Scholar

Kudo, T. and Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, November. Association for Computational Linguistics, pp. 66–71.CrossRef Google Scholar

Lakew, S.M., Cettolo, M. and Federico, M. (2018). A comparison of transformer and recurrent neural networks on multilingual neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, COLING’18, Santa Fe, New Mexico, USA, pp. 641–652.Google Scholar

Lample, G., Conneau, A., Ranzato, M., Denoyer, L. and Jégou, H. (2018). Word translation without parallel data. In Proceedings of the 6th International Conference on Learning Representations, ICLR’18, Vancouver, BC, Canada.Google Scholar

Ljubešić, N., Mikelić, N. and Boras, D. (2007). Language identification: How to distinguish similar languages? In Proceedings of the 29th International Conference on Information Technology Interfaces (ITI 2007), Cavtat/Dubrovnik, Croatia, pp. 541–546.Google Scholar

Lui, M. (2014). Generalized Language Identification. PhD Thesis, University of Melbourne.Google Scholar

Lui, M. and Cook, P. (2013). Classifying English documents by national dialect. In Proceedings of Australasian Language Technology Association Workshop 2013 (ALTA 2013), Brisbane, Australia, December, pp. 5–15.Google Scholar

Lui, M., Lau, J.H. and Baldwin, T. (2014). Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics 2, 27–40.CrossRef Google Scholar

Lui, M., Letcher, N., Adams, O., Duong, L., Cook, P. and Baldwin, T. (2014). Exploring methods and resources for discriminating similar languages. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial), Dublin, Ireland, August, pp. 129–138.Google Scholar

Lusetti, M., Ruzsics, T., Göhring, A., Samardžić, T. and Stark, E. 2018. Encoder-decoder methods for text normalization. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, New Mexico, USA, August. Association for Computational Linguistics, pp. 18–28.Google Scholar

Magistry, P., Ligozat, A.-L. and Rosset, S. (2019). Exploiting languages proximity for part-of-speech tagging of three French regional languages. Language Resources and Evaluation 53, 865–888.CrossRef Google Scholar

Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and Arabic dialect identification: A report on the third DSL shared task. In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Osaka, Japan, pp. 1–14.Google Scholar

Martinc, M. and Pollak, S. (2019). Combining N-grams and deep convolutional features for language variety classification. Natural Language Engineering 25(5), 607–632.CrossRef Google Scholar

Marujo, L., Grazina, N., Luís, T., Ling, W., Coheur, L. and Trancoso, I. (2011). BP2EP - adaptation of Brazilian Portuguese texts to European Portuguese. In Proceedings of the 15th Conference of the European Association for Machine Translation, EAMT’11, Leuven, Belgium, pp. 129–136.Google Scholar

McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K., Petrov, S., Zhang, H., Täckström, O., Bedini, C., Castelló, N.B. and Lee, J. (2013). Universal dependency annotation for multilingual parsing. In Proceedings of ACL.Google Scholar

McDonald, R., Petrov, S. and Hall, K. (2011). Multi-source transfer of delexicalized dependency parsers. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, pp. 62–72.Google Scholar

McNamee, P. (2005). Language identification: A solved problem suitable for undergraduate instruction. Journal of Computing Sciences in Colleges 20(3), 94–101.Google Scholar

Medvedeva, M., Kroon, M. and Plank, B. (2017) When sparse traditional models outperform dense neural networks: The curious case of discriminating between similar languages. In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pp. 156–163.Google Scholar

Mikolov, T., Le, Q.V. and Sutskever, I. (2013). Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168.Google Scholar

Mokhov, S.A. (2010). A MARF approach to DEFT 2010. In Proceedings of the 6th DEFT Workshop (DEFT’10), pp. 35–49.Google Scholar

Myint Oo, T., Kyaw Thu, Y. and Mar Soe, K. (2019) Neural machine translation between Myanmar (Burmese) and rakhine (arakanese). In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Minneapolis, USA, pp. 80—88.Google Scholar

Nakov, P. and Ng, H.T. (2009). Improved statistical machine translation for resource-poor languages using related resource-rich languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP’09, Singapore, pp. 1358–1367.CrossRef Google Scholar

Nakov, P. and Ng, H.T. (2012). Improving statistical machine translation for a resource-poor language using related resource-rich languages. Journal of Artificial Intelligence Research, 44, 179–222.CrossRef Google Scholar

Nakov, P. and Tiedemann, J. (2012). Combining word-level and character-level models for machine translation between closely-related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL’12, Jeju Island, Korea, pp. 301–305.Google Scholar

Nguyen, D. and Dogruoz, A.S. (2014). Word level language identification in online multilingual communication. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pp. 18–21.Google Scholar

Nguyen, T.Q. and Chiang, D. (2017). Transfer learning across low-resource, related languages for neural machine translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP’17, Taipei, Taiwan, pp. 296–301.Google Scholar

Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C.D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Silveira, R., Zeman, D. (2016). Universal dependencies v1: A multilingual treebank collection. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Portoroz, Slovenia, pp. 1659–1666.Google Scholar

Petrov, S., Das, D. and McDonald, R. (2012). A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).Google Scholar

Popović, M., Poncelas, A., Brkic, M. and Way, A. (2020). Neural machine translation for translating into Croatian and Serbian. In Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial).Google Scholar

Ranaivo-Malançon, B. (2006). Automatic identification of close languages – case study: Malay and Indonesian. ECTI Transactions on Computer and Information Technology 2(2), 126–134.CrossRef Google Scholar

Rosa, R., Zeman, D., Mareček, D. and Žabokrtský, Z. (2017). Slavic forest, Norwegian wood. In Proceedings of the VarDial Workshop (VarDial).CrossRef Google Scholar

Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv e-prints, page arXiv:1706.05098.Google Scholar

Sadat, F., Kazemi, F. and Farzindar, A. (2014). Automatic identification of Arabic dialects in social media. In Proceedings of the First International Workshop on Social Media Retrieval and Analysis (SoMeRA 2014), Gold Coast, Australia. ACM, pp. 35–40.CrossRef Google Scholar

Sajjad, H., Darwish, K. and Belinkov, Y. (2013). Translating dialectal Arabic to English. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL’13, Sofia, Bulgaria, pp. 1–6.Google Scholar

Salloum, W., Elfardy, H., Alamir-Salloum, L., Habash, N. and Diab, M. (2014). Sentence level dialect identification for machine translation system selection. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, USA, pp. 772–778.CrossRef Google Scholar

Salloum, W. and Habash, N. (2011). Dialectal to Standard Arabic paraphrasing to improve Arabic-English statistical machine translation. In Proceedings of the Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties, Stroudsburg, Pennsylvania, USA, pp. 10–21.Google Scholar

Salloum, W. and Habash, N. (2012). Elissa: A dialectal to standard Arabic machine translation system. In Proceedings of COLING 2012: Demonstration Papers, COLING’12, Mumbai, India, pp. 385–392.Google Scholar

Samardžić, T., Scherrer, Y. and Glaser, E. (2016). ArchiMob – a corpus of spoken Swiss German. In Proceedings of LREC.Google Scholar

Sawaf, H. (2010). Arabic dialect handling in hybrid machine translation. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas, AMTA’10, Denver, Colorado, USA.Google Scholar

Scannell, K.P. (2006). Machine translation for closely related language pairs. In Proceedings of the LREC 2006 Workshop on Strategies for Developing Machine Translation for Minority Languages, Genoa, Italy, pp. 103–109.Google Scholar

Scherrer, Y. (2014). Unsupervised adaptation of supervised part-of-speech taggers for closely related languages. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, Dublin, Ireland, pp. 30–38.CrossRef Google Scholar

Scherrer, Y. and Rabus, A. (2017). Multi-source morphosyntactic tagging for spoken rusyn. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain, pp. 84–92.CrossRef Google Scholar

Scherrer, Y. and Rabus, A. (2019). Neural morphosyntactic tagging for Rusyn. Natural Language Engineering 25(5), 633–650.CrossRef Google Scholar

Scherrer, Y., Rabus, A. and Mocken, S. (2018). New developments in tagging pre-modern orthodox slavic texts. Scripta & e-Scripta 18, 9–33.Google Scholar

Scherrer, Y., Samardžić, T. and Glaser, E. (2019). Digitising Swiss German – how to process and study a polycentric spoken language. Language Resources and Evaluation 53(4), 735–769.CrossRef Google Scholar

Sennrich, R., Haddow, B. and Birch, A. (2016). Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, August. Association for Computational Linguistics, pp. 1715–1725.CrossRef Google Scholar

Shapiro, P. and Duh, K. (2019). Comparing pipelined and integrated approaches to dialectal Arabic neural machine translation. In Proceedings of the Workshop on NLP for Similar Languages Varieties and Dialects (VarDial), Minneapolis, USA, pp. 214—222.Google Scholar

Simaki, V., Simakis, P., Paradis, C. and Kerren, A. (2017). Identifying the authors’ national variety of English in social media texts. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP 2017), Varna, Bulgaria, September. INCOMA Ltd., pp. 671–678.CrossRef Google Scholar

Søgaard, A., Vulic, I., Ruder, S. and Faruqui, M. (2019). Cross-Lingual Word Embeddings. Synthesis Lectures on Human Language Technologies. San Rafael: Morgan & Claypool Publishers.Google Scholar

Solorio, T., Blair, E., Maharjan, S., Bethard, S., Diab, M., Ghoneim, M., Hawwari, A., AlGhamdi, F., Hirschberg, J., Chang, A. and Fung, P. (2014). Overview for the first shared task on language identification in code-switched data. In Proceedings of the Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 62–72.CrossRef Google Scholar

Sutskever, I., Vinyals, O. and Le, Q.V. (2014). Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing System, NIPS’14, Montreal, Canada, pp. 3104–3112.Google Scholar

Suzuki, I., Mikami, Y., Ohsato, A. and Chubachi, Y. (2002). A language and character set determination method based on N-gram statistics. ACM Transactions on Asian Language Information Processing (TALIP) 1(3), 269–278.Google Scholar

Täckström, O., McDonald, R. and Uszkoreit, J. (2012). Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montréal, Canada, pp. 477–487.Google Scholar

Tan, L., Zampieri, M., Ljubešić, N. and Tiedemann, J. (2014). Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the Workshop on Building and Using Comparable Corpora (BUCC), Reykjavik, Iceland, pp. 6–10.Google Scholar

Tiedemann, J. (2012). Character-based pivot translation for under-resourced languages and domains. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL’12, Avignon, France, pp. 141–151.Google Scholar

Tiedemann, J. and Agić, Ž. (2016). Synthetic treebanking for cross-lingual dependency parsing. Journal of Artificial Intelligence Research 55, 209–248.Google Scholar

Tiedemann, J. and Ljubešić, N. (2012). Efficient discrimination between closely related languages. In Proceedings of the International Conference in Computational Linguistics (COLING), Mumbai, India, pp. 2619–2634.Google Scholar

Tiedemann, J. and Nakov, P. (2013). Analyzing the use of character-level translation with sparse and noisy datasets. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP’13, Hissar, Bulgaria, pp. 676–684.Google Scholar

Tillmann, C., Al-Onaizan, Y. and Mansour, S. (2014). Improved sentence-level Arabic dialect classification. In Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects (VarDial), Dublin, Ireland, pp. 110–119.CrossRef Google Scholar

Tjong Kim Sang, E., Bollmann, M., Boschker, R., Casacuberta, F., Dietz, F., Dipper, S., Domingo, M., van der Goot, R., van Koppen, M., Ljubešić, N., Östling, R., Petran, F., Pettersson, E., Scherrer, Y., Schraagen, M., Sevens, L., Tiedemann, J., Vanallemeersch, T. and Zervanou, K. (2017). The clin27 shared task: Translating historical text to contemporary language for improving automatic linguistic annotation. Computational Linguistics in the Netherlands Journal 7, 53–64.Google Scholar

Tyers, F. and Alperen, M.S. (2010). South-East European Times: A parallel corpus of Balkan languages. In Proceedings of the LREC workshop on Exploitation of multilingual resources and tools for Central and (South) Eastern European Languages.Google Scholar

van der Lee, C. and van den Bosch, A. (2017). Exploring lexical and syntactic features for language variety identification. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain, pp. 190–199.CrossRef Google Scholar

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L. and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems, NIPS’17, Long Beach, California, USA, pp. 5998–6008.Google Scholar

Vilar, D., Peter, J.-T. and Ney, H. (2007). Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT’07, Prague, Czech Republic, pp. 33–39.Google Scholar

Vogel, J. and Tresner-Kirsch, D. (2012). Robust language identification in short, noisy texts: Improvements to LIGA. In Third International Workshop on Mining Ubiquitous and Social Environments (MUSE 2012).Google Scholar

Wang, P., Nakov, P. and Ng, H.T. (2012). Source language adaptation for resource-poor machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL’12, Jeju Island, Korea, pp. 286–296.Google Scholar

Wang, P., Nakov, P. and Ng, H.T. (2016). Source language adaptation approaches for resource-poor machine translation. Computational Linguistics 42(2), 277–306.CrossRef Google Scholar

Wang, P. and Ng, H.T. (2013). A beam-search decoder for normalization of social media text with application to machine translation. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT’13, Atlanta, Georgia, USA, pp. 471–481.Google Scholar

Wray, S. (2018). Classification of closely related sub-dialects of Arabic using support-vector machines. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan, pp. 3671–3674.Google Scholar

Yarowsky, D. and Ngai, G. (2001). Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Pittsburgh, USA, pp. 200–207.CrossRef Google Scholar

Zaidan, O.F. and Callison-Burch, C. (2011). The Arabic online commentary dataset: An annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2, Portland, Oregon, USA, June, pp. 37–41.Google Scholar

Zaidan, O.F. and Callison-Burch, C. (2014). Arabic dialect identification. Computational Linguistics 40(1), 171–202.CrossRef Google Scholar

Zampieri, M. and Gebre, B.G. (2012). Automatic identification of language varieties: The case of Portuguese. In Proceedings of The 11th Conference on Natural Language Processing (KONVENS 2012), Vienna, Austria, pp. 233–237.Google Scholar

Zampieri, M., Gebre, B.G. and Diwersy, S. (2013). N-gram language models and POS distribution for the identification of Spanish varieties. In Proceedings of la 20ème conférence du Traitement Automatique du Langage Naturel (TALN), Sables d’Olonne, France, pp. 580–587.Google Scholar

Zampieri, M., Malmasi, S., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J., Scherrer, Y. and Aepli, N. (2017). Findings of the VarDial evaluation campaign 2017. In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain. Association for Computational Linguistics, pp. 1–15.CrossRef Google Scholar

Zampieri, M., Malmasi, S., Nakov, P., Ali, A., Shon, S., Glass, J., Scherrer, Y., Samardžić, T., Ljubešić, N., Tiedemann, J., van der Lee, C., Grondelaers, S., Oostdijk, N., Speelman, D., van den Bosch, A., Kumar, R., Lahiri, B. and Jain, M. (2018). Language identification and morphosyntactic tagging: The second VarDial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Association for Computational Linguistics, pp. 1–17.Google Scholar

Zampieri, M., Malmasi, S., Scherrer, Y., Samardžić, T., Tyers, F., Silfverberg, M., Klyueva, N., Pan, T.-L., Huang, C.-R., Ionescu, R.T., Butnaru, A. and Jauhiainen, T. (2019). A report on the third VarDial evaluation campaign. In Proceedings of the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial). Association for Computational Linguistics, pp. 1–16.Google Scholar

Zampieri, M., Malmasi, S., Sulea, O.-M. and Dinu, L.P. (2016). A computational approach to the study of Portuguese newspapers published in Macau. In Proceedings of the Workshop on Natural Language Processing meets Journalism (NLPMJ 2016), New York City, NY, USA, pp. 47–51.Google Scholar

Zampieri, M., Tan, L., Ljubešić, N. and Tiedemann, J. (2014). A report on the DSL shared task 2014. In Proceedings of the Workshop on NLP for Similar Languages Varieites and Dialects (VarDial), Dublin, Ireland, pp. 58–67.CrossRef Google Scholar

Zampieri, M., Tan, L., Ljubešić, N., Tiedemann, J. and Nakov, P. (2015). Overview of the DSL shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LTVarDial), Hissar, Bulgaria, pp. 1–9.Google Scholar

Zbib, R., Malchiodi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., Makhoul, J., Zaidan, O.F. and Callison-Burch, C. (2012). Machine translation of Arabic dialects. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT), Montreal, Canada, pp. 49–59.Google Scholar

Zeman, D. (2008). Reusable tagset conversion using tagset drivers. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco, pp. 213–218.Google Scholar

Zeman, D. and Resnik, P. (2008). Cross-language parser adaptation between related languages. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, Hyderabad, India, pp. 35–42.Google Scholar

Zhang, X. (1998). Dialect MT: A case study between Cantonese and Mandarin. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2, ACL-COLING’98, Quebec, Canada, pp. 1460–1464.Google Scholar

Zhao, L., Kipper, K., Schuler, W., Vogler, C., Badler, N.I. and Palmer, M. (2000). A machine translation system from English to American Sign Language. In Proceedings of the 4th Conference of the Association for Machine Translation in the Americas on Envisioning Machine Translation in the Information Future, AMTA’00, London, UK, pp. 54–67.CrossRef Google Scholar

Zissman, M.A. and Berkling, K.M. (2001). Automatic language identification. Speech Communication 35(1–2), 115–124.CrossRef Google Scholar

Zoph, B., Yuret, D., May, J. and Knight, K. (2016). Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP’16, Austin, Texas, USA, pp. 1568–1575.CrossRef Google Scholar

Zubiaga, A., Vicente, I.S., Gamallo, P., Pichel, J.R., Alegria, I., Aranberri, N., Ezeiza, A. and Fresno, V. (2014). Overview of TweetLID: Tweet language identification at SEPLN 2014. In Proceedings of the Tweet Language Identification Workshop 2014 co-located with 30th Conference of the Spanish Society for Natural Language Processing (SEPLN 2014), Girona, Spain, pp. 1–11.Google Scholar

Zupan, K., Ljubešić, N. and Erjavec, T. (2019). How to tag non-standard language: Normalisation versus domain adaptation for slovene historical and user-generated texts. Natural Language Engineering 25(5), 651–674.CrossRef Google Scholar

Article contents

Natural language processing for similar languages, varieties, and dialects: A survey

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests