skip to main content
research-article

Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form

Published:11 April 2020Publication History
Skip Abstract Section

Abstract

User-generated text in social media communication (SMC) is mainly characterized by non-standard form. It may contain code switching (CS) text, a widespread phenomenon in SMC, in addition to noisy elements used, especially in written conversations (use of abbreviations, symbols, emoticons) or misspelled words. All of these factors constitute a wall in front of text mining applications. Common text mining tools are dedicated to standard use of standard languages but cannot deal with other forms, especially written text in social media. To overcome these problems, in this work we present our solution for the normalization of non-standard use of standard and non-standard languages (dialects) in SMC text with the use of existent resources and tools. The main processing in our solution consists of CS normalization from multiple to one language by the use of a machine translation--like approach. This processing relies on a linguistic approach of CS, which aims at identifying automatically the translation source and target languages (without human intervention). The remaining processing operations concern the normalization of SMC special expressions and spelling correction of out-of-vocabulary words. To preserve the coded-switched sentence meaning across translation, we adopt a knowledge-based approach for word sense translation disambiguation reinforced with a multi-lingual vertical context. All of these processes are embedded in what we refer to as the machine normalization system. Our solution can be used as a front-end of text mining processing, enabling the analysis of SMC noisy text. The conducted experiments show that our system performs better than considered baselines.

References

  1. Eneko Agirre, De Lacalle, and Aitor Soroa. 2014. Random walks for knowledge-based word sense disambiguation. Computational Linguistics 40, 1 (2014), 57--84. DOI:https://doi.org/10.1162/COLIGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  2. Eneko Agirre and Aitor Soroa. 2009. Personalizing PageRank for word sense disambiguation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. 33--41. DOI:https://doi.org/10.3115/1609067.1609070Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Tiago A. Almeida, Tiago P. Silva, Igor Santos, and José M. Gómez Hidalgo. 2016. Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems 108 (2016), 25--32. DOI:https://doi.org/10.1016/j.knosys.2016.05.001Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Alexandra Antonova and Alexey Misyurev. 2014. Improving the precision of automatically constructed human-oriented translation dictionaries. In Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra’14). 58--66.Google ScholarGoogle ScholarCross RefCross Ref
  5. Marianna Apidianaki, Guillaume Wisniewski, Artem Sokolov, Aurelien Max, and Francois Yvon. 2012. WSD for n-best reranking and local language modeling in SMT. In Proceedings of the 6th Workshop on Syntax, Semantics, and Structure in Statistical Translation (SSST-6’12). 1--9. Retrieved from http://www.aclweb.org/anthology-new/W/W12/W12-4201.pdf.Google ScholarGoogle Scholar
  6. Mohammad Arshi Saloot, Norisma Idris, Liyana Shuib, Ram Gopal Raj, and Aiti Aw. 2015. Toward tweets normalization using maximum entropy. In Proceedings of the ACL 2015 Workshop Workshop on Noisy User-Generated Text. 19--27. DOI:https://doi.org/10.18653/v1/W15-4303Google ScholarGoogle Scholar
  7. Timothy Baldwin. 2017. Language Identification in the Wild. Retrieved February 24, 2020 from https://people.eng.unimelb.edu.au/tbaldwin/pubs/mlp2017-langid.pdf.Google ScholarGoogle Scholar
  8. Satanjeev Banerjee and Ted Pedersen. 2002. An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the 4th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING’02).Google ScholarGoogle ScholarCross RefCross Ref
  9. Satanjeev Banerjee and Ted Pedersen. 2003. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI’03). 805--810.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Utsab Barman, Amitava Das, Joachim Wagner, and Jennifer Foster. 2014. Code mixing: A challenge for language identification in the language of social media. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 21--31. DOI:https://doi.org/10.13140/2.1.3385.6967Google ScholarGoogle ScholarCross RefCross Ref
  11. Pierpaolo Basile, Annalina Caputo, and Giovanni Semeraro. 2014. An enhanced lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 1591--1600.Google ScholarGoogle Scholar
  12. Arianna Bisazza and Marcello Federico. 2016. A survey of word reordering in statistical machine translation: Computational models and language phenomena. Computational Linguistics 42, 2 (2016), 163--205. DOI:https://doi.org/10.1162/COLIGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  13. Louis Patrick Boumans. 1998. The Syntax of Codeswitching Analysing Moroccan Arabic/Dutch Conversations. Tilburg University Press, the Netherlands.Google ScholarGoogle Scholar
  14. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, (1993), 263--311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Marine Carpuat and Dekai Wu. 2007. Improving statistical machine translation using word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07). 61--72. DOI:https://doi.org/10.3115/1219840.1219888Google ScholarGoogle Scholar
  16. Özlem Çetinoğlu, Sarah Schulz, and Ngoc Thang Vu. 2016. Challenges of computational processing of code-switching. In Proceedings of the 2nd Workshop on Computational Approaches to Code Switching. 1--11. DOI:https://doi.org/10.18653/v1/W16-5801Google ScholarGoogle ScholarCross RefCross Ref
  17. Ys Chan, Ht Ng, and David Chiang. 2007. Word sense disambiguation improves statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 33--40.Google ScholarGoogle Scholar
  18. Percy Cheung and Pascale Fung. 2005. Translation disambiguation in mixed language queries. Machine Translation 18, 4 (2005), 251--273. DOI:https://doi.org/10.1007/s10590-004-7692-5Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Antonio M. Corbí-Bellot, Mikel L. Forcada, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, et al. 2005. An open-source shallow-transfer machine translation engine for the romance languages of Spain. In EAMT Conference Proceedings. 79--86.Google ScholarGoogle Scholar
  20. Marta R. Costa-Jussà and Jordi Centelles. 2015. Description of the Chinese-to-Spanish rule-based machine translation system developed using a hybrid combination of human annotation and statistical techniques. ACM Transactions on Asian and Low-Resource Language Information Processing 15, 1 (2015), 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Marta R. Costa-Jussà and José A. R. Fonollosa. 2015. Latest trends in hybrid machine translation and its applications. Computer Speech 8 Language 32, 1 (2015), 3--10. DOI:https://doi.org/10.1016/j.csl.2014.11.001Google ScholarGoogle Scholar
  22. Josep Maria Crego, Joshua Johanson, and Jean Senellart. 2014. SYSTRAN RBMT engine: Hybridization experiments. In Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra’14).Google ScholarGoogle Scholar
  23. Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM 7, 3 (1964), 171--176. DOI:https://doi.org/10.1145/363958.363994Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Amitava Das and Björn Gambäck. 2013. Code-mixing in social media text the last language identification frontier? Traitement Automatique des Langues 54, 3 (2013), 41--64.Google ScholarGoogle Scholar
  25. Mrinal Dhar. 2018. Enabling code-mixed translation: Parallel corpus creation and MT augmentation approach. In Proceedings of the 1st Workshop on Linguistic Resources for Natural Language Processing. 131--140.Google ScholarGoogle Scholar
  26. L. E. Dostert. 1959. Approaches to the reduction of ambiguity in machine translation. Journal of the SMPTE 68, 4 (1959), 234--235.Google ScholarGoogle ScholarCross RefCross Ref
  27. Heba Elfardy and Mona Diab. 2012. Token level identification of linguistic code switching. In Proceedings of COLING 2012: Posters. 287--296.Google ScholarGoogle Scholar
  28. Atefeh Farzindar, Diana Inkpen, Graeme Hirst (Eds.). 2017. Natural Language Processing for Social Media (2nd ed.). Morgan 8 Claypool.Google ScholarGoogle Scholar
  29. C. Fellbaum. 1988. WordNet: An electronic lexical database. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  30. Radu Florian, Silviu Cucerzan, Charles Schafer, and David Yarowsky. 2002. Combining classifiers for word sense disambiguation. Natural Language Engineering 8, 4 (2002), 327--341. DOI:https://doi.org/10.1017/S1351324902002978Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Mikel L. Forcada, Felipe Sánchez-Martínez, Gema Ramirez-Sánchez, and Francis M. Tyers. 2011. Apertium: A free/open-source platform for rule-based machine translation. Machine Translation 25, 2 (2011), 127--144. DOI:https://doi.org/10.1007/s10590-011-9090-0Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Pascale Fung, Liu Xiaohu, and Cheung Chi Shun. 1999. Mixed language query disambiguation. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 333--340. DOI:https://doi.org/10.3115/1034678.1034732Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. William A. Gale, Kenneth W. Church, David Yarowsky, and Murray Hill Nj. 1992. One sense per discourse. In Proceedings of the Workshop on Speech and Natural Language. 233--237.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. 2012. Building large monolingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 759--765.Google ScholarGoogle Scholar
  35. Maarten Van Gompel and Antal Van Den Bosch. 2014. Translation assistance by translation of L1 fragments in an L2 context. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 871--880.Google ScholarGoogle ScholarCross RefCross Ref
  36. Josiane F. Hamers and Michel Blanc. 1983. Bilingualité et Bilinguisme, P. Mardaga (Ed.). Psychologie et Sciences Humaines. P. Mardaga, Bruxelles, Belgium.Google ScholarGoogle Scholar
  37. Bo Han, Paul Cook, and Timothy Baldwin. 2012. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 421--432.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Einar Haugen. 1950. The analysis of linguistic borrowing. Language (Baltimore) 26, 2 (1950), 210--231.Google ScholarGoogle ScholarCross RefCross Ref
  39. Kenneth Heafield, Ivan Pouzyrevsky, and Jonathan H. Clark. 2013. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 690--696.Google ScholarGoogle Scholar
  40. Amal Htait, Sébastien Fournier, and Patrice Bellot. 2018. Unsupervised creation of normalization dictionaries for micro-blogs in Arabic, French and English. Computacion y Sistemas 22, 3 (2018), 729--737. DOI:https://doi.org/10.13053/cys-22-3-3034Google ScholarGoogle Scholar
  41. W. John Hutchins. 1986. Machine Translation: Past, Present, Future. Ellis Horwood, Chichester, UK.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Nancy Ide and Jean Véronis. 1998. Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics 24, 1 (1998), 1--40. DOI:https://doi.org/10.1016/j.csl.2004.05.005Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Hamid Jaafar. 2012. Le nom et l'adjectif dans l'arabe marocain: Etude lexicologique. Ph.D. Dissertation. University Sidi Mohammed Ben Abbdellah.Google ScholarGoogle Scholar
  44. Jay J. Jiang and David W. Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the 10th Research on Computational Linguistics International Conference. 19--33. DOI:https://doi.org/10.1.1.269.3598Google ScholarGoogle Scholar
  45. Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, et al. 2017. Google's multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339--351.Google ScholarGoogle ScholarCross RefCross Ref
  46. Aravind K. Joshi. 1985. Processing of sentences with intrasentential code switching. In Natural Language Parsing, D. R. Dowty, L. Karttunen, and A. M. Zwicky (Eds.). Cambridge University Press, 190--205.Google ScholarGoogle Scholar
  47. Max Kaufmann and J. Kalita. 2010. Syntactic normalization of Twitter messages. In Proceedings of the International Conference on Natural Language Processing. 1--7.Google ScholarGoogle Scholar
  48. Adam Kilgarriff and Joseph Rosenzweig. 2000. English SENSEVAL: Report and results. In Proceedings of the 2nd Conference on Language Resources and Evaluation. 1239--1244. DOI:https://doi.org/10.1023/A:1002693207386Google ScholarGoogle Scholar
  49. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the Machine Translation Summit. 79--86.Google ScholarGoogle Scholar
  50. Philipp Koehn, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Christine Moran, Chris Dyer, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the ACL 2007 Demo and Poster Sessions. 177--180.Google ScholarGoogle ScholarCross RefCross Ref
  51. Claudia Leacock and Martin Chodorow. 1998. Combining local context and WordNet similarity for word sense identification. In WordNet: An Electronic Lexical Database. WordNet An Electron. Lex. database. MIT Press, Cambridge, MA, 265--283. DOI:https://doi.org/citeulike-article-id:1259480Google ScholarGoogle Scholar
  52. Yoong Keok Lee and Hwee Tou Ng. 2002. An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing. 41--48. DOI:https://doi.org/10.3115/1118693.1118699Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th International Conference on Systems Documentation (SIGDOC’86). 24--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Sheng Li. 2015. Lifetime achievement award translating today into tomorrow. Computational Linguistics 41, 4 (2015), 4943. DOI:https://doi.org/10.1162/COLIGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  55. Dekang Lin. 1998. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (ICML’98). 296--304. DOI:https://doi.org/10.1.1.55.1832Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Wang Ling, Guang Xiang, Chris Dyer, Alan Black, and Isabel Trancoso. 2013. Microblogs as parallel corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 176--186.Google ScholarGoogle Scholar
  57. Veronica Lopez Ludeña, Rubén San Segundo, Juan Manuel Montero, Roberto Barra Chicote, and Jaime Lorenzo. 2012. Architecture for text normalization using statistical machine translation techniques. In Proceedings of the IberSPEECH 2012 Workshop. 112--122. DOI:https://doi.org/10.1016/j.jacc.2018.03.023Google ScholarGoogle Scholar
  58. Massimo Lusetti, Tatyana Ruzsics, Anne Göhring, Tanja Samardžic, and Elisabeth Stark. 2018. Encoder-decoder methods for text normalization. In Proceedings of the 5th Workshop on NLP for Similar Languages, Varieties, and Dialects. 18--28.Google ScholarGoogle Scholar
  59. Esmé Manandise and Claudia Gdaniec. 2011. Morphology to the rescue redux: Resolving borrowings and code-mixing in machine translation. Communications in Computer and Information Sciences 100 (2011), 86--97. DOI:https://doi.org/10.1007/978-3-642-23138-4_6Google ScholarGoogle ScholarCross RefCross Ref
  60. Diana McCarthy, Rob Koeling, and John Carroll. 2007. Unsupervised acquisition of predominant word senses. Computational Linguistics 33, 4 (2007), 553--590.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Paul McNamee. 2005. Language identification: A solved problem suitable for undergraduate instruction. Journal of Computer Sciences in Colleges 20, 3 (2005), 94--101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Rada Mihalcea. 2004. Co-training and self-training for word sense disambiguation. In Proceedings of the 8th Conference on Computational Natural Language Learning (CoNLL’04). 182--183.Google ScholarGoogle Scholar
  63. Pieter Muysken. 1995. Cross-disciplinary perspectives on code-switching. In One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching, L. Milroy and P. Muysken (Eds.). Cambrige University Press, Cambridge, UK, 177--198.Google ScholarGoogle Scholar
  64. Carol Myers-Scotton. 1995. A lexically based model of code-switching. In One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching, L. Milroy and P. Muysken (Eds.). Cambridge University Press, Cambridge, UK, 233--256. DOI:https://doi.org/10.1017/CBO9780511620867.011Google ScholarGoogle Scholar
  65. Carol Myers-Scotton. 1997. Duelling Languages: Grammatical Structure in Codeswitching. Clarendon Press, Oxford, UK.Google ScholarGoogle Scholar
  66. Carol Myers-Scotton and J. Jake. 2001. Explaining aspects of codeswitching and their implications. In One Mind, Two Languages: Bilingual Language Processing, J. Nicol (Ed.). Blackwell, Oxford, UK, 84--116.Google ScholarGoogle Scholar
  67. Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys 41, 2 (2009), 69. DOI:https://doi.org/10.1145/1459352.1459355Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193 (2012), 217--250. DOI:https://doi.org/10.1016/j.artint.2012.07.001Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Shana Poplack. 1980. Sometimes I'll start a sentence in Spanish y termino en ESPAÑOL: Toward a typology of code-switching. Linguistics 18, 7--8 (1980), 581--618.Google ScholarGoogle ScholarCross RefCross Ref
  70. Philip Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95). 448--453.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Alex Rudnick, Annette Rios, and Michael Grasser. 2014. Enhancing a rule-based MT system with cross-lingual WSD. In Proceedings of the SaLTMiLWorkshop on Free/Open-Source Language Resources for the Machine Translation of Less-Resourced Languages (LREC’14). 31--36.Google ScholarGoogle Scholar
  72. Yves Scherrer and Nikola Ljubešic. 2016. Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS’16). 248--255.Google ScholarGoogle Scholar
  73. Kiril Simov, Petya Osenova, and Alex Popov. 2016. Towards semantic-based hybrid machine translation between Bulgarian and English. In Proceedings of the 2nd Workshop on Semantics-Driven Machine Translation. 22--26. DOI:https://doi.org/10.18653/v1/W16-0604Google ScholarGoogle ScholarCross RefCross Ref
  74. R. Mahesh K. Sinha and Anil Thakur. 2005. Machine translation of bi-lingual Hindi-English (Hinglish) text. In Proceedings of the 10th Machine Translation Summit. 149--156.Google ScholarGoogle Scholar
  75. Thamar Solorio and Yang Liu. 2008. Part-of-speech tagging for English-Spanish code-switched text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 1051. DOI:https://doi.org/10.3115/1613715.1613852Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Francis M. Tyers, Felipe Sánchez-Martinez, and Mikel L. Forcada. 2012. Flexible finite-state lexical selection for rule-based machine translation. Proceedings of the 16th International Conference of the European Association for Machine Translation. 213--220.Google ScholarGoogle Scholar
  77. Francis M. Tyers, Felipe Sánchez-Martínez, Sergio Ortiz-Rojas, and Mikel L. Forcada. 2010. Free/open-source resources in the Apertium platform for machine translation research and development. Prague Bulletin of Mathematical Linguistics 93 (2010), 67--76. DOI:https://doi.org/10.2478/v10108-010-0015-5.PBMLGoogle ScholarGoogle ScholarCross RefCross Ref
  78. Florentina Vasilescu, Philippe Langlais, and Guy Lapalme. 2004. Evaluating variants of the Lesk approach for disambiguating words. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04).Google ScholarGoogle Scholar
  79. David Vickrey, Luke Biewald, Marc Teyssier, and Daphne Koller. 2005. Word-sense disambiguation for machine translation. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT’05). 771--778. DOI:https://doi.org/10.3115/1220575.1220672Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Clare Voss, Stephen Tratz, Jamal Laoudi, and Douglas Briesch. 2014. Finding romanized Arabic dialect in code-mixed tweets. In Proceedings of the 9th International Conference on Language Resources and Evaluation. 188--199.Google ScholarGoogle Scholar
  81. Li Wang, Masao Fuketa, Kazuhiro Morita, and Jun-Ichi Aoe. 2011. Context constraint disambiguation of word semantics by field association schemes. Information Processing 8 Management 47, 4 (2011) 560--574. DOI:https://doi.org/10.1016/j.ipm.2011.01.001Google ScholarGoogle Scholar
  82. Yorick Wilks and Mark Stevenson. 1997. The grammar of sense: Using part-of-speech tags as a first step in semantic disambiguation. Natural Language Engineering 4, 1 (1997) 135--143. DOI:https://doi.org/10.1017/S1351324998001946Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Jennifer Williams and Charlie K. Dagli. 2017. Developing ground truth for Twitter language identification of similar languages and dialects. In Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties, and Dialects. 1--6.Google ScholarGoogle Scholar
  84. David Yarowsky. 1993. One sense per collocation. In Proceedings of the Workshop on Human Language Technology (HLT’93). 266--271. DOI:https://doi.org/10.3115/1075671.1075731Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. 189--196.Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Younes Samih. 2016. Detecting code-switching in Moroccan Arabic social media. In Proceedings of the SocialNLP workshop at IJCAI 2016. DOI:https://doi.org/10.13140/RG.2.2.18663.85928Google ScholarGoogle Scholar
  87. Randa Zarnoufi, Hamid Jaafar, and Mounia Abik. 2019. Language identification for user generated content in social media. In Information Systems and Technologies to Support Learning. Smart Innovation, Systems and Technologies, Vol. 111. Springer, 672--678. DOI:https://doi.org/10.1007/978-3-030-03577-8_73Google ScholarGoogle Scholar
  88. Wei Zhang, Robert A. J. Clark, Yongyuan Wang, and Wen Li. 2016. Unsupervised language identification based on latent Dirichlet allocation. Computer Speech and Langugage 39 (2016), 47--66. DOI:https://doi.org/10.1016/j.csl.2016.02.001Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Asian and Low-Resource Language Information Processing
            ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 4
            July 2020
            291 pages
            ISSN:2375-4699
            EISSN:2375-4702
            DOI:10.1145/3391538
            Issue’s Table of Contents

            Copyright © 2020 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 11 April 2020
            • Accepted: 1 December 2019
            • Revised: 1 October 2019
            • Received: 1 March 2019
            Published in tallip Volume 19, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format