research-article

Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form

Authors:
Randa Zarnoufi

Mohammed V University

Mohammed V University

0000-0002-3353-6704
View Profile

,
Hamid Jaafar

Cadi Ayyad University

Cadi Ayyad University
View Profile

,
Mounia Abik

Mohammed V University

Mohammed V University
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 19 Issue 4Article No.: 49pp 1–30https://doi.org/10.1145/3378414

Published:11 April 2020Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

User-generated text in social media communication (SMC) is mainly characterized by non-standard form. It may contain code switching (CS) text, a widespread phenomenon in SMC, in addition to noisy elements used, especially in written conversations (use of abbreviations, symbols, emoticons) or misspelled words. All of these factors constitute a wall in front of text mining applications. Common text mining tools are dedicated to standard use of standard languages but cannot deal with other forms, especially written text in social media. To overcome these problems, in this work we present our solution for the normalization of non-standard use of standard and non-standard languages (dialects) in SMC text with the use of existent resources and tools. The main processing in our solution consists of CS normalization from multiple to one language by the use of a machine translation--like approach. This processing relies on a linguistic approach of CS, which aims at identifying automatically the translation source and target languages (without human intervention). The remaining processing operations concern the normalization of SMC special expressions and spelling correction of out-of-vocabulary words. To preserve the coded-switched sentence meaning across translation, we adopt a knowledge-based approach for word sense translation disambiguation reinforced with a multi-lingual vertical context. All of these processes are embedded in what we refer to as the machine normalization system. Our solution can be used as a front-end of text mining processing, enabling the analysis of SMC noisy text. The conducted experiments show that our system performs better than considered baselines.

References

Eneko Agirre, De Lacalle, and Aitor Soroa. 2014. Random walks for knowledge-based word sense disambiguation. Computational Linguistics 40, 1 (2014), 57--84. DOI:https://doi.org/10.1162/COLIGoogle ScholarDigital Library
Eneko Agirre and Aitor Soroa. 2009. Personalizing PageRank for word sense disambiguation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. 33--41. DOI:https://doi.org/10.3115/1609067.1609070Google ScholarDigital Library
Tiago A. Almeida, Tiago P. Silva, Igor Santos, and José M. Gómez Hidalgo. 2016. Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems 108 (2016), 25--32. DOI:https://doi.org/10.1016/j.knosys.2016.05.001Google ScholarDigital Library
Alexandra Antonova and Alexey Misyurev. 2014. Improving the precision of automatically constructed human-oriented translation dictionaries. In Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra’14). 58--66.Google ScholarCross Ref
Marianna Apidianaki, Guillaume Wisniewski, Artem Sokolov, Aurelien Max, and Francois Yvon. 2012. WSD for n-best reranking and local language modeling in SMT. In Proceedings of the 6th Workshop on Syntax, Semantics, and Structure in Statistical Translation (SSST-6’12). 1--9. Retrieved from http://www.aclweb.org/anthology-new/W/W12/W12-4201.pdf.Google Scholar
Mohammad Arshi Saloot, Norisma Idris, Liyana Shuib, Ram Gopal Raj, and Aiti Aw. 2015. Toward tweets normalization using maximum entropy. In Proceedings of the ACL 2015 Workshop Workshop on Noisy User-Generated Text. 19--27. DOI:https://doi.org/10.18653/v1/W15-4303Google Scholar
Timothy Baldwin. 2017. Language Identification in the Wild. Retrieved February 24, 2020 from https://people.eng.unimelb.edu.au/tbaldwin/pubs/mlp2017-langid.pdf.Google Scholar
Satanjeev Banerjee and Ted Pedersen. 2002. An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the 4th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING’02).Google ScholarCross Ref
Satanjeev Banerjee and Ted Pedersen. 2003. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI’03). 805--810.Google ScholarDigital Library
Utsab Barman, Amitava Das, Joachim Wagner, and Jennifer Foster. 2014. Code mixing: A challenge for language identification in the language of social media. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 21--31. DOI:https://doi.org/10.13140/2.1.3385.6967Google ScholarCross Ref
Pierpaolo Basile, Annalina Caputo, and Giovanni Semeraro. 2014. An enhanced lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). 1591--1600.Google Scholar
Arianna Bisazza and Marcello Federico. 2016. A survey of word reordering in statistical machine translation: Computational models and language phenomena. Computational Linguistics 42, 2 (2016), 163--205. DOI:https://doi.org/10.1162/COLIGoogle ScholarDigital Library
Louis Patrick Boumans. 1998. The Syntax of Codeswitching Analysing Moroccan Arabic/Dutch Conversations. Tilburg University Press, the Netherlands.Google Scholar
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, (1993), 263--311.Google ScholarDigital Library
Marine Carpuat and Dekai Wu. 2007. Improving statistical machine translation using word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07). 61--72. DOI:https://doi.org/10.3115/1219840.1219888Google Scholar
Özlem Çetinoğlu, Sarah Schulz, and Ngoc Thang Vu. 2016. Challenges of computational processing of code-switching. In Proceedings of the 2nd Workshop on Computational Approaches to Code Switching. 1--11. DOI:https://doi.org/10.18653/v1/W16-5801Google ScholarCross Ref
Ys Chan, Ht Ng, and David Chiang. 2007. Word sense disambiguation improves statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 33--40.Google Scholar
Percy Cheung and Pascale Fung. 2005. Translation disambiguation in mixed language queries. Machine Translation 18, 4 (2005), 251--273. DOI:https://doi.org/10.1007/s10590-004-7692-5Google ScholarDigital Library
Antonio M. Corbí-Bellot, Mikel L. Forcada, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Gema Ramírez-Sánchez, Felipe Sánchez-Martínez, et al. 2005. An open-source shallow-transfer machine translation engine for the romance languages of Spain. In EAMT Conference Proceedings. 79--86.Google Scholar
Marta R. Costa-Jussà and Jordi Centelles. 2015. Description of the Chinese-to-Spanish rule-based machine translation system developed using a hybrid combination of human annotation and statistical techniques. ACM Transactions on Asian and Low-Resource Language Information Processing 15, 1 (2015), 1--13.Google ScholarDigital Library
Marta R. Costa-Jussà and José A. R. Fonollosa. 2015. Latest trends in hybrid machine translation and its applications. Computer Speech 8 Language 32, 1 (2015), 3--10. DOI:https://doi.org/10.1016/j.csl.2014.11.001Google Scholar
Josep Maria Crego, Joshua Johanson, and Jean Senellart. 2014. SYSTRAN RBMT engine: Hybridization experiments. In Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra’14).Google Scholar
Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM 7, 3 (1964), 171--176. DOI:https://doi.org/10.1145/363958.363994Google ScholarDigital Library
Amitava Das and Björn Gambäck. 2013. Code-mixing in social media text the last language identification frontier? Traitement Automatique des Langues 54, 3 (2013), 41--64.Google Scholar
Mrinal Dhar. 2018. Enabling code-mixed translation: Parallel corpus creation and MT augmentation approach. In Proceedings of the 1st Workshop on Linguistic Resources for Natural Language Processing. 131--140.Google Scholar
L. E. Dostert. 1959. Approaches to the reduction of ambiguity in machine translation. Journal of the SMPTE 68, 4 (1959), 234--235.Google ScholarCross Ref
Heba Elfardy and Mona Diab. 2012. Token level identification of linguistic code switching. In Proceedings of COLING 2012: Posters. 287--296.Google Scholar
Atefeh Farzindar, Diana Inkpen, Graeme Hirst (Eds.). 2017. Natural Language Processing for Social Media (2nd ed.). Morgan 8 Claypool.Google Scholar
C. Fellbaum. 1988. WordNet: An electronic lexical database. MIT Press, Cambridge, MA.Google Scholar
Radu Florian, Silviu Cucerzan, Charles Schafer, and David Yarowsky. 2002. Combining classifiers for word sense disambiguation. Natural Language Engineering 8, 4 (2002), 327--341. DOI:https://doi.org/10.1017/S1351324902002978Google ScholarDigital Library
Mikel L. Forcada, Felipe Sánchez-Martínez, Gema Ramirez-Sánchez, and Francis M. Tyers. 2011. Apertium: A free/open-source platform for rule-based machine translation. Machine Translation 25, 2 (2011), 127--144. DOI:https://doi.org/10.1007/s10590-011-9090-0Google ScholarDigital Library
Pascale Fung, Liu Xiaohu, and Cheung Chi Shun. 1999. Mixed language query disambiguation. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. 333--340. DOI:https://doi.org/10.3115/1034678.1034732Google ScholarDigital Library
William A. Gale, Kenneth W. Church, David Yarowsky, and Murray Hill Nj. 1992. One sense per discourse. In Proceedings of the Workshop on Speech and Natural Language. 233--237.Google ScholarDigital Library
Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. 2012. Building large monolingual dictionaries at the Leipzig Corpora Collection: From 100 to 200 languages. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). 759--765.Google Scholar
Maarten Van Gompel and Antal Van Den Bosch. 2014. Translation assistance by translation of L1 fragments in an L2 context. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 871--880.Google ScholarCross Ref
Josiane F. Hamers and Michel Blanc. 1983. Bilingualité et Bilinguisme, P. Mardaga (Ed.). Psychologie et Sciences Humaines. P. Mardaga, Bruxelles, Belgium.Google Scholar
Bo Han, Paul Cook, and Timothy Baldwin. 2012. Automatically constructing a normalisation dictionary for microblogs. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 421--432.Google ScholarDigital Library
Einar Haugen. 1950. The analysis of linguistic borrowing. Language (Baltimore) 26, 2 (1950), 210--231.Google ScholarCross Ref
Kenneth Heafield, Ivan Pouzyrevsky, and Jonathan H. Clark. 2013. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 690--696.Google Scholar
Amal Htait, Sébastien Fournier, and Patrice Bellot. 2018. Unsupervised creation of normalization dictionaries for micro-blogs in Arabic, French and English. Computacion y Sistemas 22, 3 (2018), 729--737. DOI:https://doi.org/10.13053/cys-22-3-3034Google Scholar
W. John Hutchins. 1986. Machine Translation: Past, Present, Future. Ellis Horwood, Chichester, UK.Google ScholarDigital Library
Nancy Ide and Jean Véronis. 1998. Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics 24, 1 (1998), 1--40. DOI:https://doi.org/10.1016/j.csl.2004.05.005Google ScholarDigital Library
Hamid Jaafar. 2012. Le nom et l'adjectif dans l'arabe marocain: Etude lexicologique. Ph.D. Dissertation. University Sidi Mohammed Ben Abbdellah.Google Scholar
Jay J. Jiang and David W. Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the 10th Research on Computational Linguistics International Conference. 19--33. DOI:https://doi.org/10.1.1.269.3598Google Scholar
Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, et al. 2017. Google's multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5 (2017), 339--351.Google ScholarCross Ref
Aravind K. Joshi. 1985. Processing of sentences with intrasentential code switching. In Natural Language Parsing, D. R. Dowty, L. Karttunen, and A. M. Zwicky (Eds.). Cambridge University Press, 190--205.Google Scholar
Max Kaufmann and J. Kalita. 2010. Syntactic normalization of Twitter messages. In Proceedings of the International Conference on Natural Language Processing. 1--7.Google Scholar
Adam Kilgarriff and Joseph Rosenzweig. 2000. English SENSEVAL: Report and results. In Proceedings of the 2nd Conference on Language Resources and Evaluation. 1239--1244. DOI:https://doi.org/10.1023/A:1002693207386Google Scholar
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the Machine Translation Summit. 79--86.Google Scholar
Philipp Koehn, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Christine Moran, Chris Dyer, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the ACL 2007 Demo and Poster Sessions. 177--180.Google ScholarCross Ref
Claudia Leacock and Martin Chodorow. 1998. Combining local context and WordNet similarity for word sense identification. In WordNet: An Electronic Lexical Database. WordNet An Electron. Lex. database. MIT Press, Cambridge, MA, 265--283. DOI:https://doi.org/citeulike-article-id:1259480Google Scholar
Yoong Keok Lee and Hwee Tou Ng. 2002. An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing. 41--48. DOI:https://doi.org/10.3115/1118693.1118699Google ScholarDigital Library
Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th International Conference on Systems Documentation (SIGDOC’86). 24--26.Google ScholarDigital Library
Sheng Li. 2015. Lifetime achievement award translating today into tomorrow. Computational Linguistics 41, 4 (2015), 4943. DOI:https://doi.org/10.1162/COLIGoogle ScholarDigital Library
Dekang Lin. 1998. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (ICML’98). 296--304. DOI:https://doi.org/10.1.1.55.1832Google ScholarDigital Library
Wang Ling, Guang Xiang, Chris Dyer, Alan Black, and Isabel Trancoso. 2013. Microblogs as parallel corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 176--186.Google Scholar
Veronica Lopez Ludeña, Rubén San Segundo, Juan Manuel Montero, Roberto Barra Chicote, and Jaime Lorenzo. 2012. Architecture for text normalization using statistical machine translation techniques. In Proceedings of the IberSPEECH 2012 Workshop. 112--122. DOI:https://doi.org/10.1016/j.jacc.2018.03.023Google Scholar
Massimo Lusetti, Tatyana Ruzsics, Anne Göhring, Tanja Samardžic, and Elisabeth Stark. 2018. Encoder-decoder methods for text normalization. In Proceedings of the 5th Workshop on NLP for Similar Languages, Varieties, and Dialects. 18--28.Google Scholar
Esmé Manandise and Claudia Gdaniec. 2011. Morphology to the rescue redux: Resolving borrowings and code-mixing in machine translation. Communications in Computer and Information Sciences 100 (2011), 86--97. DOI:https://doi.org/10.1007/978-3-642-23138-4_6Google ScholarCross Ref
Diana McCarthy, Rob Koeling, and John Carroll. 2007. Unsupervised acquisition of predominant word senses. Computational Linguistics 33, 4 (2007), 553--590.Google ScholarDigital Library
Paul McNamee. 2005. Language identification: A solved problem suitable for undergraduate instruction. Journal of Computer Sciences in Colleges 20, 3 (2005), 94--101.Google ScholarDigital Library
Rada Mihalcea. 2004. Co-training and self-training for word sense disambiguation. In Proceedings of the 8th Conference on Computational Natural Language Learning (CoNLL’04). 182--183.Google Scholar
Pieter Muysken. 1995. Cross-disciplinary perspectives on code-switching. In One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching, L. Milroy and P. Muysken (Eds.). Cambrige University Press, Cambridge, UK, 177--198.Google Scholar
Carol Myers-Scotton. 1995. A lexically based model of code-switching. In One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching, L. Milroy and P. Muysken (Eds.). Cambridge University Press, Cambridge, UK, 233--256. DOI:https://doi.org/10.1017/CBO9780511620867.011Google Scholar
Carol Myers-Scotton. 1997. Duelling Languages: Grammatical Structure in Codeswitching. Clarendon Press, Oxford, UK.Google Scholar
Carol Myers-Scotton and J. Jake. 2001. Explaining aspects of codeswitching and their implications. In One Mind, Two Languages: Bilingual Language Processing, J. Nicol (Ed.). Blackwell, Oxford, UK, 84--116.Google Scholar
Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys 41, 2 (2009), 69. DOI:https://doi.org/10.1145/1459352.1459355Google ScholarDigital Library
Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193 (2012), 217--250. DOI:https://doi.org/10.1016/j.artint.2012.07.001Google ScholarDigital Library
Shana Poplack. 1980. Sometimes I'll start a sentence in Spanish y termino en ESPAÑOL: Toward a typology of code-switching. Linguistics 18, 7--8 (1980), 581--618.Google ScholarCross Ref
Philip Resnik. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95). 448--453.Google ScholarDigital Library
Alex Rudnick, Annette Rios, and Michael Grasser. 2014. Enhancing a rule-based MT system with cross-lingual WSD. In Proceedings of the SaLTMiLWorkshop on Free/Open-Source Language Resources for the Machine Translation of Less-Resourced Languages (LREC’14). 31--36.Google Scholar
Yves Scherrer and Nikola Ljubešic. 2016. Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS’16). 248--255.Google Scholar
Kiril Simov, Petya Osenova, and Alex Popov. 2016. Towards semantic-based hybrid machine translation between Bulgarian and English. In Proceedings of the 2nd Workshop on Semantics-Driven Machine Translation. 22--26. DOI:https://doi.org/10.18653/v1/W16-0604Google ScholarCross Ref
R. Mahesh K. Sinha and Anil Thakur. 2005. Machine translation of bi-lingual Hindi-English (Hinglish) text. In Proceedings of the 10th Machine Translation Summit. 149--156.Google Scholar
Thamar Solorio and Yang Liu. 2008. Part-of-speech tagging for English-Spanish code-switched text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’08). 1051. DOI:https://doi.org/10.3115/1613715.1613852Google ScholarDigital Library
Francis M. Tyers, Felipe Sánchez-Martinez, and Mikel L. Forcada. 2012. Flexible finite-state lexical selection for rule-based machine translation. Proceedings of the 16th International Conference of the European Association for Machine Translation. 213--220.Google Scholar
Francis M. Tyers, Felipe Sánchez-Martínez, Sergio Ortiz-Rojas, and Mikel L. Forcada. 2010. Free/open-source resources in the Apertium platform for machine translation research and development. Prague Bulletin of Mathematical Linguistics 93 (2010), 67--76. DOI:https://doi.org/10.2478/v10108-010-0015-5.PBMLGoogle ScholarCross Ref
Florentina Vasilescu, Philippe Langlais, and Guy Lapalme. 2004. Evaluating variants of the Lesk approach for disambiguating words. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC’04).Google Scholar
David Vickrey, Luke Biewald, Marc Teyssier, and Daphne Koller. 2005. Word-sense disambiguation for machine translation. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT’05). 771--778. DOI:https://doi.org/10.3115/1220575.1220672Google ScholarDigital Library
Clare Voss, Stephen Tratz, Jamal Laoudi, and Douglas Briesch. 2014. Finding romanized Arabic dialect in code-mixed tweets. In Proceedings of the 9th International Conference on Language Resources and Evaluation. 188--199.Google Scholar
Li Wang, Masao Fuketa, Kazuhiro Morita, and Jun-Ichi Aoe. 2011. Context constraint disambiguation of word semantics by field association schemes. Information Processing 8 Management 47, 4 (2011) 560--574. DOI:https://doi.org/10.1016/j.ipm.2011.01.001Google Scholar
Yorick Wilks and Mark Stevenson. 1997. The grammar of sense: Using part-of-speech tags as a first step in semantic disambiguation. Natural Language Engineering 4, 1 (1997) 135--143. DOI:https://doi.org/10.1017/S1351324998001946Google ScholarDigital Library
Jennifer Williams and Charlie K. Dagli. 2017. Developing ground truth for Twitter language identification of similar languages and dialects. In Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties, and Dialects. 1--6.Google Scholar
David Yarowsky. 1993. One sense per collocation. In Proceedings of the Workshop on Human Language Technology (HLT’93). 266--271. DOI:https://doi.org/10.3115/1075671.1075731Google ScholarDigital Library
David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. 189--196.Google ScholarDigital Library
Younes Samih. 2016. Detecting code-switching in Moroccan Arabic social media. In Proceedings of the SocialNLP workshop at IJCAI 2016. DOI:https://doi.org/10.13140/RG.2.2.18663.85928Google Scholar
Randa Zarnoufi, Hamid Jaafar, and Mounia Abik. 2019. Language identification for user generated content in social media. In Information Systems and Technologies to Support Learning. Smart Innovation, Systems and Technologies, Vol. 111. Springer, 672--678. DOI:https://doi.org/10.1007/978-3-030-03577-8_73Google Scholar
Wei Zhang, Robert A. J. Clark, Yongyuan Wang, and Wen Li. 2016. Unsupervised language identification based on latent Dirichlet allocation. Computer Speech and Langugage 39 (2016), 47--66. DOI:https://doi.org/10.1016/j.csl.2016.02.001Google ScholarDigital Library

Index Terms

Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction techniques
      1. Text input

Recommendations

Word Sense Based Hindi-Tamil Statistical Machine Translation

Corpus based natural language processing has emerged with great success in recent years. It is not only used for languages like English, French, Spanish, and Hindi but also is widely used for languages like Tamil, Telugu etc. This paper focuses to ...
Read More
A Multilingual Text Normalization Approach
Human Language Technology Challenges for Computer Science and Linguistics
Abstract
The creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This paper presents a generic approach for text normalization and concentrates on the ...
Read More
Disambiguation of Homograms in a Pitch Accent Language
CSAI '17: Proceedings of the 2017 International Conference on Computer Science and Artificial Intelligence

The Croatian language is a pitch-accent language in which the tone contour realized in the stressed syllable carries the lexical information. Therefore, in some cases, different lexical accent gives the word a different meaning. In such cases, the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 19, Issue 4
July 2020
291 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3391538
Editor:
Imed Zitouni
Microsoft, USA
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 April 2020
- Accepted: 1 December 2019
- Revised: 1 October 2019
- Received: 1 March 2019
Published in tallip Volume 19, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Text normalization
automatic language identification
code switching normalization
dialects
matrix language
multilingual vertical context
social media
standard languages
word sense disambiguation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 400
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Word Sense Based Hindi-Tamil Statistical Machine Translation

A Multilingual Text Normalization Approach

Disambiguation of Homograms in a Pitch Accent Language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Machine Normalization: Bringing Social Media Text from Non-Standard to Standard Form

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Word Sense Based Hindi-Tamil Statistical Machine Translation

A Multilingual Text Normalization Approach

Disambiguation of Homograms in a Pitch Accent Language

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media