Abstract
The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in terms of vocabulary size. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords, which in turn limits the vocabulary size in a given text corpus. Most tokenization techniques are language-agnostic, i.e., they do not incorporate the linguistic features of a given language. Not to mention the difficulty of evaluating such techniques in practice. In this paper, we introduce three new tokenization algorithms for Arabic and compare them to other three popular tokenizers using unsupervised evaluations. In addition, we compare all the six tokenizers by evaluating them on three supervised classification tasks: sentiment analysis, news classification and poem-meter classification, using six publicly available datasets. Our experiments show that none of the tokenization techniques is the best choice overall and that the performance of a given tokenization algorithm depends on many factors including the size of the dataset, nature of the task, and the morphology richness of the dataset. However, some tokenization techniques are better overall as compared to others on various text classification tasks.
Similar content being viewed by others
Data Availability
Publicly available datasets used.
References
Abandah G, Abdel-Karim A (2020) Accurate and fast recurrent neural network solution for the automatic diacritization of arabic text. Jordanian J Comput Inf Tech 6(2):103–121
Abandah Gheith A, Khedher Mohammed Z, Abdel-Majeed Mohammad R, Mansour Hamdi M, Hulliel Salma F, Bisharat Lara M (2020) Classifying and diacritizing arabic poems using deep recurrent neural networks. J King Saud Univ Comput Inf Sci 34(6):3775–3788
Abdelali Ahmed, Darwish Kareem, Durrani Nadir, Mubarak Hamdy (2016) Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp 11–16
Abdelali Ahmed, Hassan Sabit, Mubarak Hamdy, Darwish Kareem, Samih Younes (2021) Pre-training bert on arabic tweets: Practical considerations. arXiv preprint arXiv:2102.10684
Abdul-Mageed Muhammad, Elmadany AbdelRahim, Nagoudi El Moatez Billah (2020) Arbert & marbert: deep bidirectional transformers for arabic. arXiv preprint arXiv:2101.01785
Farha Ibrahim Abu, Magdy Walid (2021) A comparative study of effective approaches for arabic sentiment analysis. Information Processing & Management 58(2):102438. ISSN 0306-4573. https://doi.org/10.1016/j.ipm.2020.102438. URL http://www.sciencedirect.com/science/article/pii/S0306457320309316
Al-Ayyoub M, Khamaiseh AA, Jararweh Y, Al-Kabi MN (2019) A comprehensive survey of arabic sentiment analysis. Inf process manag 56(2):320–342
Al-Helali BM, Mahmoud SA (2017) Arabic online handwriting recognition (aohr) a survey. ACM Comput Surveys (CSUR) 50(3):1–35
Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention. Proc AAAI Conf Artif Intell 33:3159–3166
Al-shaibani Maged S, Zaid Alyafeai, Irfan Ahmad (2020) Meter classification of arabic poems using deep bidirectional recurrent neural networks. Pattern Recognition Letters 136:1–7
Al-Shaibani MS, Alyafeai Z, Ahmad I (2020) Metrec: A dataset for meter classification of arabic poetry. Data Brief 33:106497
Alkaoud Mohamed, Syed Mairaj (2020) On the importance of tokenization in arabic embedding models. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop, pp 119–129
Alomari Khaled Mohammad, ElSherif Hatem M, Shaalan Khaled (2017) Arabic tweets sentimental analysis using machine learning. In: International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pp 602–610. Springer
Aly Mohamed, Atiya Amir (2013) Labr: A large scale arabic book reviews dataset. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 494–498
Antoun Wissam, Baly Fady, Hajj Hazem (2020) Arabert: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104
Atallah AL-Shatnawi, Omar Khairuddin (2009) A comparative study between methods of arabic baseline detection. In: 2009 International Conference on Electrical Engineering and Informatics, volume 1, pp 73–77. IEEE
Badaro G, Baly R, Hajj H, El-Hajj W, Shaban KB, Habash N, Al-Sallab A, Hamdi A (2019) A survey of opinion mining in arabic: a comprehensive system perspective covering challenges and advances in tools, resources, models, applications, and visualizations. ACM Trans Asian Low-Resource Lang Inf Process (TALLIP) 18(3):1–52
mohamed BINIZ. Dataset for arabic classification. (2018)
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146 (ISSN 2307-387X)
Bostrom Kaj, Durrett Greg (2020) Byte pair encoding is suboptimal for language model pretraining. arXiv preprint arXiv:2004.03720
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Chitnis Rohan, DeNero John (2015) Variable-length word encodings for neural translation models. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 2088–2093
Cho Kyunghyun, Van Merriënboer Bart, Bahdanau Dzmitry, Bengio Yoshua (2014) On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259
Chollet François et al (2015) Keras. https://github.com/fchollet/keras
Cui Y, Che W, Liu T, Qin B, Yang Z (2021) Pre-training with whole word masking for chinese bert. IEEE/ACM Trans Audio, Speech, Lang Process 29:3504–3514
Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Ding C, Aye HTZ, Pa WP, Nwet KT, Soe KM, Utiyama M, Sumita E (2019) Towards burmese (myanmar) morphological analysis: Syllable-based tokenization and part-of-speech tagging. ACM Trans Asian Low-Resource Lang Inf Process (TALLIP) 19(1):1–34
El-Khair Ibrahim Abu (2016) 1.5 billion words arabic corpus. arXiv preprint arXiv:1611.04033
ElJundi Obeida, Antoun Wissam, Droubi Nour El, Hajj Hazem, El-Hajj Wassim, Shaban Khaled (2019) hulmona: The universal language model in arabic. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp 68–77
Guellil I, Azouaou F, Mendoza M (2019) Arabic sentiment analysis: studies, resources, and tools. Soc Netw Anal Min 9(1):56
Howard Jeremy, Ruder Sebastian (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146
Jasim Mahdi Nsaif (2020) Arabic optical characters recognition by neural network based arabic unicode
Kudo Taku (2018) Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959,
Kudo Taku, Richardson John (2018) Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226
Kunchukuttan Anoop, Bhattacharyya Pushpak (2016) Orthographic syllable as basic unit for smt between related languages. arXiv preprint arXiv:1610.00634
Kuratov Yuri, Arkhipov Mikhail (2019) Adaptation of deep bidirectional multilingual transformers for russian language. arXiv preprint arXiv:1905.07213
Lee Sangah, Shin Hyopil (2021) The korean morphologically tight-fitting tokenizer for noisy user-generated texts. In: Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pp 410–416
Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, Stoyanov Veselin (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Ma Wentao, Cui Yiming, Si Chenglei, Liu Ting, Wang Shijin, Hu Guoping (2020) Charbert: Character-aware pre-trained language model. arXiv preprint arXiv:2011.01513
Martin Louis, Muller Benjamin, Suárez Pedro Javier Ortiz, Dupont Yoann, Romary Laurent, de La Clergerie Éric Villemonte, Seddah Djamé, Sagot Benoît (2019) Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894
Mikolov Tomáš, Sutskever Ilya, Deoras Anoop, Le Hai-Son, Kombrink Stefan, Cernocky Jan (2012) Subword language modeling with neural networks. preprint (http://www.fit.vutbr.cz/mikolov/rnnlm/char.pdf), 8:67
Mikolov Tomas, Chen Kai, Corrado Greg, Dean Jeffrey (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mubarak Hamdy, Abdelali Ahmed, Sajjad Hassan, Samih Younes, Darwish Kareem (June 2019) Highly effective Arabic diacritization using sequence to sequence modeling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp 2390–2395, Minneapolis, Minnesota. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1248
Oudah Mai, Almahairi Amjad, Habash Nizar (2019) The impact of preprocessing on arabic-english statistical and neural machine translation. arXiv preprint arXiv:1906.11751
Pasha Arfath, Al-Badrashiny Mohamed, Diab Mona T, Kholy Ahmed El, Eskander Ramy, Habash Nizar, Pooleery Manoj (2014) Owen Rambow, and Ryan Roth. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In: LREC, volume 14, pp 1094–1101
Pennington Jeffrey, Socher Richard, Manning Christopher D (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Radford Alec, Narasimhan Karthik, Salimans Tim, Sutskever Ilya (2018) Improving language understanding by generative pre-training
Radford A, Jeffrey W, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9
Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, Liu Peter J (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683
Sajjad Hassan, Dalvi Fahim, Durrani Nadir, Abdelali Ahmed, Belinkov Yonatan, Vogel Stephan (2017) Challenging language-dependent segmentation for arabic: An application to machine translation and part-of-speech tagging. arXiv preprint arXiv:1709.00616
Schuster Mike, Nakajima Kaisuke (2012) Japanese and korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5149–5152. IEEE
Sennrich Rico, Haddow Barry, Birch Alexandra (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909
Shapiro Pamela, Duh Kevin (2018a) Bpe and charcnns for translation of morphology: A cross-lingual comparison and analysis. arXiv preprint arXiv:1809.01301
Shapiro Pamela, Duh Kevin (2018b) Morphological word embeddings for arabic neural machine translation in low-resource settings. In: Proceedings of the Second Workshop on Subword/Character LEvel Models, pp 1–11
Si Chenglei, Zhang Zhengyan, Chen Yingfa, Qi Fanchao, Wang Xiaozhi, Liu Zhiyuan, Sun Maosong (2021) Shuowen-jiezi: Linguistically informed tokenizers for chinese language model pretraining. arXiv preprint arXiv:2106.00400
Smit Peter, Virpioja Sami, Grönroos Stig-Arne, Kurimo Mikko (2014) Morfessor 2.0: Toolkit for statistical morphological segmentation. pp 4. Aalto University. URL http://urn.fi/URN:NBN:fi:aalto-201409292677
Soliman AB, Eissa K, El-Beltagy SR (2017) Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Comput Sci 117:256–265
Takaoka Kazuma, Hisamoto Sorami, Kawahara Noriko, Sakamoto Miho, Uchida Yoshitaka, Matsumoto Yuji (2018) Sudachi: A japanese tokenizer for business. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Tay Yi, Tran Vinh Q, Ruder Sebastian, Gupta Jai, Chung Hyung Won, Bahri Dara, Qin Zhen, Baumgartner Simon, Yu Cong, Metzler Donald (2021) Charformer: Fast character transformers via gradient-based subword tokenization. arXiv preprint arXiv:2106.12672
Wang Xinyi, Ruder Sebastian, Neubig Graham (2021) Multi-view subword regularization. arXiv preprint arXiv:2103.08490
Wu Yonghui, Schuster Mike, Chen Zhifeng, Le Quoc V, Norouzi Mohammad, Macherey Wolfgang, Krikun Maxim, Cao Yuan, Gao Qin, Macherey Klaus et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144
Xue Linting, Barua Aditya, Constant Noah, Al-Rfou Rami, Narang Sharan, Kale Mihir, Roberts Adam, Raffel Colin (2021) Byt5: Towards a token-free future with pre-trained byte-to-byte models. arXiv preprint arXiv:2105.13626
Yousef Waleed A, Ibrahime Omar M, Madbouly Taha M, Mahmoud Moustafa A (2019) Learning meters of arabic and english poems with recurrent neural networks: a step forward for language understanding and synthesis. arXiv preprint arXiv:1905.05700
Acknowledgements
The authors would like to thank King Fahd University of Petroleum & Minerals (KFUPM) for supporting this work. Irfan Ahmad is supported through grant number JRCAI- RFP-06.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
None.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Alyafeai, Z., Al-shaibani, M.S., Ghaleb, M. et al. Evaluating Various Tokenizers for Arabic Text Classification. Neural Process Lett 55, 2911–2933 (2023). https://doi.org/10.1007/s11063-022-10990-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-10990-8