Skip to main content
Log in

Evaluating Various Tokenizers for Arabic Text Classification

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

The first step in any NLP pipeline is to split the text into individual tokens. The most obvious and straightforward approach is to use words as tokens. However, given a large text corpus, representing all the words is not efficient in terms of vocabulary size. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords, which in turn limits the vocabulary size in a given text corpus. Most tokenization techniques are language-agnostic, i.e., they do not incorporate the linguistic features of a given language. Not to mention the difficulty of evaluating such techniques in practice. In this paper, we introduce three new tokenization algorithms for Arabic and compare them to other three popular tokenizers using unsupervised evaluations. In addition, we compare all the six tokenizers by evaluating them on three supervised classification tasks: sentiment analysis, news classification and poem-meter classification, using six publicly available datasets. Our experiments show that none of the tokenization techniques is the best choice overall and that the performance of a given tokenization algorithm depends on many factors including the size of the dataset, nature of the task, and the morphology richness of the dataset. However, some tokenization techniques are better overall as compared to others on various text classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

Publicly available datasets used.

Notes

  1. https://github.com/google/sentencepiece

  2. https://github.com/ARBML/tkseem

References

  1. Abandah G, Abdel-Karim A (2020) Accurate and fast recurrent neural network solution for the automatic diacritization of arabic text. Jordanian J Comput Inf Tech 6(2):103–121

    Google Scholar 

  2. Abandah Gheith A, Khedher Mohammed Z, Abdel-Majeed Mohammad R, Mansour Hamdi M, Hulliel Salma F, Bisharat Lara M (2020) Classifying and diacritizing arabic poems using deep recurrent neural networks. J King Saud Univ Comput Inf Sci 34(6):3775–3788

    Google Scholar 

  3. Abdelali Ahmed, Darwish Kareem, Durrani Nadir, Mubarak Hamdy (2016) Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp 11–16

  4. Abdelali Ahmed, Hassan Sabit, Mubarak Hamdy, Darwish Kareem, Samih Younes (2021) Pre-training bert on arabic tweets: Practical considerations. arXiv preprint arXiv:2102.10684

  5. Abdul-Mageed Muhammad, Elmadany AbdelRahim, Nagoudi El Moatez Billah (2020) Arbert & marbert: deep bidirectional transformers for arabic. arXiv preprint arXiv:2101.01785

  6. Farha Ibrahim Abu, Magdy Walid (2021) A comparative study of effective approaches for arabic sentiment analysis. Information Processing & Management 58(2):102438. ISSN 0306-4573. https://doi.org/10.1016/j.ipm.2020.102438. URL http://www.sciencedirect.com/science/article/pii/S0306457320309316

  7. Al-Ayyoub M, Khamaiseh AA, Jararweh Y, Al-Kabi MN (2019) A comprehensive survey of arabic sentiment analysis. Inf process manag 56(2):320–342

    Article  Google Scholar 

  8. Al-Helali BM, Mahmoud SA (2017) Arabic online handwriting recognition (aohr) a survey. ACM Comput Surveys (CSUR) 50(3):1–35

    Article  Google Scholar 

  9. Al-Rfou R, Choe D, Constant N, Guo M, Jones L (2019) Character-level language modeling with deeper self-attention. Proc AAAI Conf Artif Intell 33:3159–3166

    Google Scholar 

  10. Al-shaibani Maged S, Zaid Alyafeai, Irfan Ahmad (2020) Meter classification of arabic poems using deep bidirectional recurrent neural networks. Pattern Recognition Letters 136:1–7

    Article  Google Scholar 

  11. Al-Shaibani MS, Alyafeai Z, Ahmad I (2020) Metrec: A dataset for meter classification of arabic poetry. Data Brief 33:106497

    Article  Google Scholar 

  12. Alkaoud Mohamed, Syed Mairaj (2020) On the importance of tokenization in arabic embedding models. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop, pp 119–129

  13. Alomari Khaled Mohammad, ElSherif Hatem M, Shaalan Khaled (2017) Arabic tweets sentimental analysis using machine learning. In: International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pp 602–610. Springer

  14. Aly Mohamed, Atiya Amir (2013) Labr: A large scale arabic book reviews dataset. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 494–498

  15. Antoun Wissam, Baly Fady, Hajj Hazem (2020) Arabert: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104

  16. Atallah AL-Shatnawi, Omar Khairuddin (2009) A comparative study between methods of arabic baseline detection. In: 2009 International Conference on Electrical Engineering and Informatics, volume 1, pp 73–77. IEEE

  17. Badaro G, Baly R, Hajj H, El-Hajj W, Shaban KB, Habash N, Al-Sallab A, Hamdi A (2019) A survey of opinion mining in arabic: a comprehensive system perspective covering challenges and advances in tools, resources, models, applications, and visualizations. ACM Trans Asian Low-Resource Lang Inf Process (TALLIP) 18(3):1–52

    Article  Google Scholar 

  18. mohamed BINIZ. Dataset for arabic classification. (2018)

  19. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146 (ISSN 2307-387X)

    Article  Google Scholar 

  20. Bostrom Kaj, Durrett Greg (2020) Byte pair encoding is suboptimal for language model pretraining. arXiv preprint arXiv:2004.03720

  21. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901

    Google Scholar 

  22. Chitnis Rohan, DeNero John (2015) Variable-length word encodings for neural translation models. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 2088–2093

  23. Cho Kyunghyun, Van Merriënboer Bart, Bahdanau Dzmitry, Bengio Yoshua (2014) On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259

  24. Chollet François et al (2015) Keras. https://github.com/fchollet/keras

  25. Cui Y, Che W, Liu T, Qin B, Yang Z (2021) Pre-training with whole word masking for chinese bert. IEEE/ACM Trans Audio, Speech, Lang Process 29:3504–3514

    Article  Google Scholar 

  26. Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  27. Ding C, Aye HTZ, Pa WP, Nwet KT, Soe KM, Utiyama M, Sumita E (2019) Towards burmese (myanmar) morphological analysis: Syllable-based tokenization and part-of-speech tagging. ACM Trans Asian Low-Resource Lang Inf Process (TALLIP) 19(1):1–34

    Google Scholar 

  28. El-Khair Ibrahim Abu (2016) 1.5 billion words arabic corpus. arXiv preprint arXiv:1611.04033

  29. ElJundi Obeida, Antoun Wissam, Droubi Nour El, Hajj Hazem, El-Hajj Wassim, Shaban Khaled (2019) hulmona: The universal language model in arabic. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop, pp 68–77

  30. Guellil I, Azouaou F, Mendoza M (2019) Arabic sentiment analysis: studies, resources, and tools. Soc Netw Anal Min 9(1):56

    Article  Google Scholar 

  31. Howard Jeremy, Ruder Sebastian (2018) Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146

  32. Jasim Mahdi Nsaif (2020) Arabic optical characters recognition by neural network based arabic unicode

  33. Kudo Taku (2018) Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959,

  34. Kudo Taku, Richardson John (2018) Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226

  35. Kunchukuttan Anoop, Bhattacharyya Pushpak (2016) Orthographic syllable as basic unit for smt between related languages. arXiv preprint arXiv:1610.00634

  36. Kuratov Yuri, Arkhipov Mikhail (2019) Adaptation of deep bidirectional multilingual transformers for russian language. arXiv preprint arXiv:1905.07213

  37. Lee Sangah, Shin Hyopil (2021) The korean morphologically tight-fitting tokenizer for noisy user-generated texts. In: Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pp 410–416

  38. Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, Stoyanov Veselin (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  39. Ma Wentao, Cui Yiming, Si Chenglei, Liu Ting, Wang Shijin, Hu Guoping (2020) Charbert: Character-aware pre-trained language model. arXiv preprint arXiv:2011.01513

  40. Martin Louis, Muller Benjamin, Suárez Pedro Javier Ortiz, Dupont Yoann, Romary Laurent, de La Clergerie Éric Villemonte, Seddah Djamé, Sagot Benoît (2019) Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894

  41. Mikolov Tomáš, Sutskever Ilya, Deoras Anoop, Le Hai-Son, Kombrink Stefan, Cernocky Jan (2012) Subword language modeling with neural networks. preprint (http://www.fit.vutbr.cz/mikolov/rnnlm/char.pdf), 8:67

  42. Mikolov Tomas, Chen Kai, Corrado Greg, Dean Jeffrey (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  43. Mubarak Hamdy, Abdelali Ahmed, Sajjad Hassan, Samih Younes, Darwish Kareem (June 2019) Highly effective Arabic diacritization using sequence to sequence modeling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp 2390–2395, Minneapolis, Minnesota. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1248

  44. Oudah Mai, Almahairi Amjad, Habash Nizar (2019) The impact of preprocessing on arabic-english statistical and neural machine translation. arXiv preprint arXiv:1906.11751

  45. Pasha Arfath, Al-Badrashiny Mohamed, Diab Mona T, Kholy Ahmed El, Eskander Ramy, Habash Nizar, Pooleery Manoj (2014) Owen Rambow, and Ryan Roth. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In: LREC, volume 14, pp 1094–1101

  46. Pennington Jeffrey, Socher Richard, Manning Christopher D (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  47. Radford Alec, Narasimhan Karthik, Salimans Tim, Sutskever Ilya (2018) Improving language understanding by generative pre-training

  48. Radford A, Jeffrey W, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8):9

    Google Scholar 

  49. Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, Liu Peter J (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683

  50. Sajjad Hassan, Dalvi Fahim, Durrani Nadir, Abdelali Ahmed, Belinkov Yonatan, Vogel Stephan (2017) Challenging language-dependent segmentation for arabic: An application to machine translation and part-of-speech tagging. arXiv preprint arXiv:1709.00616

  51. Schuster Mike, Nakajima Kaisuke (2012) Japanese and korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 5149–5152. IEEE

  52. Sennrich Rico, Haddow Barry, Birch Alexandra (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909

  53. Shapiro Pamela, Duh Kevin (2018a) Bpe and charcnns for translation of morphology: A cross-lingual comparison and analysis. arXiv preprint arXiv:1809.01301

  54. Shapiro Pamela, Duh Kevin (2018b) Morphological word embeddings for arabic neural machine translation in low-resource settings. In: Proceedings of the Second Workshop on Subword/Character LEvel Models, pp 1–11

  55. Si Chenglei, Zhang Zhengyan, Chen Yingfa, Qi Fanchao, Wang Xiaozhi, Liu Zhiyuan, Sun Maosong (2021) Shuowen-jiezi: Linguistically informed tokenizers for chinese language model pretraining. arXiv preprint arXiv:2106.00400

  56. Smit Peter, Virpioja Sami, Grönroos Stig-Arne, Kurimo Mikko (2014) Morfessor 2.0: Toolkit for statistical morphological segmentation. pp 4. Aalto University. URL http://urn.fi/URN:NBN:fi:aalto-201409292677

  57. Soliman AB, Eissa K, El-Beltagy SR (2017) Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Comput Sci 117:256–265

    Article  Google Scholar 

  58. Takaoka Kazuma, Hisamoto Sorami, Kawahara Noriko, Sakamoto Miho, Uchida Yoshitaka, Matsumoto Yuji (2018) Sudachi: A japanese tokenizer for business. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

  59. Tay Yi, Tran Vinh Q, Ruder Sebastian, Gupta Jai, Chung Hyung Won, Bahri Dara, Qin Zhen, Baumgartner Simon, Yu Cong, Metzler Donald (2021) Charformer: Fast character transformers via gradient-based subword tokenization. arXiv preprint arXiv:2106.12672

  60. Wang Xinyi, Ruder Sebastian, Neubig Graham (2021) Multi-view subword regularization. arXiv preprint arXiv:2103.08490

  61. Wu Yonghui, Schuster Mike, Chen Zhifeng, Le Quoc V, Norouzi Mohammad, Macherey Wolfgang, Krikun Maxim, Cao Yuan, Gao Qin, Macherey Klaus et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

  62. Xue Linting, Barua Aditya, Constant Noah, Al-Rfou Rami, Narang Sharan, Kale Mihir, Roberts Adam, Raffel Colin (2021) Byt5: Towards a token-free future with pre-trained byte-to-byte models. arXiv preprint arXiv:2105.13626

  63. Yousef Waleed A, Ibrahime Omar M, Madbouly Taha M, Mahmoud Moustafa A (2019) Learning meters of arabic and english poems with recurrent neural networks: a step forward for language understanding and synthesis. arXiv preprint arXiv:1905.05700

Download references

Acknowledgements

The authors would like to thank King Fahd University of Petroleum & Minerals (KFUPM) for supporting this work. Irfan Ahmad is supported through grant number JRCAI- RFP-06.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Irfan Ahmad.

Ethics declarations

Conflict of interest

None.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alyafeai, Z., Al-shaibani, M.S., Ghaleb, M. et al. Evaluating Various Tokenizers for Arabic Text Classification. Neural Process Lett 55, 2911–2933 (2023). https://doi.org/10.1007/s11063-022-10990-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-022-10990-8

Keywords

Navigation