Skip to main content
Log in

BLSTM-API: Bi-LSTM Recurrent Neural Network-Based Approach for Arabic Paraphrase Identification

  • Research Article-Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

Advances in communication technologies have enabled peoples to deliver more. Due to this phenomenon, an increasing amount of data are easily disseminated and published on the internet, which encouraged the practice of paraphrasing. It allows the original sentence to be concealed by alternative expressions of the same meaning. Its detection consists in identifying the degree of semantic similarity between them. It is one of the complex tasks of automatic natural language processing and artificial intelligence. Despite the fact that Arabic language is spoken by a large population around the world, it is rich of grammars and semantics that made hard its sentences modeling and similarity computing. In this paper, an Arabic extrinsic paraphrase identification method is proposed. It is based on a Siamese recurrent neural networks architecture seeing its performance in processing variable size of textual sequences. Indeed, pertinent features are firstly extracted using global word vector that used a global co-occurrence matrix based on a local context window. Then, bidirectional long short-term memory is introduced that incorporated efficiently long-term dependent relationships and captured meaningful contextual semantics between words. For paraphrase identification, cosine measure is used as a merge function. It was useful for identifying semantic similarity between the obtained source and suspect vectors. To address the lack of free and publicly Arabic paraphrased datasets, word2vec algorithm and part-of-speech tagging are combined to generate suspect sentences. For its validation, its quality is compared to the SemEval benchmark. Experiments demonstrated the effectiveness of our proposal’s methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Altheneyan, A.; Menai, M.E.B.: Evaluation of state-of-the-art paraphrase identification and its application to automatic plagiarism detection. Int. J. Pattern Recognit Artif Intell. 34(4), 1–31 (2020)

    Article  Google Scholar 

  2. Abdellaoui, H.; Zrigui, M.: Using tweets and emojis to build TEAD: an arabic dataset for sentiment analysis. Computación y Sistemas 22(3), 777–786 (2018)

    Article  Google Scholar 

  3. Mahmoud, A.; Zrigui, M.: Semantic similarity analysis for paraphrase identification in Arabic texts. In: 31st Pacific Asia Conference on Language, Information and Computation PACLIC, Philippine, pp. 274–281 (2017)

  4. Hkiri, E.; Mallat, S.; Zrigui, M.: Integrating bilingual named entities lexicon with conditional random fields model for Arabic named entities recognition. In: 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, Japan, pp. 609–614 (2017)

  5. Hkiri, E.; Mallat, S.; Zrigui, M.; Mars, M.: Constructing a lexicon of Arabic-English named entity using SMT and semantic linked data. Int. Arab J. Inf. Technol. 14, 820–825 (2017)

    Google Scholar 

  6. Boudhief, A.; Maraoui, M.; Zrigui, M: Elaboration of a model for an indexed base for teaching Arabic language to disabled people. In: 6th International Conference on Computer Science and Information Technology CSIT, Amman, Jordan (2016)

  7. Maraoui, M.; Terbeh, N.; Zrigui, M.: Arabic discourse analysis based on acoustic, prosodic and phonetic modeling: elocution evaluation, speech classification and pathological speech correction. Int. J. Speech Technol. 21(14), 1071–1090 (2018)

    Article  Google Scholar 

  8. Batita, M.A.; Zrigui, M.: Derivational relations in arabic wordnet. In: 9th Global WordNet Conference GWC, Singapore (2018)

  9. Mohamed, M.A.B.; Mallat, S.; Nahdi, M.A.; Zrigui, M.: Exploring the potential of schemes in building NLP tools for Arabic language. Int. Arab J. Inf. Technol. (IAJIT) 12(16), 566–573 (2015)

    Google Scholar 

  10. Abualigah, L.M.Q.: Feature selection and enhanced krill herd algorithm for text clustering. Stud. Comput. Intell. (2018). https://doi.org/10.1007/978-3-030-10674-4

    Article  Google Scholar 

  11. Diana, N.E.; Ulfa, I.H.: Measuring performance of n-gram and Jaccard-similarity metrics in document plagiarism application. J. Phys. 1196, 1–8 (2019)

    Google Scholar 

  12. Ilham, A.A.; Bustamin, A.; Aswad, I.; Armin F.: Implementation of clustering and similarity analysis for detecting content similarity in student final projects. In: 3rd EPI International Conference on Science and Engineering, India (2020)

  13. Abualigaha, L.M.; Khader, A.T.; Hanandeh, E.S.: A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J. Comput. Sci. 25, 456–466 (2018)

    Article  Google Scholar 

  14. Abualigah, L.M.; Khader, A.T.; Hanandeh, E.S.: Hybrid clustering analysis using improved krill herd algorithm. Appl. Intell. 48(5), 4047–4071 (2018)

    Article  Google Scholar 

  15. Sahu, M.: Plagiarism detection using artificial intelligence technique in multiple files. Int. J. Sci. Technol. Res. 5(14), 111–114 (2016)

    Google Scholar 

  16. Ali, W.; Ahmed, T.; Rehman, Z.; Anwar, U.R.; Slaman, L.: Detection of plagiarism in Urdu text documents. In: 14th International Conference on Emerging Technologies ICET, Islamabad (2018)

  17. Ullah, F.; Wang, J.; Farhan, M.; Jabbar, S.; Naseer, M.K.; Asif, M.: LSA based smart assessment methodology for SDN infrastructure in IoT environment. Int. J. Parallel Prog. 48, 162–177 (2020)

    Article  Google Scholar 

  18. Ratna, A.A.P.; Wulandari, N.A.; Kaltsum, A.; Ibrahim, I.; Purnamasari, P.D.: Answer categorization method using K-Means for Indonesian language automatic short answer grading system based on Latent Semantic Analysis. In: International Conference on Quality in Research (QIR): International Symposium on Electrical and Computer Engineering, Indonesia (2019)

  19. Daud, A.; Khan, J.A.; Nasir, J.A.; Abbasi, R.: Latent dirichlet allocation and POS tags based method for external plagiarism detection: LDA and POS tags based plagiarism detection. Int. J. Semant. Web Inf. Syst. (IJSWIS) 14(13), 53–69 (2018)

    Article  Google Scholar 

  20. Xue, M.: A text retrieval algorithm based on the hybrid LDA and Word2Vec model. In: International Conference on Intelligent Transportation, Big Data & Smart City ICITBS, China (2019)

  21. Yazid, B.; Mourad, O.; Abdelmalik, T.: Semantic similarity approach between two sentences. In: 5th International Conference on the Image and Signal Processing and their Applications, Algeria (2019)

  22. Farouk, M.: Measuring text similarity based on structure and word embedding. Cogn. Syst. Res. 63(11), 1–10 (2020)

    Article  Google Scholar 

  23. Suleiman, D.; Awajan, A.; Al-Madi, N.: Deep learning based technique for plagiarism detection in Arabic texts. In: International Conference on New Trends in Computing Sciences ICTCS, Jordan (2017)

  24. Nagoudi, E.M.B.; Ferrero, J.; Schwab, D.: LIM-LIG at SemEval-2017 Task1: enhancing the semantic similarity for arabic sentences with vectors weighting. in: 11th International Workshop on Semantic Evaluation SemEval-2017, Canada (2017)

  25. Florou, E.; Perifanos, K.; Goutos, D.: Neural embeddings for metaphor detection in a corpus of Greek texts. In: International Conference on Information, Intelligence, Systems and Applications IISA, Greece (2018)

  26. Mahmoud, A.; Zrigui, M.: Machine learning based method for detecting Arabic paraphrases. In: 33rd International Business Information Management Association IBIMA, Granada, Spain, pp. 5035–5048 (2019)

  27. Mahmoud, A.; Zrigui, M.: Similar meaning analysis for original documents identification in Arabic language. In: International Conference on Computational Collective Intelligence ICCCI), Hendaye, France, pp. 193–206 (2019)

  28. Mahmoud, A.; Zrigui, M.: Deep neural network models for paraphrased text classification in the Arabic language. In: 24th International Conference on Applications of Natural Language to Information Systems NLDB, Salford, UK, pp. 3–16 (2019)

  29. Kim, Y.: Convolutional neural networks for sentence classification. In: Conference on Empirical Methods in Natural Language Processing EMNLP, Doha, Qatar, pp. 1746–1751 (2014)

  30. He, H.; Gimpel, K.; Lin, J.: Multi-perspective sentence similarity modelling with convolutional neural networks. In: Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1576–1586 (2015)

  31. Song, Y.; Hu, Q.V.; He, L.: P-CNN: enhancing text matching with positional convolutional neural network. Knowl. Based Syst. 169, 67–79 (2019)

    Article  Google Scholar 

  32. Bsir, B.; Zrigui, M.: Gender identification: a comparative study of deep learning architectures. In: International Conference on Intelligent Systems Design and Applications ISDA, Advances in Intelligent Systems and Computing, Springer, vol 94, pp. 792–800 (2020)

  33. Liu, G., Guoa, J.: Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337, 1–51 (2019)

    Article  Google Scholar 

  34. Hunt, E.; Janamsetty, R.; Kinares, C.; Koh, C.; Sanchez, A.; Zhan, F.; Ozdemir, M.; Waseem, S.; Yolcu, O.; Dahal, B.; Zhan, J.; Gewali, L.; Oh, P.: Machine learning models for paraphrase identification and its applications on plagiarism detection. In: IEEE International Conference on Big Knowledge ICBK, Beijing China, pp. 97–104 (2019)

  35. Duong, P.H.; Nguyen, H.T.; Duong, H.N.; Ngo, K.; Ngo, D.: A hybrid approach to paraphrase detection. In: 5th NAFOSTED Conference on Information and Computer Science, pp. 366–371 (2018)

  36. Wang, X.; Li, C.; Zheng, Z.; Xu, B.: Paraphrase recognition via combination of neural classifier and keywords. In: International Joint Conference on Neural Networks IJCNN, Rio, Brazil, pp. 1–8 (2018)

  37. Einea, O.; Elnagar, A.: Predicting semantic textual similarity of Arabic question pairs using deep learning. In: 16th International Conference on Computer Systems and Applications AICCSA, Abu Dhabi, United Arab Emirates, pp. 1–5 (2020)

  38. Wang, S.; Zhou, W.; Jiang, C.: A survey of word embeddings based on deep learning. Computing 102, 717–740 (2020)

    Article  MathSciNet  Google Scholar 

  39. Pennington, J.; Socher, R.; Manning, C.: GloVe: Global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing EMNLP, Qatar, pp. 1532–1543 (2014)

  40. Alrabiah, M.; Al-Salman, A.; Atwell, E.; Alhelewh, N.: KSUCCA: a key to exploring Arabic historical linguistics. Int. J. Comput. Linguist. (IJCL) 5, 27–36 (2014)

    Google Scholar 

  41. Saad, M.K.; Ashour, W.: OSAC: Open Source Arabic Corpora. In: 6th International Conference on Electrical and Computer Systems EECS’10, North Cyprus (2010)

  42. Chicco, D.; Jurman, G.: The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21(6), 1–13 (2020)

    Google Scholar 

  43. Kong, L., Han, Z., Han, Y., Qi, H.: A deep paraphrase identification model interacting semantics with syntax. Hindawi Complex 2020, 1–14 (2020)

    Google Scholar 

  44. Othman, N.; Faiz, R.; Smaili, K.: Manhattan siamese LSTM for question retrieval in community question answering. In: 18th International Conference on Ontologies, DataBases, and Applications of Semantics ODBASE, Greece (2019)

  45. Yao, L.; Pan, Z.; Ning, H.: Unlabeled short text similarity with LSTM encoder. IEEE Access 7(11), 3430–3437 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adnen Mahmoud.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mahmoud, A., Zrigui, M. BLSTM-API: Bi-LSTM Recurrent Neural Network-Based Approach for Arabic Paraphrase Identification. Arab J Sci Eng 46, 4163–4174 (2021). https://doi.org/10.1007/s13369-020-05320-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-020-05320-w

Keywords

Navigation