Abstract
Advances in communication technologies have enabled peoples to deliver more. Due to this phenomenon, an increasing amount of data are easily disseminated and published on the internet, which encouraged the practice of paraphrasing. It allows the original sentence to be concealed by alternative expressions of the same meaning. Its detection consists in identifying the degree of semantic similarity between them. It is one of the complex tasks of automatic natural language processing and artificial intelligence. Despite the fact that Arabic language is spoken by a large population around the world, it is rich of grammars and semantics that made hard its sentences modeling and similarity computing. In this paper, an Arabic extrinsic paraphrase identification method is proposed. It is based on a Siamese recurrent neural networks architecture seeing its performance in processing variable size of textual sequences. Indeed, pertinent features are firstly extracted using global word vector that used a global co-occurrence matrix based on a local context window. Then, bidirectional long short-term memory is introduced that incorporated efficiently long-term dependent relationships and captured meaningful contextual semantics between words. For paraphrase identification, cosine measure is used as a merge function. It was useful for identifying semantic similarity between the obtained source and suspect vectors. To address the lack of free and publicly Arabic paraphrased datasets, word2vec algorithm and part-of-speech tagging are combined to generate suspect sentences. For its validation, its quality is compared to the SemEval benchmark. Experiments demonstrated the effectiveness of our proposal’s methods.
Similar content being viewed by others
References
Altheneyan, A.; Menai, M.E.B.: Evaluation of state-of-the-art paraphrase identification and its application to automatic plagiarism detection. Int. J. Pattern Recognit Artif Intell. 34(4), 1–31 (2020)
Abdellaoui, H.; Zrigui, M.: Using tweets and emojis to build TEAD: an arabic dataset for sentiment analysis. Computación y Sistemas 22(3), 777–786 (2018)
Mahmoud, A.; Zrigui, M.: Semantic similarity analysis for paraphrase identification in Arabic texts. In: 31st Pacific Asia Conference on Language, Information and Computation PACLIC, Philippine, pp. 274–281 (2017)
Hkiri, E.; Mallat, S.; Zrigui, M.: Integrating bilingual named entities lexicon with conditional random fields model for Arabic named entities recognition. In: 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, Japan, pp. 609–614 (2017)
Hkiri, E.; Mallat, S.; Zrigui, M.; Mars, M.: Constructing a lexicon of Arabic-English named entity using SMT and semantic linked data. Int. Arab J. Inf. Technol. 14, 820–825 (2017)
Boudhief, A.; Maraoui, M.; Zrigui, M: Elaboration of a model for an indexed base for teaching Arabic language to disabled people. In: 6th International Conference on Computer Science and Information Technology CSIT, Amman, Jordan (2016)
Maraoui, M.; Terbeh, N.; Zrigui, M.: Arabic discourse analysis based on acoustic, prosodic and phonetic modeling: elocution evaluation, speech classification and pathological speech correction. Int. J. Speech Technol. 21(14), 1071–1090 (2018)
Batita, M.A.; Zrigui, M.: Derivational relations in arabic wordnet. In: 9th Global WordNet Conference GWC, Singapore (2018)
Mohamed, M.A.B.; Mallat, S.; Nahdi, M.A.; Zrigui, M.: Exploring the potential of schemes in building NLP tools for Arabic language. Int. Arab J. Inf. Technol. (IAJIT) 12(16), 566–573 (2015)
Abualigah, L.M.Q.: Feature selection and enhanced krill herd algorithm for text clustering. Stud. Comput. Intell. (2018). https://doi.org/10.1007/978-3-030-10674-4
Diana, N.E.; Ulfa, I.H.: Measuring performance of n-gram and Jaccard-similarity metrics in document plagiarism application. J. Phys. 1196, 1–8 (2019)
Ilham, A.A.; Bustamin, A.; Aswad, I.; Armin F.: Implementation of clustering and similarity analysis for detecting content similarity in student final projects. In: 3rd EPI International Conference on Science and Engineering, India (2020)
Abualigaha, L.M.; Khader, A.T.; Hanandeh, E.S.: A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J. Comput. Sci. 25, 456–466 (2018)
Abualigah, L.M.; Khader, A.T.; Hanandeh, E.S.: Hybrid clustering analysis using improved krill herd algorithm. Appl. Intell. 48(5), 4047–4071 (2018)
Sahu, M.: Plagiarism detection using artificial intelligence technique in multiple files. Int. J. Sci. Technol. Res. 5(14), 111–114 (2016)
Ali, W.; Ahmed, T.; Rehman, Z.; Anwar, U.R.; Slaman, L.: Detection of plagiarism in Urdu text documents. In: 14th International Conference on Emerging Technologies ICET, Islamabad (2018)
Ullah, F.; Wang, J.; Farhan, M.; Jabbar, S.; Naseer, M.K.; Asif, M.: LSA based smart assessment methodology for SDN infrastructure in IoT environment. Int. J. Parallel Prog. 48, 162–177 (2020)
Ratna, A.A.P.; Wulandari, N.A.; Kaltsum, A.; Ibrahim, I.; Purnamasari, P.D.: Answer categorization method using K-Means for Indonesian language automatic short answer grading system based on Latent Semantic Analysis. In: International Conference on Quality in Research (QIR): International Symposium on Electrical and Computer Engineering, Indonesia (2019)
Daud, A.; Khan, J.A.; Nasir, J.A.; Abbasi, R.: Latent dirichlet allocation and POS tags based method for external plagiarism detection: LDA and POS tags based plagiarism detection. Int. J. Semant. Web Inf. Syst. (IJSWIS) 14(13), 53–69 (2018)
Xue, M.: A text retrieval algorithm based on the hybrid LDA and Word2Vec model. In: International Conference on Intelligent Transportation, Big Data & Smart City ICITBS, China (2019)
Yazid, B.; Mourad, O.; Abdelmalik, T.: Semantic similarity approach between two sentences. In: 5th International Conference on the Image and Signal Processing and their Applications, Algeria (2019)
Farouk, M.: Measuring text similarity based on structure and word embedding. Cogn. Syst. Res. 63(11), 1–10 (2020)
Suleiman, D.; Awajan, A.; Al-Madi, N.: Deep learning based technique for plagiarism detection in Arabic texts. In: International Conference on New Trends in Computing Sciences ICTCS, Jordan (2017)
Nagoudi, E.M.B.; Ferrero, J.; Schwab, D.: LIM-LIG at SemEval-2017 Task1: enhancing the semantic similarity for arabic sentences with vectors weighting. in: 11th International Workshop on Semantic Evaluation SemEval-2017, Canada (2017)
Florou, E.; Perifanos, K.; Goutos, D.: Neural embeddings for metaphor detection in a corpus of Greek texts. In: International Conference on Information, Intelligence, Systems and Applications IISA, Greece (2018)
Mahmoud, A.; Zrigui, M.: Machine learning based method for detecting Arabic paraphrases. In: 33rd International Business Information Management Association IBIMA, Granada, Spain, pp. 5035–5048 (2019)
Mahmoud, A.; Zrigui, M.: Similar meaning analysis for original documents identification in Arabic language. In: International Conference on Computational Collective Intelligence ICCCI), Hendaye, France, pp. 193–206 (2019)
Mahmoud, A.; Zrigui, M.: Deep neural network models for paraphrased text classification in the Arabic language. In: 24th International Conference on Applications of Natural Language to Information Systems NLDB, Salford, UK, pp. 3–16 (2019)
Kim, Y.: Convolutional neural networks for sentence classification. In: Conference on Empirical Methods in Natural Language Processing EMNLP, Doha, Qatar, pp. 1746–1751 (2014)
He, H.; Gimpel, K.; Lin, J.: Multi-perspective sentence similarity modelling with convolutional neural networks. In: Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1576–1586 (2015)
Song, Y.; Hu, Q.V.; He, L.: P-CNN: enhancing text matching with positional convolutional neural network. Knowl. Based Syst. 169, 67–79 (2019)
Bsir, B.; Zrigui, M.: Gender identification: a comparative study of deep learning architectures. In: International Conference on Intelligent Systems Design and Applications ISDA, Advances in Intelligent Systems and Computing, Springer, vol 94, pp. 792–800 (2020)
Liu, G., Guoa, J.: Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 337, 1–51 (2019)
Hunt, E.; Janamsetty, R.; Kinares, C.; Koh, C.; Sanchez, A.; Zhan, F.; Ozdemir, M.; Waseem, S.; Yolcu, O.; Dahal, B.; Zhan, J.; Gewali, L.; Oh, P.: Machine learning models for paraphrase identification and its applications on plagiarism detection. In: IEEE International Conference on Big Knowledge ICBK, Beijing China, pp. 97–104 (2019)
Duong, P.H.; Nguyen, H.T.; Duong, H.N.; Ngo, K.; Ngo, D.: A hybrid approach to paraphrase detection. In: 5th NAFOSTED Conference on Information and Computer Science, pp. 366–371 (2018)
Wang, X.; Li, C.; Zheng, Z.; Xu, B.: Paraphrase recognition via combination of neural classifier and keywords. In: International Joint Conference on Neural Networks IJCNN, Rio, Brazil, pp. 1–8 (2018)
Einea, O.; Elnagar, A.: Predicting semantic textual similarity of Arabic question pairs using deep learning. In: 16th International Conference on Computer Systems and Applications AICCSA, Abu Dhabi, United Arab Emirates, pp. 1–5 (2020)
Wang, S.; Zhou, W.; Jiang, C.: A survey of word embeddings based on deep learning. Computing 102, 717–740 (2020)
Pennington, J.; Socher, R.; Manning, C.: GloVe: Global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing EMNLP, Qatar, pp. 1532–1543 (2014)
Alrabiah, M.; Al-Salman, A.; Atwell, E.; Alhelewh, N.: KSUCCA: a key to exploring Arabic historical linguistics. Int. J. Comput. Linguist. (IJCL) 5, 27–36 (2014)
Saad, M.K.; Ashour, W.: OSAC: Open Source Arabic Corpora. In: 6th International Conference on Electrical and Computer Systems EECS’10, North Cyprus (2010)
Chicco, D.; Jurman, G.: The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21(6), 1–13 (2020)
Kong, L., Han, Z., Han, Y., Qi, H.: A deep paraphrase identification model interacting semantics with syntax. Hindawi Complex 2020, 1–14 (2020)
Othman, N.; Faiz, R.; Smaili, K.: Manhattan siamese LSTM for question retrieval in community question answering. In: 18th International Conference on Ontologies, DataBases, and Applications of Semantics ODBASE, Greece (2019)
Yao, L.; Pan, Z.; Ning, H.: Unlabeled short text similarity with LSTM encoder. IEEE Access 7(11), 3430–3437 (2019)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mahmoud, A., Zrigui, M. BLSTM-API: Bi-LSTM Recurrent Neural Network-Based Approach for Arabic Paraphrase Identification. Arab J Sci Eng 46, 4163–4174 (2021). https://doi.org/10.1007/s13369-020-05320-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-020-05320-w