SimiT: A Text Similarity Method Using Lexicon and Dependency Representations

Inan, Emrah

doi:10.1007/s00354-020-00099-8

SimiT: A Text Similarity Method Using Lexicon and Dependency Representations

Published: 17 June 2020

Volume 38, pages 509–530, (2020)
Cite this article

New Generation Computing Aims and scope Submit manuscript

Emrah Inan ORCID: orcid.org/0000-0002-1229-6895¹

777 Accesses
4 Citations
Explore all metrics

Abstract

Semantic textual similarity methods are becoming increasingly crucial in text mining research areas such as text retrieval and summarization. Existing methods of text similarity have often been computed by their shallow or syntactic representation rather than considering their semantic content and meanings. This paper focuses mainly on computing the similarity between sentences without a supervised learning approach, only considering their word-level coherence which is calculated by a hybrid method of dependency parser and lexicon embeddings. Hence, we concentrate on structural similarity between text pairs by regarding their dependency parser embeddings. Our hybrid method also pays attention to the semantic information of words implied in the sentences. In the evaluation, we compare our method with the state-of-the-art semantic similarity measures in a well-known dataset. Our method outperforms most of the studies in the literature and the overall performance achieves better results when combining the similarity scores of both embedding models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic textual similarity between sentences using bilingual word semantics

Article 09 March 2019

Md. Shajalal & Masaki Aono

An Approach to Semantic Text Similarity Computing

An Efficient Approach for Findings Document Similarity Using Optimized Word Mover’s Distance

Notes

References

Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: International Conference on data warehousing and knowledge discovery, pp. 305–316. Springer, Berlin (2008)
Agarwal, B., Ramampiaro, H., Langseth, H., Ruocco, M.: A deep network model for paraphrase detection in short text messages. Inf. Process. Manag. 54(6), 922–937 (2018)
Article Google Scholar
Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: 5th International Conference on Learning Representations, ICLR (2017)
Baker, C.F., Fillmore, C.J., Lowe, J.B.: The berkeley framenet project. In: Proceedings of the 17th international conference on Computational linguistics-Volume 1, pp. 86–90. Association for Computational Linguistics (1998)
Balasubramanian, N., Allan, J., Croft, W.B.: A comparison of sentence retrieval techniques. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 813–814. ACM (2007)
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055 (2017)
Choi, M., Kim, H.: Social relation extraction from texts using a support-vector-machine-based dependency trigram kernel. Inf. Process. Manag. 49(1), 303–311 (2013)
Article Google Scholar
Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020)
Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Proceedings of the 42nd annual meeting on association for computational linguistics, p. 423. Association for Computational Linguistics (2004)
Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pp. 468–476. Association for Computational Linguistics (2009)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th international conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)
Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)
Article Google Scholar
Ethayarajh, K.: Unsupervised random walk sentence embeddings: a strong but simple baseline. In: Proceedings of The Third Workshop on Representation Learning for NLP, pp. 91–100 (2018)
Fellbaum, C.: Wordnet and wordnets. In: K. Brown (ed.) Encyclopedia of Language and Linguistics, pp. 665–670. Elsevier, Oxford (2005). http://wordnet.princeton.edu/
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th annual research colloquium of the UK special interest group for computational linguistics, pp. 45–52 (2008)
Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. ACM (2016)
Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J.: Umbc\_ebiquity-core: Semantic textual similarity systems. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, vol. 1, pp. 44–52 (2013)
Hassan, S.: Measuring semantic relatedness using salient encyclopedic concepts. University of North Texas, Denton (2011)
Google Scholar
Islam, A., Inkpen, D.: Semantic similarity of short texts. Recent Adv. Nat. Lang. Process. V. 309, 227–236 (2009)
Article Google Scholar
Jeon, J., Croft, W.B., Lee, J.H.: Finding similar questions in large question and answer archives. In: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 84–90. ACM (2005)
Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 891–896 (2013)
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th Research on Computational Linguistics International Conference, pp. 19–33 (1997)
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)
Jimenez, S., Becerra, C., Gelbukh, A.: Soft cardinality: a parameterized similarity function for text comparison. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 449–453. Association for Computational Linguistics (2012)
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: Spanbert: improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529 (2019)
Jurafsky, D., Martin, J.H.: Speech and language processing, vol. 3. Pearson, London (2014)
Google Scholar
Kondrak, G.: N-gram similarity and distance. In: International symposium on string processing and information retrieval, pp. 115–126. Springer, Berlin (2005)
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International conference on machine learning, pp. 957–966 (2015)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Le, Y., Wang, Z.J., Quan, Z., He, J., Yao, B.: Acv-tree: a new method for sentence similarity modeling. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4137–4143 (2018)
Leacock, C., Chodorow, M.: Combining local context and wordnet sense similarity for word sense identification. wordnet, an electronic lexical database. The MIT Press, Cambridge (1998)
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al.: Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web. 6(2), 167–195 (2015)
Article Google Scholar
Levy, O., Goldberg, Y.: Dependency-based word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 302–308 (2014)
Lison, P., Meena, R.: Automatic turn segmentation for movie & tv subtitles. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 245–252. IEEE (2016)
Liu, B., Zhang, T., Han, F.X., Niu, D., Lai, K., Xu, Y.: Matching natural language sentences with hierarchical sentence factorization. In: Proceedings of the 2018 World Wide Web Conference, pp. 1237–1246 (2018)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Mihalcea, R., Corley, C., Strapparava, C., et al.: Corpus-based and knowledge-based measures of text semantic similarity. In: Aaai, pp. 775–780 (2006)
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pp. 3111–3119 (2013)
Milajevs, D., Kartsaklis, D., Sadrzadeh, M., Purver, M.: Evaluating neural word representations in tensor-based compositional settings. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 708–719 (2014)
Mohamed, M., Oussalah, M.: A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics. Language Resources and Evaluation pp. 1–29 (2019)
Moore, E.F.: The shortest path through a maze. Proc. Int. Symp. Switch. Theor. 1959, 285–292 (1959)
MathSciNet Google Scholar
Özateş, Ş.B., Özgür, A., Radev, D.: Sentence similarity based on dependency tree kernels for multi-document summarization. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 2833–2838. European Language Resources Association (ELRA), Portorož, Slovenia (2016). https://www.aclweb.org/anthology/L16-1452
Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: Proceedings of NAACL-HLT, pp. 528–540 (2018)
Pedersen, T., Patwardhan, S., Michelizzi, J., et al.: Wordnet: Similarity-measuring the relatedness of concepts. AAAI 4, 25–29 (2004)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543 (2014)
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proc. of NAACL (2018)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints (2019)
Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on Artificial intelligence-Volume 1, pp. 448–453. Morgan Kaufmann Publishers Inc. (1995)
Saedi, C., Branco, A., Rodrigues, J.A., Silva, J.: Wordnet embeddings. In: Proceedings of The Third Workshop on Representation Learning for NLP, pp. 122–131 (2018)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Comput. Sist. 18(3), 491–504 (2014)
Google Scholar
Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: An open multilingual graph of general knowledge. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., Wang, H.: Ernie 2.0: A continual pre-training framework for language understanding. arXiv preprint arXiv:1907.12412 (2019)
Wang, D., Li, T., Zhu, S., Ding, C.: Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 307–314. ACM (2008)
Wang, Z., Mi, H., Ittycheriah, A.: Sentence similarity learning by lexical decomposition and composition. In: COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pp. 1340–1349 (2016). https://www.aclweb.org/anthology/C16-1127/
Winkler, W.E.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau. Citeseer (1999)
Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proceedings of the 32th annual meeting of the association for computational linguistics (acl’94). Las Cruces, New Mexico (1994)
Yang, Y., Yuan, S., Cer, D., Kong, S.Y., Constant, N., Pilar, P., Ge, H., Sung, Y.H., Strope, B., Kurzweil, R.: Learning semantic textual similarity from conversations. arXiv preprint arXiv:1804.07754 (2018)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems, pp. 5754–5764 (2019)
Yin, W., Schütze, H.: Convolutional neural network for paraphrase identification. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 901–911 (2015)

Download references

Author information

Authors and Affiliations

National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom
Emrah Inan

Authors

Emrah Inan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emrah Inan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Cite this article

Inan, E. SimiT: A Text Similarity Method Using Lexicon and Dependency Representations. New Gener. Comput. 38, 509–530 (2020). https://doi.org/10.1007/s00354-020-00099-8

Download citation

Received: 21 October 2019
Accepted: 08 June 2020
Published: 17 June 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s00354-020-00099-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SimiT: A Text Similarity Method Using Lexicon and Dependency Representations

Abstract

Access this article

Similar content being viewed by others

Semantic textual similarity between sentences using bilingual word semantics

An Approach to Semantic Text Similarity Computing

An Efficient Approach for Findings Document Similarity Using Optimized Word Mover’s Distance

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Keywords

Navigation

SimiT: A Text Similarity Method Using Lexicon and Dependency Representations

Abstract

Access this article

Similar content being viewed by others

Semantic textual similarity between sentences using bilingual word semantics

An Approach to Semantic Text Similarity Computing

An Efficient Approach for Findings Document Similarity Using Optimized Word Mover’s Distance

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Share this article

Keywords

Search

Navigation