Skip to main content
Log in

SimiT: A Text Similarity Method Using Lexicon and Dependency Representations

  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

Semantic textual similarity methods are becoming increasingly crucial in text mining research areas such as text retrieval and summarization. Existing methods of text similarity have often been computed by their shallow or syntactic representation rather than considering their semantic content and meanings. This paper focuses mainly on computing the similarity between sentences without a supervised learning approach, only considering their word-level coherence which is calculated by a hybrid method of dependency parser and lexicon embeddings. Hence, we concentrate on structural similarity between text pairs by regarding their dependency parser embeddings. Our hybrid method also pays attention to the semantic information of words implied in the sentences. In the evaluation, we compare our method with the state-of-the-art semantic similarity measures in a well-known dataset. Our method outperforms most of the studies in the literature and the overall performance achieves better results when combining the similarity scores of both embedding models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://dbpedia.org/

  2. http://en.wiktionary.org/

  3. http://compling.hss.ntu.edu.sg/omw/

  4. https://github.com/commonsense/conceptnet-numberbatch

  5. http://conceptnet.io/c/en/woman

  6. http://conceptnet.io/c/en/hold

  7. https://www.nltk.org/nltk_data/

  8. https://www.clips.uantwerpen.be/pages/pattern-en

  9. https://radimrehurek.com/gensim

  10. https://spacy.io/

  11. https://networkx.github.io

  12. http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark

  13. https://github.com/RaRe-Technologies/gensim-data

  14. https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art)

  15. https://gluebenchmark.com/leaderboard

  16. https://huggingface.co/transformers/

  17. https://huggingface.co/bert-base-cased

  18. https://huggingface.co/google/electra-small-discriminator

  19. https://huggingface.co/transformers/model_doc/distilbert.html

  20. https://huggingface.co/nboost/pt-tinybert-msmarco

  21. https://huggingface.co/SpanBERT/spanbert-base-cased

  22. http://duc.nist.gov/duc2003/tasks.html

  23. http://duc.nist.gov/duc2004/tasks.html

References

  1. Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: International Conference on data warehousing and knowledge discovery, pp. 305–316. Springer, Berlin (2008)

  2. Agarwal, B., Ramampiaro, H., Langseth, H., Ruocco, M.: A deep network model for paraphrase detection in short text messages. Inf. Process. Manag. 54(6), 922–937 (2018)

    Article  Google Scholar 

  3. Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: 5th International Conference on Learning Representations, ICLR (2017)

  4. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The berkeley framenet project. In: Proceedings of the 17th international conference on Computational linguistics-Volume 1, pp. 86–90. Association for Computational Linguistics (1998)

  5. Balasubramanian, N., Allan, J., Croft, W.B.: A comparison of sentence retrieval techniques. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 813–814. ACM (2007)

  6. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055 (2017)

  7. Choi, M., Kim, H.: Social relation extraction from texts using a support-vector-machine-based dependency trigram kernel. Inf. Process. Manag. 49(1), 303–311 (2013)

    Article  Google Scholar 

  8. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020)

  9. Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Proceedings of the 42nd annual meeting on association for computational linguistics, p. 423. Association for Computational Linguistics (2004)

  10. Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pp. 468–476. Association for Computational Linguistics (2009)

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  12. Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th international conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)

  13. Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)

    Article  Google Scholar 

  14. Ethayarajh, K.: Unsupervised random walk sentence embeddings: a strong but simple baseline. In: Proceedings of The Third Workshop on Representation Learning for NLP, pp. 91–100 (2018)

  15. Fellbaum, C.: Wordnet and wordnets. In: K. Brown (ed.) Encyclopedia of Language and Linguistics, pp. 665–670. Elsevier, Oxford (2005). http://wordnet.princeton.edu/

  16. Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th annual research colloquium of the UK special interest group for computational linguistics, pp. 45–52 (2008)

  17. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. ACM (2016)

  18. Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J.: Umbc\_ebiquity-core: Semantic textual similarity systems. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, vol. 1, pp. 44–52 (2013)

  19. Hassan, S.: Measuring semantic relatedness using salient encyclopedic concepts. University of North Texas, Denton (2011)

    Google Scholar 

  20. Islam, A., Inkpen, D.: Semantic similarity of short texts. Recent Adv. Nat. Lang. Process. V. 309, 227–236 (2009)

    Article  Google Scholar 

  21. Jeon, J., Croft, W.B., Lee, J.H.: Finding similar questions in large question and answer archives. In: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 84–90. ACM (2005)

  22. Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 891–896 (2013)

  23. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th Research on Computational Linguistics International Conference, pp. 19–33 (1997)

  24. Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019)

  25. Jimenez, S., Becerra, C., Gelbukh, A.: Soft cardinality: a parameterized similarity function for text comparison. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 449–453. Association for Computational Linguistics (2012)

  26. Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: Spanbert: improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529 (2019)

  27. Jurafsky, D., Martin, J.H.: Speech and language processing, vol. 3. Pearson, London (2014)

    Google Scholar 

  28. Kondrak, G.: N-gram similarity and distance. In: International symposium on string processing and information retrieval, pp. 115–126. Springer, Berlin (2005)

  29. Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International conference on machine learning, pp. 957–966 (2015)

  30. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)

  31. Le, Y., Wang, Z.J., Quan, Z., He, J., Yao, B.: Acv-tree: a new method for sentence similarity modeling. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 4137–4143 (2018)

  32. Leacock, C., Chodorow, M.: Combining local context and wordnet sense similarity for word sense identification. wordnet, an electronic lexical database. The MIT Press, Cambridge (1998)

  33. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al.: Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web. 6(2), 167–195 (2015)

    Article  Google Scholar 

  34. Levy, O., Goldberg, Y.: Dependency-based word embeddings. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 302–308 (2014)

  35. Lison, P., Meena, R.: Automatic turn segmentation for movie & tv subtitles. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 245–252. IEEE (2016)

  36. Liu, B., Zhang, T., Han, F.X., Niu, D., Lai, K., Xu, Y.: Matching natural language sentences with hierarchical sentence factorization. In: Proceedings of the 2018 World Wide Web Conference, pp. 1237–1246 (2018)

  37. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  38. Mihalcea, R., Corley, C., Strapparava, C., et al.: Corpus-based and knowledge-based measures of text semantic similarity. In: Aaai, pp. 775–780 (2006)

  39. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018) (2018)

  40. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pp. 3111–3119 (2013)

  41. Milajevs, D., Kartsaklis, D., Sadrzadeh, M., Purver, M.: Evaluating neural word representations in tensor-based compositional settings. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 708–719 (2014)

  42. Mohamed, M., Oussalah, M.: A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics. Language Resources and Evaluation pp. 1–29 (2019)

  43. Moore, E.F.: The shortest path through a maze. Proc. Int. Symp. Switch. Theor. 1959, 285–292 (1959)

    MathSciNet  Google Scholar 

  44. Özateş, Ş.B., Özgür, A., Radev, D.: Sentence similarity based on dependency tree kernels for multi-document summarization. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 2833–2838. European Language Resources Association (ELRA), Portorož, Slovenia (2016). https://www.aclweb.org/anthology/L16-1452

  45. Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: Proceedings of NAACL-HLT, pp. 528–540 (2018)

  46. Pedersen, T., Patwardhan, S., Michelizzi, J., et al.: Wordnet: Similarity-measuring the relatedness of concepts. AAAI 4, 25–29 (2004)

    Google Scholar 

  47. Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543 (2014)

  48. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proc. of NAACL (2018)

  49. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints (2019)

  50. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on Artificial intelligence-Volume 1, pp. 448–453. Morgan Kaufmann Publishers Inc. (1995)

  51. Saedi, C., Branco, A., Rodrigues, J.A., Silva, J.: Wordnet embeddings. In: Proceedings of The Third Workshop on Representation Learning for NLP, pp. 122–131 (2018)

  52. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  53. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Comput. Sist. 18(3), 491–504 (2014)

    Google Scholar 

  54. Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: An open multilingual graph of general knowledge. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)

  55. Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., Wang, H.: Ernie 2.0: A continual pre-training framework for language understanding. arXiv preprint arXiv:1907.12412 (2019)

  56. Wang, D., Li, T., Zhu, S., Ding, C.: Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 307–314. ACM (2008)

  57. Wang, Z., Mi, H., Ittycheriah, A.: Sentence similarity learning by lexical decomposition and composition. In: COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pp. 1340–1349 (2016). https://www.aclweb.org/anthology/C16-1127/

  58. Winkler, W.E.: The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau. Citeseer (1999)

  59. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proceedings of the 32th annual meeting of the association for computational linguistics (acl’94). Las Cruces, New Mexico (1994)

  60. Yang, Y., Yuan, S., Cer, D., Kong, S.Y., Constant, N., Pilar, P., Ge, H., Sung, Y.H., Strope, B., Kurzweil, R.: Learning semantic textual similarity from conversations. arXiv preprint arXiv:1804.07754 (2018)

  61. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems, pp. 5754–5764 (2019)

  62. Yin, W., Schütze, H.: Convolutional neural network for paraphrase identification. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 901–911 (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emrah Inan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Inan, E. SimiT: A Text Similarity Method Using Lexicon and Dependency Representations. New Gener. Comput. 38, 509–530 (2020). https://doi.org/10.1007/s00354-020-00099-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00354-020-00099-8

Keywords

Navigation