Skip to main content
Log in

Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval

  • 1207: Innovations in Multimedia Information Processing & Retrieval
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different from the target documents language. CLIR incorporates a translation technique based on either a manual dictionary or a probabilistic dictionary which is generated from a parallel corpus. The translation techniques for Hindi language suffer from a translation mis-mapped issue which is due to the morphological richness of Hindi language. In addition, a word may have multiple translations in a dictionary leading to word translation disambiguation issue. This paper addresses two key findings, i.e., Semantic Morphological Variant Selection (SMVS), and Hybrid Word Translation Disambiguation (HWTD), the former resolves translation mis-mapped issue and the later disambiguates the queries more effectively. The proposed techniques are investigated for FIRE ad-hoc datasets, where SMVS and HWTD at word level achieve better evaluation measures in comparison to the baseline Statistical Machine Translation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Internet World Stats: http://www.internetworldstats.com

  2. http://www.statmt.org/moses/

  3. https://github.com/tensorflow/nmt

  4. http://www.statmt.org/wmt14/

  5. http://fire.irsi.res.in/fire/home

  6. https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-625F-0

  7. https://dumps.wikimedia.org/backup-index.html

  8. https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-6260-A

  9. https://dumps.wikimedia.org/backup-index.html

  10. github.com/rsennrich/subword-nmt

  11. https://github.com/tensorflow/nmt

References

  1. Adriani M (2021) Using statistical term similarity for sense disambiguation in cross-language information retrieval. Inf Retr 2(1):71–82

    Article  MathSciNet  Google Scholar 

  2. Das A, Debasis G, Utpal G (2017) Named entity recognition with word embeddings and wikipedia categories for a low-resource language. ACM Trans Asian Low-Resour Lang Inform Process (TALLIP) 16(3):18

    Google Scholar 

  3. Duque A, Martinez-Romo J, Araujo L (2015) Choosing the best dictionary for cross-lingual word sense disambiguation. Knowl-Based Syst 81:65–75

    Article  Google Scholar 

  4. Finch A, Taisuke H, Kumiko T, Eiichiro S (2017) Inducing a bilingual lexicon from short parallel multiword sequences. ACM Trans Asian Low-Resour Lang Inf Process (TALLIP) 16(3):15

    Google Scholar 

  5. Ganguly D, Leveling J, Jones G (2012) Cross-lingual topical relevance models. In: Proceedings of COLING, vol 2012, pp 927–942

  6. Ganguly D, Roy D, Mitra M, Jones G (2015) A word embedding based generalized language model for information retrieval. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 795–798

  7. Gupta SK, Sinha A, Jain M (2011) Cross lingual information retrieval with SMT and query mining. Adv Comput 2(5):33

    Google Scholar 

  8. Hosseinzadeh Vahid A, Arora P, Liu Q, Jones GJ (2015) A comparative study of online translation services for cross language information retrieval. In: Proceedings of the 24th international conference on world wide web, pp 859–864

  9. Jagarlamudi J, Kumaran A (2007) Cross-lingual information retrieval system for indian languages. In: Advances in multilingual and multimodal information retrieval. Springer, Berlin Heidelberg, pp 80–87

  10. Janarthanam SC, Sethuramalingam S, Nallasamy U (2008) Named entity transliteration for cross-language information retrieval using compressed word format mapping algorithm. In: Proceedings of the 2nd ACM workshop on improving non english web searching, pp 33–38

  11. Jean S, Lauly S, Firat O, Cho K (2017) Neural machine translation for cross-lingual pronoun prediction. In: Proceedings of the third workshop on discourse in machine translation, pp 54–57

  12. Karimi S, Falk S, Andrew T (2011) Machine transliteration survey. ACM Comput Surv (CSUR) 43(3):17

    Article  MATH  Google Scholar 

  13. Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Saarland Univerisity, Germany

  14. Koehn P (2009) Statistical machine translation. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  15. Kunchukuttan A, Mehta P, Bhattacharyya P (2017) The IIT bombay english-hindi parallel corpus. arXiv:1710.02855

  16. Larkey LS, Connell ME, Abduljaleel N (2003) Hindi CLIR in thirty days. ACM Trans Asian Lang Inf Process (TALIP) 2(2):130–142

    Article  Google Scholar 

  17. Mahapatra L, Mohan M, Khapra MM, Bhattacharyya P (2010) OWNS Cross-lingual word sense disambiguation using weighted overlap counts and wordnet based similarity measures. In: Proceedings of the 5th international workshop on semantic evaluation, pp 138–141

  18. Makin R, Pandey N, Pingali P, Varma V (2007) Approximate string matching techniques for effective CLIR. In: International workshop on fuzzy logic and applications. Springer-Verlag, pp 430–437

  19. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  20. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  21. Monz C, Bonnie JD (2005) Iterative translation disambiguation for cross-language information retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp 520–527

  22. Mustafa A, Tait J, Oakes M (2005) Literature review of cross-language information retrieval. Trans Eng Comp Technol

  23. Nagarathinam A, Saraswathi S (2011) State of art: cross lingual information retrieval system for Indian languages. Int J Comput Appl 35(13):15–21

    Google Scholar 

  24. Nasharuddin NA, Abdullah MT (2010) Cross-lingual information retrieval state-of-the-art. Electron J Comput Sci Inform Technol (EJCSIT) 2(1):1–5

    Google Scholar 

  25. Nothman J, James RC, Tara M (2008) Transforming Wikipedia into named entity training data. In: Proceedings of the australian language technology workshop, pp 124–132

  26. Pennington J, Richard S, Christopher M (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  27. Pingali P, Ganesh S, Yella S, Varma V (2008) Statistical transliteration for cross language information retrieval using HMM alignment model and CRF. In: Proceedings of the 2nd workshop on cross lingual information access (CLIA) addressing the information need of multilingual societies

  28. Pingali P, Varma V (2007) IIIT hyderabad at CLEF 2007-Adhoc Indian language CLIR task. In: CLEF (Working Notes)

  29. Prasad G, Fousiya KK (2015) Named entity recognition approaches: A study applied to English and Hindi language. In: International conference on circuit, power and computing technologies (ICCPCT). IEEE, pp 1–4

  30. Razmara M, Siahbani M, Haffari R, Sarkar A (2013) Graph propagation for paraphrasing out-of-vocabulary words in statistical machine translation. In: Proceedings of the 51st annual meeting of the association for computational linguistics, vol 1, pp 1105–1115

  31. Saravanan K, Udupa R, Kumaran A (2010) Cross lingual information retrieval system enhanced with transliteration generation and mining. Forum for information retrieval evaluation (FIRE-2010) workshop

  32. Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with subword units. arXiv:1508.07909

  33. Shakery A, Zhai C (2013) Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs. Inf Retr 16(1):1–29

    Article  Google Scholar 

  34. Sharma VK, Mittal N (2016) Exploring bilingual word vectors for Hindi-English cross-language information retrieval. In: Proceedings of the international conference on informatics and analytics, pp 1–4

  35. Sharma VK, Mittal N (2016) Exploiting parallel sentences and cosine similarity for identifying target language translation. Procedia Comput Sci 89:428–33

    Article  Google Scholar 

  36. Sharma VK, Mittal N (2017) Named entity identification based translation disambiguation model. In: International conference on pattern recognition and machine intelligence. Springer, pp 365–372

  37. Sharma VK, Mittal N (2018) Cross-lingual information retrieval: a dictionary-based query translation approach. Advances in computer and computational sciences. Springer, Singapore, pp 611–618

    Google Scholar 

  38. Sharma VK, Mittal N, Vidyarthi A (2020) Context-based translation for the out of vocabulary words applied to hindi-english cross-lingual information retrieval. IETE Technical Review. pp 1–10

  39. Sorg P, Philipp C (2012) Exploiting Wikipedia for cross-lingual and multilingual information retrieval. J Data Knowl Eng 74:26–45

    Article  Google Scholar 

  40. Ture F, Lin J (2014) Exploiting representations from statistical machine translation for cross-language information retrieval. ACM Trans Inf Syst (TOIS) 32 (4):1–32

    Article  Google Scholar 

  41. Turney PD (2004) Word sense disambiguation by web mining for word co-occurrence probabilities. arXiv:0407065

  42. Vulic I, Moens MF (2015) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 363–372

  43. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J (2016) Google’s neural machine translation system. Bridging the gap between human and machine translation. arXiv:1609.08144

  44. Xiaoning H, Peidong W, Haoliang Q, Muyun Y, Guohua L, Yong X (2008) Using Google translation in cross-lingual information retrieval. In: Proceedings of NTCIR-7 workshop meeting, pp 16–19

  45. Zhang S, Duh K, Van Durme B (2017) Selective decoding for cross-lingual open information extraction. In: Proceedings of the eighth international joint conference on natural language processing (Volume 1: Long Papers), pp 832–842

  46. Zhou D, Mark T, Tim B, Vincent W, Helen A (2012) Translation techniques in cross-language information retrieval. ACM Comput Surv (CSUR). 45 (1):1–44

    Article  Google Scholar 

  47. Zou WY, Socher R, Cer DM, Manning CD (2013) Bilingual word embeddings for phrase-based machine translation. EMNLP, pp 1393–1398

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ankit Vidyarthi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, V.K., Mittal, N. & Vidyarthi, A. Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval. Multimed Tools Appl 82, 8197–8212 (2023). https://doi.org/10.1007/s11042-021-11074-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11074-w

Keywords

Navigation