Abstract
Cross-Lingual Information Retrieval (CLIR) enables a user to query in a language which is different from the target documents language. CLIR incorporates a translation technique based on either a manual dictionary or a probabilistic dictionary which is generated from a parallel corpus. The translation techniques for Hindi language suffer from a translation mis-mapped issue which is due to the morphological richness of Hindi language. In addition, a word may have multiple translations in a dictionary leading to word translation disambiguation issue. This paper addresses two key findings, i.e., Semantic Morphological Variant Selection (SMVS), and Hybrid Word Translation Disambiguation (HWTD), the former resolves translation mis-mapped issue and the later disambiguates the queries more effectively. The proposed techniques are investigated for FIRE ad-hoc datasets, where SMVS and HWTD at word level achieve better evaluation measures in comparison to the baseline Statistical Machine Translation.
Similar content being viewed by others
Notes
Internet World Stats: http://www.internetworldstats.com
References
Adriani M (2021) Using statistical term similarity for sense disambiguation in cross-language information retrieval. Inf Retr 2(1):71–82
Das A, Debasis G, Utpal G (2017) Named entity recognition with word embeddings and wikipedia categories for a low-resource language. ACM Trans Asian Low-Resour Lang Inform Process (TALLIP) 16(3):18
Duque A, Martinez-Romo J, Araujo L (2015) Choosing the best dictionary for cross-lingual word sense disambiguation. Knowl-Based Syst 81:65–75
Finch A, Taisuke H, Kumiko T, Eiichiro S (2017) Inducing a bilingual lexicon from short parallel multiword sequences. ACM Trans Asian Low-Resour Lang Inf Process (TALLIP) 16(3):15
Ganguly D, Leveling J, Jones G (2012) Cross-lingual topical relevance models. In: Proceedings of COLING, vol 2012, pp 927–942
Ganguly D, Roy D, Mitra M, Jones G (2015) A word embedding based generalized language model for information retrieval. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 795–798
Gupta SK, Sinha A, Jain M (2011) Cross lingual information retrieval with SMT and query mining. Adv Comput 2(5):33
Hosseinzadeh Vahid A, Arora P, Liu Q, Jones GJ (2015) A comparative study of online translation services for cross language information retrieval. In: Proceedings of the 24th international conference on world wide web, pp 859–864
Jagarlamudi J, Kumaran A (2007) Cross-lingual information retrieval system for indian languages. In: Advances in multilingual and multimodal information retrieval. Springer, Berlin Heidelberg, pp 80–87
Janarthanam SC, Sethuramalingam S, Nallasamy U (2008) Named entity transliteration for cross-language information retrieval using compressed word format mapping algorithm. In: Proceedings of the 2nd ACM workshop on improving non english web searching, pp 33–38
Jean S, Lauly S, Firat O, Cho K (2017) Neural machine translation for cross-lingual pronoun prediction. In: Proceedings of the third workshop on discourse in machine translation, pp 54–57
Karimi S, Falk S, Andrew T (2011) Machine transliteration survey. ACM Comput Surv (CSUR) 43(3):17
Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Saarland Univerisity, Germany
Koehn P (2009) Statistical machine translation. Cambridge University Press, Cambridge
Kunchukuttan A, Mehta P, Bhattacharyya P (2017) The IIT bombay english-hindi parallel corpus. arXiv:1710.02855
Larkey LS, Connell ME, Abduljaleel N (2003) Hindi CLIR in thirty days. ACM Trans Asian Lang Inf Process (TALIP) 2(2):130–142
Mahapatra L, Mohan M, Khapra MM, Bhattacharyya P (2010) OWNS Cross-lingual word sense disambiguation using weighted overlap counts and wordnet based similarity measures. In: Proceedings of the 5th international workshop on semantic evaluation, pp 138–141
Makin R, Pandey N, Pingali P, Varma V (2007) Approximate string matching techniques for effective CLIR. In: International workshop on fuzzy logic and applications. Springer-Verlag, pp 430–437
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Monz C, Bonnie JD (2005) Iterative translation disambiguation for cross-language information retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp 520–527
Mustafa A, Tait J, Oakes M (2005) Literature review of cross-language information retrieval. Trans Eng Comp Technol
Nagarathinam A, Saraswathi S (2011) State of art: cross lingual information retrieval system for Indian languages. Int J Comput Appl 35(13):15–21
Nasharuddin NA, Abdullah MT (2010) Cross-lingual information retrieval state-of-the-art. Electron J Comput Sci Inform Technol (EJCSIT) 2(1):1–5
Nothman J, James RC, Tara M (2008) Transforming Wikipedia into named entity training data. In: Proceedings of the australian language technology workshop, pp 124–132
Pennington J, Richard S, Christopher M (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Pingali P, Ganesh S, Yella S, Varma V (2008) Statistical transliteration for cross language information retrieval using HMM alignment model and CRF. In: Proceedings of the 2nd workshop on cross lingual information access (CLIA) addressing the information need of multilingual societies
Pingali P, Varma V (2007) IIIT hyderabad at CLEF 2007-Adhoc Indian language CLIR task. In: CLEF (Working Notes)
Prasad G, Fousiya KK (2015) Named entity recognition approaches: A study applied to English and Hindi language. In: International conference on circuit, power and computing technologies (ICCPCT). IEEE, pp 1–4
Razmara M, Siahbani M, Haffari R, Sarkar A (2013) Graph propagation for paraphrasing out-of-vocabulary words in statistical machine translation. In: Proceedings of the 51st annual meeting of the association for computational linguistics, vol 1, pp 1105–1115
Saravanan K, Udupa R, Kumaran A (2010) Cross lingual information retrieval system enhanced with transliteration generation and mining. Forum for information retrieval evaluation (FIRE-2010) workshop
Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with subword units. arXiv:1508.07909
Shakery A, Zhai C (2013) Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs. Inf Retr 16(1):1–29
Sharma VK, Mittal N (2016) Exploring bilingual word vectors for Hindi-English cross-language information retrieval. In: Proceedings of the international conference on informatics and analytics, pp 1–4
Sharma VK, Mittal N (2016) Exploiting parallel sentences and cosine similarity for identifying target language translation. Procedia Comput Sci 89:428–33
Sharma VK, Mittal N (2017) Named entity identification based translation disambiguation model. In: International conference on pattern recognition and machine intelligence. Springer, pp 365–372
Sharma VK, Mittal N (2018) Cross-lingual information retrieval: a dictionary-based query translation approach. Advances in computer and computational sciences. Springer, Singapore, pp 611–618
Sharma VK, Mittal N, Vidyarthi A (2020) Context-based translation for the out of vocabulary words applied to hindi-english cross-lingual information retrieval. IETE Technical Review. pp 1–10
Sorg P, Philipp C (2012) Exploiting Wikipedia for cross-lingual and multilingual information retrieval. J Data Knowl Eng 74:26–45
Ture F, Lin J (2014) Exploiting representations from statistical machine translation for cross-language information retrieval. ACM Trans Inf Syst (TOIS) 32 (4):1–32
Turney PD (2004) Word sense disambiguation by web mining for word co-occurrence probabilities. arXiv:0407065
Vulic I, Moens MF (2015) Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp 363–372
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J (2016) Google’s neural machine translation system. Bridging the gap between human and machine translation. arXiv:1609.08144
Xiaoning H, Peidong W, Haoliang Q, Muyun Y, Guohua L, Yong X (2008) Using Google translation in cross-lingual information retrieval. In: Proceedings of NTCIR-7 workshop meeting, pp 16–19
Zhang S, Duh K, Van Durme B (2017) Selective decoding for cross-lingual open information extraction. In: Proceedings of the eighth international joint conference on natural language processing (Volume 1: Long Papers), pp 832–842
Zhou D, Mark T, Tim B, Vincent W, Helen A (2012) Translation techniques in cross-language information retrieval. ACM Comput Surv (CSUR). 45 (1):1–44
Zou WY, Socher R, Cer DM, Manning CD (2013) Bilingual word embeddings for phrase-based machine translation. EMNLP, pp 1393–1398
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sharma, V.K., Mittal, N. & Vidyarthi, A. Semantic morphological variant selection and translation disambiguation for cross-lingual information retrieval. Multimed Tools Appl 82, 8197–8212 (2023). https://doi.org/10.1007/s11042-021-11074-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11074-w