当前位置: X-MOL 学术ETRI J. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Building a text collection for Urdu information retrieval
ETRI Journal ( IF 1.4 ) Pub Date : 2021-07-26 , DOI: 10.4218/etrij.2019-0458
Imran Rasheed 1 , Haider Banka 1 , Hamaid M. Khan 2
Affiliation  

Urdu is a widely spoken language in the Indian subcontinent with over 300 million speakers worldwide. However, linguistic advancements in Urdu are rare compared to those in other European and Asian languages. Therefore, by following Text Retrieval Conference standards, we attempted to construct an extensive text collection of 85 304 documents from diverse categories covering over 52 topics with relevance judgment sets at 100 pool depth. We also present several applications to demonstrate the effectiveness of our collection. Although this collection is primarily intended for text retrieval, it can also be used for named entity recognition, text summarization, and other linguistic applications with suitable modifications. Ours is the most extensive existing collection for the Urdu language, and it will be freely available for future research and academic education.

中文翻译:

为乌尔都语信息检索构建文本集合

乌尔都语是印度次大陆广泛使用的语言,全世界有超过 3 亿人使用。然而,与其他欧洲和亚洲语言相比,乌尔都语的语言进步很少。因此,通过遵循文本检索会议标准,我们试图构建一个广泛的文本集合,其中包含来自不同类别的 85304 份文档,涵盖超过 52 个主题,相关性判断集在 100 个池深度。我们还提供了几个应用程序来证明我们收集的有效性。尽管此集合主要用于文本检索,但它也可用于命名实体识别、文本摘要和其他经过适当修改的语言应用程序。我们是现存最广泛的乌尔都语收藏,
更新日期:2021-07-26
down
wechat
bug