Experiments with Different Indexing Techniques for Text Retrieval tasks on Gujarati Language using Bag of Words Approach,arXiv - CS - Digital Libraries

当前位置： X-MOL 学术 › arXiv.cs.DL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Experiments with Different Indexing Techniques for Text Retrieval tasks on Gujarati Language using Bag of Words Approach
arXiv - CS - Digital Libraries Pub Date : 2020-02-05 , DOI: arxiv-2002.01792
Dr. Jyoti Pareek, Hardik Joshi, Krunal Chauhan, Rushikesh Patel

This paper presents results of various experiments carried out to improve text retrieval of gujarati text documents. Text retrieval involves searching and ranking of text documents for a given set of query terms. We have tested various retrieval models that uses bag-of-words approach. Bag-of-words approach is a traditional approach that is being used till date where the text document is represented as collection of words. Measures like frequency count, inverse document frequency etc. are used to signify and rank relevant documents for user queries. Different ranking models have been used to quantify ranking performance using the metric of mean average precision. Gujarati is a morphologically rich language, we have compared techniques like stop word removal, stemming and frequent case generation against baseline to measure the improvements in information retrieval tasks. Most of the techniques are language dependent and requires development of language specific tools. We used plain unprocessed word index as the baseline, we have seen significant improvements in comparison of MAP values after applying different indexing techniques when compared to the baseline.

中文翻译：

使用词袋方法对古吉拉特语文本检索任务的不同索引技术进行实验

本文介绍了为改进古吉拉特语文本文档的文本检索而进行的各种实验的结果。文本检索涉及针对给定的一组查询词搜索和排序文本文档。我们已经测试了使用词袋方法的各种检索模型。Bag-of-words 方法是一种传统方法，至今仍在使用，其中文本文档表示为单词的集合。使用频率计数、逆文档频率等度量来表示和排列用户查询的相关文档。不同的排名模型已被用于使用平均精度的度量来量化排名性能。古吉拉特语是一种形态丰富的语言，我们比较了诸如停用词去除等技术，针对基线进行词干和频繁案例生成，以衡量信息检索任务的改进。大多数技术都依赖于语言，需要开发特定于语言的工具。我们使用未经处理的普通单词索引作为基线，与基线相比，在应用不同的索引技术后，我们在 MAP 值的比较方面看到了显着改进。

更新日期：2020-02-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文