当前位置: X-MOL 学术arXiv.cs.IR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes
arXiv - CS - Information Retrieval Pub Date : 2021-01-09 , DOI: arxiv-2101.03327
Alexander B. Veretennikov

Proximity full-text search is commonly implemented in contemporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more significant for the case where the query consists of frequently occurring words. Proximity full-text search requires the storage of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to the MaxDistance parameter. We showed in previous works that these indexes can be used to improve the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of MaxDistance and other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments. This is a pre-print of a contribution published in Supplementary Proceedings of the XXII International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2020), Voronezh, Russia, October 13-16, 2020, P. 336-350, published by CEUR Workshop Proceedings. The final authenticated version is available online at: http://ceur-ws.org/Vol-2790/

中文翻译:

基于多成分关键索引的快速K词邻近搜索中最佳参数的选择

邻近全文搜索通常在当代全文搜索系统中实现。让我们假设搜索查询是单词列表。如果查询的单词在文档中彼此靠近,则认为文档是相关的很自然。对于查询由频繁出现的单词组成的情况,邻近因子甚至更为重要。邻近全文搜索需要存储用户可以搜索的每个单词的文档中每次出现的信息。对于文档中每个单词的每次出现,我们都使用其他索引来存储有关附近单词的信息,即与给定单词的距离小于或等于MaxDistance参数的文档中出现的单词。我们在以前的工作中表明,对于包含高频单词的查询,这些索引可用于将平均查询执行时间提高多达130倍。在本文中,我们考虑搜索性能和搜索质量如何取决于MaxDistance和其他参数的值。实验中使用了众所周知的GOV2文本收集来提高结果的可重复性。在对实验结果进行分析之后,我们提出了一种新的索引方案。这是在2020年10月13日至16日于俄罗斯沃罗涅日举行的第二十二届数据密集域数据分析和管理国际会议补充论文集(DAMDID / RCDL 2020)中发表的论文的预印本。 ,由CEUR Workshop Proceedings发布。最终的认证版本可在线获得:http:
更新日期:2021-01-12
down
wechat
bug