当前位置:
X-MOL 学术
›
arXiv.cs.IR
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes
arXiv - CS - Information Retrieval Pub Date : 2021-01-09 , DOI: arxiv-2101.03327 Alexander B. Veretennikov
arXiv - CS - Information Retrieval Pub Date : 2021-01-09 , DOI: arxiv-2101.03327 Alexander B. Veretennikov
Proximity full-text search is commonly implemented in contemporary full-text
search systems. Let us assume that the search query is a list of words. It is
natural to consider a document as relevant if the queried words are near each
other in the document. The proximity factor is even more significant for the
case where the query consists of frequently occurring words. Proximity
full-text search requires the storage of information for every occurrence in
documents of every word that the user can search. For every occurrence of every
word in a document, we employ additional indexes to store information about
nearby words, that is, the words that occur in the document at distances from
the given word of less than or equal to the MaxDistance parameter. We showed in
previous works that these indexes can be used to improve the average query
execution time by up to 130 times for queries that consist of words occurring
with high-frequency. In this paper, we consider how both the search performance
and the search quality depend on the value of MaxDistance and other parameters.
Well-known GOV2 text collection is used in the experiments for reproducibility
of the results. We propose a new index schema after the analysis of the results
of the experiments. This is a pre-print of a contribution published in Supplementary Proceedings
of the XXII International Conference on Data Analytics and Management in Data
Intensive Domains (DAMDID/RCDL 2020), Voronezh, Russia, October 13-16, 2020, P.
336-350, published by CEUR Workshop Proceedings. The final authenticated
version is available online at: http://ceur-ws.org/Vol-2790/
中文翻译:
基于多成分关键索引的快速K词邻近搜索中最佳参数的选择
邻近全文搜索通常在当代全文搜索系统中实现。让我们假设搜索查询是单词列表。如果查询的单词在文档中彼此靠近,则认为文档是相关的很自然。对于查询由频繁出现的单词组成的情况,邻近因子甚至更为重要。邻近全文搜索需要存储用户可以搜索的每个单词的文档中每次出现的信息。对于文档中每个单词的每次出现,我们都使用其他索引来存储有关附近单词的信息,即与给定单词的距离小于或等于MaxDistance参数的文档中出现的单词。我们在以前的工作中表明,对于包含高频单词的查询,这些索引可用于将平均查询执行时间提高多达130倍。在本文中,我们考虑搜索性能和搜索质量如何取决于MaxDistance和其他参数的值。实验中使用了众所周知的GOV2文本收集来提高结果的可重复性。在对实验结果进行分析之后,我们提出了一种新的索引方案。这是在2020年10月13日至16日于俄罗斯沃罗涅日举行的第二十二届数据密集域数据分析和管理国际会议补充论文集(DAMDID / RCDL 2020)中发表的论文的预印本。 ,由CEUR Workshop Proceedings发布。最终的认证版本可在线获得:http:
更新日期:2021-01-12
中文翻译:
基于多成分关键索引的快速K词邻近搜索中最佳参数的选择
邻近全文搜索通常在当代全文搜索系统中实现。让我们假设搜索查询是单词列表。如果查询的单词在文档中彼此靠近,则认为文档是相关的很自然。对于查询由频繁出现的单词组成的情况,邻近因子甚至更为重要。邻近全文搜索需要存储用户可以搜索的每个单词的文档中每次出现的信息。对于文档中每个单词的每次出现,我们都使用其他索引来存储有关附近单词的信息,即与给定单词的距离小于或等于MaxDistance参数的文档中出现的单词。我们在以前的工作中表明,对于包含高频单词的查询,这些索引可用于将平均查询执行时间提高多达130倍。在本文中,我们考虑搜索性能和搜索质量如何取决于MaxDistance和其他参数的值。实验中使用了众所周知的GOV2文本收集来提高结果的可重复性。在对实验结果进行分析之后,我们提出了一种新的索引方案。这是在2020年10月13日至16日于俄罗斯沃罗涅日举行的第二十二届数据密集域数据分析和管理国际会议补充论文集(DAMDID / RCDL 2020)中发表的论文的预印本。 ,由CEUR Workshop Proceedings发布。最终的认证版本可在线获得:http: