Embedding based learning for collection selection in federated search,Data Technologies and Applications

当前位置： X-MOL 学术 › Data Technol. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Embedding based learning for collection selection in federated search
Data Technologies and Applications ( IF 1.7 ) Pub Date : 2020-10-28 , DOI: 10.1108/dta-01-2019-0005
Adamu Garba , Shah Khalid , Irfan Ullah , Shah Khusro , Diyawu Mumin

Purpose

There have been many challenges in crawling deep web by search engines due to their proprietary nature or dynamic content. Distributed Information Retrieval (DIR) tries to solve these problems by providing a unified searchable interface to these databases. Since a DIR must search across many databases, selecting a specific database to search against the user query is challenging. The challenge can be solved if the past queries of the users are considered in selecting collections to search in combination with word embedding techniques. Combining these would aid the best performing collection selection method to speed up retrieval performance of DIR solutions.

Design/methodology/approach

The authors propose a collection selection model based on word embedding using Word2Vec approach that learns the similarity between the current and past queries. They used the cosine and transformed cosine similarity models in computing the similarities among queries. The experiment is conducted using three standard TREC testbeds created for federated search.

Findings

The results show significant improvements over the baseline models.

Originality/value

Although the lexical matching models for collection selection using similarity based on past queries exist, to the best our knowledge, the proposed work is the first of its kind that uses word embedding for collection selection by learning from past queries.

中文翻译：

基于嵌入的学习用于联合搜索中的集合选择

目的

由于搜索引擎的专有性质或动态内容，在搜索深度网络中遇到许多挑战。分布式信息检索（DIR）试图通过为这些数据库提供统一的可搜索界面来解决这些问题。由于DIR必须在许多数据库中进行搜索，因此选择特定的数据库以针对用户查询进行搜索具有挑战性。如果结合单词嵌入技术在选择要搜索的集合时考虑用户的过去查询，就可以解决挑战。结合使用这些将有助于最佳性能的集合选择方法，以加快DIR解决方案的检索性能。