当前位置: X-MOL 学术Data Min. Knowl. Discov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ForestDSH: a universal hash design for discrete probability distributions
Data Mining and Knowledge Discovery ( IF 4.8 ) Pub Date : 2021-02-11 , DOI: 10.1007/s10618-020-00732-6
Arash Gholami Davoodi , Sean Chang , Hyun Gon Yoo , Anubhav Baweja , Mihir Mongia , Hosein Mohimani

In this paper, we consider the problem of classification of high dimensional queries to high dimensional classes from discrete alphabets where the probabilistic model that relates data to the classes is known. This problem has applications in various fields including the database search problem in mass spectrometry. The problem is analogous to the nearest neighbor search problem, where the goal is to find the data point in a database that is the most similar to a query point. The state of the art method for solving an approximate version of the nearest neighbor search problem in high dimensions is locality sensitive hashing (LSH). LSH is based on designing hash functions that map near points to the same buckets with a probability higher than random (far) points. To solve our high dimensional classification problem, we introduce distribution sensitive hashes that map jointly generated pairs to the same bucket with probability higher than random pairs. We design distribution sensitive hashes using a forest of decision trees and we analytically derive the complexity of search. We further show that the proposed hashes perform faster than state of the art approximate nearest neighbor search methods for a range of probability distributions, in both theory and simulations. Finally, we apply our method to the spectral library search problem in mass spectrometry, and show that it is an order of magnitude faster than the state of the art methods.



中文翻译:

ForestDSH:用于离散概率分布的通用哈希设计

在本文中,我们考虑了从离散字母表中将高维查询分类为高维类的问题,在这种情况下,将数据与类相关联的概率模型是已知的。该问题在各个领域都有应用,包括质谱法中的数据库搜索问题。该问题类似于最近的邻居搜索问题,后者的目标是在数据库中查找与查询点最相似的数据点。用于解决高维最近邻居搜索问题的近似版本的现有技术方法是位置敏感哈希(LSH)。LSH基于设计哈希函数的功能,该哈希函数以比随机(远)点更高的概率将近点映射到相同的存储桶。为了解决我们的高维分类问题,我们引入了分布敏感散列,这些散列将联合生成的对映射到同一存储桶,且概率高于随机对。我们使用决策树森林设计敏感的散列,然后分析得出搜索的复杂性。我们进一步表明,在理论和模拟方面,对于一定范围的概率分布,提出的散列算法的执行速度都比最新技术的近似最邻近搜索方法快。最后,我们将我们的方法应用于质谱法中的谱库搜索问题,并表明它比现有技术的方法快一个数量级。我们使用决策树森林设计敏感的散列,然后分析得出搜索的复杂性。我们进一步表明,在理论和模拟方面,对于一定范围的概率分布,提出的散列算法的执行速度都比最新技术的近似最邻近搜索方法快。最后,我们将我们的方法应用于质谱法中的谱库搜索问题,并表明它比现有技术的方法快一个数量级。我们使用决策树森林设计敏感的散列,然后分析得出搜索的复杂性。我们进一步表明,在理论和模拟方面,对于一定范围的概率分布,提出的散列算法的执行速度都比最新技术的近似最邻近搜索方法快。最后,我们将我们的方法应用于质谱法中的谱库搜索问题,并表明它比现有技术的方法快一个数量级。

更新日期:2021-02-11
down
wechat
bug