当前位置: X-MOL 学术IEEE Trans. Neural Netw. Learn. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Eliminating Negative Word Similarities for Measuring Document Distances: A Thoroughly Empirical Study on Word Mover鈥檚 Distance
IEEE Transactions on Neural Networks and Learning Systems ( IF 10.2 ) Pub Date : 11-24-2022 , DOI: 10.1109/tnnls.2022.3222336
Bo Cheng 1 , Ximing Li 2 , Yi Chang 1
Affiliation  

Document distance is a fundamental yet significant research topic in the information retrieval community, and its accuracy dominates the performance of many text retrieval applications. Beyond the Bag-of-Words (BoW) model, the Word Mover’s Distance (WMD) semantically defines the distance between documents as the minimum cost (i.e., measured by word similarities of embeddings) required to transport the words from one document to another, and it has been proven to be superior by k{k} -nearest neighbor classification. In this article, we thoroughly study the characteristics of WMD and its relaxed versions, e.g., Relaxed WMD (RWMD) and Iterative Constrained Transfers (ICT), in various scenarios. Specifically, we concentrate on the problem of negative word similarity: the WMD family leverages all word similarities, however, most of them are meaningless, resulting in negative effects for measuring document distances. To remedy this problem, we propose Informative Similarity Filter (ISF), which retains a very small percentage of top word similarities and fixes the others as the same lower similarity. Built on it, we propose a greedy optimization (GOM) for WMD, an accurate approximation to WMD. We theoretically analyze that ISF-GOM is more applicable for relatively longer documents. Extensive experiments have been conducted to validate: 1) the problem of RWMD; 2) the effectiveness of ISF-GOM; and 3) the consistence of our analysis of ISF-GOM. Our codes and datasets are available at https://github.com/BoCheng-96/ISF-GOM.

中文翻译:


消除负词相似度以测量文档距离:对词移动器距离的彻底实证研究



文档距离是信息检索领域的一个基本但重要的研究主题,其准确性主导着许多文本检索应用程序的性能。除了词袋(BoW)模型之外,词移动距离(WMD)在语义上将文档之间的距离定义为将单词从一个文档传输到另一个文档所需的最小成本(即通过嵌入的单词相似度来衡量),并且通过k{k}-最近邻分类证明了它的优越性。在本文中,我们深入研究了WMD及其宽松版本(例如宽松WMD(RWMD)和迭代约束传输(ICT))在各种场景下的特性。具体来说,我们关注负词相似度问题:WMD家族利用了所有词相似度,然而,其中大多数是没有意义的,导致测量文档距离的负面影响。为了解决这个问题,我们提出了信息相似度过滤器(ISF),它保留了很小比例的顶级单词相似度,并将其他单词修复为相同的较低相似度。在此基础上,我们提出了 WMD 的贪婪优化 (GOM),这是 WMD 的精确近似。我们从理论上分析ISF-GOM更适用于相对较长的文档。进行了大量的实验来验证:1)RWMD问题; 2) ISF-GOM 的有效性; 3) 我们对 ISF-GOM 分析的一致性。我们的代码和数据集可在 https://github.com/BoCheng-96/ISF-GOM 获取。
更新日期:2024-08-26
down
wechat
bug