当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A General Coreset-Based Approach to Diversity Maximization under Matroid Constraints
ACM Transactions on Knowledge Discovery from Data ( IF 3.6 ) Pub Date : 2020-08-05 , DOI: 10.1145/3402448
Matteo Ceccarello 1 , Andrea Pietracaprina 2 , Geppino Pucci 2
Affiliation  

Diversity maximization is a fundamental problem in web search and data mining. For a given dataset S of n elements, the problem requires to determine a subset of S containing kn “representatives” which maximize some diversity function expressed in terms of pairwise distances, where distance models dissimilarity. An important variant of the problem prescribes that the solution satisfy an additional orthogonal requirement, which can be specified as a matroid constraint (i.e., a feasible solution must be an independent set of size k of a given matroid). While unconstrained diversity maximization admits efficient coreset-based strategies for several diversity functions, known approaches dealing with the additional matroid constraint apply only to one diversity function (sum of distances), and are based on an expensive, inherently sequential, local search over the entire input dataset. We devise the first coreset-based algorithms for diversity maximization under matroid constraints for various diversity functions, together with efficient sequential, MapReduce, and Streaming implementations. Technically, our algorithms rely on the construction of a small coreset, that is, a subset of S containing a feasible solution which is no more than a factor 1−ɛ away from the optimal solution for S . While our algorithms are fully general, for the partition and transversal matroids, if ɛ is a constant in (0,1) and S has bounded doubling dimension, the coreset size is independent of n and it is small enough to afford the execution of a slow sequential algorithm to extract a final, accurate, solution in reasonable time. Extensive experiments show that our algorithms are accurate, fast, and scalable, and therefore they are capable of dealing with the large input instances typical of the big data scenario.

中文翻译:

拟阵约束下基于 Coreset 的一般性最大化方法

多样性最大化是网络搜索和数据挖掘中的一个基本问题。对于给定的数据集小号n元素,问题需要确定一个子集小号包含ķn“代表”最大化一些以成对距离表示的多样性函数,其中距离模型不同。该问题的一个重要变体规定解决方案满足额外的正交要求,该要求可以指定为拟阵约束(即,可行的解决方案必须是一个独立的大小集合ķ给定拟阵的)。虽然无约束的多样性最大化允许针对几个多样性函数的有效的基于核心集的策略,但处理附加拟阵约束的已知方法仅适用于一个多样性函数(距离之和),并且基于昂贵的、固有的顺序局部搜索整个输入数据集。我们设计了第一个基于核心集的算法,用于在各种多样性函数的拟阵约束下实现多样性最大化,以及高效的顺序、MapReduce 和流式实现。从技术上讲,我们的算法依赖于小型核心集的构建,即小号包含一个可行解,它与最优解的距离不超过一个因子 1−ɛ小号. 虽然我们的算法是完全通用的,但对于分区和横向拟阵,如果 ɛ 在(0,1)小号有界加倍维度,核心集大小独立于n它足够小,可以执行缓慢的顺序算法,以便在合理的时间内提取最终的、准确的解决方案。大量实验表明,我们的算法准确、快速且可扩展,因此它们能够处理大数据场景中典型的大型输入实例。
更新日期:2020-08-05
down
wechat
bug