Abstract
Diversity maximization is a fundamental problem in web search and data mining. For a given dataset S of n elements, the problem requires to determine a subset of S containing k≪n “representatives” which maximize some diversity function expressed in terms of pairwise distances, where distance models dissimilarity. An important variant of the problem prescribes that the solution satisfy an additional orthogonal requirement, which can be specified as a matroid constraint (i.e., a feasible solution must be an independent set of size k of a given matroid). While unconstrained diversity maximization admits efficient coreset-based strategies for several diversity functions, known approaches dealing with the additional matroid constraint apply only to one diversity function (sum of distances), and are based on an expensive, inherently sequential, local search over the entire input dataset. We devise the first coreset-based algorithms for diversity maximization under matroid constraints for various diversity functions, together with efficient sequential, MapReduce, and Streaming implementations. Technically, our algorithms rely on the construction of a small coreset, that is, a subset of S containing a feasible solution which is no more than a factor 1−ɛ away from the optimal solution for S. While our algorithms are fully general, for the partition and transversal matroids, if ɛ is a constant in (0,1) and S has bounded doubling dimension, the coreset size is independent of n and it is small enough to afford the execution of a slow sequential algorithm to extract a final, accurate, solution in reasonable time. Extensive experiments show that our algorithms are accurate, fast, and scalable, and therefore they are capable of dealing with the large input instances typical of the big data scenario.
- Zeinab Abbassi, Vahab S. Mirrokni, and Mayur Thakur. 2013. Diversity maximization under matroid constraints. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13), Inderjit S. Dhillon, Yehuda Koren, Rayid Ghani, Ted E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He, Robert L. Grossman, and Ramasamy Uthurusamy (Eds.). ACM, 32--40. DOI:https://doi.org/10.1145/2487575.2487636Google ScholarDigital Library
- Marcel R. Ackermann, Johannes Blömer, and Christian Sohler. 2010. Clustering for metric and nonmetric distance measures. ACM Transactions on Algorithms 6, 4 (2010), 59:1--59:26. DOI:https://doi.org/10.1145/1824777.1824779Google ScholarDigital Library
- Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. 2005. Geometric approximation via coresets. Journal of Combinatorial and Computational Geometry 52 (2005), 1--30. http://library.msri.org/books/Book52/contents.html.Google Scholar
- Sepideh Aghamolaei, Majid Farhadi, and Hamid Zarrabi-Zadeh. 2015. Diversity maximization via composable coresets. In Proceedings of the 27th Canadian Conference on Computational Geometry (CCCG’15).Google Scholar
- Aditya Bhaskara, Mehrdad Ghadiri, Vahab S. Mirrokni, and Ola Svensson. 2016. Linear relaxations for finding diverse elements in metric spaces. In Proceedings of the 29th Conference on Neural Information Processing Systems (NIPS’16). 4098--4106. Retrieved from http://papers.nips.cc/paper/6500-linear-relaxations-for-finding-diverse-elements-in-metric-spaces.Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan. (2003), 993--1022. Retrieved from http://jmlr.org/papers/v3/blei03a.html.Google ScholarDigital Library
- Michele Borassi, Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. 2019. Better sliding window algorithms to maximize subadditive and diversity objectives. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’19), Dan Suciu, Sebastian Skritek, and Christoph Koch (Eds.). ACM, 254--268. DOI:https://doi.org/10.1145/3294052.3319701Google ScholarDigital Library
- Allan Borodin, Hyun Chul Lee, and Yuli Ye. 2012. Max-sum diversification, monotone submodular functions and dynamic updates. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’12). 155--166. DOI:https://doi.org/10.1145/2213556.2213580Google ScholarDigital Library
- Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. 2018. Fast coreset-based diversity maximization under matroid constraints. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18), Yi Chang, Chengxiang Zhai, Yan Liu, and Yoelle Maarek (Eds.). ACM, 81--89. DOI:https://doi.org/10.1145/3159652.3159719Google ScholarDigital Library
- Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, and Eli Upfal. 2017. MapReduce and streaming algorithms for diversity maximization in metric spaces of bounded doubling dimension. Proceedings of the VLDB Endowment 10, 5 (2017), 469--480. Retrieved from http://www.vldb.org/pvldb/vol10/p469-ceccarello.pdf.Google ScholarDigital Library
- Alfonso Cevallos, Friedrich Eisenbrand, and Sarah Morell. 2018. Diversity maximization in doubling metrics. In Proceedings of the 29th International Symposium on Algorithms and Computation (ISAAC’18), Wen-Lian Hsu, Der-Tsai Lee, and Chung-Shou Liao (Eds.), Vol. 123. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 33:1--33:12. DOI:https://doi.org/10.4230/LIPIcs.ISAAC.2018.33Google Scholar
- Alfonso Cevallos, Friedrich Eisenbrand, and Rico Zenklusen. 2017. Local search for max-sum diversification. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’17). 130--142. DOI:https://doi.org/10.1137/1.9781611974782.9Google ScholarCross Ref
- Barun Chandra and Magnús M. Halldórsson. 2001. Approximation algorithms for dispersion problems. Journal of Algorithms 38, 2 (2001), 438--465. DOI:https://doi.org/10.1006/jagm.2000.1145Google ScholarDigital Library
- Moses Charikar, Chandra Chekuri, Tomás Feder, and Rajeev Motwani. 2004. Incremental clustering and dynamic information retrieval. SIAM Journal on Computing 33, 6 (2004), 1417--1440. DOI:https://doi.org/10.1137/S0097539702418498Google ScholarDigital Library
- Richard Cole and Lee-Ad Gottlieb. 2006. Searching dynamic point sets in spaces with bounded doubling dimension. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC’06). 574--583. DOI:https://doi.org/10.1145/1132516.1132599Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI’04). 137--150. Retrieved from http://www.usenix.org/events/osdi04/tech/dean.html.Google ScholarDigital Library
- Alessandro Epasto, Vahab S. Mirrokni, and Morteza Zadimoghaddam. 2019. Scalable diversity maximization via small-size composable core-sets (brief announcement). In Proceedings of the 31st ACM on Symposium on Parallelism in Algorithms and Architectures (SPAA’19), Christian Scheideler and Petra Berenbrink (Eds.). ACM, 41--42. DOI:https://doi.org/10.1145/3323165.3323172Google ScholarDigital Library
- Teofilo F. Gonzalez. 1985. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38 (1985), 293--306. https://www.sciencedirect.com/journal/theoretical-computer-science/vol/38/suppl/C.Google ScholarCross Ref
- Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. 2014. Efficient classification for metric data. IEEE Transactions on Information Theory 60, 9 (2014), 5750--5759. DOI:https://doi.org/10.1109/TIT.2014.2339840Google ScholarCross Ref
- Monika Rauch Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. 1998. Computing on data streams. In Proceedings of the DIMACS Workshop on External Memory Algorithms. 107--118.Google Scholar
- Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S. Mirrokni. 2014. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’14). 100--108. DOI:https://doi.org/10.1145/2594538.2594560Google Scholar
- Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. 2010. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 938--948. DOI:https://doi.org/10.1137/1.9781611973075.76Google ScholarDigital Library
- Goran Konjevod, Andréa W. Richa, and Donglin Xia. 2008. Dynamic routing and location services in metrics of low doubling dimension. In Proceedings of the 22nd International Symposium on Distributed Computing (DISC’08). 379--393. DOI:https://doi.org/10.1007/978-3-540-87779-0_26Google ScholarDigital Library
- Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. 2014. Mining of Massive Datasets (2nd ed.). Cambridge University Press.Google ScholarDigital Library
- Michael Masin and Yossi Bukchin. 2008. Diversity maximization approach for multiobjective optimization. Operations Research 56, 2 (2008), 411--424. DOI:https://doi.org/10.1287/opre.1070.0413Google ScholarDigital Library
- James G. Oxley. 2006. Matroid Theory. Oxford University Press.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 18th Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532--1543. Retrieved from https://www.aclweb.org/anthology/D14-1162/.Google Scholar
- Andrea Pietracaprina, Geppino Pucci, Matteo Riondato, Francesco Silvestri, and Eli Upfal. 2012. Space-round tradeoffs for MapReduce computations. In Proceedings of the 25th International Conference on Supercomputing (ICS’12), Utpal Banerjee, Kyle A. Gallivan, Gianfranco Bilardi, and Manolis Katevenis (Eds.). ACM, 235--244. DOI:https://doi.org/10.1145/2304576.2304607Google ScholarDigital Library
- Y. C. Wu. 2013. Active learning based on diversity maximization. Applied Mechanics and Materials 347, 10 (2013), 2548--2552. DOI:https://doi.org/10.4028/www.scientific.net/AMM.347-350.2548Google ScholarCross Ref
- Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G. Hauptmann. 2015. Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision 113, 2 (2015), 113--127. DOI:https://doi.org/10.1007/s11263-014-0781-xGoogle ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’10), Erich M. Nahum and Dongyan Xu (Eds.). USENIX Association. Retrieved from https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets.Google ScholarDigital Library
Index Terms
- A General Coreset-Based Approach to Diversity Maximization under Matroid Constraints
Recommendations
Diversity maximization under matroid constraints
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningAggregator websites typically present documents in the form of representative clusters. In order for users to get a broader perspective, it is important to deliver a diversified set of representative documents in those clusters. One approach to ...
Fast Coreset-based Diversity Maximization under Matroid Constraints
WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data MiningMax-sum diversity is a fundamental primitive for web search and data mining. For a given set S of n elements, it returns a subset of k«l n representatives maximizing the sum of their pairwise distances, where distance models dissimilarity. An important ...
Non-monotone submodular maximization under matroid and knapsack constraints
STOC '09: Proceedings of the forty-first annual ACM symposium on Theory of computingSubmodular function maximization is a central problem in combinatorial optimization, generalizing many important problems including Max Cut in directed/undirected graphs and in hypergraphs, certain constraint satisfaction problems, maximum entropy ...
Comments