skip to main content
research-article

A General Coreset-Based Approach to Diversity Maximization under Matroid Constraints

Published:05 August 2020Publication History
Skip Abstract Section

Abstract

Diversity maximization is a fundamental problem in web search and data mining. For a given dataset S of n elements, the problem requires to determine a subset of S containing kn “representatives” which maximize some diversity function expressed in terms of pairwise distances, where distance models dissimilarity. An important variant of the problem prescribes that the solution satisfy an additional orthogonal requirement, which can be specified as a matroid constraint (i.e., a feasible solution must be an independent set of size k of a given matroid). While unconstrained diversity maximization admits efficient coreset-based strategies for several diversity functions, known approaches dealing with the additional matroid constraint apply only to one diversity function (sum of distances), and are based on an expensive, inherently sequential, local search over the entire input dataset. We devise the first coreset-based algorithms for diversity maximization under matroid constraints for various diversity functions, together with efficient sequential, MapReduce, and Streaming implementations. Technically, our algorithms rely on the construction of a small coreset, that is, a subset of S containing a feasible solution which is no more than a factor 1−ɛ away from the optimal solution for S. While our algorithms are fully general, for the partition and transversal matroids, if ɛ is a constant in (0,1) and S has bounded doubling dimension, the coreset size is independent of n and it is small enough to afford the execution of a slow sequential algorithm to extract a final, accurate, solution in reasonable time. Extensive experiments show that our algorithms are accurate, fast, and scalable, and therefore they are capable of dealing with the large input instances typical of the big data scenario.

References

  1. Zeinab Abbassi, Vahab S. Mirrokni, and Mayur Thakur. 2013. Diversity maximization under matroid constraints. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13), Inderjit S. Dhillon, Yehuda Koren, Rayid Ghani, Ted E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He, Robert L. Grossman, and Ramasamy Uthurusamy (Eds.). ACM, 32--40. DOI:https://doi.org/10.1145/2487575.2487636Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Marcel R. Ackermann, Johannes Blömer, and Christian Sohler. 2010. Clustering for metric and nonmetric distance measures. ACM Transactions on Algorithms 6, 4 (2010), 59:1--59:26. DOI:https://doi.org/10.1145/1824777.1824779Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. 2005. Geometric approximation via coresets. Journal of Combinatorial and Computational Geometry 52 (2005), 1--30. http://library.msri.org/books/Book52/contents.html.Google ScholarGoogle Scholar
  4. Sepideh Aghamolaei, Majid Farhadi, and Hamid Zarrabi-Zadeh. 2015. Diversity maximization via composable coresets. In Proceedings of the 27th Canadian Conference on Computational Geometry (CCCG’15).Google ScholarGoogle Scholar
  5. Aditya Bhaskara, Mehrdad Ghadiri, Vahab S. Mirrokni, and Ola Svensson. 2016. Linear relaxations for finding diverse elements in metric spaces. In Proceedings of the 29th Conference on Neural Information Processing Systems (NIPS’16). 4098--4106. Retrieved from http://papers.nips.cc/paper/6500-linear-relaxations-for-finding-diverse-elements-in-metric-spaces.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3, Jan. (2003), 993--1022. Retrieved from http://jmlr.org/papers/v3/blei03a.html.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Michele Borassi, Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. 2019. Better sliding window algorithms to maximize subadditive and diversity objectives. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’19), Dan Suciu, Sebastian Skritek, and Christoph Koch (Eds.). ACM, 254--268. DOI:https://doi.org/10.1145/3294052.3319701Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Allan Borodin, Hyun Chul Lee, and Yuli Ye. 2012. Max-sum diversification, monotone submodular functions and dynamic updates. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’12). 155--166. DOI:https://doi.org/10.1145/2213556.2213580Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. 2018. Fast coreset-based diversity maximization under matroid constraints. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM’18), Yi Chang, Chengxiang Zhai, Yan Liu, and Yoelle Maarek (Eds.). ACM, 81--89. DOI:https://doi.org/10.1145/3159652.3159719Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Matteo Ceccarello, Andrea Pietracaprina, Geppino Pucci, and Eli Upfal. 2017. MapReduce and streaming algorithms for diversity maximization in metric spaces of bounded doubling dimension. Proceedings of the VLDB Endowment 10, 5 (2017), 469--480. Retrieved from http://www.vldb.org/pvldb/vol10/p469-ceccarello.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Alfonso Cevallos, Friedrich Eisenbrand, and Sarah Morell. 2018. Diversity maximization in doubling metrics. In Proceedings of the 29th International Symposium on Algorithms and Computation (ISAAC’18), Wen-Lian Hsu, Der-Tsai Lee, and Chung-Shou Liao (Eds.), Vol. 123. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 33:1--33:12. DOI:https://doi.org/10.4230/LIPIcs.ISAAC.2018.33Google ScholarGoogle Scholar
  12. Alfonso Cevallos, Friedrich Eisenbrand, and Rico Zenklusen. 2017. Local search for max-sum diversification. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’17). 130--142. DOI:https://doi.org/10.1137/1.9781611974782.9Google ScholarGoogle ScholarCross RefCross Ref
  13. Barun Chandra and Magnús M. Halldórsson. 2001. Approximation algorithms for dispersion problems. Journal of Algorithms 38, 2 (2001), 438--465. DOI:https://doi.org/10.1006/jagm.2000.1145Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Moses Charikar, Chandra Chekuri, Tomás Feder, and Rajeev Motwani. 2004. Incremental clustering and dynamic information retrieval. SIAM Journal on Computing 33, 6 (2004), 1417--1440. DOI:https://doi.org/10.1137/S0097539702418498Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Richard Cole and Lee-Ad Gottlieb. 2006. Searching dynamic point sets in spaces with bounded doubling dimension. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC’06). 574--583. DOI:https://doi.org/10.1145/1132516.1132599Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI’04). 137--150. Retrieved from http://www.usenix.org/events/osdi04/tech/dean.html.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Alessandro Epasto, Vahab S. Mirrokni, and Morteza Zadimoghaddam. 2019. Scalable diversity maximization via small-size composable core-sets (brief announcement). In Proceedings of the 31st ACM on Symposium on Parallelism in Algorithms and Architectures (SPAA’19), Christian Scheideler and Petra Berenbrink (Eds.). ACM, 41--42. DOI:https://doi.org/10.1145/3323165.3323172Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Teofilo F. Gonzalez. 1985. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38 (1985), 293--306. https://www.sciencedirect.com/journal/theoretical-computer-science/vol/38/suppl/C.Google ScholarGoogle ScholarCross RefCross Ref
  19. Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. 2014. Efficient classification for metric data. IEEE Transactions on Information Theory 60, 9 (2014), 5750--5759. DOI:https://doi.org/10.1109/TIT.2014.2339840Google ScholarGoogle ScholarCross RefCross Ref
  20. Monika Rauch Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan. 1998. Computing on data streams. In Proceedings of the DIMACS Workshop on External Memory Algorithms. 107--118.Google ScholarGoogle Scholar
  21. Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S. Mirrokni. 2014. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’14). 100--108. DOI:https://doi.org/10.1145/2594538.2594560Google ScholarGoogle Scholar
  22. Howard J. Karloff, Siddharth Suri, and Sergei Vassilvitskii. 2010. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10). 938--948. DOI:https://doi.org/10.1137/1.9781611973075.76Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Goran Konjevod, Andréa W. Richa, and Donglin Xia. 2008. Dynamic routing and location services in metrics of low doubling dimension. In Proceedings of the 22nd International Symposium on Distributed Computing (DISC’08). 379--393. DOI:https://doi.org/10.1007/978-3-540-87779-0_26Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. 2014. Mining of Massive Datasets (2nd ed.). Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Michael Masin and Yossi Bukchin. 2008. Diversity maximization approach for multiobjective optimization. Operations Research 56, 2 (2008), 411--424. DOI:https://doi.org/10.1287/opre.1070.0413Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. James G. Oxley. 2006. Matroid Theory. Oxford University Press.Google ScholarGoogle Scholar
  27. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 18th Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532--1543. Retrieved from https://www.aclweb.org/anthology/D14-1162/.Google ScholarGoogle Scholar
  28. Andrea Pietracaprina, Geppino Pucci, Matteo Riondato, Francesco Silvestri, and Eli Upfal. 2012. Space-round tradeoffs for MapReduce computations. In Proceedings of the 25th International Conference on Supercomputing (ICS’12), Utpal Banerjee, Kyle A. Gallivan, Gianfranco Bilardi, and Manolis Katevenis (Eds.). ACM, 235--244. DOI:https://doi.org/10.1145/2304576.2304607Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. C. Wu. 2013. Active learning based on diversity maximization. Applied Mechanics and Materials 347, 10 (2013), 2548--2552. DOI:https://doi.org/10.4028/www.scientific.net/AMM.347-350.2548Google ScholarGoogle ScholarCross RefCross Ref
  30. Yi Yang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and Alexander G. Hauptmann. 2015. Multi-class active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision 113, 2 (2015), 113--127. DOI:https://doi.org/10.1007/s11263-014-0781-xGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  31. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’10), Erich M. Nahum and Dongyan Xu (Eds.). USENIX Association. Retrieved from https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A General Coreset-Based Approach to Diversity Maximization under Matroid Constraints

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Knowledge Discovery from Data
          ACM Transactions on Knowledge Discovery from Data  Volume 14, Issue 5
          Special Issue on KDD 2018, Regular Papers and Survey Paper
          October 2020
          376 pages
          ISSN:1556-4681
          EISSN:1556-472X
          DOI:10.1145/3407672
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 5 August 2020
          • Accepted: 1 May 2020
          • Received: 1 December 2019
          Published in tkdd Volume 14, Issue 5

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format