ProSecCo: progressive sequence mining with convergence guarantees

Servan-Schreiber, Sacha; Riondato, Matteo; Zgraggen, Emanuel

doi:10.1007/s10115-019-01393-8

ProSecCo: progressive sequence mining with convergence guarantees

Regular Paper
Published: 20 August 2019

Volume 62, pages 1313–1340, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Sacha Servan-Schreiber¹,
Matteo Riondato ORCID: orcid.org/0000-0003-2523-4420² &
Emanuel Zgraggen¹

227 Accesses
6 Citations
1 Altmetric
Explore all metrics

“Here growes the wine Pucinum, now called Prosecho, much celebrated by Pliny.”

–Fynes Moryson, An Itinerary, 1617

Abstract

We present ProSecCo, an algorithm for the progressive mining of frequent sequences from large transactional datasets: It processes the dataset in blocks and it outputs, after having analyzed each block, a high-quality approximation of the collection of frequent sequences. ProSecCo can be used for interactive data exploration, as the intermediate results enable the user to make informed decisions as the computation proceeds. These intermediate results have strong probabilistic approximation guarantees and the final output is the exact collection of frequent sequences. Our correctness analysis uses the Vapnik–Chervonenkis (VC) dimension, a key concept from statistical learning theory. The results of our experimental evaluation of ProSecCo on real and artificial datasets show that it produces fast-converging high-quality results almost immediately. Its practical performance is even better than what is guaranteed by the theoretical analysis, and ProSecCo can even be faster than existing state-of-the-art non-progressive algorithms. Additionally, our experimental results show that ProSecCo uses a constant amount of memory, and orders of magnitude less than other standard, non-progressive, sequential pattern mining algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pattern-Growth Methods

Mining sequential patterns from probabilistic databases

Article 24 July 2014

gRosSo: mining statistically robust patterns from a sequence of datasets

Article Open access 02 August 2022

Notes

The last block may have fewer than \(b\) transactions.
Some additional care is needed when handling the initial block. See Sect. 4.4.
The last block may contain fewer than b transactions. For ease of presentation, we assume that all blocks have size \(b\).
I.e., the ith intermediate result.

References

Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) The AQUA approximate query answering system. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, ACM, New York, SIGMOD ’99, pp 574–576. https://doi.org/10.1145/304182.304581
Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European conference on computer systems, ACM, pp 29–42
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the eleventh international conference on data engineering, IEEE, ICDE’95, pp 3–14
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. SIGMOD Rec 22:207–216. https://doi.org/10.1145/170036.170072
Article Google Scholar
Ayres J, Flannick J, Gehrke J, Yiu T (2002) Sequential PAttern mining using a bitmap representation. In: Proceedings of 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, KDD’02. https://doi.org/10.1145/775047.775109
Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) MapReduce online. In: NSDI, pp 313–328
Crotty A, Galakatos A, Zgraggen E, Binnig C, Kraska T (2015) Vizdom: interactive analytics through pen and touch. Proc VLDB Endow 8(12):2024–2027
Article Google Scholar
Egho E, Raïssi C, Calders T, Jay N, Napoli A (2015) On measuring similarity for sequences of itemsets. Data Mining Knowl Discov 29(3):732–764. https://doi.org/10.1007/s10618-014-0362-1
Article MathSciNet MATH Google Scholar
Fournier-Viger P, Lin C, Gomariz A, Gueniche T, Soltani A, Deng Z, Lam HT (2016) The SPMF open-source data mining library version 2. In: Proceedings of 19th European conference on machine learning and principles and practice of knowledge discovery and data mining (Part III), ECML PKDD’16. http://www.philippe-fournier-viger.com/spmf/
Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data. ACM, New York, SIGMOD’97, pp 171–182. https://doi.org/10.1145/253260.253291
Hellerstein JM, Avnur R, Chou A, Hidber C, Olston C, Raman V, Roth T, Haas PJ (1999) Interactive data analysis: the control project. Computer 32(8):51–59
Article Google Scholar
Jermaine C, Arumugam S, Pol A, Dobra A (2008) Scalable approximate query processing with the DBO engine. ACM Trans Database Syst 33:23:1–23:54. https://doi.org/10.1145/1412331.1412335
Article Google Scholar
Kamat N, Jayachandran P, Tunga K, Nandi A (2014) Distributed and interactive cube exploration. In: 30th IEEE international conference on data engineering, IEEE, ICDE’14, pp 472–483
Klösgen W (1992) Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. Int J Intell Syst 7:649–673
Article Google Scholar
Li Y, Long PM, Srinivasan A (2001) Improved bounds on the sample complexity of learning. J Comput Syst Sci 62(3):516–527
Article MathSciNet Google Scholar
Liu Z, Heer J (2014) The effects of interactive latency on exploratory visual analysis. IEEE Trans Vis Comput Graph 20(12):2122–2131
Article Google Scholar
Mendes LF, Ding B, Han J (2008) Stream sequential pattern mining with precise error bounds. In: Eighth IEEE international conference on data mining, IEEE, ICDM’08, pp 941–946
Mitzenmacher M, Upfal E (2005) Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, Cambridge
Book Google Scholar
Olken F (1993) Random sampling from databases. Ph.D. thesis, University of California, Berkeley
Pansare N, Borkar VR, Jermaine C, Condie T (2011) Online aggregation for large MapReduce jobs. Proc VLDB Endow 4(11):1135–1145
Google Scholar
Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu MC (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440
Article Google Scholar
Pollard D (1984) Convergence of Stochastic Processes. Springer, Berlin
Book Google Scholar
Raïssi C, Poncelet P (2007) Sampling for sequential pattern mining: from static databases to data streams. In: Seventh IEEE international conference on data mining, IEEE, ICDM’07, pp 631–636
Riondato M, Upfal E (2014) Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Trans Knowl Discov Data 8(4):20. https://doi.org/10.1145/2629586
Article Google Scholar
Riondato M, Upfal E (2015) Mining frequent itemsets through progressive sampling with Rademacher averages. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining, ACM, KDD ’15, pp 1005–1014
Riondato M, Vandin F (2014) Finding the true frequent itemsets. In: Zaki MJ, Obradovic Z, Tan P, Banerjee A, Kamath C, Parthasarathy S (eds) Proceedings of the 2014 SIAM international conference on data mining, Philadelphia, Pennsylvania, USA, April 24–26, 2014, SIAM, pp 497–505. https://doi.org/10.1137/1.9781611973440.57
Riondato M, Vandin F (2018) MiSoSouP: mining interesting subgroups with sampling and pseudodimension. In: Proceedings of 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, KDD’18, pp 2130–2139
Servan-Schreiber S, Riondato M, Zgraggen E (2018) ProSecCo: progressive sequence mining with convergence guarantees. In: Proceedings of the 18th IEEE international conference on data mining, pp 417–426
Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge
Book Google Scholar
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: International conference on extending database technology. Springer, ICDT’96, pp 1–17
Toivonen H (1996) Sampling large databases for association rules. In: Proceedings of 22nd international conference very large data bases. Morgan Kaufmann Publishers Inc., San Francisco, VLDB’96, pp 134–145
Vapnik VN (1998) Statistical Learning Theory. Wiley, New York
MATH Google Scholar
Vapnik VN, Chervonenkis AJ (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob Appl 16(2):264–280. https://doi.org/10.1137/1116025
Article MATH Google Scholar
Wang J, Han J, Li C (2007) Frequent closed sequence mining without candidate maintenance. IEEE Trans Knowl Data Eng 19(8):1042–1056
Article Google Scholar
Zeng K, Agarwal S, Dave A, Armbrust M, Stoica I (2015) G-OLA: generalized on-line aggregation for interactive analysis on big data. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, ACM, pp 913–918
Zeng K, Agarwal S, Stoica I (2016) IOLAP: managing uncertainty for efficient incremental OLAP. In: Proceedings of the 2016 international conference on management of data. ACM, SIGMOD’16, pp 1347–1361
Zgraggen E, Galakatos A, Crotty A, Fekete JD, Kraska T (2017) How progressive visualizations affect exploratory analysis. IEEE Trans Vis Comput Graph 23(8):1977–1987
Article Google Scholar

Download references

Author information

Authors and Affiliations

MIT CSAIL, Cambridge, MA, USA
Sacha Servan-Schreiber & Emanuel Zgraggen
Department of Computer Science, Amherst College, Amherst, MA, USA
Matteo Riondato

Authors

Sacha Servan-Schreiber
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Riondato
View author publications
You can also search for this author in PubMed Google Scholar
Emanuel Zgraggen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matteo Riondato.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this work appeared in the proceedings of IEEE ICDM’18 [28], where it was deemed the runner-up for the Best Student Paper Award.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Servan-Schreiber, S., Riondato, M. & Zgraggen, E. ProSecCo: progressive sequence mining with convergence guarantees. Knowl Inf Syst 62, 1313–1340 (2020). https://doi.org/10.1007/s10115-019-01393-8

Download citation

Received: 03 January 2019
Revised: 23 July 2019
Accepted: 05 August 2019
Published: 20 August 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s10115-019-01393-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ProSecCo: progressive sequence mining with convergence guarantees

Abstract

Access this article