Skip to main content
Log in

ProSecCo: progressive sequence mining with convergence guarantees

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

“Here growes the wine Pucinum, now called Prosecho, much celebrated by Pliny.”

–Fynes Moryson, An Itinerary, 1617

Abstract

We present ProSecCo, an algorithm for the progressive mining of frequent sequences from large transactional datasets: It processes the dataset in blocks and it outputs, after having analyzed each block, a high-quality approximation of the collection of frequent sequences. ProSecCo can be used for interactive data exploration, as the intermediate results enable the user to make informed decisions as the computation proceeds. These intermediate results have strong probabilistic approximation guarantees and the final output is the exact collection of frequent sequences. Our correctness analysis uses the Vapnik–Chervonenkis (VC) dimension, a key concept from statistical learning theory. The results of our experimental evaluation of ProSecCo on real and artificial datasets show that it produces fast-converging high-quality results almost immediately. Its practical performance is even better than what is guaranteed by the theoretical analysis, and ProSecCo can even be faster than existing state-of-the-art non-progressive algorithms. Additionally, our experimental results show that ProSecCo uses a constant amount of memory, and orders of magnitude less than other standard, non-progressive, sequential pattern mining algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The last block may have fewer than \(b\) transactions.

  2. Some additional care is needed when handling the initial block. See Sect. 4.4.

  3. The last block may contain fewer than b transactions. For ease of presentation, we assume that all blocks have size \(b\).

  4. I.e., the ith intermediate result.

References

  1. Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) The AQUA approximate query answering system. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, ACM, New York, SIGMOD ’99, pp 574–576. https://doi.org/10.1145/304182.304581

  2. Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European conference on computer systems, ACM, pp 29–42

  3. Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the eleventh international conference on data engineering, IEEE, ICDE’95, pp 3–14

  4. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. SIGMOD Rec 22:207–216. https://doi.org/10.1145/170036.170072

    Article  Google Scholar 

  5. Ayres J, Flannick J, Gehrke J, Yiu T (2002) Sequential PAttern mining using a bitmap representation. In: Proceedings of 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, KDD’02. https://doi.org/10.1145/775047.775109

  6. Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) MapReduce online. In: NSDI, pp 313–328

  7. Crotty A, Galakatos A, Zgraggen E, Binnig C, Kraska T (2015) Vizdom: interactive analytics through pen and touch. Proc VLDB Endow 8(12):2024–2027

    Article  Google Scholar 

  8. Egho E, Raïssi C, Calders T, Jay N, Napoli A (2015) On measuring similarity for sequences of itemsets. Data Mining Knowl Discov 29(3):732–764. https://doi.org/10.1007/s10618-014-0362-1

    Article  MathSciNet  MATH  Google Scholar 

  9. Fournier-Viger P, Lin C, Gomariz A, Gueniche T, Soltani A, Deng Z, Lam HT (2016) The SPMF open-source data mining library version 2. In: Proceedings of 19th European conference on machine learning and principles and practice of knowledge discovery and data mining (Part III), ECML PKDD’16. http://www.philippe-fournier-viger.com/spmf/

  10. Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data. ACM, New York, SIGMOD’97, pp 171–182. https://doi.org/10.1145/253260.253291

  11. Hellerstein JM, Avnur R, Chou A, Hidber C, Olston C, Raman V, Roth T, Haas PJ (1999) Interactive data analysis: the control project. Computer 32(8):51–59

    Article  Google Scholar 

  12. Jermaine C, Arumugam S, Pol A, Dobra A (2008) Scalable approximate query processing with the DBO engine. ACM Trans Database Syst 33:23:1–23:54. https://doi.org/10.1145/1412331.1412335

    Article  Google Scholar 

  13. Kamat N, Jayachandran P, Tunga K, Nandi A (2014) Distributed and interactive cube exploration. In: 30th IEEE international conference on data engineering, IEEE, ICDE’14, pp 472–483

  14. Klösgen W (1992) Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora. Int J Intell Syst 7:649–673

    Article  Google Scholar 

  15. Li Y, Long PM, Srinivasan A (2001) Improved bounds on the sample complexity of learning. J Comput Syst Sci 62(3):516–527

    Article  MathSciNet  Google Scholar 

  16. Liu Z, Heer J (2014) The effects of interactive latency on exploratory visual analysis. IEEE Trans Vis Comput Graph 20(12):2122–2131

    Article  Google Scholar 

  17. Mendes LF, Ding B, Han J (2008) Stream sequential pattern mining with precise error bounds. In: Eighth IEEE international conference on data mining, IEEE, ICDM’08, pp 941–946

  18. Mitzenmacher M, Upfal E (2005) Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, Cambridge

    Book  Google Scholar 

  19. Olken F (1993) Random sampling from databases. Ph.D. thesis, University of California, Berkeley

  20. Pansare N, Borkar VR, Jermaine C, Condie T (2011) Online aggregation for large MapReduce jobs. Proc VLDB Endow 4(11):1135–1145

    Google Scholar 

  21. Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu MC (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440

    Article  Google Scholar 

  22. Pollard D (1984) Convergence of Stochastic Processes. Springer, Berlin

    Book  Google Scholar 

  23. Raïssi C, Poncelet P (2007) Sampling for sequential pattern mining: from static databases to data streams. In: Seventh IEEE international conference on data mining, IEEE, ICDM’07, pp 631–636

  24. Riondato M, Upfal E (2014) Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Trans Knowl Discov Data 8(4):20. https://doi.org/10.1145/2629586

    Article  Google Scholar 

  25. Riondato M, Upfal E (2015) Mining frequent itemsets through progressive sampling with Rademacher averages. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining, ACM, KDD ’15, pp 1005–1014

  26. Riondato M, Vandin F (2014) Finding the true frequent itemsets. In: Zaki MJ, Obradovic Z, Tan P, Banerjee A, Kamath C, Parthasarathy S (eds) Proceedings of the 2014 SIAM international conference on data mining, Philadelphia, Pennsylvania, USA, April 24–26, 2014, SIAM, pp 497–505. https://doi.org/10.1137/1.9781611973440.57

  27. Riondato M, Vandin F (2018) MiSoSouP: mining interesting subgroups with sampling and pseudodimension. In: Proceedings of 24th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, KDD’18, pp 2130–2139

  28. Servan-Schreiber S, Riondato M, Zgraggen E (2018) ProSecCo: progressive sequence mining with convergence guarantees. In: Proceedings of the 18th IEEE international conference on data mining, pp 417–426

  29. Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge

    Book  Google Scholar 

  30. Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: International conference on extending database technology. Springer, ICDT’96, pp 1–17

  31. Toivonen H (1996) Sampling large databases for association rules. In: Proceedings of 22nd international conference very large data bases. Morgan Kaufmann Publishers Inc., San Francisco, VLDB’96, pp 134–145

  32. Vapnik VN (1998) Statistical Learning Theory. Wiley, New York

    MATH  Google Scholar 

  33. Vapnik VN, Chervonenkis AJ (1971) On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob Appl 16(2):264–280. https://doi.org/10.1137/1116025

    Article  MATH  Google Scholar 

  34. Wang J, Han J, Li C (2007) Frequent closed sequence mining without candidate maintenance. IEEE Trans Knowl Data Eng 19(8):1042–1056

    Article  Google Scholar 

  35. Zeng K, Agarwal S, Dave A, Armbrust M, Stoica I (2015) G-OLA: generalized on-line aggregation for interactive analysis on big data. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, ACM, pp 913–918

  36. Zeng K, Agarwal S, Stoica I (2016) IOLAP: managing uncertainty for efficient incremental OLAP. In: Proceedings of the 2016 international conference on management of data. ACM, SIGMOD’16, pp 1347–1361

  37. Zgraggen E, Galakatos A, Crotty A, Fekete JD, Kraska T (2017) How progressive visualizations affect exploratory analysis. IEEE Trans Vis Comput Graph 23(8):1977–1987

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matteo Riondato.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this work appeared in the proceedings of IEEE ICDM’18 [28], where it was deemed the runner-up for the Best Student Paper Award.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Servan-Schreiber, S., Riondato, M. & Zgraggen, E. ProSecCo: progressive sequence mining with convergence guarantees. Knowl Inf Syst 62, 1313–1340 (2020). https://doi.org/10.1007/s10115-019-01393-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01393-8

Keywords

Navigation