Skip to main content
Log in

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

  • State of the Art
  • Published:
Business & Information Systems Engineering Aims and scope Submit manuscript

Abstract

Analyzing data streams has received considerable attention over the past decades due to the widespread usage of sensors, social media and other streaming data sources. A core research area in this field is stream clustering which aims to recognize patterns in an unordered, infinite and evolving stream of observations. Clustering can be a crucial support in decision making, since it aims for an optimized aggregated representation of a continuous data stream over time and allows to identify patterns in large and high-dimensional data. A multitude of algorithms and approaches has been developed that are able to find and maintain clusters over time in the challenging streaming scenario. This survey explores, summarizes and categorizes a total of 51 stream clustering algorithms and identifies core research threads over the past decades. In particular, it identifies categories of algorithms based on distance thresholds, density grids and statistical models as well as algorithms for high dimensional data. Furthermore, it discusses applications scenarios, available software and how to configure stream clustering algorithms. This survey is considerably more extensive than comparable studies, more up-to-date and highlights how concepts are interrelated and have been developed over time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://www.matthias-carnein.de/streamClustering.

  2. http://www.matthias-carnein.de/streamClustering.

References

  • Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. J Exp Algorithmics 17:2.4:2.1–2.4:2.30

    Article  Google Scholar 

  • Aggarwal CC (2007) Data streams: models and algorithms, vol 31. Springer, Berlin

    Book  Google Scholar 

  • Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, volume 29 of VLDB ’03, VLDB Endowment, Berlin, pp 81–92

  • Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases, volume 30 of VLDB ’04, VLDB Endowment, Toronto, pp 852–863

  • Ali MH, Sundus A, Qaiser W, Ahmed Z, Halim Z (2011) Applicative implementation of D-stream clustering algorithm for the real-time data of telecom sector. In: International conference on computer networks and information technology, pp 293–297

  • Amini A, Wah TY (2011) Density micro-clustering algorithms on data streams: a review. In: Proceeding of the international multiconference of engineers and computer scientists (IMECS)

  • Amini A, Wah TY (2012) A comparative study of density-based clustering algorithms on data streams: micro-clustering approaches. Springer, US, Boston, pp 275–287

    Google Scholar 

  • Amini A, Wah TY (2013) LeaDen-Stream: a leader density-based clustering algorithm over evolving data stream. J Comput Commun 01(05):26–31

    Article  Google Scholar 

  • Amini A, Wah TY, Saybani MR, Yazdi SRAS (2011) A study of density-grid based clustering algorithms on data streams. In: Eighth international conference on fuzzy systems and knowledge discovery (FSKD) 3:1652–1656

  • Amini A, Wah TY, Teh YW (2012) DENGRIS-Stream: a density-grid based clustering algorithm for evolving data streams over sliding window. In: Proceedings of the international conference on data mining and computer engineering, pp 206–210

  • Amini A, Saboohi H, Wah TY, Herawan T (2014a) A fast density-based clustering algorithm for real-time internet of things stream. Sci World J 2014:1–11

    Article  Google Scholar 

  • Amini A, Wah TY, Saboohi H (2014b) On density-based data streams clustering algorithms: a survey. J Comput Sci Technol 29(1):116–141

    Article  Google Scholar 

  • Amini A, Saboohi H, Herawan T, Wah TY (2016) MuDi-Stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59:370–385

    Article  Google Scholar 

  • Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, SODA ’07, Society for Industrial and Applied Mathematics, New Orleans, pp 1027–1035

  • Barbará D, Chen P (2000) Using the fractal dimension to cluster datasets. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’00, ACM, Boston, pp 260–264

  • Barbará D, Chen P (2003) Using self-similarity to cluster large data sets. Data Min Knowl Discov 7(2):123–152

    Article  Google Scholar 

  • Ben-Hur A, Horn D, Siegelmann HT, Vapnik V (2001) Support vector clustering. J Mach Learn Res 2:125–137

    Google Scholar 

  • Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? Springer, Berlin, pp 217–235

    Google Scholar 

  • Bhatnagar V, Kaur S (2007) Exclusive and complete clustering of streams. Springer, Berlin, pp 629–638

    Google Scholar 

  • Bhatnagar V, Kaur S, Chakravarthy S (2014) Clustering data streams using grid-based synopsis. Knowl Inf Syst 41(1):127–152

    Article  Google Scholar 

  • Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) MOA: massive online analysis. J Mach Learn Res 11:1601–1604

    Google Scholar 

  • Bifet A, Gavaldà R, Holmes G, Pfahringer B (2018) Machine learning for data streams with practical examples in MOA. MIT Press, Cambridge

    Book  Google Scholar 

  • Bohm C, Kailing K, Kriegel H-P, Kroger P (2004) Density connected clustering with local subspace preferences. In: Proceedings of the fourth IEEE international conference on data mining, ICDM ’04, IEEE Computer Society, Washington, DC, pp 27–34

  • Bolaños M, Forrest J, Hahsler M (2014) Clustering large datasets using data stream clustering techniques. In: Spiliopoulou M, Schmidt-Thieme L, Janning R (eds) Data analysis, machine learning and knowledge discovery, studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 135–143

    Google Scholar 

  • Bradley PS, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. In: Proceedings of the 4th international conference on knowledge discovery and data mining (KDD’98). AAAI Press, pp 9–15

  • Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Conference on data mining (SIAM ’06), pp 328–339

  • Carnein M, Trautmann H (2018) evoStream—evolutionary stream clustering utilizing idle times. Big Data Res 14:101–111. https://doi.org/10.1016/j.bdr.2018.05.005

    Article  Google Scholar 

  • Carnein M, Assenmacher D, Trautmann H (2017a) An empirical comparison of stream clustering algorithms. In: Proceedings of the ACM international conference on computing frontiers (CF ’17). ACM, pp 361–365

  • Carnein M, Assenmacher D, Trautmann H (2017b) Stream clustering of chat messages with applications to twitch streams. In Proceedings of the 36th international conference on conceptual modeling (ER’17). Springer International Publishing, pp 79–88

  • Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’07, ACM, San Jose, pp 133–142

  • Dang XH, Lee V, Ng WK, Ciptadi A, Ong KL (2009a) An EM-based algorithm for clustering data streams in sliding windows. In: Zhou X, Yokota H, Deng K, Liu Q (eds) Proceedings of the 14th international conference on database systems for advanced applications (DASFAA 2009). Springer, Berlin, pp 230–235

  • Dang XH, Lee VCS, Ng WK, Ong KL (2009b) Incremental and adaptive clustering stream data over sliding window. Springer, Berlin, pp 660–674

    Google Scholar 

  • Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI Press, pp 226–231

  • Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor Newsl 2(1):51–57

    Article  Google Scholar 

  • Fichtenberger H, Gillé M, Schmidt M, Schwiegelshohn C, Sohler C (2013) BICO: BIRCH meets coresets for k-means clustering. In: Algorithms - ESA 2013—Proceedings of 21st annual European symposium, Sophia Antipolis, pp 481–492. http://ls2-www.cs.tu-dortmund.de/grav/de/bico. Accessed 27 Dec 2018

  • Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172

    Google Scholar 

  • Forestiero A, Pizzuti C, Spezzano G (2013) A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Min Knowl Discov 26(1):1–26

    Article  Google Scholar 

  • Gao J, Li J, Zhang Z, Tan P-N (2005) An incremental data stream clustering algorithm based on dense units detection. Springer, Berlin, pp 420–425

    Google Scholar 

  • Gao X, Ferrara E, Qiu J (2015) Parallel clustering of high-dimensional social media data streams. arXiv:1502.00316

  • Ghesmoune M, Azzag H, Lebbah M (2014) G-Stream: growing neural gas over data stream. In: Loo CK, Siah YK, Wong KW, Jin AT, Huang K (eds) Proceedings of neural information processing: 21st international conference, ICONIP 2014, Kuching, Malaysia, November 3–6, 2014, Part I. Springer International Publishing, pp 207–214

  • Ghesmoune M, Lebbah M, Azzag H (2015) Clustering over data streams based on growing neural gas. Springer, Berlin, pp 134–145

    Google Scholar 

  • Ghesmoune M, Lebbah M, Azzag H (2016) State-of-the-art on clustering data streams. Big Data Anal 1(1):13

    Article  Google Scholar 

  • Guha S, Meyerson A, Mishra N, Motwani R, O’Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15(3):515–528

    Article  Google Scholar 

  • Hahsler M, Bolaños M (2016) Clustering data streams based on shared density between micro-clusters. IEEE Trans Knowl Data Eng 28(6):1449–1461

    Article  Google Scholar 

  • Hahsler M, Bolanos M, Forrest J (2015) streamMOA: interface for MOA stream clustering algorithms. https://cran.r-project.org/web/packages/streamMOA/. Accessed 27 Dec 2018

  • Hahsler M, Bolanos M, Forrest J, Carnein M, Assenmacher D (2018) stream: infrastructure for data stream mining. https://cran.r-project.org/web/packages/stream/. Accessed 27 Dec 2018

  • Hassani M, Kranen P, Seidl T (2011) Precise anytime clustering of noisy sensor data with logarithmic complexity. In: Proceedings of 5th international workshop on knowledge discovery from sensor data (SensorKDD 2011) in conjunction with 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD 2011), ACM, San Diego, pp 52–60

  • Hassani M, Spaus P, Gaber MM, Seidl T (2012) Density-based projected clustering of data streams. Springer, Berlin, pp 311–324

    Google Scholar 

  • Hassani M, Kim Y, Seidl T (2013) Subspace MOA: subspace stream clustering evaluation using the MOA framework. Springer, Berlin, pp 446–449

    Google Scholar 

  • Hassani M, Hansen M, Kim Y, Seidl T (2016) subspaceMOA: interface to ’subspaceMOA’. https://cran.r-project.org/web/packages/subspaceMOA/. Accessed 27 Dec 2018

  • Huawei Noah’s Ark Lab (2015). streamDM. http://huawei-noah.github.io/streamDM/. Accessed 27 Dec 2018

  • Hutter F, Hoos HH, Stützle T (2007) Automatic algorithm configuration based on local search. In: Proceedings of the twenty-second conference on artifical intelligence (AAAI ’07), pp 1152–1157

  • Hutter F, Hoos HH, Leyton-Brown K, Stützle T (2009) ParamILS: an automatic algorithm configuration framework. J Artif Intell Res 36:267–306

    Article  Google Scholar 

  • Hutter F, Hoos HH, Leyton-Brown K (2011) Sequential model-based optimization for general algorithm configuration. In: Proceedings of LION-5, pp 507–523

  • Isaksson C, Dunham MH, Hahsler M (2012) SOStream: self organizing density-based clustering over data stream. Springer, Berlin, pp 264–278

    Google Scholar 

  • Ismael N, Alzaalan M, Ashour W (2014) Improved multi threshold birch clustering algorithm 2(1):1–10. https://doi.org/10.14257/ijaiasd.2014.2.1.01

    Article  Google Scholar 

  • Jia C, Tan C, Yong A (2008) A grid and density-based clustering algorithm for processing data stream. In: Second international conference on genetic and evolutionary computing (WGEC ’08), pp 517–521

  • Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43(1):59–69

    Article  Google Scholar 

  • Kontaki M, Papadopoulos AN, Manolopoulos Y (2008) Continuous trend-based clustering in data streams. Springer, Berlin, pp 251–262

    Google Scholar 

  • Kranen P, Assent I, Baldauf C, Seidl T (2009) Self-adaptive anytime stream clustering. In: 9th IEEE international conference on data mining (ICDM ’09), pp 249–258

  • Kranen P, Assent I, Baldauf C, Seidl T (2011a) The ClusTree: indexing micro-clusters for anytime stream mining. In: Knowledge and information systems journal (Springer KAIS), Vol 29, Issue 2. Springer, London, pp 249–272

  • Kranen P, Reidl F, Villaamil FS, Seidl T (2011b) Hierarchical clustering for real-time stream data with noise. Springer, Berlin, pp 405–413

    Google Scholar 

  • Lin J, Lin H (2009) A density-based clustering over evolving heterogeneous data stream. In: 2009 ISECS international colloquium on computing, communication, control, and management, vol 4, pp 275–277

  • Liu LX, Huang H, Guo YF, Chen FC (2009) rDenStream, a clustering algorithm over an evolving data stream. In: 2009 international conference on information engineering and computer science, pp 1–4

  • López-Ibáñez M, Dubois-Lacoste J, Cáceres LP, Stützle T, Birattari M (2016) The irace package: iterated racing for automatic algorithm configuration. Oper Res Perspect 3:43–58

    Article  Google Scholar 

  • Lorbeer B, Kosareva A, Deva B, Softić D, Ruppel P, Küpper A (2017) A-BIRCH: automatic threshold estimation for the BIRCH clustering algorithm. Springer, Berlin, pp 169–178

    Google Scholar 

  • Lühr S, Lazarescu M (2008) Connectivity based stream clustering using localised density exemplars. Springer, Berlin, pp 662–672

    Google Scholar 

  • Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27

    Article  Google Scholar 

  • Ma WH (2014) Survey on data streams clustering techniques. In: Manufacture engineering, quality and production system III, volume 933 of Advanced Materials Research. Trans Tech Publications, pp 768–773

  • Martinetz T, Schulten K et al (1991) A “neural-gas” network learns topologies. University of Illinois at Urbana-Champaign

  • Meesuksabai W, Kangkachit T, Waiyamai K (2011) Hue-Stream: evolution-based clustering technique for heterogeneous data streams with uncertainty. In: Tang J, King I, Chen L, Wang J (eds) ADMA, volume 7121 of Lecture Notes in Computer Science. Springer, pp 27–40

  • Motoyoshi M, Miura T, Shioya I (2004) Clustering stream data by regression analysis. In: Proceedings of the second workshop on Australasian information security, data mining and web intelligence, and software internationalisation, volume 32 of ACSW Frontiers ’04, Australian Computer Society, Darlinghurst, pp 115–120

  • Mousavi M, Bakar AA, Vakilian M (2015) Data stream clustering algorithms: a review. Int J Adv Soft Comput Appl 7:1–15

    Google Scholar 

  • Nguyen H-L, Woon Y-K, Ng W-K (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569

    Article  Google Scholar 

  • Ntoutsi I, Zimek A, Palpanas T, Kröger P, Kriegel H-P (2012) Density-based projected clustering over high dimensional data streams. In: Proceedings of the 2012 SIAM international conference on data mining, pp 987–998

  • O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th international conference on data engineering (ICDE), pp 685–694

  • Park NH, Lee WS (2004) Statistical Grid-based clustering over data streams. SIGMOD Rec 33(1):32–37

    Article  Google Scholar 

  • Park NH, Lee WS (2007a) Cell trees: an adaptive synopsis structure for clustering multi-dimensional on-line data streams. Data Knowl Eng 63(2):528–549

    Article  Google Scholar 

  • Park NH, Lee WS (2007b) Grid-based subspace clustering over data streams. In: Proceedings of the sixteenth ACM conference on conference on information and knowledge management, ACM, New York, pp 801–810

    Chapter  Google Scholar 

  • Ren J, Ma R (2009) Density-based data streams clustering over sliding windows. In: 2009 Sixth international conference on fuzzy systems and knowledge discovery, volume 5, pp 248–252

  • Ren J, Cai B, Hu C (2011) Clustering over data streams based on grid density and index tree. J Converg Inf Technol 6(1):83–93

    Google Scholar 

  • Ruiz C, Spiliopoulou M, Menasalvas E (2007) C-DBSCAN: density-based clustering with constraints. Springer, Berlin, pp 216–223

    Google Scholar 

  • Ruiz C, Menasalvas E, Spiliopoulou M (2009) C-DenStream: using domain knowledge on a data stream. Springer, Berlin, pp 287–301

    Google Scholar 

  • Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho AC, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13:1–13:31

    Article  Google Scholar 

  • Spinosa EJ, de Leon F de Carvalho AP, Gama J (2007) Olindda: a cluster-based approach for detecting novelty and concept drift in data streams. In: Proceedings of the 2007 ACM symposium on applied computing. ACM, pp 448–452

  • Steil J, Huang MX, Bulling A (2018) Fixation detection for head-mounted eye tracking based on visual similarity of gaze targets. In: Proceedings of international symposium on eye tracking research and applications (ETRA), pp 23:1–23:9

  • Tasoulis DK, Adams NM, Hand DJ (2006) Unsupervised clustering in streaming data. In: Sixth IEEE international conference on data mining–workshops (ICDMW’06), pp 638–642

  • Tasoulis D, Adams N, Weston DJ, Hand DJ (2008) Mining information from plastic card transaction streams. In: Proceedings in computational statistics: 18th symposium (COMPSTAT 2008), volume 2, pp 315–322

  • Theiler J (1990) Estimating fractal dimension. J Opt Soc Am A 7(6):1055–1073

    Article  Google Scholar 

  • Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B (Stat Methodol) 63(2):411–423

    Article  Google Scholar 

  • Tu L, Chen Y (2009) Stream data clustering based on grid density and attraction. ACM Trans Knowl Discov Data 3(3):12:1–12:27

    Article  Google Scholar 

  • Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-Stream: evolution-based technique for stream clustering. Springer, Berlin, pp 605–615

    Google Scholar 

  • van Rijn JN, Holmes G, Pfahringer B, Vanschoren J (2014) Algorithm selection on data streams. In: Džeroski S, Panov P, Kocev D, Todorovski L (eds) Proceedings of the 17th international conference on discovery science (DS), volume 8777 of lecture notes in computer science (LNCS). Springer, pp 325–336

  • van Rijn J, Nicolaas GH, Pfahringer B, Vanschoren J (2018) The online performance estimation framework: heterogeneous ensemble learning for data streams. Mach Learn 107(1):149–176

    Article  Google Scholar 

  • Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data 3(3):14:1–14:28

    Article  Google Scholar 

  • Wang CD, Lai JH, Huang D, Zheng WS (2013) SVStream: a support vector-based algorithm for clustering data streams. IEEE Trans Knowl Data Eng 25(6):1410–1424

    Article  Google Scholar 

  • Wang G, Zhang X, Tang S, Zheng H, Zhao BY (2016) Unsupervised clickstream clustering for user behavior analysis. In: Proceedings of the 2016 CHI conference on human factors in computing systems, ACM, New York, pp 225–236

    Chapter  Google Scholar 

  • Wedel M, Kamakura WA (2000) Market segmentation, 2nd edn. Springer, US

    Book  Google Scholar 

  • Yang C, Zhou J (2006) HClustream: a novel approach for clustering evolving heterogeneous data stream. In: Sixth IEEE international conference on data mining—workshops (ICDMW’06), pp 682–688

  • Yang Y, Liu Z, Zhang Jp, Yang J (2012) Dynamic density-based clustering algorithm over uncertain data streams. In: 2012 9th international conference on fuzzy systems and knowledge discovery, pp 2664–2670

  • Zhang X, Wang W (2010) Self-adaptive change detection in streaming data with non-stationary distribution. Springer, Berlin, pp 334–345

    Google Scholar 

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data, ACM, Montreal, pp 103–114

    Chapter  Google Scholar 

  • Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Mini Knowl Discov 1(2):141–182

    Article  Google Scholar 

  • Zhang X, Germain C, Sebag M (2010) Adaptively detecting changes in autonomic grid computing. In: 2010 11th IEEE/ACM international conference on grid computing, pp 387–392

  • Zhou A, Cao F, Qian W, Jin C (2007a) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214

    Article  Google Scholar 

  • Zhou A, Cao F, Yan Y, Sha C, He X (2007b) Distributed data stream clustering: a fast EM-based approach. In: 2007 IEEE 23rd international conference on data engineering, pp 736–745

  • Zhu Y, Shasha D (2002) StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th international conference on very large data bases, VLDB Endowment, Hong Kong, pp 358–369

    Chapter  Google Scholar 

Download references

Acknowledgements

The authors would like to thank for the support provided by Karsten Kraume and the ERCIS Omni-Channel Lab – powered by Arvato (https://omni-channel.ercis.org/).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthias Carnein.

Additional information

Accepted after one revision by the editors of the special issue.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Carnein, M., Trautmann, H. Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms. Bus Inf Syst Eng 61, 277–297 (2019). https://doi.org/10.1007/s12599-019-00576-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12599-019-00576-5

Keywords

Navigation