Abstract
Inthis article, wefocus on the classification problem to semi-supervised learning in non-stationary environment. Semi-supervised learning is a learning task from both labeled and unlabeled data points. There are several approaches to semi-supervised learning in stationary environment which are not applicable directly for data streams. We propose a novel semi-supervised learning algorithm, named STDS. The proposed approach uses labeled and unlabeled data and employs an approach to handle the concept drift in data streams. The main challenge in semi-supervised self-training for data streams is to find a proper selection metric in order to find a set of high-confidence predictions and a proper underlying base learner. We therefore propose an ensemble approach to find a set of high-confidence predictions based on clustering algorithms and classifier predictions. We then employ the Kullback-Leibler (KL) divergence approach to measure the distribution differences between sequential chunks in order to detect the concept drift. When drift is detected, a new classifier is updated from the new set of labeled data in the current chunk; otherwise, a percentage of high-confidence newly labeled data in the current chunk is added to the labeled data in the next chunk for updating the incremental classifier based on the proposed selection metric. The results of our experiments on a number of classification benchmark datasets show that STDS outperforms the supervised and the most of other semi-supervised learning methods.
Similar content being viewed by others
References
Aggarwal CC (2009) Data streams: an overview and scientific applications. In: Scientific data mining and knowledge discovery. Springer, pp 377–397
Baena-García M, del Campo-Ávila J, Fidalgo R, Bifet A, Gavaldà R, Morales-Bueno R (2006) Early drift detection method
Belkin M, Niyogi P, Sindhwani V (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J Mach Learning Res 7(Nov):2399–2434
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) Moa: massive online analysis. J Mach Learn Res 11(May):1601–1604
Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory. ACM , pp 92–100
Borchani H, Larrañaga P, Bielza C (2011) Classifying evolving data streams with partially labeled data. Intelligent Data Analysis 15(5):655–670
Breiman L (2001) Random forests. Machine Learning 45(1):5–32
Brzeziński D (2010) Mining data streams with concept drift. PhD thesis, PhD thesis, MS thesis, Dept. of Computing Science and Management, Poznan University of Technology, Poznan Google Scholar
Brzezinski D, Stefanowski J (2014) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learning Sys 25(1):81–94
Cui W, Liu S, Li T, Shi C, Song Y, Gao Z, Qu H, Tong X (2011) Textflow: towards better understanding of evolving topics in text. IEEE Trans Visualization Comput Graphics 17(12):2412– 2421
Dasu T, Krishnan S, Venkatasubramanian S, Yi K (2006) An information-theoretic approach to detecting changes in multi-dimensional data streams. In: Proc. Symp. on the interface of statistics, computing science, and applications. Citeseer
Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Magazine 10(4):12–25
Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 71–80
Dyer KB, Capo R, Polikar R (2014) Compose: a semisupervised learning framework for initially labeled nonstationary streaming data. IEEE Trans Neural Netw Learning Sys 25(1):12–26
Efron B, Tibshirani R (1986) Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Sci: 54–75
Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531
Ferreira RS, Zimbrão G, Alvim LGM (2019) Amanda: semi-supervised density-based adaptive model for non-stationary data with extreme verification latency. Inf Sci
Frank A, Asuncion A (2010) UCI machine learning repository
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Record 34(2):18–26
Gama J (2010) Knowledge discovery from data streams. Chapman and Hall/CRC
Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intelligent Data Analysis 10(1):23–45
Gama J, Gaber MM (2007) Learning from data streams: processing techniques in sensor networks. Springer
Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Brazilian symposium on artificial intelligence. Springer, pp 286–295
Gama J, Rocha R, Medas P (2003) Accurate decision trees for mining high-speed data streams. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 523–528
Gama J, žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM computing surveys (CSUR) 46(4):44
Gao J, Fan W, Han J, Yu PS (2007) A general framework for mining concept-drifting data streams with skewed distributions. In: Proceedings of the SIAM international conference on data mining. SIAM, p 2007
Gomes HM, Barddal JP, Enembreck F, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR) 50(2):23
Harries M, New South Wales (1999) Splice-2 comparative evaluation: electricity pricing
Hosseini MJ, Gholipour A, Beigy H (2016) An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams. Knowl Inf Syst 46(3):567–597
Hulten G, Spencer L, Pedro Domingos. (2001) Mining time-changing data streams. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 97–106
Kadwe Y, Suryawanshi V (2015) A review on concept drift. IOSR J Comput Eng 17:20–26
Kim Y, Park CH (2017) An efficient concept drift detection method for streaming data under limited labeling. IEICE Trans Inf Sys 100(10):2537–2546
Kirkby RB (2007) Improving hoeffding trees. PhD thesis, The University of Waikato
Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: ICML, pp 487–494
Zico Kolter J, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8(Dec):2755–2790
Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Information Fusion 37:132–156
Krawczyk B, Wozniak M (2015) Weighted naive bayes classifier with forgetting for drifting data streams. In: IEEE international conference on systems, man, and cybernetics. IEEE, p 2015
Kulkarni P, Ade R (2014) Incremental learning from unbalanced data with concept class, concept drift and missing features: a review. International Journal of Data Mining & Knowledge Management Process 4(6):15
Li P, Wu X, Hu X (2010) Mining recurring concept drifts with limited labeled streaming data. In: Proceedings of 2nd Asian conference on machine learning, pp 241–252
Malekian D, Hashemi MR (2013) An adaptive profile based fraud detection framework for handling concept drift. In: 2013 10th international ISC conference on information security and cryptology (ISCISC). IEEE, pp 1–6
Masud MM, Gao J, Khan L, Han J, Thuraisingham B (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Eighth IEEE international conference on data mining, 2008. ICDM’08. IEEE, pp 929–934
Minku LL, Yao X (2012) Ddd: a new ensemble approach for dealing with concept drift. IEEE Trans Knowledge Data Eng 24(4):619–633
Nguyen H-L, Woon Y-K, Ng W-K (2015) A survey on data stream clustering and classification. Knowledge Inf Sys 45(3):535–569
Prasad BR, Agarwal S (2016) Stream data mining: platforms, algorithms, performance evaluators and research trends. International Journal of Database Theory and Application 9(9):201–218
Ren S, Lian Y, Zou X (2014) Incremental naïve bayesian learning algorithm based on classification contribution degree. JCP 9(8):1967–1974
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Sakaki T, Okazaki M, Matsuo Y (2010) Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on world wide web. ACM, pp 851–860
Tanha J (2018) Mssboost: a new multiclass boosting to semi-supervised learning. Neurocomputing
Tanha J, et al. (2013) Ensemble approaches to semi-supervised learning. SIKS
Tanha J, Someren MV, Afsarmanesh H (2014) Boosting for multiclass semi-supervised learning. Pattern Recogn Lett 37:63–77
Tanha J, Van Someren M, Afsarmanesh H (2017) Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics 8(1):355–370
Tanha J (2019) A multiclass boosting algorithm to labeled and unlabeled data. International Journal of Machine Learning and Cybernetics 10(12):3647–3665
Tsymbal A (2004) The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin 106 (2)
Umer M, Frederickson C, Polikar R (2016) Learning under extreme verification latency quickly: fast compose. In: 2016 IEEE symposium series on computational intelligence (SSCI). IEEE, pp 1–8
Vorburger P, Bernstein A (2006) Entropy-based concept shift detection. In: Sixth international conference on data mining ICDM’06, p 2006
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 226–235
Yi W, Li T (2018) Improving semi-supervised co-forest algorithm in evolving data streams. Appl Intell: 1–15
Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1):69–101
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Khezri, S., Tanha, J., Ahmadi, A. et al. STDS: self-training data streams for mining limited labeled data in non-stationary environment. Appl Intell 50, 1448–1467 (2020). https://doi.org/10.1007/s10489-019-01585-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01585-3