Skip to main content
Log in

Important sampling based active learning for imbalance classification

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Imbalance in data distribution hinders the learning performance of classifiers. To solve this problem, a popular type of methods is based on sampling (including oversampling for minority class and undersampling for majority class) so that the imbalanced data becomes relatively balanced data. However, they usually focus on one sampling technique, oversampling or undersampling. Such strategy makes the existing methods suffer from the large imbalance ratio (the majority instances size over the minority instances size). In this paper, an active learning framework is proposed to deal with imbalanced data by alternative performing important sampling (ALIS), which consists of selecting important majority-class instances and generating informative minority-class instances. In ALIS, two important sampling strategies affect each other so that the selected majority-class instances provide much clearer information in the next oversampling process, meanwhile the generated minority-class instances provide much more sufficient information for the next undersampling procedure. Extensive experiments have been conducted on real world datasets with a large range of imbalance ratio to verify ALIS. The experimental results demonstrate the superiority of ALIS in terms of several well-known evaluation metrics by comparing with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Xu C, Tao D, Xu C. Robust extreme multi-label learning. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 1275–1284

  2. Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 2980–2988

  3. Batuwita R, Palade V. Efficient resampling methods for training support vector machines with imbalanced datasets. In: Proceedings of the International Joint Conference on Neural Networks, 2010. 1–8

  4. Peng Y. Adaptive sampling with optimal cost for class-imbalance learning. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015. 2921–2927

  5. Attenberg J, Ertekin S. Class imbalance and active learning. In: Imbalanced Learning: Foundations, Algorithms, and Applications. Piscataway: Wiley-IEEE Press, 2013. 101–149

    Chapter  Google Scholar 

  6. Guo J, Wan X, Lin H, et al. An active learning method based on mistake sampling for large scale imbalanced classification. In: Proceedings of International Conference on Service Systems and Service Management, 2017. 1–6

  7. Stefanowski J. Dealing with data difficulty factors while learning from imbalanced data. In: Challenges in Computational Statistics and Data Mining. Berlin: Springer, 2016. 333–363

    Chapter  Google Scholar 

  8. Alejo R, Valdovinos R M, García V, et al. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recogn Lett, 2013, 34: 380–388

    Article  Google Scholar 

  9. Cheng F, Zhang J, Wen C. Cost-sensitive large margin distribution machine for classification of imbalanced data. Pattern Recogn Lett, 2016, 80: 107–112

    Article  Google Scholar 

  10. Chung Y A, Lin H T, Yang S W. Cost-aware pre-training for multiclass cost-sensitive deep learning. In: Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016. 1411–1417

  11. Ren Y, Zhao P, Sheng Y, et al. Robust softmax regression for multi-class classification with self-paced learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017. 2641–2647

  12. Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res, 2002, 16: 321–357

    Article  Google Scholar 

  13. Han H, Wang W Y, Mao B H. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Proceedings of the 2005 International Conference on Advances in Intelligent Computing. Berlin: Springer, 2005. 878–887

    Google Scholar 

  14. Tang B, He H. KernelADASYN: kernel based adaptive synthetic data generation for imbalanced learning. In: Proceedings of IEEE Congress on Evolutionary Computation, 2015. 664–671

  15. Zhou C, Liu B, Wang S. Cmo-smote: misclassification cost minimization oriented synthetic minority oversampling technique for imbalanced learning. In: Proceedings of the 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2016. 353–358

  16. Barua S, Islam M M, Yao X, et al. MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng, 2014, 26: 405–425

    Article  Google Scholar 

  17. Yuan J, Li J, Zhang B. Learning concepts from large scale imbalanced data sets using support cluster machines. In: Proceedings of the 14th ACM International Conference on Multimedia, 2006. 441–450

  18. He H B, Garcia E A. Learning from imbalanced data. IEEE Trans Knowl Data Eng, 2009, 21: 1263–1284

    Article  Google Scholar 

  19. Tahir M A, Kittler J, Yan F. Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recogn, 2012, 45: 3738–3750

    Article  Google Scholar 

  20. Galar M, Fernandez A, Barrenechea E, et al. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn, 2013, 46: 3460–3471

    Article  Google Scholar 

  21. Thanathamathee P, Lursinsap C. Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recogn Lett, 2013, 34: 1339–1347

    Article  Google Scholar 

  22. Settles B. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences, 2009

  23. Lughofer E, Weigl E, Heidl W, et al. Integrating new classes on the fly in evolving fuzzy classifier designs and their application in visual inspection. Appl Soft Comput, 2015, 35: 558–582

    Article  Google Scholar 

  24. Weigl E, Heidl W, Lughofer E, et al. On improving performance of surface inspection systems by online active learning and flexible classifier updates. Machine Vision Appl, 2016, 27: 103–127

    Article  Google Scholar 

  25. Pratama M, Dimla E, Lai C Y, et al. Metacognitive learning approach for online tool condition monitoring. J Intell Manuf, 2019, 30: 1717–1737

    Article  Google Scholar 

  26. Ertekin S, Huang J, Bottou L, et al. Learning on the border: active learning in imbalanced data classification. In: Proceedings of the 16th ACM Conference on Information and Knowledge Management, 2007. 127–136

  27. Batuwita R, Palade V. Class imbalance learning methods for support vector machines. In: Imbalanced Learning: Foundations, Algorithms, and Applications. Piscataway: Wiley-IEEE Press, 2013. 83

    Chapter  Google Scholar 

  28. Oh S, Lee M S, Zhang B T. Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans Comput Biol Bioinf, 2011, 8: 316–325

    Article  Google Scholar 

  29. Chen Y, Mani S. Active learning for unbalanced data in the challenge with multiple models and biasing. In: Proceedings of Workshop on Active Learning and Experimental Design, 2011. 113–126

  30. Zhang X, Yang T, Srinivasan P. Online asymmetric active learning with imbalanced data. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. 2055–2064

  31. Zhang T, Zhou Z H. Large margin distribution machine. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014. 313–322

  32. Roweis S. Boltzmann machines. Lecture notes, 1995. https://ftp.cs.nyu.edu/~roweis/notes/boltz.pdf

  33. Yang Y, Ma Z, Nie F, et al. Multi-class active learning by uncertainty sampling with diversity maximization. Int J Comput Vision, 2015, 113: 113–127

    Article  MathSciNet  Google Scholar 

  34. Asuncion A, Newman D. Uci machine learning repository, 2007. http://archive.ics.uci.edu/ml

  35. Alcala-Fdez J, Fernandez A, Luengo J, et al. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput, 2010, 17: 255–287

    Google Scholar 

  36. Sun Z, Song Q, Zhu X, et al. A novel ensemble method for classifying imbalanced data. Pattern Recogn, 2015, 48: 1623–1637

    Article  Google Scholar 

  37. Yan Q, Xia S, Meng F. Optimizing cost-sensitive SVM for imbalanced data: connecting cluster to classification. 2017. ArXiv: 170201504

  38. Wu F, Jing X Y, Shan S, et al. Multiset feature learning for highly imbalanced data classification. In: Prcoeedings of the 31st AAAI Conference on Artificial Intelligence, 2017

  39. More A. Survey of resampling techniques for improving classification performance in unbalanced datasets. 2016. ArXiv: 160806048

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61822601, 61773050, 61632004, 61972132), Beijing Natural Science Foundation (Grant No. Z180006), National Key Research and Development Program (Grant No. 2017YFC1703506), Fundamental Research Funds for the Central Universities (Grant Nos. 2019JBZ110, 2019YJS040), Youth Foundation of Hebei Education Department (Grant No. QN2018084), Science and Technology Foundation of Hebei Agricultural University (Grant No. LG201804), and Research Project for Self-cultivating Talents of Hebei Agricultural University (Grant No. PY201810).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liping Jing.

Additional information

Supporting information

Appendixes A-C. The supporting information is available online at info.scichina.com and link.springer.com. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.

Supplementary File

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Liu, B., Cao, S. et al. Important sampling based active learning for imbalance classification. Sci. China Inf. Sci. 63, 182104 (2020). https://doi.org/10.1007/s11432-019-2771-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-019-2771-0

Keywords

Navigation