Skip to main content
Log in

Selective ensemble of uncertain extreme learning machine for pattern classification with missing features

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Ensemble learning is an effective technique to improve performance and stability compared to single classifiers. This work proposes a selective ensemble classification strategy to handle missing data classification, where an uncertain extreme learning machine with probability constraints is used as individual (or base) classifiers. Then, three selective ensemble frameworks are developed to optimize ensemble margin distributions and aggregate individual classifiers. The first two are robust ensemble frameworks with the proposed loss functions. The third is a sparse ensemble classification framework with the zero-norm regularization, to automatically select the required individual classifiers. Moreover, the majority voting method is applied to produce ensemble classifier for missing data classification. We demonstrate some important properties of the proposed loss functions such as robustness, convexity and Fisher consistency. To verify the validity of the proposed methods for missing data, numerical experiments are implemented on benchmark datasets with missing feature values. In experiments, missing features are first imputed by using expectation maximization algorithm. Numerical experiments are simulated in filled datasets. With different probability lower bounds of classification accuracy, experimental results under different proportion of missing values show that the proposed ensemble methods have better or comparable generalization compared to the traditional methods in handling missing-value data classifications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Aravkin AY, Kambadur A, Lozano AC et al (2014) Sparse quantile Huber regression for efficient and robust estimation. Mathematics 14(Suppl 1):1–1

    Google Scholar 

  • Bershad NJ (2004) Comments on a recursive least M-estimate algorithm for robust adaptive filtering in impulsive noise: fast algorithm and convergence performance analysis. IEEE Trans Signal Process 57(1):388–389

    Article  MathSciNet  MATH  Google Scholar 

  • Bi Y, Guan J, Bell D (2008) The combination of multiple classifiers using an evidential reasoning approach. Artif Intell 172(15):1731–1751

    Article  MATH  Google Scholar 

  • Cao T, Zhang M, Andreae P, Xue B, Bui LT (2018) An effective and efficient approach to classification with incomplete data. Knowl-Based Syst 154:1–16

    Article  Google Scholar 

  • Chen C, Yan C, Zhao N, Guo B, Lin G (2016) A robust algorithm of support vector regression with a trimmed huber loss function in the primal. Soft Comput 21(18):1–9

    Google Scholar 

  • Dempster AP (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38

    MathSciNet  MATH  Google Scholar 

  • Eirola E, Miche Y, Rui N, Akusok A, Lendasse A (2016) Extreme learning machine for missing data using multiple imputations. Neurocomputing 174(PA):220–231

    Google Scholar 

  • Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41(12):3692–3705

    Article  MATH  Google Scholar 

  • Han B, He B, Nian R et al (2015) LARSEN-ELM: selective ensemble of extreme learning machines using LARS for blended data. Neurocomputing 149:285–294

    Article  Google Scholar 

  • Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501

    Article  Google Scholar 

  • Huang GB, Ding X, Zhou H (2010) Optimization method based extreme learning machine for classification. Neurocomputing 74(1):155–163

    Article  Google Scholar 

  • Huang G, Song S, Wu C, You K (2012) Robust support vector regression for uncertain input and output data. IEEE Trans Neural Netw Learn Syst 23(11):1690–1700

    Article  Google Scholar 

  • Huang X, Shi L, Suykens JAK (2014) Support vector machine classifier with pinball loss. IEEE Trans Pattern Anal Mach Intell 36(5):984–997

    Article  Google Scholar 

  • Huang G, Huang GB, Song S et al (2015) Trends in extreme learning machines: a review. Neural Netw 61:32–48

    Article  MATH  Google Scholar 

  • Jing S, Yang L (2018) A robust extreme learning machine framework for uncertain data classification. J Supercomput. https://doi.org/10.1007/s11227-018-2430-6

  • Lanckriet GRG, El Ghaoui L, Bhattacharyya C, Jordan MI (2004a) Minimax probability machine. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol 14. MIT Press, Cambridge

    Google Scholar 

  • Lin Y (2004) A note on margin-based loss functions in classification. Stat Probab Lett 68(1):73–82

    Article  MathSciNet  MATH  Google Scholar 

  • Liu ZG, Pan Q, Dezert J et al (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognit 52(C):85–95

    Article  Google Scholar 

  • Lopez J, Maldonado S, Carrasco M (2018) Double regularization methods for robust feature selection and SVM classification via DC programming. Inf Sci 429:377–389

    Article  MathSciNet  MATH  Google Scholar 

  • Marshall AW, Olkin I (1960) Multivariate chebyshev inequalities. Ann Math Stat 31(4):1001–1014

    Article  MathSciNet  MATH  Google Scholar 

  • Martin B, Dragi K, Sa D (2018) Ensembles for multi-target regression with random output selections. Mach Learn 107:1673–1709

    Article  MathSciNet  MATH  Google Scholar 

  • Pelckmans K, Brabanter JD, Suykens JAK et al (2005) Handling missing values in support vector machine classifiers. Neural Netw 18(5–6):684–692

    Article  MATH  Google Scholar 

  • Polikar R, Depasquale J, Mohammed HS, Brown G, Kuncheva LI (2010) Learn.mf: a random subspace approach for the missing feature problem. Pattern Recognit 43(11):3817–3832

    Article  MATH  Google Scholar 

  • Schapire RE (1989) The strength of weak learnability. Proceedings of the second annual workshop on computational learning theory 5(2):197–227

    MATH  Google Scholar 

  • Schapire RE, Freund Y, Bartlett P, Lee WS (1997) Boosting the margin: a new explanation for the effectiveness of voting methods. In: Fourteenth international conference on machine learning, pp 322–330

  • Schapire RE, Freund Y, Bartlett P et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686

    Article  MathSciNet  MATH  Google Scholar 

  • Shivaswamy PK, Bhattacharyya C, Smola AJ (2006) Second order cone programming approaches for handling missing and uncertain data. J Machine Learn Res 7:1283–1314

    MathSciNet  MATH  Google Scholar 

  • Sousa LM, Vandenberghe L, Boyd S, Lebret H (1998) Applications of second order cone programming. Linear Algebra Appl 284(1):193–228

    MathSciNet  MATH  Google Scholar 

  • Steinwart I, Christmann A (2011) Estimating conditional quantiles with the help of the pinball loss. Bernoulli 17(1):211–225

    Article  MathSciNet  MATH  Google Scholar 

  • Yang L, Zhang S (2016) A sparse extreme learning machine framework by continuous optimization algorithms and its application in pattern recognition. Eng Appl Artif Intell 53:176–189

    Article  Google Scholar 

  • Yang L, Dong H (2018) Support vector machine with truncated pinball loss and its application in pattern recognition. Chemom Intell Lab Syst 177:89–99

    Article  Google Scholar 

  • Yang L, Ren Z, Wang Y, Dong H (2017) A robust regression framework with laplace kernel-induced loss. Neural Comput 29:3014–3039

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang L, Zhou WD (2011) Sparse ensembles using weighted combination methods based on linear programming. Pattern Recognit 44(1):97–106

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work is supported by National Nature Science Foundation of China (Nos.11471010, 11271367). Moreover, the authors thank the referees and the editor for their constructive comments. Their suggestions improved the paper significantly.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liming Yang.

Ethics declarations

Conflict of interest

The authors declare that we have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

In this appendix, we prove the following triangle inequality holds for \(\eta (u)=1-\epsilon ^{-\alpha |u|},(\alpha >0 , u\in R)\):

$$\begin{aligned}&\eta (u_1+u_2)\le \eta (u_1)+\eta (u_2),\forall u_1,u_2\in R \end{aligned}$$
(39)
$$\begin{aligned}&\eta (u_1)+\eta (u_2)-\eta (u_1+u_2) \\&\quad =(1-\epsilon ^{-\alpha |u_1|})+(1-\epsilon ^{-\alpha |u_2|})- (1-\epsilon ^{-\alpha |u_{1}+u_{2}|}) \\&\quad =1-\epsilon ^{-\alpha |u_1|}-\epsilon ^{-\alpha |u_2|}+ \epsilon ^{-\alpha |u_{1}+u_{2}|} \\&\quad \ge 1-\epsilon ^{-\alpha |u_1|}-\epsilon ^{-\alpha |u_2|}+ \epsilon ^{-\alpha (|u_{1}|+|u_{2}|)} \\&\quad =1-\epsilon ^{-\alpha |u_1|}-\epsilon ^{-\alpha |u_2|}+ \epsilon ^{-\alpha |u_1|}\cdot \epsilon ^{-\alpha |u_2|} \\&\quad =1-\epsilon ^{-\alpha |u_2|}+\epsilon ^{-\alpha |u_1|} (\epsilon ^{-\alpha |u_2|}-1) \\&\quad =(1-\epsilon ^{-\alpha |u_2|})\cdot (1-\epsilon ^{-\alpha |u_1|}) \\&\quad \ge 0. \end{aligned}$$
(40)

The first inequality holds for \(|u_{1}+u_{2}|\le (|u_{1}|+|u_{2}|)\) accompanied by the monotone decreasing of \(f(x)=\epsilon ^{-x}\). The last inequality holds for \(\alpha >0\).

Appendix 2

Generally speaking, a DC program takes the form

$$\begin{aligned} \inf \{f(x)=g(x)-h(x),\,\, x\in R^{n}\} ~~~~~(P_{dc}) \end{aligned}$$
(41)

where g and h are lower semicontinuous proper convex functions on \(R^{n}\). Such a function f is called a DC function. g and h are the DC components of f. A function \(\pi (x)\) is said to be polyhedral convex if

$$\begin{aligned} \pi (x)=max\{\varpi _{i}^{T}x-\sigma _{i},i=1,2,\ldots m\}+\chi _{{\varOmega }}(x),\forall x\in R^{n} \end{aligned}$$
(42)

where \(\varpi _{i}\in R^{n},\sigma _{i}\in R,(i=1,2,\ldots m )\). The \(\chi _{{\varOmega }}(x)\) is the indicator function of the non-empty convex set \({\varOmega }\), and is defined as: \(\chi _{{\varOmega }}=0\) if \(x\in {\varOmega }\) and \(+\infty\) otherwise. A DC program is called a polyhedral DC program when either g or h is a polyhedral convex function.

A point \(x^{*}\) that satisfies the following generalized Kuhn–Tucker condition is called a critical point of \((P_{dc})\)

$$\begin{aligned} \partial h(x^{*})\cap \partial g(x^{*})\ne \emptyset \end{aligned}$$
(43)

where \(\partial h\) is the subdifferential of the convex function h. It follows that if h is polyhedral convex, then such a critical point for \((P_{dc})\) is almost always a local solution for \((P_{dc})\).

The necessary local optimality condition for \((P_{dc})\) is

$$\begin{aligned} \partial h(x^{*})\subset \partial g(x^{*})\ne \emptyset \end{aligned}$$
(44)

which is also sufficient for many important classes of DC programs, for example, polyhedral DC programs or when f is locally convex at \(x^{*}\). We use \(g^{*}(y)=\sup \{x^{T}y-g(x),x\in R^{n}\}\) to denote the conjugate function of g. The Fenchel–Rockafellar dual of \((P_{dc})\) is defined as

$$\begin{aligned} \inf \{h^{*}(y)-g^{*}(y),y\in R^{n}\} ~~~~~~~~~~~~~~~~(D_{dc}) \end{aligned}$$
(45)

DCA is an iterative algorithm based on local optimality conditions and duality. The idea of DCA is simple: at each iteration, one replaces the second component h in the primal DC problem \((P_{dc})\) by its affine minorization, \(h(x^{k})+(x-x^{k})^{T}y^{k}\), to generate the convex program

$$\begin{aligned} \min \{g(x)-h(x^{k})-(x-x^{k})^{T}y^{k},x\in R^{n},y^{k}\in \partial h(x^{k}) \} \end{aligned}$$
(46)

which is equivalent to determining \(x^{k+1} \in \partial g^{*}(y^{k})\). Likewise, the second DC component \(g^{*}\) of the dual DC program \((D_{dc})\) is replaced by its affine minorization, \(g^{*}(y^{k})+(y-y^{k})^{T}x^{k+1}\), to obtain a convex program that is equivalent to determining \(y^{k+1}\in \partial h(x^{k+1})\).

In practice, a simplified form of the DCA is used. Two sequences \(\{x^{k}\}\) and \(\{y^{k}\}\) satisfying \(y^{k}\in \partial h(x^{k})\) are constructed, and \(x^{k+1}\) is a solution to the convex program (46). The simplified DCA scheme is described as follows.

Initialization: Choose an initial point \(x^{0}\in R^{n}\) and set \(k=0\)

Repeat

Calculate \(y^{k}\in \partial h(x^k)\)

Solve convex program (46) to obtain \(x^{k+1}\) Let k:=k+1

Until some stopping criterion is satisfied.

DCA is a descent algorithm without linesearch. The following properties are used in next sections :(for simplicity, we omit the dual part of these properties).

  1. (1)

    If \(g(x^{k+1})-h(x^{k+1}) = g(x^{k})-h(x^{k})\), then \(x^{k}\) is a critical point for \((P_{dc})\). In this case, DCA terminates at k-th iteration.

  2. (2)

    Let \(y^{*}\) be a local solution to the dual of \((P_{dc})\) and \(x^{*}\in \partial g^{*}(y^{*})\). If h is differentiable at \(x^{*}\), then \(x^{*}\) is a local solution to \((P_{dc})\).

  3. (3)

    If the optimal value of problem \((P_{dc})\) is finite and the infinite sequence \(\{x^{k}\}\) is bounded, then every limit point \(x^{*}\) of the sequence \(\{x^{k}\}\) is a critical point of \((P_{dc})\).

  4. (4)

    DCA converges linearly for general DC programs. Especially, for polyhedral DC programs, the sequence \(\{x^{k}\}\) contains finitely many elements, and in a finite number of iterations the algorithm converges to a critical point satisfying the necessary optimality condition.

Moreover, if the second DC component h in \((P_{dc})\) is differentiable , then the subdifferential of the h at point \(x^{k}\) is reduced to a singleton, \(\partial h(x^{k})=\{\bigtriangledown h(x^{k})\}\). In this case, \(x^{k+1}\) is a solution to the following convex program:

$$\begin{aligned} \min \{g(x)-(h(x^{k})+\nabla h(x^{k})^{T}(x-x^{k})),x\in R^{n}\} \end{aligned}$$
(47)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jing, S., Wang, Y. & Yang, L. Selective ensemble of uncertain extreme learning machine for pattern classification with missing features. Artif Intell Rev 53, 5881–5905 (2020). https://doi.org/10.1007/s10462-020-09836-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-020-09836-3

Keywords

Navigation