Selective ensemble of uncertain extreme learning machine for pattern classification with missing features

Jing, Shibo; Wang, Yidan; Yang, Liming

doi:10.1007/s10462-020-09836-3

Selective ensemble of uncertain extreme learning machine for pattern classification with missing features

Published: 19 April 2020

Volume 53, pages 5881–5905, (2020)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Shibo Jing¹,
Yidan Wang¹ &
Liming Yang¹

460 Accesses
2 Citations
Explore all metrics

Abstract

Ensemble learning is an effective technique to improve performance and stability compared to single classifiers. This work proposes a selective ensemble classification strategy to handle missing data classification, where an uncertain extreme learning machine with probability constraints is used as individual (or base) classifiers. Then, three selective ensemble frameworks are developed to optimize ensemble margin distributions and aggregate individual classifiers. The first two are robust ensemble frameworks with the proposed loss functions. The third is a sparse ensemble classification framework with the zero-norm regularization, to automatically select the required individual classifiers. Moreover, the majority voting method is applied to produce ensemble classifier for missing data classification. We demonstrate some important properties of the proposed loss functions such as robustness, convexity and Fisher consistency. To verify the validity of the proposed methods for missing data, numerical experiments are implemented on benchmark datasets with missing feature values. In experiments, missing features are first imputed by using expectation maximization algorithm. Numerical experiments are simulated in filled datasets. With different probability lower bounds of classification accuracy, experimental results under different proportion of missing values show that the proposed ensemble methods have better or comparable generalization compared to the traditional methods in handling missing-value data classifications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiple Imputation and Ensemble Learning for Classification with Incomplete Data

Multiple Imputation Ensembles (MIE) for Dealing with Missing Data

Article Open access 23 April 2020

Aliya Aleryani, Wenjia Wang & Beatriz de la Iglesia

A robust extreme learning machine framework for uncertain data classification

Article 22 May 2018

Shibo Jing & Liming Yang

References

Aravkin AY, Kambadur A, Lozano AC et al (2014) Sparse quantile Huber regression for efficient and robust estimation. Mathematics 14(Suppl 1):1–1
Google Scholar
Bershad NJ (2004) Comments on a recursive least M-estimate algorithm for robust adaptive filtering in impulsive noise: fast algorithm and convergence performance analysis. IEEE Trans Signal Process 57(1):388–389
Article MathSciNet MATH Google Scholar
Bi Y, Guan J, Bell D (2008) The combination of multiple classifiers using an evidential reasoning approach. Artif Intell 172(15):1731–1751
Article MATH Google Scholar
Cao T, Zhang M, Andreae P, Xue B, Bui LT (2018) An effective and efficient approach to classification with incomplete data. Knowl-Based Syst 154:1–16
Article Google Scholar
Chen C, Yan C, Zhao N, Guo B, Lin G (2016) A robust algorithm of support vector regression with a trimmed huber loss function in the primal. Soft Comput 21(18):1–9
Google Scholar
Dempster AP (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–38
MathSciNet MATH Google Scholar
Eirola E, Miche Y, Rui N, Akusok A, Lendasse A (2016) Extreme learning machine for missing data using multiple imputations. Neurocomputing 174(PA):220–231
Google Scholar
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recognit 41(12):3692–3705
Article MATH Google Scholar
Han B, He B, Nian R et al (2015) LARSEN-ELM: selective ensemble of extreme learning machines using LARS for blended data. Neurocomputing 149:285–294
Article Google Scholar
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501
Article Google Scholar
Huang GB, Ding X, Zhou H (2010) Optimization method based extreme learning machine for classification. Neurocomputing 74(1):155–163
Article Google Scholar
Huang G, Song S, Wu C, You K (2012) Robust support vector regression for uncertain input and output data. IEEE Trans Neural Netw Learn Syst 23(11):1690–1700
Article Google Scholar
Huang X, Shi L, Suykens JAK (2014) Support vector machine classifier with pinball loss. IEEE Trans Pattern Anal Mach Intell 36(5):984–997
Article Google Scholar
Huang G, Huang GB, Song S et al (2015) Trends in extreme learning machines: a review. Neural Netw 61:32–48
Article MATH Google Scholar
Jing S, Yang L (2018) A robust extreme learning machine framework for uncertain data classification. J Supercomput. https://doi.org/10.1007/s11227-018-2430-6
Lanckriet GRG, El Ghaoui L, Bhattacharyya C, Jordan MI (2004a) Minimax probability machine. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol 14. MIT Press, Cambridge
Google Scholar
Lin Y (2004) A note on margin-based loss functions in classification. Stat Probab Lett 68(1):73–82
Article MathSciNet MATH Google Scholar
Liu ZG, Pan Q, Dezert J et al (2016) Adaptive imputation of missing values for incomplete pattern classification. Pattern Recognit 52(C):85–95
Article Google Scholar
Lopez J, Maldonado S, Carrasco M (2018) Double regularization methods for robust feature selection and SVM classification via DC programming. Inf Sci 429:377–389
Article MathSciNet MATH Google Scholar
Marshall AW, Olkin I (1960) Multivariate chebyshev inequalities. Ann Math Stat 31(4):1001–1014
Article MathSciNet MATH Google Scholar
Martin B, Dragi K, Sa D (2018) Ensembles for multi-target regression with random output selections. Mach Learn 107:1673–1709
Article MathSciNet MATH Google Scholar
Pelckmans K, Brabanter JD, Suykens JAK et al (2005) Handling missing values in support vector machine classifiers. Neural Netw 18(5–6):684–692
Article MATH Google Scholar
Polikar R, Depasquale J, Mohammed HS, Brown G, Kuncheva LI (2010) Learn.mf: a random subspace approach for the missing feature problem. Pattern Recognit 43(11):3817–3832
Article MATH Google Scholar
Schapire RE (1989) The strength of weak learnability. Proceedings of the second annual workshop on computational learning theory 5(2):197–227
MATH Google Scholar
Schapire RE, Freund Y, Bartlett P, Lee WS (1997) Boosting the margin: a new explanation for the effectiveness of voting methods. In: Fourteenth international conference on machine learning, pp 322–330
Schapire RE, Freund Y, Bartlett P et al (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5):1651–1686
Article MathSciNet MATH Google Scholar
Shivaswamy PK, Bhattacharyya C, Smola AJ (2006) Second order cone programming approaches for handling missing and uncertain data. J Machine Learn Res 7:1283–1314
MathSciNet MATH Google Scholar
Sousa LM, Vandenberghe L, Boyd S, Lebret H (1998) Applications of second order cone programming. Linear Algebra Appl 284(1):193–228
MathSciNet MATH Google Scholar
Steinwart I, Christmann A (2011) Estimating conditional quantiles with the help of the pinball loss. Bernoulli 17(1):211–225
Article MathSciNet MATH Google Scholar
Yang L, Zhang S (2016) A sparse extreme learning machine framework by continuous optimization algorithms and its application in pattern recognition. Eng Appl Artif Intell 53:176–189
Article Google Scholar
Yang L, Dong H (2018) Support vector machine with truncated pinball loss and its application in pattern recognition. Chemom Intell Lab Syst 177:89–99
Article Google Scholar
Yang L, Ren Z, Wang Y, Dong H (2017) A robust regression framework with laplace kernel-induced loss. Neural Comput 29:3014–3039
Article MathSciNet MATH Google Scholar
Zhang L, Zhou WD (2011) Sparse ensembles using weighted combination methods based on linear programming. Pattern Recognit 44(1):97–106
Article MATH Google Scholar

Download references

Acknowledgements

This work is supported by National Nature Science Foundation of China (Nos.11471010, 11271367). Moreover, the authors thank the referees and the editor for their constructive comments. Their suggestions improved the paper significantly.

Author information

Authors and Affiliations

College of Science, China Agricultural University, Beijing, 100083, China
Shibo Jing, Yidan Wang & Liming Yang

Authors

Shibo Jing
View author publications
You can also search for this author in PubMed Google Scholar
Yidan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liming Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liming Yang.

Ethics declarations

Conflict of interest

The authors declare that we have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

In this appendix, we prove the following triangle inequality holds for $\eta (u)=1-\epsilon ^{-\alpha |u|},(\alpha >0 , u\in R)$:

$$\begin{aligned}&\eta (u_1+u_2)\le \eta (u_1)+\eta (u_2),\forall u_1,u_2\in R \end{aligned}$$

(39)

$$\begin{aligned}&\eta (u_1)+\eta (u_2)-\eta (u_1+u_2) \\&\quad =(1-\epsilon ^{-\alpha |u_1|})+(1-\epsilon ^{-\alpha |u_2|})- (1-\epsilon ^{-\alpha |u_{1}+u_{2}|}) \\&\quad =1-\epsilon ^{-\alpha |u_1|}-\epsilon ^{-\alpha |u_2|}+ \epsilon ^{-\alpha |u_{1}+u_{2}|} \\&\quad \ge 1-\epsilon ^{-\alpha |u_1|}-\epsilon ^{-\alpha |u_2|}+ \epsilon ^{-\alpha (|u_{1}|+|u_{2}|)} \\&\quad =1-\epsilon ^{-\alpha |u_1|}-\epsilon ^{-\alpha |u_2|}+ \epsilon ^{-\alpha |u_1|}\cdot \epsilon ^{-\alpha |u_2|} \\&\quad =1-\epsilon ^{-\alpha |u_2|}+\epsilon ^{-\alpha |u_1|} (\epsilon ^{-\alpha |u_2|}-1) \\&\quad =(1-\epsilon ^{-\alpha |u_2|})\cdot (1-\epsilon ^{-\alpha |u_1|}) \\&\quad \ge 0. \end{aligned}$$

(40)

The first inequality holds for $|u_{1}+u_{2}|\le (|u_{1}|+|u_{2}|)$ accompanied by the monotone decreasing of $f(x)=\epsilon ^{-x}$. The last inequality holds for $\alpha >0$.

Appendix 2

Generally speaking, a DC program takes the form

$$\begin{aligned} \inf \{f(x)=g(x)-h(x),\,\, x\in R^{n}\} ~~~~~(P_{dc}) \end{aligned}$$

(41)

where g and h are lower semicontinuous proper convex functions on $R^{n}$. Such a function f is called a DC function. g and h are the DC components of f. A function $\pi (x)$ is said to be polyhedral convex if

$$\begin{aligned} \pi (x)=max\{\varpi _{i}^{T}x-\sigma _{i},i=1,2,\ldots m\}+\chi _{{\varOmega }}(x),\forall x\in R^{n} \end{aligned}$$

(42)

where $\varpi _{i}\in R^{n},\sigma _{i}\in R,(i=1,2,\ldots m )$. The $\chi _{{\varOmega }}(x)$ is the indicator function of the non-empty convex set ${\varOmega }$, and is defined as: $\chi _{{\varOmega }}=0$ if $x\in {\varOmega }$ and $+\infty$ otherwise. A DC program is called a polyhedral DC program when either g or h is a polyhedral convex function.

A point $x^{*}$ that satisfies the following generalized Kuhn–Tucker condition is called a critical point of $(P_{dc})$

$$\begin{aligned} \partial h(x^{*})\cap \partial g(x^{*})\ne \emptyset \end{aligned}$$

(43)

where $\partial h$ is the subdifferential of the convex function h. It follows that if h is polyhedral convex, then such a critical point for $(P_{dc})$ is almost always a local solution for $(P_{dc})$.

The necessary local optimality condition for $(P_{dc})$ is

$$\begin{aligned} \partial h(x^{*})\subset \partial g(x^{*})\ne \emptyset \end{aligned}$$

(44)

which is also sufficient for many important classes of DC programs, for example, polyhedral DC programs or when f is locally convex at $x^{*}$. We use $g^{*}(y)=\sup \{x^{T}y-g(x),x\in R^{n}\}$ to denote the conjugate function of g. The Fenchel–Rockafellar dual of $(P_{dc})$ is defined as

$$\begin{aligned} \inf \{h^{*}(y)-g^{*}(y),y\in R^{n}\} ~~~~~~~~~~~~~~~~(D_{dc}) \end{aligned}$$

(45)

DCA is an iterative algorithm based on local optimality conditions and duality. The idea of DCA is simple: at each iteration, one replaces the second component h in the primal DC problem $(P_{dc})$ by its affine minorization, $h(x^{k})+(x-x^{k})^{T}y^{k}$, to generate the convex program

$$\begin{aligned} \min \{g(x)-h(x^{k})-(x-x^{k})^{T}y^{k},x\in R^{n},y^{k}\in \partial h(x^{k}) \} \end{aligned}$$

(46)

which is equivalent to determining $x^{k+1} \in \partial g^{*}(y^{k})$. Likewise, the second DC component $g^{*}$ of the dual DC program $(D_{dc})$ is replaced by its affine minorization, $g^{*}(y^{k})+(y-y^{k})^{T}x^{k+1}$, to obtain a convex program that is equivalent to determining $y^{k+1}\in \partial h(x^{k+1})$.

In practice, a simplified form of the DCA is used. Two sequences $\{x^{k}\}$ and $\{y^{k}\}$ satisfying $y^{k}\in \partial h(x^{k})$ are constructed, and $x^{k+1}$ is a solution to the convex program (46). The simplified DCA scheme is described as follows.

Initialization: Choose an initial point $x^{0}\in R^{n}$ and set $k=0$

Repeat

Calculate $y^{k}\in \partial h(x^k)$

Solve convex program (46) to obtain $x^{k+1}$ Let k:=k+1

Until some stopping criterion is satisfied.

DCA is a descent algorithm without linesearch. The following properties are used in next sections :(for simplicity, we omit the dual part of these properties).

(1)
If $g(x^{k+1})-h(x^{k+1}) = g(x^{k})-h(x^{k})$, then $x^{k}$ is a critical point for $(P_{dc})$. In this case, DCA terminates at k-th iteration.
(2)
Let $y^{*}$ be a local solution to the dual of $(P_{dc})$ and $x^{*}\in \partial g^{*}(y^{*})$. If h is differentiable at $x^{*}$, then $x^{*}$ is a local solution to $(P_{dc})$.
(3)
If the optimal value of problem $(P_{dc})$ is finite and the infinite sequence $\{x^{k}\}$ is bounded, then every limit point $x^{*}$ of the sequence $\{x^{k}\}$ is a critical point of $(P_{dc})$.
(4)
DCA converges linearly for general DC programs. Especially, for polyhedral DC programs, the sequence $\{x^{k}\}$ contains finitely many elements, and in a finite number of iterations the algorithm converges to a critical point satisfying the necessary optimality condition.

Moreover, if the second DC component h in $(P_{dc})$ is differentiable , then the subdifferential of the h at point $x^{k}$ is reduced to a singleton, $\partial h(x^{k})=\{\bigtriangledown h(x^{k})\}$. In this case, $x^{k+1}$ is a solution to the following convex program:

$$\begin{aligned} \min \{g(x)-(h(x^{k})+\nabla h(x^{k})^{T}(x-x^{k})),x\in R^{n}\} \end{aligned}$$

(47)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jing, S., Wang, Y. & Yang, L. Selective ensemble of uncertain extreme learning machine for pattern classification with missing features. Artif Intell Rev 53, 5881–5905 (2020). https://doi.org/10.1007/s10462-020-09836-3

Download citation

Published: 19 April 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10462-020-09836-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Selective ensemble of uncertain extreme learning machine for pattern classification with missing features

Abstract

Access this article

Similar content being viewed by others

Multiple Imputation and Ensemble Learning for Classification with Incomplete Data

Multiple Imputation Ensembles (MIE) for Dealing with Missing Data

A robust extreme learning machine framework for uncertain data classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1

Appendix 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Selective ensemble of uncertain extreme learning machine for pattern classification with missing features

Abstract

Access this article

Similar content being viewed by others

Multiple Imputation and Ensemble Learning for Classification with Incomplete Data

Multiple Imputation Ensembles (MIE) for Dealing with Missing Data

A robust extreme learning machine framework for uncertain data classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix 1

Appendix 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation