Huber-type principal expectile component analysis

https://doi.org/10.1016/j.csda.2020.106992Get rights and content

Abstract

In principal component analysis (PCA), principal components are identified by maximizing the component score variance around the mean. However, a practitioner might be interested in capturing the variation in the tail rather than the center of a distribution to, for example, identify the major pollutants from air pollution data. To address this problem, we introduce a new method called Huber-type principal expectile component (HPEC) analysis that uses an asymmetric Huber norm to provide a kind of robust-tail PCA. The statistical properties of HPECs are derived, and a derivative-free optimization approach called particle swarm optimization (PSO) is used to identify HPECs numerically. As a demonstration, HPEC analysis is applied to real and simulated data with encouraging results.

Introduction

In principal component analysis (PCA), proposed by Pearson (1901), observations are represented by uncorrelated variables based on an orthogonal transformation. In practice, PCA can be implemented by sequentially maximizing the component score variance (L2 norm) for observations centered around the sample mean. Due to the different data structures that occur in real life, however, researchers and users may be interested in different targets. Sometimes the problem is to identify the components that cause variation in a certain region other than the central part. For example, when analyzing air pollution data, major pollutants usually exhibit large variations in their high indices. Hence, developing techniques to identify the components that are responsible for high-index variation will facilitate the detection of the major air pollution sources. However, PCA can fail in such a scenario due to its symmetric L2 criterion. A tail indicator known as an expectile, proposed by Newey and Powell (1987), is an analogue of the mean for quantiles. This expectile function, like the quantile curve, can be represented via a solution with respect to an asymmetric L2 norm. Taylor (2008) demonstrated that quantiles and expected shortfalls can be assessed via the use of expectiles. To properly capture the tail characteristics, principal expectile component (PEC) analysis can be used, as proposed by Tran et al. (2019), which is a PCA analogue based on an asymmetric L2 norm.

In real applications, we found that in some cases, the PEC fails to capture the tail variation due to the quadratic loss in the alternative part. One motivating example is the analysis of Olympic decathlon data, as shown in Section 5.1. Suppose we aim to identify the top-ranking athletes. After projecting the data points into the space spanned by the first two PECs, the top-ranking cluster cannot be clearly separated from the other groups. See Fig. 4 for details. It is possible that the PEC fails to capture the right-tail variation because four variables (shot put, high jump, 110-meter hurdle, and javelin throw) have heavier left tails. To address heavy-tailed observations and outliers, we propose a new method by replacing the L2 norm in PEC with the Huber norm proposed by Huber (1973). The Huber norm is designed to be quadratic for small deviations but grows linearly for large deviations. Thus, it is less sensitive to heavy-tailed observations. In this study, to address the top (or low) tails of data points, we integrated the Huber norm with the asymmetric loss function to obtain an asymmetric Huber norm. Analogous to the PEC analysis of Tran et al. (2019), we propose Huber-type principal expectile component (HPEC) analysis based on the Huber-type expectile. This new tail indicator is evaluated by minimizing the distance between observations based on the expected asymmetric Huber norm. Given its asymmetric Huber norm, HPEC analysis can capture the tail characteristics and is also less sensitive to outliers. To identify the HPECs, we convert the problem into an optimization problem whose objective function has no closed form and adopt a derivative-free optimization approach, particle swarm optimization (PSO).

Simulation studies show that HPEC analysis can capture tail variations, i.e., variation around the Huber-type expectile, in a mixture of normal distributions. However, based on the same data set, PCA and PECs fail to catch the corresponding tail variations. This indicates that HPEC analysis can be used to efficiently reduce the dimensions while simultaneously detecting anomalous variations in the tail. Furthermore, the probabilistic PCA (PPCA) proposed by Tipping and Bishop (1999a) can be used to model complex data structures in combination with local PCA. Therefore, we also compared the proposed method with the mechanism of the PPCA mixture. The results indicate that HPEC analysis outperforms PPCA when capturing the principal components of both the tail and center.

In addition to simulation studies, we also applied HPECs to two real data sets: the 2016 Olympic decathlon and the Taiwan air quality indices. For the 2016 Olympic decathlon data, we evaluated the scores of the first two components obtained by PCA, PECs, and HPECs. Using HPEC analysis, we could better extract the top-ranking group of athletes compared with using PCA and PECs. For the air quality indices, when using the component loadings of HPEC analysis, we were able to identify compounds with large variations around the mean or the right tail with different tuning parameters. Our analyses of these two data sets demonstrated the advantages of using HPECs, especially when one is interested in factors with larger variations on the tail.

This paper is organized as follows: In Section 2, the asymmetric Huber norm and the Huber-type principal expectation component are proposed. Certain statistical properties of the Huber-type τ-expectile are provided, and the convergence of the estimate of the first HPEC is derived. In Section 3, the PSO is adopted to efficiently identify the components of HPEC analysis. In Section 4, we demonstrate the advantages of HPECs via simulation studies. In Section 5, empirical studies on the two data sets are conducted. Concluding remarks are made in Section 6.

Section snippets

Huber-type PEC

Let y be a vector in Rp and define y+=max(0,y) and y=max(0,y) coordinatewise. The asymmetric L2 norm of y for a given level τ[0,1] is defined as yτ2=τy+2+(1τ)y2. Based on this asymmetric L2 norm, the τ-expectile of a random vector YRp is defined as eτ=argminϑRpEYϑτ2.The choice of τ is customized with respect to the variation region of interest. For example, τ can be set to either 0.950.99 or 0.050.01 to capture variations in the upper-right or lower-left regions, respectively.

Numerical approach to HPEC

In this section, to identify the Huber-type τ-expectile, we convert it to the optimization problem shown in Eq. (2.7). However, this objective function does not have a closed form, and its derivative is also complicated. Thus, gradient-based methods are unsuitable. Instead, we consider derivative-free approaches, such as stochastic optimization approaches and metaheuristic algorithms. A commonly used stochastic optimization approach is simulated annealing (SA). However, this approach can take a

Simulation study

In this section, we present simulations with a mixture of normally distributed samples with different dimensions. First, in Section 4.1, we describe our simulation experiments on two-dimensional cases for the sake of visualization. Then, in Section 4.2, we present our experiments on a 10-dimensional mixture of normal distributions to confirm the validity of our proposed method at a higher dimensionality. In addition, to illustrate the performance of the proposed method, we compare its

Real applications

To illustrate the proposed approach using real data, we considered two data sets: the 2016 Olympic decathlon data and the Taiwan air quality indices (AQI) of photochemical compounds. First, we analyzed the 2016 Olympic decathlon data using HPECs to demonstrate the methodology. Then, for the air quality indices, we used the component loadings of PCA and HPECs to determine the sources of air pollution that contribute most to the mean and right tail of the distribution. For this experiment, the

Concluding remarks

In this study, we proposed the use of the asymmetric Huber norm to obtain the Huber-type principal expectile component. The proposed method captures the tail characteristics while maintaining robustness for i.i.d. observations. The results of our simulation studies reveal that HPEC analysis outperforms PCA, PEC analysis, and PPCA in capturing tail variation, particularly in cases of mixtures of normal distributions. As demonstrated in real applications, the choices of the tuning parameters τ

Acknowledgments

This research was supported in part by the Mathematics Division of the National Center for Theoretical Sciences, Taiwan and the Ministry of Science and Technology in Taiwan , under the grants MOST 105-2628-M-006-001-MY3, MOST 106-2118-M-110-003-MY2 and MOST 105-2118-M-110-002-MY2.

References (18)

  • TranN.M. et al.

    Principal component analysis in an asymmetric norm

    J. Multivariate Anal.

    (2019)
  • XuG. et al.

    On convergence analysis of particle swarm optimization algorithm

    J. Comput. Appl. Math.

    (2018)
  • EberhartR.C. et al.

    A new optimizer using particle swarm theory

  • FoxJ. et al.

    Robust Regression in R: Appendix to an R Companion to Applied Regression

    (2018)
  • HjortN.L. et al.

    Asymptotics for minimisers of convex processes

    (2011)
  • HollandP.W. et al.

    Robust regression using iteratively reweighted least-squares

    Comm. Statist. Theory Methods

    (1977)
  • HuberP.J.

    Robust regression: asymptotics, conjectures and monte carlo

    Ann. Statist.

    (1973)
  • HuberP.J.

    Robust Statistics

    (1981)
  • NeweyW.K. et al.

    Asymmetric least squares estimation and testing

    Econometrica

    (1987)
There are more references available in the full text version of this article.
View full text