PCA-based drift and shift quantification framework for multidimensional data

Goldenberg, Igor; Webb, Geoffrey I.

doi:10.1007/s10115-020-01438-3

PCA-based drift and shift quantification framework for multidimensional data

Regular Paper
Published: 06 February 2020

Volume 62, pages 2835–2854, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

559 Accesses
7 Citations
Explore all metrics

Abstract

Concept drift is a serious problem confronting machine learning systems in a dynamic and ever-changing world. In order to manage concept drift it may be useful to first quantify it by measuring the distance between distributions that generate data before and after a drift. There is a paucity of methods to do so in the case of multidimensional numeric data. This paper provides an in-depth analysis of the PCA-based change detection approach, identifies shortcomings of existing methods and shows how this approach can be used to measure a drift, not merely detect it.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Dongkuan Xu & Yingjie Tian

Density-Based Clustering Based on Hierarchical Density Estimates

References

Abdi H (2007) The eigen-decomposition: eigenvalues and eigenvectors. Encycl Measurement Stat 304–308
Abdi H (2007) Singular value decomposition (SVD) and generalized singular value decomposition. Encycl Measurement Stat 907–912
Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459
Article Google Scholar
Benavoli A, Corani G, Demšar J, Zaffalon M (2017) Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. J Mach Learn Res 18(1):2653–2688
MathSciNet MATH Google Scholar
Blythe DA, Von Bunau P, Meinecke FC, Muller KR (2012) Feature extraction for change-point detection using stationary subspace analysis. IEEE Trans Neural Netw Learn Syst 23(4):631–643
Article Google Scholar
Cieslak DA, Hoens TR, Chawla NV, Kegelmeyer WP (2012) Hellinger distance decision trees are robust and skew-insensitive. Data Min Knowl Discov 24(1):136–158. https://doi.org/10.1007/s10618-011-0222-1
Article MathSciNet MATH Google Scholar
Cule M, Samworth R (2010) Theoretical properties of the log-concave maximum likelihood estimator of a multidimensional density. Electr J Stat 4:254–270
Article MathSciNet Google Scholar
Gama J, Žliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv (CSUR) 46(4):44
Article Google Scholar
Goldenberg I, Webb GI (2019) Survey of distance measures for quantifying concept drift and shift in numeric data. Knowl Inf Syst 60:591–615. https://doi.org/10.1007/s10115-018-1257-z
Article Google Scholar
Groeneboom P, Jongbloed G, Witte BI (2012) A maximum smoothed likelihood estimator in the current status continuous mark model. J Nonparametr Stat 24(1):85–101
Article MathSciNet Google Scholar
Hoens TR, Polikar R, Chawla NV (2012) Learning from streaming data with concept drift and imbalance: an overview. Progress Artif Intell 1(1):89–101
Article Google Scholar
Joyce JM (2011) Kullback–Leibler divergence. Springer, Berlin, pp 720–722
Google Scholar
Kuncheva LI, Faithfull WJ (2014) PCA feature extraction for change detection in multidimensional unlabeled data. IEEE Trans Neural Netw Learn Syst 25(1):69–80
Article Google Scholar
Long JS, Ervin LH (2000) Using heteroscedasticity consistent standard errors in the linear regression model. Am Stat 54(3):217–224
Google Scholar
Muller HG, Stadtmuller U et al (1987) Estimation of heteroscedasticity in regression analysis. Ann Stat 15(2):610–625
Article MathSciNet Google Scholar
Qahtan AA, Alharbi B, Wang S, Zhang X (2015) A PCA-based change detection framework for multidimensional data streams: change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 935–944
Tofallis C (2009) Least squares percentage regression. J Mod Appl Stat Methods. https://doi.org/10.2139/ssrn.1406472
Article Google Scholar
Wand MP, Jones MC (1994) Kernel smoothing. Chapman and Hall/CRC, London
Book Google Scholar
Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30(4):964–994. https://doi.org/10.1007/s10618-015-0448-4
Article MathSciNet MATH Google Scholar
Webb GI, Lee LK, Goethals B et al (2018) Analyzing concept drift and shift from sample data. Data Min Knowl Disc 32:1179–1199. https://doi.org/10.1007/s10618-018-0554-1
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, Monash University, 22 Goe Street, Clayton, 3800, Australia
Igor Goldenberg & Geoffrey I. Webb

Authors

Igor Goldenberg
View author publications
You can also search for this author in PubMed Google Scholar
Geoffrey I. Webb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Igor Goldenberg.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Experiments

Some of the experiments conducted involved generating data that have known theoretical, or “true”, Hellinger distance. We describe here the process used to generate these data. Datasets were generated from the multivariate normal distribution. Generation of random samples was done through an “inverted” PCA approach:

First generate independent univariate normal variable and then use a rotation to introduce dependency between them. HD could be attributed to difference in either mean, variance or correlation. Data were generated for each value of HD between 0 and 1 with step 0.01 and for various sample sizes between 100 and 10,000.

1.1 Difference is due to difference in mean

The samples are drawn from distributions that have the same rotation and equal covariance matrices $(V_1=V_2=V)$, but different means $(M_1\ne M_2)$. To generate distributions that differ in mean while retaining identical covariance, we use the following equality.

$$\begin{aligned} H^2= & {} 1-\frac{\root 4 \of {\det {V_1}}\root 4 \of {\det {V_2}}}{\sqrt{\det {\frac{V_1+V_2}{2}}}}\exp \Bigg [-\frac{1}{8}(\mu _1-\mu _2)'\Big (\frac{V_1+V_2}{2}\Big )^{-1}(\mu _1-\mu _2)\Bigg ]\nonumber \\= & {} 1-\exp \Bigg [-\frac{1}{8}(\mu _1-\mu _2)'V^{-1}(\mu _1-\mu _2)\Bigg ] \end{aligned}$$

(13)

Let $\varDelta =(\mu _1-\mu _2)'V^{-1}(\mu _1-\mu _2)$. If V is diagonal, then $\varDelta =\sum _{i=1}^{n}\frac{\mu ^2_i}{\sigma ^2_i}$. We split into n randomly selected addends that sum to $\varDelta $ and assign them to correspondent PCA components. The procedure to generate samples is described by Algorithm 7

1.2 Difference is due to different variance, with the same mean and rotation(eigenvectors)

We use the following equality to generate distributions that differ in variance without any change in mean or rotation.

$$\begin{aligned} V= & {} P'\varLambda P\nonumber \\ H^2= & {} 1-\frac{\root 4 \of {\det {V_1}}\root 4 \of {\det {V_2}}}{\sqrt{\det {\frac{V_1+V_2}{2}}}}\exp \Bigg [-\frac{1}{8}(\mu _1-\mu _2)'\Big (\frac{V_1+V_2}{2}\Big )^{-1}(\mu _1-\mu _2)\Bigg ]\nonumber \\= & {} 1-\frac{\root 4 \of {\det {V_1}}\root 4 \of {\det {V_2}}}{\sqrt{\det {\frac{V_1+V_2}{2}}}}=1-\frac{\root 4 \of {\det {P'\varLambda _1 P}}\root 4 \of {\det {P'\varLambda _2 P}}}{\sqrt{\det {\frac{P'\varLambda _1 P+P'\varLambda _2 P}{2}}}}=1-\frac{\root 4 \of {\prod (\varLambda _{1i}\varLambda _{2i})}}{\sqrt{\prod (\frac{\varLambda _{1i}+\varLambda _{2i}}{2})}}\nonumber \\ \end{aligned}$$

(14)

If we set

$$\begin{aligned} \varLambda _2=(1+\alpha )*\varLambda _1, \mathrm{then} \quad H^2=1-\frac{(1+\alpha )^{\frac{n}{4}}}{(1+\frac{\alpha }{2})^{\frac{n}{2}}} \end{aligned}$$

Then it follows that

$$\begin{aligned} \alpha = \frac{-b+\sqrt{b^2-4b}}{2}, \mathrm{where} \quad b=(1-H^2)^{\frac{4}{n}} \end{aligned}$$

We then use Algorithm 8 to generate the two samples.

1.3 HD is due to different correlations with the same mean and variance

We use the following equality to generate distributions that differ in correlation matrices without any change in mean or variance. Covariance matrix $V=DRD$, where D is a diagonal matrix of standard deviations and R is a corresponding correlation matrix.

$$\begin{aligned} H^2= & {} 1-\frac{\root 4 \of {\det {V_1}}\root 4 \of {\det {V_2}}}{\sqrt{\det {\frac{V_1+V_2}{2}}}}\exp \Bigg [-\frac{1}{8}(\mu _1-\mu _2)'\Big (\frac{V_1+V_2}{2}\Big )^{-1}(\mu _1-\mu _2)\Bigg ]\\= & {} 1-\frac{\root 4 \of {\det {DR_1D}}\root 4 \of {\det {DR_2D}}}{\sqrt{\det {D\frac{R_1+R_2}{2}D}}}=1-\frac{\root 4 \of {\det {R_1}}\root 4 \of {\det {R_2}}}{\sqrt{\det {\frac{R_1+R_2}{2}}}} \end{aligned}$$

Numerical approximation was used to generate a correlation matrix that yields the desired HD. In the first experiment $R_1$ was set to identity matrix. In the second experiment $R_1$ was set to the matrix where all off-diagonal elements equal − 0.1. Diagonal elements for the second matrix $R_2$ were set to one and off-diagonal to $\alpha $, where $\alpha $ was numerically approximated for each value of HD.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goldenberg, I., Webb, G.I. PCA-based drift and shift quantification framework for multidimensional data. Knowl Inf Syst 62, 2835–2854 (2020). https://doi.org/10.1007/s10115-020-01438-3

Download citation

Received: 22 January 2019
Revised: 03 January 2020
Accepted: 12 January 2020
Published: 06 February 2020
Issue Date: July 2020
DOI: https://doi.org/10.1007/s10115-020-01438-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PCA-based drift and shift quantification framework for multidimensional data

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Experiments

1.1 Difference is due to difference in mean

1.2 Difference is due to different variance, with the same mean and rotation(eigenvectors)

1.3 HD is due to different correlations with the same mean and variance

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PCA-based drift and shift quantification framework for multidimensional data

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Experiments

1.1 Difference is due to difference in mean

1.2 Difference is due to different variance, with the same mean and rotation(eigenvectors)

1.3 HD is due to different correlations with the same mean and variance

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation