Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics

Laa, Ursula; Cook, Dianne

doi:10.1007/s00180-020-00954-8

Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics

Original paper
Published: 24 January 2020

Volume 35, pages 1171–1205, (2020)
Cite this article

Computational Statistics Aims and scope Submit manuscript

309 Accesses
5 Citations
Explore all metrics

Abstract

Projection pursuit is used to find interesting low-dimensional projections of high-dimensional data by optimizing an index over all possible projections. Most indexes have been developed to detect departure from known distributions, such as normality, or to find separations between known groups. Here, we are interested in finding projections revealing potentially complex bivariate patterns, using new indexes constructed from scagnostics and a maximum information coefficient, with a purpose to detect unusual relationships between model parameters describing physics phenomena. The performance of these indexes is examined with respect to ideal behaviour, using simulated data, and then applied to problems from gravitational wave astronomy. The implementation builds upon the projection pursuit tools available in the R package, tourr, with indexes constructed from code in the R packages, binostics, minerva and mbgraphic.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reproduction of secondary data in projection pursuit transformation

Article 01 December 2016

John G. Manchuk, Ryan M. Barnett & Clayton V. Deutsch

An Algorithm for Finding Projections with Extreme Kurtosis

Refining Invariant Coordinate Selection via Local Projection Pursuit

References

Abbott BP et al (2017) GW170817: observation of gravitational waves from a binary neutron star inspiral. Phys Rev Lett 119(16):161101
Google Scholar
Abbott BP (2018) GW170817: measurements of neutron star radii and equation of state. Phys Rev Lett 121(16):161101
Google Scholar
Ahn JS, Hofmann H, Cook D (2003) A projection pursuit method on the multidimensional squared contingency table. Comput Stat 18(3):605–26
MathSciNet MATH Google Scholar
Albanese D, Filosi M, Visintainer R, Riccadonna S, Jurman G, Furlanello C (2012) minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics 29(3):407–8
Google Scholar
Asimov D (1985) The grand tour: a tool for viewing multidimensional data. SIAM J Sci Stat Comput 6(1):128–43
MathSciNet MATH Google Scholar
Buja A, Cook D, Asimov D, Hurley C (2005) Computational methods for high-dimensional rotations in data visualization. In: Rao CR, Wegman EJ, Solka JL (eds) Handbook of statistics: data mining and visualization. Elsevier, New York, pp 391–413
Google Scholar
Cook D, Swayne DF (2007) Interactive and dynamic graphics for data analysis with R and G gobi, 1st edn. Springer, Berlin
MATH Google Scholar
Cook D, Buja A, Cabrera J (1992) An analysis of polynomial-based projection pursuit. Comput Sci Stat 24:478–82
Google Scholar
Cook D, Buja A, Cabrera J (1993) Projection pursuit indexes based on orthonormal function expansions. J Comput Gr Stat 2(3):225–50
MathSciNet Google Scholar
Cook D, Buja A, Cabrera J, Hurley C (1995) Grand tour and projection pursuit. J Comput Gr Stat 4(3):155–72
Google Scholar
Cook D, Laa U, Valencia G (2018) Dynamical projections for the visualization of PDFSense data. Eur Phys J C 78(9):742
Google Scholar
Diaconis P, Freedman D (1984) Asymptotics of graphical projection pursuit. Ann Statist 12(3):793–815
MathSciNet MATH Google Scholar
Eddy WF (1977) A new convex hull algorithm for planar sets. ACM Trans Math Softw 3(4):398–403
MATH Google Scholar
Edelsbrunner H, Kirkpatrick D, Seidel R (1983) On the shape of a set of points in the plane. IEEE Trans Inf Theory 29(4):551–59
MathSciNet MATH Google Scholar
Ferraty F, Goia A, Salinelli E, Vieu P (2013) Functional projection pursuit regression. Test 22(2):293–320
MathSciNet MATH Google Scholar
Foreman-Mackey D (2016) Corner.py: scatterplot matrices in python. J Open Sour Softw 24
Friedman JH (1987) Exploratory projection pursuit. J Am Stat Assoc 82(1):249–66
MathSciNet MATH Google Scholar
Friedman JH, Tukey JW (1974) A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comput 23:881–89
MATH Google Scholar
Grimm K (2016) Kennzahlenbasierte Grafikauswahl. Doctoral thesis, Universitat Augsburg
Grimm K (2017) Mbgraphic: measure based graphic selection. https://CRAN.R-project.org/package=mbgraphic
Hall P (1989) On polynomial-based projection indices for exploratory projection pursuit. Ann Stat 17(2):589–605
MathSciNet MATH Google Scholar
Hofmann H, Wilkinson L, Wickham H, Lang DT, Anand A (2019) Binostics: compute scagnostics. https://cran.r-project.org/package=binostics
Hou S, Wentzell PD (2014) Re-centered kurtosis as a projection pursuit index for multivariate data analysis. J Chemom 28(5):370–84
Google Scholar
Huang B, Cook D, Wickham H (2012) TourrGui: a gWidgets gui for the tour to explore high-dimensional data using low-dimensional projections. J Stat Softw 49(6):1–12
Google Scholar
Huber PJ (1985) Projection pursuit. Ann Stat 13(2):435–75
MathSciNet MATH Google Scholar
Jones MC, Sibson R (1987) What is projection pursuit? J R Stat Soc Ser A 150:1–36
MathSciNet MATH Google Scholar
Kruskal JB (1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proc Am Math Soc 7(1):48–50
MathSciNet MATH Google Scholar
Kruskal JB (1969) Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new ‘index of condensation’. In: Milton RC, Nelder JA (eds) Statistical computation. Academic Press, New York, pp 427–40
Google Scholar
Laa U, Cook D (2019a) https://github.com/uschiLaa/paper-ppi
Laa U, Cook D (2019b) https://github.com/uschiLaa/spinebil
Laa U, Cook D (2019c) https://github.com/uschiLaa/galahr
Lee E-K, Cook D, Klinke S, Lumley T (2005) Projection pursuit for exploratory supervised classification. J Comput Gr Stat 14(4):831–46
MathSciNet Google Scholar
LIGO (2018) https://dcc.ligo.org/public/0152/P1800115/005
Loperfido N (2018) Skewness-based projection pursuit: a computational approach. Comput Stat Data Anal 120:42–57
MathSciNet MATH Google Scholar
Marsaglia G (1968) Random numbers fall mainly in the planes. Proc Natl Acad Sci 61:25
MathSciNet MATH Google Scholar
Naito K (1997) A generalized projection pursuit procedure and its significance level. Hiroshima Math J 27(3):513–54
MathSciNet MATH Google Scholar
Pan J-X, Fung W-K, Fang K-T (2000) Multiple outlier detection in multivariate data using projection pursuit techniques. J Stat Plann Inference 83(1):153–67
MathSciNet MATH Google Scholar
Pilhöfer A, Unwin A (2013) New approaches in visualization of categorical data: R package extracat. J Stat Softw 53(7):1–25
Google Scholar
Posse C (1995a) Projection pursuit exploratory data analysis. Comput Stat Data Anal 20(6):669–87
MathSciNet MATH Google Scholar
Posse C (1995b) Tools for two-dimensional exploratory projection pursuit. J Comput Gr Stat 4(2):83–100
Google Scholar
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–24
MATH Google Scholar
Reshef YA, Reshef DN, Finucane HK, Sabeti PC, Mitzenmacher M (2016) Measuring dependence powerfully and equitably. J Mach Learn Res 17(212):1–63
MathSciNet MATH Google Scholar
Rodriguez-Martinez E, Goulermas JY, Mu T, Ralph JF (2010) Automatic induction of projection pursuit indices. IEEE Trans Neural Netw 21(8):1281–95
Google Scholar
Simon N, Tibshirani R (2014) Comment on ‘detecting novel associations in large data sets’ by Reshef Et Al, Science Dec 16, 2011. arXiv E-Prints arXiv:1401.7645. http://arxiv.org/abs/1401.7645
Smith R, Field SE, Blackburn K, Haster C-J, Pürrer M, Raymond V, Schmidt P (2016) Fast and accurate inference on gravitational waves from precessing compact binaries. Phys Rev D 94(4):044031
Google Scholar
Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–94
MathSciNet MATH Google Scholar
Tukey PA, Tukey JW (1981) Graphical display of data in three and higher dimensions. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. Wiley, New York
Wahba G (1990) Spline models for observational data. In: CBMS-Nsf regional conference series in applied mathematics. Society for Industrial Applied Mathematics, Philadelphia
Wickham H, Cook D, Hofmann H, Buja A (2011) Tourr: an R package for exploring multivariate data with projections. J Stat Softw 40(2):1–18
Google Scholar
Wilkinson L, Wills G (2008) Scagnostics distributions. J Comput Gr Stat 17(2):473–91
MathSciNet Google Scholar
Wilkinson L, Anand A, Grossman R (2005) Graph-theoretic scagnostics. In: IEEE symposium on information visualization, 2005. INFOVIS 2005, pp 157–64
Wood SN, Pya N, Säfken B (2016) Smoothing parameter and model selection for general smooth models (with discussion). J Am Stat Assoc 111:1548–75
Google Scholar
Xie Y (2015) Dynamic documents with R and knitr. 2nd ed. Chapman, Boca Raton, FL. https://yihui.name/knitr/
Xie Y (2016) Bookdown: authoring books and technical documents with R markdown. Chapman, Boca Raton, FL. https://github.com/rstudio/bookdown
Xie Y, Allaire JJ, Grolemund G (2018) R markdown: the definitive guide. Chapman, Boca Raton, FL. https://bookdown.org/yihui/rmarkdown

Download references

Acknowledgements

The authors gratefully acknowledge the support of the Australian Research Council. We thank Rory Smith for help with the gravitational wave examples, and German Valencia for useful comments. This article was created with knitr (Xie 2015), R Markdown (Xie et al. 2018) and bookdown (Xie 2016) with embedded code.

Author information

Authors and Affiliations

School of Physics and Astronomy, Monash University, Clayton, Australia
Ursula Laa
Department of Econometrics and Business Statistics, Monash University, Clayton, Australia
Ursula Laa & Dianne Cook

Authors

Ursula Laa
View author publications
You can also search for this author in PubMed Google Scholar
Dianne Cook
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ursula Laa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Re-scaling of holes index

Holes and cmass indexes are derived from \(I_0^N\) of Cook et al. (1993). As noted in proposition 1 of that paper the index takes local maxima for the minimum and maximum of \(a_0\) which are achieved by central hole and central mass distributions respectively. Cook and Swayne (2007) then gives explicit index functions defined for sphered data (zero mean, identity variance-covariance matrix). They are defined such that each one is maximized for central holes or central mass type distributions, with maximum=1, and cmass=1-holes. It follows that for either index both large and small values signal deviation from the normal distribution, and given a normal distribution we expect to find “average” index values rather than values close to zero.

We can estimate the values found for normal distribution by comparing values of \(a_{00}\) of Sec 7.1 in Cook et al. (1993). The maximum value is \(1/(2\pi )\) found for cmass type distributions, the minimum is \(1/(2\pi e)\) found for hole type distributions. Evaluating for normal distributions gives \(1/(4\pi )\), rescaling such that the index values range from 0 to 1 then puts the value for normal distributions at approximately \(0.2\). This is consistent with the results we found, i.e. for normal distributions the cmass index is about \(0.2\), and the holes index (=1-cmass) is about \(0.8\).

We therefore rescale as follows: first have cut-off at the respective value for the normal distribution, i.e. any value below 0.2 (0.8) is set to zero for cmass (holes) index, and we rescale the remaining range to be between zero and one.

Estimating the squint angle of the spiral

We estimate the squint angle of the spiral (for p = 4, 5, 6) as follows. First pick a random starting plane and generate a tour path from the starting plane to the ideal plane containing the spiral. Using skinny as the reference index we fix a lower index value that is attributed to indicate squintability at \(0.6\), and we move along the tour path towards the ideal plane until this value is reached. The distance between the thus identified plane and the ideal plane is used as an estimate of the squint angle in this direction. Since this will strongly depend on the random starting plane, i.e. the considered direction, we repeat the estimation 100 times and present the results in the form of a box plot in Fig. 14. The result shows a large drop in squint angle when going from p = 4 to higher dimensions, and generally a large spread of squint angles depending on the direction.

Computational performance

Computational time is important for using the PPIs with the guided tour, online. Figure 15 summarizes performance for each PPI. For simplicity, data with sample sizes ranging from 100 to 10,000 are drawn from a 6-d solid sphere, using the geozoo package (Schloerke 2016). The time to compute the PPIs over 100 interpolated grand tour projections is recorded. The scagnostics PPI are computed as a bundle, since this is the code base, and that major computational constraint is common to all the scagnostics. There are two versions of the MIC and TIC algorithm, labelled MINE and MINE E, the second being a newer algorithm which improves their computational performance.

The results are interesting. The scagnostic indexes and splines2d are very fast regardless of sample size. MIC, TIC (both versions) and dcor2d slow rapidly as sample size increases.

Effect of parameter choice in index value

Some parameters must be provided for some PPIs. This can be advantageous, allowing the index to more flexibly work for different types of structure, controlling trade-offs between noise and fine structure detection, and affecting computing time and precision.

Binning:
- Scagnostics: the number of bins can be controlled by the user, note however that internally the implementation will reduce the number of bins if too many non-empty bins are found (more than 250).
- MINE: the maximum number of bins considered is fixed by the user as a function of the number of data points. The default is chosen as a trade-off between resolution and noise dependence, but it may be tuned based on requirements dictated by specific datasets. Apart from sensitivity to noise computing time may also be a consideration here.
Spline knots: for the splines2d measure we need to fix the number of knots. By default it is fixed to be 10 (or lower if appropriate based on the data values). In our examples we find the number to be appropriate to identify functional dependence while rejecting noise, but some distributions may require tuning of this parameter.

The bins argument for the scagnostics might be reasonably expected to affect the smoothness of the index: a small number of bins should provide a smooth index function, but may affect its ability to detect fine structure. Figure 16 examines this. Scagnostics PPIs are computed for the spiral1000 data on a tour path between \(x_1\)–\(x_2\) to \(x_5\)–\(x_6\), for number of bins equal to 10, 20, 50. The interesting observation to make is that even with small bin size the functions are all relatively noisy. The problem with the small bin size is that the spiral becomes invisible to the PPI.

1.1 Binning sensitivity of MIC index

To examine the sensitivity of binning in the MIC PPI, the classic RANDU data (Marsaglia 1968), available in R, is used. Binning is defined by \(\delta \), where \(B(n) = n^{\delta }\). Figure 17 shows the best projection, index value and computing time obtained when optimizing the MIC index with values \(\delta = 0.6, 0.7, 0.8\). With small \(\delta \), less bins, the structure isn’t visible, and with larger \(\delta \) the structure is confounded with noise. It does appear that this parameter affects the performance of the MIC index.

Ways to refine the PPIs

The biggest issues revealed by the investigation into the new PPIs are a lack of smoothness, particularly for the scagnostics indexes, and the rotation invariance of Grimm’s indexes. To fix the smoothness of an index function, it is possible to calculate the PPI for a small neighborhood of projections and average the value, or alternatively average the PPI for several jittered projections. This is investigated in Fig. 18. Rotation invariance is more difficult to fix, but an alternative tour interpolation method could be useful. The geodesic interpolation transitions between planes, and it ignores the basis defining the plane, creating a problem with rotation invariant indices. Alternative interpolations based on Givens or Householder rotations could be implemented to transition between bases, which should alleviate the need for rotationally invariant indices.

Two different methods are considered for smoothing the index values:

Jittering points: using the jitter function we move each point by a random amount drawn from a uniform distribution between \(\pm \,\beta \).
Jittering angles: using the tourr implementation we can draw a random plane and move some small amount \(\epsilon \) in that direction.

The mean value from a sample of projections is recorded as the PPI value. This could be robsitufied by dropping the most extreme values.

This is particularly interesting for the scagnostics indexes skinny and stringy which we found to be most noisy among the indexes considered. Figure 18 studies the potential of these two smoothing approaches, using the tour path between noise variables of the spiral1000 dataset and different \(\epsilon \) and \(\beta \) values. Both methods appear to be promising in smoothing the function. Because the scagnostics are fast to compute, either of these methods is feasible. For this example we have smoothed over 10 randomly selected jittered views, computing time increases linearly with the number of randomly jittered views, as this is mostly determined by the time needed to evaluate the scagnostics indexes which is done separately for each view.

Additional figures

1.1 Final projection of pipe guided tour

Figure 19 shows the projection returned during the scouting phase (left) and the refinement phase (right). It was important to start the method 3 optimizer at the best projection returned during the scouting phase, to smoothly converge more closely to the ideal projection.

1.2 Dataset overview scatter plot matrices

Figures 20 and 21 show the scatter plot matrices of the gravitational wave datasets considered.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Laa, U., Cook, D. Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics. Comput Stat 35, 1171–1205 (2020). https://doi.org/10.1007/s00180-020-00954-8

Download citation

Received: 01 February 2019
Accepted: 09 January 2020
Published: 24 January 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s00180-020-00954-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics

Abstract

Access this article

Similar content being viewed by others

Reproduction of secondary data in projection pursuit transformation

An Algorithm for Finding Projections with Extreme Kurtosis

Refining Invariant Coordinate Selection via Local Projection Pursuit

References

Acknowledgements