Skip to main content
Log in

Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Projection pursuit is used to find interesting low-dimensional projections of high-dimensional data by optimizing an index over all possible projections. Most indexes have been developed to detect departure from known distributions, such as normality, or to find separations between known groups. Here, we are interested in finding projections revealing potentially complex bivariate patterns, using new indexes constructed from scagnostics and a maximum information coefficient, with a purpose to detect unusual relationships between model parameters describing physics phenomena. The performance of these indexes is examined with respect to ideal behaviour, using simulated data, and then applied to problems from gravitational wave astronomy. The implementation builds upon the projection pursuit tools available in the R package, tourr, with indexes constructed from code in the R packages, binostics, minerva and mbgraphic.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Abbott BP et al (2017) GW170817: observation of gravitational waves from a binary neutron star inspiral. Phys Rev Lett 119(16):161101

    Google Scholar 

  • Abbott BP (2018) GW170817: measurements of neutron star radii and equation of state. Phys Rev Lett 121(16):161101

    Google Scholar 

  • Ahn JS, Hofmann H, Cook D (2003) A projection pursuit method on the multidimensional squared contingency table. Comput Stat 18(3):605–26

    MathSciNet  MATH  Google Scholar 

  • Albanese D, Filosi M, Visintainer R, Riccadonna S, Jurman G, Furlanello C (2012) minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics 29(3):407–8

    Google Scholar 

  • Asimov D (1985) The grand tour: a tool for viewing multidimensional data. SIAM J Sci Stat Comput 6(1):128–43

    MathSciNet  MATH  Google Scholar 

  • Buja A, Cook D, Asimov D, Hurley C (2005) Computational methods for high-dimensional rotations in data visualization. In: Rao CR, Wegman EJ, Solka JL (eds) Handbook of statistics: data mining and visualization. Elsevier, New York, pp 391–413

    Google Scholar 

  • Cook D, Swayne DF (2007) Interactive and dynamic graphics for data analysis with R and G gobi, 1st edn. Springer, Berlin

    MATH  Google Scholar 

  • Cook D, Buja A, Cabrera J (1992) An analysis of polynomial-based projection pursuit. Comput Sci Stat 24:478–82

    Google Scholar 

  • Cook D, Buja A, Cabrera J (1993) Projection pursuit indexes based on orthonormal function expansions. J Comput Gr Stat 2(3):225–50

    MathSciNet  Google Scholar 

  • Cook D, Buja A, Cabrera J, Hurley C (1995) Grand tour and projection pursuit. J Comput Gr Stat 4(3):155–72

    Google Scholar 

  • Cook D, Laa U, Valencia G (2018) Dynamical projections for the visualization of PDFSense data. Eur Phys J C 78(9):742

    Google Scholar 

  • Diaconis P, Freedman D (1984) Asymptotics of graphical projection pursuit. Ann Statist 12(3):793–815

    MathSciNet  MATH  Google Scholar 

  • Eddy WF (1977) A new convex hull algorithm for planar sets. ACM Trans Math Softw 3(4):398–403

    MATH  Google Scholar 

  • Edelsbrunner H, Kirkpatrick D, Seidel R (1983) On the shape of a set of points in the plane. IEEE Trans Inf Theory 29(4):551–59

    MathSciNet  MATH  Google Scholar 

  • Ferraty F, Goia A, Salinelli E, Vieu P (2013) Functional projection pursuit regression. Test 22(2):293–320

    MathSciNet  MATH  Google Scholar 

  • Foreman-Mackey D (2016) Corner.py: scatterplot matrices in python. J Open Sour Softw 24

  • Friedman JH (1987) Exploratory projection pursuit. J Am Stat Assoc 82(1):249–66

    MathSciNet  MATH  Google Scholar 

  • Friedman JH, Tukey JW (1974) A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comput 23:881–89

    MATH  Google Scholar 

  • Grimm K (2016) Kennzahlenbasierte Grafikauswahl. Doctoral thesis, Universitat Augsburg

  • Grimm K (2017) Mbgraphic: measure based graphic selection. https://CRAN.R-project.org/package=mbgraphic

  • Hall P (1989) On polynomial-based projection indices for exploratory projection pursuit. Ann Stat 17(2):589–605

    MathSciNet  MATH  Google Scholar 

  • Hofmann H, Wilkinson L, Wickham H, Lang DT, Anand A (2019) Binostics: compute scagnostics. https://cran.r-project.org/package=binostics

  • Hou S, Wentzell PD (2014) Re-centered kurtosis as a projection pursuit index for multivariate data analysis. J Chemom 28(5):370–84

    Google Scholar 

  • Huang B, Cook D, Wickham H (2012) TourrGui: a gWidgets gui for the tour to explore high-dimensional data using low-dimensional projections. J Stat Softw 49(6):1–12

    Google Scholar 

  • Huber PJ (1985) Projection pursuit. Ann Stat 13(2):435–75

    MathSciNet  MATH  Google Scholar 

  • Jones MC, Sibson R (1987) What is projection pursuit? J R Stat Soc Ser A 150:1–36

    MathSciNet  MATH  Google Scholar 

  • Kruskal JB (1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proc Am Math Soc 7(1):48–50

    MathSciNet  MATH  Google Scholar 

  • Kruskal JB (1969) Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new ‘index of condensation’. In: Milton RC, Nelder JA (eds) Statistical computation. Academic Press, New York, pp 427–40

    Google Scholar 

  • Laa U, Cook D (2019a) https://github.com/uschiLaa/paper-ppi

  • Laa U, Cook D (2019b) https://github.com/uschiLaa/spinebil

  • Laa U, Cook D (2019c) https://github.com/uschiLaa/galahr

  • Lee E-K, Cook D, Klinke S, Lumley T (2005) Projection pursuit for exploratory supervised classification. J Comput Gr Stat 14(4):831–46

    MathSciNet  Google Scholar 

  • LIGO (2018) https://dcc.ligo.org/public/0152/P1800115/005

  • Loperfido N (2018) Skewness-based projection pursuit: a computational approach. Comput Stat Data Anal 120:42–57

    MathSciNet  MATH  Google Scholar 

  • Marsaglia G (1968) Random numbers fall mainly in the planes. Proc Natl Acad Sci 61:25

    MathSciNet  MATH  Google Scholar 

  • Naito K (1997) A generalized projection pursuit procedure and its significance level. Hiroshima Math J 27(3):513–54

    MathSciNet  MATH  Google Scholar 

  • Pan J-X, Fung W-K, Fang K-T (2000) Multiple outlier detection in multivariate data using projection pursuit techniques. J Stat Plann Inference 83(1):153–67

    MathSciNet  MATH  Google Scholar 

  • Pilhöfer A, Unwin A (2013) New approaches in visualization of categorical data: R package extracat. J Stat Softw 53(7):1–25

    Google Scholar 

  • Posse C (1995a) Projection pursuit exploratory data analysis. Comput Stat Data Anal 20(6):669–87

    MathSciNet  MATH  Google Scholar 

  • Posse C (1995b) Tools for two-dimensional exploratory projection pursuit. J Comput Gr Stat 4(2):83–100

    Google Scholar 

  • R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/

  • Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–24

    MATH  Google Scholar 

  • Reshef YA, Reshef DN, Finucane HK, Sabeti PC, Mitzenmacher M (2016) Measuring dependence powerfully and equitably. J Mach Learn Res 17(212):1–63

    MathSciNet  MATH  Google Scholar 

  • Rodriguez-Martinez E, Goulermas JY, Mu T, Ralph JF (2010) Automatic induction of projection pursuit indices. IEEE Trans Neural Netw 21(8):1281–95

    Google Scholar 

  • Simon N, Tibshirani R (2014) Comment on ‘detecting novel associations in large data sets’ by Reshef Et Al, Science Dec 16, 2011. arXiv E-Prints arXiv:1401.7645. http://arxiv.org/abs/1401.7645

  • Smith R, Field SE, Blackburn K, Haster C-J, Pürrer M, Raymond V, Schmidt P (2016) Fast and accurate inference on gravitational waves from precessing compact binaries. Phys Rev D 94(4):044031

    Google Scholar 

  • Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–94

    MathSciNet  MATH  Google Scholar 

  • Tukey PA, Tukey JW (1981) Graphical display of data in three and higher dimensions. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. Wiley, New York

  • Wahba G (1990) Spline models for observational data. In: CBMS-Nsf regional conference series in applied mathematics. Society for Industrial Applied Mathematics, Philadelphia

  • Wickham H, Cook D, Hofmann H, Buja A (2011) Tourr: an R package for exploring multivariate data with projections. J Stat Softw 40(2):1–18

    Google Scholar 

  • Wilkinson L, Wills G (2008) Scagnostics distributions. J Comput Gr Stat 17(2):473–91

    MathSciNet  Google Scholar 

  • Wilkinson L, Anand A, Grossman R (2005) Graph-theoretic scagnostics. In: IEEE symposium on information visualization, 2005. INFOVIS 2005, pp 157–64

  • Wood SN, Pya N, Säfken B (2016) Smoothing parameter and model selection for general smooth models (with discussion). J Am Stat Assoc 111:1548–75

    Google Scholar 

  • Xie Y (2015) Dynamic documents with R and knitr. 2nd ed. Chapman, Boca Raton, FL. https://yihui.name/knitr/

  • Xie Y (2016) Bookdown: authoring books and technical documents with R markdown. Chapman, Boca Raton, FL. https://github.com/rstudio/bookdown

  • Xie Y, Allaire JJ, Grolemund G (2018) R markdown: the definitive guide. Chapman, Boca Raton, FL. https://bookdown.org/yihui/rmarkdown

Download references

Acknowledgements

The authors gratefully acknowledge the support of the Australian Research Council. We thank Rory Smith for help with the gravitational wave examples, and German Valencia for useful comments. This article was created with knitr (Xie 2015), R Markdown (Xie et al. 2018) and bookdown (Xie 2016) with embedded code.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ursula Laa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Re-scaling of holes index

Holes and cmass indexes are derived from \(I_0^N\) of Cook et al. (1993). As noted in proposition 1 of that paper the index takes local maxima for the minimum and maximum of \(a_0\) which are achieved by central hole and central mass distributions respectively. Cook and Swayne (2007) then gives explicit index functions defined for sphered data (zero mean, identity variance-covariance matrix). They are defined such that each one is maximized for central holes or central mass type distributions, with maximum=1, and cmass=1-holes. It follows that for either index both large and small values signal deviation from the normal distribution, and given a normal distribution we expect to find “average” index values rather than values close to zero.

We can estimate the values found for normal distribution by comparing values of \(a_{00}\) of Sec 7.1 in Cook et al. (1993). The maximum value is \(1/(2\pi )\) found for cmass type distributions, the minimum is \(1/(2\pi e)\) found for hole type distributions. Evaluating for normal distributions gives \(1/(4\pi )\), rescaling such that the index values range from 0 to 1 then puts the value for normal distributions at approximately \(0.2\). This is consistent with the results we found, i.e. for normal distributions the cmass index is about \(0.2\), and the holes index (=1-cmass) is about \(0.8\).

We therefore rescale as follows: first have cut-off at the respective value for the normal distribution, i.e. any value below 0.2 (0.8) is set to zero for cmass (holes) index, and we rescale the remaining range to be between zero and one.

Estimating the squint angle of the spiral

We estimate the squint angle of the spiral (for p = 4, 5, 6) as follows. First pick a random starting plane and generate a tour path from the starting plane to the ideal plane containing the spiral. Using skinny as the reference index we fix a lower index value that is attributed to indicate squintability at \(0.6\), and we move along the tour path towards the ideal plane until this value is reached. The distance between the thus identified plane and the ideal plane is used as an estimate of the squint angle in this direction. Since this will strongly depend on the random starting plane, i.e. the considered direction, we repeat the estimation 100 times and present the results in the form of a box plot in Fig. 14. The result shows a large drop in squint angle when going from p = 4 to higher dimensions, and generally a large spread of squint angles depending on the direction.

Fig. 14
figure 14

Estimated squint angles for the Spiral dataset with 1000 datapoints, with p = 4, 5, 6, containing estimates evaluated for 100 randomly selected directions each

Computational performance

Computational time is important for using the PPIs with the guided tour, online. Figure 15 summarizes performance for each PPI. For simplicity, data with sample sizes ranging from 100 to 10,000 are drawn from a 6-d solid sphere, using the geozoo package (Schloerke 2016). The time to compute the PPIs over 100 interpolated grand tour projections is recorded. The scagnostics PPI are computed as a bundle, since this is the code base, and that major computational constraint is common to all the scagnostics. There are two versions of the MIC and TIC algorithm, labelled MINE and MINE E, the second being a newer algorithm which improves their computational performance.

The results are interesting. The scagnostic indexes and splines2d are very fast regardless of sample size. MIC, TIC (both versions) and dcor2d slow rapidly as sample size increases.

Fig. 15
figure 15

Computational performance for PPIs, using sample sizes 100–10,000. Colour indicates the PPI. Because the scagnostics calculation is bundled together, the values are the same for all these indexes, and they are really fast to compute. MINE includes the MIC and TIC indexes, and MINE E are computationally more efficient algorithms for these. These, along with dcor2D, are slower with larger sample sizes

Effect of parameter choice in index value

Some parameters must be provided for some PPIs. This can be advantageous, allowing the index to more flexibly work for different types of structure, controlling trade-offs between noise and fine structure detection, and affecting computing time and precision.

  • Binning:

    • Scagnostics: the number of bins can be controlled by the user, note however that internally the implementation will reduce the number of bins if too many non-empty bins are found (more than 250).

    • MINE: the maximum number of bins considered is fixed by the user as a function of the number of data points. The default is chosen as a trade-off between resolution and noise dependence, but it may be tuned based on requirements dictated by specific datasets. Apart from sensitivity to noise computing time may also be a consideration here.

  • Spline knots: for the splines2d measure we need to fix the number of knots. By default it is fixed to be 10 (or lower if appropriate based on the data values). In our examples we find the number to be appropriate to identify functional dependence while rejecting noise, but some distributions may require tuning of this parameter.

The bins argument for the scagnostics might be reasonably expected to affect the smoothness of the index: a small number of bins should provide a smooth index function, but may affect its ability to detect fine structure. Figure 16 examines this. Scagnostics PPIs are computed for the spiral1000 data on a tour path between \(x_1\)\(x_2\) to \(x_5\)\(x_6\), for number of bins equal to 10, 20, 50. The interesting observation to make is that even with small bin size the functions are all relatively noisy. The problem with the small bin size is that the spiral becomes invisible to the PPI.

Fig. 16
figure 16

Comparing the traces of the three scagnostics indices when changing the binning via the bins parameter set to 10, 20 and 50 in this example

1.1 Binning sensitivity of MIC index

To examine the sensitivity of binning in the MIC PPI, the classic RANDU data (Marsaglia 1968), available in R, is used. Binning is defined by \(\delta \), where \(B(n) = n^{\delta }\). Figure 17 shows the best projection, index value and computing time obtained when optimizing the MIC index with values \(\delta = 0.6, 0.7, 0.8\). With small \(\delta \), less bins, the structure isn’t visible, and with larger \(\delta \) the structure is confounded with noise. It does appear that this parameter affects the performance of the MIC index.

Fig. 17
figure 17

Best projection obtained by optimizing the MIC index on the RANDU data, using different number of bins, defined by \(\delta \). The smaller the value the fewer bins. Above each plot is written the value of \(\delta \), time required to optimize (s) and the MIC index value. The best \(\delta = 0.7\), and the result indicates that this parameter does affect MIC performance

Ways to refine the PPIs

The biggest issues revealed by the investigation into the new PPIs are a lack of smoothness, particularly for the scagnostics indexes, and the rotation invariance of Grimm’s indexes. To fix the smoothness of an index function, it is possible to calculate the PPI for a small neighborhood of projections and average the value, or alternatively average the PPI for several jittered projections. This is investigated in Fig. 18. Rotation invariance is more difficult to fix, but an alternative tour interpolation method could be useful. The geodesic interpolation transitions between planes, and it ignores the basis defining the plane, creating a problem with rotation invariant indices. Alternative interpolations based on Givens or Householder rotations could be implemented to transition between bases, which should alleviate the need for rotationally invariant indices.

Fig. 18
figure 18

Comparing the traces of the scagnostics indexes skinny and stringy when smoothing the index values, either by averaging over the index value after jittering the projection by some angle \(\epsilon \) (full line) or after jittering the projected datapoints with some amount \(\beta \) (dashed line). For comparison the red line in the background shows the trace without any smoothing applied

Two different methods are considered for smoothing the index values:

  • Jittering points: using the jitter function we move each point by a random amount drawn from a uniform distribution between \(\pm \,\beta \).

  • Jittering angles: using the tourr implementation we can draw a random plane and move some small amount \(\epsilon \) in that direction.

The mean value from a sample of projections is recorded as the PPI value. This could be robsitufied by dropping the most extreme values.

This is particularly interesting for the scagnostics indexes skinny and stringy which we found to be most noisy among the indexes considered. Figure 18 studies the potential of these two smoothing approaches, using the tour path between noise variables of the spiral1000 dataset and different \(\epsilon \) and \(\beta \) values. Both methods appear to be promising in smoothing the function. Because the scagnostics are fast to compute, either of these methods is feasible. For this example we have smoothed over 10 randomly selected jittered views, computing time increases linearly with the number of randomly jittered views, as this is mostly determined by the time needed to evaluate the scagnostics indexes which is done separately for each view.

Additional figures

1.1 Final projection of pipe guided tour

Figure 19 shows the projection returned during the scouting phase (left) and the refinement phase (right). It was important to start the method 3 optimizer at the best projection returned during the scouting phase, to smoothly converge more closely to the ideal projection.

Fig. 19
figure 19

Projections returned by TIC optimization: by the scouting phase (left) and refined by optimization method 3 (right), starting from the scouting phase projection

1.2 Dataset overview scatter plot matrices

Figures 20 and 21 show the scatter plot matrices of the gravitational wave datasets considered.

Fig. 20
figure 20

Scatter plot matrix of the neutron star dataset, darker regions represent higher marginalised posterior densities

Fig. 21
figure 21

Scatter plot matrix showing most of the variables included in the BBH dataset. Strong correlation between the parameters time, dec and ra can be observed

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Laa, U., Cook, D. Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics. Comput Stat 35, 1171–1205 (2020). https://doi.org/10.1007/s00180-020-00954-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-020-00954-8

Keywords

Navigation