Abstract
Projection pursuit is used to find interesting low-dimensional projections of high-dimensional data by optimizing an index over all possible projections. Most indexes have been developed to detect departure from known distributions, such as normality, or to find separations between known groups. Here, we are interested in finding projections revealing potentially complex bivariate patterns, using new indexes constructed from scagnostics and a maximum information coefficient, with a purpose to detect unusual relationships between model parameters describing physics phenomena. The performance of these indexes is examined with respect to ideal behaviour, using simulated data, and then applied to problems from gravitational wave astronomy. The implementation builds upon the projection pursuit tools available in the R package, tourr, with indexes constructed from code in the R packages, binostics, minerva and mbgraphic.
Similar content being viewed by others
References
Abbott BP et al (2017) GW170817: observation of gravitational waves from a binary neutron star inspiral. Phys Rev Lett 119(16):161101
Abbott BP (2018) GW170817: measurements of neutron star radii and equation of state. Phys Rev Lett 121(16):161101
Ahn JS, Hofmann H, Cook D (2003) A projection pursuit method on the multidimensional squared contingency table. Comput Stat 18(3):605–26
Albanese D, Filosi M, Visintainer R, Riccadonna S, Jurman G, Furlanello C (2012) minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics 29(3):407–8
Asimov D (1985) The grand tour: a tool for viewing multidimensional data. SIAM J Sci Stat Comput 6(1):128–43
Buja A, Cook D, Asimov D, Hurley C (2005) Computational methods for high-dimensional rotations in data visualization. In: Rao CR, Wegman EJ, Solka JL (eds) Handbook of statistics: data mining and visualization. Elsevier, New York, pp 391–413
Cook D, Swayne DF (2007) Interactive and dynamic graphics for data analysis with R and G gobi, 1st edn. Springer, Berlin
Cook D, Buja A, Cabrera J (1992) An analysis of polynomial-based projection pursuit. Comput Sci Stat 24:478–82
Cook D, Buja A, Cabrera J (1993) Projection pursuit indexes based on orthonormal function expansions. J Comput Gr Stat 2(3):225–50
Cook D, Buja A, Cabrera J, Hurley C (1995) Grand tour and projection pursuit. J Comput Gr Stat 4(3):155–72
Cook D, Laa U, Valencia G (2018) Dynamical projections for the visualization of PDFSense data. Eur Phys J C 78(9):742
Diaconis P, Freedman D (1984) Asymptotics of graphical projection pursuit. Ann Statist 12(3):793–815
Eddy WF (1977) A new convex hull algorithm for planar sets. ACM Trans Math Softw 3(4):398–403
Edelsbrunner H, Kirkpatrick D, Seidel R (1983) On the shape of a set of points in the plane. IEEE Trans Inf Theory 29(4):551–59
Ferraty F, Goia A, Salinelli E, Vieu P (2013) Functional projection pursuit regression. Test 22(2):293–320
Foreman-Mackey D (2016) Corner.py: scatterplot matrices in python. J Open Sour Softw 24
Friedman JH (1987) Exploratory projection pursuit. J Am Stat Assoc 82(1):249–66
Friedman JH, Tukey JW (1974) A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comput 23:881–89
Grimm K (2016) Kennzahlenbasierte Grafikauswahl. Doctoral thesis, Universitat Augsburg
Grimm K (2017) Mbgraphic: measure based graphic selection. https://CRAN.R-project.org/package=mbgraphic
Hall P (1989) On polynomial-based projection indices for exploratory projection pursuit. Ann Stat 17(2):589–605
Hofmann H, Wilkinson L, Wickham H, Lang DT, Anand A (2019) Binostics: compute scagnostics. https://cran.r-project.org/package=binostics
Hou S, Wentzell PD (2014) Re-centered kurtosis as a projection pursuit index for multivariate data analysis. J Chemom 28(5):370–84
Huang B, Cook D, Wickham H (2012) TourrGui: a gWidgets gui for the tour to explore high-dimensional data using low-dimensional projections. J Stat Softw 49(6):1–12
Huber PJ (1985) Projection pursuit. Ann Stat 13(2):435–75
Jones MC, Sibson R (1987) What is projection pursuit? J R Stat Soc Ser A 150:1–36
Kruskal JB (1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proc Am Math Soc 7(1):48–50
Kruskal JB (1969) Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new ‘index of condensation’. In: Milton RC, Nelder JA (eds) Statistical computation. Academic Press, New York, pp 427–40
Laa U, Cook D (2019a) https://github.com/uschiLaa/paper-ppi
Laa U, Cook D (2019b) https://github.com/uschiLaa/spinebil
Laa U, Cook D (2019c) https://github.com/uschiLaa/galahr
Lee E-K, Cook D, Klinke S, Lumley T (2005) Projection pursuit for exploratory supervised classification. J Comput Gr Stat 14(4):831–46
LIGO (2018) https://dcc.ligo.org/public/0152/P1800115/005
Loperfido N (2018) Skewness-based projection pursuit: a computational approach. Comput Stat Data Anal 120:42–57
Marsaglia G (1968) Random numbers fall mainly in the planes. Proc Natl Acad Sci 61:25
Naito K (1997) A generalized projection pursuit procedure and its significance level. Hiroshima Math J 27(3):513–54
Pan J-X, Fung W-K, Fang K-T (2000) Multiple outlier detection in multivariate data using projection pursuit techniques. J Stat Plann Inference 83(1):153–67
Pilhöfer A, Unwin A (2013) New approaches in visualization of categorical data: R package extracat. J Stat Softw 53(7):1–25
Posse C (1995a) Projection pursuit exploratory data analysis. Comput Stat Data Anal 20(6):669–87
Posse C (1995b) Tools for two-dimensional exploratory projection pursuit. J Comput Gr Stat 4(2):83–100
R Core Team (2018) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–24
Reshef YA, Reshef DN, Finucane HK, Sabeti PC, Mitzenmacher M (2016) Measuring dependence powerfully and equitably. J Mach Learn Res 17(212):1–63
Rodriguez-Martinez E, Goulermas JY, Mu T, Ralph JF (2010) Automatic induction of projection pursuit indices. IEEE Trans Neural Netw 21(8):1281–95
Simon N, Tibshirani R (2014) Comment on ‘detecting novel associations in large data sets’ by Reshef Et Al, Science Dec 16, 2011. arXiv E-Prints arXiv:1401.7645. http://arxiv.org/abs/1401.7645
Smith R, Field SE, Blackburn K, Haster C-J, Pürrer M, Raymond V, Schmidt P (2016) Fast and accurate inference on gravitational waves from precessing compact binaries. Phys Rev D 94(4):044031
Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–94
Tukey PA, Tukey JW (1981) Graphical display of data in three and higher dimensions. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. Wiley, New York
Wahba G (1990) Spline models for observational data. In: CBMS-Nsf regional conference series in applied mathematics. Society for Industrial Applied Mathematics, Philadelphia
Wickham H, Cook D, Hofmann H, Buja A (2011) Tourr: an R package for exploring multivariate data with projections. J Stat Softw 40(2):1–18
Wilkinson L, Wills G (2008) Scagnostics distributions. J Comput Gr Stat 17(2):473–91
Wilkinson L, Anand A, Grossman R (2005) Graph-theoretic scagnostics. In: IEEE symposium on information visualization, 2005. INFOVIS 2005, pp 157–64
Wood SN, Pya N, Säfken B (2016) Smoothing parameter and model selection for general smooth models (with discussion). J Am Stat Assoc 111:1548–75
Xie Y (2015) Dynamic documents with R and knitr. 2nd ed. Chapman, Boca Raton, FL. https://yihui.name/knitr/
Xie Y (2016) Bookdown: authoring books and technical documents with R markdown. Chapman, Boca Raton, FL. https://github.com/rstudio/bookdown
Xie Y, Allaire JJ, Grolemund G (2018) R markdown: the definitive guide. Chapman, Boca Raton, FL. https://bookdown.org/yihui/rmarkdown
Acknowledgements
The authors gratefully acknowledge the support of the Australian Research Council. We thank Rory Smith for help with the gravitational wave examples, and German Valencia for useful comments. This article was created with knitr (Xie 2015), R Markdown (Xie et al. 2018) and bookdown (Xie 2016) with embedded code.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
Re-scaling of holes index
Holes and cmass indexes are derived from \(I_0^N\) of Cook et al. (1993). As noted in proposition 1 of that paper the index takes local maxima for the minimum and maximum of \(a_0\) which are achieved by central hole and central mass distributions respectively. Cook and Swayne (2007) then gives explicit index functions defined for sphered data (zero mean, identity variance-covariance matrix). They are defined such that each one is maximized for central holes or central mass type distributions, with maximum=1, and cmass=1-holes. It follows that for either index both large and small values signal deviation from the normal distribution, and given a normal distribution we expect to find “average” index values rather than values close to zero.
We can estimate the values found for normal distribution by comparing values of \(a_{00}\) of Sec 7.1 in Cook et al. (1993). The maximum value is \(1/(2\pi )\) found for cmass type distributions, the minimum is \(1/(2\pi e)\) found for hole type distributions. Evaluating for normal distributions gives \(1/(4\pi )\), rescaling such that the index values range from 0 to 1 then puts the value for normal distributions at approximately \(0.2\). This is consistent with the results we found, i.e. for normal distributions the cmass index is about \(0.2\), and the holes index (=1-cmass) is about \(0.8\).
We therefore rescale as follows: first have cut-off at the respective value for the normal distribution, i.e. any value below 0.2 (0.8) is set to zero for cmass (holes) index, and we rescale the remaining range to be between zero and one.
Estimating the squint angle of the spiral
We estimate the squint angle of the spiral (for p = 4, 5, 6) as follows. First pick a random starting plane and generate a tour path from the starting plane to the ideal plane containing the spiral. Using skinny as the reference index we fix a lower index value that is attributed to indicate squintability at \(0.6\), and we move along the tour path towards the ideal plane until this value is reached. The distance between the thus identified plane and the ideal plane is used as an estimate of the squint angle in this direction. Since this will strongly depend on the random starting plane, i.e. the considered direction, we repeat the estimation 100 times and present the results in the form of a box plot in Fig. 14. The result shows a large drop in squint angle when going from p = 4 to higher dimensions, and generally a large spread of squint angles depending on the direction.
Computational performance
Computational time is important for using the PPIs with the guided tour, online. Figure 15 summarizes performance for each PPI. For simplicity, data with sample sizes ranging from 100 to 10,000 are drawn from a 6-d solid sphere, using the geozoo package (Schloerke 2016). The time to compute the PPIs over 100 interpolated grand tour projections is recorded. The scagnostics PPI are computed as a bundle, since this is the code base, and that major computational constraint is common to all the scagnostics. There are two versions of the MIC and TIC algorithm, labelled MINE and MINE E, the second being a newer algorithm which improves their computational performance.
The results are interesting. The scagnostic indexes and splines2d are very fast regardless of sample size. MIC, TIC (both versions) and dcor2d slow rapidly as sample size increases.
Effect of parameter choice in index value
Some parameters must be provided for some PPIs. This can be advantageous, allowing the index to more flexibly work for different types of structure, controlling trade-offs between noise and fine structure detection, and affecting computing time and precision.
-
Binning:
-
Scagnostics: the number of bins can be controlled by the user, note however that internally the implementation will reduce the number of bins if too many non-empty bins are found (more than 250).
-
MINE: the maximum number of bins considered is fixed by the user as a function of the number of data points. The default is chosen as a trade-off between resolution and noise dependence, but it may be tuned based on requirements dictated by specific datasets. Apart from sensitivity to noise computing time may also be a consideration here.
-
-
Spline knots: for the splines2d measure we need to fix the number of knots. By default it is fixed to be 10 (or lower if appropriate based on the data values). In our examples we find the number to be appropriate to identify functional dependence while rejecting noise, but some distributions may require tuning of this parameter.
The bins argument for the scagnostics might be reasonably expected to affect the smoothness of the index: a small number of bins should provide a smooth index function, but may affect its ability to detect fine structure. Figure 16 examines this. Scagnostics PPIs are computed for the spiral1000 data on a tour path between \(x_1\)–\(x_2\) to \(x_5\)–\(x_6\), for number of bins equal to 10, 20, 50. The interesting observation to make is that even with small bin size the functions are all relatively noisy. The problem with the small bin size is that the spiral becomes invisible to the PPI.
1.1 Binning sensitivity of MIC index
To examine the sensitivity of binning in the MIC PPI, the classic RANDU data (Marsaglia 1968), available in R, is used. Binning is defined by \(\delta \), where \(B(n) = n^{\delta }\). Figure 17 shows the best projection, index value and computing time obtained when optimizing the MIC index with values \(\delta = 0.6, 0.7, 0.8\). With small \(\delta \), less bins, the structure isn’t visible, and with larger \(\delta \) the structure is confounded with noise. It does appear that this parameter affects the performance of the MIC index.
Ways to refine the PPIs
The biggest issues revealed by the investigation into the new PPIs are a lack of smoothness, particularly for the scagnostics indexes, and the rotation invariance of Grimm’s indexes. To fix the smoothness of an index function, it is possible to calculate the PPI for a small neighborhood of projections and average the value, or alternatively average the PPI for several jittered projections. This is investigated in Fig. 18. Rotation invariance is more difficult to fix, but an alternative tour interpolation method could be useful. The geodesic interpolation transitions between planes, and it ignores the basis defining the plane, creating a problem with rotation invariant indices. Alternative interpolations based on Givens or Householder rotations could be implemented to transition between bases, which should alleviate the need for rotationally invariant indices.
Two different methods are considered for smoothing the index values:
-
Jittering points: using the jitter function we move each point by a random amount drawn from a uniform distribution between \(\pm \,\beta \).
-
Jittering angles: using the tourr implementation we can draw a random plane and move some small amount \(\epsilon \) in that direction.
The mean value from a sample of projections is recorded as the PPI value. This could be robsitufied by dropping the most extreme values.
This is particularly interesting for the scagnostics indexes skinny and stringy which we found to be most noisy among the indexes considered. Figure 18 studies the potential of these two smoothing approaches, using the tour path between noise variables of the spiral1000 dataset and different \(\epsilon \) and \(\beta \) values. Both methods appear to be promising in smoothing the function. Because the scagnostics are fast to compute, either of these methods is feasible. For this example we have smoothed over 10 randomly selected jittered views, computing time increases linearly with the number of randomly jittered views, as this is mostly determined by the time needed to evaluate the scagnostics indexes which is done separately for each view.
Additional figures
1.1 Final projection of pipe guided tour
Figure 19 shows the projection returned during the scouting phase (left) and the refinement phase (right). It was important to start the method 3 optimizer at the best projection returned during the scouting phase, to smoothly converge more closely to the ideal projection.
1.2 Dataset overview scatter plot matrices
Figures 20 and 21 show the scatter plot matrices of the gravitational wave datasets considered.
Rights and permissions
About this article
Cite this article
Laa, U., Cook, D. Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics. Comput Stat 35, 1171–1205 (2020). https://doi.org/10.1007/s00180-020-00954-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-00954-8