Abstract
The presence of outliers in geochemical data can impact the accuracy of grade models and influence the interpretation of mine assay data. Removal of outliers is therefore an important consideration in grade estimation work. This paper presents two sample truncation strategies which have been devised to reject outliers in multivariate geochemical data. In essence, a data-dependent threshold is applied to the robust distances of sorted samples to discard outliers within a given class. For robust distances based on the minimum covariance determinant (MCD) where sample deviations from the cluster centre are computed using robust estimates, the inverse chi-square cumulative distribution function is often used to compute the cutoff point, \(\chi _{1-\alpha ,\nu }\), under the assumption of multivariate normality. In this work, it has been observed that this approach consistently underestimates the true extent of outliers. The proposed alternatives consist of a geometric and an analytic approach. The former defines the sample truncation point as the knee of the robust distance curve in an approximately chi-square-distributed quantile–quantile plot. The latter uses the silhouette and likelihood functions to consider the degree of cohesion in the resultant inlier/outlier clusters. Both techniques significantly reduce the scatter amongst the samples retained in each domain/class. For validation, ensemble clustering based on t-distributed stochastic neighbour embedding (t-SNE) is used to study the outlier recall rate, the effects of feature selection, and spatial correlation with MCD-based outlier rejection. Visual and quantitative analyses show that the proposed methods are superior to the baseline method which rejects samples using chi-square critical values.
Graphic Abstract
Similar content being viewed by others
Notes
The location where a blast cone sample is taken (McArthur et al. 2010) cannot be mapped precisely to a specific depth. Furthermore, the local geology may change through the sampling interval; this can make a sample unrepresentative if mixing occurs. These factors can lead to variable results between assay samples taken from the same hole and from adjacent holes.
A geological domain is defined by both its stratigraphic location within the deposit and the mineralisation present. For an iron ore deposit, the stratigraphy is defined on the basis of its gamma response and various geochemical and material attributes. A typical mineralised domain consists predominantly of hematite and goethite, and to a lesser extent shale and BIF (banded iron formation). In a non-mineralised domain, these proportions are inverted so the composition of Fe is generally much lower. From a geostatistical perspective, a domain may be alternatively defined (in a data-driven way) as a 3D region with autocorrelated properties, such as chemistry in the case of grade estimation.
Throughout this paper, the terms ‘observations’ and ‘samples’ are used interchangeably. In essence, the MCD technique proposed by Rousseeuw finds the h samples [where \((n+k+1)/2 \le h < n\), n = number of observations, k = number of variables] for which the confidence ellipsoid has minimal volume. Improved estimates are obtained once the \((n-h)\) outliers are removed; \(h=0.75n\) is often chosen by default. The MCD method is intended for continuous variables, and assumes that the number of observations, n, is at least five times the number of variables, k. Further details can be found in Rousseeuw and Driessen (1999).
For instance, the assumption of multivariate normality in the source data which leads to chi-square distributed Mahalanobis distances.
For t-SNE, the panels labelled ‘cluster 0’ show the composition after outlier rejection. For MCD, the composition before and after outlier removal are shown in the panels ‘inliers’ and ‘outliers’, respectively.
References
Balamurali M, Melkumyan A (2015) Multivariate outlier detection in geochemical data. In: Proceedings of the IAMG conference, international association for mathematical geosciences, Freiberg, Germany vol 17, pp 602–610
Balamurali M, Melkumyan A (2016) t-SNE based visualisation and clustering of geological domain. In: International conference on neural information processing. Springer, Berlin, pp 565–572
Balamurali M, Melkumyan A (2019) Computer aided sub-domain detection using t-SNE incorporating cluster ensemble for improved mine modelling. (submitted for review)
Balamurali M, Silversides KL, Melkumyan A (2019) A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data. Comput Geosci 125:78–89
Devlin S, Gnanadesikan R, Ketternring J (1981) Robust estimation of dispersion matrices and principal components. J Am Stat Assoc 76(374):354–362
Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barcelo-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35(3):279–300
Engström K, Esbensen KH (2017) Optimal grade control sampling practice in open-pit mining-a full-scale blast hole versus reverse circulation variographic experiment. Appl Earth Sci Trans Inst Min Metall Sect B 126(4):176–187
Filzmoser P, Hron K (2008) Outlier detection for compositional data using robust methods. Math Geosci 40(3):233–248
Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31(5):579–587
Garrett RG (1989) The chi-square plot: a tool for multivariate outlier recognition. J Geochem Explor 32(1–3):319–341
Garrett RG, Reimann C, Hron K, Kynlov P, Filzmoser P (2017) Finally, a correlation coefficient that tells the geochemical truth. Newsl Assoc Appl Geochem 176:1–10
Gnanadesikan R (1977) Methods for statistical data analysis of multivariate observations, vol 321. Wiley, New York
Horrocks T, Holden EJ, Wedge D, Wijns C, Fiorentini M (2019) Geochemical characterisation of rock hydration processes using t-SNE. Comput Geosci 124:46–57
Hubert M, Debruyne M (2010) Minimum covariance determinant. Wiley Interdiscip Rev: Comput Stat 2(1):36–43
McArthur G, Jones C, Murphy M (2010) Blasthole cone sampling experiments for iron ore flitch mining. In: Proceedings sampling 2010
Mežibricky R, Frohlichova M, Legemza J (2019) High-SiO\(_{2}\) iron ore sintering: Central & Eastern Europe scenario. Metalurgija 58(1–2):161–164
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1):91–118
Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8:283–297
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Rousseeuw PJ, Driessen KV (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223
Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639
Tolosana-Delgado R, Mueller U, van den Boogaart KG (2019) Geostatistics for compositional data: an overview. Math Geosci 51:485–526. https://doi.org/10.1007/s11004-018-9769-3
Tsagris MT, Preston S, Wood AT (2011) A data-based power transformation for compositional data. In: Egozcue J, Tolosana-Delgado R, Ortego M (eds) 4th international workshop on Compositional Data Analysis. Springer, New York, pp 565–572
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
Acknowledgements
This work was supported by the Australian Centre for Field Robotics and the Rio Tinto Centre for Mine Automation. The authors would like to thank the reviewers for their comments and the publishing staff who copy-edited this manuscript.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Leung, R., Balamurali, M. & Melkumyan, A. Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering. Math Geosci 53, 105–130 (2021). https://doi.org/10.1007/s11004-019-09839-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11004-019-09839-z