Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering

Leung, Raymond; Balamurali, Mehala; Melkumyan, Arman

doi:10.1007/s11004-019-09839-z

Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering

Published: 27 November 2019

Volume 53, pages 105–130, (2021)
Cite this article

Mathematical Geosciences Aims and scope Submit manuscript

608 Accesses
11 Citations
3 Altmetric
Explore all metrics

Abstract

The presence of outliers in geochemical data can impact the accuracy of grade models and influence the interpretation of mine assay data. Removal of outliers is therefore an important consideration in grade estimation work. This paper presents two sample truncation strategies which have been devised to reject outliers in multivariate geochemical data. In essence, a data-dependent threshold is applied to the robust distances of sorted samples to discard outliers within a given class. For robust distances based on the minimum covariance determinant (MCD) where sample deviations from the cluster centre are computed using robust estimates, the inverse chi-square cumulative distribution function is often used to compute the cutoff point, \(\chi _{1-\alpha ,\nu }\), under the assumption of multivariate normality. In this work, it has been observed that this approach consistently underestimates the true extent of outliers. The proposed alternatives consist of a geometric and an analytic approach. The former defines the sample truncation point as the knee of the robust distance curve in an approximately chi-square-distributed quantile–quantile plot. The latter uses the silhouette and likelihood functions to consider the degree of cohesion in the resultant inlier/outlier clusters. Both techniques significantly reduce the scatter amongst the samples retained in each domain/class. For validation, ensemble clustering based on t-distributed stochastic neighbour embedding (t-SNE) is used to study the outlier recall rate, the effects of feature selection, and spatial correlation with MCD-based outlier rejection. Visual and quantitative analyses show that the proposed methods are superior to the baseline method which rejects samples using chi-square critical values.

Graphic Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detection of Outliers in Geochemical Data Using Ensembles of Subsets of Variables

Article 13 December 2017

Quality control in geochemistry from a comparison of four central tendency and five dispersion estimators and example of a geochemical reference material

Article 08 December 2016

Spatial outlier detection on discrete GNSS velocity fields using robust Mahalanobis-distance-based unsupervised classification

Article Open access 24 September 2022

Notes

The location where a blast cone sample is taken (McArthur et al. 2010) cannot be mapped precisely to a specific depth. Furthermore, the local geology may change through the sampling interval; this can make a sample unrepresentative if mixing occurs. These factors can lead to variable results between assay samples taken from the same hole and from adjacent holes.
A geological domain is defined by both its stratigraphic location within the deposit and the mineralisation present. For an iron ore deposit, the stratigraphy is defined on the basis of its gamma response and various geochemical and material attributes. A typical mineralised domain consists predominantly of hematite and goethite, and to a lesser extent shale and BIF (banded iron formation). In a non-mineralised domain, these proportions are inverted so the composition of Fe is generally much lower. From a geostatistical perspective, a domain may be alternatively defined (in a data-driven way) as a 3D region with autocorrelated properties, such as chemistry in the case of grade estimation.
Throughout this paper, the terms ‘observations’ and ‘samples’ are used interchangeably. In essence, the MCD technique proposed by Rousseeuw finds the h samples [where \((n+k+1)/2 \le h < n\), n = number of observations, k = number of variables] for which the confidence ellipsoid has minimal volume. Improved estimates are obtained once the \((n-h)\) outliers are removed; \(h=0.75n\) is often chosen by default. The MCD method is intended for continuous variables, and assumes that the number of observations, n, is at least five times the number of variables, k. Further details can be found in Rousseeuw and Driessen (1999).
For instance, the assumption of multivariate normality in the source data which leads to chi-square distributed Mahalanobis distances.
For t-SNE, the panels labelled ‘cluster 0’ show the composition after outlier rejection. For MCD, the composition before and after outlier removal are shown in the panels ‘inliers’ and ‘outliers’, respectively.

References

Balamurali M, Melkumyan A (2015) Multivariate outlier detection in geochemical data. In: Proceedings of the IAMG conference, international association for mathematical geosciences, Freiberg, Germany vol 17, pp 602–610
Balamurali M, Melkumyan A (2016) t-SNE based visualisation and clustering of geological domain. In: International conference on neural information processing. Springer, Berlin, pp 565–572
Balamurali M, Melkumyan A (2019) Computer aided sub-domain detection using t-SNE incorporating cluster ensemble for improved mine modelling. (submitted for review)
Balamurali M, Silversides KL, Melkumyan A (2019) A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data. Comput Geosci 125:78–89
Article Google Scholar
Devlin S, Gnanadesikan R, Ketternring J (1981) Robust estimation of dispersion matrices and principal components. J Am Stat Assoc 76(374):354–362
Article Google Scholar
Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barcelo-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35(3):279–300
Article Google Scholar
Engström K, Esbensen KH (2017) Optimal grade control sampling practice in open-pit mining-a full-scale blast hole versus reverse circulation variographic experiment. Appl Earth Sci Trans Inst Min Metall Sect B 126(4):176–187
Article Google Scholar
Filzmoser P, Hron K (2008) Outlier detection for compositional data using robust methods. Math Geosci 40(3):233–248
Article Google Scholar
Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31(5):579–587
Article Google Scholar
Garrett RG (1989) The chi-square plot: a tool for multivariate outlier recognition. J Geochem Explor 32(1–3):319–341
Article Google Scholar
Garrett RG, Reimann C, Hron K, Kynlov P, Filzmoser P (2017) Finally, a correlation coefficient that tells the geochemical truth. Newsl Assoc Appl Geochem 176:1–10
Google Scholar
Gnanadesikan R (1977) Methods for statistical data analysis of multivariate observations, vol 321. Wiley, New York
Google Scholar
Horrocks T, Holden EJ, Wedge D, Wijns C, Fiorentini M (2019) Geochemical characterisation of rock hydration processes using t-SNE. Comput Geosci 124:46–57
Article Google Scholar
Hubert M, Debruyne M (2010) Minimum covariance determinant. Wiley Interdiscip Rev: Comput Stat 2(1):36–43
Article Google Scholar
McArthur G, Jones C, Murphy M (2010) Blasthole cone sampling experiments for iron ore flitch mining. In: Proceedings sampling 2010
Mežibricky R, Frohlichova M, Legemza J (2019) High-SiO\(_{2}\) iron ore sintering: Central & Eastern Europe scenario. Metalurgija 58(1–2):161–164
Google Scholar
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1):91–118
Article Google Scholar
Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8:283–297
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article Google Scholar
Rousseeuw PJ, Driessen KV (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223
Article Google Scholar
Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639
Article Google Scholar
Tolosana-Delgado R, Mueller U, van den Boogaart KG (2019) Geostatistics for compositional data: an overview. Math Geosci 51:485–526. https://doi.org/10.1007/s11004-018-9769-3
Article Google Scholar
Tsagris MT, Preston S, Wood AT (2011) A data-based power transformation for compositional data. In: Egozcue J, Tolosana-Delgado R, Ortego M (eds) 4th international workshop on Compositional Data Analysis. Springer, New York, pp 565–572
Google Scholar
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Google Scholar
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Australian Centre for Field Robotics and the Rio Tinto Centre for Mine Automation. The authors would like to thank the reviewers for their comments and the publishing staff who copy-edited this manuscript.

Author information

Authors and Affiliations

Australian Centre for Field Robotics, The University of Sydney, Sydney Robotics Hub J18, Sydney, NSW, 2006, Australia
Raymond Leung, Mehala Balamurali & Arman Melkumyan

Authors

Raymond Leung
View author publications
You can also search for this author in PubMed Google Scholar
Mehala Balamurali
View author publications
You can also search for this author in PubMed Google Scholar
Arman Melkumyan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raymond Leung.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Leung, R., Balamurali, M. & Melkumyan, A. Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering. Math Geosci 53, 105–130 (2021). https://doi.org/10.1007/s11004-019-09839-z

Download citation

Received: 06 May 2019
Accepted: 01 November 2019
Published: 27 November 2019
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11004-019-09839-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions