Skip to main content
Log in

Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering

  • Published:
Mathematical Geosciences Aims and scope Submit manuscript

Abstract

The presence of outliers in geochemical data can impact the accuracy of grade models and influence the interpretation of mine assay data. Removal of outliers is therefore an important consideration in grade estimation work. This paper presents two sample truncation strategies which have been devised to reject outliers in multivariate geochemical data. In essence, a data-dependent threshold is applied to the robust distances of sorted samples to discard outliers within a given class. For robust distances based on the minimum covariance determinant (MCD) where sample deviations from the cluster centre are computed using robust estimates, the inverse chi-square cumulative distribution function is often used to compute the cutoff point, \(\chi _{1-\alpha ,\nu }\), under the assumption of multivariate normality. In this work, it has been observed that this approach consistently underestimates the true extent of outliers. The proposed alternatives consist of a geometric and an analytic approach. The former defines the sample truncation point as the knee of the robust distance curve in an approximately chi-square-distributed quantile–quantile plot. The latter uses the silhouette and likelihood functions to consider the degree of cohesion in the resultant inlier/outlier clusters. Both techniques significantly reduce the scatter amongst the samples retained in each domain/class. For validation, ensemble clustering based on t-distributed stochastic neighbour embedding (t-SNE) is used to study the outlier recall rate, the effects of feature selection, and spatial correlation with MCD-based outlier rejection. Visual and quantitative analyses show that the proposed methods are superior to the baseline method which rejects samples using chi-square critical values.

Graphic Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. The location where a blast cone sample is taken (McArthur et al. 2010) cannot be mapped precisely to a specific depth. Furthermore, the local geology may change through the sampling interval; this can make a sample unrepresentative if mixing occurs. These factors can lead to variable results between assay samples taken from the same hole and from adjacent holes.

  2. A geological domain is defined by both its stratigraphic location within the deposit and the mineralisation present. For an iron ore deposit, the stratigraphy is defined on the basis of its gamma response and various geochemical and material attributes. A typical mineralised domain consists predominantly of hematite and goethite, and to a lesser extent shale and BIF (banded iron formation). In a non-mineralised domain, these proportions are inverted so the composition of Fe is generally much lower. From a geostatistical perspective, a domain may be alternatively defined (in a data-driven way) as a 3D region with autocorrelated properties, such as chemistry in the case of grade estimation.

  3. Throughout this paper, the terms ‘observations’ and ‘samples’ are used interchangeably. In essence, the MCD technique proposed by Rousseeuw finds the h samples [where \((n+k+1)/2 \le h < n\), n = number of observations, k = number of variables] for which the confidence ellipsoid has minimal volume. Improved estimates are obtained once the \((n-h)\) outliers are removed; \(h=0.75n\) is often chosen by default. The MCD method is intended for continuous variables, and assumes that the number of observations, n, is at least five times the number of variables, k. Further details can be found in Rousseeuw and Driessen (1999).

  4. For instance, the assumption of multivariate normality in the source data which leads to chi-square distributed Mahalanobis distances.

  5. For t-SNE, the panels labelled ‘cluster 0’ show the composition after outlier rejection. For MCD, the composition before and after outlier removal are shown in the panels ‘inliers’ and ‘outliers’, respectively.

References

  • Balamurali M, Melkumyan A (2015) Multivariate outlier detection in geochemical data. In: Proceedings of the IAMG conference, international association for mathematical geosciences, Freiberg, Germany vol 17, pp 602–610

  • Balamurali M, Melkumyan A (2016) t-SNE based visualisation and clustering of geological domain. In: International conference on neural information processing. Springer, Berlin, pp 565–572

  • Balamurali M, Melkumyan A (2019) Computer aided sub-domain detection using t-SNE incorporating cluster ensemble for improved mine modelling. (submitted for review)

  • Balamurali M, Silversides KL, Melkumyan A (2019) A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data. Comput Geosci 125:78–89

    Article  Google Scholar 

  • Devlin S, Gnanadesikan R, Ketternring J (1981) Robust estimation of dispersion matrices and principal components. J Am Stat Assoc 76(374):354–362

    Article  Google Scholar 

  • Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barcelo-Vidal C (2003) Isometric logratio transformations for compositional data analysis. Math Geol 35(3):279–300

    Article  Google Scholar 

  • Engström K, Esbensen KH (2017) Optimal grade control sampling practice in open-pit mining-a full-scale blast hole versus reverse circulation variographic experiment. Appl Earth Sci Trans Inst Min Metall Sect B 126(4):176–187

    Article  Google Scholar 

  • Filzmoser P, Hron K (2008) Outlier detection for compositional data using robust methods. Math Geosci 40(3):233–248

    Article  Google Scholar 

  • Filzmoser P, Garrett RG, Reimann C (2005) Multivariate outlier detection in exploration geochemistry. Comput Geosci 31(5):579–587

    Article  Google Scholar 

  • Garrett RG (1989) The chi-square plot: a tool for multivariate outlier recognition. J Geochem Explor 32(1–3):319–341

    Article  Google Scholar 

  • Garrett RG, Reimann C, Hron K, Kynlov P, Filzmoser P (2017) Finally, a correlation coefficient that tells the geochemical truth. Newsl Assoc Appl Geochem 176:1–10

    Google Scholar 

  • Gnanadesikan R (1977) Methods for statistical data analysis of multivariate observations, vol 321. Wiley, New York

    Google Scholar 

  • Horrocks T, Holden EJ, Wedge D, Wijns C, Fiorentini M (2019) Geochemical characterisation of rock hydration processes using t-SNE. Comput Geosci 124:46–57

    Article  Google Scholar 

  • Hubert M, Debruyne M (2010) Minimum covariance determinant. Wiley Interdiscip Rev: Comput Stat 2(1):36–43

    Article  Google Scholar 

  • McArthur G, Jones C, Murphy M (2010) Blasthole cone sampling experiments for iron ore flitch mining. In: Proceedings sampling 2010

  • Mežibricky R, Frohlichova M, Legemza J (2019) High-SiO\(_{2}\) iron ore sintering: Central & Eastern Europe scenario. Metalurgija 58(1–2):161–164

    Google Scholar 

  • Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1):91–118

    Article  Google Scholar 

  • Rousseeuw PJ (1985) Multivariate estimation with high breakdown point. Math Stat Appl 8:283–297

    Article  Google Scholar 

  • Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  • Rousseeuw PJ, Driessen KV (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223

    Article  Google Scholar 

  • Rousseeuw PJ, Van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639

    Article  Google Scholar 

  • Tolosana-Delgado R, Mueller U, van den Boogaart KG (2019) Geostatistics for compositional data: an overview. Math Geosci 51:485–526. https://doi.org/10.1007/s11004-018-9769-3

    Article  Google Scholar 

  • Tsagris MT, Preston S, Wood AT (2011) A data-based power transformation for compositional data. In: Egozcue J, Tolosana-Delgado R, Ortego M (eds) 4th international workshop on Compositional Data Analysis. Springer, New York, pp 565–572

    Google Scholar 

  • van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

    Google Scholar 

  • Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Australian Centre for Field Robotics and the Rio Tinto Centre for Mine Automation. The authors would like to thank the reviewers for their comments and the publishing staff who copy-edited this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raymond Leung.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Leung, R., Balamurali, M. & Melkumyan, A. Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering. Math Geosci 53, 105–130 (2021). https://doi.org/10.1007/s11004-019-09839-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11004-019-09839-z

Keywords

Navigation