当前位置: X-MOL 学术Math. Geosci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Sample Truncation Strategies for Outlier Removal in Geochemical Data: The MCD Robust Distance Approach Versus t-SNE Ensemble Clustering
Mathematical Geosciences ( IF 2.8 ) Pub Date : 2019-11-27 , DOI: 10.1007/s11004-019-09839-z
Raymond Leung , Mehala Balamurali , Arman Melkumyan

Abstract

The presence of outliers in geochemical data can impact the accuracy of grade models and influence the interpretation of mine assay data. Removal of outliers is therefore an important consideration in grade estimation work. This paper presents two sample truncation strategies which have been devised to reject outliers in multivariate geochemical data. In essence, a data-dependent threshold is applied to the robust distances of sorted samples to discard outliers within a given class. For robust distances based on the minimum covariance determinant (MCD) where sample deviations from the cluster centre are computed using robust estimates, the inverse chi-square cumulative distribution function is often used to compute the cutoff point, \(\chi _{1-\alpha ,\nu }\), under the assumption of multivariate normality. In this work, it has been observed that this approach consistently underestimates the true extent of outliers. The proposed alternatives consist of a geometric and an analytic approach. The former defines the sample truncation point as the knee of the robust distance curve in an approximately chi-square-distributed quantile–quantile plot. The latter uses the silhouette and likelihood functions to consider the degree of cohesion in the resultant inlier/outlier clusters. Both techniques significantly reduce the scatter amongst the samples retained in each domain/class. For validation, ensemble clustering based on t-distributed stochastic neighbour embedding (t-SNE) is used to study the outlier recall rate, the effects of feature selection, and spatial correlation with MCD-based outlier rejection. Visual and quantitative analyses show that the proposed methods are superior to the baseline method which rejects samples using chi-square critical values.

Graphic Abstract



中文翻译:

地球化学数据中异常值的样本截断策略:MCD稳健距离方法与t-SNE集合聚类

摘要

地球化学数据中异常值的存在会影响品位模型的准确性并影响地雷测定数据的解释。因此,除去异常值是等级评估工作中的重要考虑因素。本文介绍了两种示例截断策略,这些策略已被设计用来拒绝多元地球化学数据中的异常值。本质上,将与数据相关的阈值应用于已排序样本的鲁棒距离,以丢弃给定类别内的离群值。对于基于最小协方差决定因素(MCD)的稳健距离,其中使用稳健估计来计算与聚类中心的样本偏差,卡方反累积分布函数通常用于计算截断点\(\ chi _ {1- \ alpha,\ nu} \),在多元正态性假设下。在这项工作中,已经观察到这种方法始终低估了异常值的真实范围。提出的替代方案包括几何方法和分析方法。前者将样本截断点定义为近似卡方分布的分位数-分位数图中稳健距离曲线的拐点。后者使用轮廓和似然函数来考虑所得的内部/离群群集中的内聚度。两种技术都显着减少了每个域/类中保留的样本之间的分散。为了进行验证,使用基于t分布随机邻居嵌入(t-SNE)的集成聚类来研究离群点召回率,特征选择的影响以及与基于MCD的离群点拒绝的空间相关性。

图形摘要

更新日期:2019-11-27
down
wechat
bug