当前位置: X-MOL 学术Geosci. Front. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Probabilistic outlier detection for sparse multivariate geotechnical site investigation data using Bayesian learning
Geoscience Frontiers ( IF 8.9 ) Pub Date : 2020-04-18 , DOI: 10.1016/j.gsf.2020.03.017
Shuo Zheng , Yu-Xin Zhu , Dian-Qing Li , Zi-Jun Cao , Qin-Xuan Deng , Kok-Kwang Phoon

Various uncertainties arising during acquisition process of geoscience data may result in anomalous data instances (i.e., outliers) that do not conform with the expected pattern of regular data instances. With sparse multivariate data obtained from geotechnical site investigation, it is impossible to identify outliers with certainty due to the distortion of statistics of geotechnical parameters caused by outliers and their associated statistical uncertainty resulted from data sparsity. This paper develops a probabilistic outlier detection method for sparse multivariate data obtained from geotechnical site investigation. The proposed approach quantifies the outlying probability of each data instance based on Mahalanobis distance and determines outliers as those data instances with outlying probabilities greater than 0.5. It tackles the distortion issue of statistics estimated from the dataset with outliers by a re-sampling technique and accounts, rationally, for the statistical uncertainty by Bayesian machine learning. Moreover, the proposed approach also suggests an exclusive method to determine outlying components of each outlier. The proposed approach is illustrated and verified using simulated and real-life dataset. It showed that the proposed approach properly identifies outliers among sparse multivariate data and their corresponding outlying components in a probabilistic manner. It can significantly reduce the masking effect (i.e., missing some actual outliers due to the distortion of statistics by the outliers and statistical uncertainty). It also found that outliers among sparse multivariate data instances affect significantly the construction of multivariate distribution of geotechnical parameters for uncertainty quantification. This emphasizes the necessity of data cleaning process (e.g., outlier detection) for uncertainty quantification based on geoscience data.



中文翻译:

使用贝叶斯学习的稀疏多元岩土现场调查数据的概率离群值检测

在地球科学数据的采集过程中出现的各种不确定性可能导致异常的数据实例(即离群值)与常规数据实例的预期模式不一致。使用从岩土工程现场调查中获得的稀疏多元数据,由于异常值导致岩土参数统计的失真以及数据稀疏性导致的相关统计不确定性,因此无法确定异常值。本文针对岩土现场调查中获得的稀疏多变量数据,提出了一种概率离群值检测方法。所提出的方法基于马氏距离来量化每个数据实例的离群概率,并将离群值确定为离群概率大于0.5的那些数据实例。它通过重新采样技术解决了从具有异常值的数据集中估计的统计数据的失真问题,并合理地考虑了贝叶斯机器学习对统计不确定性的影响。此外,所提出的方法还建议一种排他方法来确定每个离群值的离群分量。使用模拟和真实数据集对提出的方法进行了说明和验证。结果表明,所提出的方法以概率的方式正确地识别了稀疏多元数据及其对应的外围成分中的异常值。它可以显着降低掩盖效果(即由于异常值和统计不确定性导致的统计信息失真,会丢失一些实际的异常值)。它还发现,稀疏多元数据实例之间的离群值显着影响了用于不确定性量化的岩土参数多元分布的构造。这强调了基于地球科学数据进行不确定性量化的数据清理过程(例如,异常值检测)的必要性。

更新日期:2020-04-21
down
wechat
bug