当前位置: X-MOL 学术Int. J. Fuzzy Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Outlier Detection Algorithms Over Fuzzy Data with Weighted Least Squares
International Journal of Fuzzy Systems ( IF 4.3 ) Pub Date : 2021-04-30 , DOI: 10.1007/s40815-020-01049-8
Natalia Nikolova , Rosa M. Rodríguez , Mark Symes , Daniela Toneva , Krasimir Kolev , Kiril Tenekedjiev

In the classical leave-one-out procedure for outlier detection in regression analysis, we exclude an observation and then construct a model on the remaining data. If the difference between predicted and observed value is high we declare this value an outlier. As a rule, those procedures utilize single comparison testing. The problem becomes much harder when the observations can be associated with a given degree of membership to an underlying population, and the outlier detection should be generalized to operate over fuzzy data. We present a new approach for outlier detection that operates over fuzzy data using two inter-related algorithms. Due to the way outliers enter the observation sample, they may be of various order of magnitude. To account for this, we divided the outlier detection procedure into cycles. Furthermore, each cycle consists of two phases. In Phase 1, we apply a leave-one-out procedure for each non-outlier in the dataset. In Phase 2, all previously declared outliers are subjected to Benjamini–Hochberg step-up multiple testing procedure controlling the false-discovery rate, and the non-confirmed outliers can return to the dataset. Finally, we construct a regression model over the resulting set of non-outliers. In that way, we ensure that a reliable and high-quality regression model is obtained in Phase 1 because the leave-one-out procedure comparatively easily purges the dubious observations due to the single comparison testing. In the same time, the confirmation of the outlier status in relation to the newly obtained high-quality regression model is much harder due to the multiple testing procedure applied hence only the true outliers remain outside the data sample. The two phases in each cycle are a good trade-off between the desire to construct a high-quality model (i.e., over informative data points) and the desire to use as much data points as possible (thus leaving as much observations as possible in the data sample). The number of cycles is user defined, but the procedures can finalize the analysis in case a cycle with no new outliers is detected. We offer one illustrative example and two other practical case studies (from real-life thrombosis studies) that demonstrate the application and strengths of our algorithms. In the concluding section, we discuss several limitations of our approach and also offer directions for future research.



中文翻译:

加权最小二乘的模糊数据离群值检测算法

在回归分析中用于离群值检测的经典留一法过程中,我们排除了一个观察值,然后对剩余数据构建了一个模型。如果预测值和观察值之间的差异较大,则我们将此值声明为离群值。通常,这些过程使用单个比较测试。当观测值可以与基础人口的给定隶属度相关联时,问题将变得更加棘手,并且离群值检测应被普遍化以对模糊数据进行操作。我们提出了一种新的离群值检测方法,该方法使用两个相互关联的算法对模糊数据进行操作。由于异常值进入观察样本的方式,它们可能具有各种数量级。为了解决这个问题,我们将异常值检测过程分为多个周期。此外,每个周期包含两个阶段。在阶段1中,我们对数据集中的每个非离群值执行留一法操作。在阶段2中,所有先前声明的离群值都要经过Benjamini–Hochberg逐步提高的多重测试程序,以控制错误发现率,未确认的离群值可以返回数据集。最后,我们在所得的非异常值集上构建回归模型。这样,我们确保在第1阶段中获得了可靠且高质量的回归模型,因为单项比较测试由于无需进行任何操作,因此比较容易清除可疑的观察结果。在同一时间,由于采用了多重测试程序,因此与新获得的高质量回归模型相关的离群值状态的确认要困难得多,因此只有真实的离群值仍保留在数据样本之外。每个周期中的两个阶段是在构建高质量模型的需求(即在信息量较大的数据点)与使用尽可能多的数据点的需求(从而在模型中保留尽可能多的观测值)之间的良好折衷。数据样本)。循环数是用户定义的,但是如果检测到没有新异常值的循环,则过程可以最终完成分析。我们提供了一个说明性示例和另外两个实际案例研究(来自实际血栓形成研究),这些案例研究证明了我们算法的应用和优势。在结论部分,

更新日期:2021-04-30
down
wechat
bug