当前位置: X-MOL 学术Inform. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SOAP: Semantic outliers automatic preprocessing
Information Sciences Pub Date : 2020-04-03 , DOI: 10.1016/j.ins.2020.03.071
Leonardo Trujillo , Uriel López , Pierrick Legrand

Genetic Programming (GP) is an evolutionary algorithm for the automatic generation of symbolic models expressed as syntax trees. GP has been successfully applied in many domain, but most research in this area has not considered the presence of outliers in the training set. Outliers make supervised learning problems difficult, and sometimes impossible, to solve. For instance, robust regression methods cannot handle more than 50% of outlier contamination, referred to as their breakdown point. This paper studies problems where outlier contamination is high, reaching up to 90% contamination levels, extreme cases that can appear in some domains. This work shows, for the first time, that a random population of GP individuals can detect outliers in the output variable. From this property, a new filtering algorithm is proposed called Semantic Outlier Automatic Preprocessing (SOAP), which can be used with any learning algorithm to differentiate between inliers and outliers. Since the method uses a GP population, the algorithm can be carried out for free in a GP symbolic regression system. The approach is the only method that can perform such an automatic cleaning of a dataset without incurring an exponential cost as the percentage of outliers in the dataset increases.



中文翻译:

SOAP:语义离群值自动预处理

遗传编程(GP)是一种进化算法,用于自动生成表示为语法树的符号模型。GP已成功应用于许多领域,但该领域的大多数研究并未考虑训练集中存在异常值的情况。离群值使监督学习问题难以解决,有时甚至无法解决。例如,健壮的回归方法不能处理超过50%的异常污染(称为故障点)。本文研究了离群污染高,达到高达90%污染水平的问题,在某些领域可能会出现极端情况。这项工作首次表明,随机分组的GP个体可以检测输出变量中的异常值。通过此属性,提出了一种新的滤波算法,称为语义离群值自动预处理(SOAP),该算法可以与任何学习算法一起使用,以区分离群值和离群值。由于该方法使用GP种群,因此该算法可以在GP符号回归系统中免费执行。随着数据集中异常值百分比的增加,该方法是唯一可以执行这种自动清理数据集而不会导致指数成本的方法。

更新日期:2020-04-03
down
wechat
bug