A Tandem Evolutionary Algorithm for Identifying Causal Rules from Complex Data,Evolutionary Computation

当前位置： X-MOL 学术 › Evol. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Tandem Evolutionary Algorithm for Identifying Causal Rules from Complex Data
Evolutionary Computation ( IF 6.8 ) Pub Date : 2020-03-01 , DOI: 10.1162/evco_a_00252
John P Hanley ₁ , Donna M Rizzo ₁ , Jeffrey S Buzas ₂ , Margaret J Eppstein ₃

Affiliation

We propose a new evolutionary approach for discovering causal rules in complex classification problems from batch data. Key aspects include (a) the use of a hypergeometric probability mass function as a principled statistic for assessing fitness that quantifies the probability that the observed association between a given clause and target class is due to chance, taking into account the size of the dataset, the amount of missing data, and the distribution of outcome categories, (b) tandem age-layered evolutionary algorithms for evolving parsimonious archives of conjunctive clauses, and disjunctions of these conjunctions, each of which have probabilistically significant associations with outcome classes, and (c) separate archive bins for clauses of different orders, with dynamically adjusted order-specific thresholds. The method is validated on majority-on and multiplexer benchmark problems exhibiting various combinations of heterogeneity, epistasis, overlap, noise in class associations, missing data, extraneous features, and imbalanced classes. We also validate on a more realistic synthetic genome dataset with heterogeneity, epistasis, extraneous features, and noise. In all synthetic epistatic benchmarks, we consistently recover the true causal rule sets used to generate the data. Finally, we discuss an application to a complex real-world survey dataset designed to inform possible ecohealth interventions for Chagas disease.

中文翻译：

从复杂数据中识别因果规则的串联进化算法

我们提出了一种新的进化方法，用于从批处理数据中发现复杂分类问题中的因果规则。关键方面包括 (a) 使用超几何概率质量函数作为评估适应度的原则性统计量，该统计量量化给定子句和目标类别之间观察到的关联是由于偶然发生的概率，同时考虑到数据集的大小，缺失数据的数量和结果类别的分布，(b) 串联年龄分层演化算法，用于演化连词的简约档案，以及这些连词的分离，每个连词都与结果类别有概率显着的关联，以及 (c ) 用于不同订单条款的单独存档箱，具有动态调整的订单特定阈值。该方法在表现出异质性、上位性、重叠、类关联中的噪声、缺失数据、无关特征和不平衡类的各种组合的多数选择和多路复用器基准问题上得到验证。我们还验证了具有异质性、上位性、无关特征和噪声的更真实的合成基因组数据集。在所有合成上位基准中，我们始终如一地恢复用于生成数据的真实因果规则集。最后，我们讨论了一个复杂的现实世界调查数据集的应用，旨在为可能的恰加斯病生态健康干预提供信息。我们还验证了具有异质性、上位性、无关特征和噪声的更真实的合成基因组数据集。在所有合成上位基准中，我们始终如一地恢复用于生成数据的真实因果规则集。最后，我们讨论了一个复杂的现实世界调查数据集的应用，旨在为可能的恰加斯病生态健康干预提供信息。我们还验证了具有异质性、上位性、无关特征和噪声的更真实的合成基因组数据集。在所有合成上位基准中，我们始终如一地恢复用于生成数据的真实因果规则集。最后，我们讨论了一个复杂的现实世界调查数据集的应用，旨在为可能的恰加斯病生态健康干预提供信息。

更新日期：2020-03-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>