A greedy feature selection algorithm for Big Data of high dimensionality,Machine Learning

当前位置： X-MOL 学术 › Mach. Learn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A greedy feature selection algorithm for Big Data of high dimensionality
Machine Learning ( IF 7.5 ) Pub Date : 2018-08-07 , DOI: 10.1007/s10994-018-5748-7
Ioannis Tsamardinos _{1,

2} , Giorgos Borboudakis ₁ , Pavlos Katsogridakis _{1,

3} , Polyvios Pratikakis _{1,

3} , Vassilis Christophides _{1,

4}

Affiliation

We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.

中文翻译：

一种高维大数据的贪婪特征选择算法

我们提出了用于高维大数据的特征选择 (FS) 的并行、前向后向修剪 (PFBP) 算法。PFBP 根据行和列对数据矩阵进行分区。通过采用条件独立性测试和元分析技术的 p 值概念，PFBP 仅依赖于分区本地的计算，同时最大限度地减少通信成本，从而大规模并行化计算。用于组合本地计算的类似技术也被用于创建最终的预测模型。PFBP 采用渐近合理的启发式算法来做出早期的近似决策，例如在后续迭代中提前放弃考虑特征、在同一迭代中提前停止对特征的考虑，或在每次迭代中提前返回优胜者。PFBP 为由因果网络（贝叶斯网络或最大祖先图）忠实表示的数据分布提供了最优性的渐近保证。经验分析证实，随着样本量的增加、特征数量和处理核心的线性可扩展性，算法的超线性加速。广泛的比较评估还证明了 PFBP 相对于同类其他算法的有效性。所呈现的启发式方法是通用的，并且有可能用于其他贪婪类型的 FS 算法。作为用例提供了对具有 50 万个样本的模拟单核苷酸多态性 (SNP) 数据的应用程序。经验分析证实，随着样本量的增加、特征数量和处理核心的线性可扩展性，算法的超线性加速。广泛的比较评估还证明了 PFBP 相对于同类其他算法的有效性。所呈现的启发式方法是通用的，并且有可能用于其他贪婪类型的 FS 算法。作为用例提供了对具有 50 万个样本的模拟单核苷酸多态性 (SNP) 数据的应用程序。经验分析证实，随着样本量的增加、特征数量和处理核心的线性可扩展性，算法的超线性加速。广泛的比较评估还证明了 PFBP 相对于同类其他算法的有效性。所呈现的启发式方法是通用的，并且有可能用于其他贪婪类型的 FS 算法。作为用例提供了对具有 50 万个样本的模拟单核苷酸多态性 (SNP) 数据的应用程序。所呈现的启发式方法是通用的，并且有可能用于其他贪婪类型的 FS 算法。作为用例提供了对具有 50 万个样本的模拟单核苷酸多态性 (SNP) 数据的应用程序。所呈现的启发式方法是通用的，并且有可能用于其他贪婪类型的 FS 算法。作为用例提供了对具有 50 万个样本的模拟单核苷酸多态性 (SNP) 数据的应用程序。

更新日期：2018-08-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>