当前位置: X-MOL 学术J. Parallel Distrib. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Parallel Multilevel Feature Selection algorithm for improved cancer classification
Journal of Parallel and Distributed Computing ( IF 3.8 ) Pub Date : 2019-12-28 , DOI: 10.1016/j.jpdc.2019.12.015
Lokeswari Venkataramana , Shomona Gracia Jacob , Rajavel Ramadoss

Biological data is prone to grow exponentially, which consumes more resources, time and manpower. Parallelization of algorithms could reduce overall execution time. There are two main challenges in parallelizing computational methods. (1) Biological data is multi-dimensional in nature. (2). Parallel algorithms reduces execution time, but with the penalty of reduced prediction accuracy. This research paper targets these two issues and proposes the following approaches. (1) Vertical partitioning of data along feature space and horizontal partitioning along samples in order to ease the task of data parallelism. (2) Parallel Multilevel Feature Selection (M-FS) algorithm to select optimal and important features for improved classification of cancer sub-types. The selected features are evaluated using parallel Random Forest on Spark, compared with previously reported results and also with the results of sequential execution of same algorithms. The proposed parallel M-FS algorithm was compared with existing parallel feature selection algorithms in terms of accuracy and execution time. The results reveal that parallel multilevel feature selection algorithm improved cancer classification resulting into prediction accuracy ranging from 85% to 99% with very high speed up in terms of seconds. On the other hand, existing sequential algorithms yielded prediction accuracy of 65% to 99% with execution time of more than 24 h.



中文翻译:

一种改进的癌症分类的并行多级特征选择算法

生物数据容易成倍增长,这消耗了更多的资源,时间和人力。算法的并行化可以减少总体执行时间。并行化计算方法存在两个主要挑战。(1)生物数据本质上是多维的。(2)。并行算法可减少执行时间,但会降低预测精度。本研究报告针对这两个问题,并提出以下方法。(1)沿特征空间对数据进行垂直分区,并沿样本进行水平分区,以减轻数据并行性。(2)并行多级特征选择(M-FS)算法,用于选择最佳和重要特征以改进癌症亚型的分类。使用Spark上的并行随机森林评估所选功能,与先前报告的结果以及相同算法的顺序执行结果进行比较。在准确性和执行时间方面,将提出的并行M-FS算法与现有的并行特征选择算法进行了比较。结果表明,并行多级特征选择算法改善了癌症分类,从而预测精度达到了85% 99%的速度以秒为单位。另一方面,现有的顺序算法产生的预测精度为65%至 99%的执行时间超过24小时。

更新日期:2020-01-04
down
wechat
bug