当前位置: X-MOL 学术Pattern Recogn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identifying the best data-driven feature selection method for boosting reproducibility in classification tasks
Pattern Recognition ( IF 7.5 ) Pub Date : 2020-05-01 , DOI: 10.1016/j.patcog.2019.107183
Nicolas Georges , Islem Mhiri , Islem Rekik

Abstract Considering the proliferation of extremely high-dimensional data in many domains including computer vision and healthcare applications such as computer-aided diagnosis (CAD), advanced techniques for reducing data dimensionality and identifying the most relevant features for a given classification task such as distinguishing between healthy and disordered brain states are needed. Despite the existence of many works on boosting the classification accuracy using a particular feature selection (FS) method, choosing the best one from a large pool of existing FS techniques for boosting feature reproducibility within a dataset of interest remains a formidable challenge to tackle. Notably, a good performance of a particular FS method does not necessarily imply that the experiment is reproducible and that the features identified are optimal for the entirety of the samples. Essentially, this paper presents the first attempt to address the following challenge: “Given a set of different feature selection methods { F S 1 , ⋯ , F S K } , and a dataset of interest, how to identify the most reproducible and ‘trustworthy’ connectomic features that would produce reliable biomarkers capable of accurately differentiate between two specific conditions?” To this aim, we propose FS-Select framework which explores the relationships among the different FS methods using a multi-graph architecture based on feature reproducibility power, average accuracy, and feature stability of each FS method. By extracting the ‘central’ graph node, we identify the most reliable and reproducible FS method for the target brain state classification task along with the most discriminative features fingerprinting these brain states. To evaluate the reproducibility power of FS-Select, we perturbed the training set by using different cross-validation strategies on a multi-view small-scale connectomic dataset (late mild cognitive impairment vs Alzheimer’s disease) and large-scale dataset including autistic vs healthy subjects. Our experiments revealed reproducible connectional features fingerprinting disordered brain states.

中文翻译:

确定最佳数据驱动特征选择方法,以提高分类任务的可重复性

摘要 考虑到极高维数据在许多领域的激增,包括计算机视觉和医疗保健应用,如计算机辅助诊断 (CAD),用于降低数据维数和识别给定分类任务最相关特征的先进技术,如区分需要健康和紊乱的大脑状态。尽管存在许多使用特定特征选择 (FS) 方法来提高分类精度的工作,但从现有的大量 FS 技术中选择最好的一个来提高感兴趣的数据集内的特征再现性仍然是一个艰巨的挑战。尤其,特定 FS 方法的良好性能并不一定意味着实验是可重复的,并且识别出的特征对于整个样本都是最佳的。从本质上讲,本文首次尝试解决以下挑战:“给定一组不同的特征选择方法 { FS 1 , ⋯ , FSK } 和一个感兴趣的数据集,如何识别最可重复和‘值得信赖’的连接组特征那会产生能够准确区分两种特定情况的可靠生物标志物吗?” 为此,我们提出了 FS-Select 框架,该框架基于每个 FS 方法的特征再现能力、平均精度和特征稳定性,使用多图架构探索不同 FS 方法之间的关系。通过提取“中心”图节点,我们为目标大脑状态分类任务确定了最可靠和可重复的 FS 方法,以及对这些大脑状态进行指纹识别的最具辨别力的特征。为了评估 FS-Select 的再现能力,我们通过在多视图小规模连接组数据集(晚期轻度认知障碍与阿尔茨海默病)和包括自闭症与健康在内的大规模数据集上使用不同的交叉验证策略扰乱了训练集科目。我们的实验揭示了可重复的连接特征指纹紊乱的大脑状态。我们通过在多视图小规模连接组数据集(晚期轻度认知障碍与阿尔茨海默病)和包括自闭症与健康受试者在内的大规模数据集上使用不同的交叉验证策略扰乱了训练集。我们的实验揭示了可重现的连接特征指纹紊乱的大脑状态。我们通过在多视图小规模连接组数据集(晚期轻度认知障碍与阿尔茨海默病)和包括自闭症与健康受试者在内的大规模数据集上使用不同的交叉验证策略扰乱了训练集。我们的实验揭示了可重复的连接特征指纹紊乱的大脑状态。
更新日期:2020-05-01
down
wechat
bug