Dealing with confounders and outliers in classification medical studies: The Autism Spectrum Disorders case study,Artificial Intelligence in Medicine

当前位置： X-MOL 学术 › Artif. Intell. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dealing with confounders and outliers in classification medical studies: The Autism Spectrum Disorders case study
Artificial Intelligence in Medicine ( IF 6.1 ) Pub Date : 2020-07-06 , DOI: 10.1016/j.artmed.2020.101926
Elisa Ferrari ₁ , Paolo Bosco ₂ , Sara Calderoni ₃ , Piernicola Oliva ₄ , Letizia Palumbo ₅ , Giovanna Spera ₅ , Maria Evelina Fantacci ₆ , Alessandra Retico ₅

Affiliation

Machine learning (ML) approaches have been widely applied to medical data in order to find reliable classifiers to improve diagnosis and detect candidate biomarkers of a disease. However, as a powerful, multivariate, data-driven approach, ML can be misled by biases and outliers in the training set, finding sample-dependent classification patterns. This phenomenon often occurs in biomedical applications in which, due to the scarcity of the data, combined with their heterogeneous nature and complex acquisition process, outliers and biases are very common. In this work we present a new workflow for biomedical research based on ML approaches, that maximizes the generalizability of the classification. This workflow is based on the adoption of two data selection tools: an autoencoder to identify the outliers and the Confounding Index, to understand which characteristics of the sample can mislead classification. As a study-case we adopt the controversial research about extracting brain structural biomarkers of Autism Spectrum Disorders (ASD) from magnetic resonance images. A classifier trained on a dataset composed by 86 subjects, selected using this framework, obtained an area under the receiver operating characteristic curve of 0.79. The feature pattern identified by this classifier is still able to capture the mean differences between the ASD and Typically Developing Control classes on 1460 new subjects in the same age range of the training set, thus providing new insights on the brain characteristics of ASD. In this work, we show that the proposed workflow allows to find generalizable patterns even if the dataset is limited, while skipping the two mentioned steps and using a larger but not well designed training set would have produced a sample-dependent classifier.

中文翻译：

处理分类医学研究中的混杂因素和异常值：自闭症谱系障碍案例研究

机器学习 (ML) 方法已广泛应用于医学数据，以找到可靠的分类器来改进诊断并检测疾病的候选生物标志物。然而，作为一种强大的、多元的、数据驱动的方法，机器学习可能会被训练集中的偏差和异常值误导，寻找依赖于样本的分类模式。这种现象经常发生在生物医学应用中，由于数据的稀缺性，加上它们的异构性和复杂的采集过程，异常值和偏差非常普遍。在这项工作中，我们提出了一种基于 ML 方法的生物医学研究新工作流程，最大限度地提高了分类的普遍性。此工作流程基于采用两种数据选择工具：用于识别异常值的自动编码器和混淆指数，了解样本的哪些特征会误导分类。作为一个研究案例，我们采用了有争议的研究，即从磁共振图像中提取自闭症谱系障碍 (ASD) 的大脑结构生物标志物。在由 86 名受试者组成的数据集上训练的分类器，使用该框架选择，获得了 0.79 的接收器操作特征曲线下的面积。这个分类器识别的特征模式仍然能够捕捉到 ASD 和典型发展控制类在训练集相同年龄范围内的 1460 名新受试者之间的平均差异，从而为 ASD 的大脑特征提供新的见解。在这项工作中，我们表明即使数据集有限，所提出的工作流程也可以找到可推广的模式，

更新日期：2020-07-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11