Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling,Artificial Intelligence in Medicine

当前位置： X-MOL 学术 › Artif. Intell. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling
Artificial Intelligence in Medicine ( IF 6.1 ) Pub Date : 2020-11-20 , DOI: 10.1016/j.artmed.2020.101987
Gilles Vandewiele ₁ , Isabelle Dehaene ₂ , György Kovács ₃ , Lucas Sterckx ₁ , Olivier Janssens ₁ , Femke Ongenae ₁ , Femke De Backere ₁ , Filip De Turck ₁ , Kristien Roelens ₂ , Johan Decruyenaere ₄ , Sofie Van Hoecke ₁ , Thomas Demeester ₁

Affiliation

Information extracted from electrohysterography recordings could potentially prove to be an interesting additional source of information to estimate the risk on preterm birth. Recently, a large number of studies have reported near-perfect results to distinguish between recordings of patients that will deliver term or preterm using a public resource, called the Term/Preterm Electrohysterogram database. However, we argue that these results are overly optimistic due to a methodological flaw being made. In this work, we focus on one specific type of methodological flaw: applying over-sampling before partitioning the data into mutually exclusive training and testing sets. We show how this causes the results to be biased using two artificial datasets and reproduce results of studies in which this flaw was identified. Moreover, we evaluate the actual impact of over-sampling on predictive performance, when applied prior to data partitioning, using the same methodologies of related studies, to provide a realistic view of these methodologies’ generalization capabilities. We make our research reproducible by providing all the code under an open license.

中文翻译：

对不平衡数据过于乐观的预测结果：应用过采样时的缺陷和好处的案例研究

从宫腔电图记录中提取的信息可能被证明是一个有趣的额外信息来源，用于估计早产风险。最近，大量研究报告了近乎完美的结果，可以使用公共资源（称为 Term/Preterm Electrohysterogram 数据库）区分将提供足月或早产的患者记录。然而，我们认为由于方法上的缺陷，这些结果过于乐观。在这项工作中，我们专注于一种特定类型的方法缺陷：在将数据划分为互斥的训练和测试集之前应用过采样。我们使用两个人工数据集展示了这如何导致结果存在偏差，并重现了识别出此缺陷的研究结果。而且，我们评估过采样对预测性能的实际影响，当在数据分区之前应用时，使用相关研究的相同方法，以提供这些方法的泛化能力的现实视图。我们通过在开放许可下提供所有代码来使我们的研究具有可重复性。

更新日期：2020-12-04

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11