Abstract
Motivation Imputation is a prominent strategy when dealing with missing values (MVs) in proteomics data analysis pipelines. However, the performance of different imputation methods is difficult to assess and varies strongly depending on data characteristics. To overcome this issue, we present the concept of a data-driven selection of a suitable imputation algorithm (DIMA).
Results The performance and broad applicability of DIMA is demonstrated on 121 quantitative proteomics data sets from the PRIDE database and on simulated data consisting of 5 – 50% MVs with different proportions of missing not at random and missing completely at random values. DIMA reliably suggests a high-performing imputation algorithm which is always among the three best algorithms and results in a root mean square error difference (ΔRMSE) ≤ 10% in 84% of the cases.
Availability and Implementation Source code is freely available for download at github.com/clemenskreutz/OmicsData.
Competing Interest Statement
The authors have declared no competing interest.