当前位置: X-MOL 学术Emerging Markets Finance and Trade › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Missing Data Preprocessing in Credit Classification: One-Hot Encoding or Imputation?
Emerging Markets Finance and Trade ( IF 4.859 ) Pub Date : 2020-10-08 , DOI: 10.1080/1540496x.2020.1825935
Lean Yu 1 , Rongtian Zhou 2 , Rongda Chen 3 , Kin Keung Lai 4
Affiliation  

ABSTRACT

Missing data has become an increasingly serious problem in credit risk classification. A one-hot encoding-based data preprocessing method is proposed to solve the missing data problem in credit classification. In this paradigm, the proposed missing-data preprocessing method is first used to deal with missing values to fill in the incomplete dataset. Then the classification and regression tree (CART) model is applied on the completed dataset to measure performances of different preprocessing methods. The experimental results indicate that the proposed one-hot encoding method performs the best when the missing rate is high. When missing rate is low, random sample (RS) imputation method performs better though it entails a greater computational cost than other imputation methods listed in this study. In particular, for high-missing-rate coupled with data-imbalance issue, the proposed one-hot encoding based imputation method shows not only high accuracy, but also great robustness and needs less of computational time.



中文翻译:

信用分类中缺失数据预处理:One-Hot 编码还是插补?

摘要

数据缺失已成为信用风险分类中日益严重的问题。针对信用分类中的数据缺失问题,提出了一种基于one-hot编码的数据预处理方法。在这个范式中,提出的缺失数据预处理方法首先用于处理缺失值以填充不完整的数据集。然后将分类和回归树(CART)模型应用于完整的数据集以衡量不同预处理方法的性能。实验结果表明,当丢失率很高时,所提出的one-hot编码方法表现最好。当缺失率较低时,随机样本 (RS) 插补方法性能更好,尽管它比本研究中列出的其他插补方法需要更大的计算成本。特别是,

更新日期:2020-10-08
down
wechat
bug