A new approach in reject inference of using ensemble learning based on global semi-supervised framework

doi:10.1016/j.future.2020.03.047

Future Generation Computer Systems

Volume 109, August 2020, Pages 382-391

https://doi.org/10.1016/j.future.2020.03.047 Get rights and content

Highlights

•
A novel global semi-supervised framework for reject inference is proposed.
•
A novel algorithm that combine multiple classifiers and clustering algorithms is introduced.
•
The framework is proved to outperform several normal techniques.
•
The framework is validated on the real data set.

Abstract

Credit scoring in online Peer-to-Peer (P2P) lending faces a huge challenge, which is the credit scoring models discard rejected applicants. This selective discarding leads to bias in the parameters of the models and ultimately affects the performance of credit evaluation. One approach for handling this problem is to adopt reject inference, which is a technique that infer the status of rejected samples and incorporate the results into credit scoring models. The most popular practice of reject inference is to use a credit scoring model that is only built on accepted samples to directly predict the status of rejected samples. However, the distribution of accepted samples in online P2P lending is different from rejected samples. We propose SSL-EC3, a global semi-supervised framework that merges multiple classifiers and clustering algorithms together to make better use of the information of rejected samples. It uses multiple unsupervised models (clustering algorithms) to explore the internal relationships of all samples, and then incorporates the information into the ensemble of supervised models (classifiers) to help correct initial classification results of rejected samples. In addition, we try to use a dynamic ensemble selection (DES) to select the appropriate ensemble of classifiers for each sample to be classified. Experimental results on the real data sets demonstrate the benefits of the proposed methods over conventional methods based on the reject inference.

Introduction

Credit scoring is an effective tool for assessing the potential default risks of borrowers, which guarantees the interests of platforms and investors [1]. According to borrower history records including personal information and payment records, credit scoring roughly divides borrowers into two classes, good or bad. Traditional credit scoring models use only accepted applicants and ignore rejected since the rejected applicants have no classes labels. This fact leads to sample selection bias problem and even affect the performance of the models. It is unreasonable to use these models to predict the status of all unknown borrowers.

Online Peer-to-Peer (P2P) lending provides a convenient service that allow users to trade directly. One drawback of this convenience is that investors cannot accurately assess the credit of borrowers, and the interests of platforms and investors face enormous challenges. In order to protect the interests, platforms and investors usually set high thresholds for borrowers, which has led to a large number of applicants being rejected. Thus, traditional credit scoring models have biased results in predicting borrowers’ default risks under such case. How to add rejected applicants to credit scoring models has become a big challenge, especially in online P2P lending.

Reject inference technology refers to the use of an approach to infer the status (good or bad) of rejected applicants, and add the results to the establishment process of the credit scoring models [2]. Sohn et al. [3], [4] believe that the nature of reject inference is to solve the data missing problem. They divide the data missing mechanism into three categories, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Different types of data missing have different approaches to deal with. Furthermore, the emergence of reject inference technology avoids the waste of resources and improves the stability of credit scoring models [5], [6], [7].

In recent years, many methods based on reject inference have emerged and successfully applied to the area of credit scoring. Like extrapolation and augmentation [8]. Machine learning models have attracted researchers’ attention.

Machine learning models based on reject inference are roughly divided into two types, supervised models and semi-supervised models. Supervised models have strong predictive power, but cannot exploit the potential information of unlabeled samples [4], [9]. Recently, researches have focused on semi-supervised models [10], [11], [12]. Semi-supervised models simultaneously model both labeled and unlabeled samples. They seem to be naturally designed to reject inference [13]. However, there are many restrictions on semi-supervised in practical applications. For example, the classes of labeled samples should be correct, the distribution of unlabeled samples should be the same or similar to that of labeled samples, etc. The real situations are often not ideal, especially for online P2P lending. As shown in Table 2, we clearly see a obvious difference between accepted (labeled) and rejected (unlabeled) samples in Lending Club.

We focus on how to fully exploit the predictive power of classifiers and the internal relationships between rejected and accepted samples. There are many works have shown that combining multiple classifiers and clustering algorithms can get more stable classification results and play the role of rejected samples [14], [15], [16], [17]. We introduce an ensemble learning by maximizing the consensus among the output of multiple classifiers and clustering algorithms.

In this work, a particular version of combining clustering and classification for ensemble learning (EC3) framework [18] is integrated into an global semi-supervised learning (SSL) to perform reject inference, namely SSL-EC3. We use classifiers combined with clustering algorithms to obtain a better credit scoring model. SSL-EC3 is built on two fundamental hypotheses: (i) the ensemble of classifiers have powerful classification capabilities, which ensure the accuracy of credit scoring models; (ii) the integration of clustering methods can explore the inherent relationships between accepted and rejected samples, which ensure the generalization ability of credit scoring models. Furthermore, we try to adopt a dynamic ensemble selection (DES) automatically select the appropriate classifier for samples [19]. DES is a Python library that implement the advanced dynamic classifier and ensemble selection techniques. Our experimental results show that SSL-EC3 is helpful for reject inference in online P2P lending.

The structure of this paper is as follows. Section 2 gives an overview of various methods based on reject inference in credit scoring models. Based on the existing methods, we describe the proposed SSL-EC3 framework in Section 3. Section 4 describes the data needed for our experiment and necessary preparations. Section 5 describes and discusses the experimental results in detail. Finally, we conclude the paper.

Section snippets

Related work

In this section, we introduce three different types of data missing mechanisms in online P2P Lending and corresponding solutions.

$MCAR$ indicates whether an applicant is accepted or rejected regardless of his/her history records or personal information, but rather random. That is means platforms or investor adopt a method similar to throwing a coin to decide whether to accept applicants [4]. Obviously, platforms or investors will not expose themselves to such risks. Therefore, this situation is

Methodology

Table 1 shows some notations used in this paper. Suppose we have N samples $X = \{x_{1}, \dots, x_{N}\}$ . For accepted samples, we know them belong to 2 different classes $C = \{0, 1\}$ . 0 means the samples are labeled as good, and 1 means bad. We have b1 base classifiers and b2 base clustering algorithms. In order to simplify experiment, each classifier assign only one class label to a sample, and each clustering method only produces 2 clusters. Therefore, these classifiers and clustering algorithms generate g1 = b1 * 2

Experimental setup

In this section, we introduce data sets, experimental steps and the performance indicators for measuring the credit scoring models used in our experiment.

Results discussion

Firstly, we check the performance of EC3 algorithm in artificial data set from three levels, as shown in Fig. 3. From the perspective of attributes dimensionality, we continuously reduce the number of attributes and observe the changes of accuracy, precision, and recall of each model. We can see that as the attributes continue to decrease, models performance gradually deteriorates. The number of attributes equal 16 is a important point. When the number of attributes is less than 16, the

Conclusion

In online P2P lending, many borrowers’ application are rejected. When building a credit scoring model, we need to combine these data to fully assess the potential risks of loans. This paper proposes a framework that combines multiple classifiers with clustering approaches. The ensemble of classifiers can improve the accuracy of credit scoring, and the integration of clustering method can improve the generalization ability of credit scoring.

Experimental results on real data set show that SSL-EC3

CRediT authorship contribution statement

Yan Liu: Data curation, Formal analysis. Xiner Li: Software, Conceptualization, Methodology. Zaimei Zhang: Supervision, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to acknowledge the editor and anonymous reviewers for their comments, which have helped to improve the paper. This work was supported by the National Natural Science Foundation of China (Grant 61702053, 61872135), the Natural Science Foundation of Hunan Province (Grant 2018JJ2066), and the open fund project for innovation platform of universities in Hunan (Grant 11K002).

Yan Liu received the PhD degree in computer science and technology from Hunan University, China, 2010. He is an Associate Professor at the College of Computer Science and Electronic Engineering of Hunan University, China. His research areas include big data, artificial intelligence and parallel and distributed system.

References (37)

SaberiMorteza et al.
A granular computing-based approach to credit scoring modeling
Neurocomputing
(2013)
CrookJonathan et al.
Does reject inference really improve the performance of application scoring models?
J. Bank. Financ.
(2004)
SohnSo Young et al.
Reject inference in credit operations based on survival analysis
Expert Syst. Appl.
(2006)
BanasikJohn et al.
Reject inference, augmentation, and sample selection
European J. Oper. Res.
(2007)
TianYe et al.
A new approach for reject inference in credit scoring using kernel-free fuzzy quadratic surface support vector machines
Appl. Soft Comput.
(2018)
LiZhiyong et al.
Reject inference in credit scoring using semi-supervised support vector machines
Expert Syst. Appl.
(2017)
XiaYufei et al.
A rejection inference technique based on contrastive pessimistic likelihood estimation for p2p lending
Electron. Commer. Res. Appl.
(2018)
TsaiChih-Fong et al.
Credit rating by hybrid machine learning techniques
Appl. Soft Comput.
(2010)
HsiehNan-Chen et al.
A data driven ensemble classifier for credit scoring analysis
Expert Syst. Appl.
(2010)
BückerMichael et al.
Reject inference in consumer credit scoring with nonignorable missing data
J. Bank. Financ.
(2013)

LeeEunkyoung et al.

Herding behavior in online p2p lending: An empirical investigation

Electron. Commer. Res. Appl.

(2012)

NanniLoris et al.

An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring

Expert Syst. Appl.

(2009)

CrookJonathan N. et al.

Recent developments in consumer credit risk assessment

European J. Oper. Res.

(2007)

FeeldersA.J.

Credit scoring and reject inference with mixture models

Int. J. Intell. Syst. Account. Finance Manag.

(2000)

SmithAndrew et al.

A Bayesian network framework for reject inference

KimY. et al.

Technology scoring model considering rejected applicants and effect of reject inference

J. Oper. Res. Soc.

(2007)

ChenG. Gary et al.

The economic value of reject inference in credit scoring

MaldonadoSebastián et al.

A semi-supervised approach for reject inference in credit scoring using SVMs

Cited by (0)

Xiner Li is master student of Hunan University. Her research interests include data mining, big data.

Zaimei Zhang received the Ph.D. degree in management science and engineering from Hunan University, China, 2011. She is an Assistant Professor at the School of Economics and Management of Changsha University of Science and Technology, China. Her research interests include financial engineering, big data and artificial intelligence.

View full text

A new approach in reject inference of using ensemble learning based on global semi-supervised framework

Highlights

Abstract

Introduction

Section snippets

Related work

Methodology

Experimental setup

Results discussion

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Neurocomputing

J. Bank. Financ.

Expert Syst. Appl.

European J. Oper. Res.

Appl. Soft Comput.

Expert Syst. Appl.

Electron. Commer. Res. Appl.

Appl. Soft Comput.

Expert Syst. Appl.

J. Bank. Financ.

Electron. Commer. Res. Appl.

Expert Syst. Appl.

European J. Oper. Res.

Credit scoring and reject inference with mixture models

Int. J. Intell. Syst. Account. Finance Manag.

A Bayesian network framework for reject inference

Technology scoring model considering rejected applicants and effect of reject inference

J. Oper. Res. Soc.

The economic value of reject inference in credit scoring

A semi-supervised approach for reject inference in credit scoring using SVMs