A new approach in reject inference of using ensemble learning based on global semi-supervised framework
Introduction
Credit scoring is an effective tool for assessing the potential default risks of borrowers, which guarantees the interests of platforms and investors [1]. According to borrower history records including personal information and payment records, credit scoring roughly divides borrowers into two classes, good or bad. Traditional credit scoring models use only accepted applicants and ignore rejected since the rejected applicants have no classes labels. This fact leads to sample selection bias problem and even affect the performance of the models. It is unreasonable to use these models to predict the status of all unknown borrowers.
Online Peer-to-Peer (P2P) lending provides a convenient service that allow users to trade directly. One drawback of this convenience is that investors cannot accurately assess the credit of borrowers, and the interests of platforms and investors face enormous challenges. In order to protect the interests, platforms and investors usually set high thresholds for borrowers, which has led to a large number of applicants being rejected. Thus, traditional credit scoring models have biased results in predicting borrowers’ default risks under such case. How to add rejected applicants to credit scoring models has become a big challenge, especially in online P2P lending.
Reject inference technology refers to the use of an approach to infer the status (good or bad) of rejected applicants, and add the results to the establishment process of the credit scoring models [2]. Sohn et al. [3], [4] believe that the nature of reject inference is to solve the data missing problem. They divide the data missing mechanism into three categories, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Different types of data missing have different approaches to deal with. Furthermore, the emergence of reject inference technology avoids the waste of resources and improves the stability of credit scoring models [5], [6], [7].
In recent years, many methods based on reject inference have emerged and successfully applied to the area of credit scoring. Like extrapolation and augmentation [8]. Machine learning models have attracted researchers’ attention.
Machine learning models based on reject inference are roughly divided into two types, supervised models and semi-supervised models. Supervised models have strong predictive power, but cannot exploit the potential information of unlabeled samples [4], [9]. Recently, researches have focused on semi-supervised models [10], [11], [12]. Semi-supervised models simultaneously model both labeled and unlabeled samples. They seem to be naturally designed to reject inference [13]. However, there are many restrictions on semi-supervised in practical applications. For example, the classes of labeled samples should be correct, the distribution of unlabeled samples should be the same or similar to that of labeled samples, etc. The real situations are often not ideal, especially for online P2P lending. As shown in Table 2, we clearly see a obvious difference between accepted (labeled) and rejected (unlabeled) samples in Lending Club.
We focus on how to fully exploit the predictive power of classifiers and the internal relationships between rejected and accepted samples. There are many works have shown that combining multiple classifiers and clustering algorithms can get more stable classification results and play the role of rejected samples [14], [15], [16], [17]. We introduce an ensemble learning by maximizing the consensus among the output of multiple classifiers and clustering algorithms.
In this work, a particular version of combining clustering and classification for ensemble learning (EC3) framework [18] is integrated into an global semi-supervised learning (SSL) to perform reject inference, namely SSL-EC3. We use classifiers combined with clustering algorithms to obtain a better credit scoring model. SSL-EC3 is built on two fundamental hypotheses: (i) the ensemble of classifiers have powerful classification capabilities, which ensure the accuracy of credit scoring models; (ii) the integration of clustering methods can explore the inherent relationships between accepted and rejected samples, which ensure the generalization ability of credit scoring models. Furthermore, we try to adopt a dynamic ensemble selection (DES) automatically select the appropriate classifier for samples [19]. DES is a Python library that implement the advanced dynamic classifier and ensemble selection techniques. Our experimental results show that SSL-EC3 is helpful for reject inference in online P2P lending.
The structure of this paper is as follows. Section 2 gives an overview of various methods based on reject inference in credit scoring models. Based on the existing methods, we describe the proposed SSL-EC3 framework in Section 3. Section 4 describes the data needed for our experiment and necessary preparations. Section 5 describes and discusses the experimental results in detail. Finally, we conclude the paper.
Section snippets
Related work
In this section, we introduce three different types of data missing mechanisms in online P2P Lending and corresponding solutions.
indicates whether an applicant is accepted or rejected regardless of his/her history records or personal information, but rather random. That is means platforms or investor adopt a method similar to throwing a coin to decide whether to accept applicants [4]. Obviously, platforms or investors will not expose themselves to such risks. Therefore, this situation is
Methodology
Table 1 shows some notations used in this paper. Suppose we have N samples . For accepted samples, we know them belong to 2 different classes . 0 means the samples are labeled as good, and 1 means bad. We have b1 base classifiers and b2 base clustering algorithms. In order to simplify experiment, each classifier assign only one class label to a sample, and each clustering method only produces 2 clusters. Therefore, these classifiers and clustering algorithms generate g1 = b1 * 2
Experimental setup
In this section, we introduce data sets, experimental steps and the performance indicators for measuring the credit scoring models used in our experiment.
Results discussion
Firstly, we check the performance of EC3 algorithm in artificial data set from three levels, as shown in Fig. 3. From the perspective of attributes dimensionality, we continuously reduce the number of attributes and observe the changes of accuracy, precision, and recall of each model. We can see that as the attributes continue to decrease, models performance gradually deteriorates. The number of attributes equal 16 is a important point. When the number of attributes is less than 16, the
Conclusion
In online P2P lending, many borrowers’ application are rejected. When building a credit scoring model, we need to combine these data to fully assess the potential risks of loans. This paper proposes a framework that combines multiple classifiers with clustering approaches. The ensemble of classifiers can improve the accuracy of credit scoring, and the integration of clustering method can improve the generalization ability of credit scoring.
Experimental results on real data set show that SSL-EC3
CRediT authorship contribution statement
Yan Liu: Data curation, Formal analysis. Xiner Li: Software, Conceptualization, Methodology. Zaimei Zhang: Supervision, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to acknowledge the editor and anonymous reviewers for their comments, which have helped to improve the paper. This work was supported by the National Natural Science Foundation of China (Grant 61702053, 61872135), the Natural Science Foundation of Hunan Province (Grant 2018JJ2066), and the open fund project for innovation platform of universities in Hunan (Grant 11K002).
Yan Liu received the PhD degree in computer science and technology from Hunan University, China, 2010. He is an Associate Professor at the College of Computer Science and Electronic Engineering of Hunan University, China. His research areas include big data, artificial intelligence and parallel and distributed system.
References (37)
- et al.
A granular computing-based approach to credit scoring modeling
Neurocomputing
(2013) - et al.
Does reject inference really improve the performance of application scoring models?
J. Bank. Financ.
(2004) - et al.
Reject inference in credit operations based on survival analysis
Expert Syst. Appl.
(2006) - et al.
Reject inference, augmentation, and sample selection
European J. Oper. Res.
(2007) - et al.
A new approach for reject inference in credit scoring using kernel-free fuzzy quadratic surface support vector machines
Appl. Soft Comput.
(2018) - et al.
Reject inference in credit scoring using semi-supervised support vector machines
Expert Syst. Appl.
(2017) - et al.
A rejection inference technique based on contrastive pessimistic likelihood estimation for p2p lending
Electron. Commer. Res. Appl.
(2018) - et al.
Credit rating by hybrid machine learning techniques
Appl. Soft Comput.
(2010) - et al.
A data driven ensemble classifier for credit scoring analysis
Expert Syst. Appl.
(2010) - et al.
Reject inference in consumer credit scoring with nonignorable missing data
J. Bank. Financ.
(2013)
Herding behavior in online p2p lending: An empirical investigation
Electron. Commer. Res. Appl.
An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring
Expert Syst. Appl.
Recent developments in consumer credit risk assessment
European J. Oper. Res.
Credit scoring and reject inference with mixture models
Int. J. Intell. Syst. Account. Finance Manag.
A Bayesian network framework for reject inference
Technology scoring model considering rejected applicants and effect of reject inference
J. Oper. Res. Soc.
The economic value of reject inference in credit scoring
A semi-supervised approach for reject inference in credit scoring using SVMs
Cited by (0)
Yan Liu received the PhD degree in computer science and technology from Hunan University, China, 2010. He is an Associate Professor at the College of Computer Science and Electronic Engineering of Hunan University, China. His research areas include big data, artificial intelligence and parallel and distributed system.
Xiner Li is master student of Hunan University. Her research interests include data mining, big data.
Zaimei Zhang received the Ph.D. degree in management science and engineering from Hunan University, China, 2011. She is an Assistant Professor at the School of Economics and Management of Changsha University of Science and Technology, China. Her research interests include financial engineering, big data and artificial intelligence.