Abstract

This paper mainly discusses the hybrid application of ensemble learning, classification, and feature selection (FS) algorithms simultaneously based on training data balancing for helping the proposed credit scoring model perform more effectively, which comprises three major stages. Firstly, it conducts preprocessing for collected credit data. Then, an efficient feature selection algorithm based on adaptive elastic net is employed to reduce the weakly related or uncorrelated variables to get high-quality training data. Thirdly, a novel ensemble strategy is proposed to make the imbalanced training data set balanced for each extreme learning machine (ELM) classifier. Finally, a new weighting method for single ELM classifiers in the ensemble model is established with respect to their classification accuracy based on generalized fuzzy soft sets (GFSS) theory. A novel cosine-based distance measurement algorithm of GFSS is also proposed to calculate the weights of each ELM classifier. To confirm the efficiency of the proposed ensemble credit scoring model, we implemented experiments with real-world credit data sets for comparison. The process of analysis, outcomes, and mathematical tests proved that the proposed model is capable of improving the effectiveness of classification in average accuracy, area under the curve (AUC), H-measure, and Brier’s score compared to all other single classifiers and ensemble approaches.

1. Introduction

Nowadays, financial institutions tend to adopt different risk assessment and credit scoring models to reduce potential risk to a certain extent [1]. By analyzing customer credit data to figure out the probability that potential borrowers will default on their loans, the evaluation approaches can be utilized to turn customer data into principle, which could support credit decisions [2]. In that way, an effective credit scoring model can be a reliable supporting system to help managers in making their financial decisions.

To handle the potential risk of financial services, in the past few years, increasingly financial institutions are moving from traditional manual methods to advanced approaches that require building various types of evaluation models. For credit evaluation, three main methods, which are statistical approaches, nonparametric approaches, and AI methods, are being widely utilized [37]. These three methods work efficiently in various circumstances. Statistical methods consist of different models, which include discriminant analysis models, linear probability models, and probit and logit models. Yet nonparametric approaches tend to utilize the decision tree, K-nearest neighbor algorithm, fuzzy logic, Naïve Bayes, and so on. AI methods are more advanced and technology-dependent, such as artificial neural networks, support vector machines (SVM), particle swarm optimization (PSO), and genetic algorithm (GA).

Many researches also indicated that ensemble approaches show more effective performance in the evaluation of credit than single classifiers. To avoid the downsides of single classifiers, an increasing number of researchers have switched to using customized and combined various methods instead of using individual classification models separately. The principle of the hybrid approach is to perform preprocessing on the data input to the classifiers. The focus is to gather information from group-based classifiers based on the same issue and then export these strengths to get valid credit scoring decisions [8, 9]. In recent years, the research of fuzzy soft sets theory has made great progress, especially in the application field of multiattribute decision making [10, 11]. The development of fuzzy soft sets theory can provide a new perspective for us to build more state-of-the-art ensemble data classification and credit evaluation models [12].

The motivation of our study is to construct a more reliable credit scoring model that can generate accurate outcomes within imbalanced data. Three main approaches will be addressed to achieve this goal: (1) improved elastic net-based feature selection, (2) novel ensemble strategy and learning algorithm for imbalanced credit data, and (3) dynamic weighting method for single ELM classifiers based on new proposed similarity measure of GFSS. Because in real applications of credit risk evaluation, especially in peer-to-peer lending, credit data could be gathered from many different channels, including social networking and judicial administration platforms. The data collected from these channels are usually very sparse, redundant, rough, and imbalanced (good customers generally outnumber bad customers) and often consist of various weakly related or even uncorrelated features [13, 14]. These data characteristics will make the commonly used credit scoring models unstable, which leads to the credit evaluation results become unreliable and inaccurate. Through the above three approaches proposed in this article, the problems arising in credit scoring for imbalanced data can be handled and solved effectively.

In Section 2, we will talk about the construction of a new ensemble credit scoring model. The experimental outcomes will be discussed in Section 3. Finally, Section 4 concludes the paper.

2. New Ensemble Credit Scoring Model

This section mainly talks about the construction of the ensemble classification model for credit scoring.

2.1. Adaptive Elastic Net-Based Feature Selection

A large number of researchers have studied the appropriate feature selection approaches for credit scoring, such as cost-sensitive [15], information gain ratio [16], and genetic algorithm [17]. The Lasso estimator can reduce the regression coefficients to zero in L_1-norm. This method can also reduce features (variables) as well as select the most important one to build simple but effective models while keeping the high efficiency. Denote historical credit scoring data as , , where are variables for customers and are category tags (binary responses, denote 0 as default and 1 as nondefault). The regression model could be constructed as follows:where and are the intercept and regression coefficients, respectively. Suppose that every observation is not correlated and that all the variables are normalized. The Lasso proper estimate of could be constructed as

Based on the information above, a large would reduce some coefficients in to zero. That is, Lasso reduces the coefficients to zero while gradually increases. In addition, the Lasso model is able to hold any number of variables. Therefore, both the reduction of coefficients and the selection of features (variables) can be carried out at the same time.

Although Lasso has been proved to be easily interpretable and effective under various circumstances, it still has some shortcomings [18]. Zou and Hastie [19] put forward an expansion approach called elastic net to conduct selection. Similarly, the elastic net is also able to conduct automatic selection of variables and shrinkage of coefficient at the same time and select groups of correlated variables. For any constant and nonnegative and , the estimation of by the elastic net could be carried out as follows:where is an element of L1-norm and is the L2-norm element.

In addition, Zou and Zhang [20] also pointed out that the elastic net does not possess the oracle property. They then proposed a new adaptive elastic net which combines the L2 penalty with the weighted L1 penalty to penalize the squared error loss. Therefore, the adaptive elastic net could be treated as the package of the adaptive Lasso and elastic net. The valuation of by the adaptive elastic net could be calculated aswhere , is positive, while is fixed and nonnegative.

Using formula (4), we will be capable of obtaining the most significant attributes (“big fish”) from the variable pool. Then, we can plug them into credit scoring models to get a more precise result but with minimum computational and operational cost.

2.2. ELM-Based Classifier

ELM model, as a single-hidden layer feedforward neural network (SLFN), can select the input weight and hidden biases randomly without any adjustment during its process. The Moore–Penrose generalized inverse matrices of the hidden-layer output matrix can be utilized to analyze and determine the output weights. ELM exhibits excellent performance of generalizing and the reduction in the iterative time of training process. Clearly, it is more effective than any other ANN-type machine learning algorithms [21].

For the historical training credit data set that is mentioned above, the input vector is the ith sample with p-dimensional features, and .Then, p is the amount of input neurons. p is also equivalent to the input features. Let L be the amount of hidden neurons. Denote C as the amount of output neurons, which is also equivalent to the category number. Denote the input weight matrix as , where is the vector connecting the p input neurons with the jth hidden neuron. is the bias value of the hidden neurons, where is the bias value of the jth hidden neurons. The above parameters do not change during the whole process. The output could be computed by as follows:where G(x) is the activation function. Let H be the output of all the samples. It can be calculated by using the following equation:

The ith column represents the ith hidden nodes output vector relative to the inputs to . The jth row represents the output vector of the hidden layer relative to the input . The output of ELM can be calculated bywhere is the weight vector that connects the ith hidden nodes with the output nodes. ELM is able to evaluate those N samples without any mistake. In other words, . Then, the following equation can be obtained:

Equation (8) can also be rewritten as follows:

Based on (9), the value of output weight could be estimated using a least square solution as follows:where stands for the Moore–Penrose generalized inverse of . For credit scoring classification, the outcome of ELM is as follows:

2.3. Ensemble Strategy for Imbalanced Data

To better solve the classification of imbalanced data, a considerable number of approaches have been used. They can be categorized into three types: preprocessing, cost-sensitive learning, and ensemble methodology. Preprocessing is able to decrease the classification bias based on the bias-variance decomposition to enhance the single classifier. Undersampling [22, 23], oversampling [24, 25], and strategic sampling are extensively utilized to offset imbalanced data.

Ensemble methodology can be viewed as a decision-making process that combines both individual learning algorithms and their outcomes in parallel to obtain the ultimate result. The basic idea behind the ensemble methodology is that the algorithm will get a number of single classifiers from the training set, and then it uses some ensemble strategies to integrate them to raise the accuracy and reliability of classification. Bagging [26], boosting [27], and stacking [28] are the most common ensemble approaches in credit scoring.

A novel ensemble strategy is planned for imbalanced data according to its imbalance ratio, which could determine the number of ELM classifiers that apply as single classifiers to predict credit scoring data, as well as the number of samples that feed into each ELM as training data.

For any given historical credit training data set that contains N samples, there are “good applicants” and “bad applicants,” such that . Then, the imbalance ratio is called IR, which can be calculated as follows:

After obtaining the IR, the amount of single ELM classifiers M in the ensemble model also can be calculated as follows:where the symbol represents “ceiling” operation.

Equation (13) can help us to not only determine the number of ELM classifiers needed in the ensemble credit scoring model but will also guide us to make the imbalanced data become balanced for each classifier. Regarding the ensemble strategy for imbalanced data, we proposed M-based value. In the remainder of this subsection, we will elaborate the proposed strategy in detail.

Firstly, calculate the imbalance ratio IR for any given historical credit training data set on the basis of and . Secondly, determine the number of ELM classifiers M using (13).

Thirdly, for the first M-1 ELM classifiers, we feed “good applicants” samples and “bad applicants” samples into the ELM classifiers to make sure that the training data sets of the first M − 1 classifiers are balanced, and random sampling without replacement method is employed to extract samples from “good applicants” samples for each classifier. After finishing the first M-1 training data sets extraction, there are “good applicants” sample that have not been extracted.

Finally, for the last ELM classifier, we let the remaining “good applicants” samples into the training data sets. Considering that , we will employ the SMOTE algorithm to create “good applicants” samples from “good applicants” samples. Thus, for the last ELM classifier, there are still “good applicants” samples and “bad applicants” samples fed into it as the training data set.

Through the described processes above, the ensemble strategy for the imbalanced data we proposed has been realized, and the training data for each classifier is balanced. In the next subsection, we will further introduce the GFSS theory-based ensemble credit scoring approach, which utilizes the results of each single ELM classifier.

2.4. GFSS Theory-Based Ensemble Credit Scoring Model

Since we have the results of each ELM classifier, we also need to figure out the weights with respect to their performance. The accuracy of classification is expected to be greatly improved. The theory of soft sets, which is firstly put forward by Molodtsov [29], can be regarded as a way for solving the uncertainties in imprecise environments (e.g., credit scoring area). Maji et al. [30] launched a research focusing on both fuzzy and soft sets. We will firstly introduce the principle of generalized fuzzy soft sets and then put forward a similarity measure of generalized fuzzy soft sets using angular cosine. After that, we can get the weights of each credit scoring model using similarity measure and the accuracy of classification. Finally, we are able to build the generalized fuzzy soft sets theory-based ensemble credit scoring model.

2.4.1. GFSS Theory

Based on the theories proposed by Molodtsov [29] and Maji et al. [30], we can make some definitions for fuzzy soft sets.

Definition 1. Denote U as the initial universal set. Denote P as a set of parameters. P(U) is the power set of U. (F, P) is a soft set over U if P is a mapping given by F : PP(U).

Definition 2. Denote U as the initial universal set. Denote P as a set of parameters. The power set of all fuzzy subsets of U is IU. Let : A pair (F, P) be the fuzzy soft set over U if F is a mapping given by F : AIU.
Then, Maji and Samanta’s [31] definition of GFSS is as follows:

Definition 3. Denote U as the initial universal set. Denote P as a set of parameters. Let (U, P) be the soft universe. Denote F : PIU and µ as the fuzzy subset of P, i.e., µ: PI= [0, 1], where IU is all fuzzy subsets of U. Denote Fµ as the mapping given by Fµ : P × I, which can also be denoted as Fµ(e) = (F(e), µ(e)), where F(e) IU. In this way, Fµ can be viewed as a GFSS over the soft universe (U, P).
For every ei, Fµ (ei) = (F(ei), µ(ei)) illustrates both the level of belonging of the subsets of U to F(ei) and the possibility of belonging.
In this paper, U denotes the historical data of customer credit , where , and F(ei) represents the performance of the classification by a single customer from a single classifier and µ(ei) denotes the overall degree of classification of a certain single classifier.

2.4.2. Similarity Measure of GFSS

It is important to address the similarity measurement of GFSS during the setting of GFSS and the establishment of our model. Thus, we establish a new approach for the similarity measure of GFSS for the scoring of credits.

Definition 4. For M single classifiers, denote and as the elements of Fµ. The definition is as follows:where is the category tag of the tth customer (binary responses denote 0 as default and 1 as nondefault); is the forecasting result of the tth customer predicted by the mth classifier and is between 0 and 1. is the accuracy degree of classification of a single classifier for a single customer. This accuracy ranges between 0 and 1. This is in line with our initial intuition. In addition, is the mth classifier’s overall classification performance, which can be calculated as follows:where TPm, FNm, TNm, and FPm are elements of the confusion matrix (Table 1). TPm is the amount of good customers accurately labeled as good, TNm is the amount of bad customers accurately labeled as bad, FNm is the amount of good customers falsely labeled as bad, and FPm is the amount of bad customers falsely labeled as good. The greater the value of , the more accurate the result is calculated by the mth classifier.
Based on the above discussion, it can be noted that and from Definition 4 are able to evaluate the performance of classification of every single model. Therefore, we can build the GFSS of the mth classifier as follows:

Definition 5. For two GFSS and over the universal set U, where , the similarity measurement of GFSS can be calculated as follows:

2.4.3. Ensemble Credit Scoring Modeling

The determination of the weight for every single model is the most important step in the establishment of the ensemble credit score modeling. The purpose of this determination is to compare the calculated credit score with the actual records.

The similarity measure of GFSS can be utilized to do the determination, i.e., the calculated value, where is the GFSS of the actual score of customers. In this way, the weight of mth classifier can be calculated bywhere and . Thus, the final score could be calculated as

Figure 1 presents the flow-process diagram of the ensemble model. The algorithm of “ELM and GFSS Theory-Based Hybrid Ensemble” model is described. We will call this EGHE in the following parts (Algorithm 1).

Input: Historical data of credit scoring (xi, yi).
Output: Score of every single customer .
Step 1. Preprocessing of data.
Step 2. Variables selection using AEnet.
Step 3. Imbalanced data rebalancing by using the proposed ensemble strategy.
Step 4. Credit scoring of every single ELM classifier.
Step 5. Compute and using (14) and (15).
Step 6. Calculate the of single classifier using (17).
Step 7. Calculate the weight of mth classifier using (18).
Step 8. Get the final credit score of every customer .

3. Results and Discussion

3.1. Preparation of Dataset

During the evaluation, we collected many different private and public data sets. We have collected a total of six credit data sets and four additional imbalanced data sets with different IR are obtained, i.e., three public and three private. The public sets can be obtained from the UCI Machine Learning Repository. They are real-world credit score data sets and are now widely used by researchers. The German, Australian, and Japanese data sets are used for extra verification. The private data sets consist of the Iranian data set that has also been widely used in many studies and the Bene 1 and 2 data sets, which can be obtained from two key financial institutions in Benelux et al. [32]. This Iranian set has various customer data of many small Iranian private banks [33, 34]. Four additional imbalanced datasets are also from the Machine Learning Repository, UCI. They are Shuttle, Skin_segment, MiniBooNE, and LC2017Q1 which contains loan data of the first quarter in 2017 from Lending Club. The characteristics of all experimental data sets can be found in Table 2.

In this paper, we compared our proposed model with the other four state-of-the-art models, namely, C5.0 decision tree, SVM with Radial Basis Function, kernel SVM-R, Deep Belief Networks (DBN), and Bayes to validate the performance of our approaches. All the continuous attributes will be discretized into various intervals. Every single data set will be divided into a two-thirds training set and one-third testing set randomly. We use the open source platform R-statistics (version R-3.2.2.) to conduct our experiments.

3.2. Experimental Results

Different methods are utilized as comparison models to test the validity of the EGHE credit scoring model.

Firstly, the FS algorithm that is based on AEnet is used to obtain the highly correlated variables after initial data gathering and preprocessing. We could notice that after selection, the variables in these ten data sets are all decreased to various degrees (Table 3). In consideration of the complexity of computation, deleting irrelevant or weakly correlated variables is becoming increasingly important for big-data-oriented credit assessment issues.

After feature selection, we can perform two experiments to further see the effect of feature selection on the classification of each model: (1) all single classifiers with feature selection and (2) all single classifiers without feature selection. Tables 4 and 5 show in detail the AUC, H-measure (HM), Brier’s score (BS), and accuracy (ACC) for all single classifiers with and without feature selection.

From the results of every single classifier in Tables 4 and 5, we can see that the feature selection helps enhance the effectiveness of classification of most single classifiers. After feature selection, C5.0 shows accuracy values that are 0.8%, 1.76%, 1.63%, 4.49%, 2.85%, 0.82%, 3.47%, 4.15%, 7.94%, and 4.12% greater than without feature selection for Germany, Australia, Japan, Iran, Bene 1, Bene 2, Shuttle, Skin_segment, MiniBooNE, and LC2017Q1 data sets, respectively. In the same way, SVM-R increased the accuracy values by 0.93%, 1.08%, 0.67%, 4.53%, −0.4%, 1.94%, 0.87%, 1.02%, 2.03%, and 2.54%, respectively, for ten experimental data sets. DBN and Naïve Bayes improve their effectiveness too after feature selection. Only SVM-R on Bene 1 reduced by 0.4%, but this does not contradict the improvement on the classification that our AEnet-based feature selection brings. Not only do outliers, redundant, and weakly or even unrelated variables help improve the effectiveness, but also affect the model establishment and cause great computation.

It is noteworthy that, compared with C5.0, SVM-R, DBN, and Bayes, ELM has manifested the superiority in accuracy on the vast majority of data sets. Table 6 reports the average running time (total time for training and testing) for all models. For these experiments, we use an Intel i5-8500 with CPU at 3.0 GHz and 16 GB of RAM.

From Table 6, we can see that ELM costs less time than other single models to carry out credit scoring activities. The efficiency of computing resources also makes ELM a great match for ensemble learning and modeling.

After implementing feature selection, completing the ensemble strategy, and individual model classification, the EGHE model can be achieved. Based on (14), (15), and (17)–(19), the weights of single ELM classifiers are calculated according to their efficiency, respectively. To validate the availability of EGHE, we employed several ensemble models in contrast with EGHE. These models were split into two parts. The first one contains four FS algorithms with GFSS-based combination. They are cost-sensitive, GA, information gain ratio (IGR), and elastic net (Enet). Cost-sensitive, GA, and IGR are popular feature selection approaches in the credit scoring area [16, 35, 36]. The second group applies four other approaches with AEnet-based feature selection, which were weighted average (WAVG) [37], majority voting (MajVot) [38], weighted voting (WVOT) [39], and fuzzy soft set (FSS). Those methods are frequently adopted in the establishment and utilization of different combination models. They also employ ELM as the classifier but did not take the ensemble strategy that is proposed above, only using random sampling methods to make all training data sets become balanced. Table 7 displays the results of AUC, H-measure, and Brier’s score for all ensemble models.

From Tables 7 and 5, we can see that, compared with the single classifiers, ensemble methods reveal significant advantages with regard to the accuracy of classification. Compared with other single classifiers and combined approaches in both groups, EGHE has an advantage in all metrics across all datasets. Experiments on several state-of-the-art ensemble models are performed to verify the effectiveness of the EGHE model. They are an EMPNGA-based multistage hybrid model put forward by Zhang and Xia [37]; the heterogeneous ensemble credit model put forward by Xia et al. [40]; EBCA-RF&XGB-PSO model that is put forward by He et al. [41]; heterogeneous ensemble learning-based two-stage credit risk model (TSHE) proposed by Papouskova and Hajek [42]; twin neural networks (TNN) proposed by Jayadeva et al. [43]; and a new rule-based knowledge extraction (RKE) method proposed by Mahani and Baba [44] recently. Table 8 gives the results of ensemble models in different data sets.

From Table 8, we could tell that the results of these models are very close. The accuracy of EGHE model is better than most of the other models but Iranian. The EBCA-RF&XGB-PSO model achieved a high accuracy of 0.921 in the Iranian data set because it uses the Extended Balance Cascade method that can effectively solve the issue of class imbalance. However, the ensemble strategy and GFSS theory-based model EGHE can deal with the thorny problem of unbalanced data classification better in most experimental data sets; even in some severely skewed data sets, ideal outcomes have been achieved, such as in Shuttle, Skin_segment, MiniBooNE, and LC2017Q1.

4. Conclusion

In this paper, we proposed a novel ensemble credit scoring model called EGHE, which integrates efficient feature selection algorithm, novel ensemble strategy, and GFSS-based weighting method for single ELM classifiers. In the proposed model, the adaptive elastic net-based feature selection algorithm was firstly utilized to obtain high-quality training data to improve the evaluation efficiency without reducing the predictive precision. ELM model was employed as basic classifier, and a novel ensemble strategy was generated to make the imbalanced training data sets become balanced for each ELM classifier. Additionally, we proposed a new weighting method to build the GFSS theory-based ensemble credit scoring model. Dual-scale classification accuracy metric that is based on new similarity measurement of GFSS was constructed to compute the final weight of every single classifier. The biggest contribution of this paper is that the proposed EGHE is able to predict credit risk reliably and accurately, especially for unbalanced credit data. Comparisons between EGHE and other credit scoring models were implemented on ten real-world datasets with four metrics (average accuracy, AUC, H-measure, and Brier’s score). A variety of state-of-the-art ensemble models were employed to compare with EGHE to prove its validity. The experiments results demonstrated that the proposed EGHE model was robust and represented a positive development in credit scoring.

Data Availability

(1) The “Germany” data set used to support the findings of this study are included within the following URL: http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29. (2) The “Australia” data set used to support the findings of this study are included within the following URL: http://archive.ics.uci.edu/ml/datasets/Statlog+%28Australian+Credit+Approval%29. (3) The “Japan” data set used to support the findings of this study are included within the following URL: https://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening. (4) The “Iran” data set used to support the findings of this study are included in [34, 45] (5) The “Bene 1” and “Bene 2” data set used to support the findings of this study are included within the following article: [32]. (6) The “Shuttle” data set used to support the findings of this study are included within the following URL: http://archive.ics.uci.edu/ml/datasets/statlog+(shuttle). (7) The “Skin_segment” data set used to support the findings of this study are included within the following URL: http://archive.ics.uci.edu/ml/datasets/Skin+Segmentation. (8) The “MiniBooNE” data set used to support the findings of this study are included within the following URL: http://academictorrents.com/details/7fafb101f9c7961f9b840daeb4af43039107ddef. (9) The “LC2017Q1” data set used to support the findings of this study are included within the following URL and article: [41] http://www.lendingclub.com.

Disclosure

Dayu Xu and Xuyao Zhang are co-first authors.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Dayu Xu and Xuyao Zhang contributed equally to this work.

Acknowledgments

The authors acknowledge the support for the project (no. 20YJC630173) supported by the Ministry of Education of Humanities and Social Science Project, the project (no. 31971493) by the National Natural Science Foundation of China, the National Key Research and Development Program of China (no. 2018YFD0401403), the Zhejiang Province Key Science and Technology Projects (no. 2018C02050), the Hangzhou Agricultural and Social Development Project (no. 20190101A07), and the Zhejiang Education and Teaching Reform Project (no. jg20180175) supported by the Department of Education of Zhejiang Province.