Abstract

Accurate evaluation of the risk level and operation performances of P2P online lending platforms is not only conducive to better functioning of information intermediaries but also effective protection of investors’ interests. This paper proposes a genetic algorithm (GA) improved hybrid kernel support vector machine (SVM) with an index system to construct such an evaluation model. A hybrid kernel consisting of polynomial function and radial basis function is improved, specifically kernel parameters and the weight of two kernels, by GA method with excellent global optimization and rapid convergence. Empirical testing based on cross-sectional data from Chinese P2P lending market demonstrates the superiority of the improved hybrid kernel SVM model. The classification accuracy of credit risk level and operation quality is higher than the single kernel SVM model as well as the hybrid kernel model with empirical parameter values.

1. Introduction

Chinese P2P online lending industry was once without supervision and regulation for more than five years so that most platforms act as credit intermediaries, providing credit enhancement measures such as principal guarantees and third-party guarantees [1, 2]. With increasing events in bankruptcy and disappearance of platforms, investors are more and more sensitive to platform characteristics in decision-making. Risk management focusing on platforms shall be a new trend for regulation of the P2P online lending industry [3, 4]. Interim Measures for the Administration of the Business Activities of Online Lending Information Intermediary Institutions issued jointly by four ministries and commissions of Chinese government in August 2016 clarified contents of P2P lending, regulatory system, and business rules; subsequently, a series of detailed rules and regulations on third-party depository, filing and registration, and information disclosure were promulgated to standardize the development of P2P online lending industry [5, 6]. Accurate evaluation of risk level and operation performances of platforms not only provides solid basis for practical measures adoption by regulation authorities but also acts as an important reference for investors’ decision-making. Therefore, constructing an advanced evaluation model for P2P online lending platforms is of vital realistic significance [7].

Risk level and operation performance evaluation are hotspot issues in recent research studies given unstable market environment. Tsolas applied a new series two-stage DEA method while evaluating credit risk of enterprises [8]. Luo Sirong et al. introduced a regression spline-based discrete time survival model to assess comprehensive performance of credit card applicants [9]. Dahira et al. presented a feature selection-based hybrid-bagging algorithm (FS-HB) for improved credit risk evaluation [10]. With respect to Chinese P2P platforms, existing research studies usually adopt statistical methods such as factor analysis, principal component clustering, and analytic hierarchy process. Zhu Zongyuan and Wang Jingyu perform the analytic hierarchy process and data envelopment analysis to measure the technical, scale, and overall efficiencies of 22 P2P online lending platforms, finding those efficiencies to be generally low [11]. Shan Peng et al. successively apply the factor analysis method to scoring and sequencing comprehensive strength and risk levels of the sample platforms [12]. Yan Xin et al. constructed a complex evaluation index system for P2P online loan platforms and utilized the two steps and Kohonen model to cluster 516 platforms for classification and providing references for investors' decision-making [13]. Liu Ao et al. determined optimal weights by means of the teaching and learning optimization algorithm and sorted efficiencies of 100 P2P online loan platforms [14].

There are mainly two defects in the existing research studies. Firstly, in most research studies, platforms are ranked according to certain criterion. The boundary of suitable platforms for investment is ambiguous, whereas an intuitive support for investors’ decision-making is missing. Secondly, for researches adopting statistical models, data modeling is overemphasized so that accuracy of the model-based prediction will be affected, while data dimension is enlarged. Therefore, a machine learning algorithm integrating GA and hybrid kernel SVM is proposed in this study. The improved algorithm sets a clear boundary of whether the platform is credible that investors could trade on by classification of risk level and operation quality. Moreover, applying the GA method and hybrid kernel SVM will not only reach a higher classification accuracy than statistical and traditional machine learning models but also fit for large data volume analysis.

The rest of this paper is organized as follows. Section 2 discusses design of evaluation model for GA optimized hybrid SVM. Section 3 shows simulation experimental results, including the labeling process by principal component method and platform evaluation process by the optimized hybrid SVM method. Section 4 concludes the paper with summary and future research directions.

2. Principle of GA and Hybrid Kernel SVM Integrating Model

2.1. Establishment of SVM Hybrid Kernel

The principle of SVM as a classification algorithm is to find the separate hyperplanes with the maximum margin to maximize the distance between point x and hyperplane (). Slack variable, namely, a nonnegative parameter ξ, and penalty factor C are introduced to describe inseparability losses and penalty for sample misclassification. While the training samples are assumed as (: input index and : classification tag value), the basic model can be described as

The kernel function is to map the data implicitly to the high-dimensional feature space so that linear inseparable issue in original low-dimensional space may be solved, whose form and parameter value significantly influence the classification accuracy of the SVM algorithm. The kernel function may generally be divided into two types (global and local kernels); the former has strong generalization capacity but weak learning ability, while the latter is opposite. Among common kernel functions, global kernel functions include the polynomial and Sigmod types and RBF type belongs to local kernel functions. The polynomial and RBF kernel functions were linearly combined in this study to obtain a hybrid kernel function which has both learning and generalization capacities to overcome limitations of the single kernel functions. Mathematical expressions are as follows:Polynomial kernel function: RBF kernel function: Polynomial-RBF hybrid kernel function:

2.2. Optimization of SVM Parameters

While the hybrid kernel function is applied for classification, those parameters to be necessarily determined include λ (hybrid kernel weight coefficient), a, c, and d (polynomial kernel parameters), (RBF kernel parameter), and C (penalty factor).

Firstly, the hybrid kernel weight coefficient is determined by principle of minimizing featured distances between similar samples and maximizing featured distances between dissimilar samples, which was put forward by Wang Xingfu and Yu Lu [2]. Evaluation function L (λ) is defined as the difference between spacing of any two dissimilar samples or any two similar samples; ϕ1 and ϕ2 represent the corresponding mappings of RBF and polynomial kernel functions, respectively. The distance between sample i and j may be expressed as

Then,where stands for the sample value and stands for the sample type.

Plugging equation (2) in equation (3),

Secondly, GA with global optimization ability is used to optimize kernel parameters, and its basic principles are as follows:(1)Initialization of SVM parameters and setting searching space for kernel parameters and the penalty facto and initialization of GA parameters, population size, encoding lengths, crossover and mutation probability, and maximum number of iterations.(2)Random selection of the number of individuals of the initial population for coding is based on the following equation:where M represents a binary code string; x represents the independent variable, whose value range is [a, b]; and l represents the encoding length.(3)Calculation of f (individual fitness) and marking the individual with highest fitness.(4)Selection, crossing, and mutation: selection refers to selecting two parent individuals from the population in accordance with the principle that the greater the fitness, the higher the probability of being selected; crossing refers to offspring forming through random code exchanges of two parent individuals; mutation refers to flipping each bit of the parent's individual codes under a certain probability.(5)Calculation of the fitness value of each individual according to the fitness function and decoding the individual with the highest fitness and output optimal SVM parameters. If the termination condition is not satisfied, Step (3) continues till the termination condition (the evolutional generation peaks or the individual fitness (f) converges to a certain value) is met.

3. Simulation and Tests

3.1. Construction of the Comprehensive Evaluation System and Index Preprocess

This study focused on evaluation of the monthly operation level of a P2P platform with reference to the industry average level by taking data availability and index stability into account, and the evaluation indexes were selected in the following four dimensions:(1)Transaction level: it was decomposed in two subdimensions, trading scale and cost of capital, in which 3 indexes (namely, turnover, average reference rate of return, and net capital inflow) were examined.(2)Platform popularity: it is primarily to examine the platform's attractiveness to investors and borrowers through the brand effects, public opinion communication, and other channels, and it is directly reflected by numbers of investors and borrowers, investment, and loan amount per capital.(3)Loan decentralization: the explosive increasing in the trading volume and the high concentration of borrowing transactions leads to extensive payment pressure of platforms. This study focused on the degree of decentralization of borrowers; thus, two indexes (the per capita amount to be paid and the percentage of the amount to be paid by top ten borrowers) are selected for representation.(4)Liquidity level: it refers to the ability of liquidating any assets at a reasonable price. As for any asset, the worse its liquidity is, the less active its transaction is. The average loan term is generally utilized to reflect the liquidity level, and the shorter the term is, the stronger the fund liquidity is.

Our platform and industry data are derived from statistics results for October 2017 of Website (http://www.wangdaizhijia.com), and 463 valid samples were obtained after deleting those samples whose data are incomplete. Software environment: WINDOWS 7/SPSS 19.0/Matlab R2016b. The statistical description of original indexes is shown in Table 1.

Original indexes are preprocessed in two steps: relativization and reversing negative indexes. Due to the imperfect supervision system of the Chinese P2P industry, regulatory authorities have bound neither cap nor floor for platform operation indexes. In this paper, the ratio of the absolute value and the industry average acts as input indexes for the kernel principal component analysis, which represents a relative level against the industry in an economic sense. Due to the lack of industry statistics of the index X9, a proportion of 50% is used here, which is the cap proportion authoritatively set for commercial banks in China.

Ten original indexes consist of positive and negative ones. The latter includes the per capita amount to be paid, the percentage of the amount to be paid by top ten borrowers, and the average loan term, whose absolute values have negative correlations with the operation level of a platform. Thus, the reciprocal of original negative indexes is adopted to unify the dependency of index value and the platform operation level.

3.2. Classification Evaluation Mechanism Based on Principal Component Analysis

To begin with, sample data are scored and labeled using the principal component analysis method to generate output results of the supervised learning of the SVM algorithm. The corresponding results are shown in Table 2.

The top six components whose accumulated variance contribution rates are up to 85% are extracted as the principal components, namely, as F1, F2, ..., and F6 in sequence. The score matrix is shown in Table 3.

Each component is expressed as a linear combination of index (X) according to the following equation, whose coefficient matrix is the score matrix of principal components in Table 3:

The comprehensive score function was established as follows, which is a weighted sum of scores of all principal components; and the weight is the corresponding variance contribution rate for each principal component:

While X(i) is taken as 1 for any i, the industry average score is calculated as . While the comprehensive score , the platform shall be below the industry average level and it belongs to the “ALERT” type platforms and is labeled with “−1”. In contrast, while X(i) is taken as 10 for any i, its “EXCELLENT” type score is calculated as . While the comprehensive score , it shall be labeled with “1”. While the comprehensive platform score , it belongs to the “GENERAL” type platforms and it shall be labeled with “0”. The principal component analysis was performed to gain the results: 107 “EXCELLENT” type platforms, 334 “GENERAL” type platforms, and 22 “ALERT” type platforms.

In addition, in order to assess the ability of early warning of optimized evaluation model, a second classification standard is constructed. A binary classifier gives a definite answer to whether investors could trade on the platform based on its risk level, which is different from the ternary classifier we built before aiming at choosing the most outstanding platforms. “EXCELLENT” and “GENERAL” platforms are collectively called “NONALERT” platforms, labeled “1” and “0” for “ALERT” platforms. Accordingly, there will be 22 “ALERT” platforms and 441 “NONALERT” platforms.

3.3. Evaluation Model for Optimization of Hybrid Kernel SVM by GA
3.3.1. Classification Evaluation Results for Determining SVM Parameters Based on Empirical Values

The empirical value parameters were first selected to test the accuracies of the single and hybrid kernel SVM models. By taking λ = 0.5, a = c = 1, d = 3,  = 10, and C = 1, the 5-fold cross validation binary classification and ternary classification results are shown in Table 4.

As shown in Table 4, the classification accuracy of the polynomial-RBF hybrid kernel support vector machine evaluation model with empirical parameters is slightly better than that of the four common single-core models both in binary and ternary classification. However, ternary classification results are not satisfactory especially. GA is introduced to optimize the hybrid kernel weight coefficient and kernel parameters to achieve higher classification accuracy.

3.3.2. Optimization of SVM Parameters Based on GA

Parameters are optimized by LIBSVM toolkit, and is taken when applying the hybrid kernel function. SVM parameters are optimized by the GA algorithm in accordance with the specific steps as follows:

Input: inputting 463 sample data after feature extraction.Step 1: parameters are encoded in binary mode to construct a population (pop size: 50; individual chromosome length: 10). The ranges of polynomial kernel parameters are ,, and . A 50 ∗ 40 matrix is generated randomly as the initial population.Step 2: is solved based on the characteristic distance method.Step 3: SVM classification accuracy based on the 5-fold test method is calculated and defined as the fitness function of GA.Step 4: selection was performed by the roulette wheel selection method so that the greater the fitness of individuals, the higher the probability of being selected. The generation gap is set as 0.9, which means that 90% individuals are copied to the next generation. The probability of an individual being selected isStep 5: crossing was performed by the two-point crossover method. Two crossover points were set randomly in two paired individual encoded strings, between which some genes were exchanged. The crossover probability is .Step 6: mutation was performed by the discrete mutation method, where the mutation probability is taken as .Step 7: keep the current optimal solution and the filial generation was inserted again into the parent to generate a new population. If the number of iterations is not up to the maximum which is 100, operation shall be performed again from Step 2; otherwise, Step 8 shall be performed.Step 8: decoded outputs and classification accuracy.

The warning capability of the optimized model for “ALERT” platforms is investigated firstly. The best binary classification accuracy (fitness) during the evolution process is shown in Figure 1. When the iteration goes to the fiftieth generation, the accuracy reaches 98.9201% and finally converges to the value, which is significantly higher than that with empirical parameters in Table 4. The ROC curve of the binary classifier is shown in Figure 2, from which we can see that the AUC value reaches 0.9817. This shows that the evaluation model of hybrid kernel SVM method optimized by genetic algorithm has outstanding warning ability for “ALERT” platforms. Optimal parameter values of binary classifier are shown in Table 5.

The fitness curve of optimized ternary classifiers is shown in Figure 3. When it evolves to the 26th generation, the ternary classification accuracy reaches 96.7603% and finally converges to the value. The accuracy is significantly higher than that of single kernel (72.14%–75.59%) and hybrid kernel support vector machines with empirical parameters (76.89%) are presented in Table 4. This shows that the GA optimized hybrid kernel SVM algorithm is effective in accurate classification of risk level and operation quality of Chinese P2P online lending platforms. Parameter values of ternary classifiers during evolution are shown in Table 6.

4. Conclusions

How P2P platforms operate is closely related to investors’ fund safety and their investment decisions, which creates requirements for rating and classification of platforms. An improved hybrid kernel SVM evaluation model is put forward to effectively increase the accuracy of traditional SVM algorithm. A hybrid kernel function is introduced in which the weight is solved by the characteristic distance method and the parameter value is determined by the GA algorithm. The transaction data test indicates that this improved model has strong learning ability and generalization ability, and the prediction accuracy is significantly higher than of either single kernel SVM models or the hybrid kernel model with empirical value parameterization, which enables evaluation and classification of Chinese P2P online lending platforms to be more accurate and more objective.

Nonetheless, the premature defect of GA algorithm is not solved in this study. The improved hybrid kernel model has limited ability while exploring an unknown space as well as the tendency to converge to a local optimal solution. Optimization could be further developed through these aspects.

Data Availability

The labeled dataset used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by the National Social Science Foundation of China (Grant no. 14BGL185).