Deciphering big data in consumer credit evaluation

https://doi.org/10.1016/j.jempfin.2021.01.009Get rights and content

Highlights

  • Big data credit scores outperform internal rating in predicting loan delinquencies.

  • The outperformance is more prominent among borrowers without public credit records.

  • Credit modeling with big data has the potential to correct financial misreporting.

Abstract

This paper examines the impact of large-scale alternative data on predicting consumer delinquency. Using a proprietary double-blinded test from a traditional lender, we find that the big data credit score predicts an individual’s likelihood of defaulting on a loan with 18.4% greater accuracy than the lender’s internal score. Moreover, the impact of the big data credit score is more significant when evaluating borrowers without public credit records. We also provide evidence that big data have the potential to correct financial misreporting.

Introduction

Large volumes of alternative data, or big data, have recently become available, leading to a profound transformation of both economic research and practice (Einav and Levin, 2014).1 In the context of consumer lending, financial institutions have begun using large-scale external data to evaluate the creditworthiness of potential borrowers. The increased prevalence of online loan application during the pandemic highlight the importance of this trend. Compared to simple digit footprints, using big data for lending decisions makes it more costly for people to change their behaviors. These data include behavioral loan tracking, location-based information, mobile app data, and much more, and they differ from traditional sources of information (e.g., financial information) in that they are granular, real data that are not self-reported. Despite the increased use of big data in practice, there is limited research on the performance of big data in a real business context. It remains an open question that whether the new wave of big data provide new information value in financial services and improve traditional business practices.

In this paper, we examine whether the availability of large-scale alternative data improve personal credit assessment for a traditional financial institution. Our study is based on a double-blinded test of an anonymous traditional lender, by comparing its own internal rating with the big data score constructed by BaiRong, a big data service company in China. The internal score is based on credit records from the public credit reporting system, account-level data, and self-reported demographic and income, while the big data credit score incorporates multiple dimensions that are unavailable to the lender. We are, therefore, able to assess the predictive power of the big data credit score both separately, compared to the traditional internal score, and jointly with the internal score.

Financial institutions may choose to use alternative big data to complement China’s underdeveloped credit reporting system. The financial credit reporting system, run by the Credit Reference Center of the People’s Bank of China (PBCCRC), only covers one-quarter of the Chinese population, meaning that around one billion Chinese individuals lack a financial credit profile. The PBCCRC system provides credit records for licensed financial institutions. For financial institutions, no common credit scores, like the FICO score, is available. In practice, financial institutions develop their own internal score system based on public credit records and other in-house available information. Severe information asymmetry is observed between financial institutions and individuals without public credit records. External information from big data firms might help refine their credit evaluation.

In theory, however, it is unclear whether the big data credit score has the potential to outperform traditional lenders’ internal score. The latter is based on core financial information, such as historical credit records at licensed financial institutions and financial account activities, which are essential for predicting default (e.g., Mester et al., 2007, Norden and Weber, 2010). Despite its low dimensionality, this financial information directly captures a borrower’s ability to repay in the future. Thus, the traditional internal score could have greater information value in predicting defaults. However, the big data credit score could also outperform the internal rating because it has greater coverage and uses thousands of variables that cannot be easily manipulated. Traditional rating relies on official credit records that are not available for most individuals in China, which limits the coverage of traditional rating. Self-reported financial information, which is widely used in traditional credit risk assessment, is likely to be manipulated and subject to falsification.

We empirically examine the efficiency of the two scores using a real business sample, and we compare their respective predictive power. We find that the big data credit score substantially outperforms the lender’s internal score. The AUC obtained by only using the lender’s internal score is 0.761, while the AUC obtained using the big data credit score is 0.809. The comparison indicates that the big data credit score predicts a borrower’s likelihood of delinquency with 18.4% greater accuracy than the internal score. Combining both scores predicts a borrower’s delinquency likelihood with 22.6% greater accuracy than the model using only the lender’s internal score. In terms of economic value, the big data credit score significantly raises the expected profit per applicant by CNY 1500–2500 (or USD 220–360). The magnitude of the expected profit is of great economic significance, as the querying fee for the big data credit score is usually less than CNY 50 (or USD 7) per applicant.

Further, we investigate who is rescued by big data credit scores. Among borrowers with below-median traditional ratings, those who submit more consistent identity information in multiple loan applications, fewer loan applications in online non-bank lenders, and allow a longer time interval between two consecutive loan applications are more likely to be granted higher big data credit scores.

One possible explanation for these findings is that the big data credit score incorporates an individual’s very frequent and real-time behavioral information, such as online cash loan applications, online shopping, and internet surfing. First, these elements generate information that can better reflect an individual’s general profile. While public credit records are the main source of an individual’s financial credit profile, they only aggregate loan records at formal financial institutions. Besides, public credit records have limited coverage, which often prevents those without public credit records from accessing money from formal financial institutions. The results in this study are consistent with this interpretation. We find that, for borrowers without public credit records, using the big data credit score alone can achieve 99.6% of the predictive power of the combined model, which highlights big data’s potential for those who lack a credit history.

Second, big data may provide an opportunity to correct misreporting. Potential borrowers find it more costly to manipulate big data credit scores than simple variables since the former incorporate a large set of variables. In contrast, self-reported financial information, such as income (Jiang et al., 2014), are more vulnerable to manipulation. We focus on borrower’s income misreporting and argue that big-data-based income information may be a better proxy for real income than self-reported income. We provide supportive evidence for this argument: big-data-based income is significantly and positively associated with delinquencies, while self-reported income is not. Also, borrowers with self-reported income above the estimated income based on big data are more likely to become delinquent.

Our research’s main contribution is to shed light on the value of large-scale alternative data, or “big data”, in the context of personal credit assessment. To the best of our knowledge, this paper is the first to investigate how large-scale data improve credit evaluating in a real-world business scenario. Thus, this paper contributes to the literature by using such a setting to understand the information value of big data and its potential impact on participants in a real-world market.

The closest research to this study is Berg et al., 2020, who demonstrate that digital footprints perform better in predicting consumer loan delinquency than credit bureau scores. Our paper differs from and complements their findings along several dimensions. First, our paper investigates a machine-learning-based big data score that aggregates various types of information, including credit history in both non-bank and bank lenders, online shopping, and web browsing records. The score is constructed as many FinTech lenders and credit service providers typically do in real business. Instead of only using digital footprints, resorting to big data score deepens the current understanding of the information value of big data and its impact on consumer credit evaluation. Besides, the proposed big data score is constructed from 3312 variables. This approach overcomes the scope and frequency limitations of existing data sources of single digital footprint variables (which can be easily manipulated) and makes this study’s conclusions less subject to Lucas critique (Lucas, 1976). Second, given that banks may use other information (e.g., income, bank account, and credit usage information) besides credit bureau scores for evaluating default risk, credit bureau scores likely underestimate a bank’s credit assessment ability. Our paper relies on the lender’s real internal rating, which summarizes the information that the lender uses in risk assessment. Internal rating thus serves as an appropriate benchmark for evaluating how lenders’ screening power is improved by big data. Third, this study’s sample individuals are minimally screened before being granted loans. Thus, the sample is more representative of the consumer loan applicant population.

This study is also related to several other strands of the literature. First, our research is related to the literature on how big data in alleviating information asymmetries in the consumer loan market, which exist both in developed and emerging economies (Adams et al., 2009, Stiglitz and Weiss, 1981). Oliver Wyman (Carroll and Rehmani, 2017) estimates that around 50 million people in the U.S. lack an informative credit score, which may lead to denials when applying for mainstream credit. The situation might be even worse in emerging markets, where the credit scoring system is often underdeveloped. This study contributes to this stream of the literature by providing evidence that large-volume information from alternative sources has the potential to fulfill this demand for credit information, thus expanding the access to credit for those without credit records.

Second, our research relates to the literature on financial intermediaries in the consumer lending market. The existing research has attached great importance to the ability of intermediaries’ internal data, such as credit history and account data, to assess individual borrowers’ risk (Mester et al., 2007, Norden and Weber, 2010). Marshall et al. (2010) includes customer loan approval process information in predicting loan performance. Khandani et al. (2010) construct a consumer credit score model via machine learning, but their model only incorporates bank account transactions and credit bureau information. Our model differs from theirs by including a large-scale alternative information. Our results show that, while internal information performs well in predicting delinquency, the use of external alternative data significantly improves credit evaluation. Our findings suggest that alternative data may provide meaningful insight for intermediaries making business decisions in the consumer lending market.

Third, this study contributes to the growing body of research on the role of big data and data analytics in economics (see Einav and Levin (2014) for an overview). The financial industry is heavily dependent on data. The advent of big data and analytics represents a major advance, with tremendous potential for real-world business. Recent studies have documented the impact of big data’s application in capital markets, including serving as a corporate governance mechanism (Zhu, 2019), and the measurement of the FinTech innovation value based on capital market reaction (Chen et al., 2019). This study contributes to this stream of research by deciphering big data’s value in consumer lending market.

The remainder of the paper is organized as follows. Section 2 provides an overview of the institutional background. Section 3 details the data and information underlying the two scores. Section 4 reports the main results and discusses their economic implications. Section 5 describes the possible mechanisms. Section 6 compares our paper to Berg et al., 2020, and Section 7 concludes.

Section snippets

China’s financial credit information system

China’s public credit bureau is run by an arm of the central bank, the Credit Reference Center at the People’s Bank of China (PBCCRC), which maintains a credit reporting system on Chinese individuals. As stipulated by Regulation on Credit Reporting Industry enacted in March 15, 2013, this system acts as the Financial Credit Information Basic Database established by the State. According to the PBCCRC’s release, the credit reporting system’s coverage of individual borrowers was 361 million as of

Sample

Our sample is an anonymous traditional lender’s testing sample with BaiRong. The lender is a typical lender in China’s consumer loan market. As we discussed in Section 2.3, before formal cooperation, the lender should decide whether to introduce BaiRong’s big data credit score to enhance its credit evaluation model in future business by exploiting double-blinded tests. Specifically, the lender randomly selected 7838 auto loan applicants from its applicants’ pool to be the testing sample. Only

Summary statistics

Table 2 Panel A presents the summary statistics. The overdue rate of the sample borrowers is 8.7%.4

How does the big data credit score improve prediction?

In this section, we try to figure out the way in which big data enhances model predictability. We provide suggestive evidence for the following two possible channels. First, incorporating alternative data may mitigate the information asymmetry between the lender and borrowers, especially those lacking information. Second, high-frequency online behavior data may help better assess those variables that are typically subject to misreporting.

Discussion: Comparison to Berg et al., 2020

In this section, we briefly discuss three differences between our paper and Berg et al., 2020, who compare the performance in predicting loan delinquency of digital footprints and a credit bureau score. First, in contrast with Berg et al., 2020, who focus on eight digital footprint variables, this study investigates a machine-learning-based big data score that aggregates various types of information, including credit history in both non-bank and bank lenders, online shopping, and web browsing

Conclusion

We use a proprietary double-blinded sample from a traditional financial institution lender to evaluate the potential impact of big data on the consumer credit assessment. The dataset provides an ideal context for comparing the big data credit score and the lender’s internal rating due to a minimally screened sample and two independently constructed scores. In line with previous studies, we find that alternative data have information content in predicting consumer default. In particular, the big

CRediT authorship contribution statement

Jinglin Jiang: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - original draft, Writing - review & editing. Li Liao: Conceptualization, Methodology, Validation, Investigation, Writing - review & editing. Xi Lu: Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing - review & editing. Zhengwei Wang: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation,

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.jempfin.2021.01.009.

Acknowledgment

The authors acknowledge the funding support from the National Natural Science Foundation of China (71790605).

References (23)

  • EinavL. et al.

    Economics in the age of big data

    Sci.

    (2014)
  • Cited by (10)

    • Credit scoring methods: Latest trends and points to consider

      2022, Journal of Finance and Data Science
      Citation Excerpt :

      Rules are made to be broken, though: in many cases, using a particular dataset in an academic article provides additional insights otherwise unavailable. For example,50-53 employ personal information on private borrowers to increase model performance.54 analyse textual descriptions of borrowers using natural language processing (NLP) techniques.

    • A flexible framework for intervention analysis applied to credit-card usage during the coronavirus pandemic

      2022, International Journal of Forecasting
      Citation Excerpt :

      Yao et al. (2017) used a support vector machine as a classifier in their two-stage loss given default models for credit cards. Jiang et al. (2021) employed large-scale alternative data to construct credit scores to predict consumer delinquency. Although our methodology employs a more rudimentary form of nonparametric analysis, our framework is rich enough to incorporate both the nonlinearity and the dependence present in the data.

    • Shadow Banking in a Crisis: Evidence from Fintech during COVID-19

      2021, Journal of Financial and Quantitative Analysis
    View all citing articles on Scopus
    View full text