1 Introduction

There are many ways to construct classifiers, such as the Bayesian method, decision tree method, case-based learning method, artificial neural network method, support vector machine method, genetic algorithm method, rough set method, fuzzy set method, and so on. Among them, the Bayesian method is becoming one of the most attractive focuses of many methods because of its unique form of uncertain knowledge expression, rich probability expression ability, and the incremental learning characteristics of integrating prior knowledge. Naive Bayesian classification algorithm (NBC) is one of the classic Bayesian classification algorithms, which has a simple algorithm structure and high computational efficiency. One advantage of a naive Bayes classifier is that it only needs to estimate the necessary parameters (mean and variance of variables) based on a small amount of training data. Due to the assumption of independent variables, only the method of estimating each variable is needed, and the whole covariance matrix is not needed.

Based on the above excellent properties, the naive Bayesian classification algorithm has a wide range of applications, such as clinical medicine [1,2,3], telecommunications [4, 5], artificial intelligence [6], linguistics [7, 8], gene technology [9], precision instruments [10], and other fields. At the same time, naive Bayes classification algorithm has strong compatibility, which can form more powerful algorithms when combined with other methods, such as double-weighted fuzzy gamma naive Bayes classification [11], fuzzy association naive Bayes classification [12], complex network naive Bayes classification [13], feature selection naive Bayes classification [14], tree augmented naive Bayes classification [15], etc.

At the same time, the study found that with the promotion of urbanization, the improvement of transportation facilities, and the popularity of family cars, “road killers” are more and more, and the problem of traffic risk is becoming increasingly prominent. Therefore, before the drivers implement the driving behavior, how to carry out the risk management and implement the classified early warning in advance and realize the source management has become a hot topic in the industry and academia. From the perspective of research fields, the research of traffic risk management has involved many fields of traffic risk, including traffic accidents [16], water safety [17], extreme weather [18], etc. In terms of research methods, scholars have actively used a large number of different methods to classify, manage, analyze, and predict traffic risks, including signal control [19], spatiotemporal analysis [20], etc. In particular, with the maturity of big data technology and the improvement of database, AI-related methods are more and more used in the field of traffic risk management, including support vector machine [21], RBF neural network [22], deep learning [23], fuzzy rule base [24], etc.

From the above analysis, it is found that the existing research has the following shortcomings:

First, naive Bayes classification has an obvious defect: it is based on the assumption of attribute independence, but in most cases, this assumption does not conform to the reality [25]. At the same time, this assumption makes the redundant, irrelevant, interactive, and noise-contaminated features have the same status as the really important features, which eventually leads to the reduction of classification accuracy.

Second, there are few researches on driver’s risk. The existing literature on the risk of traffic scenes is more common, but the risk of drivers is less. The driver is the most important factor leading to traffic accidents, and more than 90% of traffic accidents are related to driver behavior. Therefore, it has great research prospects to establish relevant risk management models for drivers, especially for some characteristics of drivers (such as gender, driving age, personality, etc.). The purpose of this study is to carry out risk research on the personal characteristics of drivers and realize source management.

Third, a machine learning algorithm is rarely used in the field of traffic risk management. With the rapid growth of traffic data and the improvement of its computing power, the machine learning algorithm has become a potentially important means to deal with traffic risk management [26].

Based on the above shortcomings, this paper improves the naive Bayes classification algorithm by combining feature weighting and Laplace calibration. The improved naive Bayes classification algorithm can overcome the above shortcomings and make full use of the information of the training set to greatly improve the accuracy of the original naive Bayes classification algorithm. At the same time, the improved naive Bayes classification algorithm is applied to the scene of traffic risk management to effectively predict and classify the driver’s driving risk and finally implement effective risk management.

The rest of the paper is organized as follows: the improved naive Bayes classification algorithm is established in section 2. In Section 3, numerical simulation is used to verify the accuracy of the improved naive Bayes classification algorithm. At the same time, this method is applied to big data of traffic risk for robustness analysis. There are some discussion in the end. Conclusions are given in section 4.

2 Model

2.1 Bayes theory

Bayesian theory is an important part of subjective Bayesian inductive theory. Bayesian decision-making is to estimate the subjective probability of some unknown states under incomplete information, then modify the occurrence probability with the Bayesian formula, and finally make the optimal decision by using the expected value and modified probability.

Ω is a complete set, C1, C2, ⋯, Cn ∈ Ω, Ci denotes the ith category, P(Ci) > 0, i = 1, 2, ⋯, n. Any two categories are incompatible with each other, and \( \underset{i=1}{\overset{n}{\cup }}{C}_i=\varOmega \). For any setX, if P(X) > 0, so

$$ P\left({C}_i|\mathrm{X}\right)=\frac{P\left(\mathrm{X}|{C}_i\right)P\left({C}_i\right)}{\sum \limits_{i=1}^nP\left(\mathrm{X}|{C}_i\right)P\left({C}_i\right)} $$
(1)

2.2 Naive Bayesian classification

Naive Bayes classification is to use the maximum likelihood estimation principle to classify the sample into the most likely category [27], that is:

$$ P\left({C}_i|\mathrm{X}\right)=\mathit{\operatorname{Max}}\left\{P\left({C}_1|\mathrm{X}\right),P\left({C}_2|\mathrm{X}\right),\cdots P\left({C}_n|\mathrm{X}\right)\right\} $$
(2)

Suppose the sample X = (A1, A2, ⋯, Ak) is an attribute vector, Aj is the jth attribute which may have several different values xj.

Naive Bayes classification considers that the attributes are independent of each other, so

$$ P\left(\mathrm{X}|{C}_i\right)=\prod \limits_{j=1}^kP\left({A}_j={\mathrm{x}}_{\mathrm{j}}|{C}_i\right) $$
(3)

Substituting formula (3) into formula (1), that is:

$$ P\left({C}_i|\mathrm{X}\right)=\frac{\prod \limits_{j=1}^kP\left({A}_j={\mathrm{x}}_{\mathrm{j}}|{C}_i\right)P\left({C}_i\right)}{P\left(\mathrm{X}\right)} $$
(4)

Let \( \frac{1}{P\left(\mathrm{X}\right)}=\alpha \left(>0\right) \), that is

$$ P\left({C}_i|\mathrm{X}\right)=\alpha \prod \limits_{j=1}^kP\left({A}_j={\mathrm{x}}_{\mathrm{j}}|{C}_i\right)P\left({C}_i\right) $$
(5)

In sample set D, N(D) is the total number of samples, N(Ci) is the number of samples of Ci, N(C = Ci, Aj = xj) is the number of samples when attribute Aj is xj in Ci, that is

$$ P\left({C}_i\right)=\frac{N\left({C}_i\right)}{N(D)} $$
(6)
$$ P\left({A}_j={\mathrm{x}}_{\mathrm{j}}|\mathrm{C}={C}_i\right)=\frac{N\left(C={C}_i,{A}_j={x}_j\right)}{N\left({C}_i\right)} $$
(7)

Substituting formula (6) and formula (7) into formula (5), then,

$$ P\left({C}_i|\mathrm{X}\right)=\alpha \prod \limits_{j=1}^k\frac{N\left(C={C}_i,{A}_j={x}_j\right)}{N\left({C}_i\right)}\cdot \frac{N\left({C}_i\right)}{N(D)} $$
(8)

2.3 Feature-weighted naive Bayes classification algorithm

It is generally believed that the more an attribute feature appears, the more important it is, and the greater the corresponding weight in the model [28, 29]. Therefore, the weight coefficient of the feature is set as

$$ {w}_j=\frac{N\left({A}_j={x}_j\right)}{N(D)} $$

wj represents the proportion of the number of samples in the total number of samples when attribute Aj is xj. Formula (8) can be improved to:

$$ P\left({C}_i|X\right)=\alpha \prod \limits_{j=1}^k{w}_j\frac{N\left(C={C}_i,{A}_j={x}_j\right)}{N\left({C}_i\right)}\cdot \frac{N\left({C}_i\right)}{N(D)} $$
$$ =\alpha \prod \limits_{j=1}^k\frac{N\left({A}_j={x}_j\right)}{N(D)}\cdot \frac{N\left(C={C}_i,{A}_j={x}_j\right)}{N\left({C}_i\right)}\cdot \frac{N\left({C}_i\right)}{N(D)} $$
(9)

2.4 Laplace calibration

There may be a potential problem in formula (9): when the number of training samples is small and the number of attributes is large, the training samples are not enough to cover so many attributes, so the number of samples of Aj=xj may be 0, and the whole category conditional probability P(Ci| X) will be equal to 0 [30, 31]. If this happens frequently, it is impossible to achieve accurate classification. Therefore, it is very fragile to simply use the proportion to estimate the category conditional probability. The way to solve the problem is to use Laplacian calibration (Laplacian estimation), which can completely solve the problem that the category conditional probability is 0. At the same time, this slight change does not change sample’s classification.

The specific method is to improve formula (7) as follows:

$$ P\left({A}_j={x}_j\left|C={C}_i\right.\right)=\frac{N\left(C={C}_i,{A}_j={x}_j\right)+1}{N\left({C}_i\right)+q{}_j} $$
(10)
$$ {w}_j=\frac{N\left({A}_j={x}_j\right)+1}{N(D)+{q}_j} $$
(11)

qj represents the number of possible values of attribute Aj.

By substituting formula (10) and formula (11) into formula (9), we can get

$$ P\left({C}_i|X\right)=\alpha \frac{N\left({C}_i\right)}{N(D)}\prod \limits_{j=1}^k\frac{N\left({A}_j={x}_j\right)+1}{N(D)+{q}_j}\cdot \frac{N\left(C={C}_i,{A}_j={x}_j\right)+1}{N\left({C}_i\right)+{q}_j}\;i=1,2,\cdots, n $$
(12)

3 Result and discussion

3.1 Numerical simulation

3.1.1 Impact of sample size

Suppose that the number of attributes is k = 5, the number of values of each attribute is q = 5, and the number of categories is C = 2. Ten thousand samples are randomly selected from the standard normal distribution N (0,1), and the accuracy of the model is tested by gradually increasing the sample size.

It can be seen from Fig. 1 that when the sample size is small, the accuracy rate of discrimination analysis fluctuates greatly, but with the increase of the sample size, the fluctuation gradually becomes smaller, and the overall trend tends to be stable, with the accuracy reaching more than 99%.

Fig. 1
figure 1

The impact of sample size on the accuracy of the model

3.1.2 Impact of sample attributes

In the standard normal distribution N (0,1), 1000 samples are randomly selected, assuming that the number of categories is C = 2, and the number of values of each attribute is q = 5.

As can be seen from Fig. 2, when the sample attribute is less than 400, the accuracy is above 95%, which remains at a high level, and the trend is stable; when the sample attribute is between 400 and 600, the accuracy drops precipitously; when the sample attribute is more than 600, the accuracy drops to about 50%, and the overall trend is stable.

Fig. 2
figure 2

The impact of sample attributes on the accuracy of the model

3.1.3 Impact of category

In the standard normal distribution N (0,1), randomly select 1000 samples, assuming that the number of attributes is m = 5, and each attribute value is q = 5.

As can be seen from Fig. 3, when the number of categories is small (< 24), the accuracy remains above 95%, and the trend is stable; when the number of categories is large (24–60), the accuracy fluctuates greatly, and the stability is poor; when the number of categories further increases (> 60), the accuracy rate quickly drops to zero.

Fig. 3
figure 3

The impact of category on the accuracy of the model

3.2 Improved Bayesian classification algorithm for traffic risk management

3.2.1 Data collection and processing

Based on the random sampling of traffic violation cases in a city from January 2019 to December 2019, a total of 115,482 samples were selected, including 30,340 samples with complete data. There are two kinds of traffic violations: speeding and running red lights. In this paper, speeding without running red lights is set as the first category, running red lights without speeding is set as the second category, speeding with running red lights is set as the third category, respectively, assigned to 0, 1, and 2; there are five reasons for traffic violations: whether driving with a license, gender, vehicle type, driving age, and weather. Among them, unlicensed driving is 0, licensed driving is 1; female driver is 0, male driver is 1; small car is 0, medium bus is 1, and large truck is 2; driving experience less than 1 year is 0, driving experience between 1 and 3 years is 1, and driving experience more than 3 years is 2. It is 0 in sunny days, 1 in rainy days, 2 in foggy days, and 3 in snowy days.

According to the above statistics (Table 1), red light running accounts for nearly 60% of violations, and 75% of speeding drivers will also run red lights. Twenty percent of the violations are caused by unlicensed drivers, which shows that unlicensed driving is a very dangerous driving behavior. Men account for more than 60% of violations, indicating that there is no reason for discrimination against female drivers. From the perspective of driving experience, there is a reverse relationship between violation and driving experience. The smaller the driving experience, the more violation. From the perspective of weather, nearly 60% of the violations occurred in sunny days, and bad weather is not the main reason for violations.

Table 1 Descriptive statistics of data

3.2.2 Improved naive Bayes classification algorithm

Using the improved naive Bayes classification algorithm for analysis (Table 2), this paper can draw the following conclusions: in the first, second, and third classes of traffic violations, 5097, 17,311, and 5501 samples are correct; the correct rate is 69.8%, 98.8%, and 99.7%; and the overall correct rate is 92.0%, which shows that the improved naive Bayes classification algorithm has a very high correct rate, especially in the second and third category.

Table 2 Discriminatory analysis of the improved naive Bayes classification algorithm

3.2.3 Naive Bayes classification algorithm

In order to compare with the improved naive Bayesian classification algorithm, this paper uses the original naive Bayesian classification algorithm to carry out the back analysis, and the result is as follows (Table 3):

Table 3 Discriminant analysis of naive Bayes classification algorithm

From the above results, the accuracy of the first, second, and third classes is 52.8%, 41.5%, 69.7%, respectively, and the overall accuracy of the discriminatory analysis is 49.4%. All the indexes are far lower than the results of the improved naive Bayesian classification algorithm. Therefore, the efficiency of the improved naive Bayesian classification algorithm is greatly improved.

3.2.4 Robustness test

In order to continue to compare the efficiency of the improved naive Bayesian classification algorithm, this paper uses logistic regression to compare. Because all variables are discrete selection variables and there are three values for dependent variables, multivariate logistic regression is adopted [32, 33].

  1. a.

    Multiple logistic main effect regression

In this section, a multiple logistic main effect model was used for regression analysis [34], and the following results were obtained (Table 4):

Table 4 Discriminant analysis of multiple logistic main effect regression

According to the results of the above table, the correct rates of the first, second, and third classes are 37.7%, 90.0%, and 93.5%, respectively, and the overall correct rate is 78.1%. It can be seen that the correct rate of multiple logistic main effect regression is much lower than the improved naive Bayes classification algorithm.

  1. b.

    Multiple logistic total factor regression

The multivariate logistic main effect regression is only considered in the whole factor regression, and the interaction effect of each factor is not considered. Therefore, this section continues to analyze the multiple logistic total factor regression [35], and the analysis results are as follows (Table 5):

Table 5 Discriminant analysis of multiple logistic total factor regression

It can be seen from the above table that in the multiple logistic total factor regression, the correct rates of the first, second, and third classes are 45.9%, 91.9%, and 94.5%, respectively, and the overall correct rate is 81.3%. Therefore, the multiple logistic total factor regression has a higher accuracy than the main effect regression, but it is still far lower than the improved naive Bayes classification algorithm.

3.3 Discussion

Through numerical simulation, we found that, when the sample size is small, the accuracy rate of discrimination analysis of improved naive Bayesian classification algorithm fluctuates greatly, but with the increase of the sample size, the fluctuation gradually becomes smaller, and the overall trend tends to be stable, with the accuracy reaching more than 99%; when the sample attribute is less than 400, the accuracy is above 95%, which remains at a high level, and the trend is stable; when the sample attribute is between 400 and 600, the accuracy drops precipitously; when the sample attribute is more than 600, the accuracy drops to about 50%, and the overall trend is stable; when the number of categories is small (< 24), the accuracy remains above 95%, and the trend is stable; when the number of categories is large (24–60), the accuracy fluctuates greatly, and the stability is poor; when the number of categories further increases (> 60), the accuracy rate quickly drops to zero.

Through empirical analysis, this paper found that, using the improved naive Bayes classification algorithm for analysis, in the first, second, and third classes of traffic violations, 5097, 17311, and 5501 samples are correct; the correct rate is 69.8%, 98.8%, and 99.7%; and the overall correct rate is 92.0%; using the naive Bayes classification algorithm, the accuracy of the first, second, and third classes is 52.8%, 41.5%, 69.7%, respectively, and the overall accuracy of the discriminatory analysis is 49.4%. All the indexes are far lower than the results of the improved naive Bayesian classification algorithm.

Through robustness analysis, we find that, using multiple logistic main effect regression, the correct rates of the first, second, and third classes are 37.7%, 90.0%, and 93.5%, respectively, and the overall correct rate is 78.1%; using the multiple logistic total factor regression, the correct rates of the first, second, and third classes are 45.9%, 91.9%, and 94.5%, respectively, and the overall correct rate is 81.3%. Therefore, the multiple logistic total factor regression has a higher accuracy than the main effect regression, but it is still far lower than the improved naive Bayes classification algorithm.

Through the research of this paper, it is found that the improved naive Bayes algorithm has greatly improved the original algorithm, but unfortunately, there are some limitations in this paper, such as unable to consider the interaction of features, sample size, category and other factors, and so on.

4 Main conclusions

In view of the shortcomings of the naive Bayesian classification algorithm, this paper improves the algorithm by using the feature weighting and Laplace calibration and obtains the improved naive Bayesian classification algorithm. The results show that when the sample size is large, the improved naive Bayesian classification algorithm has a high accuracy of 99% and is very stable. When the sample attribute is less than 400, the accuracy rate is over 95%, and when the sample attribute is greater than 600, the accuracy rate of discrimination decreases to about 50%, and the trend is stable; when the number of categories is less than 24, the accuracy rate of discrimination analysis is maintained at least 95%, and the trend is stable; when the number is more than 60, the accuracy of discrimination is reduced to zero rapidly. Through empirical research, it is found that, compared with the original naive Bayesian classification algorithm, the improved naive Bayesian classification algorithm greatly improves the accuracy of discrimination analysis from 49.5 to 92%. Compared with the multivariate logistic main effect regression and multivariate logistic total factor regression, the improved naive Bayesian classification algorithm has higher accuracy.