Abstract
There is a long history of interest in modeling Poisson regression in different fields of study. The focus of this work is on handling the issues that occur after modeling the count data. For the prediction and analysis of count data, it is valuable to study the factors that influence the performance of the model and the decision based on the analysis of that model. In regression analysis, multicollinearity and influential observations separately and jointly affect the model estimation and inferences. In this article, we focused on multicollinearity and influential observations simultaneously. To evaluate the reliability and quality of regression estimates and to overcome the problems in model fitting, we proposed new diagnostic methods based on Sherman–Morrison Woodbury (SMW) theorem to detect the influential observations using approximate deletion formulas for the Poisson regression model with the Liu estimator. A Monte Carlo method is done for the assessment of the proposed diagnostic methods. Real data are also considered for the evaluation of the proposed methods. Results show the superiority of the proposed diagnostic methods in detecting unusual observations in the presence of multicollinearity compared to the traditional maximum likelihood estimation method.
1. Introduction
Nowadays, there are several distributions available in the literature that can be used to remove noise and then predict data. Similarly, there is a persistent record of concern in modeling count data which has several applications in biosciences and other disciplines [1–4]. The focus of this effort is on dealing with the issues that occur after modeling the count data. For the prediction and analysis of count data, it is valuable to study the factors that influence the performance of the model and the decision based on the analysis of that model. Considering the suitable statistical modeling, when the dependent variable is count data, one of the most used statistical models is the Poisson regression model (PRM). For accurate statistical inferences, the standard ordinary least square (OLS) regression sets some important assumptions related to the model’s errors [5]. Generally, numerous problems may arise when a count variable model is estimated using the OLS method, because of the level of noise. For the analysis of count data, PRM provides the most relevant results. According to McCullagh and Nelder [6], the PRM belongs to the family of GLM. The maximum likelihood ML estimation method is used to estimate the PRM estimates instead of the OLS method.
In the PRM, when the explanatory variables are linearly correlated, then the ML method is very sensitive [7]. Some biased estimators were introduced in the literature to handle the multicollinearity, i.e., Stein, ridge, Lasso, regularization, and Liu estimators; see [1, 3] and [8–10] for more details. The most popular one is the ridge estimator, but it has some limitations, i.e., selecting the ridge parameter, where the ridge rule is based on two normal distributions. It is a shrinkage rule because it depends on the slope. In contrast, Lasso is based on the slope and the intercept. The best choice is to adopt a Liu estimator to avoid the hindrances of the ridge estimator. The Liu estimator is an ace in this regard as it avoids the disadvantages of the ridge estimator [10], where the main advantage of the ridge is easy to use, and it can be written in the explicate and the objective formulas. In the literature, various studies are available for the PRM to overcome the presence of collinearity [7, 11–16].
To evaluate the reliability and quality of regression estimates and to overcome the problems in model fitting, diagnostic techniques have been developed. Although regression diagnostics have been developed methodologically and theoretically for linear regression models together with multicollinearity (see [17–24]), some studies about the influence diagnostics in the GLM with uncorrelated explanatory variables are available in the literature. Pregibon [25] proposed the influence diagnostics for logistic regression using the one-step methods. For further discussion on influential diagnostics about GLM, see [26–32].
Influence diagnostics in the GLM with correlated explanatory variables is very limited. Özkale et al. [33] proposed the first study on influence diagnostics for logistic ridge regression. Amin et al. [34] worked on the influence diagnostics for the gamma ridge regression model. Khan et al. [35] assessed the performance of influence diagnostics in the PRM with a ridge estimator. Recently, Khan et al. [36] examined the superiority of influence diagnostics in the PRM with two-parameter estimator and, further, Amin et al. [37] discussed the influence diagnostics for the inverse Gaussian ridge regression model.
The available literature showed that no study in the GLM is available for influence diagnostics with the Liu estimator. Though, the Poisson Liu regression (PLR) diagnostics have got no thoughtful attention up till now. Thus, our present work is an effort to fill this gap. So, in the present work, we proposed diagnostic methods for the PRM under the Liu estimator, which prove to be the competed method.
The remaining of the study is organized as follows: we focused on the formulation of influence diagnostic measures for the PRM under the Liu estimator (LE). Next, in Sections 4 and 5, we conducted a Monte Carlo study using two, four, and six independent variables to examine the level of detection percentage of newly developed diagnostic measures and, finally, we proved the efficacy of proposed measures with the help of real world application.
1.1. Model Specification and Estimator
Suppose the model can be written aswhere are the observation, is a matrix, are the unknown parameters, and with and being independent. We assume the observation is the result of the integration form and try to solve this problem by finding differentiation matrix. The PRM is applicable for real data, especially when the response variable often comes in the form of count data that are known. Let follow a Poisson distribution with , as its parameter. The probability mass function for PRM is used to describe the relationship when , the response variable occurs as count data.
The PRM belongs to the GLM with log link function aswhere is the intercept and are the set of coefficients. The estimated mean function is defined by .
Here is the row of independent variable and of coefficients, where represents the number of explanatory variables.
Assume that all are independent; then, the joint log-likelihood is defined as
For finding the best value of , we have to solve the following relation:
Since the systems of equations are nonlinear, so the MLE with iterative reweighted least-squared (IRLS) algorithm is used to estimate the regression coefficients as explicit formulas:where and .
In the presence of multicollinearity, the matrix becomes ill conditioned, and because of this problem, it gets complicated to draw effective inferences. To overcome these effects of multicollinearity, we use the generalization of Liu [6] to define PLRE.where . Here, the important step is selecting shrinkage parameter as the optimal value of which affects the performance of PLRE. Furthermore, if , then . Recently, Qasim et al. [38] recommended the optimum Liu parameter for the Liu estimator in the PRM aswhere and and where is the the element of and the columns of orthogonal matrix represent the eigenvectors of matrix, such that , where .
2. The PRM Diagnostics
2.1. Hat Matrix, Leverage, and Residuals for the PRM
Hat matrix is a common measure used to compute leverages. According to Davison and Tsai [39], the hat matrix in the PRM is
The diagonal elements of are interpreted as leverages, i.e., . For the computation of regression diagnostic measures, residuals play the most important role (Belsley et al. [18]). Let symbolize the Pearson residual, so for the case of PRM, we defined it as
Similarly, we find the standardized Pearson residual as
Another useful residual that proves to be of great help for detecting unusual observations is termed as the deviance residual. The deviance residual for the PRM is defined bywhere the sign is the sign function [31].
2.2. Influence Diagnostic Methods
Pregibon [25] was the first to work on logistic regression diagnostics tool and proposed the influence diagnostic measures using one-step approximations. The proposed influence diagnostics take account of Cook’s distance, change in deviance, and change in Pearson . For PRM Cook’s distance, is suggested as
The measures the overall change in the fitted model when the observation is deleted from the model. The one-step approximation for the expression is defined aswhere are the diagonal elements of weight matrix after the removal of observation. Furthermore, (13) can also be approximated as
Hardin and Hilbe [40] suggested the cut point for detecting the unusual observations in GLM as ; this process is used to specify the window in GLM.
Pregibon [25] suggested change in Pearson as another influence diagnostic measure to detect the influential observations. Applying one-step approximation, we defined aswhere is used to represent the squared Pearson residuals of the complete data set and signifies the squared Pearson residuals of the data set without the observation, respectively. This statistic is employed to study the effect of observation on the goodness of fit of the model and the estimates. On similar grounds, Pregibon [25] suggested that another statistic for measuring the impact of observation on the goodness of fit of a model is the change in deviance statistic. The one-step linear approximation for change in deviance statistic is defined aswhere for complete data set is used to represent the squared deviance residuals and the squared deviance residual are found after the removal of observation, respectively. We suggested a simplified form of equation (17) by replacing by as
The cut-off value for change in deviance statistic is to detect the unusual observations [25].
The difference of fits suggested by Belsley et al. [18] is another common influence measure. After deleting the observation, assesses the change in fit of model. For GLM, it is given aswhere is used to represent the predicted regressand of complete data set and represents the predicted regressand after deleting the case. Furthermore, it can also be written as
By using the SMW theorem, (20) is retransformed aswhere is termed as the jackknife Pearson residual and shows that the observation as influential. The second matrix will be introduced in the next section.
3. Influence Measures in Poisson Liu Regression Model (PLRM)
3.1. Hat Matrix, Leverages, and Residual in PLRM
Hat matrix for the PLRM is defined as
The leverages are the Liu hat diagonals that proved helpful in detecting influential cases with some modifications. As for , for and as increases, decreases monotonically.
Using the Liu estimator, the Pearson residuals for PLRM are defined as
The standardized form of Pearson residuals with multicollinear independent variables is given as
3.2. Influence Diagnostics for PLRM
The approximate case deletion formulas using SMW theorem [41] are found for the identification of influential observations.
Theorem 1. After the deletion of row from , we write aswhere represents the matrix without the row. Using the SMW theorem, we approximated .
Proof. Letting , then and stated by (6) and (7) becomeLet and represent the ML and PLRE of after deleting the observation, respectively. Thus, we haveWith the help of SMV theorem, can be improved aswhere is the row vector of the matrix and solves the first part of R.H.S of (27)where . We also haveNow,Hence, the theorem is completed.
Following [42] for the PLRE, the Cook’s distance is redefined asThe observation is considered as influential if the distance between and is larger. Another version can be expressed asUsing the Liu estimator, we defined the change in Pearson chi-square aswhere the squared Liu Pearson residuals are used to represent the complete data set and computed without observation. Correspondingly, with Liu estimator, we formulated the change in deviance statistic aswhere and represent the squared Liu deviance residuals with complete data and the squared Liu deviance residuals computed without observation, andwhere is the sign function of .
Following (19), we give the DFFITS for PLRM aswhere and represent the predicted regressand of the complete data set and predicted regressand after deleting the case.
Using the SMW theorem, we simplified (37) aswhere is the Pearson Jackknife residual with Liu estimator.
4. Simulation Study
In this section, we summarize the results of the PRM and the PLRM influence diagnostics using the Monte Carlo simulation scheme. We follow the same simulation scheme used by many researchers to see [43, 44]. The response variable is generated from the Poisson distribution with mean function as defined by
We set simulation with explanatory variables with various sample sizes plus the mild to severe levels of collinearity. We assumed sample sizes . Moreover, we generated the regressors using the following formula:
We consider the different collinearity levels as , and we assume the arbitrary values of regression coefficients in such a way that . Now we generated few influential observations in the regressors by using the expression , and , where . All the analyses are performed using the R software with 1000 replications.
4.1. Results and Discussion
The study results of the calculations of the identification of the unusual observations with LE in the presence of mild to severe multicollinearity are provided in Tables 1–6 with with defined optimum and . From Tables 1–3 with p = 2, it is clear that performed is good as compared to the method for different sample sizes with multicollinearity. The influence detection of and methods is identical and performs significantly better than and , respectively. However, it is observed that their performance does not occur in a better way than that of the method for all the combinations of . Comparable effects are observed on method, and it is found that the detection percentage of influential observations by method is better than , although the performance of related to is equally better. Furthermore, as we increased the sample sizes, the percentage of detecting the influential observation of the developed measures equally increases. Moreover, from Tables 4–6, we observed that the newly developed diagnostic measures performed more efficiently with , but give better detection percentage as compared to Furthermore, varying the regressors size affects the functioning of method and method, respectively. and values are larger than and , respectively. Further the changing pattern of sample size and multicollinearity together with their effects are demonstrated explicitly to study the performance of newly developed measures through graphs; see Figures 1, 2, and 3 for defined and . Considering Figures 1–3, with defined combinations of sample sizes, regressors, and collinearity levels, we clearly observe a positive increase in the performance of newly developed measures.
5. Application: English League Football Data
For the illustration of the proposed diagnostic methods, we analyze the football English League data set which is also available in Table 7. The said data comprise observations with one response variable, i.e., the number of won matches and p = 5 explanatory variables, i.e., the number of yellow cards , the number of red cards , goals won , goals conceded , and the number of points earned . Algamal and Alanaz [11] also used this data set. After checking the , the goodness of fit test found that it is well fitted to the Poisson distribution. The said data are multicollinear as the condition index CI = 31.274.
From Table 8, it is found that all methods commonly identify the 1st observation as the influential observation. Change in chi-square statistic and change in deviance statistic with ML estimator do not detect any of the observation as influential. Furthermore, observation was detected as influential by DFFITS without Liu estimator and by all of the proposed diagnostics.
The effect of deleting the highlighted observations on the estimates of PRM and PLRM is presented in Table 9. We found a maximum change in PRM and PLRM estimates after the removal of the 1st observation that was detected by all selected and proposed measures. The second observation that was identified by just and all proposed measures is the 19th. After the deletion of detected observations, we found the maximum change on and . After examining these results, it was noted that in the presence of multicollinearity, PLRM diagnostic measures efficiently detect the influential observations. Furthermore, we incorporated index plots to summarize the efficacy of the proposed measures in Figure 4.
6. Conclusion
This study introduces diagnostic measures for Poisson Liu regression using biased estimator to handle influential observations and multicollinearity simultaneously in the PRM. As discussed earlier, multicollinearity affects the performance of traditional ML estimator in PRM. Therefore, we adopted the Liu estimator due to its efficient statistical properties to solve multicollinearity and influential observations in PRM. The simulation results support the performance of new diagnostic measures as the detection percentage of ML estimators and the existing measures turn out to be the worst with increasing the sample size, number of regressors, and the level of multicollinearity. The results proved that the suggested measures proved more beneficial for the identification of influential observations together with multicollinearity. Hence, it is suggested that these proposed measures guide the user to handle the issue of multicollinearity with robust estimator support efficiently.
Data Availability
All data are included in the paper with their links.
Conflicts of Interest
The authors declare there are no conflicts of interest.
Acknowledgments
The authors are thankful for the Taif University researchers supporting project number TURSP-2020/160, Taif University, Taif, Saudi Arabia.