当前位置: X-MOL 学术J. Appl. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function
Journal of Applied Statistics ( IF 1.2 ) Pub Date : 2021-06-16 , DOI: 10.1080/02664763.2021.1939662
Lili Zhang 1 , Trent Geisler 1 , Herman Ray 2 , Ying Xie 3
Affiliation  

Logistic regression is estimated by maximizing the log-likelihood objective function formulated under the assumption of maximizing the overall accuracy. That does not apply to the imbalanced data. The resulting models tend to be biased towards the majority class (i.e. non-event), which can bring great loss in practice. One strategy for mitigating such bias is to penalize the misclassification costs of observations differently in the log-likelihood function. Existing solutions require either hard hyperparameter estimating or high computational complexity. We propose a novel penalized log-likelihood function by including penalty weights as decision variables for observations in the minority class (i.e. event) and learning them from data along with model coefficients. In the experiments, the proposed logistic regression model is compared with the existing ones on the statistics of area under receiver operating characteristics (ROC) curve from 10 public datasets and 16 simulated datasets, as well as the training time. A detailed analysis is conducted on an imbalanced credit dataset to examine the estimated probability distributions, additional performance measurements (i.e. type I error and type II error) and model coefficients. The results demonstrate that both the discrimination ability and computation efficiency of logistic regression models are improved using the proposed log-likelihood function as the learning objective.



中文翻译:

通过新的惩罚对数似然函数改进不平衡数据的逻辑回归

逻辑回归是通过最大化在最大化整体准确度的假设下制定的对数似然目标函数来估计的。这不适用于不平衡的数据。由此产生的模型倾向于偏向多数类(即非事件),这在实践中会带来很大的损失。减轻这种偏差的一种策略是在对数似然函数中以不同的方式惩罚观察的错误分类成本。现有解决方案需要硬超参数估计或高计算复杂度。我们提出了一种新颖的惩罚对数似然函数,方法是将惩罚权重作为少数类(即事件)中观察的决策变量,并从数据以及模型系数中学习它们。在实验中,将提出的逻辑回归模型与现有模型进行比较,对来自 10 个公共数据集和 16 个模拟数据集的接受者操作特征(ROC)曲线下的面积以及训练时间进行统计。对不平衡的信用数据集进行详细分析,以检查估计的概率分布、额外的绩效测量(即 I 类错误和 II 类错误)和模型系数。结果表明,使用提出的对数似然函数作为学习目标,逻辑回归模型的判别能力和计算效率都得到了提高。对不平衡的信用数据集进行详细分析,以检查估计的概率分布、额外的绩效测量(即 I 类错误和 II 类错误)和模型系数。结果表明,使用提出的对数似然函数作为学习目标,逻辑回归模型的判别能力和计算效率都得到了提高。对不平衡的信用数据集进行详细分析,以检查估计的概率分布、额外的绩效测量(即 I 类错误和 II 类错误)和模型系数。结果表明,使用提出的对数似然函数作为学习目标,逻辑回归模型的判别能力和计算效率都得到了提高。

更新日期:2021-06-16
down
wechat
bug