当前位置: X-MOL 学术Appl. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards Improved Classification Accuracy on Highly Imbalanced Text Dataset Using Deep Neural Language Models
Applied Sciences ( IF 2.5 ) Pub Date : 2021-01-19 , DOI: 10.3390/app11020869
Sarang Shaikh , Sher Muhammad Daudpota , Ali Shariq Imran , Zenun Kastrati

Data imbalance is a frequently occurring problem in classification tasks where the number of samples in one category exceeds the amount in others. Quite often, the minority class data is of great importance representing concepts of interest and is often challenging to obtain in real-life scenarios and applications. Imagine a customers’ dataset for bank loans-majority of the instances belong to non-defaulter class, only a small number of customers would be labeled as defaulters, however, the performance accuracy is more important on defaulters labels than non-defaulter in such highly imbalance datasets. Lack of enough data samples across all the class labels results in data imbalance causing poor classification performance while training the model. Synthetic data generation and oversampling techniques such as SMOTE, AdaSyn can address this issue for statistical data, yet such methods suffer from overfitting and substantial noise. While such techniques have proved useful for synthetic numerical and image data generation using GANs, the effectiveness of approaches proposed for textual data, which can retain grammatical structure, context, and semantic information, has yet to be evaluated. In this paper, we address this issue by assessing text sequence generation algorithms coupled with grammatical validation on domain-specific highly imbalanced datasets for text classification. We exploit recently proposed GPT-2 and LSTM-based text generation models to introduce balance in highly imbalanced text datasets. The experiments presented in this paper on three highly imbalanced datasets from different domains show that the performance of same deep neural network models improve up to 17% when datasets are balanced using generated text.

中文翻译:

利用深度神经语言模型提高高度不平衡文本数据集的分类精度

数据不平衡是分类任务中经常发生的问题,其中一个类别中的样本数量超过另一类别中的样本数量。通常,少数族裔数据对于表示感兴趣的概念非常重要,在现实生活中的场景和应用中通常很难获得。想象一下,客户的银行贷款数据集-大多数实例属于非违约者类别,只有少数客户将被标记为违约者,但是,在如此高的违约者标签上,性能准确性比非违约者更为重要不平衡数据集。所有类别标签上缺少足够的数据样本会导致数据不平衡,从而在训练模型时导致不良的分类性能。综合数据生成和过采样技术,例如SMOTE,AdaSyn可以针对统计数据解决此问题,但此类方法存在过拟合和严重噪声的问题。尽管已证明此类技术对于使用GAN生成合成数值和图像数据很有用,但针对文本数据提出的方法(其可以保留语法结构,上下文和语义信息)的有效性尚未得到评估。在本文中,我们通过评估文本序列生成算法并结合针对特定领域的高度不平衡数据集的语法验证来进行文本分类,从而解决了这一问题。我们利用最近提出的GPT-2和基于LSTM的文本生成模型在高度不平衡的文本数据集中引入平衡。17 使用生成的文本平衡数据集时。
更新日期:2021-01-19
down
wechat
bug