Conversion of adverse data corpus to shrewd output using sampling metrics.,Visual Computing for Industry, Biomedicine, and Art

当前位置： X-MOL 学术 › Vis. Comput. Ind. Biomed. Art › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Conversion of adverse data corpus to shrewd output using sampling metrics.
Visual Computing for Industry, Biomedicine, and Art ( IF 3.2 ) Pub Date : 2020-08-11 , DOI: 10.1186/s42492-020-00055-9
Shahzad Ashraf ₁ , Sehrish Saleem ₂ , Tauqeer Ahmed ₁ , Zeeshan Aslam ₃ , Durr Muhammad ₄

Affiliation

An imbalanced dataset is commonly found in at least one class, which are typically exceeded by the other ones. A machine learning algorithm (classifier) trained with an imbalanced dataset predicts the majority class (frequently occurring) more than the other minority classes (rarely occurring). Training with an imbalanced dataset poses challenges for classifiers; however, applying suitable techniques for reducing class imbalance issues can enhance classifiers’ performance. In this study, we consider an imbalanced dataset from an educational context. Initially, we examine all shortcomings regarding the classification of an imbalanced dataset. Then, we apply data-level algorithms for class balancing and compare the performance of classifiers. The performance of the classifiers is measured using the underlying information in their confusion matrices, such as accuracy, precision, recall, and F measure. The results show that classification with an imbalanced dataset may produce high accuracy but low precision and recall for the minority class. The analysis confirms that undersampling and oversampling are effective for balancing datasets, but the latter dominates.

中文翻译：

使用采样指标将不良数据语料库转换为精巧的输出。

通常会在至少一个类别中找到不平衡的数据集，而其他类别通常会超过这些数据集。用不平衡数据集训练的机器学习算法（分类器）比其他少数类（很少发生）预测多数类（频繁发生）。用不平衡的数据集进行训练对分类器提出了挑战。但是，应用适当的技术减少类不平衡问题可以提高分类器的性能。在这项研究中，我们考虑了教育背景下的不平衡数据集。最初，我们研究了有关不平衡数据集分类的所有缺点。然后，我们将数据级算法应用于类平衡并比较分类器的性能。使用混淆矩阵中的基础信息来衡量分类器的性能，例如准确性，准确性，召回率和F量度。结果表明，使用不平衡数据集进行分类可能会产生较高的准确性，但准确性较低，并且对少数类别的召回率很高。分析确认欠采样和过采样对于平衡数据集有效，但后者占主导地位。

更新日期：2020-08-11

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文