Weighted k‐nearest neighbor based data complexity metrics for imbalanced datasets,Statistical Analysis and Data Mining

当前位置： X-MOL 学术 › Stat. Anal. Data Min. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Weighted k‐nearest neighbor based data complexity metrics for imbalanced datasets
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2020-06-02 , DOI: 10.1002/sam.11463
Deepika Singh ₁ , Anjana Gosain ₁ , Anju Saha ₁

Affiliation

Empirical behavior of a classifier depends strongly on the characteristics of the underlying imbalanced dataset; therefore, an analysis of intrinsic data complexity would appear to be vital in order to choose classifiers suitable for particular problems. Data complexity metrics (CMs), a fairly recent proposal, identify dataset features which imply some difficulty for the classification task and identify relationships with classifier accuracy. In this paper, we introduce two CMs for imbalanced datasets, which help in explaining the factors responsible for the deterioration in classifier performance. These metrics are based on the weighted k‐nearest neighbors approach. The experiments are performed in MATLAB software using 48 simulated datasets and 22 real‐world datasets for different choices of neighborhood size k considered as 3, 5, 7, 9, 11. The results help to illustrate the usefulness of the proposed metrics.

中文翻译：

不平衡数据集的基于加权k最近邻的数据复杂性指标

分类器的经验行为在很大程度上取决于基础不平衡数据集的特征。因此，为了选择适合特定问题的分类器，对内部数据复杂性进行分析似乎至关重要。数据复杂性度量标准（CMs）是一项较新的提议，用于识别隐含分类任务难度的数据集特征，并识别具有分类器准确性的关系。在本文中，我们针对不平衡数据集引入了两个CM，这有助于解释导致分类器性能下降的因素。这些指标基于加权k最近的邻居接近。实验是在MATLAB软件中使用48个模拟数据集和22个现实世界数据集针对邻域大小k的不同选择（分别视为3、5、7、9、11）进行的。结果有助于说明所提出的指标的有效性。

更新日期：2020-06-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11