当前位置: X-MOL 学术Int. J. Approx. Reason. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evidential instance selection for K-nearest neighbor classification of big data
International Journal of Approximate Reasoning ( IF 3.2 ) Pub Date : 2021-08-31 , DOI: 10.1016/j.ijar.2021.08.006
Chaoyu Gong 1 , Zhi-gang Su 1 , Pei-hong Wang 1 , Qian Wang 2 , Yang You 3
Affiliation  

Many instance selection algorithms have been introduced to reduce the high storage requirements and computation complexity of K-nearest neighbor (K-NN) classification rules. However, the information provided by the neighbors of one instance was still not completely utilized in many studies. The information is usually in the form of a quantitative metric for determining whether an instance can be selected. Thus, many instances may have the same quality, which confuses the selection results. In addition, the proposed metrics are simply added without deeper fusion and the information loss has further negative effects. To address these issues, we propose a new instance selection algorithm for K-NN rules in the evidence theory framework called evidential instance selection (EIS). The basic idea is that all neighbors of every instance first provide distinct items of evidence regarding the estimated value of the label (called the estimation label) for each instance. After fusing the items of evidence and computing the conflicts among them, instances with higher conflict are considered more likely to be near the class boundaries. Finally, the selection of boundary instances is formalized as solving an optimal problem, where the objective function considers both the reduction rate and classification accuracy. When dealing with big data sets, EIS is enhanced as a distributed and parallel version called EIS-AS by applying Apache Spark to alleviate the computational bottleneck. We tested EIS and EIS-AS with 30 small data sets and six big data sets, respectively, which contained up to 11 million instances. The experimental results showed that EIS performed well at simplifying the raw training data and EIS-AS could cope with big data sets in an appropriate manner.



中文翻译:

大数据K-近邻分类的证据实例选择

已经引入了许多实例选择算法来降低K-最近邻(K- NN)分类规则的高存储要求和计算复杂度。然而,许多研究仍未完全利用一个实例的邻居提供的信息。该信息通常采用定量度量的形式,用于确定是否可以选择实例。因此,许多实例可能具有相同的质量,这会混淆选择结果。此外,所提出的指标只是简单地添加而没有进行更深层次的融合,信息丢失会产生进一步的负面影响。为了解决这些问题,我们提出了一种新的K实例选择算法-NN 规则在证据理论框架中称为证据实例选择 (EIS)。基本思想是每个实例的所有邻居首先提供关于每个实例的标签估计值(称为估计标签)的不同证据项。在融合证据项并计算它们之间的冲突后,冲突较高的实例被认为更有可能靠近类边界。最后,边界实例的选择被形式化为解决一个最优问题,其中目标函数同时考虑了减少率和分类精度。在处理大数据集时,EIS 被增强为分布式并行版本,称为 EIS-AS,通过应用 Apache Spark 来缓解计算瓶颈。我们分别用 30 个小数据集和 6 个大数据集测试了 EIS 和 EIS-AS,其中包含多达 1100 万个实例。实验结果表明,EIS 在简化原始训练数据方面表现良好,并且 EIS-AS 能够以适当的方式处理大数据集。

更新日期:2021-09-03
down
wechat
bug