当前位置: X-MOL 学术Pattern Recogn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SI(FS)2: Fast simultaneous instance and feature selection for datasets with many features
Pattern Recognition ( IF 7.5 ) Pub Date : 2021-03-01 , DOI: 10.1016/j.patcog.2020.107723
Nicolás García-Pedrajas , Juan A. Romero del Castillo , Gonzalo Cerruela-García

Abstract Data reduction is becoming increasingly relevant due to the enormous amounts of data that are constantly being produced in many fields of research. Instance selection is one of the most widely used methods for this task. At the same time, most recent pattern recognition problems involve highly complex datasets with a large number of possible explanatory variables. For many reasons, this abundance of variables significantly hinders classification and recognition tasks. There are efficiency issues, too, because the speed of many classification algorithms is greatly improved when the complexity of the data is reduced. Thus, feature selection is also a widely used method for data reduction and for gaining an understanding of feature information. Although most methods address instance and feature selection separately, the two problems are interwoven, and benefits are expected from performing these two tasks jointly. However, few algorithms have been proposed for simultaneously addressing the tasks of instance and feature selection. Furthermore, most of those methods are based on complex heuristics that are very difficult to scale up even to moderately large datasets. This paper proposes a new algorithm for dealing with many instances and many features simultaneously by performing joint instance and feature selection using a simple heuristic search and several scaling-up mechanisms that can be successfully applied to datasets with millions of features and instances. In the proposed method, a forward selection search is performed in the feature space jointly with the application of standard instance selection in a constructive subspace built stepwise. Several simplifications are adopted in the search to obtain a scalable method. An extensive comparison using 95 large datasets shows the usefulness of our method and its ability to deal with millions of instances and features simultaneously. The method is able to obtain better classification performance results than state-of-the-art approaches while achieving considerable data reduction.

中文翻译:

SI(FS)2:具有许多特征的数据集的快速同时实例和特征选择

摘要 由于在许多研究领域不断产生大量数据,数据简化变得越来越重要。实例选择是此任务最广泛使用的方法之一。同时,最近的模式识别问题涉及具有大量可能解释变量的高度复杂的数据集。由于许多原因,这种丰富的变量严重阻碍了分类和识别任务。也存在效率问题,因为当数据的复杂度降低时,许多分类算法的速度会大大提高。因此,特征选择也是一种广泛使用的数据缩减和了解特征信息的方法。尽管大多数方法分别处理实例和特征选择,这两个问题相互交织,共同执行这两项任务有望带来好处。然而,很少有人提出同时解决实例和特征选择任务的算法。此外,这些方法中的大多数都基于复杂的启发式方法,即使扩展到中等规模的数据集也非常困难。本文提出了一种新算法,通过使用简单的启发式搜索和几种可成功应用于具有数百万个特征和实例的数据集的扩展机制来执行联合实例和特征选择,从而同时处理许多实例和许多特征。在所提出的方法中,在特征空间中执行前向选择搜索,并在逐步构建的构造子空间中应用标准实例选择。在搜索中采用了一些简化以获得可扩展的方法。使用 95 个大型数据集的广泛比较显示了我们方法的实用性及其同时处理数百万个实例和特征的能力。该方法能够获得比最先进的方法更好的分类性能结果,同时实现可观的数据缩减。
更新日期:2021-03-01
down
wechat
bug