当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Classification Performance Improvement Using Random Subset Feature Selection Algorithm for Data Mining
Big Data Research ( IF 3.3 ) Pub Date : 2018-04-25 , DOI: 10.1016/j.bdr.2018.02.007
Lakshmipadmaja D , B. Vishnuvardhan

This study focuses on feature subset selection from high dimensionality databases and presents modification to the existing Random Subset Feature Selection (RSFS) algorithm for the random selection of feature subsets and for improving stability. A standard k-nearest-neighbor (kNN) classifier is used for classification. The RSFS algorithm is used for reducing the dimensionality of a data set by selecting useful novel features. It is based on the random forest algorithm. The current implementation suffers from poor dimensionality reduction and low stability when the database is very large. In this study, an attempt is made to improve the existing algorithm's performance for dimensionality reduction and increase its stability. The proposed algorithm was applied to scientific data to test its performance. With 10 fold cross-validation and modifying the algorithm classification accuracy is improved. The applications of the improved algorithm are presented and discussed in detail. From the results it is concluded that the improved algorithm is superior in reducing the dimensionality and improving the classification accuracy when used with a simple kNN classifier. The data sets are selected from public repository. The datasets are scientific in nature and mostly used in cancer detection. From the results it is concluded that the algorithm is highly recommended for dimensionality reduction while extracting relevant data from scientific datasets.



中文翻译:

使用随机子集特征选择算法进行数据挖掘的分类性能提高

这项研究的重点是从高维数据库中选择特征子集,并对现有的随机子集特征选择(RSFS)算法进行了修改,以随机选择特征子集并提高稳定性。使用标准的k最近邻(kNN)分类器进行分类。RSFS算法用于通过选择有用的新颖特征来减少数据集的维数。它基于随机森林算法。当数据库很大时,当前的实现会遇到降维效果差和稳定性低的问题。在这项研究中,试图改善现有算法的降维性能并增加其稳定性。将该算法应用于科学数据以测试其性能。通过10倍交叉验证和修改,算法的分类精度得以提高。提出并讨论了改进算法的应用。从结果可以得出结论,当与简单的kNN分类器一起使用时,改进的算法在减少维数和提高分类精度方面具有优势。数据集是从公共存储库中选择的。这些数据集本质上是科学的,主要用于癌症检测。从结果可以得出结论,强烈建议将该算法用于降维,同时从科学数据集中提取相关数据。从结果可以得出结论,改进的算法在与简单的kNN分类器结合使用时,在降低维数和提高分类精度方面具有优势。数据集是从公共存储库中选择的。这些数据集本质上是科学的,主要用于癌症检测。从结果可以得出结论,强烈建议将该算法用于降维,同时从科学数据集中提取相关数据。从结果可以得出结论,改进的算法在与简单的kNN分类器结合使用时,在降低维数和提高分类精度方面具有优势。数据集是从公共存储库中选择的。这些数据集本质上是科学的,主要用于癌症检测。从结果可以得出结论,强烈建议将该算法用于降维,同时从科学数据集中提取相关数据。

更新日期:2018-04-25
down
wechat
bug