Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest,Journal of Big Data

当前位置： X-MOL 学术 › J. Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest
Journal of Big Data ( IF 8.1 ) Pub Date : 2021-06-05 , DOI: 10.1186/s40537-021-00472-4
Maria Irmina Prasetiyowati , Nur Ulfa Maulidevi , Kridanto Surendro

Feature selection is a pre-processing technique used to remove unnecessary characteristics, and speed up the algorithm's work process. A part of the technique is carried out by calculating the information gain value of each dataset characteristic. Also, the determined threshold rate from the information gain value is used in feature selection. However, the threshold value is used freely or through a rate of 0.05. Therefore this study proposed the threshold rate determination using the information gain value’s standard deviation generated by each feature in the dataset. The threshold value determination was tested on 10 original datasets transformed by FFT and IFFT and classified using Random Forest. On processing the transformed dataset with the proposed threshold this study resulted in lower accuracy and longer execution time compared to the same process with Correlation-Base Feature Selection (CBF) and a standard 0.05 threshold method. Similarly, the required accuracy value is lower when using transformed features. The study showed that by processing the original dataset with a standard deviation threshold resulted in better feature selection accuracy of Random Forest classification. Furthermore, by using the transformed feature with the proposed threshold excluding the imaginary numbers leads to a faster average time than the three methods compared.

中文翻译：

确定信息增益特征选择的阈值以提高随机森林的速度和预测精度

特征选择是一种预处理技术，用于去除不必要的特征，加快算法的工作进程。该技术的一部分是通过计算每个数据集特征的信息增益值来实现的。此外，根据信息增益值确定的阈值率用于特征选择。但是，阈值可以随意使用或通过 0.05 的比率使用。因此，本研究提出了使用数据集中每个特征生成的信息增益值的标准差来确定阈值率。阈值确定在 10 个原始数据集上进行了测试，这些数据集由 FFT 和 IFFT 转换并使用随机森林分类。与使用相关性基础特征选择 (CBF) 和标准 0.05 阈值方法的相同过程相比，在使用建议的阈值处理转换后的数据集时，该研究导致较低的准确性和较长的执行时间。同样，使用变换特征时所需的精度值较低。研究表明，通过使用标准差阈值处理原始数据集，可以提高随机森林分类的特征选择精度。此外，通过使用具有不包括虚数的建议阈值的变换特征，与比较的三种方法相比，平均时间更快。研究表明，通过使用标准偏差阈值处理原始数据集，可以提高随机森林分类的特征选择精度。此外，通过使用具有不包括虚数的建议阈值的变换特征导致比三种方法更快的平均时间。研究表明，通过使用标准偏差阈值处理原始数据集，可以提高随机森林分类的特征选择精度。此外，通过使用具有不包括虚数的建议阈值的变换特征导致比三种方法更快的平均时间。

更新日期：2021-06-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>