The minimum ratio of preserving the dataset similarity in resampling: (1 − 1 /e ),International Journal of Information Technology

当前位置： X-MOL 学术 › Int. J. Inf. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The minimum ratio of preserving the dataset similarity in resampling: (1 − 1 /e )
International Journal of Information Technology Pub Date : 2019-05-17 , DOI: 10.1007/s41870-019-00316-8
Faruk Bulut

Pattern recognition, data mining and machine learning disciplines always work with a predefined dataset to create a hypothesis for an artificial decision support system. A dataset might occasionally be damaged due to various reasons. They might be subdivided for cross-validation to test an expert system performance. Some samples in the dataset might be deleted since they lose their importance. In addition, some noisy and outlier data need to be removed since it defects the general layout. In such similar cases, it is important to note how many percentages of the samples in a set should remain original in order to both avoid corruption and keep the overall originality. The ratio of missed, deleted, and removed samples in a dataset is a crucial issue for maintaining the whole integrity. In this study, a theoretical approach has been proposed about that the integrity and originality of a dataset can be preserved with a certain ratio of non-selection probability. It is approximately 63.21%, derived from the equation (1 − 1/e), which is the minimum ratio for the remaining original samples. e is the natural logarithm base. In other words, (1/e) % amount of the data at most might be removed from the set for the preservation of the originality. The rest data points in the set will be used for resampling. A variety of parametric and nonparametric criterions and tests in statistics such as Kolmogorov–Smirnov, t-tests, Kruskal–Wallis ANOVA, and Ansari–Bradley has been used in the proofing process of the proposed theory. In the experiments, a synthetic dataset has been damaged many times and compared with its original form in order to observe whether the originality and homogeneity changed or not. Experiments indicate that the ratio of (1 − 1/e) is the fundamental lower bound ratio and limit for the authenticity and actuality of a dataset.

中文翻译：

重新采样时保持数据集相似性的最小比率：（1-1 / e）

模式识别，数据挖掘和机器学习学科始终与预定义的数据集配合使用，为人工决策支持系统创建假设。数据集有时可能由于各种原因而损坏。可以将它们细分以进行交叉验证，以测试专家系统的性能。可能会删除数据集中的某些样本，因为它们失去了重要性。另外，由于一些噪声和异常数据会破坏总体布局，因此需要将其删除。在这种类似情况下，重要的是要注意一组样本中应保留多少百分比的原始样本，以便避免损坏并保持整体原始性。数据集中丢失，删除和删除的样本的比率对于保持整体完整性至关重要。在这个研究中，已经提出了一种关于可以以一定比例的非选择概率保存数据集的完整性和独创性的理论方法。从等式（1-1得出）约为63.21％/ e），这是其余原始样本的最小比率。e是自然对数底数。换句话说，最多可以从集中删除（1 / e）％的数据量，以保持独创性。集合中的其余数据点将用于重采样。统计中的各种参数和非参数标准和检验，例如Kolmogorov-Smirnov，t检验，Kruskal-Wallis ANOVA和Ansari-Bradley，已用于提出的理论的证明过程中。在实验中，一个合成数据集已经被破坏了很多次，并且与原始数据进行比较，以观察其原始性和同质性是否发生了变化。实验表明（1/1 / e）是数据集真实性和真实性的基本下限比率和极限。

更新日期：2019-05-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文