当前位置: X-MOL 学术Knowl. Based Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unsupervised dimensionality reduction for very large datasets: Are we going to the right direction?
Knowledge-Based Systems ( IF 8.8 ) Pub Date : 2020-03-24 , DOI: 10.1016/j.knosys.2020.105777
Jadson Jose Monteiro Oliveira , Robson Leonardo Ferreira Cordeiro

Given a set of millions or even billions of complex objects for descriptive data mining, how to effectively reduce the data dimensionality? It must be performed in an unsupervised way. Unsupervised dimensionality reduction is essential for analytical tasks like clustering and outlier detection because it helps to overcome the drawbacks of the “curse of high dimensionality”. The state-of-the-art approach is to preserve the data variance by means of well-known techniques, such as PCA, KPCA, SVD, and other techniques based on those that have been mentioned, such as PUFS. But, is it always the best strategy to follow? This paper presents an exploratory study performed to compare two distinct approaches: (a) the standard variance preservation, and; (b) one alternative, Fractal-based solution that is rarely used, for which we propose one fast and scalable Spark-based algorithm using a novel feature partitioning approach that allows it to tackle data of high dimensionality. Both strategies were evaluated by inserting into 11 real-world datasets, with up to 123.5 million elements and 518 attributes, at most 500 additional attributes formed by correlations of many kinds, such as linear, quadratic, logarithmic and exponential, and verifying their abilities to remove this redundancy. The results indicate that, at least for large datasets of dimensionality with up to 1,000 attributes, our proposed Fractal-based algorithm is the best option. It accurately and efficiently removed the redundant attributes in nearly all cases, as opposed to the standard variance-preservation strategy that presented considerably worse results, even when applying the KPCA approach that is made for non-linear correlations.



中文翻译:

超大型数据集的无监督降维:我们朝着正确的方向发展吗?

给定数百万甚至数十亿个用于描述性数据挖掘的复杂对象,如何有效降低数据维数?它必须以无监督的方式执行。无监督降维对于诸如聚类和离群值检测之类的分析任务至关重要,因为它有助于克服“高维诅咒”的缺点。最先进的方法是通过众所周知的技术(例如PCA,KPCA,SVD和其他基于已提到的技术(例如PUFS))来保留数据差异。但是,它始终是最好的策略吗?本文提出了一项探索性研究,以比较两种不同的方法:(a)标准方差保留,和;(b)一种很少使用的基于分形的解决方案,为此,我们提出了一种使用新颖的特征分区方法的快速且可扩展的基于Spark的算法,该算法可处理高维数据。通过将这两种策略插入11个现实世界数据集中进行评估,这些数据集具有多达1.235亿个元素和518个属性,最多还包含500个由多种相关(例如线性,二次,对数和指数)的相关性形成的属性,并验证了它们的能力。删除此冗余。结果表明,至少对于维数最大的数据集 最多包含500个其他属性,这些属性是由多种相关(例如线性,二次,对数和指数)形成的,并验证了它们消除此冗余的能力。结果表明,至少对于维数最大的数据集 最多包含500个其他属性,这些属性是由多种相关(例如线性,二次,对数和指数)形成的,并验证了它们消除此冗余的能力。结果表明,至少对于维数最大的数据集1个000属性,我们提出的基于分形的算法是最佳选择。甚至在应用针对非线性相关性进行的KPCA方法时,与几乎没有结果的标准方差保留策略相比,它几乎在所有情况下都可以准确有效地删除冗余属性。

更新日期:2020-03-24
down
wechat
bug