当前位置: X-MOL 学术Appl. Comput. Harmon. Anal. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Geometric component analysis and its applications to data analysis
Applied and Computational Harmonic Analysis ( IF 2.5 ) Pub Date : 2021-03-09 , DOI: 10.1016/j.acha.2021.02.005
Amit Bermanis , Moshe Salhov , Amir Averbuch

Dimensionality reduction methods are designed to overcome the ‘curse of dimensionality’ phenomenon that makes the analysis of high dimensional big data difficult. Many of these methods are based on principal component analysis which is statistically driven and do not directly address the geometry of the data. Thus, machine learning tasks, such as classification and anomaly detection, may not benefit from a PCA-based methodology.

This work provides a dictionary-based framework for geometrically driven data analysis, for both linear and non-linear (diffusion geometries), that includes dimensionality reduction, out-of-sample extension and anomaly detection. This paper proposes the Geometric Component Analysis (GCA) methodology for dimensionality reduction of linear and non-linear data. The main algorithm greedily picks multidimensional data points that form linear subspaces in the ambient space that contain as much information as possible from the original data. For non-linear data, this greedy approach to the “diffusion kernel” is commonly used in diffusion geometry. The GCA-based diffusion maps appear to be a direct application of a greedy algorithm to the kernel matrix constructed in diffusion maps. The algorithm greedily selects data points from the data according to their distances from the subspace spanned by the previously selected data points. When the distance of all the remaining data points is smaller than a prespecified threshold, the algorithm stops.

The extracted geometry of the data is preserved up to a user-defined distortion rate. In addition, a subset of landmark data points, known as dictionary, is identified by the presented algorithm for dimensionality reduction that is geometric-based. The performance of the method is demonstrated and evaluated on both synthetic and real-world data sets. It achieves good results for unsupervised learning tasks. The proposed algorithm is attractive for its simplicity, low computational complexity and tractability.



中文翻译:

几何成分分析及其在数据分析中的应用

设计降维方法是为了克服“维数诅咒”现象,该现象使高维大数据的分析变得困难。这些方法中的许多方法都是基于主成分分析的,该分析是统计驱动的,并不直接处理数据的几何形状。因此,机器学习任务(例如分类和异常检测)可能无法从基于PCA的方法中受益。

这项工作为线性和非线性(扩散几何)的几何驱动数据分析提供了一个基于字典的框架,其中包括降维,样本外扩展和异常检测。本文提出了用于线性和非线性数据降维的几何成分分析(GCA)方法。主要算法贪婪地选择在环境空间中形成线性子空间的多维数据点,这些子空间包含来自原始数据的尽可能多的信息。对于非线性数据,这种对“扩散核”的贪婪方法通常用于扩散几何中。基于GCA的扩散图似乎是将贪心算法直接应用于在扩散图中构造的内核矩阵。该算法根据它们与先前选择的数据点所跨越的子空间之间的距离,从数据中贪婪地选择数据点。当所有剩余数据点的距离均小于预定阈值时,算法将停止。

提取的数据几何形状将保留到用户定义的失真率。另外,通过提出的基于几何的降维算法,识别了地标数据点的子集(称为字典)。该方法的性能已在合成数据集和实际数据集上进行了演示和评估。它可以在无人监督的学习任务中取得良好的效果。所提出的算法以其简单,低计算复杂度和易处理性而吸引人。

更新日期:2021-03-16
down
wechat
bug