当前位置: X-MOL 学术Comput. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Reduced multidimensional scaling
Computational Statistics ( IF 1.0 ) Pub Date : 2021-06-05 , DOI: 10.1007/s00180-021-01116-0
Emmanuel Paradis

Dimension reduction is a common problem when analysing large data sets. The present paper proposes a method called reduced multidimensional scaling based on performing an initial standard multidimensional scaling on a reduced data set. This method faces the problem of finding a representative reduced sample. An algorithm is presented to perform this selection based on alternating sampling in outlier areas and observations in high density areas. A space is then constructed with the selected reduced sample by standard multidimentional scaling using pairwise distances. The observations not included in the reduced sample are then projected on the constructed space using Gower’s formula in order to obtain a final representation of the whole data set. The only requirement is the ability to compute distances among observations. A simulation study showed that the proposed algorithm results performs well to detect outliers. Evaluation of running times suggests that the proposed method could run in a few hours with data sets that would take more than one year to analyse with standard multidimensional scaling. An application is presented with a dataset of 9547 DNA sequences of human immunodeficiency viruses.



中文翻译:

减少多维缩放

降维是分析大型数据集时的常见问题。本论文基于对缩减数据集执行初始标准多维缩放,提出了一种称为缩减多维缩放的方法。该方法面临寻找具有代表性的缩减样本的问题。提出了一种基于异常区域中的交替采样和高密度区域中的观察来执行此选择的算法。然后,通过使用成对距离的标准多维缩放,使用选定的缩减样本构建空间。然后使用高尔公式将未包含在缩减样本中的观察结果投影到构建的空间上,以获得整个数据集的最终表示。唯一的要求是能够计算观察之间的距离。仿真研究表明,所提出的算法结果在检测异常值方面表现良好。对运行时间的评估表明,所提出的方法可以在几个小时内运行数据集,而使用标准多维标度分析需要一年多的时间。一个应用程序提供了一个包含 9547 个人类免疫缺陷病毒 DNA 序列的数据集。

更新日期:2021-06-05
down
wechat
bug