当前位置: X-MOL 学术Inf. Visualization › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A comparative user study of visualization techniques for cluster analysis of multidimensional data sets
Information Visualization ( IF 1.8 ) Pub Date : 2020-07-04 , DOI: 10.1177/1473871620922166
Elio Ventocilla 1 , Maria Riveiro 1, 2

This article presents an empirical user study that compares eight multidimensional projection techniques for supporting the estimation of the number of clusters, k , embedded in six multidimensional data sets. The selection of the techniques was based on their intended design, or use, for visually encoding data structures, that is, neighborhood relations between data points or groups of data points in a data set. Concretely, we study: the difference between the estimates of k as given by participants when using different multidimensional projections; the accuracy of user estimations with respect to the number of labels in the data sets; the perceived usability of each multidimensional projection; whether user estimates disagree with k values given by a set of cluster quality measures; and whether there is a difference between experienced and novice users in terms of estimates and perceived usability. The results show that: dendrograms (from Ward’s hierarchical clustering) are likely to lead to estimates of k that are different from those given with other multidimensional projections, while Star Coordinates and Radial Visualizations are likely to lead to similar estimates; t-Stochastic Neighbor Embedding is likely to lead to estimates which are closer to the number of labels in a data set; cluster quality measures are likely to produce estimates which are different from those given by users using Ward and t-Stochastic Neighbor Embedding; U-Matrices and reachability plots will likely have a low perceived usability; and there is no statistically significant difference between the answers of experienced and novice users. Moreover, as data dimensionality increases, cluster quality measures are likely to produce estimates which are different from those perceived by users using any of the assessed multidimensional projections. It is also apparent that the inherent complexity of a data set, as well as the capability of each visual technique to disclose such complexity, has an influence on the perceived usability.



本文介绍了一项实证用户研究,该研究比较了八种多维投影技术,以支持对嵌入在六个多维数据集中的簇数 k 的估计。这些技术的选择基于它们的预期设计或用途,用于可视化编码数据结构,即数据集中数据点或数据点组之间的邻域关系。具体而言,我们研究: 参与者在使用不同多维投影时给出的 k 估计值之间的差异;用户对数据集中标签数量估计的准确性;每个多维投影的感知可用性;用户估计是否与一组集群质量度量给出的 k 值不一致;经验和新手用户在估计和感知可用性方面是否存在差异。结果表明: 树状图(来自 Ward 的层次聚类)可能导致 k 的估计与其他多维投影给出的估计不同,而星坐标和径向可视化可能导致类似的估计;t-Stochastic Neighbor Embedding 可能会导致估计值更接近数据集中的标签数量;聚类质量度量可能会产生与用户使用 Ward 和 t-Stochastic Neighbor Embedding 给出的估计值不同的估计值;U-Matrices 和可达性图可能具有较低的感知可用性;并且经验丰富的用户和新手用户的答案之间没有统计上的显着差异。而且,随着数据维度的增加,聚类质量度量可能会产生与用户使用任何评估的多维预测所感知的估计值不同的估计值。同样明显的是,数据集的固有复杂性,以及每种视觉技术揭示这种复杂性的能力,都会对感知的可用性产生影响。