Clustering longitudinal life-course sequences using mixtures of exponential-distance models,The Journal of the Royal Statistical Society, Series A (Statistics in Society)

当前位置： X-MOL 学术 › J. R. Stat. Soc. A › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Clustering longitudinal life-course sequences using mixtures of exponential-distance models
The Journal of the Royal Statistical Society, Series A (Statistics in Society) ( IF 1.5 ) Pub Date : 2021-07-08 , DOI: 10.1111/rssa.12712
Keefe Murphy ₁ , T. Brendan Murphy _{2,

3} , Raffaella Piccarreta ₄ , I. Claire Gormley _{2,

3}

Affiliation

Sequence analysis is an increasingly popular approach for analysing life courses represented by ordered collections of activities experienced by subjects over time. Here, we analyse a survey data set containing information on the career trajectories of a cohort of Northern Irish youths tracked between the ages of 16 and 22. We propose a novel, model-based clustering approach suited to the analysis of such data from a holistic perspective, with the aims of estimating the number of typical career trajectories, identifying the relevant features of these patterns, and assessing the extent to which such patterns are shaped by background characteristics. Several criteria exist for measuring pairwise dissimilarities among categorical sequences. Typically, dissimilarity matrices are employed as input to heuristic clustering algorithms. The family of methods we develop instead clusters sequences directly using mixtures of exponential-distance models. Basing the models on weighted variants of the Hamming distance metric permits closed-form expressions for parameter estimation. Simultaneously allowing the component membership probabilities to depend on fixed covariates and accommodating sampling weights in the clustering process yields new insights on the Northern Irish data. In particular, we find that school examination performance is the single most important predictor of cluster membership.

中文翻译：

使用指数距离模型的混合对纵向生命历程序列进行聚类

序列分析是一种越来越流行的方法，用于分析由受试者随时间推移经历的有序活动集合所代表的生命历程。在这里，我们分析了一个调查数据集，其中包含有关 16 至 22 岁之间跟踪的一组北爱尔兰青年的职业轨迹信息。我们提出了一种新颖的、基于模型的聚类方法，适用于从整体分析此类数据透视，目的是估计典型职业轨迹的数量，识别这些模式的相关特征，并评估这些模式在多大程度上受背景特征的影响。有几个标准可用于测量分类序列之间的成对差异。通常，相异矩阵用作启发式聚类算法的输入。我们开发的一系列方法直接使用指数距离模型的混合来对序列进行聚类。基于汉明距离度量的加权变体的模型允许用于参数估计的封闭形式表达式。同时允许组件成员概率依赖于固定的协变量并在聚类过程中适应采样权重，从而对北爱尔兰数据产生新的见解。特别是，我们发现学校考试成绩是集群成员资格的唯一最重要的预测因素。同时允许组件成员概率依赖于固定的协变量并在聚类过程中适应采样权重，从而对北爱尔兰数据产生新的见解。特别是，我们发现学校考试成绩是集群成员资格的唯一最重要的预测因素。同时允许组件成员概率依赖于固定的协变量并适应聚类过程中的采样权重，从而对北爱尔兰数据产生了新的见解。特别是，我们发现学校考试成绩是集群成员资格的唯一最重要的预测因素。

更新日期：2021-07-08

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文