当前位置: X-MOL 学术Int. J. Epidemiol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Numero: a statistical framework to define multivariable subgroups in complex population-based datasets
International Journal of Epidemiology ( IF 7.7 ) Pub Date : 2018-06-26 , DOI: 10.1093/ije/dyy113
Song Gao 1 , Stefan Mutter 1 , Aaron Casey 1 , Ville-Petteri Mäkinen 1, 2, 3
Affiliation  

Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.

中文翻译:

Numero:一个统计框架,用于在基于人口的复杂数据集中定义多变量子组

大规模的流行病学和人口数据提供了机会来识别处于疾病风险或暴露于不利环境的人群亚组。聚类算法是流行的数据驱动工具,可用于识别这些子组。但是,如果数据集没有聚类结构,则仅依靠算法可能不会产生最佳结果。因此,我们提出了一个框架(R库Numero),该框架结合了自组织映射算法,用于统计证据的置换分析以及最终的专家驱动的分组步骤。我们使用Numero在两个没有明显聚类结构的示例中定义了亚组:肾脏疾病的生物医学数据集和社区水平的社会经济指标的另一个数据集。我们根据流行的聚类算法(主要成分,K均值和分层聚类)对Numero子分组进行了基准测试。Numero子分组更直观,更易于解释,而不会损失数学质量。因此,我们希望Numero可用于基于人群的流行病学数据集的探索性分析。
更新日期:2018-06-28
down
wechat
bug