当前位置: X-MOL 学术Stat. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Causal discoveries for high dimensional mixed data
Statistics in Medicine ( IF 2 ) Pub Date : 2022-08-15 , DOI: 10.1002/sim.9544
Zhanrui Cai 1 , Dong Xi 2 , Xuan Zhu 3 , Runze Li 4
Affiliation  

Causal relationships are of crucial importance for biological and medical research. Algorithms have been proposed for causal structure learning with graphical visualizations. While much of the literature focuses on biological studies where data often follow the same distribution, for example, the normal distribution for all variables, challenges emerge from epidemiological and clinical studies where data are often mixed with continuous, binary, and ordinal variables. We propose to use a mixed latent Gaussian copula model to estimate the underlying correlation structure via the rank correlation for mixed data. This correlation structure is then incorporated into a popular causal discovery algorithm, the PC algorithm, to identify causal structures. The proposed algorithm, called the latent-PC algorithm, is able to discover the true causal structure consistently under mild conditions in high dimensional settings. From simulation studies, the latent-PC algorithm delivers a competitive performance in terms of a similar or higher true positive rate and a similar or lower false positive rate, compared with other variants of the PC algorithm. In the high dimensional settings where the number of variables is more than the number of observations, the causal graphs identified by the latent-PC algorithm are closer to the true causal structures, compared to other competing algorithms. Further, we demonstrate the utility of the latent-PC algorithm in a real dataset for hepatocellular carcinoma. Causal structures for patient survival are visualized and connected with clinical interpretations in the literature.

中文翻译:

高维混合数据的因果发现

因果关系对于生物学和医学研究至关重要。已经提出了用于通过图形可视化进行因果结构学习的算法。虽然大部分文献都侧重于数据通常遵循相同分布的生物学研究,例如,所有变量的正态分布,但流行病学和临床研究中出现了挑战,其中数据通常与连续、二元和有序变量混合。我们建议使用混合潜在高斯 copula 模型通过混合数据的秩相关来估计潜在的相关结构。然后将这种相关结构合并到流行的因果发现算法 PC 算法中,以识别因果结构。所提出的算法称为潜在 PC 算法,能够在高维环境的温和条件下始终如一地发现真正的因果结构。从模拟研究来看,与 PC 算法的其他变体相比,潜在 PC 算法在相似或更高的真阳性率和相似或更低的误报率方面具有竞争力。在变量数量多于观察数量的高维设置中,与其他竞争算法相比,latent-PC 算法识别的因果图更接近真实的因果结构。此外,我们还展示了潜在 PC 算法在肝细胞癌真实数据集中的实用性。患者生存的因果结构可视化并与文献中的临床解释相关联。
更新日期:2022-08-15
down
wechat
bug