当前位置: X-MOL 学术bioRxiv. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
bioRxiv - Genetics Pub Date : 2020-07-30 , DOI: 10.1101/2020.03.23.004598
James A Watson , Aimee R Taylor , Elizabeth A Ashley , Arjen Dondorp , Caroline O Buckee , Nicholas J White , Chris C Holmes

Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results.

中文翻译:

关于使用无监督机器学习算法从遗传距离矩阵表征疟疾寄生虫种群结构的警告提示

疟疾寄生虫的遗传监测支持疟疾控制计划,治疗指南和消除策略。监视研究通常会引起有关疟疾寄生虫血统的问题(例如,抗疟药耐药性如何传播),并采用表征寄生虫种群结构的统计方法。用于表征结构的许多方法是无监督的机器学习算法,其依赖于遗传距离矩阵,尤其是主坐标分析(PCoA)和分层凝聚聚类(HAC)。PCoA和HAC对遗传距离的定义和算法规范都敏感。重要的是,这两种算法都不能推断出疟原虫的血统。这样,PCoA和HAC可以(例如通过探索性数据可视化和假设生成)提供信息,但不能全面回答有关疟疾寄生虫血统的关键问题。我们使用从柬埔寨和周边地区(最近出现了抗疟药的耐药性,并且最近已经传播)的393个恶性疟原虫全基因组序列说明了PCoA和HAC的敏感性,并为在疟疾寄生虫遗传流行病学中使用和解释PCoA和HAC提供了初步的指导。该指南包括呼吁建立完全透明且可重复的分析管道,这些管道应具有(i)明确概述的科学问题;(ii)明确说明用于回答科学问题的分析方法的合理性,并讨论任何推论性局限性;(iii)当下游分析依赖它们时,可公开获得的遗传距离矩阵;(iv)敏感性分析。为了弥合非推理非监督学习算法的输出与关注的科学问题之间的推理脱节,需要定制的统计模型来推断疟疾寄生虫祖先。在没有这种模型的情况下,推测性推理应仅作为讨论的特征,而不作为结果的特征。
更新日期:2020-07-31
down
wechat
bug