当前位置: X-MOL 学术Forensic Sci. Int. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Using simulated microhaplotype genotyping data to evaluate the value of machine learning algorithms for inferring DNA mixture contributor numbers
Forensic Science International: Genetics ( IF 3.1 ) Pub Date : 2024-01-09 , DOI: 10.1016/j.fsigen.2024.103008
Haoyu Wang , Qiang Zhu , Yuguo Huang , Yueyan Cao , Yuhan Hu , Yifan Wei , Yuting Wang , Tingyun Hou , Tiantian Shan , Xuan Dai , Xiaokang Zhang , Yufang Wang , Ji Zhang

Inferring the number of contributors (NoC) is a crucial step in interpreting DNA mixtures, as it directly affects the accuracy of the likelihood ratio calculation and the assessment of evidence strength. However, obtaining the correct NoC in complex DNA mixtures remains challenging due to the high degree of allele sharing and dropout. This study aimed to analyze the impact of allele sharing and dropout on NoC inference in complex DNA mixtures when using microhaplotypes (MH). The effectiveness and value of highly polymorphic MH for NoC inference in complex DNA mixtures were evaluated through comparing the performance of three NoC inference methods, including maximum allele count (MAC) method, maximum likelihood estimation (MLE) method, and random forest classification (RFC) algorithm. In this study, we selected the top 100 most polymorphic MH from the Southern Han Chinese (CHS) population, and simulated over 40 million complex DNA mixture profiles with the NoC ranging from 2 to 8. These profiles involve unrelated individuals (RM type) and related pairs of individuals, including parent-offspring pairs (PO type), full-sibling pairs (FS type), and second-degree kinship pairs (SE type). Our results indicated that how the number of detected alleles in DNA mixture profiles varied with the markers’ polymorphism, kinship’s involvement, NoC, and dropout settings. Across different types of DNA mixtures, the MAC and MLE methods performed best in the RM type, followed by SE, FS, and PO types, while RFC models showed the best performance in the PO type, followed by RM, SE, and FS types. The recall of all three methods for NoC inference were decreased as the NoC and dropout levels increased. Furthermore, the MLE method performed better at low NoC, whereas RFC models excelled at high NoC and/or high dropout levels, regardless of the availability of a priori information about related pairs of individuals in DNA mixtures. However, the RFC models which considered the aforementioned priori information and were trained specifically on each type of DNA mixture profiles, outperformed RFC_ALL model that did not consider such information. Finally, we provided recommendations for model building when applying machine learning algorithms to NoC inference.



中文翻译:

使用模拟微单倍型基因分型数据评估机器学习算法推断 DNA 混合物贡献者数量的价值

推断贡献者数量(NoC)是解释 DNA 混合物的关键步骤,因为它直接影响似然比计算和证据强度评估的准确性。然而,由于高度的等位基因共享和丢失,在复杂的 DNA 混合物中获得正确的 NoC 仍然具有挑战性。本研究旨在分析使用微单倍型 (MH) 时,等位基因共享和丢失对复杂 DNA 混合物中 NoC 推断的影响。通过比较三种 NoC 推理方法(包括最大等位基因计数(MAC)方法、最大似然估计(MLE)方法和随机森林分类(RFC)的性能,评估高度多态性 MH 在复杂 DNA 混合物中进行 NoC 推理的有效性和价值。 ) 算法。在本研究中,我们从南方汉族(CHS)人群中选择了前 100 个多态性最高的 MH,并模拟了超过 4000 万个复杂的 DNA 混合物图谱,NoC 范围从 2 到 8。这些图谱涉及不相关的个​​体(RM 型)和个体的相关对,包括亲子对(PO型)、全兄弟姐妹对(FS型)和二级亲属对(SE型)。我们的结果表明,DNA 混合物谱中检测到的等位基因数量如何随标记的多态性、亲属关系的参与、NoC 和 dropout 设置而变化。在不同类型的DNA混合物中,MAC和MLE方法在RM类型中表现最好,其次是SE、FS和PO类型,而RFC模型在PO类型中表现最好,其次是RM、SE和FS类型。随着 NoC 和 dropout 水平的增加,所有三种 NoC 推理方法的召回率均下降。此外,MLE 方法在低 NoC 下表现更好,而 RFC 模型在高 NoC 和/或高 dropout 水平下表现出色,无论 DNA 混合物中相关个体对的先验信息是否可用。然而,考虑上述先验信息并专门针对每种类型的 DNA 混合物图谱进行训练的 RFC 模型优于不考虑此类信息的 RFC_ALL 模型。最后,我们提供了将机器学习算法应用于 NoC 推理时的模型构建建议。

更新日期:2024-01-09
down
wechat
bug