当前位置: X-MOL 学术Genome Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Entropy predicts sensitivity of pseudorandom seeds
Genome Research ( IF 7 ) Pub Date : 2023-07-01 , DOI: 10.1101/gr.277645.123
Benjamin Dominik Maier 1 , Kristoffer Sahlin 2
Affiliation  

Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness–sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.

中文翻译:

熵预测伪随机种子的敏感性

种子设计对于序列相似性搜索应用(例如读取映射和平均核苷酸同一性 (ANI) 估计)非常重要。尽管k聚体和间隔k聚体可能是最知名和最常用的种子,但灵敏度会因高错误率而受到影响,特别是当存在插入/缺失时。最近,我们开发了一种伪随机种子构建体,strobemers,经验表明它在高插入缺失率下也具有高灵敏度。然而,该研究缺乏对其原因的更深入理解。在这项研究中,我们提出了一个模型来估计种子的熵,并发现根据我们的模型,具有高熵的种子在大多数情况下具有较高的匹配敏感性。我们发现的种子随机性-敏感性关系解释了为什么有些种子比其他种子表现更好,并且这种关系为设计更敏感的种子提供了一个框架。我们还提出了三种新的频闪种子结构:混合频闪、交替频闪和多重频闪。我们使用模拟和生物数据来表明我们的新种子构建体提高了对其他选通器的序列匹配敏感性。我们证明了这三种新的种子结构对于读取映射和 ANI 估计很有用。对于读映射,我们在 minimap2 中实现了选通器,并且在以高错误率映射读时,比使用k聚体的对齐时间快了 30%,准确度提高了 0.2% 。至于 ANI 估计,我们发现较高的熵种子在估计 ANI 和真实 ANI 之间具有较高的秩相关性。
更新日期:2023-07-01
down
wechat
bug