当前位置: X-MOL 学术BMC Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Current cancer driver variant predictors learn to recognize driver genes instead of functional variants
BMC Biology ( IF 4.4 ) Pub Date : 2021-01-13 , DOI: 10.1186/s12915-020-00930-0
Daniele Raimondi 1 , Antoine Passemiers 1 , Piero Fariselli 2 , Yves Moreau 1
Affiliation  

Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task. In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions. To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open.

中文翻译:


当前的癌症驱动变异预测器学习识别驱动基因而不是功能变异



识别驱动肿瘤进展的变异(驱动变异)并将其与癌症中不受控制的细胞生长的副产品变异(过客变异)区分开来,是了解肿瘤发生和精准肿瘤学的关键一步。各种生物信息学方法试图解决这一复杂的任务。在本研究中,我们研究了这些方法所基于的假设,结果表明驾驶员和乘客变量的不同定义会影响预测任务的难度。更重要的是,我们证明数据集存在构造偏差,这会阻止机器学习(ML)方法实际学习变体级别的功能效果,尽管它们具有出色的性能。这种效应是由于以下事实造成的:在这些数据集中,驾驶员变体映射到几个驾驶员基因,而乘客变体分布在数千个基因中,因此仅学习识别驾驶员基因就可以提供近乎完美的预测。为了缓解这个问题,我们提出了一个新的数据集,通过确保数据覆盖的所有基因都包含驾驶员和乘客的变异来最大限度地减少这种偏差。结果,我们表明测试的预测变量的性能显着下降,这不应被视为较差的建模,而应被视为纠正不必要的乐观情绪。最后,我们提出了一种加权程序来完全消除基因对此类预测的影响,从而精确评估预测因子对单个变异的功能效应进行建模的能力,并且我们表明这项任务实际上仍然是开放的。
更新日期:2021-01-13
down
wechat
bug