当前位置: X-MOL 学术Genome Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A pitfall for machine learning methods aiming to predict across cell types
Genome Biology ( IF 12.3 ) Pub Date : 2020-11-19 , DOI: 10.1186/s13059-020-02177-y
Jacob Schreiber 1 , Ritambhara Singh 2, 3 , Jeffrey Bilmes 1, 4 , William Stafford Noble 1, 2
Affiliation  

Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.

中文翻译:

旨在预测跨细胞类型的机器学习方法的一个陷阱

预测基因组活动的机器学习模型在跨细胞类型做出准确预测时最有用。在这里,我们表明,当训练集和测试集包含相同的基因组位点时,生成的模型可能会通过有效地记住与训练细胞类型中每个位点相关的平均活动而错误地表现得很好。我们在预测基因表达和染色质域边界的背景下展示了这种现象,并且我们提出了诊断和避免陷阱的方法。我们预计,随着越来越多的数据可用,未来的项目将越来越有可能遭受此问题的困扰。
更新日期:2020-11-19
down
wechat
bug