当前位置: X-MOL 学术Biol. Direct › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Genes sharing the protein family domain decrease the performance of classification with RNA-seq genomic signatures.
Biology Direct ( IF 5.7 ) Pub Date : 2018-02-21 , DOI: 10.1186/s13062-018-0205-x
Anna Leśniewska 1 , Joanna Zyprych-Walczak 2 , Alicja Szabelska-Beręsewicz 2 , Michal J Okoniewski 3
Affiliation  

BACKGROUND The experience with running various types of classification on the CAMDA neuroblastoma dataset have led us to the conclusion that the results are not always obvious and may differ depending on type of analysis and selection of genes used for classification. This paper aims in pointing out several factors that may influence the downstream machine learning analysis. In particular those factors are: type of the primary analysis, type of the classifier and increased correlation between the genes sharing a protein domain. They influence the analysis directly, but also interplay between them may be important. We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the datasets. RESULTS The major findings are: pairs of genes that share a domain have an increased Spearman's correlation coefficients of counts; genes sharing a domain are expected to have a lower predictive power due to increased correlation. For most of the cases it can be seen with the higher number of misclassified samples; classifiers performance may vary depending on a method, still in most cases using genes sharing a domain in the training set results in a higher misclassification rate; increased correlation in genes sharing a domain results most often in worse performance of the classifiers regardless of the primary analysis tools used, even if the primary analysis alignment yield varies. CONCLUSIONS The effect of sharing a domain is likely more a results of real biological co-expression than just sequence similarity and artifacts of mapping and counting. Still, this is more difficult to conclude and needs further research. The effect is interesting itself, but we also point out some practical aspects in which it may influence the RNA sequencing analysis and RNA biomarker use. In particular it means that a gene signature biomarker set build out of RNA-sequencing results should be depleted for genes sharing common domains. It may cause to perform better when applying classification. REVIEWERS This article was reviewed by Dimitar Vassiliev and Susmita Datta.

中文翻译:


共享蛋白质家族结构域的基因会降低 RNA-seq 基因组特征的分类性能。



背景 在 CAMDA 神经母细胞瘤数据集上运行各种类型的分类的经验使我们得出这样的结论:结果并不总是显而易见的,并且可能会根据分析类型和用于分类的基因选择而有所不同。本文旨在指出可能影响下游机器学习分析的几个因素。这些因素特别是:主要分析的类型、分类器的类型以及共享蛋白质结构域的基因之间增加的相关性。它们直接影响分析,但它们之间的相互作用也可能很重要。我们已经编译了基因域数据库并将其用于分析,以查看共享一个域的基因与数据集中的其余基因之间的差异。结果 主要发现是: 共享一个域的基因对具有增加的 Spearman 计数相关系数;由于相关性增加,共享一个结构域的基因预计具有较低的预测能力。对于大多数情况,可以看出误分类样本的数量较多;分类器的性能可能会根据方法的不同而有所不同,但在大多数情况下,使用共享训练集中同一域的基因会导致更高的错误分类率;无论使用何种主要分析工具,共享域的基因相关性增加通常会导致分类器性能较差,即使主要分析比对率有所不同。结论 共享域的影响可能更多的是真实生物共表达的结果,而不仅仅是序列相似性以及映射和计数的伪影。尽管如此,这个结论仍然比较困难,需要进一步研究。 这种效应本身很有趣,但我们也指出了它可能影响 RNA 测序分析和 RNA 生物标志物使用的一些实际方面。特别是,这意味着根据 RNA 测序结果构建的基因特征生物标记集应该耗尽共享共同域的基因。它可能会导致应用分类时表现更好。审稿人 本文由 Dimitar Vassiliev 和 Susmita Datta 审阅。
更新日期:2019-11-01
down
wechat
bug