当前位置: X-MOL 学术BMC Med. Genomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening
BMC Medical Genomics ( IF 2.7 ) Pub Date : 2020-10-22 , DOI: 10.1186/s12920-020-00778-x
Laura Moody 1 , Hong Chen 1, 2 , Yuan-Xiang Pan 1, 2, 3
Affiliation  

Advancements in transcriptomic profiling have led to the emergence of new challenges regarding data integration and interpretability. Variability between measurement platforms makes it difficult to compare between cohorts, and large numbers of gene features have encouraged the use black box methods that are not easily translated into biologically and clinically meaningful findings. We propose that gene rankings and algorithms that rely on relative expression within gene pairs can address such obstacles. We implemented an innovative process to evaluate the performance of five feature selection methods on simulated gene-pair data. Along with TSP, we consider other methods that retain more information in their score calculations, including the magnitude of gene expression change as well as within-class variation. Tree-based rule extraction was also applied to serum microRNA (miRNA) pairs in order to devise a noninvasive screening tool for pancreatic and ovarian cancer. Gene pair data were simulated using different types of signal and noise. Pairs were filtered using feature selection approaches, including top-scoring pairs (TSP), absolute differences between gene ranks, and Fisher scores. Methods that retain more information, such as the magnitude of expression change and within-class variance, yielded higher classification accuracy using a random forest model. We then demonstrate two powerful applications of gene pairs by first performing large-scale integration of 52 breast cancer datasets consisting of 10,350 patients. Not only did we confirm known oncogenes, but we also propose novel tumorigenic genes, such as BSDC1 and U2AF1, that could distinguish between tumor subtypes. Finally, circulating miRNA pairs were filtered and salient rules were extracted to build simplified tree ensemble learners (STELs) for four types of cancer. These accessible clinical frameworks detected pancreatic and ovarian cancer with 84.8 and 93.6% accuracy, respectively. Rank-based gene pair classification benefits from careful feature selection methods that preserve maximal information. Gene pairs enable dataset integration for greater statistical power and discovery of robust biomarkers as well as facilitate construction of user-friendly clinical screening tools.

中文翻译:

使用基因对进行特征选择的注意事项以及在大规模数据集集成、新癌基因发现和可解释癌症筛查中的应用

转录组分析的进步导致了有关数据集成和可解释性的新挑战的出现。测量平台之间的可变性使得很难在队列之间进行比较,并且大量基因特征鼓励使用不容易转化为具有生物学和临床意义的发现的黑盒方法。我们建议依赖于基因对内相对表达的基因排名和算法可以解决这些障碍。我们实施了一个创新过程来评估五种特征选择方法在模拟基因对数据上的性能。与 TSP 一起,我们考虑了在其分数计算中保留更多信息的其他方法,包括基因表达变化的幅度以及类内变异。基于树的规则提取也应用于血清 microRNA (miRNA) 对,以设计一种非侵入性的胰腺癌和卵巢癌筛查工具。使用不同类型的信号和噪声模拟基因对数据。使用特征选择方法过滤对,包括最高分对 (TSP)、基因等级之间的绝对差异和 Fisher 分数。保留更多信息的方法,例如表达变化的幅度和类内方差,使用随机森林模型产生更高的分类准确度。然后,我们首先对由 10,350 名患者组成的 52 个乳腺癌数据集进行大规模整合,从而展示了基因对的两个强大应用。我们不仅确认了已知的致癌基因,而且还提出了新的致瘤基因,例如 BSDC1 和 U2AF1,可以区分肿瘤亚型。最后,过滤循环 miRNA 对并提取显着规则以构建针对四种癌症的简化树集成学习器 (STEL)。这些可访问的临床框架分别以 84.8% 和 93.6% 的准确率检测到胰腺癌和卵巢癌。基于等级的基因对分类受益于保留最大信息的谨慎特征选择方法。基因对使数据集整合成为可能,以提高统计能力和发现强大的生物标志物,并促进用户友好的临床筛选工具的构建。这些可访问的临床框架分别以 84.8% 和 93.6% 的准确率检测到胰腺癌和卵巢癌。基于等级的基因对分类受益于保留最大信息的谨慎特征选择方法。基因对使数据集整合成为可能,以提高统计能力和发现强大的生物标志物,并促进用户友好的临床筛选工具的构建。这些可访问的临床框架分别以 84.8% 和 93.6% 的准确率检测到胰腺癌和卵巢癌。基于等级的基因对分类受益于保留最大信息的谨慎特征选择方法。基因对使数据集整合成为可能,以提高统计能力和发现强大的生物标志物,并促进用户友好的临床筛选工具的构建。
更新日期:2020-10-26
down
wechat
bug