当前位置: X-MOL 学术BMC Med. Genomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High dimensional model representation of log likelihood ratio: binary classification with SNP data.
BMC Medical Genomics ( IF 2.7 ) Pub Date : 2020-09-21 , DOI: 10.1186/s12920-020-00774-1
Ali Foroughi Pour 1, 2 , Maciej Pietrzak 3, 4 , Lara E Sucheston-Campbell 5 , Ezgi Karaesmen 5 , Lori A Dalton 1 , Grzegorz A Rempała 3, 6
Affiliation  

Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Small-sample, high-dimensional nature of SNP data, weak effect of each SNP on the outcome, and highly non-linear SNP interactions are several key factors complicating the analysis. Additionally, SNPs take a finite number of values which may be best understood as ordinal or categorical variables, but are treated as continuous ones by many algorithms. We use the theory of high dimensional model representation (HDMR) to build appropriate low dimensional glass-box models, allowing us to account for the effects of feature interactions. We compute the second order HDMR expansion of the log-likelihood ratio to account for the effects of single SNPs and their pairwise interactions. We propose a regression based approach, called linear approximation for block second order HDMR expansion of categorical observations (LABS-HDMR-CO), to approximate the HDMR coefficients. We show how HDMR can be used to detect pairwise SNP interactions, and propose the fixed pattern test (FPT) to identify statistically significant pairwise interactions. We apply LABS-HDMR-CO and FPT to synthetically generated HAPGEN2 data as well as to two GWAS cancer datasets. In these examples LABS-HDMR-CO enjoys superior accuracy compared with several algorithms used for SNP classification, while also taking pairwise interactions into account. FPT declares very few significant interactions in the small sample GWAS datasets when bounding false discovery rate (FDR) by 5%, due to the large number of tests performed. On the other hand, LABS-HDMR-CO utilizes a large number of SNP pairs to improve its prediction accuracy. In the larger HAPGEN2 dataset FTP declares a larger portion of SNP pairs used by LABS-HDMR-CO as significant. LABS-HDMR-CO and FPT are interesting methods to design prediction rules and detect pairwise feature interactions for SNP data. Reliably detecting pairwise SNP interactions and taking advantage of potential interactions to improve prediction accuracy are two different objectives addressed by these methods. While the large number of potential SNP interactions may result in low power of detection, potentially interacting SNP pairs, of which many might be false alarms, can still be used to improve prediction accuracy.

中文翻译:

对数似然比的高维模型表示:使用 SNP 数据进行二元分类。

开发基于 SNP 观察的二元分类规则一直是许多现代生物信息学应用的主要挑战,例如,预测复杂条件下未来疾病事件的风险,如癌症。SNP 数据的小样本、高维性质、每个 SNP 对结果的弱影响以及高度非线性的 SNP 相互作用是使分析复杂化的几个关键因素。此外,SNP 采用有限数量的值,这些值最好理解为有序变量或分类变量,但许多算法将其视为连续变量。我们使用高维模型表示 (HDMR) 理论来构建适当的低维玻璃盒模型,使我们能够考虑特征交互的影响。我们计算对数似然比的二阶 HDMR 扩展,以说明单个 SNP 及其成对相互作用的影响。我们提出了一种基于回归的方法,称为分类观测的块二阶 HDMR 扩展的线性近似 (LABS-HDMR-CO),以近似 HDMR 系数。我们展示了 HDMR 如何用于检测成对 SNP 相互作用,并提出固定模式测试 (FPT) 来识别具有统计意义的成对相互作用。我们将 LABS-HDMR-CO 和 FPT 应用于合成生成的 HAPGEN2 数据以及两个 GWAS 癌症数据集。在这些示例中,与用于 SNP 分类的几种算法相比,LABS-HDMR-CO 具有更高的准确性,同时还考虑了成对的相互作用。由于执行了大量测试,当将错误发现率 (FDR) 限制为 5% 时,FPT 声明在小样本 GWAS 数据集中很少有显着的交互作用。另一方面,LABS-HDMR-CO利用大量的SNP对来提高其预测精度。在更大的 HAPGEN2 数据集中,FTP 声明 LABS-HDMR-CO 使用的 SNP 对的大部分是重要的。LABS-HDMR-CO 和 FPT 是设计预测规则和检测 SNP 数据成对特征交互的有趣方法。可靠地检测成对 SNP 相互作用和利用潜在的相互作用来提高预测准确性是这些方法解决的两个不同目标。虽然大量潜在的 SNP 相互作用可能会导致检测能力低下,潜在的相互作用的 SNP 对,其中许多可能是误报,
更新日期:2020-09-21
down
wechat
bug