当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A novel feature extraction method based on highly expressed SNPs for tissue-specific gene prediction
Journal of Big Data ( IF 8.6 ) Pub Date : 2021-08-17 , DOI: 10.1186/s40537-021-00497-9
Jasbir Dhaliwal 1 , John Wagner 2
Affiliation  

Background

Gene expression provides a means for an organism to produce gene products necessary for the organism to live. Variation in the significant gene expression levels can distinguish the gene and the tissue in which the gene is expressed. Tissue-specific gene expression, often determined by single nucleotide polymorphisms (SNPs), provides potential molecular markers or therapeutic targets for disease progression. Therefore, SNPs are good candidates for identifying disease progression. The current bioinformatics literature uses gene network modeling to summarize complex interactions between transcription factors, genes, and gene products. Here, our focus is on the SNPs’ impact on tissue-specific gene expression levels. To the best of our knowledge, we are not aware of any studies that distinguish tissue-specific genes using SNP expression levels.

Method

We propose a novel feature extraction method based on highly expressed SNPs using k-mers as features. We also propose optimal k-mer and feature sizes used in our approach. Determining the optimal sizes is still an open research question as it depends on the dataset and purpose of the analysis. Therefore, we evaluate our algorithm’s performance on a range of k-mer and feature sizes using a multinomial naive Bayes (MNB) classifier on genes in the 49 human tissues from the Genotype-Tissue Expression (GTEx) portal.

Conclusions

Our approach achieves practical performance results with k-mers of size 3. Based on the purpose of the analysis and the number of tissue-specific genes under study, feature sizes [7, 8, 9] and [8, 9, 10] are typically optimal for the machine learning model.



中文翻译:

基于高表达SNPs的组织特异性基因预测特征提取新方法

背景

基因表达为生物体提供了一种产生生物体生存所必需的基因产物的手段。显着基因表达水平的变化可以区分基因和表达基因的组织。通常由单核苷酸多态性 (SNP) 决定的组织特异性基因表达为疾病进展提供了潜在的分子标记或治疗靶点。因此,SNP 是识别疾病进展的良好候选者。当前的生物信息学文献使用基因网络建模来总结转录因子、基因和基因产物之间复杂的相互作用。在这里,我们的重点是 SNP 对组织特异性基因表达水平的影响。据我们所知,我们不知道有任何研究使用 SNP 表达水平来区分组织特异性基因。

方法

我们提出了一种基于使用 k-mers 作为特征的高表达 SNP 的新特征提取方法。我们还提出了我们方法中使用的最佳 k-mer 和特征大小。确定最佳大小仍然是一个开放的研究问题,因为它取决于数据集和分析的目的。因此,我们使用多项朴素贝叶斯 (MNB) 分类器对来自基因型组织表达 (GTEx) 门户的 49 种人体组织中的基因进行评估,以评估我们的算法在一系列 k 聚体和特征大小上的性能。

结论

我们的方法使用大小为 3 的 k-mers 实现了实际性能结果。 基于分析的目的和所研究的组织特异性基因的数量,特征大小 [7, 8, 9] 和 [8, 9, 10] 是通常最适合机器学习模型。

更新日期:2021-08-19
down
wechat
bug