当前位置: X-MOL 学术BMC Med. Genomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TNFPred: identifying tumor necrosis factors using hybrid features based on word embeddings
BMC Medical Genomics ( IF 2.7 ) Pub Date : 2020-10-22 , DOI: 10.1186/s12920-020-00779-w
Trinh-Trung-Duong Nguyen , Nguyen-Quoc-Khanh Le , Quang-Thai Ho , Dinh-Van Phan , Yu-Yen Ou

Cytokines are a class of small proteins that act as chemical messengers and play a significant role in essential cellular processes including immunity regulation, hematopoiesis, and inflammation. As one important family of cytokines, tumor necrosis factors have association with the regulation of a various biological processes such as proliferation and differentiation of cells, apoptosis, lipid metabolism, and coagulation. The implication of these cytokines can also be seen in various diseases such as insulin resistance, autoimmune diseases, and cancer. Considering the interdependence between this kind of cytokine and others, classifying tumor necrosis factors from other cytokines is a challenge for biological scientists. In this research, we employed a word embedding technique to create hybrid features which was proved to efficiently identify tumor necrosis factors given cytokine sequences. We segmented each protein sequence into protein words and created corresponding word embedding for each word. Then, word embedding-based vector for each sequence was created and input into machine learning classification models. When extracting feature sets, we not only diversified segmentation sizes of protein sequence but also conducted different combinations among split grams to find the best features which generated the optimal prediction. Furthermore, our methodology follows a well-defined procedure to build a reliable classification tool. With our proposed hybrid features, prediction models obtain more promising performance compared to seven prominent sequenced-based feature kinds. Results from 10 independent runs on the surveyed dataset show that on an average, our optimal models obtain an area under the curve of 0.984 and 0.998 on 5-fold cross-validation and independent test, respectively. These results show that biologists can use our model to identify tumor necrosis factors from other cytokines efficiently. Moreover, this study proves that natural language processing techniques can be applied reasonably to help biologists solve bioinformatics problems efficiently.

中文翻译:

TNFPred:使用基于词嵌入的混合特征识别肿瘤坏死因子

细胞因子是一类小蛋白质,作为化学信使,在免疫调节、造血和炎症等基本细胞过程中发挥重要作用。肿瘤坏死因子作为细胞因子的一个重要家族,与细胞增殖分化、细胞凋亡、脂质代谢、凝血等多种生物学过程的调控有关。在胰岛素抵抗、自身免疫性疾病和癌症等各种疾病中也可以看到这些细胞因子的作用。考虑到这类细胞因子与其他细胞因子之间存在相互依赖性,从其他细胞因子中区分出肿瘤坏死因子是生物科学家面临的挑战。在这项研究中,我们采用词嵌入技术来创建混合特征,该特征被证明可以有效识别给定细胞因子序列的肿瘤坏死因子。我们将每个蛋白质序列分割成蛋白质词,并为每个词创建相应的词嵌入。然后,为每个序列创建基于词嵌入的向量,并将其输入到机器学习分类模型中。在提取特征集时,我们不仅使蛋白质序列的分割大小多样化,而且在分裂克之间进行不同的组合,以找到产生最佳预测的最佳特征。此外,我们的方法遵循定义明确的程序来构建可靠的分类工具。通过我们提出的混合特征,与七种突出的基于序列的特征类型相比,预测模型获得了更有希望的性能。对调查数据集进行 10 次独立运行的结果表明,平均而言,我们的最佳模型在 5 折交叉验证和独立测试中获得的曲线下面积分别为 0.984 和 0.998。这些结果表明,生物学家可以使用我们的模型有效地从其他细胞因子中识别出肿瘤坏死因子。此外,这项研究证明了自然语言处理技术可以合理应用,帮助生物学家有效地解决生物信息学问题。
更新日期:2020-10-26
down
wechat
bug