当前位置: X-MOL 学术J. Proteomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deep learning embedder method and tool for mass spectra similarity search
Journal of Proteomics ( IF 3.3 ) Pub Date : 2020-12-08 , DOI: 10.1016/j.jprot.2020.104070
Chunyuan Qin 1 , Xiyang Luo 1 , Chuan Deng 1 , Kunxian Shu 1 , Weimin Zhu 2 , Johannes Griss 3 , Henning Hermjakob 4 , Mingze Bai 5 , Yasset Perez-Riverol 6
Affiliation  

Spectral similarity calculation is widely used in protein identification tools and mass spectra clustering algorithms while comparing theoretical or experimental spectra. The performance of the spectral similarity calculation plays an important role in these tools and algorithms especially in the analysis of large-scale datasets. Recently, deep learning methods have been proposed to improve the performance of clustering algorithms and protein identification by training the algorithms with existing data and the use of multiple spectra and identified peptide features. While the efficiency of these algorithms is still under study in comparison with traditional approaches, their application in proteomics data analysis is becoming more common. Here, we propose the use of deep learning to improve spectral similarity comparison. We assessed the performance of deep learning for spectral similarity, with GLEAMS and a newly trained embedder model (DLEAMSE), which uses high-quality spectra from PRIDE Cluster. Also, we developed a new bioinformatics tool (mslookup - https://github.com/bigbio/DLEAMSE/) that allows users to quickly search for spectra in previously identified mass spectra publish in public repositories and spectral libraries. Finally, we released a human database to enable bioinformaticians and biologists to search for identified spectra in their machines.

Significance statement

Spectral similarity calculation plays an important role in proteomics data analysis. With deep learning's ability to learn the implicit and effective features from large-scale training datasets, deep learning-based MS/MS spectra embedding models has emerged as a solution to improve mass spectral clustering similarity calculation algorithms. We compare multiple similarity scoring and deep learning methods in terms of accuracy (compute the similarity for a pair of the mass spectrum) and computing-time performance. The benchmark results showed no major differences in accuracy between DLEAMSE and normalized dot product for spectrum similarity calculations. The DLEAMSE GPU implementation is faster than NDP in preprocessing on the GPU server and the similarity calculation of DLEAMSE (Euclidean distance on 32-D vectors) takes about 1/3 of dot product calculations. The deep learning model (DLEAMSE) encoding and embedding steps needed to run once for each spectrum and the embedded 32-D points can be persisted in the repository for future comparison, which is faster for future comparisons and large-scale data. Based on these, we proposed a new tool mslookup that enables the researcher to find spectra previously identified in public data. The tool can be also used to generate in-house databases of previously identified spectra to share with other laboratories and consortiums.



中文翻译:

用于质谱相似性搜索的深度学习嵌入方法和工具

光谱相似性计算广泛用于蛋白质鉴定工具和质谱聚类算法,同时比较理论或实验光谱。光谱相似度计算的性能在这些工具和算法中起着重要作用,尤其是在大规模数据集的分析中。最近,已经提出了深度学习方法,通过使用现有数据训练算法并使用多个光谱和识别的肽特征来提高聚类算法和蛋白质识别的性能。虽然与传统方法相比,这些算法的效率仍在研究中,但它们在蛋白质组学数据分析中的应用正变得越来越普遍。在这里,我们建议使用深度学习来改进光谱相似性比较。我们使用 GLEAMS 和新训练的嵌入模型 (DLEAMSE) 评估了深度学习在光谱相似性方面的性能,该模型使用来自 PRIDE Cluster 的高质量光谱。此外,我们还开发了一种新的生物信息学工具 (mslookup - https://github.com/bigbio/DLEAMSE/),允许用户在公共存储库和光谱库中发布的先前识别的质谱中快速搜索光谱。最后,我们发布了一个人类数据库,使生物信息学家和生物学家能够在他们的机器中搜索已识别的光谱。com/bigbio/DLEAMSE/),允许用户在公共存储库和光谱库中发布的先前识别的质谱中快速搜索光谱。最后,我们发布了一个人类数据库,使生物信息学家和生物学家能够在他们的机器中搜索已识别的光谱。com/bigbio/DLEAMSE/),允许用户在公共存储库和光谱库中发布的先前识别的质谱中快速搜索光谱。最后,我们发布了一个人类数据库,使生物信息学家和生物学家能够在他们的机器中搜索已识别的光谱。

意义陈述

光谱相似度计算在蛋白质组学数据分析中起着重要作用。凭借深度学习从大规模训练数据集中学习隐含和有效特征的能力,基于深度学习的 MS/MS 谱嵌入模型已成为改进质谱聚类相似度计算算法的一种解决方案。我们在准确性(计算一对质谱的相似性)和计算时间性能方面比较了多种相似度评分和深度学习方法。基准测试结果表明 DLEAMSE 和归一化点积在光谱相似性计算方面的准确性没有重大差异。DLEAMSE GPU 实现在 GPU 服务器上的预处理比 NDP 更快,并且 DLEAMSE(32 维向量上的欧几里德距离)的相似度计算大约需要 1/3 的点积计算。深度学习模型 (DLEAMSE) 编码和嵌入步骤需要为每个光谱运行一次,嵌入的 32 维点可以保存在存储库中以供将来比较,这对于未来的比较和大规模数据来说更快。基于这些,我们提出了一种新工具 mslookup,使研究人员能够找到以前在公共数据中识别的光谱。该工具还可用于生成先前确定的光谱的内部数据库,以与其他实验室和财团共享。深度学习模型 (DLEAMSE) 编码和嵌入步骤需要为每个光谱运行一次,嵌入的 32 维点可以保存在存储库中以供将来比较,这对于未来的比较和大规模数据来说更快。基于这些,我们提出了一种新工具 mslookup,使研究人员能够找到以前在公共数据中识别的光谱。该工具还可用于生成先前确定的光谱的内部数据库,以与其他实验室和财团共享。深度学习模型 (DLEAMSE) 编码和嵌入步骤需要为每个光谱运行一次,嵌入的 32 维点可以保存在存储库中以供将来比较,这对于未来的比较和大规模数据来说更快。基于这些,我们提出了一种新工具 mslookup,使研究人员能够找到以前在公共数据中识别的光谱。该工具还可用于生成先前确定的光谱的内部数据库,以与其他实验室和财团共享。

更新日期:2020-12-14
down
wechat
bug