当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Machine learning for identification of silylated derivatives from mass spectra
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2022-09-15 , DOI: 10.1186/s13321-022-00636-1
Milka Ljoncheva 1, 2 , Tomaž Stepišnik 2, 3 , Tina Kosjek 1, 2 , Sašo Džeroski 2, 3
Affiliation  

Compound structure identification is using increasingly more sophisticated computational tools, among which machine learning tools are a recent addition that quickly gains in importance. These tools, of which the method titled Compound Structure Identification:Input Output Kernel Regression (CSI:IOKR) is an excellent example, have been used to elucidate compound structure from mass spectral (MS) data with significant accuracy, confidence and speed. They have, however, largely focused on data coming from liquid chromatography coupled to tandem mass spectrometry (LC–MS). Gas chromatography coupled to mass spectrometry (GC–MS) is an alternative which offers several advantages as compared to LC–MS, including higher data reproducibility. Of special importance is the substantial compound coverage offered by GC–MS, further expanded by derivatization procedures, such as silylation, which can improve the volatility, thermal stability and chromatographic peak shape of semi-volatile analytes. Despite these advantages and the increasing size of compound databases and MS libraries, GC–MS data have not yet been used by machine learning approaches to compound structure identification. This study presents a successful application of the CSI:IOKR machine learning method for the identification of environmental contaminants from GC–MS spectra. We use CSI:IOKR as an alternative to exhaustive search of MS libraries, independent of instrumental platform and data processing software. We use a comprehensive dataset of GC–MS spectra of trimethylsilyl derivatives and their molecular structures, derived from a large commercially available MS library, to train a model that maps between spectra and molecular structures. We test the learned model on a different dataset of GC–MS spectra of trimethylsilyl derivatives of environmental contaminants, generated in-house and made publicly available. The results show that 37% (resp. 50%) of the tested compounds are correctly ranked among the top 10 (resp. 20) candidate compounds suggested by the model. Even though spectral comparisons with reference standards or de novo structural elucidations are neccessary to validate the predictions, machine learning provides efficient candidate prioritization and reduction of the time spent for compound annotation.

中文翻译:

从质谱中识别甲硅烷基化衍生物的机器学习

化合物结构识别正在使用越来越复杂的计算工具,其中机器学习工具是最近添加的工具,其重要性迅速增加。这些工具,其中标题为化合物结构识别:输入输出核回归 (CSI:IOKR) 的方法是一个很好的例子,已被用于从质谱 (MS) 数据中以显着的准确性、可信度和速度阐明化合物结构。然而,他们主要关注来自液相色谱与串联质谱 (LC-MS) 的数据。气相色谱与质谱联用 (GC-MS) 是一种替代方法,与 LC-MS 相比,它具有多种优势,包括更高的数据重现性。特别重要的是 GC-MS 提供的大量化合物覆盖率,通过衍生化程序进一步扩展,例如甲硅烷基化,可以改善半挥发性分析物的挥发性、热稳定性和色谱峰形。尽管有这些优势以及化合物数据库和质谱库的规模不断扩大,但 GC-MS 数据尚未被机器学习方法用于化合物结构识别。本研究成功应用 CSI:IOKR 机器学习方法从 GC-MS 光谱中识别环境污染物。我们使用 CSI:IOKR 作为彻底搜索 MS 库的替代方法,独立于仪器平台和数据处理软件。我们使用来自大型商用 MS 库的三甲基甲硅烷基衍生物及其分子结构的 GC-MS 光谱综合数据集,训练一个在光谱和分子结构之间映射的模型。我们在内部生成并公开提供的环境污染物三甲基甲硅烷基衍生物的 GC-MS 光谱的不同数据集上测试学习模型。结果显示,37%(分别为 50%)的测试化合物被正确地排在模型建议的前 10(或 20)个候选化合物中。尽管与参考标准的光谱比较或从头结构解析对于验证预测是必要的,但机器学习提供了有效的候选优先级并减少了化合物注释所花费的时间。内部生成并公开发布。结果显示,37%(分别为 50%)的测试化合物被正确地排在模型建议的前 10(或 20)个候选化合物中。尽管与参考标准的光谱比较或从头结构解析对于验证预测是必要的,但机器学习提供了有效的候选优先级并减少了化合物注释所花费的时间。内部生成并公开发布。结果显示,37%(分别为 50%)的测试化合物被正确地排在模型建议的前 10(或 20)个候选化合物中。尽管与参考标准的光谱比较或从头结构解析对于验证预测是必要的,但机器学习提供了有效的候选优先级并减少了化合物注释所花费的时间。
更新日期:2022-09-15
down
wechat
bug