当前位置: X-MOL 学术Eur. J. Mass Spectrom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Predicting the absence of an unknown compound in a mass spectral database
European Journal of Mass Spectrometry ( IF 1.1 ) Pub Date : 2019-06-10 , DOI: 10.1177/1469066719855503
Andrey Samokhin 1 , Ksenia Sotnezova 1 , Igor Revelsky 1
Affiliation  

Only a small subset of known organic compounds (amenable for gas chromatography/mass spectrometry) is present in the largest mass spectral databases (such as NIST or Wiley). Nevertheless, library search algorithms available in the market are not able to predict the absence of a compound in the database. In the present work, we have tried to implement such prediction by means of supervised classification. Training and validation set contained 1500 and 750 compounds, respectively. Two prediction sets (containing 750 and about 3000 mass spectra) were considered. The easiest-to-use models were built with only one input variable: match factor of the best candidate or InLib factor (both parameters were calculated within MS Search (NIST) software). Multivariate classification models were built by partial least squares discriminant analysis (PLS-DA); match factors of top n candidates were used as input variables. PLS-DA was found to be the most effective approach. The prediction efficiency strongly depended on the ‘uniqueness’ of mass spectra presented in the test set. PLS-DA model was able to correctly predict the absence of a compound in the database in 29.9% for prediction set #1 and in 74.4% for prediction set #2 (only 1.3% and 2.5% of compounds actually presented in the database were wrongly classified).

中文翻译:

预测质谱数据库中未知化合物的缺失

在最大的质谱数据库(如 NIST 或 Wiley)中只存在一小部分已知有机化合物(适用于气相色谱/质谱)。然而,市场上可用的库搜索算法无法预测数据库中不存在化合物。在目前的工作中,我们尝试通过监督分类来实现这种预测。训练和验证集分别包含 1500 和 750 个化合物。考虑了两个预测集(包含 750 个和大约 3000 个质谱)。最易于使用的模型仅使用一个输入变量构建:最佳候选者的匹配因子或 InLib 因子(这两个参数均在 MS Search (NIST) 软件中计算)。通过偏最小二乘判别分析(PLS-DA)建立多元分类模型;前 n 个候选的匹配因子被用作输入变量。PLS-DA 被发现是最有效的方法。预测效率在很大程度上取决于测试集中呈现的质谱的“唯一性”。PLS-DA 模型能够在 29.9% 的预测集 #1 和 74.4% 的预测集 #2 中正确预测数据库中不存在化合物(数据库中实际呈现的化合物中只有 1.3% 和 2.5% 是错误的)分类)。
更新日期:2019-06-10
down
wechat
bug