Various aspects of retention index usage for GC-MS library search: A statistical investigation using a diverse data set,Chemometrics and Intelligent Laboratory Systems

当前位置： X-MOL 学术 › Chemometr. Intell. Lab. Systems › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Various aspects of retention index usage for GC-MS library search: A statistical investigation using a diverse data set
Chemometrics and Intelligent Laboratory Systems ( IF 3.7 ) Pub Date : 2020-07-01 , DOI: 10.1016/j.chemolab.2020.104042
Dmitriy D. Matyushin , Anastasia Yu. Sholokhova , Anastasia E. Karnaeva , Aleksey K. Buryak

Abstract This work is devoted to the large-scale statistical evaluation of various aspects of using the retention index for GC-MS library search with a diverse data set. A search in a large library often does not give a correct compound even if a library contains it. One of the methods to improve a spectral library search procedure is to use the retention index information. The aim of this study is to explore some statistical peculiarities which can be helpful for development of automated software which uses a library search of diverse completely unknown compounds in a large database. A data set that was used in this work as a source of queries contains ~11 thousand spectra of compounds which belong to diverse chemical classes. Six equations for matching reference and experimental “retention index – spectrum” pairs were compared. It was found that good results can be obtained when a linear equation for similarity of pairs is used. Similarity of pairs is found as a sum of spectral similarity and of a product of a negative adjustable weight parameter and the absolute difference between reference and query retention indices. This equation performs equal or better than much more complex equations which contain two instead of one adjustable parameters. Widely used threshold-based approach, when candidates with high retention index deviation are rejected, performs worse than other equations. The use of predicted with neural networks retention indices as reference was also considered. Modern universal retention prediction models which are applicable to a wide variety of compounds are still quite inaccurate comparing with values from databases, but these predicted values allow to improve a library search as well. When predicted retention indices are used as reference, the linear equation for matching “retention index – spectrum” pairs also performs equal or better than other equations. The distribution of differences between query indices and reference indices (both calculated and experimental) was found close to exponential distribution near zero. The dependence of a fraction of correct identifications on the reference retention indices accuracy was studied. The addition of random noise with double exponential distribution to exact values was used to create “reference” retention indices with the predefined accuracy. The use of the molecular mass and molecular formula as additional constraints during a library search was also considered.

中文翻译：

GC-MS 谱库搜索保留指数使用的各个方面：使用不同数据集的统计调查

摘要这项工作致力于对使用保留指数进行 GC-MS 库搜索与不同数据集的各个方面进行大规模统计评估。在大型库中搜索通常不会给出正确的化合物，即使库中包含它。改进谱库搜索程序的方法之一是使用保留索引信息。本研究的目的是探索一些统计特性，这些特性有助于开发自动化软件，该软件使用大型数据库中各种完全未知的化合物的库搜索。在这项工作中用作查询源的数据集包含约 11,000 种属于不同化学类别的化合物光谱。比较了用于匹配参考和实验“保留指数 - 光谱”对的六个方程。发现当使用对的相似性的线性方程时可以获得良好的结果。对的相似性被发现为频谱相似性和负可调权重参数与参考和查询保留指数之间的绝对差异的乘积的总和。该方程的性能与包含两个而不是一个可调参数的复杂得多的方程相同或更好。广泛使用的基于阈值的方法，当具有高保留指数偏差的候选人被拒绝时，其表现比其他方程更差。还考虑了使用预测的神经网络保留指数作为参考。适用于多种化合物的现代通用保留预测模型与数据库中的值相比仍然非常不准确，但是这些预测值也可以改进库搜索。当使用预测的保留指数作为参考时，用于匹配“保留指数 - 光谱”对的线性方程的性能也与其他方程相同或更好。发现查询索引和参考索引（计算的和实验的）之间的差异分布接近于零附近的指数分布。研究了一小部分正确鉴定对参考保留指数准确度的依赖性。将具有双指数分布的随机噪声添加到精确值用于创建具有预定义精度的“参考”保留指数。还考虑了使用分子质量和分子式作为库搜索期间的附加限制。当使用预测的保留指数作为参考时，用于匹配“保留指数 - 光谱”对的线性方程的性能也与其他方程相同或更好。发现查询索引和参考索引（计算的和实验的）之间的差异分布接近于零附近的指数分布。研究了一小部分正确鉴定对参考保留指数准确度的依赖性。将具有双指数分布的随机噪声添加到精确值用于创建具有预定义精度的“参考”保留指数。还考虑了使用分子质量和分子式作为库搜索期间的附加限制。当使用预测的保留指数作为参考时，用于匹配“保留指数 - 光谱”对的线性方程的性能也与其他方程相同或更好。发现查询索引和参考索引（计算的和实验的）之间的差异分布接近于零附近的指数分布。研究了一小部分正确鉴定对参考保留指数准确度的依赖性。将具有双指数分布的随机噪声添加到精确值用于创建具有预定义精度的“参考”保留指数。还考虑了使用分子质量和分子式作为库搜索期间的附加限制。用于匹配“保留指数 - 频谱”对的线性方程的性能也与其他方程相同或更好。发现查询索引和参考索引（计算的和实验的）之间的差异分布接近于零附近的指数分布。研究了一小部分正确鉴定对参考保留指数准确度的依赖性。将具有双指数分布的随机噪声添加到精确值用于创建具有预定义精度的“参考”保留指数。还考虑了使用分子质量和分子式作为库搜索期间的附加限制。用于匹配“保留指数 - 频谱”对的线性方程的性能也与其他方程相同或更好。发现查询索引和参考索引（计算的和实验的）之间的差异分布接近于零附近的指数分布。研究了一小部分正确鉴定对参考保留指数准确度的依赖性。将具有双指数分布的随机噪声添加到精确值用于创建具有预定义精度的“参考”保留指数。还考虑了使用分子质量和分子式作为库搜索期间的附加限制。发现查询索引和参考索引（计算的和实验的）之间的差异分布接近于零附近的指数分布。研究了一小部分正确鉴定对参考保留指数准确度的依赖性。将具有双指数分布的随机噪声添加到精确值用于创建具有预定义精度的“参考”保留指数。还考虑了使用分子质量和分子式作为库搜索期间的附加限制。发现查询索引和参考索引（计算的和实验的）之间的差异分布接近于零附近的指数分布。研究了一小部分正确鉴定对参考保留指数准确度的依赖性。将具有双指数分布的随机噪声添加到精确值用于创建具有预定义精度的“参考”保留指数。还考虑了使用分子质量和分子式作为库搜索期间的附加限制。将具有双指数分布的随机噪声添加到精确值用于创建具有预定义精度的“参考”保留指数。还考虑了使用分子质量和分子式作为库搜索期间的附加限制。将具有双指数分布的随机噪声添加到精确值用于创建具有预定义精度的“参考”保留指数。还考虑了使用分子质量和分子式作为库搜索期间的附加限制。

更新日期：2020-07-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11