Elsevier

Analytica Chimica Acta

Volume 1238, 15 January 2023, 340189
Analytica Chimica Acta

Image classification combined with faster R–CNN for the peak detection of complex components and their metabolites in untargeted LC-HRMS data

https://doi.org/10.1016/j.aca.2022.340189Get rights and content

Highlights

  • A new method based on CNN and faster R–CNN is proposed for peak detection.

  • Peak_CF, Peakonly and ADAP are compared through an evaluation data-set.

  • Peak_ CF shows the low detection rate of the false positives and false negatives.

  • The training model with strong generalization ability can be achieved in Peak_ CF.

  • Peak_CF was applied to real metabolic fingerprints of Glycyrrhiza flavonoids.

Abstract

Peak detection of untargeted liquid chromatography-high resolution mass spectrometry (LC-HRMS) data is a key step to identify the metabolic status of the drugable chemicals and extracts from functional foods or herbs. Nevertheless, the existing approaches are difficult to obtain ideal results with low false positives and false negatives. In this paper, we proposed an automatic method based on convolutional neural network (CNN) for image classification and Faster R–CNN for peak location/classification in untargeted LC-HRMS data, and named it Peak_CF. It can achieve detection of target peaks with high accuracy and high recall (both >90%) as verified by an evaluation data-set. In terms of detecting the m/z peaks of known compounds, Peak_CF is better than Peakonly, and it can effectively have an overall peak shape judgment of split peaks. For the same evaluation data, the recall of MZmine2 (ADAP) is slightly higher than that of Peak_CF, however, the F1 score of Peak_CF is higher, indicating that it has higher accuracy. In addition, the Peak_ CF training model with strong generalization ability can be achieved and verified. At last, Peak_CF was applied in real metabolic fingerprints of total flavonoids from Glycyrrhiza uralensis Fisch, also a contrast was conducted based on 40 m/z peaks of 40 prototypes in serum data-set. The result showed that the recall rate of Peak_CF and Peakonly all reached 95%, higher than 70% of MZmine2 (ADAP), and Peak_CF is more accurate when detecting EIC that has serious drifts. In conclusion, Peak_CF provides a new route for data mining of LC-HRMS datasets of drug (or herbs, or functional foods) metabolites.

Introduction

The metabolic status of the drugable chemicals [[1], [2], [3]], and extracts from functional foods or herbs [[4], [5]] is one of the most important factors influencing their pharmacological activity. Therefore, achieving the separation and identification of their prototypes and metabolites in complex biological matrix will provide valuable information for product research, clinical applications and mechanism studies. Currently, liquid chromatography-high resolution mass spectrometry (LC-HRMS) has been widely used in analysis of metabolites in complex samples due to its high selectivity and high sensitivity [[6], [7]].

Conducting LC- HRMS determination on complex sample, will produce raw data sets containing huge amounts of information (including m/z, intensity, etc.) involving too many chemical components. Faced with such complexity, Tauler etc developed the Regions of Interest (ROI) procedure for untargeted LC-MS data compression and processing in Matlab environment [8]. Usually each responding compound can be represented by one or more m/z peaks in LC- HRMS fingerprint, including isotopic ions, adduct ions, fragment ions, and multi-charged ions [9], among others. It is worth-noting that both prototypes and metabolites tend to be present in low concentrations in vivo and are subject to signal interference from abundant biological matrix. This means that by manual interpretation of m/z peaks in LC-HRMS fingerprints, is not only cumbersome/work-intense, but also likely to ignore information of metabolites that are unknown and have low intensities. Therefore, there is an urgent need for an automatic, accurate and sensitive peak detection method to achieve this purpose. The classical peak detection algorithms including continuous wavelet transform (CWT) [10], central wavelet algorithm (centWave) [11], ADAP peak detection [12], etc. These methods have been integrated into open source software such as XCMS [13] and MZmine2 [14] to implement a complete data processing workflow. In this process, the raw data are converted into a table containing peaks, which is also the most commonly used method for m/z peak detection at now. However, conventional algorithms are not good enough in detecting weak and overlapped peaks, or have difficulty in distinguishing noise signals like peaks. When the parameters of the peak detection program are not set properly, a large number of false-positive or false-negative results can be triggered. For example, the smoothing level is too large, which will lead to weak peaks being over-smoothed and losing the characteristics of the peaks, resulting in miss detection; too large a peak width setting can result in some unrealistically broad peaks being detected, and they are likely to be generated by changes in factors such as mobile phase.

Deep learning is a neural network that simulates the human brain for analytic learning and is recognized as a powerful tool in research areas such as speech recognition and computer vision. It is also applied in chemometric domain for data processing, such as spectral image [15], two-dimensional Raman spectrogram [16], etc. Presently, advanced techniques for target detection have been further developed, including region-convolutional neural networks (R–CNN) [17], Fast R–CNN [18], and Faster R–CNN [19]. Among them, Faster R–CNN has excellent performance in locating and classifying multiple targets in an image. Given the power of deep learning, many scientists have tried to apply it in the peak detection, e.g., Woldegebriel et al. applied neural networks to predict whether a specific coordinate is the center of a peak [20]; Kant et al. established a deep neural network for the peak classification, which allows data filtering [21]; Risum et al. used deep learning to automatically evaluate whether chromatographic components reflect chemical information or baseline [22]; Melnikov et al. developed the Peakonly algorithm for peak detection using two CNN [23]; Guo et al. developed an EVA programme to evaluate the fidelity of metabolic features [24], etc. These methods focused on the performance of deep learning on image classification, nevertheless, ignored its powerful ability on target detection. Moreover, most of them cannot directly obtain the specific location of peaks, including peak endpoints and peak vertices. Therefore, our team tried to apply Faster R–CNN to the peak detection of LC-HRMS fingerprints of complex samples to see if the classification and localization of peaks could be achieved more accurately.

In this paper, with the aim of improving the detection rate for weak peak and reducing the false positive rate, the authors propose an automatic peak detection method based on EIC image classification and Faster R–CNN (Peak_CF). It includes construction of EIC image, CNN-based noise recognition, and peak detection based on Faster R–CNN. Here, Peak_CF, Peakonly and ADAP algorithms were evaluated with a real data set containing 857 real peaks that were manually labeled. At last, LC-qTOF MS peaks of metabolic fingerprints of total flavonoids from Glycyrrhiza uralensis Fisch in mice were detected using Peak_CF.

Section snippets

Theory and method

Peak_CF consists of EIC image construction, CNN-based noise identification, and Faster R–CNN–based peak detection. The code was written in python v.3.7, using pytorch 1.9.0 to build and train the neural network, and other requirements is provided in section S1 (Supplemental Material I). Peak_CF is freely available online at https://github.com/JunZeng1999/Peak_CF.

Preparation of serum and bile samples

Extraction of glycyrrhiza flavonoids and drug administration can be observed in Section M1 (Supplemental Material II). Then blood and bile samples were collected at 1 h after the last administration. Blood was centrifuged at 5000 rpm for 20 min to separate the serum. An aliquot of 0.5 mL serum (or bile) was treated with 3 vol of ethyl acetate to extract target analytes. The mixture was vortexed for 5 min, and centrifuged at 5000 rpm for 10 min. The supernatant was separated, dried under

Acquisition of evaluation dataset (EVDS)

MM14 [11], a dataset obtained from 14 standard compounds tested by UPLC/ESI-qTOF-MS, was used to evaluate the performance of Peak_CF, and it provided 296 m/z peaks for all possible compounds. It was manually verified that some low-intensity peaks were difficult to be observed. Thus, only the 115 m/z peaks with intensity above 300 were considered, and manually re-annotated for MM14 (Tab S3, Supplemental Material I). However, there are also many m/z peaks in this data-set brought about by

Conclusion

In this paper, we proposed an automatic peak detection method based on noise identification and Faster R–CNN classification, named Peak_CF. It mainly includes three steps of EIC image construction, noise signal identification, and peak detection. In the construction of the EIC image, the zero value of intensity was allowed to appear, and the EIC was saved and analyzed as a picture; then a CNN and a Faster R–CNN were constructed and trained for noise identification and weak peak detection,

Funding statement

This work was supported by Hunan 2011 Collaborative Innovation Center of Chemical Engineering & Technology with Environmental Benignity and Effective Resource Utilization, Hunan Province Natural Science Fund (no. 2020JJ4569), the key project of Hunan Provincial Education Department (no. 18A055), the Open Research Funding of Chongqing Key Laboratory of Traditional Chinese Medicine for Prevention and Cure of Metabolic Diseases (no. 2021-1-4) and Hunan Province College Students' innovation and

Data availability statement

Data included in article/supplementary material/referenced in article.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (31)

  • A.B. Risum et al.

    Using deep learning to evaluate peaks in chromatographic data

    Talanta

    (2019)
  • M. He et al.

    Accurate recognition and feature qualify for flavonoid extracts from Liang-wai Gan Cao by liquid chromatography-high resolution-mass spectrometry and computational MS/MS fragmentation

    J. Pharmaceut. Biomed. Anal.

    (2017)
  • Q. Wang et al.

    Metabolites identification of bioactive licorice compounds in rats

    J. Pharmaceut. Biomed. Anal.

    (2015)
  • M. Barranco-Altirriba et al.

    mWISE: an algorithm for context-based annotation of liquid Chromatography−Mass spectrometry features through diffusion in graphs

    Anal. Chem.

    (2021)
  • P. Du et al.

    Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching

    Bioinformatics

    (2006)
  • Cited by (8)

    • A false peak recognition method based on deep learning

      2023, Chemometrics and Intelligent Laboratory Systems
    View all citing articles on Scopus
    View full text