An Approach of Spectra Standardization and Qualitative Identification for Biomedical Materials Based on Terahertz Spectroscopy

Shi, Chenjun; Zhu, Ji; Xu, Mingqian; Wu, Xu; Peng, Yan

doi:https://doi.org/10.1155/2020/8841565

Scientific Programming

On this page

Abstract Introduction Methods Results Discussion Conclusion Data Availability Conflicts of Interest Authors’ Contributions Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2020 | Article ID 8841565 | https://doi.org/10.1155/2020/8841565

An Approach of Spectra Standardization and Qualitative Identification for Biomedical Materials Based on Terahertz Spectroscopy

Chenjun Shi,¹Ji Zhu,²Mingqian Xu,¹Xu Wu,¹and Yan Peng¹

Academic Editor: Wenzheng Bao

Received06 Sept 2020

Revised25 Sept 2020

Accepted05 Oct 2020

Published21 Oct 2020

Abstract

Terahertz time-domain spectroscopy (THz-TDS) systems are widely used to obtain fingerprint spectra of many different biomedical substances, and thus the identification of different biological materials, medicines, or dangerous chemicals can be realized. However, the spectral data for the same substance obtained from different THz-TDS systems may have distinct differences because of differences in system errors and data processing methods, which leads to misclassification and errors in identification. To realize the exact and fast identification of substances, spectral standardization is the key issue. In this paper, we present detailed disposal methods and execution processes for the spectral standardization and substance identification, including feature extracting, database searching, and fingerprint spectrum matching of unknown substances. Here, we take twelve biomedical compounds including different biological materials, medicines, or dangerous chemicals as examples. These compounds were analyzed by two different THz-TDS systems, one of which is a commercial product and the other is our verification platform. The original spectra from two systems showed obvious differences in their curve shapes and amplitudes. After wavelet transform, cubic spline interpolation, and support vector machine (SVM) classification with an appropriate kernel function, the spectra from two systems can be standardized, and the recognition rate of qualitative identification can be up to 99.17%.

1. Introduction

Terahertz (THz) spectroscopy has a wide range of applications and has become an important research topic in recent years. THz radiation is an electromagnetic wave with a frequency between 0.1 and 10 THz. The photon energy of THz radiation is about 4 meV, which is one million times weaker than an X-ray photon; therefore, it will not cause harmful photoionization in biological tissues [1–3]. Additionally, the vibrational and rotational frequencies of most polar molecules are located in the THz range, which can provide specific absorption responses (fingerprint spectra) for the identification of compounds [4, 5]. Compared with other electromagnetic spectra (Raman, infrared, or ultraviolet-visible light spectroscopy, i.e., which detects specific molecules or molecular bonds [Reference]), THz spectra reflect the collective behavior of molecules and can be used to detect different substances with different molecular structures as well as polymorph and chiral substances, even those that have the same elements and molecular bonds [6].

To date, THz technology has been widely applied in the identification of many compounds and has been proven to have high recognition accuracy [7–9]. Especially in biomedical fields, different medicines, biomarkers, dangerous chemicals, and biological materials have been identified by THz technology [10–14]. Currently, various algorithms based on machine learning have be used for classification different kinds of data [15, 16], which can also be a potential tool for classifying terahertz spectra. However, for different research groups, the presented spectrum data for the same substance may not be same. For example, THz spectra of dinitrotoluene show large differences in the amplitudes and frequency domain, which are caused by variations in system errors and data processing methods for different measuring systems [17, 18]. This may cause big errors in the database built and the substance identification. Therefore, a standardization method for spectral acquisition and data processing is urgently needed.

In this paper, we propose a method of THz spectrum standardization, which can improve the recognition rate affected by spectral differences among different systems. The diagram of the spectral standardization is shown in Figure 1. After spectra are obtained from different systems, a series of calibration processes are followed to obtain standard spectra. These standard spectra can be used directly for data comparison and then substance identification. Based on this technological process, we successfully standardized the spectra of 12 biomedical substances including different biological materials, medicines, or dangerous chemicals and then utilized support vector machine (SVM) for qualitative identification, which presents a higher accuracy than that of other common methods.

2. Methods

2.1. Sample Preparation

To ensure the wide applicability of our method, we randomly selected the following compounds as representatives, including biological materials, medicines, and dangerous chemicals: 4-aminobenzoic acid, amoxicillin, phenylalanine, 4-phenyl-2-pyrrolidone-1-acetamide (A2), (R)-2-amino-1-phenylethanol (A3), a mixture of explosives containing cyclonite (C5), d-lactose monohydrate, trinitrotoluene, benzoic acid, p-toluylic acid, glutamic acid, and riboflavin. These pure samples were purchased from Sigma Aldrich and stored properly according to manufacturer’s recommendations. Each compound was taken for a certain mass and then mixed with 120 mg polyethylene (PE) powder and then pressed into 2 mm-thick tablets (ø13 mm) with 3 tons of force. The pressing time was set at 1 min, and the mass loss was controlled less than 1%. A pure polyethylene tablet was also pressed with the same parameters as a reference. During the test, the temperature was at room temperature (∼22°C) and the humidity was controlled below 1%.

2.2. THz-TDS Measurement

THz-TDS systems from the University of Shanghai for Science and Technology (USST system) and the Shanghai Gaojing Image Technology Company (Gaojing system) were used to measure the spectra of these 12 compounds, respectively. The USST system has a spectral resolution of 0.001 THz, signal-to-noise ratio (SNR) of 40000 : 1, and effective bandwidth of 0.1–4.0 THz [19, 20]. The Gaojing system has a spectral resolution of 0.001 THz, SNR of 50000 : 1, and effective bandwidth of 0.2–1.5 THz. Due to the different frequency ranges of these two systems, only data recorded between 0.2 THz and 1.5 THz can be compared. First, samples's time-domain spectra can be obtained by THz-TDS systems. Then the time-domain spectra are converted using fast Fourier transform to frequency-domain spectra. Finally, the relative absorbance of a sample α () can be calculated by using the following equation [21]:where l_sam () is the frequency-domain spectrum of the sample tablet, l_ref () is the frequency-domain spectrum of the pure polyethylene tablet, and d is the thickness of the sample.

2.3. Baseline Correction and Noise Removal

Baseline drift is generally caused by scattering from solid state samples in the THz region. The testing environment and different system errors can also lead to baseline drift and noise to different degrees. Therefore, baseline correction and noise removal is necessary for standardization of spectra. Wavelet transform has good time-frequency localization character and decorrelation, which can remove the baseline and noise with different scales and can retain the effective information in the different frequency regions of the terahertz spectrum at the same time. Here, we used the wavelet transform to correct the baseline at low frequency and recognize the signal and noise at high frequency according to multiresolution analysis [22]. The Mallat algorithm [23] was used for wavelet decomposition as follows:

The mother wavelet φ_J (u) is orthogonal to the scaling function ψ_j (u), C_J is a coefficient in the J + 1th level of the low frequency component, and d_j is a coefficient in the jth level (1 ≤ j ≤ J) of the high frequency component. With this equation, different parts and frequencies of the signal can be analyzed, respectively. Based on the mother wavelet Daubechies 9, the spectra of the 12 compounds were decomposed at level 6. The coefficient in level 6 is a low-frequency component, which represents a baseline of zero, and the coefficients in levels 1–5 are high-frequency components that contain both noise and useful information. We dropped the high-frequency components from levels 1–3, which represented noise, and we retained the low-frequency components from levels 4-5, which represented useful information.

2.4. Sampling Interval

Due to the different sampling intervals between the USST system (0.009 THz) and the Gaojing system (0.01 THz), the data points’ number recorded between 0.2 and 1.5 THz by these two systems was 143 (the USST system) and 130 (the Gaojing system). Different data points’ number could lead to errors in the recognition program as characteristic dimensions of input data are different. Cubic spline curve interpolation [24] was applied here to unify the data points’ number. The final data points’ number was adjusted to the larger one of these two THz-TDS systems (143 data points of the USST system).

2.5. SVM Classification

An SVM classifier was constructed by using a nonlinear mapping function to map the training data set onto a higher-dimensional space and then solve the linearly inseparable problem in a linear kernel space. In this study, we used a “one-against-one” approach for multiclass classification [25]. For t classes, we constructed t (t − 1)/2 binary classifiers to take into account all combinations of pairs of classes. A given training set of data was used to establish the SVM classifier, which can be described by (X_i, Y_i), i = 1, 2, 3, … , N, X_i ∈ R_n, where Y_i is an output label of the input absorption spectral data X_i. The objective function of the SVM is defined as follows [26]:where c is the penalty parameter, φ (·) is a kernel function, and and b are the SVM weight and bias parameters, respectively. The penalty parameter c controls the trade-off between the margin maximization and error minimization, and the kernel function is used in establishing the SVM classification model. Three common kernel functions are described by the following equations [27]:

In SVM classification, it is important to select an appropriate kernel function according to the characteristics of the sample. For example, if the spectral data of the samples are independent with sufficient unique characteristics, in other words, if they are linearly separable, then the linear kernel will be the most suitable kernel. In addition, it is the simplest and fastest. Otherwise, a nonlinear kernel function, such as the RBF kernel or polynomial kernel, will be needed to map the spectral data to higher dimensions. A process with one of these kernel functions will be longer and will require more optimization parameters, and overfitting may reduce the identification accuracy.

3. Results

3.1. Original Spectral Data Analysis

The spectra of the samples measured by the USST system and the Gaojing system are shown in Figure 2. It can be seen clearly that the spectra obtained from the two systems showed significant differences in their curve shapes and amplitudes, especially for amoxicillin, phenylalanine, p-toluylic acid, trinitrotoluene, and riboflavin. The main reason is that the USST system just gives out the original time-domain waveform, without any data processing, while the Gaojing system provides the smoothed data after a series of data preprocessing, whose details are unknown (commercial protection). Additionally, the system parameter settings, environmental conditions, and data calculation methods will cause spectral differences. Therefore, it was hard to compare spectra recorded by one system with those recorded by the other systems directly.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 2

THz spectra of 12 compounds measured by two THz-TDS systems. Black solid lines show spectra obtained from the USST system (left y-axis)—original time-domain spectral data without any data processing; red dotted lines show spectra obtained from the Gaojing system (right y-axis)—smoothed data after information privacy processing. (a) 4-Aminobenzoic acid. (b) A2. (c) A3. (d) Amoxicillin. (e) Phenylalanine. (f) Benzoic acid. (g) C5. (h) d-Lactose monohydrate. (i) p-Toluylic acid. (j) Glutamic acid. (k) Trinitrotoluene. (l) Riboflavin.

3.2. Data Standardization

Upon the obvious differences in the spectral shape and amplitude of two THz-TDS systems, data standardization is necessary to eliminate noise, baselines, and instability of different systems. The results of data standardization between the two systems are shown in Figure 3. The data processing procedure had two steps:(1)The elimination of baseline drift and noise by using the wavelet transform. It can be seen that, for the USST system, the peaks from the original spectra (Figure 2, black solid lines) became sharper and their baselines were eliminated after this step (Figure 3, blue solid lines), while for the Gaojing system, as the spectra had already been preprocessed by the system, it is normal that the spectra nearly had no difference (red dotted lines in Figures 2 and 3). These processing steps made the absorption peaks, curve shapes, and amplitudes in the spectra from the two systems very similar to each other.(2)The standardization of the data points’ number. Cubic spline interpolation was used to add the number of the Gaojing spectral data points to 143 with a sampling interval of 0.009 THz without changing the shapes of the spectra. Otherwise, it will cause errors in the algorithm for the mismatch of spectral data dimensions.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

3.3. Compound Identification

All standardized spectral data from the 12 compounds were divided into two sets based on their sources. The processed dataset from the USST system was used to optimize the parameters and build the SVM classifier model, while the processed data from the Gaojing system was used for classification of each compound. Our model, which includes wavelet transform, cubic spline interpolation, and SVM, has the advantage of low demand for sample numbers and easier to obtain globally optimal solutions. Additionally, several kernel functions, such as the linear kernel function, polynomial kernel function, and radial basis function (RBF) kernel, are commonly used to construct the SVM classifier. The choice of kernel function should be based on the characteristics of the sample data. In our study, the absorption spectra from the 0.2–1.5 THz range were selected, and these spectra contain sufficient characteristic information for linear separation of the 12 compounds. Consequently, we first evaluated the linear kernel function for compound identification. The open source software LIBSVM [28] was used in this study, and the identification accuracy reached 99.17%. Furthermore, it only took a short time without searching for optimum kernel parameters, which proved the validity of the linear kernel function for these spectral data.

4. Discussion

Among the kernel functions, the RBF kernel is the most commonly used in SVMs. Therefore, the RBF kernel was also evaluated in this study, and the corresponding penalty parameter c and kernel function parameter were optimized by the grid search method [29]. The final identification accuracy was 95.00% as shown in Table 1. The only misclassification was for phenylalanine. This could be attributed to the overfitting from the RBF kernel function and the proximity of the absorption peak of phenylalanine (1.27 THz) to that of glutamic acid (1.25 THz). The proximity of these peaks was more problematic as these compounds only have one absorption peak in the 0.2–1.5 THz range. As a result, six phenylalanine samples were misclassified as glutamic acid (Figure 4), which reduced the overall accuracy.

Considering that the absorption peaks in the spectra reflect the vibration/rotation of molecules' functional groups, therefore, some same functional groups may cause the similar absorption peaks in the spectra which will cause error in identification. This can be reduced by searching for more absorption peaks over a wider THz range and improving the SNR of the spectra. In the present study, the effective bandwidth of the Gaojing system was 0.2–1.5 THz, and both phenylalanine and glutamic acid have only one absorption peak in this range. Therefore, we could not improve the accuracy by extending the bandwidth of the system, so the selection of an appropriate kernel function for the SVM classifier was crucial for optimization of our method.

For comparison, correlation coefficients and BP neural networks, which are widely used as spectroscopy identification methods, were also used here [30, 31]. The performances of these models were evaluated based on their accuracy for classification of the test dataset (Table 2). It is proved that our model was much more accurate than the correlation coefficient and BP neural network models for compound identification. This could be attributed to the high sensitivity of our model for characteristics of the sample, small number of samples required, and self-regulation in finding minima.

Based on all our results, we can conclude that the errors between different THz-TDS systems are mainly affected by four factors:(1)The SNR of the system: since the noise in the system interferes with the absorption peaks, improving the SNR can effectively reduce the identification error.(2)Data standardization: because different system errors and data preprocessing can create large differences in curve shapes and amplitudes of spectra, standardization of data from different THz-TDS systems is required for accurate identification. In our study, this was achieved by using the wavelet transform and cubic spline interpolation. But other methods such as digital filtering and the window function could also be applied to the standardization.(3)Kernel functions: in SVM classification, the appropriate kernel function should be selected based on the characteristics of sample data. A suitable kernel function can greatly reduce the computational complexity and improve the classification accuracy.(4)Identification methods: SVM can effectively achieve high generalization performance with a small number of samples, which improves the classification accuracy compared with other identification algorithms. However, the kernel function cannot be selected automatically, and then the user’s experience in this area determines the final outcome.

5. Conclusion

In our paper, we proposed an approach of spectra standardization for different THz-TDS systems and then realized the qualitative identification using these standard spectra. The model contains wavelet transform, cubic spline interpolation, and SVM. With this model, we realized the qualitative identification of 12 biomedical compounds with 100% recognition rate. These results show the importance of spectra standardization for different THz-TDS systems and the accuracy of these standard spectra. Based on these results, our method is demonstrated to be a potential tool for the identification of different substances in the biomedical field. However, for actual usage, the sample will contain far more compositions. In this case, the accuracy will be dropped. Therefore, further optimization is needed to improve the model’s performance. For example, the input spectra can be preprocessed to extract the information related to target substances, which can reduce the influence of other substances.

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

CJS and JZ contributed equally to this study and should be considered as co-first authors. YP planned and designed the experiments. CJS and MQX performed the experiments and analyzed the data. CJS and JZ wrote the manuscript. YP and XW supervised the project. All authors reviewed and edited the manuscript.

Acknowledgments

This study was financially supported by the Major National Development Project of Scientific Instrument and Equipment (2017YFF0106300), NSFC (61771314 and 61722111), Shanghai Rising-Star Program (17QA1402500), Shuguang Program (17SG45), and Shanghai Youth Talent Support Program and Young Yangtse River Scholar (Q2016212).

References

P. H. Siegel, “Terahertz technology,” IEEE Transactions on Microwave Theory and Techniques, vol. 50, no. 3, pp. 910–928, 2002.
View at: Publisher Site | Google Scholar
Y. Peng, C. Shi, Y. Zhu, and M. Gu, “Terahertz spectroscopy in biomedical field: a review on signal-to-noise ratio improvement,” PhotoniX, vol. 1, pp. 1–18, 2020.
View at: Publisher Site | Google Scholar
P. U. Jepsen, D. G. Cooke, and M. Koch, “Terahertz spectroscopy and imaging - modern techniques and applications,” Laser & Photonics Reviews, vol. 6, no. 3, p. 418, 2012.
View at: Publisher Site | Google Scholar
H. Ge, Y. Jiang, Z. Xu, F. Lian, Y. Zhang, and S. Xia, “Identification of wheat quality using THz spectrum,” Optics Express, vol. 22, no. 10, pp. 12533–12544, 2014.
View at: Publisher Site | Google Scholar
D. Brigada and X.-C. Zhang, “Chemical identification with information-weighted terahertz spectrometry,” IEEE Transactions on Terahertz Science and Technology, vol. 2, no. 1, pp. 107–112, 2012.
View at: Publisher Site | Google Scholar
J. El Haddad, B. Bousquet, L. Canioni, and P. Mounaix, “Review in terahertz spectral analysis,” TrAC Trends in Analytical Chemistry, vol. 44, pp. 98–105, 2013.
View at: Publisher Site | Google Scholar
T. Kurabayashi, H. Shuhama, S. Yodokawa, and S. Kosaka, “Identification of cellulosic fibers and determination of their blend ratio using terahertz spectroscopy,” Infrared Physics & Technology, vol. 80, pp. 153–157, 2017.
View at: Publisher Site | Google Scholar
J. Shen, “Identification of THz absorption spectra of chemicals using neural networks,” Proceedings of SPIE-The International Society for Optical Engineering, vol. 6695, pp. 47–50, 2007.
View at: Google Scholar
W. Xie, J. Li, and J. Pei, “THz-TDS Signal analysis and substance identification,” Journal of Physics Conference Series, vol. 276, no. 1, 2011.
View at: Publisher Site | Google Scholar
X. Wu, Y. Dai, L. Wang et al., “Diagnosis of methylglyoxal in blood by using far-infrared spectroscopy and o-phenylenediamine derivation,” Biomedical Optics Express, vol. 11, no. 2, pp. 963–970, 2020.
View at: Publisher Site | Google Scholar
K. Choi, T. Hong, K. I. Sim et al., “Reflection terahertz time-domain spectroscopy of RDX and HMX explosives,” Journal of Applied Physics, vol. 115, no. 2, Article ID 023105, 2014.
View at: Publisher Site | Google Scholar
Z. Geng, X. Zhang, and J. Liu, “A route to terahertz metamaterial biosensor integrated with microfluidics for liver cancer biomarker testing in early stage,” Scientific Reports, vol. 7, no. 1, pp. 1–11, 2017.
View at: Publisher Site | Google Scholar
Y. Ueno, R. Rungsawang, I. Tomita, and K. Ajito, “Quantitative measurements of amino acids by terahertz time-domain transmission spectroscopy,” Analytical Chemistry, vol. 78, no. 15, pp. 5424–5428, 2006.
View at: Publisher Site | Google Scholar
L. Wang, X. Wu, Y. Peng et al., “Quantitative analysis of homocysteine in liquid by terahertz spectroscopy,” Biomedical Optics Express, vol. 11, no. 5, pp. 2570–2577, 2020.
View at: Publisher Site | Google Scholar
W. Bao, Z. Huang, C. A. Yuan, and D. S. Huang, “Pupylation sites prediction with ensemble classification model,” International Journal of Data Mining and Bioinformatics, vol. 18, no. 2, pp. 91–104, 2017.
View at: Publisher Site | Google Scholar
W. Bao, D. Wang, and Y. Chen, “Classification of protein structure classes on flexible neutral tree,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 14, no. 5, pp. 1122–1133, 2016.
View at: Google Scholar
G.-F. Liu, X.-J. Ma, S.-H. Ma et al., “Identification of explosives 2, 4-DNT and 2, 6-DNT using terahertz time-domain spectroscopy,” Chinese Journal of Chemistry, vol. 26, no. 7, pp. 1257–1261, 2008.
View at: Publisher Site | Google Scholar
J. Chen, Y. Chen, H. Zhao, G. J. Bastiaans, and X.-C. Zhang, “Absorption coefficients of selected explosives and related compounds in the range of 0.1–2.8 THz,” Optics Express, vol. 15, no. 19, pp. 12060–12067, 2007.
View at: Publisher Site | Google Scholar
W. Chen et al., “Isomers Identification of 2-hydroxyglutarate acid disodium salt (2HG) by terahertz time-domain spectroscopy,” Scientific Reports, vol. 7, Article ID 12166, 2017.
View at: Publisher Site | Google Scholar
Y. Peng, X. Yuan, X. Zou et al., “Terahertz identification and quantification of neurotransmitter and neurotrophy mixture,” Biomedical Optics Express, vol. 7, no. 11, pp. 4472–4479, 2016.
View at: Publisher Site | Google Scholar
A. G. Markelz, A. Roitberg, and E. J. Heilweil, “Pulsed terahertz spectroscopy of DNA, bovine serum albumin and collagen between 0.1 and 2.0 THz,” Chemical Physics Letters, vol. 320, no. 1-2, pp. 42–48, 2000.
View at: Publisher Site | Google Scholar
L. Xu, D. Zhang, and K. Wang, “Wavelet-based cascaded adaptive filter for removing baseline drift in pulse waveforms,” IEEE Transactions on Biomedical Engineering, vol. 52, no. 11, pp. 1973–1975, 2005.
View at: Publisher Site | Google Scholar
E. Bayro-Corrochano, “Multi-resolution image analysis using the quaternion wavelet transform,” Numerical Algorithms, vol. 39, no. 1–3, pp. 35–55, 2005.
View at: Publisher Site | Google Scholar
J. A. Gregory and M. Sarfraz, “A rational cubic spline with tension,” Computer Aided Geometric Design, vol. 7, no. 1–4, pp. 1–13, 1990.
View at: Publisher Site | Google Scholar
S. Knerr, L. Personnaz, and G. Dreyfus, Single-layer Learning Revisited: A Stepwise Procedure for Building and Training a Neural Network, Neurocomputing, Springer, Berlin, Germany, 1990.
R. E. Fan, P. H. Chen, and C. J. Lin, “Working set selection using second order information for training support vector machines,” Journal of Machine Learning Research, vol. 16, pp. 1889–1918, 2005.
View at: Google Scholar
L. Devroye, L. Gyorf, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, Berlin, Germany, 1997.
C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995.
View at: Publisher Site | Google Scholar
G. H. Feng, “Parameter optimizing for support vector machines classification,” Computer Applications in Engineering Education, vol. 47, pp. 123–126, 2006.
View at: Google Scholar
P. Sidauruk, A. H.-D. Cheng, and D. Ouazar, “Ground water contaminant source and transport parameter identification by correlation coefficient optimization,” Ground Water, vol. 36, no. 2, pp. 208–214, 1998.
View at: Publisher Site | Google Scholar
E. Marengo, M. Bobba, E. Robotti, and M. Lenti, “Hydroxyl and acid number prediction in polyester resins by near infrared spectroscopy and artificial neural networks,” Analytica Chimica Acta, vol. 511, no. 2, pp. 313–322, 2004.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2020 Chenjun Shi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

856

Downloads

827

Citations