Abstract

Terahertz time-domain spectroscopy (THz-TDS) systems are widely used to obtain fingerprint spectra of many different biomedical substances, and thus the identification of different biological materials, medicines, or dangerous chemicals can be realized. However, the spectral data for the same substance obtained from different THz-TDS systems may have distinct differences because of differences in system errors and data processing methods, which leads to misclassification and errors in identification. To realize the exact and fast identification of substances, spectral standardization is the key issue. In this paper, we present detailed disposal methods and execution processes for the spectral standardization and substance identification, including feature extracting, database searching, and fingerprint spectrum matching of unknown substances. Here, we take twelve biomedical compounds including different biological materials, medicines, or dangerous chemicals as examples. These compounds were analyzed by two different THz-TDS systems, one of which is a commercial product and the other is our verification platform. The original spectra from two systems showed obvious differences in their curve shapes and amplitudes. After wavelet transform, cubic spline interpolation, and support vector machine (SVM) classification with an appropriate kernel function, the spectra from two systems can be standardized, and the recognition rate of qualitative identification can be up to 99.17%.

1. Introduction

Terahertz (THz) spectroscopy has a wide range of applications and has become an important research topic in recent years. THz radiation is an electromagnetic wave with a frequency between 0.1 and 10 THz. The photon energy of THz radiation is about 4 meV, which is one million times weaker than an X-ray photon; therefore, it will not cause harmful photoionization in biological tissues [13]. Additionally, the vibrational and rotational frequencies of most polar molecules are located in the THz range, which can provide specific absorption responses (fingerprint spectra) for the identification of compounds [4, 5]. Compared with other electromagnetic spectra (Raman, infrared, or ultraviolet-visible light spectroscopy, i.e., which detects specific molecules or molecular bonds [Reference]), THz spectra reflect the collective behavior of molecules and can be used to detect different substances with different molecular structures as well as polymorph and chiral substances, even those that have the same elements and molecular bonds [6].

To date, THz technology has been widely applied in the identification of many compounds and has been proven to have high recognition accuracy [79]. Especially in biomedical fields, different medicines, biomarkers, dangerous chemicals, and biological materials have been identified by THz technology [1014]. Currently, various algorithms based on machine learning have be used for classification different kinds of data [15, 16], which can also be a potential tool for classifying terahertz spectra. However, for different research groups, the presented spectrum data for the same substance may not be same. For example, THz spectra of dinitrotoluene show large differences in the amplitudes and frequency domain, which are caused by variations in system errors and data processing methods for different measuring systems [17, 18]. This may cause big errors in the database built and the substance identification. Therefore, a standardization method for spectral acquisition and data processing is urgently needed.

In this paper, we propose a method of THz spectrum standardization, which can improve the recognition rate affected by spectral differences among different systems. The diagram of the spectral standardization is shown in Figure 1. After spectra are obtained from different systems, a series of calibration processes are followed to obtain standard spectra. These standard spectra can be used directly for data comparison and then substance identification. Based on this technological process, we successfully standardized the spectra of 12 biomedical substances including different biological materials, medicines, or dangerous chemicals and then utilized support vector machine (SVM) for qualitative identification, which presents a higher accuracy than that of other common methods.

2. Methods

2.1. Sample Preparation

To ensure the wide applicability of our method, we randomly selected the following compounds as representatives, including biological materials, medicines, and dangerous chemicals: 4-aminobenzoic acid, amoxicillin, phenylalanine, 4-phenyl-2-pyrrolidone-1-acetamide (A2), (R)-2-amino-1-phenylethanol (A3), a mixture of explosives containing cyclonite (C5), d-lactose monohydrate, trinitrotoluene, benzoic acid, p-toluylic acid, glutamic acid, and riboflavin. These pure samples were purchased from Sigma Aldrich and stored properly according to manufacturer’s recommendations. Each compound was taken for a certain mass and then mixed with 120 mg polyethylene (PE) powder and then pressed into 2 mm-thick tablets (ø13 mm) with 3 tons of force. The pressing time was set at 1 min, and the mass loss was controlled less than 1%. A pure polyethylene tablet was also pressed with the same parameters as a reference. During the test, the temperature was at room temperature (∼22°C) and the humidity was controlled below 1%.

2.2. THz-TDS Measurement

THz-TDS systems from the University of Shanghai for Science and Technology (USST system) and the Shanghai Gaojing Image Technology Company (Gaojing system) were used to measure the spectra of these 12 compounds, respectively. The USST system has a spectral resolution of 0.001 THz, signal-to-noise ratio (SNR) of 40000 : 1, and effective bandwidth of 0.1–4.0 THz [19, 20]. The Gaojing system has a spectral resolution of 0.001 THz, SNR of 50000 : 1, and effective bandwidth of 0.2–1.5 THz. Due to the different frequency ranges of these two systems, only data recorded between 0.2 THz and 1.5 THz can be compared. First, samples's time-domain spectra can be obtained by THz-TDS systems. Then the time-domain spectra are converted using fast Fourier transform to frequency-domain spectra. Finally, the relative absorbance of a sample α () can be calculated by using the following equation [21]:where lsam () is the frequency-domain spectrum of the sample tablet, lref () is the frequency-domain spectrum of the pure polyethylene tablet, and d is the thickness of the sample.

2.3. Baseline Correction and Noise Removal

Baseline drift is generally caused by scattering from solid state samples in the THz region. The testing environment and different system errors can also lead to baseline drift and noise to different degrees. Therefore, baseline correction and noise removal is necessary for standardization of spectra. Wavelet transform has good time-frequency localization character and decorrelation, which can remove the baseline and noise with different scales and can retain the effective information in the different frequency regions of the terahertz spectrum at the same time. Here, we used the wavelet transform to correct the baseline at low frequency and recognize the signal and noise at high frequency according to multiresolution analysis [22]. The Mallat algorithm [23] was used for wavelet decomposition as follows:

The mother wavelet φJ (u) is orthogonal to the scaling function ψj (u), CJ is a coefficient in the J + 1th level of the low frequency component, and dj is a coefficient in the jth level (1 ≤ j ≤ J) of the high frequency component. With this equation, different parts and frequencies of the signal can be analyzed, respectively. Based on the mother wavelet Daubechies 9, the spectra of the 12 compounds were decomposed at level 6. The coefficient in level 6 is a low-frequency component, which represents a baseline of zero, and the coefficients in levels 1–5 are high-frequency components that contain both noise and useful information. We dropped the high-frequency components from levels 1–3, which represented noise, and we retained the low-frequency components from levels 4-5, which represented useful information.

2.4. Sampling Interval

Due to the different sampling intervals between the USST system (0.009 THz) and the Gaojing system (0.01 THz), the data points’ number recorded between 0.2 and 1.5 THz by these two systems was 143 (the USST system) and 130 (the Gaojing system). Different data points’ number could lead to errors in the recognition program as characteristic dimensions of input data are different. Cubic spline curve interpolation [24] was applied here to unify the data points’ number. The final data points’ number was adjusted to the larger one of these two THz-TDS systems (143 data points of the USST system).

2.5. SVM Classification

An SVM classifier was constructed by using a nonlinear mapping function to map the training data set onto a higher-dimensional space and then solve the linearly inseparable problem in a linear kernel space. In this study, we used a “one-against-one” approach for multiclass classification [25]. For t classes, we constructed t (t − 1)/2 binary classifiers to take into account all combinations of pairs of classes. A given training set of data was used to establish the SVM classifier, which can be described by (Xi, Yi), i = 1, 2, 3, … , N, Xi ∈ Rn, where Yi is an output label of the input absorption spectral data Xi. The objective function of the SVM is defined as follows [26]:where c is the penalty parameter, φ (·) is a kernel function, and and b are the SVM weight and bias parameters, respectively. The penalty parameter c controls the trade-off between the margin maximization and error minimization, and the kernel function is used in establishing the SVM classification model. Three common kernel functions are described by the following equations [27]:

In SVM classification, it is important to select an appropriate kernel function according to the characteristics of the sample. For example, if the spectral data of the samples are independent with sufficient unique characteristics, in other words, if they are linearly separable, then the linear kernel will be the most suitable kernel. In addition, it is the simplest and fastest. Otherwise, a nonlinear kernel function, such as the RBF kernel or polynomial kernel, will be needed to map the spectral data to higher dimensions. A process with one of these kernel functions will be longer and will require more optimization parameters, and overfitting may reduce the identification accuracy.

3. Results

3.1. Original Spectral Data Analysis

The spectra of the samples measured by the USST system and the Gaojing system are shown in Figure 2. It can be seen clearly that the spectra obtained from the two systems showed significant differences in their curve shapes and amplitudes, especially for amoxicillin, phenylalanine, p-toluylic acid, trinitrotoluene, and riboflavin. The main reason is that the USST system just gives out the original time-domain waveform, without any data processing, while the Gaojing system provides the smoothed data after a series of data preprocessing, whose details are unknown (commercial protection). Additionally, the system parameter settings, environmental conditions, and data calculation methods will cause spectral differences. Therefore, it was hard to compare spectra recorded by one system with those recorded by the other systems directly.

3.2. Data Standardization

Upon the obvious differences in the spectral shape and amplitude of two THz-TDS systems, data standardization is necessary to eliminate noise, baselines, and instability of different systems. The results of data standardization between the two systems are shown in Figure 3. The data processing procedure had two steps:(1)The elimination of baseline drift and noise by using the wavelet transform. It can be seen that, for the USST system, the peaks from the original spectra (Figure 2, black solid lines) became sharper and their baselines were eliminated after this step (Figure 3, blue solid lines), while for the Gaojing system, as the spectra had already been preprocessed by the system, it is normal that the spectra nearly had no difference (red dotted lines in Figures 2 and 3). These processing steps made the absorption peaks, curve shapes, and amplitudes in the spectra from the two systems very similar to each other.(2)The standardization of the data points’ number. Cubic spline interpolation was used to add the number of the Gaojing spectral data points to 143 with a sampling interval of 0.009 THz without changing the shapes of the spectra. Otherwise, it will cause errors in the algorithm for the mismatch of spectral data dimensions.

3.3. Compound Identification

All standardized spectral data from the 12 compounds were divided into two sets based on their sources. The processed dataset from the USST system was used to optimize the parameters and build the SVM classifier model, while the processed data from the Gaojing system was used for classification of each compound. Our model, which includes wavelet transform, cubic spline interpolation, and SVM, has the advantage of low demand for sample numbers and easier to obtain globally optimal solutions. Additionally, several kernel functions, such as the linear kernel function, polynomial kernel function, and radial basis function (RBF) kernel, are commonly used to construct the SVM classifier. The choice of kernel function should be based on the characteristics of the sample data. In our study, the absorption spectra from the 0.2–1.5 THz range were selected, and these spectra contain sufficient characteristic information for linear separation of the 12 compounds. Consequently, we first evaluated the linear kernel function for compound identification. The open source software LIBSVM [28] was used in this study, and the identification accuracy reached 99.17%. Furthermore, it only took a short time without searching for optimum kernel parameters, which proved the validity of the linear kernel function for these spectral data.

4. Discussion

Among the kernel functions, the RBF kernel is the most commonly used in SVMs. Therefore, the RBF kernel was also evaluated in this study, and the corresponding penalty parameter c and kernel function parameter were optimized by the grid search method [29]. The final identification accuracy was 95.00% as shown in Table 1. The only misclassification was for phenylalanine. This could be attributed to the overfitting from the RBF kernel function and the proximity of the absorption peak of phenylalanine (1.27 THz) to that of glutamic acid (1.25 THz). The proximity of these peaks was more problematic as these compounds only have one absorption peak in the 0.2–1.5 THz range. As a result, six phenylalanine samples were misclassified as glutamic acid (Figure 4), which reduced the overall accuracy.

Considering that the absorption peaks in the spectra reflect the vibration/rotation of molecules' functional groups, therefore, some same functional groups may cause the similar absorption peaks in the spectra which will cause error in identification. This can be reduced by searching for more absorption peaks over a wider THz range and improving the SNR of the spectra. In the present study, the effective bandwidth of the Gaojing system was 0.2–1.5 THz, and both phenylalanine and glutamic acid have only one absorption peak in this range. Therefore, we could not improve the accuracy by extending the bandwidth of the system, so the selection of an appropriate kernel function for the SVM classifier was crucial for optimization of our method.

For comparison, correlation coefficients and BP neural networks, which are widely used as spectroscopy identification methods, were also used here [30, 31]. The performances of these models were evaluated based on their accuracy for classification of the test dataset (Table 2). It is proved that our model was much more accurate than the correlation coefficient and BP neural network models for compound identification. This could be attributed to the high sensitivity of our model for characteristics of the sample, small number of samples required, and self-regulation in finding minima.

Based on all our results, we can conclude that the errors between different THz-TDS systems are mainly affected by four factors:(1)The SNR of the system: since the noise in the system interferes with the absorption peaks, improving the SNR can effectively reduce the identification error.(2)Data standardization: because different system errors and data preprocessing can create large differences in curve shapes and amplitudes of spectra, standardization of data from different THz-TDS systems is required for accurate identification. In our study, this was achieved by using the wavelet transform and cubic spline interpolation. But other methods such as digital filtering and the window function could also be applied to the standardization.(3)Kernel functions: in SVM classification, the appropriate kernel function should be selected based on the characteristics of sample data. A suitable kernel function can greatly reduce the computational complexity and improve the classification accuracy.(4)Identification methods: SVM can effectively achieve high generalization performance with a small number of samples, which improves the classification accuracy compared with other identification algorithms. However, the kernel function cannot be selected automatically, and then the user’s experience in this area determines the final outcome.

5. Conclusion

In our paper, we proposed an approach of spectra standardization for different THz-TDS systems and then realized the qualitative identification using these standard spectra. The model contains wavelet transform, cubic spline interpolation, and SVM. With this model, we realized the qualitative identification of 12 biomedical compounds with 100% recognition rate. These results show the importance of spectra standardization for different THz-TDS systems and the accuracy of these standard spectra. Based on these results, our method is demonstrated to be a potential tool for the identification of different substances in the biomedical field. However, for actual usage, the sample will contain far more compositions. In this case, the accuracy will be dropped. Therefore, further optimization is needed to improve the model’s performance. For example, the input spectra can be preprocessed to extract the information related to target substances, which can reduce the influence of other substances.

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

CJS and JZ contributed equally to this study and should be considered as co-first authors. YP planned and designed the experiments. CJS and MQX performed the experiments and analyzed the data. CJS and JZ wrote the manuscript. YP and XW supervised the project. All authors reviewed and edited the manuscript.

Acknowledgments

This study was financially supported by the Major National Development Project of Scientific Instrument and Equipment (2017YFF0106300), NSFC (61771314 and 61722111), Shanghai Rising-Star Program (17QA1402500), Shuguang Program (17SG45), and Shanghai Youth Talent Support Program and Young Yangtse River Scholar (Q2016212).