Abstract

Speech signal is enriched with plenty of features used for biometrical recognition and other applications like gender and emotional recognition. Channel conditions manifested by background noise and reverberation are the main challenges causing feature shifts in the test and training data. In this paper, a hybrid speaker identification model for consistent speech features and high recognition accuracy is made. Features using Mel frequency spectrum coefficients (MFCC) have been improved by incorporating a pitch frequency coefficient from speech time domain analysis. In order to enhance noise immunity, we proposed a single hidden layer feed-forward neural network (FFNN) tuned by an optimized particle swarm optimization (OPSO) algorithm. The proposed model is tested using 10-fold cross-validation over different levels of Adaptive White Gaussian Noise (AWGN) (0-50 dB). A recognition accuracy of 97.83% was obtained from the proposed model in clean voice environments. However, a noisy channel is realized with lesser impact on the proposed model as compared with other baseline classifiers such as plain-FFNN, random forest (RF), -nearest neighbour (KNN), and support vector machine (SVM).

1. Introduction

Voice is the oldest method of communication reported in human history on earth. It was enforced by the fact that humans continuously need to share their feelings and requirements for surviving. Voice communication is inspired by the fact that an enormous amount of information can be exchanged which makes it the optimum medium of communication over other possible alternatives such as writing and even modern day communication facilities such as electronic texting [1]. The concept of language was invented due to the diversity of human descent exhibiting various geographical areas. Languages vary in accordance with the surroundings and nature of the place exhibited by the human. On these bases, different speaking tongues and accents are recognized in today’s world [2]. Speaker recognition systems are further implemented as an electronic solution for supporting security and privacy enforcement systems. It is preferred to be used by service providers for securing personal data and for preventing any autonomous attack. A voice recognition system is based on the fact that the voice generation system is uniquely structured in every human [3]. A vocal track is the main contributor in the voice creation process; it works as air blows over a set of vocal cords making it vibrate, and hence, vibration produces the tone of voice. Voices flow henceforth through the vocal track propagating through the throat and mouth. Speech tone is directly impacted by the shape of the vocal tacks and by the objects available inside the mouth such as the number of teeth [4]. A speech system is encountered for various numbers of challenges that are vital to the modelling process of the speaker. The main obstacle of speaker identification system is the random nature of voice signals. Those signals are termed by their randomness nature, which can be realized from the electrical property fluctuation throughout time. The information of the spectrum in a speech signal can vary within the time period, so it is difficult to rely on frequency information for modulating the process of the voice track [5, 6]. The speaker recognition process involves two different phases of processing. The first phase of processing is called a text-dependent speaker identification system which is simply depending on the exact (being told) voice imprint in both testing and training. Text-dependent speaker recognition is implementable using the time domain analysis, and the drawback of this method is that complete matching between the test and train data is required which is practically not possible [7, 8]. On the other hand, in text-independent speaker identification, speakers can be recognized based on their signal frequency analysis. This is commonly done using the frequency domain analysis such as Fourier transform. The main drawback of this method in such a domain is the inconsistency with the practical reality of voice nature. As the voice signal is time variant, signal frequency (spectrum) information is changed by time [9]. With respect to the above scenarios, the traditional models seem unable to accommodate the varying nature of voice signals. However, traditional models can use the Fourier transformation as an essential method to analyse the frequency. Other methods such as zero-crossing, convolution, and correlation are commonly used as a time domain method to analyse the speech signal. Conventional voice recognition models depending on the aforementioned approaches are not consistence for overcoming the time-varying nature of voice signals [10]. From the popular acoustic feature extraction methods, LPCC is used in [11] for modelling the speech production process. LPCC is abled for producing linear prediction coefficients sensitive to the spectral smoothness and spectral bias. In [12, 13], LPCC features are fused with MFCC features with efforts to improve the later features while implementing a Gaussian mixture universal background model (GMM-UBM). In order to improve the MFCC features, hamming windows have been replaced by multiple windows in order to achieve smoother spectrum results. In [14], bottleneck features are extracted from speech signals by using a deep neural network; however, the same is concatenated with MFCC features for speaker identification improvement. Feature sections from the fused sets of features are performed using kernel-based learning (e.g., support vector machine (SVM)) and using the reduced features of the SR model [15]. Another approach is illustrated in [16] for SR performance enhancement using speech data from different channels for constructing the acoustic features. Dimensionality reduction techniques are being proposed in [17] by adopting a PCA algorithm. Computational reduction is the key solution for performance enhancement; one of the efficient approaches for the same is frame rate reduction of the speech signal [18]. Dimensionality reduction is performed in [19] by manipulating the speaker or utterance layer by reducing the channel noise impact (channel scoring) [20]. The machine learning algorithms, namely, naïve Bayes (NB), support vector machine (SVM), and -nearest neighbours (KNN), are presented in classification algorithms to predicates [21, 22].

In this paper, we are developing a smart voice recognition system using a deep learning approach for predicting the speakers. The deep learning model performs the recognition tasks as the model is trained with the features of the speech signal. A feed-forward neural network is used in an optimized version for serving the required recognition purpose.

2. Voice Processor

Preprocessing of a voice signal refers to all the changes that apply to the signal before it is actually passed to the analyser. However, preprocessing is proceeded by sampling where the signal is converted to a set of samples for efficient analysis. Herein, as the speaker recognition system might deal with a large number of speakers, the data set preparation is an important step in preprocessing [23]. Hereinafter, points are noteworthy and set to be covered while preprocessing. The data set involves 250 voice clips recorded from many speakers, and the same clips need to be ordered and named in numerical or alphabetical form in order to feed them into the processing system easily. An index is associated with the data set that enlists all the speech signals’ names. If the same is not available on the data set, it needs to be created. Such an index can be formed as character strings more likely as ; in case the index is available by default along with the data set, index verification is to be started for matching the index with the voice clips in the database, as, in many cases, the index may lose some voice clips and that will create an error on the further process. Figure 1 depicts the process of data set preprocessing.

3. Hybrid Speech Features

3.1. Time Domain

Fundamental frequency is one of the interesting features in speech signal; it can be produced in time domain analysis using the cross-correlation approach. The aim of this feature is to identify the fundamental frequency in the speech signal [24]. The fundamental frequency is also called pitch frequency and is calculated using the pitch period. This period lies on the cross-correlation signal and represents the time between the minimum local maxima and the maximum local maxima on the signal corpus. Assuming that the sampled speech signal is represented by , let the be the time-shifted copy of the same signal. Cross-correlation () can be given in

Figures 2 and 3 depict the resultant of cross-correlation between a speech signal and the same copy of it in shifted samples (phase). The next step is to evaluate the peaks of the resultant signal; those peaks are named as maximum local maximum as in Figure 4 and minimum local maxima as in Figure 2.

3.2. Mel Domain

Mel scale is a popular terminology in speech context; it simulates the value of human ear sensation to the speech signal. Mel frequency is different from the local frequency of the signal, and Mel spectrum coefficients formulate the Mel set which represents the amount of ear sensitivity of the human ear to a particular voice signal. Therefore, each voice signal has different effects on the ear, and Mel frequency spectrum coefficients represent the ear response in the form of a vector of eight values. The Mel frequency spectrum coefficient vector can be represented in the following equation [25]. In order to derive the Mel frequency spectrum coefficients which represent the ear response to the voice firstly, the voice signal is passed through the preemphasis filter as an attempt to amplify the low power samples. This process is important for the reason that the voice may include low frequency segments resulting in voice waveforms (samples) due to whispering or a none loud voice [26]. However, the preemphasis filter will take the voice signal and attempt to unify the power so that the power can be distributed uniformly among the frequencies as demonstrated in Figure 5. Signal passing from the preemphasis filter can result in a new version of the signal with an enhanced signal to noise ratio. As low power frequencies are more susceptible to noise impact, the preemphasis filter produces a signal with higher power for those slots, and hence, the ratio of the signal power to the noise power will be larger [27]. As soon as a signal results with good SNR (signal to noise ratio) from the preemphasis filter, signal framing is the next process in the Mel frequency spectrum coefficient algorithm. However, since the speech signal is a time variant signal which means that the frequency keeps changing with time and a nonfixed frequency repose can be ensured, researchers agreed to the fact that the speech signal remains stationary in a very short time frame more likely within 25 milliseconds. For this purpose and in order to determine the signal properties as a time invariant signal, framing of the signal is a must. This window is called as the hamming window and can be presented by the following equation (2); the samples and Fourier transform of the hamming window is depicted in Figure 6.

where represents the hamming window and is the total number of samples in the speech signal.

In further steps, each hamming window is converted from samples into a spectrum using the fast Fourier transform (FFT) as given in

where is the fast Fourier transform of the sampled signal ; furthermore, Mel frequency is derived from the above components using the Mel conversion equation as in (3).

where is frequency of speech signal in the Mel scale and is the Hertz scale speech frequency.

The last step in the Mel frequency spectrum algorithm is to simulate the human ear perception to the voice signal. Therefore, the filter bank in used to perform the same. The filter bank with the transfer functions given below is implemented to produce the human ear voice perception. The filter bank response to the input can be demonstrated in Figure 7. The same is representing the Mel scale of the spectrum according to the ear which is usually responding to the voice signal in low and high frequencies, so the ear as in the figure can respond with a narrow response to low frequencies and give a wide response to high frequencies, and accordingly, for each voice signal, there will be a different response.

4. Feature Mapping

Features of speech signal are generated from both the Mel frequency spectrum coefficient (MFCC) method and the fundamental frequency method (pitch frequency). The Mel scale of the speech signal is obtained from the MFCC method which represents the human ear response to speech, and it basically divides the speech signal into a set of windows using a triangular filter bank and returns different window sizes for the higher and lower frequencies. In other words, MFCC may segregate the speech signal according to the frequency range presented on it and depends on the Mel scale (human ear perception). Furthermore, the pitch frequency is also obtained from the speech signal; it produces a single value in hertz; the fundamental frequency is vital in speaker recognition because it presents the minimum frequency of vocal chord vibration [16, 17]. The combination between the two aforementioned methods is performed as the pitch frequency may be affected by the noise association, and hence, it might not return the exact character of the speech signal. Accordingly, features from both the pitch frequency method and the Mel frequency spectrum coefficient method are obtained and used for recognition work. For 250 speech signals and nine features for each signal, a total of 2250 features (elements) are generated from the speaker model. Figure 8 depicts the process of feature combination.

5. Model Optimizer

5.1. Plain FFNN

The feed-forward neural network model is used in this project to predict the speaker characters. However, the features of each speaker are applied to the model for training; model is established according to the parameters in Table 1. A three-layer FFNN with 30, 20, and 1 nodes are made; the reason for selecting this number of nodes is to reduce the delay taken by the model at the training and testing stages. According to Figure 9, the LM algorithm is used to train the model, and the target performance (mean square error) is made equal to 1-29. Three experiments are made as the time line of enhancement to the model; in each experiment, the model is upgraded for the sake of performance enhancement. Therefore, the first experiment relied on the parameters given in Table 1. During the training stage of this experiment, it was noticed that results were varying every time the model is restarted since the LM algorithm is allotting the weight values randomly, and it repeats the same whenever the model is used. In order to monitor the model performance and to tackle this random nature of the results, the experiment is repeated for 100 times and the results are recorded, and then, the average results are used for examining the model performance [28, 29].

5.2. Model Freezing

A second experiment was done according to the results monitored from the first experience; the performance of the neural network is realized for all 100 repetitions; hence, the weight of every repetition is recorded. However, weight freezing technology involves presetting of the weight values of the FFNN model to those weight values that return the best cost. The technology of freezing dispenses with the need for a training algorithm as ready-made weights can be fed into the model with predetermined performance. The selection of proper weight values depends totally on the previous experiment which involves the record of weights and their cost values. Figure 9 demonstrates the process of model freezing. The figure shows that the program may be established to test all the weights and to select the weight that yields the best cost.

The third experiment is made as another attempt to enhance the performance of prediction, so the new algorithm is used as a training algorithm. Particle swarm optimization algorithm is proven to have noticeable performance in optimizing the feed-forward neural network. Figure 10 shows the flow diagram of the PSO-FFNN algorithm. The PSO algorithm is made to produce the weight values that yield an enhanced performance; the following steps are taken to execute the algorithm.

6. Results and Discussion

As discussed in the previous sections, the feed-forward neural network is examined under several performance metrics in order to identify the best model that is capable of predicting the speaker identity. Three models are used, namely, the plain feed-forward neural network, the weight freezing-based feed-forward neural network, and ultimately the particle swarm optimization-based feed-forward neural network. Results of those models’ performances are listed in Table 2. It is observed that the accuracy of speaker prediction is optimum at FFNN while PSO is used for performance optimization; hence, a 97.83% accuracy is recorded from the aforementioned model. In other models, namely, plain FFNN and modified FFNN (MFFNN), the recorded accuracies of speaker recognition are, respectively, 78.59% and 89.25%. The optimum accuracies were detected in the PSO-FFNN model. The noble method adopted in the PSO algorithm for tuning up FFNN weight coefficients has produced the coefficients of weights resulting in the least error in the prediction. By using the plain FFNN and weight freezing method in MFFNN, the results of the prediction of both models have been analysed and used over the PSO swarm (weight) generator for building the seed of generated random swarms. On the other hand, it was realized that the time taken for predicting the speaker identity using the PSO-FFNN model was 0.97, which means that the proposed model was capable of performing the required tasks in minimum time compared to the other models. The rapid operation performance of the PSO-FFNN model has been reported since the FFNN model will be totally relying on the PSO algorithm for producing the weight coefficients without the need of performing standalone (internal) weight generation. Eventually, the mean square error (MSE) and root mean square error (RMSE) metrics are also found to be minimal [3032]. MSE and RMSE metrics imply lesser error existence on the proposed model predictions. The results are graphically demonstrated in Figure 11.

Furthermore, different classifiers were used for predicting the speaker identity such as the random forest (RF) algorithm, -nearest neighbour (KNN), and support vector machine (SVM). In order to evaluate the performance of all proposed tools, -means validation is used where various input styles were tested. The accuracy scores of our proposed model as well as the other algorithms are demonstrated 10-folds in various noise conditions as in Figures 1216. The accuracy measures under clear voice environments are depicted in Figure 12. The accuracy measures under 10 dB AWGN voice environments are depicted in Figure 13. The accuracy measures under 15 dB AWGN voice environments are depicted in Figure 14. The accuracy measures under 20 dB AWGN voice environments are depicted in Figure 15. The accuracy measures under 50 dB AWGN voice environments are depicted in Figure 16.

The proposed classifier e.g. PSO-FNN had yielded best accuracy score during all noise conditions as demonstrated in Table 3.

7. Conclusions

Speaker recognition is a vital stage in many personal authentication and security systems; it builds the logic of person verification using their biometrical features, more specifically, voice features. The entity of the speaker recognition system involves the major two stages called feature extraction and speaker classification. However, these processes may begin with voice preprocessing involving the preparation of voice signals and set them together in the data set. Speech features include time domain and frequency domain processing; each is an integral part of speech processing and can be used to form a final recognition system. Speech signal preprocessing is about signal enhancement by reducing the noise level and removing other unnecessary information such as background noise and other associates. It might involve silence removal, which deletes the samples of low power that represent the silence in the uttered sentence (breaks while speaking). These processes are important to enhance the signal quality, which makes the signal more readable by the further process (stages). However, preprocessing is important to reduce the extra computation power that might utilize the capacity of the processor and distort the performance of the entire system. On the other hand, several approaches are knocked out to perform feature extraction of the speech signal. The fundamental frequency and Mel frequency cepstrum coefficients are the main approaches employed over this system, whether deep learning approaches are however employed for speaker classification tasks (mapping the features to particular speakers). FFNN is used for mapping the features to their perspective speaker, and the results have shown that PSO-FFNN outperformed the other techniques used in this paper.

Data Availability

The used data is public and available online freely.

Conflicts of Interest

The authors declare that they have no conflicts of interest.