Next Article in Journal
Evaluation of the Microstructure and the Electrochemical Properties of Ce0.8(1−x)Gd0.2(1−x)CuxO[1.9(1−x)+x] Electrolytes for IT-SOFCs
Next Article in Special Issue
Automatic Segmentation and Classification of Heart Sounds Using Modified Empirical Wavelet Transform and Power Features
Previous Article in Journal
DDS-Based Containment Control of Multiple UAV Systems
Previous Article in Special Issue
A Fast Self-Learning Subspace Reconstruction Method for Non-Uniformly Sampled Nuclear Magnetic Resonance Spectroscopy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Combined Generative Adversarial Network and Fuzzy C-Means Clustering for Multi-Class Voice Disorder Detection with an Imbalanced Dataset

1
School of Science and Technology, The Open University of Hong Kong, Hong Kong, China
2
King Abdulaziz University, Jeddah, P.O. Box 34689, Saudi Arabia
3
Effat College of Engineering, Effat University, Jeddah, P.O. Box 34689, Saudi Arabia
4
Fundamental and Applied Sciences Department, Universiti Teknologi PETRONAS, Seri Iskandar 32610, Perak Darul Ridzuan, Malaysia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(13), 4571; https://doi.org/10.3390/app10134571
Submission received: 2 June 2020 / Revised: 29 June 2020 / Accepted: 29 June 2020 / Published: 1 July 2020
(This article belongs to the Special Issue Signal Processing and Machine Learning for Biomedical Data)

Abstract

:
The world has witnessed the success of artificial intelligence deployment for smart healthcare applications. Various studies have suggested that the prevalence of voice disorders in the general population is greater than 10%. An automatic diagnosis for voice disorders via machine learning algorithms is desired to reduce the cost and time needed for examination by doctors and speech-language pathologists. In this paper, a conditional generative adversarial network (CGAN) and improved fuzzy c-means clustering (IFCM) algorithm called CGAN-IFCM is proposed for the multi-class voice disorder detection of three common types of voice disorders. Existing benchmark datasets for voice disorders, the Saarbruecken Voice Database (SVD) and the Voice ICar fEDerico II Database (VOICED), use imbalanced classes. A generative adversarial network offers synthetic data to reduce bias in the detection model. Improved fuzzy c-means clustering considers the relationship between adjacent data points in the fuzzy membership function. To explain the necessity of CGAN and IFCM, a comparison is made between the algorithm with CGAN and that without CGAN. Moreover, the performance is compared between IFCM and traditional fuzzy c-means clustering. Lastly, the proposed CGAN-IFCM outperforms existing models in its true negative rate and true positive rate by 9.9–12.9% and 9.1–44.8%, respectively.

1. Introduction and Literature Review

Voices are crucial for human beings. We communicate with others and express our emotions and ideas through our voices. Throughout our entire lives, we may witness those that have voice disorders, which include abnormalities in quality, tone, volume, and pitch. Voice professionals, whose jobs are strongly related to vocal load and/or voice quality, are highly influenced by voice disorders [1]. Examples include (i) teachers, who require a high vocal load and a medium quality; (ii) television presenters, who require a medium vocal load and a high quality; and (iii) actors, who require a high standard in both vocal load and quality. A healthy voice that fulfills one’s professional needs is important to ensure the best performance.
Various epidemiological studies have suggested a high prevalence of voice disorders. In [2], a prevalence rate of 21% was observed in a sample size of about 4800 people. Another study [3] was conducted in Sweden, in which 16.9% of 114,538 people had voice disorders. Attention was likewise drawn to primary and secondary education teachers in Finland: 54% of 1198 teachers in Finland were found to suffer from voice disorders [4]. Voice disorder sufferers are expected to consult Otorhinolaryngologists and speech therapists for diagnosis and medical treatment to support their rapid recovery. However, research has shown that only a small percentage of total sufferers seek professional advice. Only 5.9% (78 participants) sought professional advice when they suffered from voice disorders [5]. Another survey [6] showed that 22.5% (56 teachers) consulted medical advice to solve their voice disorders. Many of the subjects claimed that they did not recognize they were suffering from voice disorders. Indeed, confirming a voice disorder requires medical knowledge.
Due to the high prevalence and low consultation rate of the population with voice disorders, it is desirable to have an automatic voice disorder detection algorithm with voice inputs that can give an instant diagnosis of a voice disorder, as well as its type. Such algorithms could be integrated into mobile health applications, as demonstrated in related works [7,8].
In Section 1.1, the performance of existing works on voice disorder detection is summarized. The limitations of existing works are explained in Section 1.2, which forms the rationale of the proposed work in addressing these limitations. This is followed by the key contributions of this paper in Section 1.3.

1.1. Literature Review

In recent years, various machine learning algorithms have been proposed and evaluated for the detection of voice disorders. It is worth noting that some previous works [9,10,11] tested their algorithms using the Massachusetts Eye and Ear Infirmary voice and speech lab (MEEI) database, which are not discussed in this paper because the MEEI database is commercialized and not publicly available. Further, it is inconsistent and features varying recording conditions for pathological and healthy subjects.
Instead, this paper focuses on two publicly available databases: the Saarbruechen Voice Database (SVD) [12,13,14,15,16] and Voice ICar fEDerico II (VOICED) [16,17,18]. The following is a summary of existing approaches applied to the SVD. A hybrid support vector machine (SVM) and Gaussian mixture models (GMM) were proposed and evaluated in [12]. This method achieved an accuracy, sensitivity, and specificity of 0.965, 0.94, and 0.99, respectively. Six methods, including a sequential minimum optimization (SMO)-based SVM, decision tree, Bayesian classification, logistic model tree, k-nearest neighbor, and entropy-based method, were evaluated in [13]. The SMO based SVM yielded the best performance in accuracy (0.858), sensitivity (0.876), and specificity (0.839). Guedes et al. [14] proposed two approaches, long short-term memory (LSTM) and convolutional neural network (CNN), for differentiation between healthy and dysphonic candidates, healthy and laryngitic candidates, and healthy and paralyzed candidates. The achieved precision values were 0.66, 0.67, and 0.78, respectively. SVM was applied with zero frequency filtering and quasi-closed phase glottal inverse filtering methods for the detection of voice disorders in [15]. Based on the performance evaluation, the applied method achieved an accuracy, sensitivity, and specificity of 0.76, 0.72, and 0.78, respectively. A threshold-based detection method with a newly defined dysphonia detection index was proposed in [16]. The accuracy, sensitivity, and specificity were 0.798, 0.706, and 0.902, respectively.
The work in [16] also applied VOICED, showing degraded performance in the accuracy, sensitivity, and specificity (of 0.5, 0.458, and 0.643, respectively). Researchers adopted K nearest neighbor (KNN) as a model for the detection of voice disorders [17]. This method achieved an accuracy of 0.933 and outperformed the other algorithms (random forest (0.874) and extra trees (0.863)). In [18], boosted tree, SVM, decision tree, naïve based classifier, and KNN were evaluated; the pairs of sensitivity and specificity were (0.829, 0.862), (0.79, 0.279), (0.779,0.844), (0.857,0.644), and (0.774,0.334), respectively. As the VOICED database was published in 2018 (compared with SVR, which was published in 1997), the research publications that consider the VOICED dataset are fewer than those using SVR.

1.2. Research Gaps and Motivation

Existing works have applied various algorithms for voice disorder detection [12,13,14,15,16,17,18]. However, further research and exploration are necessary to address the following limitations in existing works.
  • Some existing works [12,16,17] did not apply cross-validation during their performance evaluation of the algorithms. The first concern is that one may pick up a biased training dataset to train the detection model, which takes advantages of this bias by yielding high accuracy. Secondly, not all the data were evaluated, which may affect the fine tuning of the model and the robustness of its applicability in real-world scenarios.
  • There is room for improvement in the accuracy, sensitivity, and specificity of voice disorder detection models [13,14,15,16,18]. Particularly in smart healthcare applications, the performance of the machine learning model has high expectations, as such applications are related to the health status of humans.
  • Current works [12,13,14,15,16,17,18] have formulated the voice disorder detection problem as binary detection that only outputs a healthy or pathological result. It is desirable for machine learning algorithms to suggest the actual type of voice disorder to minimize screening time by medical professions.
  • Current works [12,13,14,15,16,17,18] have not considered the issues of imbalanced datasets in both SVR and VOICED. This may alter the gap between the sensitivity and specificity of the detection model.
To solve these limitations, this paper adopts the following measures.
  • A 10-fold cross validation is adopted for the performance evaluation of the voice disorder detection algorithm.
  • The proposed algorithm incorporates a generative adversarial network and fuzzy c-means clustering to improve the performance of the detection model.
  • Voice disorder classification is formulated as a multi-class detection problem, thereby allowing the actual type of voice disorder to be suggested.
  • A generative adversarial network is proposed to generate new training data, which will reduce the influence of imbalanced datasets and improve the performance of the voice disorder detection model.

1.3. Research Contributions

The contributions of this paper are as follows.
  • A conditional generative adversarial network (CGAN) and improved fuzzy c-means clustering (IFCM) algorithm named CGAN-IFCM is proposed to enable the multi-class detection of voice disorders.
  • CGAN offers dual benefits, not only reducing the influence of imbalanced datasets but also generating new training data to improve the performance of the voice disorder detection model. In this way, the gap between sensitivity and specificity in the detection model can be reduced. The results indicate that the proposed CGAN-IFCM outperforms stand-alone IFCM by 10–12.6% and 5.8–16.2% for the true negative rate (TNR) and true positive rate (TPR), respectively.
  • IFCM addresses the limitations of existing fuzzy c-means clustering (FCM). IFCM increases performance by introducing interactions between adjacent data points in the fuzzy membership function. The data point and its neighboring data points in feature space will have a high probability to be grouped into the same cluster. The results reveal that the proposed CGAN-IFCM improves the TNR and TPR by 7.3–9% and 3.1–12%, respectively.
  • Compared with existing works, the proposed CGAN-IFCM improves the TNR and TPR by 9.9–43.5% and 9.1–44.8%, respectively.

2. Materials and Methods

In Section 2, the two publicly available databases (SVD and VOICED) for voice disorder detection are outlined, followed by the proposed CGAN-IFCM algorithm that will be applied to each of the databases.

2.1. Voice Disorders Databases

2.1.1. Saarbruechen Voice Database (SVD)

The SVD was collected in collaboration with the Department of Phoniatrics and Ear, Nose, and Throat (ENT) at the Caritas clinic of St. Theresia in Saarbrücken [19,20]. The data contain recordings of sustained phonations of the vowels, as well as the sentence “Good morning, how are you?” in German. The database includes 869 healthy candidates and 1356 candidates with voice disorders. This can be formulated as a binary detection problem.
For the multi-class detection problem in voice disorders, there are 71 types of pathologies among 1356 candidates. Since most types of pathologies have a small sample size and are aligned with the types of pathologies in VOICED, the multi-class detection problem is formulated as 213 with hyperkinetic dysphonia, 16 with hypokinetic dysphonia, 140 with reflux laryngitis, and 869 healthy.

2.1.2. Voice ICar fEDerico II (VOICED)

VOICED was collected at the faculty of Phoniatrics and Videolaryngoscopy at the Hospital University of Naples Federico II and at the medical room at the Institute of High Performance Computing and Networking [21]. The recordings contain voice signals of the vowel /a/ sustained for five seconds in a quiet room to minimize background noise. Further, VOICED contains information on other attributes, such as smoking status, alcohol consumption, hydration, eating habits, voice handicap index, and reflux symptom index. The database contains 58 healthy candidates and 150 candidates with voice disorders. Thus, the binary detection problem can be formulated.
On the other hand, the multi-class detection problem can be formulated by dividing the voice disorder group into 70 with hyperkinetic dysphonia, 41 with hypokinetic dysphonia, and 39 with reflux laryngitis.
Table 1 summarizes the key information on SVD and VOICED, including the signal characteristics, the information included, and the classes of the binary detection model and multi-class detection model. Both have imbalanced datasets, so CGAN will be applied in this study to generate more training samples for classes with smaller sample sizes.

2.2. Design of the Voice Disorder Detection Model using CGAN-IFCM

In this section, the methodology of CGAN-IFCM is presented. Firstly, this section explains the reasons for using CGAN instead of other types of GAN, along with the details of CGAN. This is followed by the rationale behind the selection of IFCM instead of a traditional FCM as a voice disorder detection model.

2.2.1. Generation of Additional Training Data Using CGAN

Imbalanced classes were observed in the databases SVD and VOICED. This imbalance causes a significant deviation between the sensitivity and specificity of the voice disorder detection model, particularly in VOICED [16,17,18]. The goal is to increase the number of training samples to balance the classes. A recent state-of-the-art article presented the development and progress of the evolution of GAN [22]. The basic GAN proposal has a key limitation: its noise vector does not have restrictions and may lead to fatal theoretical issues. As a result, different kinds of solutions have been proposed. The following relevant articles have received a large number of citations: auxiliary classifier GAN (ACGAN) [23], CGAN [24], and information maximizing GAN (InfoGAN) [25]. In this paper, CGAN is selected to increase the number of training samples for classes with fewer samples. The first reason is that various types of information (the information included in Table 1) are included in SVD and VOICED. This information is conditional information that fits well with the theory of CGAN. Further, conditional information is introduced in both the generator and discriminator to reduce the aforesaid fatal theoretical issue. Conditional generation is then introduced, in which CGAN is restricted to generate new data for the class of the largest sample size. Therefore, in the binary detection model, the class of candidates with voice disorders is restricted in both SVD and VOICED. For the multi-class detection model, the class of healthy candidates and the class of candidates with hyperkinetic dysphonia are restricted in SVD and VOICED, respectively. To avoid the excessive generation of training data, there is another restriction that involves limiting the number of generated samples of a class to, at most, the original sample size quantity of the corresponding class.
Figure 1 shows the system workflow of CGAN. Given noise vector n and conditional variable c, generator G captures data distribution x, and discriminator D determines whether a sample was from the original dataset or generated dataset. Both G and D are conditioned. Basically, n and c serve as the inputs of G, G serves as the input of G(n|c), and c serves as the input of x. To train a discriminator, D(x; θd) gives a single scalar representing the probability that x came from the training data instead of pdata(x). To train the generator distribution pdata(x) using data distribution x, the generator constructs a mapping function from Gaussian noise distribution pn(n) to the data space as G(n;θg). The training of D and G occurs at the same time, where parameters θd for D are adjusted to minimize logD(x|c), and parameters θg for G are adjusted to minimize log(1 − D(G(n|c))) for the conditional variable c. Mathematically, the objective function is expressed as follows:
min G max D V ( D , G ) = E x p d a t a ( x ) [ log ( D ( x | c ) ) ] + E x p n ( n ) [ log ( 1 D ( G ( n | c ) ) ) ]
where V(D,G) is the value function.

2.2.2. Voice Disorder Detection Model Using IFCM

As shown in Table 1, the voice disorder detection model can be formulated as a binary detection problem and a multi-class (4-class) detection problem. We selected a multi-class detection model for the illustration in this subsection because it represents the expected application through which the voice disorder detection model could diagnose actual types of voice disorders.
To design the feature vector, there are two typical approaches: (i) follow clinical and expert guidelines; and (ii) if the first approach is not available, a thorough investigation of the proper features should be carried out. For the voice disorder detection problem, we follow the first approach because there is a clinical acoustic analysis technique available for the formal assessment of voice disorders [26,27]. Four voice quality-based parameters are chosen. These qualities are harmonic to the noise ratio (HNR), shimmer, jitter, and fundamental frequency (f0). Since the estimations of these parameters are a standard approach, and feature extraction is not the focus of this paper, we follow de Krom’s algorithm to measure HRN [28], jitter (as the cycle-to-cycle variation of fundamental frequency) [29], shimmer (as the peak-to-peak amplitude variation in decibels) [29], and f0 (as the maximal of autocorrelation function) [30].
In the implementation and performance evaluation, we adopt k-fold cross-validation, with k = 10 as the typical order [31,32]. We define Ni as the total number of samples in class 1, where N1, N2, N3, and N4 correspond to healthy candidates, those with hyperkinetic dysphonia, those with hypokinetic dysphonia, and those with reflux laryngitis, respectively. The feature vector is Xp = {HRNp, jitterp, shimmerp, f0,p} for p = [1, 0.9(N1 + N2 + N3 + N4)], where 0.9(N1 + N2 + N3 + N4) is equivalent to 90% of the training dataset.
In general, the FCM problem is defined to minimize the objective function [33]. Intuitively, in each cluster, the data points should be close to the cluster center. Therefore, the objective is defined as a minimization of the intra-cluster variance, which is equivalent to a maximization of intra-cluster similarities. The less the variance, the greater the similarities among the data points. For example, a variance of zero results in maximal similarity:
F F C M = p = 1 ( 0.9 ) ( N 1 + N 2 + N 3 + N 4 ) c = 1 N c u p c m | | X p v c | | 2 , 1 m
s . t .       c = 1 N c u p c = 1
where Nc is the optimally designed total number of clusters using the multiobjective genetic algorithm (MOGA), upc is the degree of membership of Xp in the cth cluster, vc is the cluster center of the cth cluster, and m controls the fuzziness of the resulting partition. The value m = 2 is common and is selected according to [33].
Taking the partial derivative, we have two iterative solutions, vc and upc, given by
v c = p = 1 ( 0.9 ) ( N 1 + N 2 + N 3 + N 4 ) u p c m · X p p = 1 ( 0.9 ) ( N 1 + N 2 + N 3 + N 4 ) u p c m
u p c = p = 1 N c | | X p v p | | 2 m 1 | | X p v c | | 2 m 1 .
Nonetheless, the existing FCM algorithm [33] does not include the relationship between Xp and its neighbor Xp, which limits its performance in clustering problems. Feature vector Xp is indeed strongly correlated (sharing similar characteristic) to its neighbor Xp due to the nature of characteristics in the feature space. These Xp values have a high probability to be grouped into the same cluster. As a result, the rationale of the proposed work is to introduce the interaction between Xp and its neighbor Xp into the fuzzy membership function.
A new degree of membership is proposed as follows:
u n e i g h b o r , p c = C p c · p = 1 N c | | X p v p | | 2 m 1 | | X p v c | | 2 m 1
C p c = j S N ( X p ) u j c N n
where Cpc is a conditional variable related to the level of participation of Xp in the cth cluster, SN(Xp) is the square neighborhood centered at Xp, and Nc is the number of feature vectors in the neighborhood.
The joined degree of membership between upc and uneighbor,pc is defined as follows:
u V D C , p c = ( u n e i g h b o r , p c ) ω n ( u p c ) ω 0 q = 1 N n ( u n e i g h b o r , p q ) ( u p q )
v V D C , c = p = 1 ( 0.9 ) ( N 1 + N 2 + N 3 + N 4 ) u V D C , p c m · X p p = 1 ( 0.9 ) ( N 1 + N 2 + N 3 + N 4 ) u V D C , p c m
where ω n and ω 0 are the control weighting factors of the importance between uneighbor,pc and upc. These are optimally designed via MOGA, which benefits the convergence of the model training. There are numerous combinations of ω n and ω 0 which may yield different u V D C , p c and v V D C , c values. Searching for the optimal ω n and ω 0 requires considerable computing power. Alternatively, a tradeoff is sought between the convergence of model training and computing power while maintaining favorable performance.
The multiobjective optimization problem consists of four objectives, given by
{ M A X F 1 = C H i n d e x M I N M A X M A X F 2 = | | U V D C ( b + 1 ) U V D C ( b ) | | F 3 = S p e c i f i c i t y = T N R F 4 = S e n s i t i v i t y = T P R
where CHindex is the Calinski–Harabasz (CH) index, which is important to determine the number of clusters [34]. CHindex evaluates the cluster validity based on the sum of the square between the cluster and the sum of the square within the cluster. Varying the number of clusters Nc yields distinct values of CHindex: The higher the CHindex is, the better the solution. In addition, UVDC = (uVDC,pc)(N1+N2+N3+N4)∗(Nc) is the fuzzy membership matrix, TNR is the true negative rate (equivalent to specificity), and TPR is the true positive rate (equivalent to sensitivity) of the voice disorder detection model. CHindex, TNR, and TPR are given by
C H i n d e x = S S B C / ( N c 1 ) S S W C ( ( 0.9 ) ( N 1 + N 2 + N 3 + N 4 ) N c )
S p e c i f i c i t y = T N R = T N T N + F P
S e n s i t i v i t y = T P R = T P T P + F N
where SSBC is the sum of the square between clusters, Nc is the number of clusters, SSWC is the sum of the square within the cluster. The larger the SSBC is, the higher the degree of dispersion between the clusters becomes. The smaller the SSWC is, the closer the relationship in the cluster. In addition, TN is the true negative, FP is the false positive, TP is the true positive, and FN is the false negative of the testing samples.
The proposed MOGA-IFCM algorithm is then applied to solve the multiobjective optimization problem in Equation (10) [35,36]. Generally, solving a multiobjective optimization problem has an overall run time complexity of O(GMN2), where G, M, and N are the number of generations, number of objectives, and population size, respectively. We adopted a hyper-grid scheme [37,38] to reduce the time complexity. The calculation of individual sharing functions is restricted only to individuals in neighboring cells. Each cell contains a few individuals. The overall time complexity is thus reduced to O(GMN).
Evolutionary algorithms (for instance, the genetic algorithm) tend to converge to a single solution as the diversity of the population diminishes [39]. This phenomenon is called genetic drift. The technique for maintaining a stable sub-population of diverse individuals according to the distance between individuals is called the niching technique. An individual’s niche count is defined as the sum of the sharing function sf (i.e., j = 1 n s f ( d ( i , j ) ) ), whose values lies between itself and every individual in the population. We can define sf as a function of distance d(i,j) between the two population elements, as follows [40,41]:
s f ( d ) = { 1 ( d ( i , j ) σ d i s ) α i f   d < σ d i s 0 o t h e r w i s e
where σ d i s is the threshold for dissimilarity, and α = 1 is a constant for regulating the shape of the sharing function.
The optimal tradeoff solutions are thus found. Pseudo code of the algorithm can be found in Algorithm 1.
Algorithm 1 T r a i n S V C ( { X p } t r a i n , b )
Input: {Xp}train, m = 2
Output: Model
1. Generation count g = 1;
2. Initialize population;
3. Evaluate the individuals with the fitness functions (F1, F2, F3, and F4);
4. Rank the fitness values of individuals based on step 3;
5. Calculate the Niche count;
while generations <= max_generation do
6. Select two parents from the population;
7. Create offspring using Roulette wheel selection (RWS), crossover, and mutation;
8. Train a MOGA-based IFCM model for each individual;
9. Evaluate the offspring with the fitness functions (F1, F2, F3, and F4);
10. Rank the offspring by their fitness values according to step 9
11. Calculate the Niche count;
12. Decide the new population based on the offspring;
13. g = g + 1;
End while
Model←Pareto solutions

3. Analysis and Results of CGAN-IFCM

The effectiveness of the proposed CGAN-IFCM is analyzed in four parts: (i) the performance of the proposed CGAN-IFCM; (ii) the necessity of using CGAN by comparing the performance of the detection models using CGAN-IFCM and IFCM; (iii) the necessity of using IFCM by comparing the performance of the detection models using CGAN-IFCM and CGAN-FCM; (iv) comparing CGAN-IFCM with two typical data generation methods (synthetic minority oversampling technique (SMOTE) and cost-sensitive learning (CSL)); (v) comparing the performance between CGAN-IFCM and existing methods. As previously noted, 10-fold cross-validation was adopted for performance evaluation, for which the value of 10 has been widely adopted in the literature [31,32].

3.1. Performance Evaluation of CGAN-IFCM

As shown in Table 1, there are four cases of formulations: (i) the binary detection model using the SVD database; (ii) the binary detection model using the VOICED database; (iii) the multi-class detection model using the SVD database; (iv) the multi-class detection model using the VOICED database. Section 3.1 will present and analyze the results of CGAN-IFCM in all these cases.

3.1.1. Binary Detection Model using CGAN-IFCM

Consider that the proposed CGAN-IFCM, the number of clusters Nc, and conditioning variables ω n and ω 0 highly influence the joined degree of membership u V D C , p c and cluster center v V D C , c and thus correlate to the performance (TNR and TPR) of the detection model.
The evaluation is first performed on the binary detection model using CGAN-IFCM with the SVD database. Table 2 summarizes the TNR and TPR of the selected scenarios with N c = [ 3 , 10 ] , ω n , ω 0 = [ 2 , 2.4 ] , with step size of 1 and 0.1, respectively. Key observations can be drawn from this process:
  • ( ω n = 2 ): The membership u V D C , p c is more dominated by ω 0 = 2 . Therefore, ω 0 increases as the TNR and TPR decrease.
  • ( ω n = [ 2.1 , 2.4 ] ): The highest TNR and TPR are obtained at ω 0 = 2 . This can be explained by the case that if ω n > ω 0 . Recall that Xp and its neighbor Xp share similar characteristics and tend to group under the same v V D C , c . However, when ω n ω 0 , both TNR and TPR decrease.
  • The performance of CGAN-IFCM is better when N c = [ 3 , 5 ] compared to N c = [ 6 , 10 ] and deteriorates significantly when N c > 5 , as some clusters could be redundant and lead to errors in voice disorder detection.
As a result, the voice disorder detection model achieves higher accuracy when ω n > ω 0 , which reveals the effectiveness and necessity of the membership function uneighbor,pc and weighting factor ω n . On the other hand, when ω n ω o , u V D C , p c tends to be dominated by ω 0 , which is close to the performance of existing FCM algorithms.
Next, we focus on the binary detection model with the VOICED database. Similarly, Table 3 summarizes the TNR and TPR of the selected scenarios with N c = [ 3 , 10 ] and ω n , ω 0 = [ 2 , 2.4 ] , with step sizes of 1 and 0.1, respectively. Besides following the same observations shown in Table 2, there are two extra observations.
  • The detection model is dominated by a class of candidates with voice disorders because the number of samples of voice disorders remains larger than the number of samples of healthy candidates after the adoption of CGAN. In other words, as shown in Table 3, there is a notable gap (on average, 3.94% versus 0.79%, as shown in Table 2) between TNR and TPR, where TPR is higher than TNR.
  • The overall accuracy of the binary detection model with the VOICED database is less than that of the SVD database. This could be explained in two ways: the imbalanced dataset in SVD and the number of samples in SVD, which is 10 times that in VOICED.

3.1.2. Multi-Class Detection Model using CGAN-IFCM

For the construction of the multi-class detection model, the problem is extended to determining the exact type of voice disorders from among four possibilities: healthy, hyperkinetic dysphonia, hypokinetic dysphonia, and reflux laryngitis. First, the SVD database is considered. Table 4 presents the TNR and TPR of selected scenarios with N c = [ 3 , 10 ] , ω n , ω 0 = [ 2 , 2.4 ] , and a step size of 1 and 0.1, respectively. In addition to the first two points stated in Table 2, there are three other observations.
  • The performance of CGAN-IFCM is better when N c = [ 6 , 8 ] compared to N c = [ 3 , 5 ] and N c = [ 9 , 10 ] . It achieves lower performance when N c < 6 and N c > 8 as either an insufficient number of clusters or redundant clusters are considered.
  • The detection model is dominated by the class of healthy candidates because the number of samples of healthy candidates is far greater than the number of samples of candidates with hyperkinetic dysphonia, hypokinetic dysphonia, or reflux laryngitis after the adoption of CGAN. In Table 4, there difference between TNR and TPR (where TNR is higher than TPR) is further increased (on average, by 4.56% versus 3.94%, as shown in Table 3) between TNR and TPR, where TPR is higher than TNR.
  • The overall accuracy of the multi-class detection model is lower than that of the binary-class detection model. Basically, the multi-class detection model is a more complicated problem compared to the binary classifier. Further, the issue of an imbalanced dataset is especially significant given the small sample size of patients with hypokinetic dysphonia (2% of the samples of healthy candidates).
Next, the multi-class detection model is implemented using the VOICED database. Table 5 presents the TNR and TPR of the selected scenarios with N c = [ 3 , 10 ] , ω n , ω 0 = [ 2 , 2.4 ] , and a step size of 1 and 0.1, respectively. In addition to the first two points stated in Table 2, there are two other observations.
  • The performance of CGAN-IFCM is better when N c = [ 6 , 8 ] compared to N c = [ 3 , 5 ] and N c = [ 9 , 10 ] . It achieves lower performance when N c < 6 and N c > 8 , as either an insufficient number of clusters or redundant clusters are considered.
  • The multi-class detection model is formulated using an equal sample size in each class after the adoption of CGAN. The difference between TNR and TPR is not significant, and there is no bias in TNR, which is either always higher than TPR, or TPR is always higher than TNR.

3.2. Comparison Between CGAN-IFCM and IFCM

IFCM includes the relationship between Xp and its neighbor Xp, which enhances the performance of the traditional FCM. As noted, CGAN generates new training data that not only lower the effect of imbalanced classes in SVD and VOICED but also enhance the performance of the detection model. Table 6 summarizes the performance of four cases of the detection models using CGAN-IFCM and IFCM. The results reveal that CGAN-IFCM lowers the difference between TNR and TPR by 58.4% (an average of four cases, 2.09% versus 5.02%). Further, the proposed CGAN-IFCM outperforms stand-alone IFCM by 10–12.6% and 5.8–16.2% in the TNR and TPR, respectively. This demonstrates the dual benefits of CGAN, which (i) reduces the effect of the imbalanced class (and thus lowers the difference between TNR and TPR) and (ii) generates more data to enhance detection performance.

3.3. Comparison Between CGAN-IFCM and CGAN-FCM

To demonstrate the effectiveness of IFCM, performance of detection models using CGAN-IFCM and CGAN-FCM are compared and summarized in Table 7. The proposed CGAN-IFCM improves the TNR and TPR by 7.3–9% and 3.1–12%, respectively. Therefore, combining the results in Table 6 and Table 7 shows that both CGAN and IFCM help improve the detection accuracy of voice disorder detection models.

3.4. Comparison Between CGAN, SMOTE, and CSL

To study the effectiveness of CGAN in addressing the issue of an imbalanced dataset, a comparison is made with two typical approaches: the synthetic minority oversampling technique (SMOTE) [42,43] and cost-sensitive learning (CSL) [44,45]. Table 8 presents the performance of CGAN-IFCM, SMOTE-IFCM, and CSL-IFCM in binary and multi-class voice disorder detection. The results show that the proposed CGAN-IFCM model outperformed SMOTE-IFCM and CSL-IFCM by 4–6% in terms of the TNR and TPR. This result can be explained by the following factors: (i) the datasets SVD and VOICED contain conditional information that fits well with the theory of CGAN, and (ii) SMOTE does not consider the possibility of neighboring points that belong to other classes. This could potentially increase the overlapping of classes and introduce additional noise; (iii) CSL is, moreover, prone to overfitting.

3.5. Comparison Between CGAN-IFCM and Existing Works

This subsection discusses the performance of the proposed CGAN-IFCM compared to performance in existing works [12,13,14,15,16,17,18]. Table 9 presents the databases, methodologies, types of cross-validation, and performance under the scenario of the binary detection model for voice disorder detection. Existing works [12,16,17] did not apply cross-validation during their performance evaluations, and one can pick up a set of training data and testing data to obtain excellent performance. Therefore, the results in [12,16,17] are not comparable to those of the proposed CGAN-IFCM. In comparison with [13,14,15] using the SVD database, the proposed CGAN-IFCM improves the TNR and TPR by 12.9–43.5% and 9.1–44.8%, respectively. Further, compared to [18], using VOICED, CGAN-IFCM increases the TNR and TPR by 9.9% and 15.3%, respectively. Moreover, existing works [12,13,15,16,18] experienced the issue of imbalanced classes, featuring notable deviations between the TNR and TPR from 3.8% to 28.7%.
Notably, existing works [12,13,14,15,16,17,18] did not formulate the voice disorder detection problem as one of multi-class detection, thereby precluding a direct comparison. Thus, in this paper, we extend the formulation to the multi-class detection problem for detecting three common types of voice disorders: hyperkinetic dysphonia, hypokinetic dysphonia, and reflux laryngitis. Table 10 summarizes the TNR and TPR of the proposed CGAN-IFCM with two typical algorithms: RF and SVM (radial basis kernel function). The same feature vector is utilized to ensure a fair comparison. CGAN-IFCM achieved a higher TNR and TPR compared to RF and SVM for three key reasons: (i) CGAN reduces the issue of imbalanced datasets; (ii) CGAN increases the amount of training data; (iii) IFCM includes the relationship between Xp and its neighbor Xp. Satisfactory performance (only a small deterioration of the TNR and TPN compared with the binary detection model) was achieved for CGAN-IFCM.

3.6. Comparison Between CGAN-IFCM with Other Approaches using Wilcoxon Signed-Rank Test

We conducted analysis on the performance of proposed CGAN-IFCM and various approaches in Section 3.1, Section 3.2, Section 3.3, Section 3.4 and Section 3.5. To determine if CGAN-IFCM statistically outperformed other approaches, adopting a t-test is not suitable because both TNR and TPR are bounded by 0% and 100% which does not follow normal distribution [46,47]. Non-parametric Wilcoxon signed-rank test [48,49] was chosen to confirm if CGAN-IFCM is statistically significant compared with other approaches. We assumed the significance level is 0.05. H0 and Ha were denoted as null hypothesis and alternative hypothesis, respectively. Equivalent accuracy (EQA) was defined as the weighted sum of TNR and TPR of the model. Table 11 summarizes the results of Wilcoxon signed-rank test. The subscript of EQA denotes the method of voice disorder detection model. Results reveal that the p-values of all cases are less than 0.05 so that the proposed CGAN-IFCM outperforms aforesaid approaches, presented in Section 3.1, Section 3.2, Section 3.3, Section 3.4 and Section 3.5.

4. Conclusions

In this paper, a conditional generative adversarial network (CGAN) and improved fuzzy c-means clustering (IFCM) algorithm named CGAN-IFCM were proposed. CGAN demonstrated its effectiveness in generating new training data to reduce the effect of the imbalanced class and thus the deviation between the true positive rate and true negative rate of the detection model. CGAN improved the TNR and TPR by 10–12.6% and 5.8–16.2%, respectively. IFCM addresses the limitation of the existing FCM, where the IFCM uses multi-class voice disorder detection for three common types of voice disorders. IFCM introduces the interaction between data (feature space) and its neighboring data into the fuzzy membership function. The results show that the IFCM improves the TNR and TPR by 7.3–9% and 3.1–12%, respectively. We also discussed the advantages of CGAN for managing an imbalanced dataset with conditional information. This model could increase performance by 4–6% compared to traditional SMOTE and CSL methods. In addition, the performance between the proposed CGAN-IFCM was compared with the results of existing works that demonstrated TNR and TPR enhancements of 12.9–43.5% and 9.1–44.8%, respectively.
Future research directions are suggested as follows: (i) merging databases SVD and VOICED to enlarge the number of samples in each class, thereby allowing data heterogeneity to be properly handled. (ii) Investigating the possibility of extending the multi-class detection model to incorporate more types of voice disorders. It is expected that the performance of detection model will be lowered when the number of classes (the complexity of the model) increases. Therefore, further studies on the aspects of feature extraction and model construction should be carried out.

Author Contributions

Formal analysis, K.T.C., M.D.L. and P.V.; investigation, K.T.C., M.D.L. and P.V.; methodology, K.T.C.; validation, K.T.C., M.D.L. and P.V.; visualization, K.T.C. and M.D.L.; writing—original draft, K.T.C., M.D.L. and P.V.; writing—review and editing, K.T.C., M.D.L. and P.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Vilkman, E. Voice problems at work: A challenge for occupational safety and health arrangement. Folia Phoniatrica et Logopaedica 2000, 52, 120–125. [Google Scholar] [CrossRef] [PubMed]
  2. Dodderi, T.; Philip, N.E.; Mutum, K. Prevalence of voice disorders in the Department of Speech Language Pathology of a tertiary care hospital of Mangaluru: A retrospective study of 11 years. Nitte Univ. J. Health Sci. 2018, 8, 12–16. [Google Scholar]
  3. Lyberg-Åhlander, V.; Rydell, R.; Fredlund, P.; Magnusson, C.; Wilén, S. Prevalence of voice disorders in the general population, based on the Stockholm public health cohort. J. Voice 2019, 33, 900–905. [Google Scholar] [CrossRef]
  4. Vertanen-Greis, H.; Löyttyniemi, E.; Uitti, J. Voice disorders are associated with stress among teachers: A cross-sectional study in Finland. J. Voice 2018, 34, 488.e1–488.e8. [Google Scholar] [CrossRef]
  5. Roy, N.; Merrill, R.M.; Gray, S.D.; Smith, E.M. Voice disorders in the general population: Prevalence, risk factors, and occupational impact. Laryngoscope 2005, 115, 1988–1995. [Google Scholar] [CrossRef] [PubMed]
  6. Leão, S.H.D.S.; Oates, J.M.; Purdy, S.C.; Scott, D.; Morton, R.P. Voice problems in New Zealand teachers: A national survey. J. Voice 2015, 29, 645-e1. [Google Scholar] [CrossRef]
  7. Muhammad, G.; Alhamid, M.F.; Alsulaiman, M.; Gupta, B. Edge computing with cloud for voice disorder assessment and treatment. IEEE Commun. Mag. 2018, 56, 60–65. [Google Scholar] [CrossRef]
  8. Alhussein, M.; Muhammad, G. Voice pathology detection using deep learning on mobile healthcare framework. IEEE Access 2018, 6, 41034–41041. [Google Scholar] [CrossRef]
  9. Amami, R.; Amami, R.; Eleraky, H.A. An Incremental System for Voice Pathology Detection Combining Possibilistic SVM and HMM. In Proceedings of the International Conference on Statistical Language and Speech Processing, Ljubljana, Slovenia, 14–16 October 2019; pp. 127–138. [Google Scholar]
  10. Fang, S.H.; Tsao, Y.; Hsiao, M.J.; Chen, J.Y.; Lai, Y.H.; Lin, F.C.; Wang, C.T. Detection of pathological voice using cepstrum vectors: A deep learning approach. J. Voice 2019, 33, 634–641. [Google Scholar] [CrossRef] [PubMed]
  11. Ali, Z.; Imran, M.; Alsulaiman, M.; Zia, T.; Shoaib, M. A zero-watermarking algorithm for privacy protection in biomedical signals. Future Gener. Comput. Syst 2018, 82, 290–303. [Google Scholar] [CrossRef] [Green Version]
  12. Amara, F.; Fezari, M.; Bourouba, H. An improved GMM-SVM system based on distance metric for voice pathology detection. Appl. Math 2016, 10, 1061–1070. [Google Scholar] [CrossRef]
  13. Verde, L.; De Pietro, G.; Sannino, G. Voice disorder identification by using machine learning techniques. IEEE Access 2018, 6, 16246–16255. [Google Scholar] [CrossRef]
  14. Guedes, V.; Teixeira, F.; Oliveira, A.; Fernandes, J.; Silva, L.; Junior, A.; Teixeira, J.P. Transfer Learning with AudioSet to Voice Pathologies Identification in Continuous Speech. Procedia Comput. Sci. 2019, 164, 662–669. [Google Scholar] [CrossRef]
  15. Kadiri, S.R.; Alku, P. Analysis and Detection of Pathological Voice using Glottal Source Features. IEEE J. Sel. Top. Signal Process. 2020, 14, 367–379. [Google Scholar] [CrossRef] [Green Version]
  16. Verde, L.; De Pietro, G.; Alrashoud, M.; Ghoneim, A.; Al-Mutib, K.N.; Sannino, G. Dysphonia Detection Index (DDI): A New Multi-Parametric Marker to Evaluate Voice Quality. IEEE Access 2019, 7, 55689–55697. [Google Scholar] [CrossRef]
  17. Chen, L.; Wang, C.; Chen, J.; Xiang, Z.; Hu, X. Voice Disorder Identification by using Hilbert-Huang Transform (HHT) and K Nearest Neighbor (KNN). J. Voice 2020. [Google Scholar] [CrossRef]
  18. Verde, L.; De Pietro, G.; Alrashoud, M.; Ghoneim, A.; Al-Mutib, K.N.; Sannino, G. Leveraging Artificial Intelligence to Improve Voice Disorder Identification Through the Use of a Reliable Mobile App. IEEE Access 2019, 7, 124048–124054. [Google Scholar] [CrossRef]
  19. Pützer, M.; Koreman, J. A German database of patterns of pathological vocal fold vibration. Phonus 1997, 3, 143–153. [Google Scholar]
  20. Saarbruecken Voice Database: Handbook. Available online: http://www.stimmdatenbank.coli.uni-saarland.de/help_en.php4 (accessed on 20 February 2020).
  21. Cesari, U.; De Pietro, G.; Marciano, E.; Niri, C.; Sannino, G.; Verde, L. A new database of healthy and pathological voices. Comput. Elect. Eng. 2018, 68, 310–321. [Google Scholar] [CrossRef]
  22. Pan, Z.; Yu, W.; Yi, X.; Khan, A.; Yuan, F.; Zheng, Y. Recent progress on generative adversarial networks (GANs): A survey. IEEE Access 2019, 7, 36322–36333. [Google Scholar] [CrossRef]
  23. Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar]
  24. Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. Available online: https://arxiv.org/abs/1411.1784 (accessed on 10 April 2020).
  25. Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2172–2180. [Google Scholar]
  26. Brockmann, M.; Drinnan, M.J.; Storck, C.; Carding, P.N. Reliable jitter and shimmer measurements in voice clinics: The relevance of vowel, gender, vocal intensity, and fundamental frequency effects in a typical clinical task. J. Voice 2011, 25, 44–53. [Google Scholar] [CrossRef] [PubMed]
  27. Lopes, L.W.; da Silva, J.D.; Simões, L.B.; da Silva Evangelista, D.; Silva, P.O.C.; Almeida, A.A.; de Lima-Silva, M.F.B. Relationship between acoustic measurements and self-evaluation in patients with voice disorders. J. Voice 2017, 31, 119.e1–119.e10. [Google Scholar] [CrossRef] [PubMed]
  28. Severin, F.; Bozkurt, B.; Dutoit, T. HNR extraction in voiced speech, oriented towards voice quality analysis. In Proceedings of the 2005 13th European Signal Processing Conference, Antalya, Turkey, 4–8 September 2005; pp. 1–4. [Google Scholar]
  29. Farrús, M.; Hernando, J.; Ejarque, P. Jitter and shimmer measurements for speaker recognition. In Proceedings of the Eighth Annual Conference of the International Speech Communication Association, Antwerp, Belgium, August 27–31 2007; pp. 778–781. [Google Scholar]
  30. Verde, L.; De Pietro, G.; Sannino, G. A methodology for voice classification based on the personalized fundamental frequency estimation. Biomed. Signal Process. Control 2018, 42, 134–144. [Google Scholar] [CrossRef]
  31. Grimm, K.J.; Mazza, G.L.; Davoudzadeh, P. Model selection in finite mixture models: A k-fold cross-validation approach. Struct. Equ. Model. 2017, 24, 246–256. [Google Scholar] [CrossRef]
  32. Varoquaux, G.; Raamana, P.R.; Engemann, D.A.; Hoyos-Idrobo, A.; Schwartz, Y.; Thirion, B. Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines. NeuroImage 2017, 145, 166–179. [Google Scholar] [CrossRef] [Green Version]
  33. Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Kluwer Academic Publishers: Norwell, MA, USA, 1981. [Google Scholar]
  34. Maulik, U.; Bandyopadhyay, S. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1650–1654. [Google Scholar] [CrossRef] [Green Version]
  35. Foneseca, C.M.; Flemming, P. Genetic algorithms for multi-objective optimization: Formulation, discussion, and generalization. In Proceedings of the 5th International Conference on Genetic Algorithms, Urbana-Champaign, Champaign, IL, USA, 17–21 July 1993; Morgan Kaufmann: San Francisco, CA, USA, 1993; pp. 416–423. [Google Scholar]
  36. Deb, K. Multi-Objective Optimization Using Evolutionary Algorithms; John Wiley & Sons, Inc.: New York, NY, USA, 2001. [Google Scholar]
  37. Jensen, M.T. Reducing the run-time complexity of multiobjective EAs: The NSGA-II and other algorithms. IEEE Trans. Evol. Comput. 2003, 7, 503–515. [Google Scholar] [CrossRef]
  38. Dutta, S.; Das, K.N. A survey on pareto-based eas to solve multi-objective optimization problems. In Soft Computing for Problem Solving; Bansal, J., Das, K., Nagar, A., Deep, K., Ojha, A., Eds.; Advances in Intelligent Systems and Computing; Springer: Singapore, 2019. [Google Scholar]
  39. Goldberg, D.; Richardson, J. Genetic Algorithms with Sharing for Multi-modal Function Optimization. In Proceedings of the International Conference on Genetic Algorithms, Cambridge, MA, USA, 28–31 July 1987; pp. 41–49. [Google Scholar]
  40. Mahfoud, S.W. Niching Methods for Genetic Algorithms. Ph.D. Thesis, University of Illinois at Urbana-Champaign, Urbana Champaign, IL, USA, 1995. [Google Scholar]
  41. Ji, J.Y.; Yu, W.J.; Zhong, J.; Zhang, J. Density-Enhanced Multiobjective Evolutionary Approach for Power Economic Dispatch Problems. IEEE Trans. Syst. Man Cybern. Syst. 2019. [Google Scholar] [CrossRef]
  42. Maldonado, S.; López, J.; Vairetti, C. An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl. Soft Comput. 2019, 76, 380–389. [Google Scholar] [CrossRef]
  43. Sun, J.; Li, H.; Fujita, H.; Fu, B.; Ai, W. Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inf. Fusion 2020, 54, 128–144. [Google Scholar] [CrossRef]
  44. Jia, X.; Li, W.; Shang, L. A multiphase cost-sensitive learning method based on the multiclass three-way decision-theoretic rough set model. Inf. Sci. 2019, 485, 248–262. [Google Scholar] [CrossRef]
  45. Feng, F.; Li, K.C.; Shen, J.; Zhou, Q.; Yang, X. Using cost-sensitive learning and feature selection algorithms to improve the performance of imbalanced classification. IEEE Access 2020, 8, 69979–69996. [Google Scholar] [CrossRef]
  46. Limpert, E.; Stahel, W.A. Problems with using the normal distribution–and ways to improve quality and efficiency of data analysis. PLoS ONE 2011, 6, e21403. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  47. De Winter, J.C. Using the Student’s t-test with extremely small sample sizes. Pract. Assess. Res. Eval. 2013, 18, 10. [Google Scholar]
  48. Ngyen, K.A.; Chen, W.; Lin, B.S.; Seeboonruang, U. Using Machine Learning-Based Algorithms to Analyze Erosion Rates of a Watershed in Northern Taiwan. Sustainability 2020, 12, 2022. [Google Scholar] [CrossRef] [Green Version]
  49. Meek, G.E.; Ozgur, C.; Dunning, K. Comparison of the t vs. Wilcoxon signed-rank test for Likert scale data and small samples. J. Mod. Appl. Stat. Methods 2007, 6, 10. [Google Scholar] [CrossRef]
Figure 1. System workflow of conditional generative adversarial network (CGAN).
Figure 1. System workflow of conditional generative adversarial network (CGAN).
Applsci 10 04571 g001
Table 1. Key information on the databases Saarbruecken Voice Database (SVD) and Voice ICar fEDerico II Database (VOICED).
Table 1. Key information on the databases Saarbruecken Voice Database (SVD) and Voice ICar fEDerico II Database (VOICED).
DatabaseSignal CharacteristicsInformation IncludedBinary Detection ModelMulti-Class Detection Model
SVD [19,20]16-bit resolution at 50 kHzGender, age, and clinical diagnosis869 healthy candidates and 1356 candidates with voice disorders869 healthy candidates, 213 with hyperkinetic dysphonia, 16 with hypokinetic dysphonia, and 140 with reflux laryngitis
VOICED [21]32-bit resolution at 8 kHzGender, age, clinical diagnosis, smoking status, alcohol consumption, hydration, eating habits, voice handicap index, and reflux symptom index58 healthy candidates and 150 candidates with voice disorders58 healthy candidates, 70 with hyperkinetic dysphonia, 41 with hypokinetic dysphonia, and 39 with reflux laryngitis
Table 2. Performance evaluation of the binary detection model using CGAN-improved fuzzy c-means clustering (IFCM) with varying Nc, ω n , and ω 0 values using the SVD database. (a) Fixed ω n = 1.8 . (b) Fixed ω n = 1.9 . (c) Fixed ω n = 2.0 . (d) Fixed ω n = 2.1 . (e) Fixed ω n = 2.2 .
Table 2. Performance evaluation of the binary detection model using CGAN-improved fuzzy c-means clustering (IFCM) with varying Nc, ω n , and ω 0 values using the SVD database. (a) Fixed ω n = 1.8 . (b) Fixed ω n = 1.9 . (c) Fixed ω n = 2.0 . (d) Fixed ω n = 2.1 . (e) Fixed ω n = 2.2 .
Control Weighting FactorsTNR (%)/TPR (%) with Varying N c
ω n ω 0 345678910
2290.9/90.293.6/95.291.3/93.688.2/89.385.3/86.382.6/83.377.7/78.874.3/73.8
2.187.8/88.589.7/90.287.3/88.285.4/87.082.9/82.180.1/80.875.6/74.572.3/71.7
2.286.1/85.887.7/86.986.8/85.084.2/84.581.3/81.578.8/78.074.1/73.270.8/71.8
2.384.9/84.685.8/86.184.2/83.583.3/82.780.8/81.277.1/77.373.8/72.569.8/70.2
2.482.5/82.183.9/84.183.1/82.782.4/81.880.4/79.775.5/76.371.8/73.168.6/69.1
2.1291.6/91.894.3/95.192.8/92.189.7/90.686.8/85.684.2/84.878.3/79.674.3/73.8
2.190.3/91.293.7/92.892.0/91.790.1/89.286.3/85.783.5/82.977.3/77.873.6/73.1
2.289.1/89.491.6/92.090.7/91.189.5/88.885.3/84.682.4/82.176.5/76.972.1/72.9
2.387.6/87.090.2/90.588.9/87.787.4/85.884.7/84.081.8/81.175.9/75.271.4/72.0
2.485.8/85.188.4/88.986.5/86.185.4/85.184.1/84.580.6/80.073.6/73.572.1/71.2
2.2291.9/92.394.6/95.493.2/92.389.9/91.087.5/85.984.6/85.278.7/78.274.8/74.2
2.190.6/91.494.1/93.292.4/92.290.5/89.686.7/86.284.0/83.377.7/78.173.9/73.3
2.289.6/89.991.8/92.391.0/91.389.8/88.985.7/84.982.8/82.476.9/77.272.4/73.5
2.387.8/87.390.4/90.689.2/87.987.8/87.585.2/84.582.4/81.376.3/75.571.7/72.3
2.486.0/85.588.8/89.286.8/86.385.8/86.384.6/84.880.9/80.374.0/73.972.5/71.4
2.3291.2/92.394.3/95.393.1/91.989.8/90.686.9/85.784.6/84.677.6/77.273.0/73.5
2.190.1/90.493.8/92.991.3/91.690.0/89.185.8/85.383.3/82.977.3/77.172.8/72.3
2.287.8/88.290.2/91.190.3/89.988.7/88.184.9/84.182.2/81.876.4/76.871.1/70.9
2.386.5/87.188.9/89.589.6/88.887.4/87.084.2/83.681.2/80.575.6/76.170.9/70.3
2.485.4/86.387.5/88.288.2/87.386.8/86.283.6/82.880.3/79.774.2/74.569.5/69.8
2.4290.4/91.893.1/94.292.5/91.388.6/89.586.1/84.983.5/83.976.9/76.372.1/72.3
2.189.2/89.192.8/92.490.7/91.189.2/88.485.2/84.782.7/82.276.7/76.272.0/71.5
2.287.5/87.989.7/90.889.6/89.288.2/87.484.4/82.681.7/81.175.8/76.070.4/70.1
2.385.8/86.388.1/88.989.0/88.186.9/86.383.8/83.280.4/79.875.1/75.770.2/69.9
2.484.9/85.887.1/87.887.5/86.886.2/85.883.1/82.580.0/79.573.7/74.169.2/69.5
Table 3. Performance evaluation of the binary detection model using CGAN-IFCM with varying Nc, ω n , and ω 0 values with the VOICED database. (a) Fixed ω n = 1.8 . (b) Fixed ω n = 1.9 . (c) Fixed ω n = 2.0 . (d) Fixed ω n = 2.1 . (e) Fixed ω n = 2.2 .
Table 3. Performance evaluation of the binary detection model using CGAN-IFCM with varying Nc, ω n , and ω 0 values with the VOICED database. (a) Fixed ω n = 1.8 . (b) Fixed ω n = 1.9 . (c) Fixed ω n = 2.0 . (d) Fixed ω n = 2.1 . (e) Fixed ω n = 2.2 .
Control Weighting FactorsTNR (%)/TPR (%) with Varying N c
ω n ω 0 345678910
2283.7/87.988.3/92.187.5/90.884.6/87.981.6/84.779.8/82.373.7/77.370.2/73.0
2.182.1/86.086.8/89.685.9/88.883.7/86.979.7/82.477.5/80.073.5/75.969.8/72.1
2.281.3/85.285.1/87.583.1/85.582.1/85.678.9/81.976.4/78.971.5/74.268.6/71.5
2.380.2/84.883.3/86.681.8/84.281.0/83.977.5/81.474.9/77.870.6/73.267.3/70.8
2.479.7/83.582.1/85.380.2/83.580.3/82.676.9/80.373.8/77.269.7/72.466.8/70.1
2.1287.0/90.490.9/93.689.7/92.386.3/90.883.2/86.581.4/84.576.6/80.371.5/74.9
2.186.2/89.789.6/92.288.3/91.585.6/89.381.8/85.580.1/83.575.8/78.570.1/73.8
2.285.5/88.888.7/91.587.4/90.984.9/88.580.4/84.880.0/83.174.2/77.468.8/72.8
2.384.7/87.387.4/90.985.3/88.683.8/87.479.6/83.878.7/82.073.3/76.867.9/71.7
2.482.8/85.785.3/89.284.2/87.282.7/86.378.6/82.977.3/80.571.0/74.267.3/71.1
2.2288.2/91.491.1/93.889.9/92.586.7/91.684.0/87.482.1/86.177.6/80.372.6/75.6
2.187.2/90.690.2/93.189.3/92.186.9/90.383.0/86.681.7/84.075.1/79.271.5/74.2
2.286.1/89.489.2/92.088.1/91.585.6/89.482.4/85.381.4/83.774.5/78.669.3/73.3
2.385.2/88.188.2/91.385.7/88.784.9/88.181.8/84.780.1/83.273.6/76.368.9/72.1
2.483.1/86.786.4/90.683.8/87.982.8/87.280.6/83.578.7/81.972.3/75.668.6/71.8
2.3289.2/92.391.5/94.390.6/93.487.8/92.085.3/88.682.5/86.878.0/81.672.9/76.3
2.187.8/90.988.6/93.688.5/92.587.4/90.583.6/87.781.9/84.675.5/79.872.1/74.7
2.286.6/90.188.4/91.588.2/90.386.5/89.883.4/86.381.6/82.574.9/79.370.2/72.5
2.385.5/88.588.4/91.186.3/89.485.7/88.482.5/85.880.2/83.174.2/76.869.6/72.0
2.484.1/87.387.1/90.985.5/88.884.5/87.881.1/84.379.8/80.573.2/76.269.1/71.2
2.4287.1/90.488.5/92.386.7/89.985.2/88.083.8/86.681.8/85.376.5/79.872.5/75.8
2.185.8/88.587.4/90.786.2/89.384.6/87.183.0/85.979.7/83.875.3/78.971.7/74.9
2.285.0/87.586.5/89.285.3/88.584.1/87.082.2/85.378.5/82.974.3/77.070.4/73.6
2.384.2/87.285.5/88.684.7/87.583.6/86.681.7/84.379.8/80.373.7/76.369.7/72.4
2.483.4/86.584.7/87.383.7/86.682.7/85.582.0/83.776.3/79.372.4/75.668.5/70.7
Table 4. Performance evaluation of the multi-class detection model using CGAN-IFCM while varying Nc, ω n , and ω 0 with the SVD database. (a) Fixed ω n = 1.8 . (b) Fixed ω n = 1.9 . (c) Fixed ω n = 2.0 . (d) Fixed ω n = 2.1 . (e) Fixed ω n = 2.2 .
Table 4. Performance evaluation of the multi-class detection model using CGAN-IFCM while varying Nc, ω n , and ω 0 with the SVD database. (a) Fixed ω n = 1.8 . (b) Fixed ω n = 1.9 . (c) Fixed ω n = 2.0 . (d) Fixed ω n = 2.1 . (e) Fixed ω n = 2.2 .
Control Weighting FactorsTNR (%)/TPR (%) with Varying N c
ω n ω 0 345678910
2283.2/79.684.3/80.085.6/80.886.1/81.990.3/86.389.2/85.186.0/83.683.8/79.5
2.182.2/78.783.0/79.183.5/79.684.4/80.387.8/84.687.0/83.885.4/82.581.2/77.7
2.281.2/77.082.0/77.982.6/78.783.4/79.885.7/83.683.6/81.483.8/80.280.1/76.8
2.379.8/75.980.6/76.881.6/77.982.2/78.484.8/81.882.1/80.182.3/79.479.7/75.3
2.478.9/75.580.0/75.780.3/76.581.3/77.583.7/80.681.7/78.681.1/78.578.3/74.3
2.1285.4/80.386.2/81.287.0/82.388.2/84.691.2/86.889.9/85.588.3/84.986.1/82.6
2.184.2/79.585.4/80.686.3/81.487.4/83.290.3/85.689.5/84.787.6/83.185.3/81.3
2.283.6/78.784.3/80.085.2/80.386.3/82.389.5/84.288.6/83.687.1/82.384.4/80.5
2.382.5/77.783.2/79.184.1/79.785.3/81.688.8/83.187.5/82.686.3/81.283.2/79.8
2.481.3/76.982.4/78.183.4/78.784.5/80.986.3/82.084.9/81.384.1/80.582.3/79.1
2.2286.1/81.887.2/82.588.2/84.089.3/85.691.6/87.690.3/86.488.9/85.387.6/84.1
2.185.2/80.586.0/81.387.1/83.588.0/84.290.3/86.389.6/85.087.8/84.186.4/83.7
2.284.3/79.685.3/80.786.3/82.087.2/83.689.6/85.588.8/84.386.3/83.285.6/82.7
2.383.4/78.784.6/79.985.6/80.986.4/82.889.1/84.688.5/83.885.5/82.684.8/81.2
2.481.5/77.982.6/79.283.3/80.083.8/81.887.5/83.886.8/82.685.0/81.383.6/80.8
2.3287.6/82.188.5/83.489.4/85.390.6/85.992.1/88.590.4/86.989.7/86.288.8/85.3
2.186.3/81.387.6/82.888.5/84.588.8/84.891.2/87.390.0/85.888.7/85.187.2/84.6
2.285.4/80.186.2/81.387.3/83.686.3/84.190.2/86.689.3/85.387.2/84.186.6/83.5
2.384.6/79.685.8/80.586.2/81.385.3/83.689.7/85.389.0/84.286.0/83.585.2/82.6
2.482.5/79.383.1/80.583.8/81.584.3/82.588.8/84.387.8/83.686.3/82.683.9/81.2
2.4284.5/80.085.8/81.387.1/82.688.4/83.590.2/86.187.3/84.885.1/83.783.6/82.4
2.183.8/79.185.0/81.286.0/81.487.1/82.389.6/85.286.6/83.184.0/82.582.3/81.5
2.282.4/78.083.5/79.584.2/80.385.1/81.688.2/84.585.8/82.282.6/81.781.0/80.4
2.381.3/77.282.4/78.783.5/79.684.7/80.886.8/83.284.9/81.681.1/80.880.1/79.8
2.480.4/76.381.5/77.582.0/78.583.2/79.485.6/81.383.6/80.380.2/78.179.6/76.7
Table 5. Performance evaluation of the multi-class detection model using CGAN-IFCM under varying Nc, ω n , and ω 0 values with the VOICED database. (a) Fixed ω n = 1.8 . (b) Fixed ω n = 1.9 . (c) Fixed ω n = 2.0 . (d) Fixed ω n = 2.1 . (e) Fixed ω n = 2.2 .
Table 5. Performance evaluation of the multi-class detection model using CGAN-IFCM under varying Nc, ω n , and ω 0 values with the VOICED database. (a) Fixed ω n = 1.8 . (b) Fixed ω n = 1.9 . (c) Fixed ω n = 2.0 . (d) Fixed ω n = 2.1 . (e) Fixed ω n = 2.2 .
Control Weighting FactorsTNR (%)/TPR (%) with Varying N c
ω n ω 0 345678910
2281.5/82.183.0/82.683.4/82.784.9/85.387.6/87.086.2/86.784.3/83.182.6/81.8
2.180.3/81.482.3/81.883.0/82.383.5/84.185.9/86.284.5/85.983.6/82.182.0/80.4
2.279.5/78.780.7/80.381.5/82.182.7/83.584.6/84.083.2/82.783.1/82.681.3/79.3
2.377.6/78.078.6/78.580.2/80.781.6/82.383.3/82.482.5/81.881.7/80.780.5/78.6
2.477.0/76.878.1/77.379.6/78.580.7/81.282.4/81.881.2/80.780.3/79.679.2/77.7
2.1282.5/82.884.0/83.185.1/84.086.3/85.888.5/88.087.2/86.885.8/85.283.1/83.3
2.181.4/81.683.2/82.284.0/83.585.7/84.886.9/87.185.9/86.384.5/84.882.6/82.2
2.280.8/80.382.1/81.683.2/82.684.0/83.985.3/86.084.1/85.083.6/83.982.1/81.5
2.379.3/78.981.0/80.482.3/81.583.6/82.984.6/83.983.9/83.382.1/82.681.1/80.5
2.478.1/77.579.6/79.081.1/80.882.3/81.683.5/82.682.5/82.481.0/81.480.6/79.6
2.2284.2/83.785.3/84.486.7/85.987.5/86.788.7/88.887.5/87.186.4/86.084.9/85.1
2.182.8/82.284.5/83.785.8/84.986.8/85.387.6/88.087.2/87.285.6/85.483.8/84.5
2.281.1/80.983.2/81.884.8/83.585.7/84.786.3/87.286.3/85.984.5/84.783.1/84.0
2.380.3/80.182.5/81.183.6/82.884.4/83.585.7/86.184.6/84.383.6/83.382.3/82.6
2.479.3/78.781.2/80.682.6/82.183.4/82.684.8/85.383.9/83.583.1/82.781.6/81.5
2.3285.4/84.986.8/85.387.8/86.588.2/87.889.2/89.988.1/88.487.3/87.686.5/85.8
2.184.1/83.785.2/84.687.0/86.287.5/87.288.5/88.887.4/87.086.2/86.385.0/84.5
2.283.2/82.684.5/83.486.1/85.787.1/86.587.9/88.286.1/86.585.5/85.284.1/83.8
2.382.5/81.483.7/82.884.6/85.085.6/85.286.5/87.185.3/86.184.8/84.183.6/82.0
2.481.3/80.582.6/81.883.5/83.684.9/84.485.6/86.084.5/85.083.6/83.982.5/81.1
2.4283.2/82.684.3/83.785.8/84.386.5/85.487.8/87.286.4/85.384.3/83.283.1/82.5
2.182.5/82.083.6/82.884.5/83.785.8/85.086.5/86.185.3/84.483.1/82.982.0/81.3
2.281.2/81.582.7/82.383.6/82.984.6/84.185.8/84.584.5/83.283.7/82.681.4/80.8
2.380.4/81.081.8/81.582.7/82.383.9/83.485.2/83.683.5/82.782.5/81.980.5/79.3
2.479.6/80.180.3/80.881.5/81.382.6/82.884.9/83.382.8/81.981.7/80.579.3/78.5
Table 6. Performance comparison between CGAN-IFCM and IFCM.
Table 6. Performance comparison between CGAN-IFCM and IFCM.
CasesIFCM TNR (%)/TPR (%)Proposed CGAN-IFCM TNR (%)/TPR (%)Percentage Improvement by Proposed Work
Binary detection model with SVD85.9/89.894.7/95.610.2/6.5
Binary detection model with VOICED83.3/89.291.6/94.410.0/5.8
Multi-class detection model with SVD81.9/76.592.2/88.912.6/16.2
Multi-class detection model with VOICED80.7/79.289.4/90.110.8/13.8
Table 7. Performance comparison between CGAN-IFCM and CGAN-FCM.
Table 7. Performance comparison between CGAN-IFCM and CGAN-FCM.
CasesCGAN-FCM TNR (%)/TPR (%)Proposed CGAN-IFCM TNR (%)/TPR (%)Percentage Improvement by Proposed Work
Binary detection model with SVD87.8/91.394.7/95.67.9/4.7
Binary detection model with VOICED85.3/91.691.6/94.47.4/3.1
Multi-class detection model with SVD84.6/79.492.2/88.99.0/12.0
Multi-class detection model with VOICED83.3/81.689.4/90.17.3/10.4
Table 8. Performance comparison between CGAN-IFCM, synthetic minority oversampling technique (SMOTE)-IFCM, and cost-sensitive learning (CSL)-IFCM.
Table 8. Performance comparison between CGAN-IFCM, synthetic minority oversampling technique (SMOTE)-IFCM, and cost-sensitive learning (CSL)-IFCM.
CasesCGAN-IFCM TNR (%)/TPR (%)SMOTE-IFCM TNR (%)/TPR (%)CSL-IFCM TNR (%)/TPR (%)Percentage Improvement by Proposed Work
Binary detection model with SVD94.7/95.690.1/91.689.2/90.3(4.4–5.1)/(6–6.2)
Binary detection model with VOICED91.6/94.487.2/90.186.5/89.2(4.8–5)/(5.8–5.9)
Multi-class detection model with SVD92.2/88.986.9/85.487.5/84.0(4.1–6.1)/(5.4–5.8)
Multi-class detection model with VOICED89.4/90.186.2/86.585.0/85.6(3.7–4.2)/(5.2–5.3)
Table 9. Performance comparison between CGAN-IFCM and existing works on the binary detection model for voice disorder detection.
Table 9. Performance comparison between CGAN-IFCM and existing works on the binary detection model for voice disorder detection.
WorkDatabaseMethodologyTypes of Cross-ValidationPerformance
[12]SVD (binary class)SVM and GMMNoTNR: 99%; TPR: 94%
[13]SVD (binary class)SMO and SVM10-foldTNR: 83.9%; TPR: 87.6%
[14]SVD (binary class)LSTM and CNN10-foldPrecision of 66–78%
[15]SVD (binary class)SVM20-foldTNR: 78%; TPR: 72%
[16]SVD (binary class)Threshold-based detectionNoTNR: 90.2%; TPR: 70.6%
VOICED (binary class)TNR: 64.3%; TPR: 45.8%
[17]VOICED (binary class)KNNNoAccuracy of 93.3%
RFAccuracy of 87.4%
[18]VOICED (binary class)Boosted tree5-foldTNR: 86.2%; TPR: 82.9%
Proposed CGAN-IFCMSVD (binary class)CGAN and IFCM10-foldTNR: 94.7%; TPR: 95.6%
VOICED (binary class)TNR: 91.6%; TPR: 94.4%
Conditional generative adversarial network (CGAN); convolutional neural network (CNN); random forest (RF); Gaussian mixture models (GMM); improved fuzzy c-means clustering (IFCM); K-nearest neighbor (KNN); long short-term memory (LSTM); Saarbruechen Voice Database (SVD); sequential minimum optimization (SMO); support vector machine (SVM); true negative rate (TNR); true positive rate (TPR); Voice ICar fEDerico II (VOICED).
Table 10. Performance comparison between CGAN-IFCM and two typical algorithms (random forest and support vector machine) for a multi-class detection model of voice disorder detection.
Table 10. Performance comparison between CGAN-IFCM and two typical algorithms (random forest and support vector machine) for a multi-class detection model of voice disorder detection.
DatabaseMethodologyTypes of Cross-ValidationPerformance
SVDRF10-foldTNR: 80.2%; TPR: 74.8%
SVM (radial basis kernel function)TNR: 79.1%; TPR: 73.4%
Proposed CGAN-IFCMTNR: 92.2%; TPR: 88.9%
VOICEDRF10-foldTNR: 78.5%; TPR: 77.1%
SVM (radial basis kernel function)TNR: 76.9%; TPR: 75.4%
Proposed CGAN-IFCMTNR: 89.4%; TPR: 90.1%
Table 11. Results of Wilcoxon signed-rank test on proposed CGAN-IFCM and various approaches.
Table 11. Results of Wilcoxon signed-rank test on proposed CGAN-IFCM and various approaches.
ClassificationHypothesesResults
Binary detection modelH0: EQAproposed = EQAIFCM  Ha: EQAproposed >EQAIFCMReject H0
Multi-class detection modelH0: EQAproposed = EQAIFCM  Ha: EQAproposed >EQAIFCMReject H0
Binary detection modelH0: EQAproposed = EQACGAN-FCM  Ha: EQAproposed >EQACGAN-FCMReject H0
Multi-class detection modelH0: EQAproposed = EQACGAN-FCM  Ha: EQAproposed >EQACGAN-FCMReject H0
Binary detection modelH0: EQAproposed = EQASMOTE-IFCM  Ha: EQAproposed > EQASMOTE-IFCMReject H0
Multi-class detection modelH0: EQAproposed = EQASMOTE-IFCM  Ha: EQAproposed > EQASMOTE-IFCMReject H0
Binary detection modelH0: EQAproposed = EQACSL-IFCM  Ha: EQAproposed > EQACSL-IFCMReject H0
Multi-class detection modelH0: EQAproposed = EQACSL-IFCM  Ha: EQAproposed > EQACSL-IFCMReject H0
Binary detection modelH0: EQAproposed = EQA[13]  Ha: EQAproposed > EQA[13]Reject H0
Binary detection modelH0: EQAproposed = EQA[14]  Ha: EQAproposed > EQA[14]Reject H0
Binary detection modelH0: EQAproposed = EQA[15]  Ha: EQAproposed > EQA[15]Reject H0
Binary detection modelH0: EQAproposed = EQA[18]  Ha: EQAproposed > EQA[18]Reject H0
Multi-class detection modelH0: EQAproposed = EQARF  Ha: EQAproposed > EQARFReject H0
Multi-class detection modelH0: EQAproposed = EQASVM-RBF  Ha: EQAproposed > EQASVM-RBFReject H0

Share and Cite

MDPI and ACS Style

Chui, K.T.; Lytras, M.D.; Vasant, P. Combined Generative Adversarial Network and Fuzzy C-Means Clustering for Multi-Class Voice Disorder Detection with an Imbalanced Dataset. Appl. Sci. 2020, 10, 4571. https://doi.org/10.3390/app10134571

AMA Style

Chui KT, Lytras MD, Vasant P. Combined Generative Adversarial Network and Fuzzy C-Means Clustering for Multi-Class Voice Disorder Detection with an Imbalanced Dataset. Applied Sciences. 2020; 10(13):4571. https://doi.org/10.3390/app10134571

Chicago/Turabian Style

Chui, Kwok Tai, Miltiadis D. Lytras, and Pandian Vasant. 2020. "Combined Generative Adversarial Network and Fuzzy C-Means Clustering for Multi-Class Voice Disorder Detection with an Imbalanced Dataset" Applied Sciences 10, no. 13: 4571. https://doi.org/10.3390/app10134571

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop