Skip to main content

A speech enhancement algorithm based on a non-negative hidden Markov model and Kullback-Leibler divergence

Abstract

In this paper, we propose a supervised single-channel speech enhancement method that combines Kullback-Leibler (KL) divergence-based non-negative matrix factorization (NMF) and a hidden Markov model (NMF-HMM). With the integration of the HMM, the temporal dynamics information of speech signals can be taken into account. This method includes a training stage and an enhancement stage. In the training stage, the sum of the Poisson distribution, leading to the KL divergence measure, is used as the observation model for each state of the HMM. This ensures that a computationally efficient multiplicative update can be used for the parameter update of this model. In the online enhancement stage, a novel minimum mean square error estimator is proposed for the NMF-HMM. This estimator can be implemented using parallel computing, reducing the time complexity. Moreover, compared to the traditional NMF-based speech enhancement methods, the experimental results show that our proposed algorithm improved the short-time objective intelligibility and perceptual evaluation of speech quality by 5% and 0.18, respectively.

1 Introduction

Single-channel speech enhancement technology is being widely used in our daily lives, such as in speech coding, teleconferencing, hearing aids, mobile communication, and automated robust speech recognition (ASR) [1, 2]. In general, the purpose of speech enhancement is to remove background noise from an audio source while preserving clean speech. It aims to improve the quality and intelligibility of noisy speech [3]. Currently, single-channel speech enhancement is an active topic of research.

During the past decades, many different monaural speech enhancement approaches have been proposed [2, 4]. In an environment with additive noise, the simplest approach to speech enhancement is the spectral subtraction algorithm [5], which subtracts the estimated noise spectrum from the observed signal to acquire the desired clean speech. Other unsupervised methods, such as the signal subspace algorithm [6,7,8,9], Wiener filtering [10], minimum mean square error (MMSE) spectral amplitude estimator [11], and log-MMSE spectral amplitude estimator [12], are effective strategies for speech enhancement when the noise is stationary. These methods have low computational complexity and have been widely applied in various areas. However, these approaches cannot always achieve satisfactory performance for non-stationary noise and usually introduce musical noise because they do not make the best use of the prior information of the speech and noise [13]. Moreover, most unsupervised methods are based on the statistical properties of the speech and noise signals. However, it is difficult to meet these properties in actual noisy scenarios [14].

Therefore, supervised speech enhancement approaches have been developed. For instance, Kavalekalam et al. [15] proposed a codebook-based Kalman filter speech enhancement method, which performs a listening test and shows significant improvement for speech intelligibility. In addition, Srinivasan et al. [16] proposed a codebook-driven speech enhancement algorithm for non-stationary noise. In this work, the auto-regressive (AR) spectrum shape codebooks of speech and noise were pre-trained. In the enhancement stage, the codebooks could be used to build a Wiener filter to conduct speech enhancement. Inspired by this research, many other codebook-based speech enhancement approaches have been developed [17, 18]. Furthermore, an auto-regressive hidden Markov model (ARHMM) [19, 20] has also been shown to be an effective supervised speech enhancement method because it considers the temporal information of the speech signal.

In recent years, advances in deep learning techniques [21, 22], specifically, deep neural networks (DNNs), have significantly promoted the development of speech enhancement [23]. These methods usually rely on fewer assumptions [3, 14, 23] between the noise and clean speech, so they have huge potential to achieve better speech enhancement performance. Xu et al. [3, 14] applied a feedforward multilayer perceptron (MLP) to map log-power spectrum (LPS) features of clean speech given noisy LPS input; the enhanced speech could be obtained directly by waveform reconstruction. Compared to the MMSE estimator [12], this method achieved better performance in various noisy environments. Wang et al. [24, 25] also utilized an MLP to estimate the ideal ratio mask (IRM) and ideal binary mask (IBM) in conducting speech enhancement and also achieved satisfactory performance. Motivated by this work, researchers has used different DNN structures to conduct speech enhancement, such as a fully convolutional neural network (FCN) [26], deep recurrent neural networks (DRNN) [27, 28], and generative adversarial networks (GANs) [29, 30]. These methods could help ASR systems achieve higher recognition accuracy in noisy environments. However, generalization is always a problem that needs to be considered for these DNN-based algorithms [31, 32].

A non-negative matrix factorization (NMF)-based speech enhancement algorithm [33,34,35] can also be viewed as a kind of supervised speech enhancement method. NMF-based methods usually include a training and enhancement stage. In [36], a mask-based NMF speech enhancement method was proposed. In the training stage, the basis matrix of clean speech and noise was trained. In the enhancement stage, the activation matrix could be acquired by combining the trained basis matrix and noisy signal. The mask was then estimated to conduct the speech enhancement. Additionally, an NMF-based denoising scheme was described in [37, 38], which added a heuristic term to the cost function, so the NMF coefficients could be adjusted according to the long-term levels of the signals. A parametric NMF method for speech enhancement was proposed in [17]. This method applied the AR coefficient and codebook to build the basis matrix. This strategy effectively improved the speech intelligibility. Moreover, some DNN-based NMF methods represent an effective strategy for conducting speech enhancement [39, 40]. In general, the basis matrix could be acquired using the traditional NMF method, and the activation matrix could be estimated by applying a DNN, which improved the accuracy of the estimated activation matrix. Thus, it could achieve a higher perceptual evaluation of speech quality (PESQ) [41] and short-time objective intelligibility (STOI) [42] scores than traditional NMF-based speech enhancement methods. The combination of DNN and NMF could also help the ASR system achieve a lower word error rate (WER) in noisy environments. In [43], a DNN-NMF-based method achieved excellent performance in the Computational Hearing in Multisource Environments (CHiME)-3 challenge. To capture temporal information, some HMM-based NMF speech enhancement methods have been proposed. Mohammadiha et al.  [44] proposed a supervised and unsupervised NMF speech enhancement method. In [44], an HMM was used for modeling the temporal change of different noise types. In [45], a non-negative factorial HMM was used to model sound mixtures and showed superior performance in source separation tasks. In [46], an HMM-DNN NMF speech enhancement algorithm was proposed, which applied a clustering method to acquire the HMM-based basis matrix and used the Viterbi algorithm to obtain the ideal state label for the DNN training. In the enhancement stage, the DNN was used to find the corresponding state to conduct speech enhancement.

In this paper, we propose a novel NMF-HMM speech enhancement method based on the Kullback-Leibler (KL) divergence, expanding on our preliminary work [47]. Our preliminary work has briefly verified the effectiveness of an NMF-HMM for speech enhancement [47, 48], but the effect of the parameters for the model was not considered. This is very important to optimize the algorithm performance. Additionally, its performance in various noisy environments was also not investigated. In this paper, we expand our preliminary research on these two aspects. Compared to other HMM-based methods [44, 45, 49], our method uses the HMM to capture the temporal dynamics of the speech and noise signal. Moreover, we use the sum of the Poisson distribution as the state-conditioned likelihood for the HMM, rather than the general Gaussian mixture model (GMM), because the sum of the Poisson distribution leads to the KL divergence measure. KL divergence is a mainstream measure in NMF, and its parameter update rule is identical to the multiplicative update rule. This ensures that the parameter update is computationally efficient during the training stage. In the enhancement stage, in contrast with previous works [44, 45], we propose a novel NMF-HMM-based MMSE estimator to perform the online enhancement. A major benefit of the proposed algorithm is that the activation matrix could be updated by parallel computing in the online stage. This could effectively reduce computational time. In this paper, we also show a more detailed algorithm derivation towards the preliminary NMF-HMM-based algorithm [47]. Moreover, the proposed method was compared with other state-of-the-art speech enhancement algorithms, which further indicated the advantages of the proposed algorithm.

The rest of this paper is organized as follows. First, we will briefly review the general NMF-based speech enhancement method with KL divergence in Section 2. The proposed HMM-based signal model will be introduced in Section 3, and the more detailed offline parameter learning will be explained in Section 4. The details of the proposed MMSE estimator and online speech enhancement process will be given in Section 5. The experimental comparison and analysis of results will be illustrated in Section 5, and we will draw conclusions in Section 6.

2 NMF-based speech enhancement method with KL divergence

In this section, we will briefly review the NMF-based speech enhancement with KL divergence. Under the additive noise assumption, the noisy signal model can be expressed as:

$$\begin{aligned} y(t) = s(t)+m(t), \end{aligned}$$
(1)

where \(y(t)\), \(s(t)\), and \(m(t)\) denote the noisy signal, clean speech, and noise, respectively, and t is the time index. With (1), the short-time Fourier transform (STFT) of \(y(t)\) can be written as:

$$\begin{aligned} Y(f,n) = S(f,n)+M(f,n), \end{aligned}$$
(2)

where Y(fn), \(S(f,n)\), and \(M(f,n)\) denote the frequency spectra of \(y(t)\), \(s(t)\), and \(m(t)\), respectively. Here, \(f \in [1, F]\) and \(n \in [1, N]\) denote the frequency bin and time frame indices, respectively. Collecting the F frequency bins and N time frames, we define the magnitude spectrum matrices \(\mathbf {Y}_{N}\), \(\mathbf {S}_{N}\), and \(\mathbf {M}_{N}\), where \({\mathbf {Y}}_{N}=[\mathbf {y}_1,\cdots , \mathbf {y}_{n}, \cdots , \mathbf {y}_{N}]\) and \(\mathbf {y}_{n}=[|Y(1,n)|, \cdots , |Y(f,n)|,\cdots , |Y(F,n)|]^T\) and also \(\mathbf {s}_{n}\) and \(\mathbf {m}_{n}\) are defined similarly to \(\mathbf {y}_{n}\). Additionally, \(\mathbf {S}_{N}\) and \(\mathbf {M}_{N}\) are defined similarly to \(\mathbf {Y}_{N}\); we assume that \(\mathbf {Y}_{N}=\mathbf {S}_{N}+\mathbf {M}_{N}\). The classical NMF-based speech enhancement has two stages: training and enhancement. In the training stage, the clean speech basis matrix \(\overline{\mathbf {W}}\) and noise basis matrix \(\ddot{\mathbf {W}}\) are trained using clean speech and noise databases, respectively. Many cost functions have been proposed for NMF, such as KL divergence [34], Itakura-Saito (IS) divergence [50], \(\beta\) divergence, and Euclidean distance [51]. In this paper, we focus on using the KL divergence measure. There are two reasons for this choice. First, compared with other cost functions, the best speech enhancement performance can be achieved using the KL divergence-based NMF with the magnitude spectrum [52]. Second, the efficient multiplicative update (MU) rule of the KL divergence-based NMF can be also derived statistically using the expectation maximization (EM) algorithm [53]. For the two matrices \(\mathbf {B}\) and \(\hat{\mathbf {B}}\), the KL divergence measure is defined as:

$$\begin{aligned} {\mathrm {KL}(\mathbf {B}|\hat{\mathbf {B}})} = \sum \limits _{i,j} (b_{i,j} \mathrm {log}(b_{i,j}/ \hat{b}_{i,j})-b_{i,j}+\hat{b}_{i,j}), \end{aligned}$$
(3)

where \(b_{i,j}\) and \(\hat{b}_{i,j}\) denote the elements from the \({i}{\mathrm {th}}\) row and \({j}{\mathrm {th}}\) column of the matrices \(\mathbf {B}\) and \(\hat{\mathbf {B}}\), respectively. Using speech basis matrix training as an example, the cost function of the KL divergence-based NMF for training \(\overline{\mathbf {W}}\) can be written as:

$$\begin{aligned} (\overline{\mathbf {W}},\overline{\mathbf {H}}) = \underset{\overline{\mathbf {W}},\overline{\mathbf {H}}}{\arg \min }\ \mathrm {KL}(\mathbf {S}_{N}|\overline{\mathbf {W}}\times \overline{\mathbf {H}}). \end{aligned}$$
(4)

Noise basis matrix training is similar to speech basis matrix training. In [34], it is derived that \(\overline{\mathbf {W}}\) and \(\overline{\mathbf {H}}\) can be obtained iteratively using the following multiplicative update rules:

$$\begin{aligned} {\overline{\mathbf {W}}} \leftarrow {\overline{\mathbf {W}}} \odot {\frac{\frac{\mathbf {S}_{N}}{\overline{\mathbf {W}}\times \overline{\mathbf {H}}}\overline{\mathbf {H}}^{T}}{\mathbf {1}{\overline{\mathbf {H}}}^{T}}} , \end{aligned}$$
(5)
$$\begin{aligned} {\overline{\mathbf {H}}} \leftarrow {\overline{\mathbf {H}}} \odot {\frac{\overline{\mathbf {W}}^{T}\frac{\mathbf {S}_{N}}{\overline{\mathbf {W}}\times \overline{\mathbf {H}}}}{\overline{\mathbf {W}}^{T}\mathbf {1}}}, \end{aligned}$$
(6)

where \(\odot\) and all divisions are element-wise multiplication and division operations, respectively, and \(\mathbf {1}\) is a matrix of ones with the same size as \(\mathbf {S}_{N}\). In the enhancement stage, the noisy speech basis matrix \(\mathbf {W}\) can be constructed by concatenating the speech and noise basis matrices, \({{\mathbf {W}}=[\overline{\mathbf {W}},\ddot{\mathbf {W}} ]}\). The activation matrix \({\mathbf {H}}\) of the noisy speech can be estimated iteratively by replacing \(\mathbf {S}_{N}\), \(\overline{\mathbf {W}}\), and \(\overline{\mathbf {H}}\) in (6) with \(\mathbf {Y}_{N}\), \(\mathbf {W}\), and \(\mathbf {H}\), respectively. The enhanced signal can be obtained using various algorithms [36, 37, 44, 45]. One popular approach is to use the following Wiener filter-like spectral gain \(\mathbf {g}_{n}^{\mathrm {NMF}}\) function:

$$\begin{aligned} \mathbf {g}_{n}^{\mathrm {NMF}}= \frac{\overline{\mathbf {W}}\,\overline{\mathbf {h}}_{n}}{{\overline{\mathbf {W}}\,\overline{\mathbf {h}}_{n}}+{\ddot{\mathbf {W}}\,\ddot{\mathbf {h}}_{n}}}, \end{aligned}$$
(7)
$$\begin{aligned} \mathbf {h}_{n}=\left[ \overline{\mathbf {h}}_{n}^{T}, \ddot{\mathbf {h}}_{n}^{T}\right] ^{T} \nonumber \\ =\arg \min _{\mathbf {h}_{n}} \mathrm {KL}(\mathbf {y}_{n}|\mathbf {W}\mathbf {h}_{n}), \end{aligned}$$
(8)

where (8) can be solved iteratively using (6). Apart from the gradient descent derivation of the MU update rules (5) and (6) presented in [34], it is further shown in [53] that the MU update rules can be derived from a statistical perspective. More specifically, the KL divergence-based NMF can be motivated from the following hierarchical statistical model:

$$\begin{aligned} {\mathbf {S}_{N} = \sum \limits _{k=1}^K {\mathbf {C}}(k),} \end{aligned}$$
(9)
$$\begin{aligned} {c_{f,n}(k)} \sim \mathcal{PO}\mathcal{} (c_{f,n}(k);\overline{W}_{f,k}\overline{H}_{k,n}), \end{aligned}$$
(10)

where \({\mathcal{PO}\mathcal{}(x;\lambda )=\frac{\lambda ^{x}e^{-\lambda }}{\Gamma (x+1)}}\) is the Poisson distribution, \(\Gamma (x+1)=x!\) denotes the gamma function for positive integer x, K denotes the number of basis vectors, \(\mathbf {C}(k)\) is the latent matrix, and \(c_{f,n}(k)\) denotes the element of \(\mathbf {C}(k)\) in the \(f\mathrm {th}\) row and \(n\mathrm {th}\) column. Note that \({c_{f,n}(k)}\) is assumed to have a Poisson distribution, which can only be used for discrete variables. However, in practice, this hierarchical statistical model is not limited to discrete variables because the gamma function for continuous variables can be used to replace the factorial calculation [53]. It has been shown in [53] that the iterative update of the parameters \(\overline{\mathbf {H}}\) and \(\overline{\mathbf {W}}\) using the EM algorithm is identical to the multiplicative update rules shown in (5) and (6).

One of the advantages of the classical NMF-based method for speech enhancement is that the computational efficient MU rules can be applied. However, the temporal dynamical aspects of speech and noise are not taken into account. To incorporate the temporal dynamical information of audio signals, the HMM model is used in [45] for source separation. However, the parameter update rules are computationally complex. Moreover, this method [45] can only perform the offline enhancement. In this paper, we propose an NMF-based speech enhancement algorithm using the HMM to take the temporal aspects of both the speech and noise into account. The proposed approach can achieve efficient parameter updates. Moreover, an online MMSE estimator for speech enhancement is derived. Although other methods also considered the temporal dynamical information for speech enhancement, such as simply stacking multiple frames to a vector [14, 54], using the DRNN [28], and non-negative matrix deconvolution [55], the high computational complexity and the large model size lead to a high storage complexity. In this paper, the proposed method can achieve a higher PESQ score than the referenced DNN-based method for unseen noise and also has a lower complexity than it.

3 HMM-based signal models with the KL divergence

In this section, we present the details of the proposed signal models, including the speech and noise signal models and the noisy signal model.

3.1 Speech and noise signal models

In this work, the same signal model is used for both the clean speech and noise signals, so we will derive the equations using only the clean speech signal. Additionally, we use the overbar (\(\overline{\cdot }\)) and double dots (\(\ddot{\cdot }\)) to represent the clean speech and noise, respectively. To consider the temporal dynamic information of the speech and noise, we use the HMM. Following the conditional independence property of the standard HMM [56], the likelihood function can be expressed as follows:

$$\begin{aligned} p({\mathbf {S}}_{N};{\varvec{\Phi }}) = \sum \limits _{\mathbf {\overline{x}}_{N}}\prod\limits_{n=1}^N p(\mathbf {s}_{n}|\overline{x}_{n})p(\overline{x}_{n}|\overline{x}_{n-1}), \end{aligned}$$
(11)

where \({\mathbf {\overline{x}}_{N}}=[\overline{x}_1,\cdots , \overline{x}_{n},\cdots ,\overline{x}_{N}]^T\) is a collection of states, \(\overline{x}_{n} \in \{1,2,\cdots ,\overline{J}\}\) denote the state at the \(n\mathrm {th}\) frame, and \(\overline{J}\) denotes the total number of states. The function \(p(\overline{x}_{n}|\overline{x}_{n-1})\) denotes the state transition probability from state \(\overline{x}_{n-1}\) to \(\overline{x}_{n}\) with \(p(\overline{x}_1|\overline{x}_{0})\) being the initial state probability. \(p(\mathbf {S}_{n}|\overline{x}_{n})\) is the state-conditioned likelihood function, and \(\varvec{\Phi }\) is a collection of modeling parameters. Next, we describe the state transition probability and the state-conditioned likelihood function, respectively, for the proposed signal model.

The state transition probability \(p(\overline{x}_{n}|\overline{x}_{n-1})\): Following the standard HMM, we use a first-order Markov chain to model the state transition, that is:

$$\begin{aligned} p(\overline{x}_{n}|\overline{x}_{n-1})=\prod\limits_{i=1}^{\overline{J}}\prod\limits_{j=1}^{\overline{J}} {\overline{A}_{i,j}^{l(\overline{x}_{n}=j,\overline{x}_{n-1}=i)}}, \end{aligned}$$
(12)
$$\begin{aligned} p(\overline{x}_1|\overline{x}_0)=p(\overline{x}_1)=\prod\limits_{j=1}^{\overline{J}}{\overline{\pi }_{j}^{l(\overline{x}_{1}=j)}}, \end{aligned}$$
(13)

where \(l(\cdot )\) denotes an indicator function, which is one when the logic expression in the parentheses is true and zero otherwise. In addition, \(\overline{A}_{i,j}\) and \(\overline{\pi }_{j}\) denote the transition probability from state i to state j and the initial probability for the first frame’s state \(\overline{x}_1\) being state j, respectively. Collecting all the initial and transition probabilities, we can write them into matrix forms, \(\overline{\varvec{\pi }}=[\overline{\pi }_1, \cdots ,\overline{\pi }_j, \cdots , \overline{\pi }_{\overline{J}}] ^T\) and \(\overline{\mathbf {A}}\) with \(\overline{A}_{i,j}\) being the element at the \(i\mathrm {th}\) row and \(j\mathrm {th}\) column. Therefore, the modeling parameters of the HMM can be expressed as \(\varvec{\Phi }_{\mathrm {hmm}}=\{\overline{\mathbf {A}}, \overline{\varvec{\pi }}, \overline{J}\}\). The modeling parameters \(\overline{\mathbf {A}}\) and \(\overline{\varvec{\pi }}\) with a predefined \(\overline{J}\) can be trained through the EM algorithm shown in the next section. In the experiments, we investigate the impact of the total number of states \(\overline{J}\).

The state-conditioned likelihood function: Next, we present the proposed state-conditioned likelihood function. Motived by the good speech enhancement performance, the computationally efficient MU rule, and the equivalence between the gradient descent derivation and the EM algorithm for the KL divergence-based NMF, we propose to use the statistical model in (9) and (10) to build the state-conditioned likelihood function, that is:

$$\begin{aligned} \mathbf {s}_{n} = \sum\limits_{k=1}^{\overline{K}} {{\overline{\mathbf {c}}_{n}}}(k), \end{aligned}$$
(14)
$$\begin{aligned} {p(\overline{\mathbf {c}}_{n}(k)|\overline{x}_{n})} = \prod\limits_{f=1}^F \mathcal{P}\mathcal{O} ({\overline{c}}_{f,n}(k);\overline{W}_{f,k}^{\overline{x}_{n}}\overline{H}_{k,n}^{\overline{x}_{n}}), \end{aligned}$$
(15)

where \(\overline{K}\) is the number of basis vectors, \(\overline{\mathbf {c}}_{n}(k)\) contains the hidden variables, and \(\overline{W}_{k,n}^{\overline{x}_{n}}\) and \(\overline{H}_{k,n}^{\overline{x}_{n}}\) correspond to the elements of the basis and activation matrices, respectively. By writing \(\overline{\mathbf {c}}_{n}=[\overline{\mathbf {c}}_{n}(1)^T,\overline{\mathbf {c}}_{n}(2)^T, \cdots , \overline{\mathbf {c}}_{n}(\overline{K})^T]^T\) and integrating \(\overline{\mathbf {c}}_{n}\), the state conditioned likelihood function can be written as:

$$\begin{aligned} &{p(\mathbf {s}_{n}| \overline{x}_{n})} = \int {p(\mathbf {s}_{n}|{\overline{\mathbf {c}}_{n}})}{p(\overline{\mathbf {c}}_{n}|{\overline{{x}}_{n}})}\, d{\overline{\mathbf {c}}_{n}} \\&= \prod\limits_{f=1}^F \mathcal{P}\mathcal{O} (|S(f,n)|;\sum\limits_{k=1}^{\overline{K}} \overline{W}_{f,k}^{\overline{x}_{n}}\overline{H}_{k,n}^{\overline{x}_{n}}), \end{aligned}$$
(16)

where we use the superposition property of the Poisson random variable [53]. Collecting the unknown parameters \(\{\overline{W}_{f,k}^{\overline{x}_{n}}\}\) and \(\{\overline{H}_{k,n}^{\overline{x}_{n}}\}\), we can write them into matrix forms, \(\{\overline{\mathbf {W}}^j\}\) and \(\{\overline{\mathbf {H}}^j\}\). Therefore, unlike the traditional NMF using only one basis matrix, the proposed model has \(\overline{J}\) basis matrices to be trained. Each basis matrix is intended to capture a specific feature (e.g., a phoneme) of the speech signal. The modeling parameters of the proposed state-conditioned likelihood function can be expressed as \(\varvec{\Phi }_{\mathrm {like}}=\{ \{\overline{\mathbf {W}}^j\}, \{\overline{\mathbf {H}}^j\}, \overline{K}, \overline{J} \}\). The modeling parameters \(\{\overline{\mathbf {W}}^j\}\) and \(\{\overline{\mathbf {H}}^j\}\) with predefined \(\overline{J}\) and \(\overline{K}\) can be trained through the EM algorithm shown in the next section. In the experiments, we investigate the impact of the number of basis vectors \(\overline{K}\) and \(\overline{J}\). It will also be shown that a multiplicative update rule can be derived for the basis and activation matrices update of the proposed state-conditioned likelihood function.

To summarize, five types of parameters in the parameter set \(\varvec{\Phi }\)=\(\varvec{\Phi }_{\mathrm {hmm}}\cup \varvec{\Phi }_{\mathrm {like}}\) can be identified. They are the transition matrix \(\overline{\mathbf {A}}\), initial state probabilities in \(\overline{\varvec{\pi }}\), basis matrices of different states \(\{\overline{\mathbf {W}}^j\}\), activation matrices of different states \(\{{\overline{\mathbf {H}}}^j\}\), and modeling parameters \(\overline{K}\) and \(\overline{J}\). In this paper, the modeling parameters \(\overline{K}\) and \(\overline{J}\) are predefined, the activation matrices \(\{\overline{\mathbf {H}}^j\}\) are estimated by online speech enhancement, and the other three types of parameters are obtained using offline learning.

3.2 Noisy speech model

Based on the proposed clean speech and noise signal models (1) and (2), the noisy speech model can be defined. We assume that there are a total of \(\ddot{J}\) hidden states for the noise, and the hidden state of the noise is \(\ddot{x}_{n} (\ddot{x}_{n}\in {\{1,2,\cdots ,\ddot{J}\}})\). The notations \(\ddot{\varvec{\pi }}\) and \(\ddot{\mathbf {A}}\) correspond to the initial state probability and transition probability matrix of the noise. Thus, there are a total of \(\overline{J} \times \ddot{J}\) hidden states for the noisy speech. Each composite state consists of a pair of states of clean speech \({\overline{x}_{n}}\) and noise \(\ddot{x}_{n}\). Thus, if we list the state space for a noisy signal, we have \((\overline{x}_{n} = 1, \ddot{x}_{n} = 1),(\overline{x}_{n} = 1, \ddot{x}_{n} = 2),\cdots ,(\overline{x}_{n} = 1, \ddot{x}_{n} = \ddot{J});(\overline{x}_{n} = 2, \ddot{x}_{n} = 1),(\overline{x}_{n} = 2, \ddot{x}_{n} = 2),\cdots ,(\overline{x}_{n} = 2, \ddot{x}_{n} = \ddot{J});\cdots ; (\overline{x}_{n} = \overline{J}, \ddot{x}_{n} =1),(\overline{x}_{n} = \overline{J}, \ddot{x}_{n} = 2),\cdots ,(\overline{x}_{n} = \overline{J}, \ddot{x}_{n} = \ddot{J})\). Moreover, the initial state and transition probability matrices of the noisy speech can be expressed as \(\overline{\varvec{\pi }} \otimes \ddot{\varvec{\pi }}\) and \(\overline{\mathbf {A}} \otimes \ddot{\mathbf {A}}\), where \({\otimes }\) denotes the Kronecker product. Finally, the state conditioned likelihood function of the noisy speech can be written as follows:

$$\begin{aligned}&{p(\mathbf {y}_{n} | \overline{x}_{n}, \ddot{x}_{n}) } = \\&\prod\limits_{f=1}^F \mathcal{P}\mathcal{O}{(} |(Y(f,n)|;\sum _{k=1}^{\overline{K}} \overline{W}_{f,k}^{\overline{x}_{n}}\overline{H}_{k,n}^{\overline{x}_{n}}+\sum \limits _{k=1}^{\ddot{K}} \ddot{W}_{f,k}^{\ddot{x}_{n}}\ddot{H}_{k,n}^{\ddot{x}_{n}}), \end{aligned}$$
(17)

where \(\ddot{K}\), \(\{\ddot{W}_{f,k}^{\ddot{x}_{n}}\}\), and \(\{\ddot{H}_{f,k}^{\ddot{x}_{n}}\}\) represent the number of basis vectors, elements of the basis matrices, and the activation matrices for the noise, respectively. We can write \(\{\ddot{W}_{f,k}^{\ddot{x}_{n}}\}\) and \(\{\ddot{H}_{k,n}^{\ddot{x}_{n}}\}\) into matrix forms as \(\{{\ddot{\mathbf {W}}}^j\}\) and \(\{{\ddot{\mathbf {H}}}^j\}\), respectively. Note that we also used the superposition property of Poisson random variables to obtain (17).

4 Methods

4.1 Offline NMF-HMM-based parameter learning

In the offline training stage, the objective is to find the parameter set \(\varvec{\Phi }\) that maximizes the likelihood function (11). In general, the EM algorithm [56] can be used to address this problem. Because we use the same model for the speech and noise, here, we use the clean speech as an example to illustrate the offline parameter learning process. First, we define the complete data set \((\mathbf {S}_{N}, {\mathbf {\overline{x}}}_{N}, {\mathbf {\overline{C}}}_{N})\), where \({\mathbf {\overline{C}}}_{N}=[\overline{\mathbf {c}}_1,\overline{\mathbf {c}}_2,\cdots ,\overline{\mathbf {c}}_{N}]\). Thus, using the conditional independence property, the complete data likelihood function can be written as:

$$\begin{aligned}&{p({\mathbf {S}}_{N}, {\mathbf {\overline{x}}}_{N}, {\mathbf {\overline{C}}}_{N})}=\prod\limits_{n=1}^N p(\mathbf {s}_{n}|\overline{\mathbf {c}}_{n})p(\overline{\mathbf {c}}_{n}|\overline{x}_{n})p(\overline{x}_{n}|\overline{x}_{n-1}). \end{aligned}$$
(18)

Next, we show how the parameter set can be obtained iteratively using the EM algorithm. Moreover, we propose an acceleration strategy to lower the computational and memory complexities. The traditional MU update algorithm for the KL divergence-based NMF can be seen as a special case of the proposed algorithm.

Expectation step: We first calculate the posterior state probability and the joint posterior probability, which can be written as:

$$\begin{aligned} q(\overline{x}_{n})=p(\overline{x}_{n}|{\mathbf {S}}_{N};{\varvec{\Phi }}^{i-1}), \end{aligned}$$
(19)
$$\begin{aligned} q(\overline{x}_{n},\overline{x}_{n-1})=p(\overline{x}_{n},\overline{x}_{n-1}|{\mathbf {S}}_{N};{\varvec{\Phi }}^{i-1}), \end{aligned}$$
(20)

where i is the iteration number. The calculation of (19) and (20) can be performed using the forward-backward algorithm [56]. Apart from this, we also need to evaluate the posterior expectation \({\mathbb E}_{\overline{\mathbf {c}}_{n}|{\mathbf {S}}_{N},{\overline{x}_{n}};{{\varvec{\Phi }}}^{i-1}}(\overline{\mathbf {c}}_{n})\), which will be used in the maximization step. By using the Bayes rule and the conditional independence property of the proposed model, we have:

$$\begin{aligned} q(\overline{\mathbf {c}}_{n}|\overline{x}_{n})= p({\overline{\mathbf {c}}_{n}|{\mathbf {S}}_{N},{\overline{x}_{n}};{{\varvec{\Phi }}}^{i-1}}) = \frac{{p(\mathbf{s}}_{n}|{\overline{\mathbf {c}}_{n}}){p(\overline{\mathbf {c}}_{n}|{\overline{{x}}_{n}})}}{p({\mathbf {S}}_{N},{\overline{x}_{n})}}{.} \end{aligned}$$
(21)

Combining (14) and (15) and following the derivation in [53], we have:

$$\begin{aligned} &q(\overline{\mathbf{c}}_{n}|\overline{x}_{n})= \\ &\prod\limits_{f=1}^{F} {\mathcal{M}} ({\overline{c}}_{f,n}(1),\cdots ,{\overline{c}}_{f,n}({\overline{K}});|S(f,n)|,\\ &p_{f,n}^{{\overline{x}}_{n}}(1), \cdots ,p_{f,n}^{{\overline{x}}_{n}}{(\overline{K})}), \end{aligned}$$
(22)

where \({\mathcal {M}} (\cdot )\) denotes the multinomial distribution and

$$\begin{aligned} p_{f,n}^{{\overline{x}}_{n}}(k)=\frac{\overline{W}_{f,k}^{\overline{x}_{n}}\overline{H}_{k,n}^{\overline{x}_{n}}}{\sum _{l=1}^{\overline{K}}\overline{W}_{f,l}^{\overline{x}_{n}}\overline{H}_{l,n}^{\overline{x}_{n}}}. \end{aligned}$$
(23)

Using the properties of the multinomial distribution, the mean can be written as:

$$\begin{aligned} {\mathbb E}({\overline{c}}_{f,n}({k})|{\mathbf {S}}_{N},{\overline{x}_{n}})=|S(f,n)|\frac{\overline{W}_{f,k}^{\overline{x}_{n}}\overline{H}_{k,n}^{\overline{x}_{n}}}{\sum _{l=1}^K\overline{W}_{f,l}^{\overline{x}_{n}}\overline{H}_{l,n}^{\overline{x}_{n}}}. \end{aligned}$$
(24)

Maximization step: In this step, our objective is to find parameters to maximize the expectation of the logarithm of the complete data likelihood, that is,

$$\begin{aligned} {{\varvec{\Phi }}}^{i}= \underset{\varvec{\Phi }}{\arg \max } { \mathbb E}_{{\mathbf {\overline{x}}}_{N}, {\mathbf {\overline{C}}}_{N}|{\mathbf {S}}_{N};{{\varvec{\Phi }}}^{i-1}}[\log {p({\mathbf {S}}_{N}, {\mathbf {\overline{x}}}_{N}, {\mathbf {\overline{C}}}_{N})}]. \end{aligned}$$
(25)

The estimators for \(\overline{\mathbf {A}}\) and \(\overline{\varvec{\pi }}\) are the same as the traditional HMM [56]. For completeness, the results are shown below:

$$\begin{aligned} {\overline{\pi }}_j=\frac{q(\overline{x}_{1}=j)}{\sum _{o=1}^{\overline{J}} q(\overline{x}_{1}=o)}, \end{aligned}$$
(26)
$$\begin{aligned} {\overline{A}}_{o,j}=\frac{\sum _{n=2}^{\overline{N}}q(\overline{x}_{n}=j,\overline{x}_{n-1}=o)}{\sum _{j=1}^{\overline{J}}\sum _{n=2}^{\overline{N}} q(\overline{x}_{n}=j,\overline{x}_{n-1}=o)}, \end{aligned}$$
(27)

where \(1\le o,j\le \overline{J}\). The estimated basis and activation matrices can be derived by setting the derivatives of (25) to zeros, and we can obtain:

$$\begin{aligned} W_{f,k}^{j}=\frac{\sum _{n=1}^{{N}}q(\overline{x}_{n}=j){\mathbb E}({\overline{c}}_{f,n}(k)|{\mathbf {S}}_{N},{\overline{x}_{n}}=j)}{\sum _{n=1}^{{N}}q(\overline{x}_{n}=j)H_{k,n}^{j}}, \end{aligned}$$
(28)
$$\begin{aligned} H_{k,n}^{j}=\frac{\sum _{f=1}^{{F}}{\mathbb E}({\overline{c}}_{f,n}(k)|{\mathbf {S}}_{N},{\overline{x}_{n}}=j)}{\sum _{f=1}^{{F}}W_{f,k}^{j}}. \end{aligned}$$
(29)

Acceleration strategy: Although we can directly use the above EM algorithm to update the parameter set, saving the conditional expectation of \({\overline{c}}_{f,n}({k})\) in (24) requires a great deal of memory. Like [53], we substitute (24) into (28) and (29) and can obtain:

$$\begin{aligned} W_{f,k}^{j} \leftarrow \frac{\displaystyle {\sum\limits _{n=1}^{{N}}q(\overline{x}_{n}=j)\frac{|S(f,n)|\overline{H}_{k,n}^{j}}{\sum _{l=1}^{\overline{K}}\overline{W}_{f,l}^{j}\overline{H}_{l,n}^{j}}}}{\sum _{n=1}^{{N}}q(\overline{x}_{n}=j)H_{k,n}^{j}}, \end{aligned}$$
(30)
$$\begin{aligned} H_{k,n}^{j} \leftarrow \frac{\displaystyle {\sum \limits _{f=1}^{{F}}\frac{\overline{W}_{f,k}^{j}|S(f,n)|}{\sum _{l=1}^K\overline{W}_{f,l}^{j}\overline{H}_{l,n}^{j}}}}{\sum _{f=1}^{{F}}H_{k,n}^{j}}. \end{aligned}$$
(31)

We can further write (30) and (31) in matrix forms:

$$\begin{aligned} {\overline{\mathbf {W}}}^{j} \leftarrow {\overline{\mathbf {W}}}^{j} \odot {\frac{\frac{{\mathbf {S}}_{N}}{{\overline{\mathbf {W}}}^{j}{\overline{\mathbf {H}}}^{j}}{\varvec{\Lambda } (j)}({\overline{\mathbf {H}}}^{j})^{T}}{{\mathbf {1}}{\varvec{\Lambda } (j)}({\overline{\mathbf {H}}}^{j})^{T}}}, \end{aligned}$$
(32)
$$\begin{aligned} {\overline{\mathbf {H}}}^{j} \leftarrow {\overline{\mathbf {H}}}^{j} \odot {\frac{({{\overline{\mathbf {W}}}^{j}})^{T}\frac{{\mathbf {S}}_{N}}{{\overline{\mathbf {W}}}^{j}{\overline{\mathbf {H}}}^{j}}}{({\overline{\mathbf {W}}}^{j})^{T}\mathbf {1}}}, \end{aligned}$$
(33)

where \(\varvec{\Lambda } (j)=\mathrm {diag}(q(\overline{x}_1=j),q(\overline{x}_2=j),\cdots ,q(\overline{x}_{N}=j))\). By using the proposed acceleration strategy, the computing and saving of the conditional expectation of \({\overline{c}}_{f,n}({k})\) in (24) is not required. Moreover, the multiplicative update rules for the basis and activation matrices can be obtained, leading to fast computing. In other words, there are more than one basis and active matrices to be estimated in the proposed algorithm. Using acceleration strategy, the different basis and active matrices can be simultaneously estimated. We do not need to estimate them one by one. This reduces the time complexity. Comparing the update rules of the proposed method (32), (33) with the traditional NMF-based method (5), (6), the difference is that the basis vectors update rule (32) for the proposed method takes the posterior state information \(\varvec{\Lambda } (j)\) into account. In fact, if the number of the state is set to one (i.e., \(\overline{J} = 1\)), the proposed training method is identical to the traditional KL divergence-based NMF approach. Thus, the traditional NMF can be seen as a special case of the proposed algorithm. The entire flow of the offline parameter learning is shown in Algorithm 1. Note that, for stability reasons, each column of \({\overline{\mathbf {W}}}^{j}\) is normalized to have a unit norm during training.

figure a

4.2 Online speech enhancement using the MMSE estimator

4.2.1 MMSE estimator for the NMF-HMM

In this section, we provide a detailed derivation for the proposed MMSE-based online speech enhancement algorithm in the proposed NMF-HMM model. Our objective is to obtain the MMSE estimate of the desired clean speech signal from noisy observation:

$$\begin{aligned} \hat{\mathbf {s}}_{n}={\mathbb E}_{\mathbf {s}_{n}|{\mathbf {Y}}_{n}}(\mathbf {s}_{n})= \int {\mathbf {s}_{n}}p({\mathbf {s}_{n}|{\mathbf {Y}}_{n}})\, d{\mathbf {s}_{n}}. \end{aligned}$$
(34)

In (34), the posterior probability \(p({\mathbf {s}_{n}|{\mathbf {Y}}_{n}})\) can be derived as:

$$\begin{aligned}&p({\mathbf {s}_{n}|{\mathbf {Y}}_{n}}) = \frac{p({\mathbf {s}}_{n},{\mathbf {y}}_{n}|{\mathbf {Y}}_{n-1})}{p({\mathbf {y}}_{n}|{\mathbf {Y}}_{n-1})} \\&= \frac{\sum _{\overline{x}_{n}, \ddot{x}_{n}} p({\mathbf {s}}_{n},{\mathbf {y}}_{n}|{\overline{x}_{n}, \ddot{x}_{n}})p({\overline{x}_{n}, \ddot{x}_{n}}|{\mathbf {Y}}_{n-1})}{p({\mathbf {y}}_{n}|{\mathbf {Y}}_{n-1})}, \end{aligned}$$
(35)

where we use the conditional independence property of the HMM. The term \(p({\overline{x}_{n}, \ddot{x}_{n}}|{\mathbf {Y}}_{n-1})\) in (35) can be expressed as:

$$\begin{aligned} &p({\overline{x}_{n}, \ddot{x}_{n}}|{\mathbf {Y}}_{n-1}) \\&= \sum \limits _{\overline{x}_{n-1}, \ddot{x}_{n-1}} p({\overline{x}_{n}, \ddot{x}_{n}}|{\overline{x}_{n-1}, \ddot{x}_{n-1}})p({\overline{x}_{n-1}, \ddot{x}_{n-1}}|{\mathbf {Y}}_{n-1}), \end{aligned}$$
(36)

where the first term after the summation is the state transition probability for a noisy signal, and the second term is the forward probability that can be acquired using the well-known forward algorithm [56]. By applying the Bayes rule, the term \(p({\mathbf {s}}_{n},{\mathbf {y}}_{n}|{\overline{x}_{n}, \ddot{x}_{n}})\) in (35) can be further written as:

$$\begin{aligned} p({\mathbf {s}}_{n},{\mathbf {y}}_{n}|{\overline{x}_{n}, \ddot{x}_{n}}) = p({\mathbf {s}}_{n}|{\mathbf {y}}_{n},{\overline{x}_{n}, \ddot{x}_{n}})p({\mathbf {y}}_{n}|{\overline{x}_{n}, \ddot{x}_{n}}). \end{aligned}$$
(37)

Substituting (37) for (35), the posterior probability can be re-written as:

$$\begin{aligned} p({\mathbf {s}_{n}|{\mathbf {Y}}_{n}}) = \sum \limits _{\overline{x}_{n-1},\ddot{x}_{n-1}} \omega _{\overline{x}_{n}, \ddot{x}_{n}}p({\mathbf {s}}_{n}|{\mathbf {y}}_{n},{\overline{x}_{n}, \ddot{x}_{n}}), \end{aligned}$$
(38)

where the weight \(0 \le \omega _{\overline{x}_{n}, \ddot{x}_{n}} \le 1\) is defined as:

$$\begin{aligned} {\omega _{\overline{x}_{n}, \ddot{x}_{n}}}=\frac{{p(\mathbf {y}_{n} | \overline{x}_{n}, \ddot{x}_{n}) }{p(\overline{x}_{n}, \ddot{x}_{n} | \mathbf {Y}_{n-1}})}{\sum _{\overline{x}_{n}, \ddot{x}_{n}}{p(\mathbf {y}_{n} | \overline{x}_{n}, \ddot{x}_{n}) }{p(\overline{x}_{n}, \ddot{x}_{n} | \mathbf {Y}_{n-1}}) }. \end{aligned}$$
(39)

Thus, by combining (34) and (38), the proposed HMM-based MMSE estimator can be expressed as:

$$\begin{aligned} \hat{\mathbf {s}}_{n} = \sum \limits _{\overline{x}_{n},\ddot{x}_{n}} \omega _{\overline{x}_{n}, \ddot{x}_{n}} \int {\mathbf {s}_{n}}p({\mathbf {s}}_{n}|{\mathbf {y}}_{n},{\overline{x}_{n}, \ddot{x}_{n}})\, d{\mathbf {s}_{n}}. \end{aligned}$$
(40)

Instead of obtaining the posterior probability density function (PDF) \(p({\mathbf {s}}_{n}|{\mathbf {y}}_{n},{\overline{x}_{n}, \ddot{x}_{n}})\) directly, we derive the formula for the joint posterior PDF of the clean speech and noise first, that is:

$$\begin{aligned} &p({\mathbf {s}_{n},\mathbf {m}_{n}|\mathbf {y}_{n}},{\overline{x}_{n}, \ddot{x}_{n}}) \\&= \frac{p({\mathbf {y}_{n}|\mathbf {s}_{n},\mathbf {m}_{n}})p({\mathbf {s}_{n},\mathbf {m}_{n}|}{\overline{x}_{n}, \ddot{x}_{n}})}{p({\mathbf {y}_{n}}|{\overline{x}_{n}, \ddot{x}_{n}})} \\&= \frac{p({\mathbf {y}_{n}|\mathbf {s}_{n},\mathbf {m}_{n}})p({\mathbf {s}_{n}|}{\overline{x}_{n}})p({\mathbf {m}_{n}|}{\ddot{x}_{n}})}{p({\mathbf {y}_{n}}|{\overline{x}_{n}, \ddot{x}_{n}})}. \end{aligned}$$
(41)

By using (1), we can express the likelihood function \(p({\mathbf {y}_{n}|\mathbf {s}_{n},\mathbf {m}_{n}})\) as \(p({\mathbf {y}_{n}|\mathbf {s}_{n},\mathbf {m}_{n}})= \delta ({\mathbf {y}_{n}-\mathbf {s}_{n}-\mathbf {m}_{n}})\), where \(\delta (\cdot )\) denotes the Dirac delta function, which is defined by \(\delta (0)=+\infty\), and \(\delta (x)=0\) when \(x\ne 0\). Furthermore, \(\int _{-\infty }^{+\infty }\delta (x)\, dx=1\). The prior probability \(p({\mathbf {s}_{n}|}{\overline{x}_{n}})\) and \(p({\mathbf {m}_{n}|}{\ddot{x}_{n}})\) can be estimated by using (16). Following the derivation in [53], we can verify that the joint posterior PDF can be expressed in terms of the multinomial distribution as:

$$\begin{aligned} &p({\mathbf {s}_{n},\mathbf {m}_{n}|\mathbf {y}_{n}},{\overline{x}_{n}, \ddot{x}_{n}})= \\&\prod _{f=1}^F {\mathcal {M}} (|S(f,n)|,|M(f,n)|;\\&|Y(f,n)|,p_{f,n}(\overline{x}_{n}, \ddot{x}_{n}),q_{f,n}(\overline{x}_{n}, \ddot{x}_{n})), \end{aligned}$$
(42)

where \(p_{f,n}(\overline{x}_{n}, \ddot{x}_{n})\) and \(q_{f,n}(\overline{x}_{n}, \ddot{x}_{n})\) are defined as:

$$\begin{aligned} &p_{f,n}(\overline{x}_{n}, \ddot{x}_{n})= \\&\frac{\sum _{k=1}^{\overline{K}}\overline{W}_{f,k}^{\overline{x}_{n}}\overline{H}_{k,n}^{\overline{x}_{n}}}{\sum _{k=1}^{\overline{K}} \overline{W}_{f,k}^{\overline{x}_{n}}\overline{H}_{k,n}^{\overline{x}_{n}}+\sum _{k=1}^{\ddot{K}} \ddot{W}_{f,k}^{\ddot{x}_{n}}\ddot{H}_{k,n}^{\ddot{x}_{n}}}, \end{aligned}$$
(43)

where \(q_{f,n}(\overline{x}_{n}, \ddot{x}_{n})=1-p_{f,n}(\overline{x}_{n}, \ddot{x}_{n})\). Therefore, the integral term in (40) can be expressed as:

$$\begin{aligned}&\int {\mathbf {s}_{n}}p({\mathbf {s}}_{n}|{\mathbf {y}}_{n},{\overline{x}_{n}, \ddot{x}_{n}})\, d{\mathbf {s}_{n}} \\&=\int {\mathbf {s}_{n}} \int {p({\mathbf {s}}_{n},{\mathbf {m}}_{n}|{\mathbf {y}}_{n},{\overline{x}_{n}, \ddot{x}_{n}})}\, d{ \mathbf {m}_{n}}\, d{ \mathbf {s}_{n}} \\&={{\mathbf {y}}_{n}} \odot {\mathbf {p}_{n}}({\overline{x}_{n}, \ddot{x}_{n}}), \end{aligned}$$
(44)

where \({\mathbf {p}_{n}}({\overline{x}_{n}, \ddot{x}_{n}}) = [p_{1,n}(\overline{x}_{n}, \ddot{x}_{n}),\cdots ,p_{F,n}(\overline{x}_{n}, \ddot{x}_{n})]^T\), and we used the marginal mean property of the multinomial distribution. Combining (40) and (44), the MMSE estimator can be expressed as:

$$\begin{aligned} \hat{\mathbf {s}}_{n} = {\mathbf {y}_{n}} \odot {\mathbf {g}_{n}}, \end{aligned}$$
(45)
$$\begin{aligned} {\mathbf {g}_{n}} = {\sum \limits _{\overline{x}_{n}, \ddot{x}_{n}} \omega _{\overline{x}_{n}, \ddot{x}_{n}}{\mathbf {p}_{n}}({\overline{x}_{n}, \ddot{x}_{n}}}), \end{aligned}$$
(46)

where \(\mathbf {g}_{n}\) can be viewed as the spectral gain vector for the proposed model. Comparing the proposed gain vector \(\mathbf {g}_{n}\) with the traditional NMF-based gain vector [36], we find that the proposed gain vector is a weighted sum of each state’s gain, which is in the Wiener filtering form as the traditional NMF gain (7).

4.2.2 Online estimation of activation matrices

After obtaining the trained basis matrices \(\overline{W}_{f,k}^{\overline{x}_{n}}\) and \(\ddot{W}_{f,k}^{\ddot{x}_{n}}\) for both the clean speech and noise in the training stage, we need to obtain the online estimates of the activation parameters \(\overline{H}_{f,k}^{\overline{x}_{n}}\) and \(\ddot{H}_{f,k}^{\ddot{x}_{n}}\) to acquire the gain in (45) and (46). The activation matrices are estimated by maximizing the logarithm of the state-conditioned likelihood function (17), which is equivalent to:

$$\begin{aligned} \mathbf {h}_{n} ({\overline{x}_{n}, \ddot{x}_{n}}) = \underset{{\mathbf {h}}_{n}}{\arg \min } \ \ \mathrm {KL}({\mathbf {y}}_{n}|[{\overline{\mathbf {W}}}^{\overline{x}_{n}},{\ddot{\mathbf {W}}}^{\ddot{x}_{n}}]{\mathbf {h}}_{n}), \end{aligned}$$
(47)
$$\begin{aligned} {\mathbf {h}}_{n} ({\overline{x}_{n}, \ddot{x}_{n}}) = [{\overline{\mathbf {h}}_{n}}({\overline{x}_{n}, \ddot{x}_{n}})^T,{\ddot{\mathbf {h}}_{n}}({\overline{x}_{n}, \ddot{x}_{n}})^T]^T, \end{aligned}$$
(48)

where the clean and noise activation matrices for the state \(({\overline{x}_{n}, \ddot{x}_{n}})\) are defined as \({\overline{\mathbf {h}}_{n}} ({\overline{x}_{n}, \ddot{x}_{n}})= [\overline{H}_{1,n}^{\overline{x}_{n}},\overline{H}_{2,n}^{\overline{x}_{n}},\cdots\), \({H}_{\overline{K},n}^{\overline{x}_{n}}]^T\), and \({\ddot{\mathbf {h}}_{n}} ({\overline{x}_{n}, \ddot{x}_{n}})= [\ddot{H}_{1,n}^{\ddot{x}_{n}},\ddot{H}_{2,n}^{\ddot{x}_{n}},\cdots ,{H}_{\ddot{K},n}^{\ddot{x}_{n}}]^T\). The activation matrix (48) can be obtained iteratively by using the multiplicative update rule in Eq. (6). Note that parallel computing can be used to reduce the time complexity when obtaining the activation matrices for different states. It can be readily shown that when \(\overline{J} = \ddot{J}=1\), the gain vectors for the proposed algorithm (46) and the standard NMF (7) are identical, that is, \(\mathbf {g}_{n}=\mathbf {g}_{n}^{\mathrm {NMF}}\). The entire flow of the proposed MMSE-based online speech enhancement algorithm is illustrated by Algorithm 2.

figure b

5 Experimental results and discussion

In this section, we report on the investigation and evaluation of the proposed algorithm using various experiments. First, we investigated the effect of different parameter settings for the proposed model, that is, the number of states and basis vectors of clean speech and noise, respectively. Second, we compared the proposed NMF-HMM with other state-of-the-art speech enhancement methods to demonstrate the effectiveness of the proposed algorithm. In this work, the PESQ score [41], ranging from − 0.5 to 4.5, was used to quantify the enhanced speech quality. The version of the PESQ model used was the International Telecommunication Union (ITU) standard P.862 [57]. The implementation code was provided by [2]. The STOI score [42], ranging from 0 to 1, was used to measure speech intelligibility.

5.1 Experimental data preparation

In this study, the proposed algorithm was evaluated using the Texas Instruments/Massachusetts Institute of Technology (TIMIT) database [58], 100 environmental noises [59], office noiseFootnote 1, and the NoiseX-92 database [60]. During the training stage, all 4620 utterances from the TIMIT training database were used to train the proposed NMF-HMM model for clean speech. For the experiments in Section 5.2, the Babble, F16, Factory, and White noises from the NoiseX-92 database were used to train the NMF-HMM model. For the experiments in Section 5.2, 200 utterances from the TIMIT test set, including 1680 utterances, were randomly chosen to build the test database. Four types of noise were then added at four different SNR levels (− 5, 0, 5, and 10 dB). The noise types of the testing set were the same as the training set, but there was no overlap between the signals in the two sets. In total, \(200 \times 4 \times 4 = 3200\) utterances were used for the evaluation. For the experiments in Section 5.3, we conducted extensive experiments; the Babble and F16 noises from the NoiseX-92 database and 90 environmental noises (N1–N90 in [59]) were used to train the NMF-HMM model for the noise dictionary. In the test stage, 200 utterances from the TIMIT test set, including 1680 utterances, were randomly chosen to build three test databases. The first test database included 10 unseen environmental noises from [59] (N91–N100). The second included unseen office noise, and the third test database was built from 25 seen environmental noises in [59] (N18–N43). In all three test databases, the noise was added at four different SNR levels (− 5, 0, 5, and 10 dB). All the algorithms were evaluated using the same test dataset. In all experiments, the sound signals were down-sampled to 16 kHz. The frame length was set to 1024 samples (64 ms) with a frame shift of 512 samples (32 ms). The size of STFT was 1024 points with a Hanning window. Furthermore, the maximum number of iterations was set to 30 in the training stage and 15 in the online speech enhancement stage for the proposed NMF-HMM algorithm.

5.2 Analyses of the number of states and basis vectors

As explained in Sections 3 and 4, four parameters are needed to be pre-defined in our proposed NMF-HMM-based speech enhancement algorithm. These parameters were the number of states (\(\overline{J}\) and \(\ddot{J}\)) and basis vectors (\(\overline{K}\) and \(\ddot{K}\)) for the clean speech and noise. In this section, we report on the investigation of the effects of these parameters in our proposed method and the choice of suitable parameters for the later experiments.

5.2.1 HMM states analysis

Fig. 1
figure 1

Performance of the NMF-HMM and T-NMF using different numbers of clean speech basis vectors

First, before the state analysis, we want to indicate that using temporal dynamics can effectively help NMF obtain a better SE performance. To verify this point, we use the traditional NMF-based speech enhancement (T-NMF) [36] as the reference method. T-NMF is a special case of NMF-HMM when \(\overline{J} = 1\) and \(\ddot{J} = 1\). T-NMF does not include the temporal dynamics information. The transition matrix A is a non-informational matrix in T-NMF. For a fair comparison, we keep that the total numbers of clean speech basis vectors (\(\overline{K}\times \overline{J}\)) for the NMF-HMM and T-NMF method [36] are the same. For the T-NMF, the number of clean speech basis vectors \({\overline{K}}\) is varied as 25, 125, 250, 500, and 1000. For the NMF-HMM, the \(\overline{K}\) is fixed to 25 and \(\overline{J}\) is varied as 1, 5, 10, 20, and 40. The number of noise basis vectors for both the proposed NMF-HMM and T-NMF is fixed to 70, and the number of noise states for the NMF-HMM is fixed to 1. In this experiment, we use the average STOI and PESQ scores of 3200 utterances as the performance metrics. The experimental results are shown in Fig. 1. As can be seen, the T-NMF can achieve the best performance when \(\overline{K} = 25.\) However, its performance degraded with the increasing of number of basis vectors due to overfitting. By contrast, NMF-HMM achieves higher PESQ and STOI scores with an increasing number of the clean speech basis vectors by taking the temporal dynamics into account using the HMM model, which indicates that temporal dynamics can improve the NMF’s SE performance.

Table 1 Average STOI scores (%) comparisons of different clean speech states and basis vectors (\(\ddot{J} = 1,{\ddot{K}}=70\))
Table 2 Average PESQ scores (%) comparisons of different clean speech states and basis vectors (\(\ddot{J} = 1,{\ddot{K}}=70\))
Table 3 Average STOI scores (%) comparisons of different noise states and basis vectors (\(\overline{J} = 40 ,{\overline{K}}= 25\))
Table 4 Average PESE scores (%) comparisons of different noise states and basis vectors (\(\overline{J} = 40 ,{\overline{K}}= 25\))

5.2.2 States and basis vector analysis for clean speech

Next, we investigated the effect of the number of clean speech states \(\overline{J}\) and basis vector \(\overline{K}\) to the proposed model. The number of noise states was set to 1 (i.e., \(\ddot{J} = 1\)) for the proposed NMF-HMM. The number of basis vectors for the noise was fixed to \({\ddot{K}}=70\), respectively. The number of clean speech states was chosen as 1, 5, 10, 20, and 40. Additionally, the number of clean speech basis vector was chosen as 5, 10, 25, and 50. The enhancement performance was evaluated by the PESQ and STOI scores.

Tables 1 and 2 show the average STOI and PESQ score in different SNRs. It can be seen that if the number of basis vectors \(\overline{K}\) is fixed, there is a higher PESQ and STOI score with the increasing of clean state \(\overline{J}\). This indicated the benefits of using the temporal dynamics in NMF model. Additionally, if the clean state \(\overline{J}\) is fixed, we can find that HMM can achieve the best speech enhancement performance when \(\overline{K} = 25\). A higher \(\overline{K}\) can lead to a worse speech enhancement performance due to overfitting. Therefore, based on these experimental results, we choose \(\overline{J} = 40\) and \(\overline{K} = 25\) to perform the following experiments.

5.2.3 States and basis vector analysis for noise

In this part, we evaluated the effect of noise states \(\ddot{J}\) and basis vector \(\ddot{K}\) to the proposed model. Here, the number of clean states and basis vectors was set to 40 and 25 (\(\overline{J} = 40\), \({\ddot{K}}=25\)), respectively, which is based on the previous experimental results. The number of noise states was chosen as 1, 2, 5, and 10. In addition, the number of noise basis vector was chosen as 10, 20, 40, and 70.

Tables 3 and 4 show the experimental results for the average STOI and PESQ score in different SNRs. We can find that the PESQ and STOI have an increasing trend with the increasing of noise state \(\ddot{J}\) when the number of noise basis vectors \(\ddot{K}\) is fixed. Moreover, if the \(\ddot{J}\) is fixed, \(\ddot{K}=70\) can achieve the highest PESQ score but the STOI score is slightly lower than \(\ddot{K}=40\). Based on the experimental results, we select \({\overline{J}=40,\ddot{J}=10, \overline{K}=25, \ddot{K}=40}\) for the rest of the experiments because the model have the less parameters when \(\ddot{K}=40\). Furthermore, there is a higher STOI when \(\ddot{K}=40\) and the PESQ difference is not obvious between the \(\ddot{K}=40\) and \(\ddot{K}=70\).

5.3 Overall evaluation

In this section, we report on the comparison of the proposed NMF-HMM speech enhancement method with state-of-the-art speech enhancement methods. We chose the optimally modified log-spectral amplitude (OM-LSA) method [61] with improved minima controlled recursive averaging (IMCRA) noise estimator [62]; variable span linear filters method [7] (SLF-NMF), which uses the parametric NMF [17] for estimating the statistics; temporal-NMF [49]; convolutive NMF (CNMF) [55, 63]; DNN [64]; and log-MMSE [65] algorithm as the reference methods. For the SLF-NMF, the maximum SNR filter was applied, and the number of eigenvectors was set to one. The variable span linear filters reference code can be found in [7]. The codebook size of clean speech and noise was set to 64 and 8, respectively. The other SLF-NMF parameter settings were the same as NMF-HMM. For the temporal-NMF, all the parameter settings were the same as the work in [49], which ensured that the temporal-NMF could achieve the best speech enhancement performance. For the CNMF, the related settings were similar to the CNMF in [40]. For the DNN, we used the DNS baseline [64] as the reference method, which is one of the state of the art speech enhancement algorithm. The OM-LSA and log-MMSE were state-of-the-art unsupervised speech enhancement methods. while the SLF-NMF and temporal-NMF were state-of-the-art NMF-based speech enhancement methods. The temporal-NMF also considered the temporal information like our methods.

Fig. 2
figure 2

Average PESQ scores of different methods for 25 types of seen noise

Fig. 3
figure 3

Average PESQ scores of different methods for 10 types of unseen noise

Fig. 4
figure 4

Average PESQ scores of different methods for unseen office noise

The performance of the NMF-HMM, DNN, temporal-NMF, CNMF, SLF-NMF, log-MMSE, and OM-LSA were evaluated using the test set. Figure 2 shows the average PESQ scores with 95% confidence intervals of these algorithms for 25 types of seen noise. As can be seen, the SLF-NMF had the worst performance among these algorithms. Temporal-NMF and CNMF achieved a higher score than SLF-NMF, which indicated the benefits of temporal information for speech enhancement. Moreover, except for DNS baseline, the proposed NMF-HMM outperformed other enhancement algorithms in all the SNR scenarios. Furthermore, in low SNR scenarios (e.g., − 5–5 dB), the average PESQ score improvement of the proposed NMF-HMM was larger than 0.5 against the other algorithms.

Figures 3 and 4 show the PESQ result under an unseen noise environment, which indicates that NMF-HMM could always achieve a higher PESQ score than the reference methods at all four SNRs except for DNS baseline.

The results of the STOI scores with 95% confidence intervals for various algorithms are provided in Table 5. As can be seen, the temporal-NMF, CNMF, and NMF-HMM had higher STOI scores than SLF-NMF under three different test datasets, which illustrates the benefits of considering speech temporal information. In general, NMF-HMM achieved the highest STOI score, better than the referenced NMF-based methods (temporal-NMF, CNMF, and SLF-NMF) for seen and unseen noise. In addition, the DNS baseline achieved a better STOI score than NMF-HMM.

In general, for these non-DNN-based speech enhancement algorithm, the proposed method can achieve the best speech enhancement performance. Moreover, DNS baseline can achieve the highest speech enhancement score. In the future work, we think that a DNN-based strategy can be combine with proposed algorithm to improve to accuracy of basis vector estimation. As a result, our algorithm can achieve a better speech enhancement performance.

Table 5 Comparison of STOI scores (%) for various algorithms under different SNRs using different types of noise

6 Conclusion

In this work, we proposed and analyzed an NMF-HMM-based speech enhancement algorithm that applies the sum of the Poisson distribution, leading to the KL divergence measure, as the observation model for each state of the HMM. The computationally efficient multiplicative update rule is used to conduct parameter updates during the training stage for this proposed method. Moreover, using the HMM, the temporal dynamic information of speech signals can be captured in this method. Furthermore, we detailed the derivation of the proposed NMF-HMM-based MMSE estimator to conduct online speech enhancement. Parallel computation can be applied for the proposed estimator, so we can effectively reduce the time complexity during the online speech enhancement stage. With experiments, a suitable number of state basis vectors for the proposed NMF-HMM were found. Our experimental results also indicated that the proposed algorithm could outperform state-of-the-art NMF-based and unsupervised speech enhancement methods. In the future work, a DNN-based strategy can be considered to improve the accuracy of basis vector estimation. As a result, our algorithm can achieve a better speech enhancement performance.

Availability of data and materials

Not applicable.

Notes

  1. https://www.youtube.com/watch?v=D7ZZp8XuUTE

References

  1. J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)

    Article  Google Scholar 

  2. P.C. Loizou, Speech Enhancement: Theory and Practice (CRC Press, Boca Raton, 2013)

    Book  Google Scholar 

  3. Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2013)

    Article  Google Scholar 

  4. I. Cohen, S. Gannot, in Springer Handbook of Speech Processing. Spectral enhancement methods (Springer, Berlin, Heidelberg, 2008) p. 873–902

  5. S. Boll, Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27(2), 113–120 (1979)

    Article  Google Scholar 

  6. K.B. Christensen, M.G. Christensen, J.B. Boldt, F. Gran, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Experimental study of generalized subspace filters for the cocktail party situation (IEEE, Shanghai, 2016), p. 420–424

  7. J.R. Jensen, J. Benesty, M.G. Christensen, Noise reduction with optimal variable span linear filters. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 631–644 (2015)

    Article  Google Scholar 

  8. Y. Ephraim, H.L. Van Trees, A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 3(4), 251–266 (1995)

    Article  Google Scholar 

  9. F. Jabloun, B. Champagne, Incorporating the human hearing properties in the signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 11(6), 700–708 (2003)

    Article  Google Scholar 

  10. J. Lim, A. Oppenheim, All-pole modeling of degraded speech. IEEE Trans. Acoust. Speech Signal Process. 26(3), 197–210 (1978)

    Article  Google Scholar 

  11. Y. Ephraim, D. Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)

    Article  Google Scholar 

  12. Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)

    Article  Google Scholar 

  13. A. Hussain, M. Chetouani, S. Squartini, A. Bastari, F. Piazza, in Progress in nonlinear speech processing. An overview, Nonlinear speech enhancement (Springer, Berlin, Heidelberg, 2007), p. 217–248

  14. Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2014)

    Article  Google Scholar 

  15. M.S. Kavalekalam, J.K. Nielsen, J.B. Boldt, M.G. Christensen, Model-based speech enhancement for intelligibility improvement in binaural hearing aids. IEEE/ACM Trans. Audio Speech Lang. Process. 27(1), 99–113 (2018)

    Article  Google Scholar 

  16. S. Srinivasan, J. Samuelsson, W.B. Kleijn, Codebook-based bayesian speech enhancement for nonstationary environments. IEEE Trans. Audio Speech Lang. Process. 15(2), 441–452 (2007)

    Article  Google Scholar 

  17. M.S. Kavalekalam, J.K. Nielsen, L. Shi, M.G. Christensen, J. Boldt, in Proc. European Signal Processing Conf. Online parametric NMF for speech enhancement (IEEE, Rome, 2018), p. 2320–2324

  18. Q. He, F. Bao, C. Bao, Multiplicative update of auto-regressive gains for codebook-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 25(3), 457–468 (2016)

    Article  Google Scholar 

  19. D.Y. Zhao, W.B. Kleijn, HMM-based gain modeling for enhancement of speech in noise. IEEE Trans. Audio Speech Lang. Process. 15(3), 882–892 (2007)

    Article  Google Scholar 

  20. F. Deng, C. Bao, W.B. Kleijn, Sparse hidden Markov models for speech enhancement in non-stationary noise environments. IEEE/ACM Trans. Audio Speech Lang. Process. 23(11), 1973–1987 (2015)

    Article  Google Scholar 

  21. Y. Bengio et al., Learning deep architectures for AI. Found. Trends® Mach. Learn. 2(1), 1–127 (2009)

  22. G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)

    Article  MathSciNet  Google Scholar 

  23. D. Wang, J. Chen, Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)

    Article  Google Scholar 

  24. Y. Wang, A. Narayanan, D. Wang, On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1849–1858 (2014)

    Article  Google Scholar 

  25. A. Narayanan, D. Wang, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Ideal ratio mask estimation using deep neural networks for robust speech recognition (IEEE, Vancouver, 2013), p. 7092–7096

  26. S.R. Park, J. Lee, A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132. (2016)

  27. H. Jacobsson, Rule extraction from recurrent neural networks: Ataxonomy and review. Neural Comput. 17(6), 1223–1263 (2005)

    Article  MathSciNet  Google Scholar 

  28. P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(12), 2136–2147 (2015)

    Article  Google Scholar 

  29. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al., in Proc. Advances in Neural Inform. Process. Syst. Generative adversarial nets (Communications of the ACM, US, 2014), p. 2672–2680

  30. S. Pascual, A. Bonafonte, J. Serra, Segan: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452. (2017)

  31. M. Kolbæk, Z.-H. Tan, J. Jensen, Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Trans. Audio Speech Lang. Process. 25(1), 153–167 (2016)

    Article  Google Scholar 

  32. Y. Xiang, C. Bao, A parallel-data-free speech enhancement method using multi-objective learning cycle-consistent generative adversarial network. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1826–1838 (2020)

    Article  Google Scholar 

  33. D.D. Lee, H.S. Seung, Learning the parts of objects by non-negative matrix factorization. Nature. 401(6755), 788–791 (1999)

    Article  Google Scholar 

  34. D.D. Lee, H.S. Seung, in Proc. Advances in Neural Inform. Process. Syst. Algorithms for non-negative matrix factorization (Communications of the ACM, US, 2001), p. 556–562

  35. K. Shimada, Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, T. Kawahara, Unsupervised speech enhancement based on multichannel nmf-informed beamforming for noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 27(5), 960–971 (2019)

    Article  Google Scholar 

  36. E.M. Grais, H. Erdogan, in Int. Conf. Digital Signal Process. Single channel speech music separation using nonnegative matrix factorization and spectral masks (IEEE, Corfu, 2011), p. 1–6

  37. K.W. Wilson, B. Raj, P. Smaragdis, in Proc Interspeech. Regularized non-negative matrix factorization with temporal dependencies for speech denoising (ICSA, Brisbane, 2008)

  38. S. Nie, S. Liang, H. Li, X. Zhang, Z. Yang, W.J. Liu, L.K. Dong, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Exploiting spectro-temporal structures using NMF for DNN-based supervised speech separation (IEEE, Shanghai, 2016), p. 469–473

  39. T.G. Kang, K. Kwon, J.W. Shin, N.S. Kim, NMF-based target source separation using deep neural network. IEEE Signal Process. Lett. 22(2), 229–233 (2014)

    Article  Google Scholar 

  40. S. Nie, S. Liang, W. Liu, X. Zhang, J. Tao, Deep learning based speech separation via nmf-style reconstructions. IEEE/ACM Trans. Audio Speech Lang. Process. 26(11), 2043–2055 (2018)

    Article  Google Scholar 

  41. A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, vol. 2 (IEEE, Salt Lake City, 2001), p. 749–752

  42. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)

    Article  Google Scholar 

  43. T.T. Vu, B. Bigot, E.S. Chng, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Combining non-negative matrix factorization and deep neural networks for speech enhancement and automatic speech recognition (IEEE, Shanghai, 2016), p. 499–503

  44. N. Mohammadiha, P. Smaragdis, A. Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Lang. Process. 21(10), 2140–2151 (2013)

    Article  Google Scholar 

  45. G.J. Mysore, P. Smaragdis, B. Raj, in International conference on latent variable analysis and signal separation. Non-negative hidden Markov modeling of audio with application to source separation (Springer, Malo, 2010), p. 140–148

  46. Z. Wang, X. Li, X. Wang, Q. Fu, Y. Yan, in Proc. Interspeech. A DNN-HMM approach to non-negative matrix factorization based speech enhancement (ICSA, Pittsburgh, 2016), p. 3763–3767

  47. Y. Xiang, L. Shi, J.L. Højvang, M.H. Rasmussen, M.G. Christensen, in Proc. Interspeech. An NMF-HMM speech enhancement method based on Kullback-Leibler divergence (ICSA, Shanghai, 2020), p. 2667–2671

  48. Y. Xiang, L. Shi, J.L. Højvang, M.H. Rasmussen, M.G. Christensen, in Proc. IEEE Int. Conf. coust., Speech, Signal Process. A novel NMF-HMM speech enhancement algorithm based on poisson mixture model (IEEE, Toronto, 2021), p. 721–725

  49. C. Févotte, J. Le Roux, J.R. Hershey, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Non-negative dynamical system with application to speech and audio (IEEE, Vancouver, 2013), p. 3158–3162

  50. C. Févotte, N. Bertin, J.-L. Durrieu, Nonnegative matrix factorization with the itakura-saito divergence: with application to music analysis. Neural Comput. 21(3), 793–830 (2009)

    Article  Google Scholar 

  51. C. Févotte, J. Idier, Algorithms for nonnegative matrix factorization with the β-divergence. Neural Comput. 23(9), 2421–2456 (2011)

  52. D. FitzGerald, M. Cranitch, E. Coyle, On the use of the beta divergence for musical source separation (IET digital library, Dublin, 2009)

  53. A.T. Cemgil, Bayesian inference for nonnegative matrix factorisation models. Computational intelligence and neuroscience. 2009, 1–17 (2009)

  54. D. Baby, J.F. Gemmeke, T. Virtanen, et al., in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Exemplar-based speech enhancement for deep neural network based automatic speech recognition (IEEE, South Brisbane, 2015), p. 4485–4489

  55. P. Smaragdis, Convolutive speech bases and their application to supervised speech separation. IEEE Trans. Audio Speech Lang. Process. 15(1), 1–12 (2006)

    Article  Google Scholar 

  56. L.E. Baum, An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities. 3(1), 1–8 (1972)

    MathSciNet  Google Scholar 

  57. I.-T. Recommendation, Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. Rec. ITU-T P (IEEE, US, 2001), p. 862

  58. J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n. 93, (1993)

  59. G. Hu, D. Wang, A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio Speech Lang. Process. 18(8), 2067–2079 (2010)

    Article  Google Scholar 

  60. A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3), 247–251 (1993)

    Article  Google Scholar 

  61. I. Cohen, B. Berdugo, Speech enhancement for non-stationary noise environments. Signal Process. 81(11), 2403–2418 (2001)

    Article  Google Scholar 

  62. I. Cohen, Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Speech Audio Process. 11(5), 466–475 (2003)

    Article  Google Scholar 

  63. P.D. O’grady, B.A. Pearlmutter, Discovering speech phones using convolutive non-negative matrix factorisation with a sparseness constraint. Neurocomputing 72(1–3), 88–101 (2008)

    Article  Google Scholar 

  64. S. Braun, I. Tashev, in International Conference on Speech and Computer. Data augmentation and loss normalization for deep noise suppression (Springer, Petersburg, 2020), p. 79–86

  65. T. Gerkmann, R.C. Hendriks, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Trans. Audio Speech Lang. Process. 20(4), 1383–1393 (2011)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Innovation Fund Denmark (Grant No.9065-00046).

Funding

Innovation Fund Denmark (Grant No.9065-00046).

Author information

Authors and Affiliations

Authors

Contributions

All authors participate in methodology discussion, experimental design, and paper writing. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Yang Xiang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiang, Y., Shi, L., Højvang, J.L. et al. A speech enhancement algorithm based on a non-negative hidden Markov model and Kullback-Leibler divergence. J AUDIO SPEECH MUSIC PROC. 2022, 22 (2022). https://doi.org/10.1186/s13636-022-00256-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-022-00256-5

Keywords