A Laplacian-based MMSE estimator for speech enhancement
Introduction
Single-channel speech enhancement algorithms based on minimum mean-square error (MMSE) estimation of the short-time spectral magnitude have received a lot of attention in the past two decades (Ephraim and Malah, 1984, Ephraim and Malah, 1985, Cohen and Berdugo, 2001). A key assumption made in the MMSE algorithms is that the real and imaginary parts of the clean Discrete Fourier Transform (DFT) coefficients can be modeled by a Gaussian distribution. This Gaussian assumption, however, holds asymptotically for long duration analysis frames, for which the span of the correlation of the signal is much shorter than the DFT size. While this assumption might hold for the noise DFT coefficients, it does not hold for the speech DFT coefficients, which are typically estimated using relatively short (20–30 ms) duration windows. For that reason, several researchers (Martin, 2002, Martin and Breithaupt, 2003, Lotter and Vary, 2003, Breithaupt and Martin, 2003, Porter and Boll, 1984, Chen and Loizou, 2005) have proposed the use of non-Gaussian distributions for modeling the real and imaginary parts of the speech DFT coefficients. In particular, the Gamma or the Laplacian probability distributions can be used to model the distributions of the real and imaginary parts of the DFT coefficients. Several have computed histograms of the real and imaginary parts of the DFT coefficients from a large corpus of speech and confirmed that the Gamma and Laplacian distributions provide a better fit to the experimental data than the Gaussian distribution (Lotter and Vary, 2003, Martin, 2002). This was also confirmed quantitatively in Breithaupt and Martin (2003) by using the Kullback divergence to measure the ability of the Gamma probability density function (pdf) to fit the experimental data. A smaller Kullback divergence was found for the Gamma pdf when compared to the Gaussian pdf, suggesting that the Gamma pdf provides a better fit to the experimental data than the Gaussian pdf.
The use of Gamma or Laplacian distributions, however, complicates the derivation of the MMSE estimate of the magnitude spectrum. This is partly because the magnitude and phases of the DFT coefficients are no longer independent when the real and imaginary parts of the DFT coefficients are modelled by a Laplacian (or Gamma) distribution. For that reason, alternative solutions were explored in (Martin, 2002, Martin and Breithaupt, 2003, Lotter and Vary, 2003, Breithaupt and Martin, 2003, Porter and Boll, 1984). For instance, in (Lotter and Vary, 2003) the authors approximated the pdf of the magnitude of the DFT coefficients with a parametric function, and used that to derive a MAP estimator of the magnitude spectrum. The MAP estimator was pursued over the MMSE estimator since the resulting integrals were too difficult to evaluate in closed form. In (Martin and Breithaupt, 2003), the estimators of the real and imaginary parts of the DFT coefficients were derived separately assuming Gamma and Laplacian distributions for the speech DFT coefficients. The two estimators combined yielded a complex-valued estimator for the signal DFT coefficients. Experimental results showed that those estimators provided consistently better results than the Wiener estimator.
In Chen and Loizou (2005), we derived an approximate MMSE estimator of the speech magnitude spectrum based on a Laplacian model for the speech DFT coefficients and a Gaussian model for the noise DFT coefficients. This estimator was derived under the assumption that the magnitude and phases of the complex DFT coefficients were independent. Acknowledging that this assumption does not necessarily hold, we derive in this paper the true MMSE estimator of the speech magnitude spectrum based on Laplacian modeling. The derived estimator is implemented using numerical integration techniques, and compared to the approximate MMSE estimator (Chen and Loizou, 2005). To further improve the amplitude estimation, we also incorporate speech presence uncertainty into the Laplacian-based estimator. The performance of the proposed estimator is compared to the conventional MMSE estimator (Ephraim and Malah, 1984) as well as the Laplacian estimator proposed in Martin and Breithaupt (2003).
The paper is organized as follows. In Sections 2 Laplacian-based short-time spectral amplitude estimator, 3 Derivation of approximate Laplacian MMSE estimator, we derive the Laplacian-based MMSE estimators and in Section 4 we derive the MMSE estimator under signal presence uncertainty. In Section 5, we evaluate the performance of the proposed estimators, and in Section 6 we present the conclusions.
Section snippets
Laplacian-based short-time spectral amplitude estimator
Let y(n) = x(n) + d(n) be the sampled noisy speech signal consisting of the clean signal x(n) and the noise signal d(n). Taking the short-time Fourier transform of y(n), we get:for ωk = 2πk/N where k = 0, 1, 2, … , N − 1, and N is the frame length. The above equation can also be expressed in polar form aswhere {Yk, Xk, Dk} denote the corresponding magnitude spectra and {θy(k), θx(k), θd(k)} denote the corresponding phase spectra of the noisy, clean and noise signals
Derivation of approximate Laplacian MMSE estimator
It is known that complex zero mean Gaussian random variables have magnitudes and phases which are statistically independent (Papoulis and Pillai, 2001). Furthermore, the phases have a uniform distribution. This is not the case, however, with the complex Laplacian distributions that are used in this paper for modeling the speech DFT coefficients. Further analysis of the joint pdf of the magnitudes and phases, p(Xk, θk), however, revealed that the pdfs of the magnitudes and phases are nearly
Derivation of amplitude estimator under speech presence uncertainty
In this section, we derive the MMSE magnitude estimator under the assumed Laplacian model and uncertainty of speech presence. This is motivated by the fact that speech might not be present at all times and at all frequencies. We could therefore consider a two-state model for speech events that assumes that either speech is present at a particular frequency bin (hypothesis H1) or that is not (hypothesis H0). Intuitively, this amounts to multiplying the estimator by a term that provides an
Implementation
Evaluation of p(Xk) in (9) involves an infinite number of terms, however, computer simulations indicated that retaining only the first 40 terms in (9), gave a good approximation of p(Xk). This is demonstrated in Fig. 5, which shows p(Xk) estimated using numerical integration techniques and also approximated by truncating the summation in (9) using the first 40 terms.
As shown in (10a), the derived ApLapMMSE estimator is highly nonlinear and computationally complex. The implementation of (10a)
Summary and conclusions
An MMSE estimator was derived for the speech magnitude spectrum based on a Laplacian model for the speech DFT coefficients and a Gaussian model for the noise DFT coefficients. An estimator was also derived under speech presence uncertainty and a Laplacian model assumption. Results, in terms of objective measures, indicated that the proposed Laplacian MMSE estimators yielded better performance than the traditional MMSE estimator, which is based on a Gaussian model (Ephraim and Malah, 1984).
Acknowledgements
This research was supported in part by a grant No. R01 DC007527 from NIDCD/NIH. The authors would like to thank Prof. Ali Hooshyar for all his help and suggestions regarding numerical integration. Thanks also go to the reviewers for providing valuable suggestions that helped improve the manuscript.
References (17)
- et al.
Speech enhancement for non-stationary noise environments
Signal Process.
(2001) - Breithaupt, C., Martin, R., 2003. MMSE estimation of magnitude-squared DFT coefficients with SuperGaussian priors. In:...
- Chen, B., 2005. Speech Enhancement Using a MMSE Short Time Spectral Amplitude Estimator with Laplacian Speech Modeling....
- Chen, B., Loizou, P., 2005. Speech enhancement using a MMSE short time spectral amplitude estimator with Laplacian...
- et al.
Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator
IEEE Trans. Acoust., Speech, Signal Proc.
(1984) - et al.
Speech enhancement using a minimum mean-square error log-spectral amplitude estimator
IEEE Trans. Acoust., Speech, Signal Process.
(1985) - et al.
Table of Integrals, Series and Products
(2000) - Hansen, J., Pellom, B., 1998. An effective quality evaluation protocol for speech enhancement algorithms. In: Proc....
Cited by (73)
Semi-parametric joint detection and estimation for speech enhancement based on minimum mean square error
2018, Speech CommunicationNoisy speech enhancement with sparsity regularization
2017, Speech CommunicationCitation Excerpt :In traditional methods such as wiener filter (Vaseghi, 1996) and the well known minimum mean square error estimation of the short time spectral amplitude (MMSE-STSA) (Ephraim and Malah, 1984), Gaussian distribution models are used for clean speech and noise spectra. Later, researchers have employed non-Gaussian distributions such as Laplacian and Gamma to better model clean speech and to obtain better speech enhancement performance (Martin, 2002; Martin and Breithaupt, 2003; Lotter and Vary, 2003; Martin, 2005; Chen and Loizou, 2007). Subspace based approaches rely on linear algebra techniques.
Speech enhancement using Bayesian estimation given a priori knowledge of clean speech phase
2016, Speech CommunicationLow-Distortion MMSE Estimator for Speech Enhancement Based on Hahn Moments
2023, Proceedings - International Conference on Developments in eSystems Engineering, DeSE