Elsevier

Speech Communication

Volume 49, Issue 2, February 2007, Pages 134-143
Speech Communication

A Laplacian-based MMSE estimator for speech enhancement

https://doi.org/10.1016/j.specom.2006.12.005Get rights and content

Abstract

This paper focuses on optimal estimators of the magnitude spectrum for speech enhancement. We present an analytical solution for estimating in the MMSE sense the magnitude spectrum when the clean speech DFT coefficients are modeled by a Laplacian distribution and the noise DFT coefficients are modeled by a Gaussian distribution. Furthermore, we derive the MMSE estimator under speech presence uncertainty and a Laplacian statistical model. Results indicated that the Laplacian-based MMSE estimator yielded less residual noise in the enhanced speech than the traditional Gaussian-based MMSE estimator. Overall, the present study demonstrates that the assumed distribution of the DFT coefficients can have a significant effect on the quality of the enhanced speech.

Introduction

Single-channel speech enhancement algorithms based on minimum mean-square error (MMSE) estimation of the short-time spectral magnitude have received a lot of attention in the past two decades (Ephraim and Malah, 1984, Ephraim and Malah, 1985, Cohen and Berdugo, 2001). A key assumption made in the MMSE algorithms is that the real and imaginary parts of the clean Discrete Fourier Transform (DFT) coefficients can be modeled by a Gaussian distribution. This Gaussian assumption, however, holds asymptotically for long duration analysis frames, for which the span of the correlation of the signal is much shorter than the DFT size. While this assumption might hold for the noise DFT coefficients, it does not hold for the speech DFT coefficients, which are typically estimated using relatively short (20–30 ms) duration windows. For that reason, several researchers (Martin, 2002, Martin and Breithaupt, 2003, Lotter and Vary, 2003, Breithaupt and Martin, 2003, Porter and Boll, 1984, Chen and Loizou, 2005) have proposed the use of non-Gaussian distributions for modeling the real and imaginary parts of the speech DFT coefficients. In particular, the Gamma or the Laplacian probability distributions can be used to model the distributions of the real and imaginary parts of the DFT coefficients. Several have computed histograms of the real and imaginary parts of the DFT coefficients from a large corpus of speech and confirmed that the Gamma and Laplacian distributions provide a better fit to the experimental data than the Gaussian distribution (Lotter and Vary, 2003, Martin, 2002). This was also confirmed quantitatively in Breithaupt and Martin (2003) by using the Kullback divergence to measure the ability of the Gamma probability density function (pdf) to fit the experimental data. A smaller Kullback divergence was found for the Gamma pdf when compared to the Gaussian pdf, suggesting that the Gamma pdf provides a better fit to the experimental data than the Gaussian pdf.

The use of Gamma or Laplacian distributions, however, complicates the derivation of the MMSE estimate of the magnitude spectrum. This is partly because the magnitude and phases of the DFT coefficients are no longer independent when the real and imaginary parts of the DFT coefficients are modelled by a Laplacian (or Gamma) distribution. For that reason, alternative solutions were explored in (Martin, 2002, Martin and Breithaupt, 2003, Lotter and Vary, 2003, Breithaupt and Martin, 2003, Porter and Boll, 1984). For instance, in (Lotter and Vary, 2003) the authors approximated the pdf of the magnitude of the DFT coefficients with a parametric function, and used that to derive a MAP estimator of the magnitude spectrum. The MAP estimator was pursued over the MMSE estimator since the resulting integrals were too difficult to evaluate in closed form. In (Martin and Breithaupt, 2003), the estimators of the real and imaginary parts of the DFT coefficients were derived separately assuming Gamma and Laplacian distributions for the speech DFT coefficients. The two estimators combined yielded a complex-valued estimator for the signal DFT coefficients. Experimental results showed that those estimators provided consistently better results than the Wiener estimator.

In Chen and Loizou (2005), we derived an approximate MMSE estimator of the speech magnitude spectrum based on a Laplacian model for the speech DFT coefficients and a Gaussian model for the noise DFT coefficients. This estimator was derived under the assumption that the magnitude and phases of the complex DFT coefficients were independent. Acknowledging that this assumption does not necessarily hold, we derive in this paper the true MMSE estimator of the speech magnitude spectrum based on Laplacian modeling. The derived estimator is implemented using numerical integration techniques, and compared to the approximate MMSE estimator (Chen and Loizou, 2005). To further improve the amplitude estimation, we also incorporate speech presence uncertainty into the Laplacian-based estimator. The performance of the proposed estimator is compared to the conventional MMSE estimator (Ephraim and Malah, 1984) as well as the Laplacian estimator proposed in Martin and Breithaupt (2003).

The paper is organized as follows. In Sections 2 Laplacian-based short-time spectral amplitude estimator, 3 Derivation of approximate Laplacian MMSE estimator, we derive the Laplacian-based MMSE estimators and in Section 4 we derive the MMSE estimator under signal presence uncertainty. In Section 5, we evaluate the performance of the proposed estimators, and in Section 6 we present the conclusions.

Section snippets

Laplacian-based short-time spectral amplitude estimator

Let y(n) = x(n) + d(n) be the sampled noisy speech signal consisting of the clean signal x(n) and the noise signal d(n). Taking the short-time Fourier transform of y(n), we get:Y(ωk)=X(ωk)+D(ωk)for ωk = k/N where k = 0, 1, 2,  , N  1, and N is the frame length. The above equation can also be expressed in polar form asYkejθy(k)=Xkejθx(k)+Dkejθd(k)where {Yk, Xk, Dk} denote the corresponding magnitude spectra and {θy(k), θx(k), θd(k)} denote the corresponding phase spectra of the noisy, clean and noise signals

Derivation of approximate Laplacian MMSE estimator

It is known that complex zero mean Gaussian random variables have magnitudes and phases which are statistically independent (Papoulis and Pillai, 2001). Furthermore, the phases have a uniform distribution. This is not the case, however, with the complex Laplacian distributions that are used in this paper for modeling the speech DFT coefficients. Further analysis of the joint pdf of the magnitudes and phases, p(Xk, θk), however, revealed that the pdfs of the magnitudes and phases are nearly

Derivation of amplitude estimator under speech presence uncertainty

In this section, we derive the MMSE magnitude estimator under the assumed Laplacian model and uncertainty of speech presence. This is motivated by the fact that speech might not be present at all times and at all frequencies. We could therefore consider a two-state model for speech events that assumes that either speech is present at a particular frequency bin (hypothesis H1) or that is not (hypothesis H0). Intuitively, this amounts to multiplying the estimator by a term that provides an

Implementation

Evaluation of p(Xk) in (9) involves an infinite number of terms, however, computer simulations indicated that retaining only the first 40 terms in (9), gave a good approximation of p(Xk). This is demonstrated in Fig. 5, which shows p(Xk) estimated using numerical integration techniques and also approximated by truncating the summation in (9) using the first 40 terms.

As shown in (10a), the derived ApLapMMSE estimator is highly nonlinear and computationally complex. The implementation of (10a)

Summary and conclusions

An MMSE estimator was derived for the speech magnitude spectrum based on a Laplacian model for the speech DFT coefficients and a Gaussian model for the noise DFT coefficients. An estimator was also derived under speech presence uncertainty and a Laplacian model assumption. Results, in terms of objective measures, indicated that the proposed Laplacian MMSE estimators yielded better performance than the traditional MMSE estimator, which is based on a Gaussian model (Ephraim and Malah, 1984).

Acknowledgements

This research was supported in part by a grant No. R01 DC007527 from NIDCD/NIH. The authors would like to thank Prof. Ali Hooshyar for all his help and suggestions regarding numerical integration. Thanks also go to the reviewers for providing valuable suggestions that helped improve the manuscript.

References (17)

  • I. Cohen et al.

    Speech enhancement for non-stationary noise environments

    Signal Process.

    (2001)
  • Breithaupt, C., Martin, R., 2003. MMSE estimation of magnitude-squared DFT coefficients with SuperGaussian priors. In:...
  • Chen, B., 2005. Speech Enhancement Using a MMSE Short Time Spectral Amplitude Estimator with Laplacian Speech Modeling....
  • Chen, B., Loizou, P., 2005. Speech enhancement using a MMSE short time spectral amplitude estimator with Laplacian...
  • Y. Ephraim et al.

    Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator

    IEEE Trans. Acoust., Speech, Signal Proc.

    (1984)
  • Y. Ephraim et al.

    Speech enhancement using a minimum mean-square error log-spectral amplitude estimator

    IEEE Trans. Acoust., Speech, Signal Process.

    (1985)
  • I.S. Gradshteyn et al.

    Table of Integrals, Series and Products

    (2000)
  • Hansen, J., Pellom, B., 1998. An effective quality evaluation protocol for speech enhancement algorithms. In: Proc....
There are more references available in the full text version of this article.

Cited by (73)

  • Noisy speech enhancement with sparsity regularization

    2017, Speech Communication
    Citation Excerpt :

    In traditional methods such as wiener filter (Vaseghi, 1996) and the well known minimum mean square error estimation of the short time spectral amplitude (MMSE-STSA) (Ephraim and Malah, 1984), Gaussian distribution models are used for clean speech and noise spectra. Later, researchers have employed non-Gaussian distributions such as Laplacian and Gamma to better model clean speech and to obtain better speech enhancement performance (Martin, 2002; Martin and Breithaupt, 2003; Lotter and Vary, 2003; Martin, 2005; Chen and Loizou, 2007). Subspace based approaches rely on linear algebra techniques.

  • Low-Distortion MMSE Estimator for Speech Enhancement Based on Hahn Moments

    2023, Proceedings - International Conference on Developments in eSystems Engineering, DeSE
View all citing articles on Scopus
View full text