Elsevier

Applied Acoustics

Volume 175, April 2021, 107813
Applied Acoustics

Improved microphone array design with statistical speaker verification

https://doi.org/10.1016/j.apacoust.2020.107813Get rights and content

Abstract

Conventional microphone array implementations aim to lock onto a source with given location and if required, tracking it. It is a challenge to identify the intended source when the location of the source is unknown and interference exists in the same environment. In this study we combine speaker verification and microphone array processing techniques to localize and maximize gain on the intended speaker under the assumption of open acoustic field. We exploit the steering capability of the microphone array for more accurate speaker verification. Our first contribution is a new N-Gram based and computationally efficient feature for detecting an intended speaker. When the source and interference are localized, microphone array can be tuned further to reduce noise and increase the gain. Our second contribution is this integrated algorithm for speaker verification and localization. In the context of this study we developed SharpEar, an open source environment that simulates propagation of sound emanating from multiple sources. Our third and last contribution is this simulation environment, which is open source and available to researchers of the field.

Introduction

Microphone arrays are composed of multiple sensors. By combining individual sensors, signals can be separated based on their spatial locations. In other words aperture signals are combined in a phased array in such a way that signals at particular spatial position experience constructive interference, allowing spatial filtering of a signal. This procedure is known as beamforming or spatial filtering [1].

Microphone array processing is a mature field with many uses. They are widely used in corporate conferencing systems. An array on a conference room table or mounted on ceiling along with state of the art steering algorithms, digital signal processing, and echo cancellation, can “zoom in” acoustically on individual conference participants and deliver sound quality that is superior to traditional conference room sound-gathering methodologies.

Microphone array processing also has medical applications such as hearing aids. It provides significant improvement in speech perception over existing hearing aid designs, particularly in the presence of background noise, reverberation, and feedback [2].

Earlier implementations of microphone arrays assumed that the locations of the source and interference are known. However, in most real world problems this information is unknown and dynamically changing. This issue was addressed by more recent studies [19], [20]. In this study we automatically identify and localize the speaker and interferences(s) in an open acoustic field but without any prior assumptions on the location of the sound sources. We exploit the spatial filtering capability of microphone arrays to improve speaker verification and localization performance. Once the speaker is localized, we use this information to distinguish between the speaker and interference, which in turn will be exploited to further tune microphone array gain.

We start our report with theoretical aspects of microphone array processing (Section 2) and speaker verification (Section 3). We then introduce the proposed approach (Section 4) and present our experimental results (Section 5). We conclude the report with findings, contributions and future work (Section 6).

Section snippets

Conventional microphone array processing

In sensor arrays, the impulse response of each aperture is given by;hn(t,r)=δ(t-τn,r)where δ(t) is the dirac delta function, τn is the time delay and r is the spatial position of the aperture. Each element in a microphone array may introduce amplitude and time shift which is denoted by w(t,r);hn(t,r)=w(t,r)δ(t-τn,r)

Frequency response of an element can be derived by applying the Fourier transform [3];Hn(f,αx)=W(f,r)ej2παnrThe αn term in Eq. 3 is determined by the wavenumber in the sound wave

Speaker verification

The task of speaker verification is to decide if the speaker is the claimed (in this case intended) speaker using a segment of speech. Often the speech sample contains only one speaker and the task is better termed single-speaker verification [7]. The individual’s anatomy together with his/her manner of speaking, accent, rhythm and pronunciation pattern influence the speech characteristics. Kinunen and Li [8] provide a discussion of features that are used for the purpose of speaker

Proposed approach

In this study we propose an N-Gram based speaker verification model and an approach for steering of microphone array that combines speaker verification and localization. We will start this discussion with speaker verification, which will be an integral part of localization.

Simulation environment

We developed a program that simulates an environment to test our speaker verification and localization algorithms. Our sources are simple .wav files that can be placed anywhere in the simulation room. The number of microphones and distance between microphones are also adjustable. Our simulation program is open source, and is available on Github [16].

We first experimented on the effects of frequency, number or sources,number of sensors and the distance in between to tune the microphone array.

Conclusion

Early implementations of microphone array processing algorithms focused on cross correlation between array elements to find the direction relative to the array. One location estimation problem is the cocktail party problem, where a number of people are talking simultaneously in a room. When only one specific person would like to be overheard in a cocktail party cross correlation methods will not be selective.

We take a different approach by combining the microphone array steering and speech

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (21)

  • Douglas A. Reynolds et al.

    Speaker verification using adapted gaussian mixture models

    Digital Signal Process

    (2000)
  • Tomi Kinnunen et al.

    An overview of text-independent speaker recognition: From features to supervectors

    Speech Commun

    (2010)
  • Yiteng Huang Jacob Benesty et al.

    Springer handbook of speech processing

    (2008)
  • B. Widrow

    A microphone array for hearing aids

    J Acoust Soc Am

    (2001)
  • Dr. John McDonough et al.

    Distant speech recognition

    (2009)
  • Iain McCowan

    Microphone arrays: A tutorial

    (2001)
  • Kellermann Walter. A self-steering digital microphone array. In: Acoustics, speech, and signal processing, 1991....
  • Sverre Holm et al.

    Analysis of worst-case phase quantization sidelobes in focused beamforming

    IEEE Trans Ultrason, Ferroelectr, Freq Control

    (1992)
  • Jacqueline Vaissière

    Language-independent prosodic features

  • Yiteng Huang Jacob Benesty et al.

    Springer handbook of speech processing

    (2008)
There are more references available in the full text version of this article.

Cited by (0)

View full text