Improved microphone array design with statistical speaker verification
Introduction
Microphone arrays are composed of multiple sensors. By combining individual sensors, signals can be separated based on their spatial locations. In other words aperture signals are combined in a phased array in such a way that signals at particular spatial position experience constructive interference, allowing spatial filtering of a signal. This procedure is known as beamforming or spatial filtering [1].
Microphone array processing is a mature field with many uses. They are widely used in corporate conferencing systems. An array on a conference room table or mounted on ceiling along with state of the art steering algorithms, digital signal processing, and echo cancellation, can “zoom in” acoustically on individual conference participants and deliver sound quality that is superior to traditional conference room sound-gathering methodologies.
Microphone array processing also has medical applications such as hearing aids. It provides significant improvement in speech perception over existing hearing aid designs, particularly in the presence of background noise, reverberation, and feedback [2].
Earlier implementations of microphone arrays assumed that the locations of the source and interference are known. However, in most real world problems this information is unknown and dynamically changing. This issue was addressed by more recent studies [19], [20]. In this study we automatically identify and localize the speaker and interferences(s) in an open acoustic field but without any prior assumptions on the location of the sound sources. We exploit the spatial filtering capability of microphone arrays to improve speaker verification and localization performance. Once the speaker is localized, we use this information to distinguish between the speaker and interference, which in turn will be exploited to further tune microphone array gain.
We start our report with theoretical aspects of microphone array processing (Section 2) and speaker verification (Section 3). We then introduce the proposed approach (Section 4) and present our experimental results (Section 5). We conclude the report with findings, contributions and future work (Section 6).
Section snippets
Conventional microphone array processing
In sensor arrays, the impulse response of each aperture is given by;where is the dirac delta function, is the time delay and r is the spatial position of the aperture. Each element in a microphone array may introduce amplitude and time shift which is denoted by ;
Frequency response of an element can be derived by applying the Fourier transform [3];The term in Eq. 3 is determined by the wavenumber in the sound wave
Speaker verification
The task of speaker verification is to decide if the speaker is the claimed (in this case intended) speaker using a segment of speech. Often the speech sample contains only one speaker and the task is better termed single-speaker verification [7]. The individual’s anatomy together with his/her manner of speaking, accent, rhythm and pronunciation pattern influence the speech characteristics. Kinunen and Li [8] provide a discussion of features that are used for the purpose of speaker
Proposed approach
In this study we propose an N-Gram based speaker verification model and an approach for steering of microphone array that combines speaker verification and localization. We will start this discussion with speaker verification, which will be an integral part of localization.
Simulation environment
We developed a program that simulates an environment to test our speaker verification and localization algorithms. Our sources are simple .wav files that can be placed anywhere in the simulation room. The number of microphones and distance between microphones are also adjustable. Our simulation program is open source, and is available on Github [16].
We first experimented on the effects of frequency, number or sources,number of sensors and the distance in between to tune the microphone array.
Conclusion
Early implementations of microphone array processing algorithms focused on cross correlation between array elements to find the direction relative to the array. One location estimation problem is the cocktail party problem, where a number of people are talking simultaneously in a room. When only one specific person would like to be overheard in a cocktail party cross correlation methods will not be selective.
We take a different approach by combining the microphone array steering and speech
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (21)
- et al.
Speaker verification using adapted gaussian mixture models
Digital Signal Process
(2000) - et al.
An overview of text-independent speaker recognition: From features to supervectors
Speech Commun
(2010) - et al.
Springer handbook of speech processing
(2008) A microphone array for hearing aids
J Acoust Soc Am
(2001)- et al.
Distant speech recognition
(2009) Microphone arrays: A tutorial
(2001)- Kellermann Walter. A self-steering digital microphone array. In: Acoustics, speech, and signal processing, 1991....
- et al.
Analysis of worst-case phase quantization sidelobes in focused beamforming
IEEE Trans Ultrason, Ferroelectr, Freq Control
(1992) Language-independent prosodic features
- et al.
Springer handbook of speech processing
(2008)