当前位置: X-MOL 学术EURASIP J. Audio Speech Music Proc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data-based spatial audio processing
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2022-06-08 , DOI: 10.1186/s13636-022-00248-5
Maximo Cobos , Jens Ahrens , Konrad Kowalczyk , Archontis Politis

Spatial audio today is a mature field with an active research community involving both industry and academia. Spatial audio processing is a key technology for many sound-related applications, ranging from personalized audio experiences to the creation of the sound metaverse. The present special issue concerns data-based spatial audio processing. The term “data” is defined very broadly in this case, and any processing method that involves data of any sort—be it measured data or data captured in real-time—qualifies. The topics that are covered by the articles that this special issue comprises are just as broad.

The past decade has shown a growing interest in such data-based methods. This may be attributed to the increased availability of advanced multichannel measurement and capture setups and, of course, the increased availability of processing resources. The accomplishment of tasks such as the measurement of head-related transfer functions (HRTFs) or sound source directivities has become considerably easier, and the data continues to become publicly available on a much larger scale than previously. This allows also for researchers who do not have access to according measurement resources to be active in the domain in a comprehensive manner. Along with this availability of data, the exponential growth of deep learning and its application to most research areas has also contributed to this interest, where data-based methods in the classical sense have evolved into data-driven approaches fully based on machine learning (ML) techniques. The article co-authored by the guest editors of this special issue provides an extensive overview over data-based methods in spatial audio capture, processing, and reproduction. The article’s quintessence is the identification of the role that ML-based methods play in this regard, whereby the authors use the term data-based in a narrower interpretation than this Editorial. The authors assume the so-called data-based representation of the audio scene that is being processed where the spatial information such as the positions of the sound sources and the acoustic response of the virtual environment are encoded into the audio signals. This is opposed to model-based representations where the audio scenes are described by a set of individual source signals as well as metadata such as the sources’ positions. Sound radiation and propagation are described by physical models in this case. The authors categorize the available methods based on the task that they implement and sort them based on their location inside the processing pipeline, which is generally composed of a capture stage, a processing stage, and a reproduction stage. The capture stage is dominated by model-based methods, whereby recent works have successfully used ML to perform tasks such as the identification of auditory localization cues in binaural signals or scene decomposition into foreground and background. The situation is similar for the reproduction stage where ML is popular primarily only in the individualization of HRTFs. A number of tasks in terms of spatial audio processing have been accomplished with ML such as source localization and the extraction of acoustic metrics. All methods that produce a high-resolution representation of the spatial audio scene are dominated by classical signal processing. The authors identify the following aspects to be hindering a breakthrough in this regard: (1) The lack of suitable psychoacoustic models that are applicable to a broad range of audio scenes, and (2) the very limited availability of content and the associated immense data volume that requires computation resources beyond the typical. The breakthrough of ML in the creation and processing of high-fidelity spatial audio scenes is therefore yet to occur.

A good example of the growing availability of spatial audio data is the article of Di Carlo and co-authors, which presents “dEchorate,” a dataset of multichannel room impulse responses (RIRs) measured in a cubic room. This is the first publicly available dataset of RIRs including annotations of the timings of early echoes as well as positions from microphones, real, and image sources for different configurations. The dataset features 1800 RIRs obtained from 6 arrays of 5 microphones each, 6 sound sources, and 11 different acoustic conditions. Such data along with its accompanying information and software utilities can be an excellent research resource for echo-aware audio processing, room-geometry estimation, learning-based RIR estimation, and algorithm validation considering a variety of realistic conditions.

Binaural technology has experienced a surge in popularity in recent years. In particular, the perception of the spatial location is an important aspect of binaural rendering in a breadth of audio and multimedia applications, including gaming, virtual, and augmented reality. For natural spatial hearing, personal HRTFs should be used as they depend on the anthropometric parameters of an individual. The most common approach to obtain personal HRTFs is to directly measure the binaural responses of a listener. Alternative approaches include synthesizing HRTFs based on an individual’s morphological features and an assumed model or the adaptation of the relevant spatial hearing cues for a particular listener. The article of Gutierrez-Parera et al. treats the latter issue of adjusting generic HRTFs to those of an individual, which the authors achieve by scaling inter-aural time differences (ITDs) through anthropometric parameters. The authors perform a listening test to infer the relationship between the measured anthropometric parameters and listeners’ spatial perception when using scaled versions of ITDs of generic HRTFs. Based on the results of an exploratory perceptual test, they propose a method to predict an individual’s scaling factor that is applied to adapt ITDs for localization in a horizontal plane and validate their approach with another perceptual test and objective measures. An outcome of this study includes a practical method to fit ITDs of the widely used binaural dummy heads of Brüel & Kjær and Neumann.

While the localization of sources from binaural signals has been and continues to be a problem that attracts great interest, few works have focused on the localization mechanisms involved in the localization of “source ensembles” in binaural music recordings. Indeed, front-back confusions are known to be an important issue in this regard. The article by Zieliński et al. addresses the automatic disambiguation between front and back audio sources in binaural music recordings. The work is developed considering a large dataset of binaural excerpts (22,496) generated by convolving multi-track recordings with 74 sets of HRTFs. A traditional ML method and a deep learning-based method using convolutional neural networks (CNNs) are proposed and compared on the discrimination task. The article analyses the design choices of both systems and compares their performance over both HRTF-dependent and HRTF-independent scenarios, identifying also those features that provide high discrimination power within the considered framework. The result of this work may lead to a better understanding of the front-back confusion problem, paving the way towards future location-aware binaural music search and retrieval systems.

Binaural rendering of real spatial sound scenes is a core component of immersive audio and audio for virtual and augmented reality. It also enables to auralize concert recordings and sports events with a high level of perceptual immersion. Traditionally, such a sound reproduction task has been dominated by recording devices equipped with spherical microphone arrays and the associated spatial transformation of their recordings to integrate HRTF information, as done, e.g., in Ambisonics. The basic approach consists in capturing the spatial sound field, transforming the recorded multichannel signals into the spherical harmonic domain, referred to as encoding, and reproducing the encoded signals over headphones using inverse spherical harmonic transform followed by weighting with HRTFs for the corresponding directions. This processing pipeline has been studied extensively in the literature, with bounds in performance depending on the size and number of microphones, and optimization of the spatial transform stage informed by perceptual spatial cues. The article of Arend et al. presents a computationally efficient method for real-time binaural rendering in which instead of the encoding and decoding steps, the signals of a spherical microphone array are directly convolved with a set of precomputed FIR filters that model a linear time-invariant system. The real-time operation even for spherical arrays of high orders is possible with this approach; however, it comes at a cost of a lower flexibility in sound field manipulation such as rotation by an angle. As a consequence, the set of linear filters needs to be precomputed and stored for every possible head orientation of a listener. The method has been validated on two working examples.

Traditional methods for binaural reproduction of the recorded sound scenes have been developed and formulated in the spherical harmonic domain, tailored for spherical microphone arrays. As more general array configurations, such as wearable or head-mounted ones, are becoming more important for immersive media and immersive communications, binaural methods from such array recordings will also need to be developed and studied. The article by Ifergan and Rafaely is a comprehensive step in that direction. The authors formulate binaural rendering for arbitrary arrays as a beamforming problem of a discrete set of beamformer signals rendered with corresponding HRTFs from the respective directions. Parallels with spherical microphone arrays and spherical harmonic processing for binaural rendering are drawn, and useful performance metrics are transferred from the spherical microphone array case to the general array one.

Sound source directivities are usually determined from recordings of a given source such as a musical instrument in an open spherical microphone array in a free field. The main difficulties with such measurements are the fact that the source signal is not known, and the resolution of the angular sampling of the sound field is very low. As a consequence, usually, only the frequency-dependent magnitude of the directivity is determined, and interpolation over the direction is performed. Ackermann et al. compare the accuracy of the three different methods of spherical harmonic interpolation, thin plate pseudo-spline interpolation, and piece-wise linear, spherical triangular interpolation. The test data are obtained from four different musical instruments, whereby an automatic excitation was used to be able to achieve a measurement with a high angular resolution by using only a moderate number of microphones and rotating the sound source. The test data are downsampled to 32 directions, and the accuracy of the three methods under consideration is determined by comparing the interpolation results to the ground truth measured at the given direction. The accuracy turns out to be strongly dependent on the source type and on the sampling grid. It is in the same order of magnitude for all methods under consideration, whereby the smallest average global error occurred for thin plate pseudo-spline interpolation. The authors also provide a number of guidelines for further processing of the interpolated directives.

Multichannel source separation methods have a strong potential in spatial audio applications, such as in spatial re-mixing, enhancement, or modification of spatial content. In recent years, large performance gains have been achieved using deep neural network (DNN) separation models, such as the Conv-TasNet and related multichannel extensions. Similar gains have been reported by learning-based dereverberation, training DNN models to perform a complex spectral mapping from reverberant source spectrograms to dry or anechoic ones. The article by Chen et al. combines skillfully such advances for the task of simultaneous multichannel source separation and dereverberation. A Conv-TasNet-inspired DNN is a construction operating as a trainable beamformer and giving an estimate of separated source signals. These separated signals are then post-processed by a dereverberation network, producing the final enhanced source signals. The work conducts a comprehensive comparison against traditional dereverberation and spatial filtering combinations, as well as traditional source separation approaches. Furthermore, it compares against a recent multichannel Conv-Tasnet extension, and, tested in reverberant speech mixtures, it shows solid improvements in separation performance, speech intelligibility, and enhanced speech quality. The performance is additionally analyzed with respect to the number of microphones, varying reverberation times, and spatial separation of the speakers.

The above contributions are significant examples of ongoing work in data-based spatial audio and reflect very well its current status. The availability of more datasets (RIRs, HRTFs, Ambisonics recordings, etc.) is facilitating open research that efficiently exploits the acoustic knowledge provided by such data, showing a very interesting situation where traditional signal processing and emerging ML-based methods coexist and interact in a wide range of applications. Echo-aware signal processing, binaural perception analysis, HRTF individualization, efficient and flexible binaural rendering, interpolation methods for source directivities, or multichannel source separation, which have been covered in this special issue, are only some of these applications.

Maximo Cobos

Jens Ahrens

Konrad Kowalczyk

Archontis Politis

This work received funding from Grant RTI2018-097045-B-C21 funded by MCIN/AEI/10.13039/501100011033 and “ERDF A way of making Europe.” Additionally, from the National Science Centre of Poland under grant number DEC-2017/25/B/ST7/01792 and Generalitat Valenciana under grants AICO/2020/154 and AEST/2020/012.

Author notes
  1. Maximo Cobos, Jens Ahrens, Konrad Kowalczyk and Archontis Politis contributed equally to this work.

Authors and Affiliations

  1. Computer Science Department, Universitat de València, 46100, Burjassot, Valencia, Spain

    Maximo Cobos

  2. Division of Applied Acoustics, Chalmers University of Technology, 412 96, Gothenburg, Sweden

    Jens Ahrens

  3. Institute of Electronics, AGH University of Science and Technology, 30-059, Krakow, Poland

    Konrad Kowalczyk

  4. Department of Information Technology and Communication Sciences, Tampere University, FI-33720, Tampere, Finland

    Archontis Politis

Authors
  1. Maximo CobosView author publications

    You can also search for this author in PubMed Google Scholar

  2. Jens AhrensView author publications

    You can also search for this author in PubMed Google Scholar

  3. Konrad KowalczykView author publications

    You can also search for this author in PubMed Google Scholar

  4. Archontis PolitisView author publications

    You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to this work. The authors read and approved the final manuscript.

Corresponding authors

Correspondence to Maximo Cobos, Jens Ahrens, Konrad Kowalczyk or Archontis Politis.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

Verify currency and authenticity via CrossMark

Cite this article

Cobos, M., Ahrens, J., Kowalczyk, K. et al. Data-based spatial audio processing. J AUDIO SPEECH MUSIC PROC. 2022, 13 (2022). https://doi.org/10.1186/s13636-022-00248-5

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/s13636-022-00248-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative



中文翻译:

基于数据的空间音频处理

今天的空间音频是一个成熟的领域,拥有一个涉及工业界和学术界的活跃研究社区。空间音频处理是许多与声音相关的应用的关键技术,从个性化的音频体验到声音虚拟世界的创建。本期特刊涉及基于数据的空间音频处理。在这种情况下,“数据”一词的定义非常广泛,任何涉及任何类型数据的处理方法(无论是测量数据还是实时捕获的数据)都符合条件。本期特刊所包含的文章所涵盖的主题同样广泛。

在过去的十年中,人们对这种基于数据的方法越来越感兴趣。这可能归因于高级多通道测量和捕获设置的可用性增加,当然还有处理资源的可用性增加。诸如测量头部相关传递函数 (HRTF) 或声源方向性等任务的完成变得相当容易,并且数据继续以比以前更大的规模公开。这也允许无法访问相应测量资源的研究人员以全面的方式活跃于该领域。随着数据的可用性,深度学习的指数级增长及其在大多数研究领域的应用也促成了这种兴趣,经典意义上的基于数据的方法已经演变为完全基于机器学习 (ML) 技术的数据驱动方法。这篇由本期特刊客座编辑共同撰写的文章对空间音频捕获、处理和再现中基于数据的方法进行了广泛的概述。这篇文章的精髓是确定基于 ML 的方法在这方面所起的作用,因此作者在比这篇社论更窄的解释中使用了基于数据的术语。作者假设正在处理的音频场景是所谓的基于数据的表示,其中空间信息(例如声源的位置和虚拟环境的声学响应)被编码到音频信号中。这与基于模型的表示相反,其中音频场景由一组单独的源信号以及元数据(如源位置)描述。在这种情况下,声音辐射和传播由物理模型描述。作者根据他们实现的任务对可用方法进行分类,并根据它们在处理管道中的位置对它们进行分类,该处理管道通常由捕获阶段、处理阶段和再现阶段组成。捕获阶段以基于模型的方法为主,最近的工作已成功使用 ML 来执行任务,例如识别双耳信号中的听觉定位线索或将场景分解为前景和背景。复制阶段的情况类似,ML 主要仅在 HRTF 的个体化中流行。ML 已经完成了许多空间音频处理方面的任务,例如源定位和声学指标的提取。产生空间音频场景的高分辨率表示的所有方法都以经典信号处理为主。作者确定以下方面阻碍了这方面的突破:(1)缺乏适用于广泛音频场景的合适的心理声学模型,以及(2)内容的可用性非常有限以及相关的海量数据需要超出典型计算资源的容量。

空间音频数据日益普及的一个很好的例子是 Di Carlo 和合著者的文章,该文章介绍了“dEchorate”,这是一个在立方体房间中测量的多通道房间脉冲响应 (RIR) 数据集。这是第一个公开可用的 RIR 数据集,包括早期回声时间的注释以及来自不同配置的麦克风、真实和图像源的位置。该数据集包含 1800 个 RIR,这些 RIR 从 6 个阵列(每个阵列 5 个麦克风)、6 个声源和 11 种不同的声学条件中获得。此类数据及其随附的信息和软件实用程序可以成为用于回声感知音频处理、房间几何估计、基于学习的 RIR 估计和考虑各种现实条件的算法验证的优秀研究资源。

近年来,双耳技术的普及率激增。特别是,空间位置的感知是双耳渲染在广泛的音频和多媒体应用中的一个重要方面,包括游戏、虚拟和增强现实。对于自然空间听力,应使用个人 HRTF,因为它们取决于个人的人体测量参数。获得个人 HRTF 的最常见方法是直接测量听众的双耳反应。替代方法包括基于个人的形态特征和假设模型合成 HRTF,或为特定听众调整相关空间听觉线索。Gutierrez-Parera 等人的文章。处理将通用 HRTF 调整为个人的后一个问题,作者通过人体测量参数缩放耳间时间差 (ITD) 来实现这一点。当使用通用 HRTF 的 ITD 的缩放版本时,作者进行了听力测试,以推断测量的人体测量参数与听众的空间感知之间的关系。基于探索性感知测试的结果,他们提出了一种预测个人比例因子的方法,该方法用于使 ITD 适应水平面的定位,并通过另一种感知测试和客观测量来验证他们的方法。这项研究的结果包括一种实用的方法来拟合 Brüel & Kjær 和 Neumann 广泛使用的双耳假人头的 ITD。当使用通用 HRTF 的 ITD 的缩放版本时,作者进行了听力测试,以推断测量的人体测量参数与听众的空间感知之间的关系。基于探索性感知测试的结果,他们提出了一种预测个人比例因子的方法,该方法用于使 ITD 适应水平面的定位,并通过另一种感知测试和客观测量来验证他们的方法。这项研究的结果包括一种实用的方法来拟合 Brüel & Kjær 和 Neumann 广泛使用的双耳假人头的 ITD。当使用通用 HRTF 的 ITD 的缩放版本时,作者进行了听力测试,以推断测量的人体测量参数与听众的空间感知之间的关系。基于探索性感知测试的结果,他们提出了一种预测个人比例因子的方法,该方法用于使 ITD 适应水平面的定位,并通过另一种感知测试和客观测量来验证他们的方法。这项研究的结果包括一种实用的方法来拟合 Brüel & Kjær 和 Neumann 广泛使用的双耳假人头的 ITD。他们提出了一种预测个人比例因子的方法,该方法用于调整 ITD 以在水平面上进行定位,并通过另一种感知测试和客观测量来验证他们的方法。这项研究的结果包括一种实用的方法来拟合 Brüel & Kjær 和 Neumann 广泛使用的双耳假人头的 ITD。他们提出了一种预测个人比例因子的方法,该方法用于调整 ITD 以在水平面上进行定位,并通过另一种感知测试和客观测量来验证他们的方法。这项研究的结果包括一种实用的方法来拟合 Brüel & Kjær 和 Neumann 广泛使用的双耳假人头的 ITD。

虽然双耳信号源的定位一直是并且继续是一个引起极大兴趣的问题,但很少有作品关注双耳音乐录音中“源合奏”定位所涉及的定位机制。事实上,众所周知,前后混淆是这方面的一个重要问题。Zieliński 等人的文章。解决了双耳音乐录音中前后音频源之间的自动消歧问题。这项工作的开发考虑了通过将多轨录音与 74 组 HRTF 进行卷积而生成的大型双耳摘录数据集 (22,496)。提出了一种传统的机器学习方法和一种使用卷积神经网络(CNN)的基于深度学习的方法,并在判别任务上进行了比较。本文分析了这两个系统的设计选择,并比较了它们在依赖 HRTF 和独立于 HRTF 的情况下的性能,还确定了在所考虑的框架内提供高辨别能力的那些特征。这项工作的结果可能会导致更好地理解前后混淆问题,为未来的位置感知双耳音乐搜索和检索系统铺平道路。

真实空间声音场景的双耳渲染是沉浸式音频和用于虚拟和增强现实的音频的核心组件。它还可以使音乐会录音和体育赛事具有高度的感性沉浸感。传统上,这种声音再现任务主要由配备球形麦克风阵列的录音设备和相关的录音空间变换来整合 HRTF 信息,例如在 Ambisonics 中所做的那样。基本方法包括捕获空间声场,将记录的多声道信号转换为球谐域,称为编码,并使用逆球谐变换通过耳机再现编码信号,然后对相应方向使用 HRTF 加权。该处理管道已在文献中进行了广泛的研究,其性能界限取决于麦克风的大小和数量,以及由感知空间线索通知的空间变换阶段的优化。Arend 等人的文章。提出了一种计算高效的实时双耳渲染方法,其中球形麦克风阵列的信号直接与一组预先计算的 FIR 滤波器进行卷积,而不是编码和解码步骤,这些滤波器对线性时不变系统进行建模。使用这种方法,即使是高阶球面阵列的实时操作也是可能的;然而,它的代价是声场操作的灵活性较低,例如旋转一个角度。作为结果,需要针对听众的每个可能的头部方向预先计算和存储一组线性滤波器。该方法已在两个工作示例上得到验证。

已在球谐域中开发并制定了用于记录声音场景的双耳再现的传统方法,专为球面麦克风阵列量身定制。随着更通用的阵列配置(例如可穿戴或头戴式配置)对于沉浸式媒体和沉浸式通信变得越来越重要,也需要开发和研究来自此类阵列记录的双耳方法。Ifergan 和 Rafaely 的文章是朝着这个方向迈出的全面一步。作者将任意阵列的双耳渲染公式化为一组离散的波束形成器信号的波束形成问题,该波束形成器信号由相应方向的相应 HRTF 渲染。绘制了用于双耳渲染的球面麦克风阵列和球面谐波处理的平行线,

声源方向性通常由给定源的记录确定,例如在自由场中的开放球形麦克风阵列中的乐器。这种测量的主要困难在于源信号未知,并且声场角度采样的分辨率非常低。因此,通常只确定与频率相关的方向性大小,并在该方向上执行插值。阿克曼等人。比较球谐插值、薄板伪样条插值和分段线性、球面三角插值三种不同方法的精度。测试数据来自四种不同的乐器,因此,使用自动激励能够通过仅使用中等数量的麦克风并旋转声源来实现具有高角分辨率的测量。将测试数据下采样到 32 个方向,通过将插值结果与在给定方向测量的地面实况进行比较来确定所考虑的三种方法的准确性。结果证明,准确度很大程度上取决于源类型和采样网格。对于所考虑的所有方法,它处于同一数量级,其中薄板伪样条插值的平均全局误差最小。作者还为进一步处理插值指令提供了一些指南。将测试数据下采样到 32 个方向,通过将插值结果与在给定方向测量的地面实况进行比较来确定所考虑的三种方法的准确性。结果证明,准确度很大程度上取决于源类型和采样网格。对于所考虑的所有方法,它处于同一数量级,其中薄板伪样条插值的平均全局误差最小。作者还为进一步处理插值指令提供了一些指南。将测试数据下采样到 32 个方向,通过将插值结果与在给定方向测量的地面实况进行比较来确定所考虑的三种方法的准确性。结果证明,准确度很大程度上取决于源类型和采样网格。对于所考虑的所有方法,它处于同一数量级,其中薄板伪样条插值的平均全局误差最小。作者还为进一步处理插值指令提供了一些指南。结果证明,准确度很大程度上取决于源类型和采样网格。对于所考虑的所有方法,它处于同一数量级,其中薄板伪样条插值的平均全局误差最小。作者还为进一步处理插值指令提供了一些指南。结果证明,准确度很大程度上取决于源类型和采样网格。对于所考虑的所有方法,它处于同一数量级,其中薄板伪样条插值的平均全局误差最小。作者还为进一步处理插值指令提供了一些指南。

多通道源分离方法在空间音频应用中具有很大的潜力,例如在空间重新混合、增强或修改空间内容方面。近年来,使用深度神经网络 (DNN) 分离模型(例如 Conv-TasNet 和相关的多通道扩展)实现了巨大的性能提升。基于学习的去混响也报告了类似的收益,训练 DNN 模型以执行从混响源频谱图到干或无回声的复杂频谱映射。Chen等人的文章。巧妙地结合了这些进步,用于同时进行多通道声源分离和去混响的任务。受 Conv-TasNet 启发的 DNN 是一种结构,可用作可训练的波束形成器,并对分离的源信号进行估计。然后,这些分离的信号由去混响网络进行后处理,产生最终的增强源信号。这项工作对传统的去混响和空间滤波组合以及传统的源分离方法进行了全面比较。此外,它与最近的多通道 Conv-Tasnet 扩展进行了比较,并且在混响语音混合中进行了测试,它显示出在分离性能、语音清晰度和增强的语音质量方面的可靠改进。此外,还针对麦克风数量、不同的混响时间和扬声器的空间分离对性能进行了分析。这项工作对传统的去混响和空间滤波组合以及传统的源分离方法进行了全面比较。此外,它与最近的多通道 Conv-Tasnet 扩展进行了比较,并且在混响语音混合中进行了测试,它显示出在分离性能、语音清晰度和增强的语音质量方面的可靠改进。此外,还针对麦克风数量、不同的混响时间和扬声器的空间分离对性能进行了分析。这项工作对传统的去混响和空间滤波组合以及传统的源分离方法进行了全面比较。此外,它与最近的多通道 Conv-Tasnet 扩展进行了比较,并且在混响语音混合中进行了测试,它显示出在分离性能、语音清晰度和增强的语音质量方面的可靠改进。此外,还针对麦克风数量、不同的混响时间和扬声器的空间分离对性能进行了分析。

上述贡献是基于数据的空间音频正在进行的工作的重要示例,并且很好地反映了其当前状态。更多数据集(RIR、HRTF、Ambisonics 录音等)的可用性正在促进有效利用此类数据提供的声学知识的开放研究,这表明传统信号处理和新兴的基于 ML 的方法共存并相互作用的情况非常有趣。广泛的应用。回声感知信号处理、双耳感知分析、HRTF 个性化、高效灵活的双耳渲染、源方向性的插值方法或多通道源分离,这些只是本期特刊中介绍的一些应用。

马克西莫科沃斯

延斯·阿伦斯

康拉德·科瓦尔奇克

Archontis Politis

这项工作获得了由 MCIN/AEI/10.13039/501100011033 和“ERDF A way of making Europe”资助的 Grant RTI2018-097045-B-C21 资助。此外,来自波兰国家科学中心的拨款号 DEC-2017/25/B/ST7/01792 和 Generalitat Valenciana 的拨款 AICO/2020/154 和 AEST/2020/012。

作者笔记
  1. Maximo Cobos、Jens Ahrens、Konrad Kowalczyk 和 Archontis Politis 对这项工作做出了同样的贡献。

作者和附属机构

  1. 计算机科学系, Universitat de València, 46100, Burjassot, Valencia, Spain

    马克西莫科沃斯

  2. 应用声学系,查尔姆斯理工大学,412 96,瑞典哥德堡

    延斯·阿伦斯

  3. AGH科技大学电子研究所,30-059,克拉科夫,波兰

    康拉德·科瓦尔奇克

  4. 坦佩雷大学信息技术与通信科学系,FI-33720,坦佩雷,芬兰

    Archontis Politis

作者
  1. Maximo Cobos查看作者的出版物

    您也可以在PubMed  Google Scholar中搜索此作者

  2. Jens Ahrens查看作者的出版物

    您也可以在PubMed  Google Scholar中搜索此作者

  3. Konrad Kowalczyk查看作者的出版物

    您也可以在PubMed  Google Scholar中搜索此作者

  4. Archontis Politis查看作者的出版物

    您也可以在PubMed  Google Scholar中搜索此作者

贡献

所有作者都对这项工作做出了同等贡献。作者阅读并批准了最终手稿。

通讯作者

与 Maximo Cobos、Jens Ahrens、Konrad Kowalczyk 或 Archontis Politis 的通信。

利益争夺

作者声明他们没有相互竞争的利益。

出版商注

Springer Nature 对出版地图和机构附属机构的管辖权主张保持中立。

开放存取本文根据知识共享署名 4.0 国际许可进行许可,该许可允许以任何媒介或格式使用、共享、改编、分发和复制,只要您对原作者和来源给予适当的信任,并提供链接到知识共享许可,并说明是否进行了更改。本文中的图像或其他第三方材料包含在文章的知识共享许可中,除非在材料的信用额度中另有说明。如果文章的知识共享许可中未包含材料,并且您的预期用途不受法律法规的允许或超出允许的用途,则您需要直接从版权所有者那里获得许可。要查看此许可证的副本,请访问 http://creativecommons.org/licenses/by/4.0/。

转载和许可

通过 CrossMark 验证货币和真实性

引用这篇文章

Cobos, M., Ahrens, J., Kowalczyk, K.等。基于数据的空间音频处理。J 音频语音音乐程序。 2022 年,第 13 期(2022 年)。https://doi.org/10.1186/s13636-022-00248-5

下载引文

  • 发表

  • DOI https ://doi.org/10.1186/s13636-022-00248-5

分享这篇文章

与您共享以下链接的任何人都可以阅读此内容:

抱歉,本文目前没有可共享的链接。

由 Springer Nature SharedIt 内容共享计划提供

更新日期:2022-06-08
down
wechat
bug