Pre-Trained-Based Individualization Model for Real-Time Spatial Audio Rendering System,IEEE Access

当前位置： X-MOL 学术 › IEEE Access › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Pre-Trained-Based Individualization Model for Real-Time Spatial Audio Rendering System
IEEE Access ( IF 3.4 ) Pub Date : 2021-09-15 , DOI: 10.1109/access.2021.3113133
Jinyan Lu , Xiaoke Qi

Spatial audio has attracted more and more attention in the fields of virtual reality (VR), blind navigation and so on. The individualized head-related transfer functions (HRTFs) play an important role in generating spatial audio with accurate localization perception. Existing methods only focus on one database, and do not fully utilize the information from multiple databases. In light of this, a pre-trained-based individualization model is proposed to predict HRTFs for any target user in this paper, and a real-time spatial audio rendering system built on a wearable device is implemented to produce an immersive virtual auditory display. The proposed method first builds a pre-trained model based on multiple databases using a DNN-based model combined with an autoencoder-based dimensional reduction method. This model can capture the nonlinear relationship between user-independent HRTFs and position-dependent features. Then, fine tuning is done using a transfer learning technique at a limit number of layers based on the pre-trained model. The key idea behind fine tuning is to transfer the pre-trained user-independent model to the user-dependent one based on anthropometric features. Finally, real-time issues are discussed to guarantee a fluent auditory experience during dynamic scene update, including fine-grained head-related impulse response (HRIR) acquisition, efficient spatial audio reproduction, and parallel synthesis and playback. These techniques ensure that the system is implemented with little computational cost, thus minimizing processing delay. The experimental results show that the proposed model outperforms other methods in terms of subjective and objective metrics. Additionally, our rendering system runs on HTC Vive, with almost unnoticeable delay.

中文翻译：

基于预训练的实时空间音频渲染系统个性化模型

空间音频在虚拟现实（VR）、盲人导航等领域受到越来越多的关注。个性化头部相关传递函数（HRTF）在生成具有准确定位感知的空间音频方面发挥着重要作用。现有方法仅关注一个数据库，并没有充分利用多个数据库的信息。鉴于此，本文提出了一种基于预训练的个性化模型来预测任何目标用户的 HRTF，并实现了基于可穿戴设备的实时空间音频渲染系统，以产生身临其境的虚拟听觉显示。该方法首先使用基于 DNN 的模型与基于自动编码器的降维方法相结合，构建基于多个数据库的预训练模型。该模型可以捕获与用户无关的 HRTF 和与位置相关的特征之间的非线性关系。然后，基于预训练模型，在有限层数上使用迁移学习技术进行微调。微调背后的关键思想是将预先训练的与用户无关的模型转换为基于人体测量特征的与用户相关的模型。最后，讨论了实时问题，以保证动态场景更新期间流畅的听觉体验，包括细粒度的头部相关脉冲响应（HRIR）采集、高效的空间音频再现以及并行合成和播放。这些技术确保系统的实现只需很少的计算成本，从而最大限度地减少处理延迟。实验结果表明，所提出的模型在主观和客观指标方面均优于其他方法。此外，我们的渲染系统在 HTC Vive 上运行，几乎没有明显的延迟。

更新日期：2021-09-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11