Time-Domain Multi-modal Bone/air Conducted Speech Enhancement,IEEE Signal Processing Letters

当前位置： X-MOL 学术 › IEEE Signal Process. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Time-Domain Multi-modal Bone/air Conducted Speech Enhancement
IEEE Signal Processing Letters ( IF 3.2 ) Pub Date : 2020-01-01 , DOI: 10.1109/lsp.2020.3000968
Cheng Yu , Kuo-Hsuan Hung , Syu-Siang Wang , Yu Tsao , Jeih-weih Hung

Previous studies have proven that integrating video signals, as a complementary modality, can facilitate improved performance for speech enhancement (SE). However, video clips usually contain large amounts of data and pose a high cost in terms of computational resources and thus may complicate the SE system. As an alternative source, a bone-conducted speech signal has a moderate data size while manifesting speech-phoneme structures, and thus complements its air-conducted counterpart. In this study, we propose a novel multi-modal SE structure in the time domain that leverages bone- and air-conducted signals. In addition, we examine two ensemble-learning-based strategies, early fusion (EF) and late fusion (LF), to integrate the two types of speech signals, and adopt a deep learning-based fully convolutional network to conduct the enhancement. The experiment results on the Mandarin corpus indicate that this newly presented multi-modal (integrating bone- and air-conducted signals) SE structure significantly outperforms the single-source SE counterparts (with a bone- or air-conducted signal only) in various speech evaluation metrics. In addition, the adoption of an LF strategy other than an EF in this novel SE multi-modal structure achieves better results.

中文翻译：

时域多模态骨/气传导语音增强

先前的研究已经证明，集成视频信号作为一种补充方式，可以促进语音增强 (SE) 的性能提高。然而，视频剪辑通常包含大量数据并且在计算资源方面造成高成本，因此可能使 SE 系统复杂化。作为替代来源，骨传导语音信号具有中等数据大小，同时表现出语音-音素结构，因此补充了其空气传导对应物。在这项研究中，我们提出了一种新的时域多模态 SE 结构，它利用了骨传导和空气传导信号。此外，我们研究了两种基于集成学习的策略，早期融合（EF）和后期融合（LF），以整合两种类型的语音信号，并采用基于深度学习的全卷积网络进行增强。在普通话语料库上的实验结果表明，这种新提出的多模态（整合骨传导和空气传导信号）SE 结构在各种语音中显着优于单源 SE 对应物（仅具有骨传导或空气传导信号）评价指标。此外，在这种新颖的 SE 多模态结构中采用 EF 以外的 LF 策略可以获得更好的结果。

更新日期：2020-01-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11