当前位置: X-MOL 学术IEEE Open J. Eng. Med. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Privacy-Preserving Deep Speaker Separation for Smartphone-Based Passive Speech Assessment
IEEE Open Journal of Engineering in Medicine and Biology Pub Date : 2021-03-04 , DOI: 10.1109/ojemb.2021.3063994
Apiwat Ditthapron 1 , Emmanuel O Agu 1 , Adam C Lammert 2
Affiliation  

Goal: Smartphones can be used to passively assess and monitor patients’ speech impairments caused by ailments such as Parkinson’s disease, Traumatic Brain Injury (TBI), Post-Traumatic Stress Disorder (PTSD) and neurodegenerative diseases such as Alzheimer’s disease and dementia. However, passive audio recordings in natural settings often capture the speech of non-target speakers (cross-talk). Consequently, speaker separation, which identifies the target speakers’ speech in audio recordings with two or more speakers’ voices, is a crucial pre-processing step in such scenarios. Prior speech separation methods analyzed raw audio. However, in order to preserve speaker privacy, passively recorded smartphone audio and machine learning-based speech assessment are often performed on derived speech features such as Mel-Frequency Cepstral Coefficients (MFCCs). In this paper, we propose a novel Deep MFCC bAsed SpeaKer Separation (Deep-MASKS). Methods: Deep-MASKS uses an autoencoder to reconstruct MFCC components of an individual’s speech from an i-vector, x-vector or d-vector representation of their speech learned during the enrollment period. Deep-MASKS utilizes a Deep Neural Network (DNN) for MFCC signal reconstructions, which yields a more accurate, higher-order function compared to prior work that utilized a mask. Unlike prior work that operates on utterances, Deep-MASKS operates on continuous audio recordings. Results: Deep-MASKS outperforms baselines, reducing the Mean Squared Error (MSE) of MFCC reconstruction by up to 44% and the number of additional bits required to represent clean speech entropy by 36%.

中文翻译:

基于智能手机的被动语音评估的隐私保护深度说话人分离

目标:智能手机可用于被动评估和监测由帕金森病、创伤性脑损伤 (TBI)、创伤后应激障碍 (PTSD) 和阿尔茨海默病和痴呆等神经退行性疾病等疾病引起的患者语言障碍。然而,自然环境中的被动录音通常会捕捉到非目标说话者的语音(串音)。因此,在具有两个或多个说话者的声音的录音中识别目标说话者的语音的说话者分离是此类场景中的关键预处理步骤。先前的语音分离方法分析了原始音频。但是,为了保护说话者的隐私,被动录制的智能手机音频和基于机器学习的语音评估通常在衍生语音特征上执行,例如梅尔频率倒谱系数 (MFCC)。在本文中,我们提出了一种新颖的基于深度 MFCC 的扬声器分离(Deep-MASKS)。方法:Deep-MASKS 使用自动编码器从注册期间学习的语音的 i 向量、x 向量或 d 向量表示重建个人语音的 MFCC 组件。Deep-MASKS 利用深度神经网络 (DNN) 进行 MFCC 信号重建,与之前使用掩模的工作相比,它产生了更准确、更高阶的函数。与之前对话语进行操作的工作不同,Deep-MASKS 对连续录音进行操作。结果:Deep-MASKS 优于基线,将 MFCC 重建的均方误差 (MSE) 降低了 44%,将表示干净语音熵所需的额外比特数降低了 36%。
更新日期:2021-03-04
down
wechat
bug