Engineering Applications of Artificial Intelligence ( IF 8 ) Pub Date : 2020-09-23 , DOI: 10.1016/j.engappai.2020.103976 Miguel Fernández-Díaz , Ascensión Gallardo-Antolín
Speech intelligibility can be degraded due to multiple factors, such as noisy environments, technical difficulties or biological conditions. This work is focused on the development of an automatic non-intrusive system for predicting the speech intelligibility level in this latter case. The main contribution of our research on this topic is the use of Long Short-Term Memory (LSTM) networks with log-mel spectrograms as input features for this purpose. In addition, this LSTM-based system is further enhanced by the incorporation of a simple attention mechanism that is able to determine the more relevant frames to this task. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. Results show that the attention LSTM architecture outperforms both, a reference Support Vector Machine (SVM)-based system with hand-crafted features and a LSTM-based system with Mean-Pooling.
中文翻译:
基于注意力长期记忆的系统,用于语音清晰度的自动分类
语音清晰度可能由于多种因素而降低,例如嘈杂的环境,技术难题或生物学条件。这项工作专注于开发一种自动非侵入式系统,用于预测在后一种情况下的语音清晰度。我们对此主题的研究的主要贡献是使用带有log-mel频谱图的长短期记忆(LSTM)网络作为输入功能。此外,该基于LSTM的系统通过合并一个简单的注意机制而得以进一步增强,该机制能够确定与此任务更为相关的框架。UA-Speech数据库对提出的模型进行了评估,该数据库包含不同严重程度的发音异常语音。结果表明,关注度LSTM架构胜过两者,