Detection and classification of human-produced nonverbal audio events,Applied Acoustics

当前位置： X-MOL 学术 › Appl. Acoust. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Detection and classification of human-produced nonverbal audio events
Applied Acoustics ( IF 3.4 ) Pub Date : 2021-01-01 , DOI: 10.1016/j.apacoust.2020.107643
Philippe Chabot , Rachel E. Bouserhal , Patrick Cardinal , Jérémie Voix

Abstract Audio wearable devices, or hearables, are becoming an increasingly popular consumer product. Some of these hearables contain an in-ear microphone to capture audio signals inside the user’s occluded earcanal. Mainly, the microphone is used to pick up speech in noisy environments, but it can also capture other signals, such as nonverbal events that could be used to interact with the device or a computer. Teeth or tongue clicking could be used to interact with a device in a discreet manner, and coughing or throat-clearing sounds could be used to monitor the health of a user. In this paper, 10 human produced nonverbal audio events are detected and classified in real-time with a classifier using the Bag-of-Audio-Words algorithm. To build this algorithm, different clustering and classification methods are compared. Mel-Frequency Cepstral Coefficient features are used alongside Auditory-inspired Amplitude Modulation features and Per-Channel Energy Normalization features. To combine the different features, concatenation performance at the input level and at the histogram level is compared. The real-time detector is built using the detection by classification technique, classifying on a 400 ms window with 75% overlap. The detector is tested in a controlled noisy environment on 10 subjects. The classifier had a sensitivity of 81.5% while the detector using the same classifier had a sensitivity of 69.9% in a quiet environment.

中文翻译：

人为非语言音频事件的检测和分类

摘要音频可穿戴设备或耳戴式设备正在成为越来越受欢迎的消费产品。其中一些耳戴式设备包含一个入耳式麦克风，用于捕获用户封闭耳道内的音频信号。麦克风主要用于在嘈杂环境中拾取语音，但它也可以捕获其他信号，例如可用于与设备或计算机交互的非语言事件。牙齿或舌头的咔哒声可用于以谨慎的方式与设备交互，咳嗽或清嗓子的声音可用于监测用户的健康状况。在本文中，使用 Bag-of-Audio-Words 算法使用分类器实时检测和分类 10 个人类产生的非语言音频事件。为了构建该算法，比较了不同的聚类和分类方法。Mel-Frequency Cepstral Coefficient 特征与受听觉启发的幅度调制特征和每通道能量归一化特征一起使用。为了组合不同的特征，比较输入级别和直方图级别的连接性能。实时检测器是使用分类检测技术构建的，在 400 ms 窗口上进行分类，重叠率为 75%。检测器在受控嘈杂环境中对 10 名受试者进行了测试。分类器的灵敏度为 81.5%，而使用相同分类器的检测器在安静环境中的灵敏度为 69.9%。比较输入级别和直方图级别的串联性能。实时检测器是使用分类检测技术构建的，在 400 ms 窗口上进行分类，重叠率为 75%。检测器在受控嘈杂环境中对 10 名受试者进行了测试。分类器的灵敏度为 81.5%，而使用相同分类器的检测器在安静环境中的灵敏度为 69.9%。比较输入级别和直方图级别的串联性能。实时检测器是使用分类检测技术构建的，在 400 毫秒的窗口上进行分类，重叠率为 75%。检测器在受控嘈杂环境中对 10 名受试者进行测试。分类器的灵敏度为 81.5%，而使用相同分类器的检测器在安静环境中的灵敏度为 69.9%。

更新日期：2021-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11