当前位置: X-MOL 学术arXiv.cs.SD › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms
arXiv - CS - Sound Pub Date : 2020-09-24 , DOI: arxiv-2009.11644
Lara Orlandic, Tomas Teijeiro, David Atienza

Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. However, there is currently no validated database of cough sounds with which to train such ML models. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. First, we filtered the dataset using our open-sourced cough detection algorithm. Second, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. Finally, we ensured that coughs labeled as symptomatic and COVID-19 originate from countries with high infection rates, and that their expert labels are consistent. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world's most urgent health crises.

中文翻译:

COUGHVID 众包数据集:用于研究大规模咳嗽分析算法的语料库

咳嗽音频信号分类已成功用于诊断各种呼吸系统疾病,并且利用机器学习 (ML) 提供广泛的 COVID-19 筛查引起了极大的兴趣。但是,目前还没有经过验证的咳嗽声音数据库来训练此类 ML 模型。COUGHVID 数据集提供了 20,000 多个众包​​咳嗽记录,代表了广泛的受试者年龄、性别、地理位置和 COVID-19 状态。首先,我们使用开源的咳嗽检测算法过滤数据集。其次,经验丰富的肺病学家标记了 2,000 多个记录以诊断咳嗽中存在的医学异常,从而提供了现有最大的专家标记咳嗽数据集之一,可用于大量咳嗽音频分类任务。最后,我们确保标记为有症状和 COVID-19 的咳嗽来自感染率高的国家,并且他们的专家标签一致。因此,COUGHVID 数据集为训练 ML 模型提供了丰富的咳嗽记录,以解决世界上最紧迫的健康危机。
更新日期:2020-09-25
down
wechat
bug