当前位置: X-MOL 学术Poznan Studies in Contemporary Linguistics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A survey of Polish ASR speech datasets
Poznan Studies in Contemporary Linguistics ( IF 0.400 ) Pub Date : 2024-03-01 , DOI: 10.1515/psicl-2023-0019
Michał Junczyk 1
Affiliation  

Access to speech datasets is essential for the effective use of modern ASR systems in low-resource languages like Polish. However, the lack of centralized information and metadata describing available datasets poses a significant challenge to researchers and practitioners. In this paper, we address this issue by presenting the most comprehensive survey of Polish ASR speech datasets to date. We manually curated information on 53 publicly available datasets and annotated them with 61 attributes, providing a comprehensive catalog of these resources. The catalog facilitates the discovery and evaluation of available datasets, enabling researchers to identify datasets that suit their specific needs. It also enables the identification of gaps in the existing datasets, which may inform future research directions. The catalog is open and community-driven, which means that new data sets can be added and issues can be reported, ensuring its continued relevance and usefulness to the ASR community. Our work contributes to improving the accessibility and usability of ASR systems in low-resource languages such as Polish.

中文翻译:

波兰 ASR 语音数据集调查

访问语音数据集对于在波兰语等资源匮乏的语言中有效使用现代 ASR 系统至关重要。然而,缺乏描述可用数据集的集中信息和元数据给研究人员和从业者带来了重大挑战。在本文中,我们通过对波兰 ASR 语音数据集进行迄今为止最全面的调查来解决这个问题。我们手动整理了 53 个公开可用数据集的信息,并用 61 个属性对其进行了注释,提供了这些资源的全面目录。该目录有助于发现和评估可用数据集,使研究人员能够识别适合其特定需求的数据集。它还可以识别现有数据集中的差距,这可能会为未来的研究方向提供信息。该目录是开放的且由社区驱动,这意味着可以添加新的数据集并报告问题,从而确保其对 ASR 社区的持续相关性和有用性。我们的工作有助于提高波兰语等资源匮乏语言的 ASR 系统的可访问性和可用性。
更新日期:2024-03-01
down
wechat
bug