Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton March 4, 2024

A survey of Polish ASR speech datasets

  • Michał Junczyk ORCID logo EMAIL logo

Abstract

Access to speech datasets is essential for the effective use of modern ASR systems in low-resource languages like Polish. However, the lack of centralized information and metadata describing available datasets poses a significant challenge to researchers and practitioners. In this paper, we address this issue by presenting the most comprehensive survey of Polish ASR speech datasets to date. We manually curated information on 53 publicly available datasets and annotated them with 61 attributes, providing a comprehensive catalog of these resources. The catalog facilitates the discovery and evaluation of available datasets, enabling researchers to identify datasets that suit their specific needs. It also enables the identification of gaps in the existing datasets, which may inform future research directions. The catalog is open and community-driven, which means that new data sets can be added and issues can be reported, ensuring its continued relevance and usefulness to the ASR community. Our work contributes to improving the accessibility and usability of ASR systems in low-resource languages such as Polish.


Corresponding author: Michał Junczyk, Adam Mickiewicz University, Poznań, Poland, E-mail:

Appendix

PL ASR speech datasets catalog attributes

  1. Dataset name: Full name of a speech dataset, consisting of alphanumeric characters, underscores, and hyphens.

  2. Dataset ID: Dataset’s unique identifier for reporting, composed of lowercase letters and hyphens.

  3. Access type: Dataset access type from the cost perspective, with possible values including free, paid, and no-info.

  4. Access link: Web reference for accessing or purchasing a dataset, provided in URL format.

  5. Available online: Validated access status as of March 2023, with possible values of yes and no.

  6. License: Dataset license type, which can be Apache, CC-0, CC-BY, CC-BY-SA, ELRA, HZSK-PUB, LDC, or Proprietary.

  7. Publisher: Creator or publisher of a dataset, composed of alphabetical characters and hyphens.

  8. Repository: Main repository hosting a dataset, consisting of alphabetical characters and hyphens.

  9. Languages: Language and country code of the recorded speakers, represented as language code (ISO-639-1) and country code (ISO-3166-2), possibly including multiple languages.

  10. Creation year: Year a dataset was created or published, represented as a four-digit number.

  11. ISLRN: International Standard Language Resource Number, provided as ISRLN.

  12. ISBN: International Standard Book Number, provided as ISBN.

  13. LR catalog ID: Language data repository ID, represented as a combination of a URL or a string containing alphanumeric characters, hyphens, and underscores.

  14. Reference publication: Link to a relevant publication describing a dataset, provided in URL format.

  15. Contact point: Contact point referenced in the documentation, composed of alphanumeric characters, hyphens, underscores, and the ‘@’ symbol.

  16. Latest version: The latest version of the dataset released, expressed as a decimal number.

  17. Last update year: Date (year) of the last update, represented as a four-digit number.

  18. Sponsor: Institution that funded the creation of the dataset, consisting of alphanumeric characters, hyphens, and underscores.

  19. Price – non-commercial usage: Price for noncommercial usage, with possible values including free or a numerical value.

  20. Price – commercial usage: Price for commercial usage, with possible values including free or a numerical value.

  21. Purpose and split: Target usage and available data splits, with possible values being train, valid, test, or none.

  22. Size audio total [hours]: Total amount of audio data in hours, represented as a decimal number.

  23. Size of audio transcribed [hours]: Total amount of speech data transcribed, expressed as a decimal number.

  24. Size [GB]: Size of a dataset in gigabytes, represented as a decimal number.

  25. Speakers: Number of unique speakers who contribute speech recordings, expressed as an integer.

  26. Audio recordings: Number of voice recordings in the corpus, represented as an integer.

  27. Audio segmentation: Indicates whether audio recordings are segmented, with possible values of yes or no.

  28. Tokens: Number of tokens in the corpus, represented as an integer.

  29. Unique tokens: Number of unique tokens, expressed as an integer.

  30. Automatic QA: Type of automatic quality assurance process applied, with possible values of yes or no.

  31. Manual QA: Type of manual quality assurance process applied, with possible values of yes or no.

  32. Manual QA scope: Application of manual QA, consisting of alphanumeric characters and spaces.

  33. Transcription coverage: Proportion of transcribed recordings, expressed as a percentage.

  34. Transcription protocol: Specifies whether a transcription protocol is described, with possible values being yes, no, or its description.

  35. Denormalized transcriptions: Indicates whether available transcriptions are without abbreviations, numerals, punctuation, etc., with possible values of yes or no.

  36. Transcription and annotation format: Format of transcription files, consisting of alphanumeric characters and periods.

  37. Domain: Domain of utterances, which can include academic lecture, books, broadcast, conversations, customer service, digits, general, interview, multi-domain, news, numbers, parliament speech or public transport.

  38. Speech type: Type of speech, with possible values including conversational, read, public speech, or various.

  39. Audio collection process: Audio collection process, with potential values controlled, corpus, or various.

  40. Speech recordings source: Speech recordings source, which can include volunteers, university employees, crowd, public speakers or paid contributors.

  41. Acoustic environment: Acoustic conditions under which audio was collected, with possible values of broadcast, car, home, mixed, quiet space, office, public space or various.

  42. Audio device: Audio devices used for speech collection, such as a condenser mic, headset, mobile phone, landline phone, or various.

  43. Device model: Recording device(s) and model(s), represented by a combination of alphanumeric characters and hyphens.

  44. Audio format: Audio storage format, with potential values including flac, mp3, raw, riff or wav.

  45. Audio codec: Audio encoding format, with possible values being mp3, ogg, opus or vorbis.

  46. Audio channels: Number of audio recording channels, represented as an integer ranging from 1 to 16.

  47. Sampling rate [Hz]: Sampling rate of recorded audio, expressed as a four- or five-digit number.

  48. Bits per sample: Number of bits used to encode each audio sample, with possible values 8, 16, 24 or 32.

  49. Age info: Annotation of the age of the speakers, with potential values of yes or no.

  50. Age balance: Indicates whether the age distribution of the speakers is balanced between demographic groups, represented as free text.

  51. Gender info: Annotation of the gender of the speakers, with potential values of yes or no.

  52. Gender balance: Indicates whether the gender distribution of the speakers is balanced between demographic groups, represented as free text.

  53. Nativity info: Annotation of the nativity of the speakers, with potential values of yes or no.

  54. Accent info: Annotation of the accent of the speakers, with potential values of yes or no.

  55. Accent representative: Indicates whether the accent distribution of the speakers is representative of the target population groups, with potential values of yes or no.

  56. Occupation info: Annotation of the occupation of the speakers, with potential values of yes or no.

  57. Health info: Annotation of the health condition of the speakers, with potential values of yes or no.

  58. Speech signal time-alignment annotation: Annotation of the duration of speech segments, with potential values of yes or no.

  59. Speaker diarization annotation: In the case of speech recordings containing speech of multiple speakers, annotation of the duration of specific speaker speech segments, with potential values of yes or no.

  60. Named entities annotation: Annotation of named entities in utterances, with potential values of yes or no.

  61. Part of speech annotation: Annotation of part of speech information in utterances, with potential values of yes or no.

References

Aksënova, Alëna, Zhehuai Chen, Chung-Cheng Chiu, Daan van Esch, Pavel Golik, Wei Han, Bhuvana Ramabhadran, Levi King, Andrew Rosenberg, Susan Schwartz & Gary Wang. 2022. Accented speech recognition: Benchmarking, pre-training, and diverse data. arXiv preprint arXiv:2205.08014.Search in Google Scholar

Aksënova, Alëna, Daan van Esch, James Flynn & Pavel Golik. 2021. How might we create better benchmarks for speech recognition? In Proceedings of the 1st workshop on benchmarking: Past, present and future, 22–34.10.18653/v1/2021.bppf-1.4Search in Google Scholar

Ardila, Rosana, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers & Gregor Weber. 2019. Common voice: A massively-multilingual speech corpus. Proceedings of the twelfth language resources and evaluation conference, 4218–4222. European Language Resources Association.Search in Google Scholar

Augustyniak, Łukasz, Kamil Tagowski, Albert Sawczyn, Denis Janiak, Roman Bartusiak, Adrian Szymczak, Marcin Wątroba, Arkadiusz Janz, Piotr Szymański, Mikołaj Morzy, Tomasz Kajdanowicz & Maciej Piasecki. 2022. This is the way: Designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish. Arxiv:2211.13112.Search in Google Scholar

Bender, Emily M. & Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6. 587–604. https://doi.org/10.1162/tacl_a_00041.Search in Google Scholar

Beneitez, Nahuel Unai Roselló. 2019. Development and evaluation of a Polish automatic speech recognition system using the TLK toolkit. Valencia, Spain: Universitat Politècnica de València.Search in Google Scholar

Chen, Guoguo, Shuzhou Chai, Guanbo Wang, Du Jiayu, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You & Zhiyong Yan. 2021. GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. arXiv:2106.06909.10.21437/Interspeech.2021-1965Search in Google Scholar

Chmiel, Agnieszka, Przemysław Janikowski, Marta Kajzer-Wietrzny, Danijel Koržinek & Dariusz, Jakubowski. 2021. EU Parliament Speech corpus: CLARIN-PL Digital Repository. Available at: http://hdl.handle.net/11321/821.Search in Google Scholar

Demenko, Grażyna, Stefan Grocholewski, Katarzyna Klessa, Jerzy Ogórkiewicz, Agnieszka Wagner, Marek Lange, Daniel Śledziński & Natalia Cylwik. 2008. JURISDIC-Polish speech database for taking dictation of legal texts. In European Language Resources Association (ELRA) (ed.), Proceedings of the sixth international conference on language resources and evaluation (LREC’08), 1280–1287. Marrakech, Morocco: European Language Resources Association (ELRA).Search in Google Scholar

Galvez, Daniel, Gregory Frederick Diamos, Juan Ciro, Juan Felipe Cer’on, Keith Achorn, Anjali Gopi, David Kanter, Maximilian Lam, Mark Mazumder & Vijay Janapa Reddi. 2021. The people’s speech: A large-scale diverse English speech recognition dataset for commercial usage. arXiv:2111.09344.Search in Google Scholar

Gandhi, Sanchit, Patrick von Platen & Alexander M. Rush. 2022. ESB: A benchmark for multi-domain end-to-end speech recognition.Search in Google Scholar

Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III & Kate Crawford. 2021. Datasheets for datasets. Arxiv:1803.09010.Search in Google Scholar

Grocholewski, Stefan. 1997. Corpora – speech database for Polish diphones. In EUROSPEECH ’97 5th European Conference on Speech Communication and Technology. Available at: https://www.isca-speech.org/archive_v0/archive_papers/eurospeech_1997/e97_1735.pdf.10.21437/Eurospeech.1997-492Search in Google Scholar

Hedström, Staffan, David Erik Mollberg, Ragnheiður Þórhallsdóttir & Jón Guðnason. 2022. Samrómur: Crowd-sourcing large amounts of data. In Proceedings of the twelfth language resources and evaluation conference.Search in Google Scholar

Iskra, Dorota, et al.. 2002. SPEECON-speech databases for consumer devices: Database specification and validation. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). Available at: http://www.lrec-conf.org/proceedings/lrec2002/pdf/177.pdf.Search in Google Scholar

Koržinek, Danijel. 2019. Task 5: Automatic speech recognition PolEval 2019 competition. Available at: http://2019.poleval.pl/files/2019/11.pdf.Search in Google Scholar

Kozierski, Piotr. 2016. Kaldi toolkit in Polish whispery speech recognition. Przegld Elektrotechniczny 1. 303–306. https://doi.org/10.15199/48.2016.11.70.Search in Google Scholar

Kreutzer, Julia, Isaac Caswell, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suarez, Iroro Orife, Kelechi Ogueji, Andre Niyongabo Rubungo, Toan Q. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal & Mofetoluwa Adeyemi. 2022. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics 10. 50–72. https://doi.org/10.1162/tacl_a_00447.Search in Google Scholar

Marasek, Krzysztof, Danijel Korzinek & Łukasz Brocki. 2014. System for automatic transcription of sessions of the Polish senate. Archives of Acoustics 39. 501–509, https://doi.org/10.2478/aoa-2014-0054.Search in Google Scholar

Mihajlik, Peter, et al.. 2022. BEA-base: A benchmark for ASR of spontaneous Hungarian. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, 1970–1977. Marseille, France: European Language Resources Association. Available at: https://aclanthology.org/2022.lrec-1.211.Search in Google Scholar

Pacholczyk, Marcin. 2018. Przegląd I porównanie rozwiązań rozpoznawania mowy pod kątem rozpoznawania zbioru komend głosowych. Available at: http://kkapd.pl/new/index.php/pliki-do-pobrania/category/6-tom-ii-2018?download=140.Search in Google Scholar

Pęzik, Piotr, Gosia Krawentek, Sylwia Karasińska, Paweł Wilk, Paulina Rybińska, Anna Cichosz, Angelika Peljak-Łapińska, Mikołaj Deckert & Michał Adamczyk. 2022. DiaBiz – An annotated corpus of Polish call center dialogs. In Proceedings of the thirteenth language resources and evaluation conference: European Language Resources Association, 723–725. Available at: https://aclanthology.org/2022.lrec-1.76.Search in Google Scholar

Pęzik, Piotr & Michał Adamczyk. 2022. DiaBiz ASR benchmark: CLARIN-PL Digitalrepository. Available at: http://hdl.handle.net/11321/894.Search in Google Scholar

Pratap, Vineel, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve & Ronan Collobert. 2020. MLS: A large-scale multilingual dataset for speech research. In Interspeech 2020: ISCA. Available at: http://dx.doi.org/10.21437/Interspeech.2020-2826.10.21437/Interspeech.2020-2826Search in Google Scholar

Rowley, Jennifer & Frances Slack. 2004. Conducting a literature review. Management Research News 27. 31–39. https://doi.org/10.1108/01409170410784185.Search in Google Scholar

Rybak, Piotr, Robert Mroczkowski, Janusz Tracz & Ireneusz Gawlik. 2020. KLEJ: Comprehensive benchmark for Polish language understanding. Proceedings of the 58th annual meeting of the association for computational linguistics.10.18653/v1/2020.acl-main.111Search in Google Scholar

SpeechColab ASR leaderboard . 2021. Available at: https://github.com/SpeechColab/Leaderboard.Search in Google Scholar

Tsai, Hsiang-Sheng, Heng-Jui Chang, Wen-Chin Huang, Zili Huang, Kushal Lakhotia, Shu-wen Yang, Shuyan Dong, Andy Liu, Cheng-I Lai, Jiatong Shi, Xuankai Chang, Phil Hall, Hsuan-Jui Chen, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed & Hung-yi Lee. 2022. SUPERB-SG: Enhanced speech processing universal PERformance benchmark for semantic and generative capabilities. In Proceedings of the 60th annual meeting of the association for computational linguistics: Association for Computational Linguistics, vol. 1, 8479–8492.10.18653/v1/2022.acl-long.580Search in Google Scholar

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy & Samuel R. Bowman. 2019a. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv:1804.07461.Search in Google Scholar

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy & Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv:1804.07461.10.18653/v1/W18-5446Search in Google Scholar

Wang, Changhan, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino & Emmanuel Dupoux. 2021. VoxPopuli: A large-scale multilingual speech corpus forrepresentation learning semi-supervised learning and interpretation. In Chengqing Zong, Fei Xia, Wenjie Li & Roberto Navigli (eds.), Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers): Association for Computational Linguistics, 993–1003. Available at: https://aclanthology.org/2021.acl-long.80.10.18653/v1/2021.acl-long.80Search in Google Scholar

Ziółko, Bartosz, Tomasz Jadczyk, Dawid Skurzok, Piotr Żelasko, Jakub Gałka, Tomasz Pȩdzimąż, Ireneusz Gawlik & Szymon, Pałka. 2015. SARMATA 2.0 automatic polish language speech recognition system. In Procedings interspeech, 1062–1063.Search in Google Scholar

Received: 2023-03-23
Accepted: 2023-09-04
Published Online: 2024-03-04
Published in Print: 2024-03-25

© 2023 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 27.4.2024 from https://www.degruyter.com/document/doi/10.1515/psicl-2023-0019/html
Scroll to top button