The South African directory enquiries (SADE) name corpus

Thirion, Jan W. F.; van Heerden, Charl; Giwa, Oluwapelumi; Davel, Marelie H.

doi:10.1007/s10579-019-09448-6

The South African directory enquiries (SADE) name corpus

Original Paper
Published: 06 February 2019

Volume 54, pages 155–184, (2020)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Jan W. F. Thirion¹,
Charl van Heerden¹,
Oluwapelumi Giwa¹ &
…
Marelie H. Davel ORCID: orcid.org/0000-0003-3103-5858¹

162 Accesses
Explore all metrics

Abstract

We present the design and development of a South African directory enquiries corpus. It contains audio and orthographic transcriptions of a wide range of South African names produced by first-language speakers of four languages, namely Afrikaans, English, isiZulu and Sesotho. Useful as a resource to understand the effect of name language and speaker language on pronunciation, this is the first corpus to also aim to identify the “intended language”: an implicit assumption with regard to word origin made by the speaker of the name. We describe the design, collection, annotation, and verification of the corpus. This includes an analysis of the algorithms used to tag the corpus with meta information that may be beneficial to pronunciation modelling tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

Article 05 July 2023

Svetlana Zemicheva, Maxim Gromov, … Natalia Zyuz’kova

Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks

Article Open access 09 August 2022

Anssi Moisio, Dejan Porjazovski, … Mikko Kurimo

Sociolinguistic Extension of the ORD Corpus of Russian Everyday Speech

Notes

https://cloud.google.com/translate/docs/reference/rest.
Google Translate API was developed by Google as a proprietary application based on statistical machine translation.
ISLRN 510-842-952-534-8, available from http://hdl.handle.net/20.500.12185/378 under a Creative Commons Attribution License (3.0 Unported).

References

Adda-Decker, M., & Lamel, L. (2006). Multilingual dictionaries. In T. Schultz & K. Kirchoff (Eds.), Multilingual speech processing (pp. 123–166). Berlington, MA: Academic Press. chap 5.
Chapter Google Scholar
Amdal, I., & Fosler-Lussier, E. (2003). Pronunciation variation modeling in automatic speech recognition. Telektronikk, 99, 70–82.
Google Scholar
Barnard, E., Davel, M., & van Heerden, C. (2009). ASR corpus design for resource-scarce languages. In Proceedings of the 10th annual conference of the international speech communication association (INTERSPEECH), Brighton, UK (pp. 2847–2850).
Barnard, E., Davel, M. H., van Heerden, C. J., De Wet, F., & Badenhorst, J. (2014). The NCHLT speech corpus of the South African languages. In Proceedings of the of the 4th workshop on spoken language technologies for under-resourced languages (SLTU), St. Peterburg, Russia (pp. 194–200).
Barnard, E., Davel, M, H., & van Huyssteen, G. B. (2010). Speech technology for information access: a South African case study. In Proceedings of the AAAI spring symposium on artificial intelligence for development (AAI-D) (pp. 8–13).
Bechet, F., De Mori, R., & Subsol, G. (2001) Very large vocabulary proper name recognition for directory assistance. In IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 222–225).
Bechet, F., De Mori, R., & Subsol, G. (2002). Dynamic generation of proper name pronunciations for directory assistance. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 1, pp. I–745–I–748).
Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5), 434–451. https://doi.org/10.1016/j.specom.2008.01.002.
Article Google Scholar
Church, K. W. (1985). Stress assignment in letter-to-sound rules for speech synthesis. The Journal of the Acoustical Society of America, 78(S1), S7–S7.
Article Google Scholar
Córdoba, R., San-Segundo, R., Montero, J. M., Colás, J., Ferreiros, J., Macías-Guarasa, J., & Pardo, J. M. (2001). An interactive directory assistance service for Spanish with large-vocabulary recognition. In Proceedings of the 2nd annual conference of the international speech communication association (INTERSPEECH), Scandinavia (pp. 1279–1282).
Davel, M. H., Basson, W. D., van Heerden, C. J., & Barnard, E. (2013). NCHLT dictionaries: Project report. Technical report. Multilingual Speech Technologies, North-West University.
Davel, M. H., & Martirosian, O. (2009). Pronunciation dictionary development in resource-scarce environments. In Proceedings of the 10th annual conference of the international speech communication association (INTERSPEECH) (pp. 2851–2854).
Davel, M. H., van Heerden, C. J., & Barnard, E. (2012). Validating smartphone-collected speech corpora (accepted for publication). In Proceedings of the spoken language technologies for under-resourced languages (SLTU).
Giwa, O., & Davel, M. H. (2014). Language identification of individual words with Joint Sequence Models. In Proceedings of the 15th annual conference of the international speech communication association (Interspeech).
Giwa, O., & Davel, M. H. (2015) Text-based language identification of multilingual names. In Proceedings of the pattern recognition association of South Africa and robotics and mechatronics international conference (PRASA-RobMech) (pp. 166–171).
Giwa, O., Davel, M. H., & Barnard, E. (2011). A Southern African corpus for multilingual name pronunciation. In Proceedings of the 22nd annual symposium of the pattern recognition association of South Africa (PRASA) (pp. 49–53).
Gustafson, J. (2009). ONOMASTICA—Creating a multi-lingual dictionary of European names. Lund Working Papers in Linguistics, 43, 66–69.
Google Scholar
Kamm, C. A., Shamieh, C., & Singhal, S. (1995). Speech recognition issues for directory assistance applications. Speech Communication, 17(3), 303–311.
Article Google Scholar
Kgampe, M., & Davel, M. H. (2010). Consistency of cross-lingual pronunciation of South African personal names. In 21st annual symposium of the pattern recognition association of South Africa (PRASA 2010) (pp. 123–127).
Kgampe, M., Davel, M. H. (2011). The predictability of name pronunciation errors in four South African languages. In Proceedings of the 22nd annual symposium of the pattern recognition association of South Africa (PRASA), Emerald Casino and Resort, Vanderbijlpark, South Africa (pp. 85–90).
Llitjós, A.F., & Black, A.W. (2001) Knowledge of language origin improves pronunciation accuracy of proper names. In 7th European conference on speech communication and technology (EUROSPEECH) (pp. 1919–1922).
Llitjós, A. F., Black, A. W., Lenzo, K., & Rosenfeld, R. (2001) Improving pronunciation accuracy of proper names with language origin classes. In Proceedings of the 7th ESSLLI student session.
Loots, L., & Niesler, T. (2011). Automatic conversion between pronunciations of different English accents. Speech Communication, 53, 75–84. https://doi.org/10.1016/j.specom.2010.07.006.
Article Google Scholar
Maison, B., Chen, S. F., & Cohen, P. S. (2003). Pronunciation modeling for names of foreign origin. In Proceedings of the IEEE workshop on automatic speech recognition and understanding (ASRU), IEEE (pp. 429–434). https://doi.org/10.1109/ASRU.2003.1318479.
Modipa, T., de Wet, F., Davel, M. H. (2009) ASR performance analysis of an experimental call routing system. In Proceedings of the 20th annual symposium of the pattern recognition association of South Africa (PRASA) (pp. 127–130).
Modipa, T. I., Davel, M. H., & de Wet, F. (2013). Pronunciation modelling of foreign words for Sepedi ASR. In Proceedings of the annual symposium of the pattern recognition association of South Africa (PRASA), Johannesburg, South Africa (pp. 64–69).
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P., et al. (2011) The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 workshop on automatic speech recognition and understanding (ASRU), Big Island, Hawaii, EPFL-CONF-192584.
Réveil, B., Martens, J. P., & D’Hoore, B. (2009) How speaker tongue and name source language affect the automatic recognition of spoken names. In 10th annual conference of the international speech communication association (INTERSPEECH) (pp. 2971–2974).
Réveil, B., Martens, J. P., & van den Heuvel, H. (2010) Improving proper name recognition by adding automatically learned pronunciation variants to the lexicon. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner & D. Tapias (Eds.), Proceedings of the 7th conference on international language resources and evaluation (LREC) (pp. 2149–2154).
Réveil, B., Martens, J. P., & van den Heuvel, H. (2012). Improving proper name recognition by means of automatically learned pronunciation variants. Speech Communication, 54(3), 321–340.
Article Google Scholar
Schramm, H., Rueber, B., & Kellner, A. (2000). Strategies for name recognition in automatic directory assistance systems. Speech Communication, 31(4), 329–338. https://doi.org/10.1016/S0167-6393(99)00066-7.
Article Google Scholar
Spiegel, M. F. (2003). Proper name pronunciations for speech technology applications. International Journal of Speech Technology, 6(4), 419–427.
Article Google Scholar
Strik, H., & Cucchiarini, C. (1999). Modeling pronunciation variation for ASR: A survey of the literature. Speech Communication, 29(2–4), 225–246.
Article Google Scholar
Thirion, J. W., Davel, M. H., & Barnard, E. (2012) Multilingual pronunciations of proper names in a Southern African corpus. In Proceedings of the 23rd annual symposium of the pattern recognition association of South Africa (PRASA), Pretoria, South Africa (pp. 102–108).
Trancoso, I., & Viana, M. C. (1995). Issues in the pronunciation of proper names: The experience of the Onomastica project. In Workshop on integration of language and speech (pp. 1–16).
van den Heuvel, H., Martens, J. P., D’hanens, K., & Konings, N. (2008) The autonomata spoken names corpus. In Proceedings of the 6th conference on international language resources and evaluation (LREC) (pp. 140–143).
van den Heuvel, H., Réveil, B., & Martens, J. P. (2009). Pronunciation-based ASR for names. In Proceedings of the 10th annual conference of the international speech communication association (INTERSPEECH) (pp. 2959–2962).
van Heerden, C., Davel, M. H., & Barnard, E. (2014). Performance analysis of a multilingual directory enquiries application. In Proceedings of the annual symposium of the pattern recognition association of South Africa (PRASA).
van Heerden, C., Kleynhans, N., & Davel, M. (2016). Improving the Lwazi ASR baseline. In Proceedings of the INTERSPEECH (pp. 3534–3538).
Yang, Q., Martens, J.P., Konings, N., & van den Heuvel, H. (2006) Development of a phoneme-to-phoneme (p2p) converter to improve the grapheme-to-phoneme (g2p) conversion of names. In Proceedings of the 5th international conference on language resources and evaluation (LREC) (pp. 287–292).
Yu, D., Ju, Y. C., Wang, Y. Y., Zweig, G., & Acero, A. (2007) Automated directory assistance system—From theory to practice. In Proceedings of the 8th annual conference of the international speech communication association (INTERSPEECH) (pp. 2709–2712).
Zulu, P. N., Botha, G., & Barnard, E. (2008). Orthographic measures of language distances between the official South African languages. Literator: Journal of Literary Criticism, Comparative Linguistics and Literary Studies, 29(1), 185–204.
Article Google Scholar

Download references

Acknowledgements

This work is based on research supported by the Department of Arts and Culture (DAC) of the government of South Africa, through their Human Language Technologies (HLT) unit, and the National Research Foundation (NRF). Any opinion, finding and conclusion or recommendation expressed in this material is that of the authors and the NRF does not accept any liability in this regard. The support by both institutions is gratefully acknowledged.

Author information

Authors and Affiliations

Multilingual Speech Technologies (MuST), Faculty of Engineering, North-West University, Potchefstroom, South Africa
Jan W. F. Thirion, Charl van Heerden, Oluwapelumi Giwa & Marelie H. Davel

Authors

Jan W. F. Thirion
View author publications
You can also search for this author in PubMed Google Scholar
Charl van Heerden
View author publications
You can also search for this author in PubMed Google Scholar
Oluwapelumi Giwa
View author publications
You can also search for this author in PubMed Google Scholar
Marelie H. Davel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marelie H. Davel.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

1.1 Appendix 1: The SADE phone set

Table 13 provides a description of the SADE phoneme set as used to annotate the SADE corpus. For each phoneme, the corresponding IPA and X-SAMPA symbols are also provided.

Table 13 SADE phone set

Full size table

1.2 Appendix 2: Consent form

The following consent form was used during data collection:

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thirion, J.W.F., van Heerden, C., Giwa, O. et al. The South African directory enquiries (SADE) name corpus. Lang Resources & Evaluation 54, 155–184 (2020). https://doi.org/10.1007/s10579-019-09448-6

Download citation

Published: 06 February 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10579-019-09448-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

The South African directory enquiries (SADE) name corpus

Abstract

Access this article

Similar content being viewed by others

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks

Sociolinguistic Extension of the ORD Corpus of Russian Everyday Speech

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendices

1.1 Appendix 1: The SADE phone set

1.2 Appendix 2: Consent form

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The South African directory enquiries (SADE) name corpus

Abstract

Access this article

Similar content being viewed by others

The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years

Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks

Sociolinguistic Extension of the ORD Corpus of Russian Everyday Speech

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendices

1.1 Appendix 1: The SADE phone set

1.2 Appendix 2: Consent form

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation