Title | CSIR SAMA Speech Corpus Manual Datasets |
Description | The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format.
Afrikaans [News-20:04:15 (hh:mm:ss)-Bulletins:377; Drama-12:35:47-Episodes: 351],
Sepedi [Drama-10:25:16 -Episodes: 321],
Sesotho [News-14:49:11-Bulletins:326; Drama-10:01:17-Episodes: 200],
isiXhosa [News-14:57:20-Bulletins: 325; Drama-09:41:42-Episodes: 190],
isiZulu [News-15:55:14-Bulletins:349; Drama-07:58:10-Episodes: 124],
Tshivenḓa [Drama-12:32:29-Episodes: 275] |
Contact name | Jaco Bandehorst |
Contact email | jbadenhorst@csir.co.za |
Publisher(s) | Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI); SADiLaR |
License | Creative Commons Attribution 4.0 International (CC-BY 4.0): https://www.creativecommons.org/licenses/by/4.0/ |
Language(s) | Afrikaans; isiXhosa; isiZulu; Sepedi; Sesotho; Tshivenda |
Author(s) | Bandehorst, Jaco; Mak, Franco |
Subject | Speech corpora; Data harvesting; Transcription; Segmentation |
URI | https://hdl.handle.net/20.500.12185/689 |
Media type | Speech |
Media category | Annotated Monolingual Speech Corpus |
Submit date | 2025-02-13T13:50:52Z |
Date available | 2025-02-13T13:50:52Z |
Date created | 2023-12 |
Verification status | Level 0 |