Show simple item record

CSIR SAMA Speech Corpus Manual Datasets
The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format. Afrikaans [News-20:04:15 (hh:mm:ss)-Bulletins:377; Drama-12:35:47-Episodes: 351], Sepedi [Drama-10:25:16 -Episodes: 321], Sesotho [News-14:49:11-Bulletins:326; Drama-10:01:17-Episodes: 200], isiXhosa [News-14:57:20-Bulletins: 325; Drama-09:41:42-Episodes: 190], isiZulu [News-15:55:14-Bulletins:349; Drama-07:58:10-Episodes: 124], Tshivenḓa [Drama-12:32:29-Episodes: 275]
Jaco Bandehorst
jbadenhorst@csir.co.za
Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI); SADiLaR
Creative Commons Attribution 4.0 International (CC-BY 4.0): https://www.creativecommons.org/licenses/by/4.0/
Afrikaans; isiXhosa; isiZulu; Sepedi; Sesotho; Tshivenda
Bandehorst, Jaco; Mak, Franco
Speech corpora; Data harvesting; Transcription; Segmentation
https://hdl.handle.net/20.500.12185/689
Speech
Annotated Monolingual Speech Corpus
2025-02-13T13:50:52Z
2025-02-13T13:50:52Z
2023-12
Level 0


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

  • Resource Index [414]
    A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.

Show simple item record