CSIR SAMA Speech Corpus Manual Datasets

License agreement
By downloading this resource I accept and agree to the terms of use and the associated license conditions under which the resource is distributed.
Collections
- Resource Index [414]
Author(s)
Bandehorst, Jaco
Mak, Franco
Metadata
Show full item recordDescription
The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format.
Afrikaans [News-20:04:15 (hh:mm:ss)-Bulletins:377; Drama-12:35:47-Episodes: 351],
Sepedi [Drama-10:25:16 -Episodes: 321],
Sesotho [News-14:49:11-Bulletins:326; Drama-10:01:17-Episodes: 200],
isiXhosa [News-14:57:20-Bulletins: 325; Drama-09:41:42-Episodes: 190],
isiZulu [News-15:55:14-Bulletins:349; Drama-07:58:10-Episodes: 124],
Tshivenḓa [Drama-12:32:29-Episodes: 275]
Contact person
Jaco BandehorstContact person's e-mail address
jbadenhorst@csir.co.zaPublisher(s)
Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI)
SADiLaR
Verification status
Level 0