Creative Commons Attribution 4.0 International (CC-BY 4.0): https://www.creativecommons.org/licenses/by/4.0/Bandehorst, JacoMak, Franco2025-02-132025-02-132023-12https://hdl.handle.net/20.500.12185/689The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format. <br>Afrikaans <br>• News-20:04:15-Bulletins:377 <br>• Drama-12:35:47-Episodes: 351 <br>Sepedi <br>• Drama-10:25:16-Episodes: 321 <br>Sesotho <br>• News-14:49:11-Bulletins:326 <br>• Drama-10:01:17-Episodes: 200 <br>isiXhosa <br>• News-14:57:20-Bulletins: 325 <br>• Drama-09:41:42-Episodes: 190 <br>isiZulu <br>• News-15:55:14-Bulletins:349 <br>• Drama-07:58:10-Episodes: 124 <br>Tshivenḓa <br>• Drama-12:32:29-Episodes: 275WAVSpeech corporaData harvestingTranscriptionSegmentationCSIR SAMA Speech Corpus Manual Datasets