CSIR SAMA Speech Corpus Manual Datasets

Creative Commons Attribution 4.0 International (CC-BY 4.0): https://www.creativecommons.org/licenses/by/4.0/Bandehorst, JacoMak, Franco2025-02-132025-02-132023-12https://hdl.handle.net/20.500.12185/689The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format. Afrikaans • News-20:04:15-Bulletins:377 • Drama-12:35:47-Episodes: 351 Sepedi • Drama-10:25:16-Episodes: 321 Sesotho • News-14:49:11-Bulletins:326 • Drama-10:01:17-Episodes: 200 isiXhosa • News-14:57:20-Bulletins: 325 • Drama-09:41:42-Episodes: 190 isiZulu • News-15:55:14-Bulletins:349 • Drama-07:58:10-Episodes: 124 Tshivenḓa • Drama-12:32:29-Episodes: 275WAVSpeech corporaData harvestingTranscriptionSegmentationCSIR SAMA Speech Corpus Manual Datasets