Department of Science, Technology and InnovationCLARIN in South Africa
 

CSIR SAMA Speech Corpus Manual Datasets

Abstract

Description

The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format.
Afrikaans
• News-20:04:15-Bulletins:377
• Drama-12:35:47-Episodes: 351
Sepedi
• Drama-10:25:16-Episodes: 321
Sesotho
• News-14:49:11-Bulletins:326
• Drama-10:01:17-Episodes: 200
isiXhosa
• News-14:57:20-Bulletins: 325
• Drama-09:41:42-Episodes: 190
isiZulu
• News-15:55:14-Bulletins:349
• Drama-07:58:10-Episodes: 124
Tshivenḓa
• Drama-12:32:29-Episodes: 275

Citation

Verification status

Level 0