CSIR SAMA Speech Corpus Manual Datasets

Bandehorst, Jaco; Mak, Franco

Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/689'

CSIR SAMA Speech Corpus Manual Datasets

Date

2023-12

Authors

Bandehorst, Jaco

Mak, Franco

Publisher

Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI)
SADiLaR

Description

The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format.
Afrikaans
• News-20:04:15-Bulletins:377
• Drama-12:35:47-Episodes: 351
Sepedi
• Drama-10:25:16-Episodes: 321
Sesotho
• News-14:49:11-Bulletins:326
• Drama-10:01:17-Episodes: 200
isiXhosa
• News-14:57:20-Bulletins: 325
• Drama-09:41:42-Episodes: 190
isiZulu
• News-15:55:14-Bulletins:349
• Drama-07:58:10-Episodes: 124
Tshivenḓa
• Drama-12:32:29-Episodes: 275