Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/689'
CSIR SAMA Speech Corpus Manual Datasets
Loading...
Deposit Licenses
Date
2023-12
Authors
Bandehorst, Jaco
Mak, Franco
Journal Title
Journal ISSN
Volume Title
Publisher
Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI)
SADiLaR
SADiLaR
Abstract
Description
The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format.
Afrikaans [News-20:04:15 (hh:mm:ss)-Bulletins:377; Drama-12:35:47-Episodes: 351],
Sepedi [Drama-10:25:16 -Episodes: 321],
Sesotho [News-14:49:11-Bulletins:326; Drama-10:01:17-Episodes: 200],
isiXhosa [News-14:57:20-Bulletins: 325; Drama-09:41:42-Episodes: 190],
isiZulu [News-15:55:14-Bulletins:349; Drama-07:58:10-Episodes: 124],
Tshivenḓa [Drama-12:32:29-Episodes: 275]
Keywords
Citation
Collections
Verification status
Level 0