Repository logoRepository logo
 

CSIR SAMA Speech Corpus Manual Datasets

Loading...
Thumbnail Image

Deposit Licenses

Date

2023-12

Authors

Bandehorst, Jaco
Mak, Franco

Journal Title

Journal ISSN

Volume Title

Publisher

Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI)
SADiLaR

Abstract

Description

The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format. Afrikaans [News-20:04:15 (hh:mm:ss)-Bulletins:377; Drama-12:35:47-Episodes: 351], Sepedi [Drama-10:25:16 -Episodes: 321], Sesotho [News-14:49:11-Bulletins:326; Drama-10:01:17-Episodes: 200], isiXhosa [News-14:57:20-Bulletins: 325; Drama-09:41:42-Episodes: 190], isiZulu [News-15:55:14-Bulletins:349; Drama-07:58:10-Episodes: 124], Tshivenḓa [Drama-12:32:29-Episodes: 275]

Citation

Collections

Verification status

Level 0