Repository logoRepository logo
 

CSIR SAMA Speech Corpus Manual Datasets

Loading...
Thumbnail Image

Deposit Licenses

Date

2023-12

Authors

Bandehorst, Jaco
Mak, Franco

Journal Title

Journal ISSN

Volume Title

Publisher

Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI)
SADiLaR

Abstract

Description

The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format.
Afrikaans
• News-20:04:15-Bulletins:377
• Drama-12:35:47-Episodes: 351
Sepedi
• Drama-10:25:16-Episodes: 321
Sesotho
• News-14:49:11-Bulletins:326
• Drama-10:01:17-Episodes: 200
isiXhosa
• News-14:57:20-Bulletins: 325
• Drama-09:41:42-Episodes: 190
isiZulu
• News-15:55:14-Bulletins:349
• Drama-07:58:10-Episodes: 124
Tshivenḓa
• Drama-12:32:29-Episodes: 275

Citation

Collections

Verification status

Level 0