Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/689'

CSIR SAMA Speech Corpus Manual Datasets

Files

afr_SABC_CSIR_speech_corpus_SAMA_manual_data_sets_Dec_2023.zip (3.12 GB)

sot_SABC_CSIR_speech_corpus_SAMA_manual_data_sets_Dec_2023.zip (2.28 GB)

nso_SABC_CSIR_speech_corpus_SAMA_manual_data_sets_Dec_2023.zip (1010.61 MB)

ven_SABC_CSIR_speech_corpus_SAMA_manual_data_sets_Dec_2023.zip (1.19 GB)

xho_SABC_CSIR_speech_corpus_SAMA_manual_data_sets_Dec_2023.zip (2.34 GB)

Date

2023-12

Authors

Badenhorst, Jaco

Mak, Franco

Publisher

Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI)
SADiLaR

Description

The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format.
Afrikaans
• News-20:04:15-Bulletins:377
• Drama-12:35:47-Episodes: 351
Sepedi
• Drama-10:25:16-Episodes: 321
Sesotho
• News-14:49:11-Bulletins:326
• Drama-10:01:17-Episodes: 200
isiXhosa
• News-14:57:20-Bulletins: 325
• Drama-09:41:42-Episodes: 190
isiZulu
• News-15:55:14-Bulletins:349
• Drama-07:58:10-Episodes: 124
Tshivenḓa
• Drama-12:32:29-Episodes: 275

Keywords

Speech corpora, Data harvesting, Transcription, Segmentation

License

Creative Commons Attribution 4.0 International (CC-BY 4.0)

URI

https://hdl.handle.net/20.500.12185/689

Collections

Resource Catalogue

Verification status

Level 0