Department of Science, Technology and InnovationCLARIN in South Africa
 

CSIR SAMA Speech Corpus Manual Datasets

dc.contact.emailjbadenhorst@csir.co.zaen_ZA
dc.contact.nameJaco Badenhorsten_ZA
dc.contributor.authorBadenhorst, Jaco
dc.contributor.authorMak, Franco
dc.date.accessioned2025-02-13T13:50:52Z
dc.date.accessioned2025-08-21T14:20:43Z
dc.date.available2025-02-13T13:50:52Z
dc.date.available2025-08-21T14:20:43Z
dc.date.issued2023-12
dc.descriptionThe evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format. <br>Afrikaans <br>• News-20:04:15-Bulletins:377 <br>• Drama-12:35:47-Episodes: 351 <br>Sepedi <br>• Drama-10:25:16-Episodes: 321 <br>Sesotho <br>• News-14:49:11-Bulletins:326 <br>• Drama-10:01:17-Episodes: 200 <br>isiXhosa <br>• News-14:57:20-Bulletins: 325 <br>• Drama-09:41:42-Episodes: 190 <br>isiZulu <br>• News-15:55:14-Bulletins:349 <br>• Drama-07:58:10-Episodes: 124 <br>Tshivenḓa <br>• Drama-12:32:29-Episodes: 275en_ZA
dc.formatWAVen_ZA
dc.identifier.urihttps://hdl.handle.net/20.500.12185/689
dc.languagesAfrikaansen_ZA
dc.languagesisiXhosaen_ZA
dc.languagesisiZuluen_ZA
dc.languagesSepedien_ZA
dc.languagesSesothoen_ZA
dc.languagesTshivendaen_ZA
dc.media.categoryAnnotated Monolingual Speech Corpusen_ZA
dc.media.typeSpeechen_ZA
dc.publisherVoice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI)en_ZA
dc.publisherSADiLaRen_ZA
dc.rights.licenseCreative Commons Attribution 4.0 International (CC-BY 4.0): https://www.creativecommons.org/licenses/by/4.0/en_ZA
dc.subjectSpeech corporaen_ZA
dc.subjectData harvestingen_ZA
dc.subjectTranscriptionen_ZA
dc.subjectSegmentationen_ZA
dc.titleCSIR SAMA Speech Corpus Manual Datasetsen_ZA
local.urlhttps://repo.sadilar.org/handle/20.500.12185/9/submit/6e66107b60294b642f8b2b1e317c71205c3e3034.continueen_ZA

Files

Original bundle

Now showing 1 - 5 of 6
Loading...
Thumbnail Image
Name:
afr_SABC_CSIR_speech_corpus_SAMA_manual_data_sets_Dec_2023.zip
Size:
3.12 GB
Format:
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
Loading...
Thumbnail Image
Name:
sot_SABC_CSIR_speech_corpus_SAMA_manual_data_sets_Dec_2023.zip
Size:
2.28 GB
Format:
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
Loading...
Thumbnail Image
Name:
nso_SABC_CSIR_speech_corpus_SAMA_manual_data_sets_Dec_2023.zip
Size:
1010.61 MB
Format:
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
Loading...
Thumbnail Image
Name:
ven_SABC_CSIR_speech_corpus_SAMA_manual_data_sets_Dec_2023.zip
Size:
1.19 GB
Format:
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
Loading...
Thumbnail Image
Name:
xho_SABC_CSIR_speech_corpus_SAMA_manual_data_sets_Dec_2023.zip
Size:
2.34 GB
Format:
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.22 KB
Format:
Plain Text
Description: