CSIR SAMA Speech Corpus Manual Datasets
Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/689'
dc.contact.email | jbadenhorst@csir.co.za | en_ZA |
dc.contact.name | Jaco Bandehorst | en_ZA |
dc.contributor.author | Bandehorst, Jaco | |
dc.contributor.author | Mak, Franco | |
dc.date.accessioned | 2025-02-13T13:50:52Z | |
dc.date.available | 2025-02-13T13:50:52Z | |
dc.date.issued | 2023-12 | |
dc.description | The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of South Africa’s eleven official written languages. The audio was harvested as MP3 podcasts and automatically segmented and transcribed. Segment transcriptions are provided in XML format. <br>Afrikaans <br>• News-20:04:15-Bulletins:377 <br>• Drama-12:35:47-Episodes: 351 <br>Sepedi <br>• Drama-10:25:16-Episodes: 321 <br>Sesotho <br>• News-14:49:11-Bulletins:326 <br>• Drama-10:01:17-Episodes: 200 <br>isiXhosa <br>• News-14:57:20-Bulletins: 325 <br>• Drama-09:41:42-Episodes: 190 <br>isiZulu <br>• News-15:55:14-Bulletins:349 <br>• Drama-07:58:10-Episodes: 124 <br>Tshivenḓa <br>• Drama-12:32:29-Episodes: 275 | en_ZA |
dc.format | WAV | en_ZA |
dc.identifier.uri | https://hdl.handle.net/20.500.12185/689 | |
dc.languages | Afrikaans | en_ZA |
dc.languages | isiXhosa | en_ZA |
dc.languages | isiZulu | en_ZA |
dc.languages | Sepedi | en_ZA |
dc.languages | Sesotho | en_ZA |
dc.languages | Tshivenda | en_ZA |
dc.media.category | Annotated Monolingual Speech Corpus | en_ZA |
dc.media.type | Speech | en_ZA |
dc.publisher | Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI) | en_ZA |
dc.publisher | SADiLaR | en_ZA |
dc.rights.license | Creative Commons Attribution 4.0 International (CC-BY 4.0): https://www.creativecommons.org/licenses/by/4.0/ | en_ZA |
dc.subject | Speech corpora | en_ZA |
dc.subject | Data harvesting | en_ZA |
dc.subject | Transcription | en_ZA |
dc.subject | Segmentation | en_ZA |
dc.title | CSIR SAMA Speech Corpus Manual Datasets | en_ZA |
local.url | https://repo.sadilar.org/handle/20.500.12185/9/submit/6e66107b60294b642f8b2b1e317c71205c3e3034.continue | en_ZA |
Files
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 3.22 KB
- Format:
- Item-specific license agreed upon to submission
- Description: