SADiLaR Language Resource Repository: Recent submissions
Now showing items 1-10 of 531
CSIR SAMA Speech Corpus Manual Datasets
(Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI); SADiLaR, 2023-12)The evaluation corpus contains orthographically transcribed broadband speech in Afrikaans, isiXhosa, isiZulu, Sepedi, Sesotho, Tshivenḓa all part of ... -
AwezaMed automatic speech recognition (ASR) test data
(Voice Computing (VC) Research Group at the CSIR Nextgen Enterprises and Institutions (NGEI), 2020-12)The corpus contains orthographically transcribed broadband speech in four official languages of South Africa: Afrikaans, English, isiXhosa and isiZulu. ... -
Autshumato Monolingual English Corpus
(CTexT® (Centre for Text Technology, North-West University), 2023-10-30)Monolingual corpus for South African English. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically ... -
IsiZulu Second Language Learner Speech Corpus
(Indiana University, 2024)This corpus is specifically designed to assist in evaluating the performance of pronunciation feedback tools for second language learning. The corpus ... -
African Wordnet version 1.0
(UNISA, 2022-09-20)Developed using the expand model with Princeton WordNet 3.1 as basis. Please see for all details on the project. ... -
Ex Machina: Using NLP and statistical learning models to model eyewitness statements and choosing behaviour
(Sadilar, 2019-05-07)This curated database includes data from various of empirical studies where eyewitness statements and descriptions were collected. The original studies, ... -
Autshumato English-Tshivenḓa Parallel Corpora
(North-West University; Centre for Text Technology (CTexT), 2023-12-12)Aligned parallel corpora for the following language pair: English-Tshivenḓa. Data was crawled from various multilingual government websites, sourced ... -
Autshumato Monolingual Tshivenḓa Corpus
(North-West University; Centre for Text Technology (CTexT), 2023-12-12)Monolingual corpus for Tshivenḓa. The data is given as a single UTF-8 text file, with each segment on a newline. -
Morphologically annotated corpus for isiNdebele
(Centre for Text Technology (CTexT), 2024-01-31)NCHLT corpus of morphologically annotated tokens in isiNdebele converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data ... -
Morphologically annotated corpus for isiXhosa
(Centre for Text Technology (CTexT), 2024-01-31)NCHLT corpus of morphologically annotated tokens in isiXhosa converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is ...