In 2009, the South African National HLT Network (NHN) funded a technology audit that was conducted to form a clear profile of the research and development activities in the human language technology field in South Africa. This audit was used as the basis for the RMA Index, which is a list of South African resources with the relevant metadata (information such as developer details and specifications). Some of these resources are included in the RMA Catalogue, and are therefore available for download.

Collections in this community

  • Resource Catalogue [251]

    A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.
  • Resource Index [386]

    A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.
  • Student Data Repository [6]

    A collection of language resources available as part of the output of post-graduate study programs

Recent Submissions

  • Proof of concept: Afrikaans English Venda E-dictionary 

    Bosch, Sonja, et al. (Published as a Lexonomy dictionary (https://www.lexonomy.eu/POCVenEngAfr/), 2022-03-04)
    This proof of concept is a result of an experiment to compile a trilingual e-dictionary for Afrikaans, Venda and English. It includes 613 items and is ...
  • Bilingual English-Siswati Corpus 

    McKellar, Cindy (North-West University - Centre for Text Technology (CTexT), 2022-03-31)
    Aligned parallel corpora for the following language pair: English-SiSwati. The data is given as four separate UTF-8 text files, with each segment on a ...
  • Monolingual Siswati Corpus 

    McKellar, Cindy (North-West University - Centre for Text Technology (CTexT), 2022-03-31)
    Monolingual corpus for SiSwati. The data is given as a single UTF-8 text file, with each segment on a newline. The dataset contains existing data sourced ...
  • South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) 

    Van Dyk, Tobie (ICELDA; SADiLaR, 2021)
    The South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) is a multi-genre, multi-level learner corpus developed by the Inter-institutional ...
  • Sesotho syllabification systems 

    Sibeko, Johannes, et al. (South African Centre for Digital Language Resources, 2022-02-03)
    This package contains two syllabification systems for Sesotho (rule-based and TeX-based).
  • Sesotho syllable wordlist 

    Sibeko, Johannes, et al. (South African Centre for Digital Language Resources, 2022-02-03)
    This package contains a wordlist containing Sesotho words and their syllable information.
  • CTexT fastText Skipgram String Embeddings 

    Eiselen, Roald (Centre for Text Technology (CTexT), 2022-01-10)
    The CTexT Afrikaans fastText Skipgram String Embeddings is a 300 dimensional Afrikaans embedding model based on the Skipgram fastText architecture that ...
  • CTexT Afrikaans GloVe Word Embeddings 

    Eiselen, Roald (Centre for Text Technology (CTexT), 2022-01-10)
    The CTexT Afrikaans GloVe Word Embeddings is a 300 dimensional Afrikaans embedding model based on the Global Vectors architecture (Pennington, 2014) ...
  • CTexT Afrikaans FLAIR String Embeddings 

    Eiselen, Roald (Centre for Text Technology (CTexT), 2022-01-10)
    The CTexT Afrikaans FLAIR String Embeddings are two Afrikaans embedding models based on the FLAIR architecture (Akbik et al. 2018, 2019) that provides ...
  • CTexT Afrikaans FLAIR Named Entity Recognition model 

    Eiselen, Roald (Centre for Text Technology (CTexT), 2022-01-10)
    The CTexT Afrikaans FLAIR Named Entity Recognition model is a neural NER model based on the FLAIR framework (Akbik et al. 2019), and includes Afrikaans ...

View more