A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.

Recent Submissions

  • South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) 

    Van Dyk, Tobie (ICELDA; SADiLaR, 2021)
    The South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) is a multi-genre, multi-level learner corpus developed by the Inter-institutional ...
  • CTexT fastText Skipgram String Embeddings 

    Eiselen, Roald (Centre for Text Technology (CTexT), 2022-01-10)
    The CTexT Afrikaans fastText Skipgram String Embeddings is a 300 dimensional Afrikaans embedding model based on the Skipgram fastText architecture that ...
  • CTexT Afrikaans GloVe Word Embeddings 

    Eiselen, Roald (Centre for Text Technology (CTexT), 2022-01-10)
    The CTexT Afrikaans GloVe Word Embeddings is a 300 dimensional Afrikaans embedding model based on the Global Vectors architecture (Pennington, 2014) ...
  • CTexT Afrikaans FLAIR String Embeddings 

    Eiselen, Roald (Centre for Text Technology (CTexT), 2022-01-10)
    The CTexT Afrikaans FLAIR String Embeddings are two Afrikaans embedding models based on the FLAIR architecture (Akbik et al. 2018, 2019) that provides ...
  • CTexT Afrikaans FLAIR Named Entity Recognition model 

    Eiselen, Roald (Centre for Text Technology (CTexT), 2022-01-10)
    The CTexT Afrikaans FLAIR Named Entity Recognition model is a neural NER model based on the FLAIR framework (Akbik et al. 2019), and includes Afrikaans ...
  • CTexT Afrikaans fastText CBoW String Embeddings 

    Eiselen, Roald (Centre for Text Technology (CTexT), 2022-01-10)
    The CTexT Afrikaans fastText CBoW String Embeddings is a 300 dimensional Afrikaans embedding model based on the Contunious Bag of Words fastText ...
  • CTexT Afrikaans FLAIR Part of Speech tagger model 

    Eiselen, Roald (Centre for Text Technology (CTexT), 2022-01-10)
    The CTexT Afrikaans FLAIR Part of Speech tagger model is a neural part of speech tagger model based on the FLAIR framework (Akbik et al. 2019), and ...
  • Core technologies for conjunctively written South African languages 

    Du Toit, Jaco, et al. (North-West University, Centre for Language Technology (CTexT), 2021-03-31)
    During this SADiLaR funded project, enriched corpora for the four official South African languages with a conjunctive orthography, i.e. isiNdebele ...
  • Corpus of multilingual code-switched soap opera speech 

    van der Westhuizen, Ewald, et al. (Stellenbosch University, 2020-02-28)
    The corpus comprises 26.9 hours of annotated multilingual speech that contains examples of code-switching in isiZulu, isiXhosa, Setswana, Sesotho and ...
  • COVID-19 Multilingual Terminology 

    City of Tshwane, et al. (City of Tshwane; South African Centre for Digital Language Resources (SADiLaR); Department of Science and Innovation; Pan South African Language Board (PanSALB), 2021-07)
    COVID-19 multilingual terminology list document in all the South African languages. The development of this terminology list was initiated by City of ...

View more