A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.

Recent Submissions

  • Autshumato Monolingual English Corpus 

    McKeller, Cindy (CTexT® (Centre for Text Technology, North-West University), 2023-10-30)
    Monolingual corpus for South African English. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically ...
  • African Wordnet version 1.0 

    Griesel, Marissa (UNISA, 2022-09-20)
    Developed using the expand model with Princeton WordNet 3.1 as basis. Please see https://africanwordnet.wordpress.com/ for all details on the project. ...
  • Ex Machina: Using NLP and statistical learning models to model eyewitness statements and choosing behaviour 

    Nortje, Alicia, et al. (Sadilar, 2019-05-07)
    This curated database includes data from various of empirical studies where eyewitness statements and descriptions were collected. The original studies, ...
  • Autshumato English-Tshivenḓa Parallel Corpora 

    McKellar, Cindy (North-West University; Centre for Text Technology (CTexT), 2023-12-12)
    Aligned parallel corpora for the following language pair: English-Tshivenḓa. Data was crawled from various multilingual government websites, sourced ...
  • Autshumato Monolingual Tshivenḓa Corpus 

    McKellar, Cindy (North-West University; Centre for Text Technology (CTexT), 2023-12-12)
    Monolingual corpus for Tshivenḓa. The data is given as a single UTF-8 text file, with each segment on a newline.
  • Morphologically annotated corpus for isiNdebele 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in isiNdebele converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data ...
  • Morphologically annotated corpus for isiXhosa 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in isiXhosa converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is ...
  • Morphologically annotated corpus for isiZulu 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in isiZulu converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is ...
  • Morphologically annotated corpus for Siswati 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in Siswati converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is ...
  • Morphologically annotated corpus for Sesotho 

    Gaustad, Tanja (Centre for Text Technology (CTexT), 2024-01-31)
    NCHLT corpus of morphologically annotated tokens in Sesotho converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is ...

View more