A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.

Recent Submissions

  • Speech transcription platform user interface 

    Kleynhans, Neil, et al. (Multilingual Speech Technologies, North-West University, 2017) ~ Resource Index
    This is the user interface component of the Speech Transcription Platform developed by the Multilingual Speech Technologies group at North-West University ...
  • Speech transcription platform speech services 

    Van Niekerk, Daniel, et al. (Multilingual Speech Technologies, North-West University, 2017) ~ Resource Index
    This is the Language Technology Services component implemented for the Speech Transcription Platform project by the Multilingual Speech Technologies ...
  • High quality TTS data for four South African languages (af, st, tn, xh) 

    Unknown author (Google; North-West University, 2017) ~ Resource Catalogue
    This data set contains multi-speaker TTS high quality transcribed audio data for four languages of South Africa: Afrikaans, Sesotho, Setswana and isiXhosa. ...
  • Speech transcription server 

    Kleynhans, Neil, et al. (Multilingual Speech Technologies, North-West University, 2017) ~ Resource Index
    This is the "Parliament-specific" application server component implemented as a proof-of-concept during the Speech Transcription Platform project by the ...
  • Bilingual English-isiXhosa corpus 

    McKellar, Cindy (North-West University - Centre for Text Technology (CTexT), 2019-11-30) ~ Resource Catalogue
    Aligned parallel corpora for the following language pair: English-isiXhosa. The data is given as two separate UTF-8 text files, with each segment on a ...
  • Monolingual isiXhosa corpus 

    McKellar, Cindy (North-West University - Centre for Text Technology (CTexT), 2019-11-30) ~ Resource Catalogue
    Monolingual corpus for isiXhosa. The data is given as a single UTF-8 text file, with each segment on a newline. The dataset contains existing data ...
  • NCHLT English Auxiliary Speech Corpus 

    Febe de Wet, et al. (CSIR Meraka Institute; North-West University, 2019-06-01) ~ Resource Catalogue
    The corpus contains orthographically transcribed broadband speech in each of South Africa's eleven official languages. Transcriptions are provided in ...
  • NCHLT Afrikaans Auxiliary Speech Corpus 

    Febe de Wet, et al. (CSIR Meraka Institute; North-West University, 2019-06-01) ~ Resource Catalogue
    The corpus contains orthographically transcribed broadband speech in each of South Africa's eleven official languages. Transcriptions are provided in ...
  • NCHLT Xitsonga Auxiliary Speech Corpus 

    Febe de Wet, et al. (CSIR Meraka Institute; North-West University, 2019-06-01) ~ Resource Catalogue
    The corpus contains orthographically transcribed broadband speech in each of South Africa's eleven official languages. Transcriptions are provided in ...
  • NCHLT Setswana Auxiliary Speech Corpus 

    Febe de Wet, et al. (CSIR Meraka Institute; North-West University, 2019-06-01) ~ Resource Catalogue
    The corpus contains orthographically transcribed broadband speech in each of South Africa's eleven official languages. Transcriptions are provided in ...

View more