Search
Now showing items 11-20 of 22
NCHLT Sepedi RoBERTa language model
(North-West University; Centre for Text Technology (CTexT), 2023-05-01)
Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned ...
NCHLT Sepedi GloVe embeddings
(North-West University; Centre for Text Technology (CTexT), 2023-05-01)
Static word embedding model based on the Global Vectors architecture (Pennington et al., 2014). The embeddings provide real-valued vector representations ...
NCHLT Sepedi word2vec-Skipgram embeddings
(North-West University; Centre for Text Technology (CTexT), 2023-05-01)
Static word embeddings for the Skipgram flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides real-valued vector ...
Autshumato Monolingual Sepedi Corpus
(CTexT® (Centre for Text Technology, North-West University), 2022-09-30)
Monolingual corpus for Sepedi. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and ...
NCHLT Sepedi word2vec-CBOW embeddings
(North-West University; Centre for Text Technology (CTexT), 2023-05-01)
Static word embeddings for the continuous bag of words (CBoW) flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides ...
Final year high school examination texts of South African home and first additional language subjects
(South African Centre for Digital Language Resources, 2022-11-16)
This data collection consists of reading comprehension and summary
writing texts. The texts comprise of the final year high school exam
texts for ...
USAf National Language Resources Audit 2023
(South African Centre for Digital Language Resources, 2023-10)
This report documents the findings of a comprehensive language resources audit conducted by the South African Centre for Digital Language Resources ...
Generic Multilingual Academic Wordlists with Definitions
(SADiLaR; ICELDA, 2022)
This multilingual generic academic wordlist has been developed to serve as a resource to students to assist with building a vocabulary and decoding ...
POS annotated corpus in 5 different genres for Sepedi
(Centre for Text Technology (CTexT), 2024-01-31)
This corpus contains POS annotated data in 5 different genres for Sepedi.
The text types included are:
- CAPS gr12 (Academic) - https://www.educ ...
Multilingual Linguistic Terminology
(UNISA, 2022-09-20)
Multilingual Linguistic Terminology Project
Termbanks of Linguistic terminology for South African languages
Version 1.0
https://linguistictermino ...