NCHLT isiNdebele RoBERTa language model

Creative Commons Attribution 4.0 International (CC-BY 4.0)Roald EiselenRico KoenAlbertus KrugerJacques van Heerden2023-07-282023-05-012023-07-282023-05-012023-05-01https://hdl.handle.net/20.500.12185/637Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned for any downstream process. The model can be used both as a masked LM or as an embedding model to provide real-valued vectorised respresentations of words or string sequences for isiNdebele text.Training data: Paragraphs: 247,926; Token count: 3,633,845; Vocab size: 30,000; Embedding dimensions: 768;nrNCHLT isiNdebele RoBERTa language modelModules236.06MB (Zipped)