NCHLT Sesotho RoBERTa language model

Title	NCHLT Sesotho RoBERTa language model
Description	Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned for any downstream process. The model can be used both as a masked LM or as an embedding model to provide real-valued vectorised respresentations of words or string sequences for Sesotho text.
Contact name	Roald Eiselen
Contact email	Roald.Eiselen@nwu.ac.za
Publisher(s)	North-West University; Centre for Text Technology (CTexT)
License	Creative Commons Attribution 4.0 International (CC-BY 4.0)
Language(s)	Sesotho
Author(s)	Roald Eiselen
Contributor	Rico Koen; Albertus Kruger; Jacques van Heerden
URI	https://hdl.handle.net/20.500.12185/640
Media type	Text
Type	Modules
Media category	Language model
Format extent	Training data: Paragraphs: 535,853; Token count: 17,425,650; Vocab size: 30,000; Embedding dimensions: 768;
Format size	235.78MB (Zipped)
Project	NCHLT Text IV
Software requirements	Python
Source	Web; Government Documents
ISO639 code	st
Submit date	2023-07-28T08:11:43Z; 2023-05-01
Date available	2023-07-28T08:11:43Z; 2023-05-01
Date created	2023-05-01
Verification status	Level 0

Resource Catalogue [350]
A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.