NCHLT isiXhosa RoBERTa language model

Title	NCHLT isiXhosa RoBERTa language model
Description	Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned for any downstream process. The model can be used both as a masked LM or as an embedding model to provide real-valued vectorised respresentations of words or string sequences for isiXhosa text.
Contact name	Roald Eiselen
Contact email	Roald.Eiselen@nwu.ac.za
Publisher(s)	North-West University; Centre for Text Technology (CTexT)
License	Creative Commons Attribution 4.0 International (CC-BY 4.0)
Language(s)	isiXhosa
Author(s)	Roald Eiselen
Contributor	Rico Koen; Albertus Kruger; Jacques van Heerden
URI	https://hdl.handle.net/20.500.12185/644
Media type	Text
Type	Modules
Media category	Language model
Format extent	Training data: Paragraphs: 718,751; Token count: 13,190,962; Vocab size: 30,000; Embedding dimensions: 768;
Format size	235.81MB (Zipped)
Project	NCHLT Text IV
Software requirements	Python
Source	Web; Government Documents
ISO639 code	xh
Submit date	2023-07-28T08:11:54Z; 2023-05-01
Date available	2023-07-28T08:11:54Z; 2023-05-01
Date created	2023-05-01
Verification status	Level 0

Resource Catalogue [350]
A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.