NCHLT Xitsonga RoBERTa language model

Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/642'

dc.contact.email	Roald.Eiselen@nwu.ac.za
dc.contact.name	Roald Eiselen
dc.contributor.author	Roald Eiselen
dc.contributor.other	Rico Koen
dc.contributor.other	Albertus Kruger
dc.contributor.other	Jacques van Heerden
dc.date.accessioned	2023-07-28T08:11:48Z
dc.date.accessioned	2023-05-01
dc.date.available	2023-07-28T08:11:48Z
dc.date.available	2023-05-01
dc.date.issued	2023-05-01
dc.description	Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned for any downstream process. The model can be used both as a masked LM or as an embedding model to provide real-valued vectorised respresentations of words or string sequences for Xitsonga text.
dc.format.extent	Training data: Paragraphs: 360,698; Token count: 7,357,764; Vocab size: 30,000; Embedding dimensions: 768;
dc.format.size	235.98MB (Zipped)
dc.identifier.uri	https://hdl.handle.net/20.500.12185/642
dc.language.iso	ts
dc.languages	Xitsonga
dc.media.category	Language model
dc.media.type	Text
dc.project	NCHLT Text IV
dc.publisher	North-West University; Centre for Text Technology (CTexT)
dc.rights.license	Creative Commons Attribution 4.0 International (CC-BY 4.0)
dc.software.requirements	Python
dc.source	Web
dc.source	Government Documents
dc.title	NCHLT Xitsonga RoBERTa language model
dc.type	Modules

Files

Now showing 1 - 1 of 1