Repository logoRepository logo
 

NCHLT Xitsonga RoBERTa language model

dc.contact.emailRoald.Eiselen@nwu.ac.za
dc.contact.nameRoald Eiselen
dc.contributor.authorRoald Eiselen
dc.contributor.otherRico Koen
dc.contributor.otherAlbertus Kruger
dc.contributor.otherJacques van Heerden
dc.date.accessioned2023-07-28T08:11:48Z
dc.date.accessioned2023-05-01
dc.date.available2023-07-28T08:11:48Z
dc.date.available2023-05-01
dc.date.issued2023-05-01
dc.descriptionContextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned for any downstream process. The model can be used both as a masked LM or as an embedding model to provide real-valued vectorised respresentations of words or string sequences for Xitsonga text.
dc.format.extentTraining data: Paragraphs: 360,698; Token count: 7,357,764; Vocab size: 30,000; Embedding dimensions: 768;
dc.format.size235.98MB (Zipped)
dc.identifier.urihttps://hdl.handle.net/20.500.12185/642
dc.language.isots
dc.languagesXitsonga
dc.media.categoryLanguage model
dc.media.typeText
dc.projectNCHLT Text IV
dc.publisherNorth-West University; Centre for Text Technology (CTexT)
dc.rights.licenseCreative Commons Attribution 4.0 International (CC-BY 4.0)
dc.software.requirementsPython
dc.sourceWeb
dc.sourceGovernment Documents
dc.titleNCHLT Xitsonga RoBERTa language model
dc.typeModules

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ts.RoBERTa.tar.gz
Size:
235.98 MB
Format:
Unknown data format