Show simple item record

NCHLT Tshivenḓa RoBERTa language model
Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned for any downstream process. The model can be used both as a masked LM or as an embedding model to provide real-valued vectorised respresentations of words or string sequences for Tshivenḓa text.
Roald Eiselen
Roald.Eiselen@nwu.ac.za
North-West University; Centre for Text Technology (CTexT)
Creative Commons Attribution 4.0 International (CC-BY 4.0)
Tshivenḓa
Roald Eiselen
Rico Koen; Albertus Kruger; Jacques van Heerden
https://hdl.handle.net/20.500.12185/643
Text
Modules
Language model
Training data: Paragraphs: 304,248; Token count: 7,363,713; Vocab size: 30,000; Embedding dimensions: 768;
236.05MB (Zipped)
NCHLT Text IV
Python
Web; Government Documents
ve
2023-07-28T08:11:51Z; 2023-05-01
2023-07-28T08:11:51Z; 2023-05-01
2023-05-01


Files in this item

Thumbnail

This item appears in the following Collection(s)

  • Resource Catalogue [349]
    A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.

Show simple item record