NCHLT Tshivenda POS and Lemma annotated corpus

Gaustad, Tanja

NCHLT Tshivenda POS and Lemma annotated corpus

Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/706'

dc.contact.email	tanja.gaustad@nwu.ac.za
dc.contact.name	Tanja Gaustad
dc.contributor.author	Gaustad, Tanja
dc.contributor.other	McKellar, Cindy
dc.contributor.other	Gent, Sunny
dc.date.accessioned	2026-03-26T12:59:11Z
dc.date.available	2026-03-26T12:59:11Z
dc.date.issued	2026-03-31
dc.description	NCHLT corpora with tokens lemmatised and converted to POS tags used during the SADiLaR-II project for Tshivenda. The POS tag conversion results have been thoroughly quality controlled by linguistic experts. Part of the multilingual NCHLT data set where each text type data file contains approximately 45,000 tokens for the conjunctive languages and 75,000 tokens for the disjunctive languages. Tshivenda dataset (Lemma, POS annotated) TRAIN: 59,814 tokens; TEST: 6,676 tokens; Total: 66,490 tokens Please see the included protocols for more details on the POS tags used and on the lemmatisation process. The data is given as txt files where each line contains a token, the corresponding lemma, morphological analysis and POS tag, all tab separated. The morphological information originates from a previous project "SADiLaR II (Extension): Linguistic corpus enrichment for South African languages" (see handles below for the data with only morphological analysis included). The data has been split into Train and Test sets according to the original NCHLT data (see handles below for the original NCHLT data). NB: There can be tokenisation differences between this release of the NCHLT data, the original data and the morphologically annotated data due to corrections. Morphological information included from: Morphologically annotated corpus for Tshivenḓa https://hdl.handle.net/20.500.12185/673 Original NCHLT data: NCHLT Tshivenda Annotated Text Corpora https://hdl.handle.net/20.500.12185/353
dc.format	text
dc.format.extent	66490 tokens
dc.format.medium	N/A
dc.format.size	2.5 Mb
dc.identifier.uri	https://hdl.handle.net/20.500.12185/706
dc.languages	Tshivenda
dc.media.category	annotated multilingual corpus
dc.media.type	Text
dc.project	Update and extension of linguistic resources and core technologies for South African languages
dc.publisher	North-West University - Centre for Text Technology (CTexT)
dc.rights.license	Creative Commons Attribution 4.0 International
dc.subject	Tshivenda, POS annotated, Lemma annotated, NCHLT, annotated corpus
dc.title	NCHLT Tshivenda POS and Lemma annotated corpus
dc.version	1.0