Project: SADiLaR IV (Extension): Update and extension of linguistic resources and core technologies for South African languages Type: Converted NCHLT part-of-speech (POS) and lemma annotations for nine languages (Final versions). Split into Train and Test sets according to original NCHLT releases. Languages: isiNdebele (nr_ZA), isiXhosa (xh_ZA), isiZulu (zu_ZA), Siswati (ss_ZA), Sesotho (st_ZA), Sesotho sa Leboa/Sepedi (nso_ZA), Setswana (tn_ZA), Tshivenḓa (ve_ZA), Xitsonga (ts_ZA) Date: 2026-03-31 Version: 1.0 (Final) Description: NCHLT corpora with tokens lemmatised and converted to the POS tags used during the SADiLaR-II project, for 9 languages, isiNdebele, isiXhosa, isiZulu, Siswati, Sesotho, Sepedi, Setswana, Tshivenḓa and Xitsonga. The POS tag conversion results have been thoroughly quality controlled by linguistic experts. Each text type data file contains approximately 45,000 tokens for the conjunctive languages and 75,000 tokens for the disjunctive languages. Please see the included protocols for more details on the POS tags used and on the lemmatisation process. The data is given as txt files where each line contains a token, the corresponding lemma, morphological analysis and POS tag, all tab separated. The morphological information originates from a previous project "SADiLaR II (Extension): Linguistic corpus enrichment for South African languages" (see handles below for the data with only morphological analysis included). The data has been split into Train and Test sets according to the original NCHLT data (see handles below for the original NCHLT data). NB: There can be tokenisation differences between this release of the NCHLT data, the original data and the morphologically annotated data due to corrections. Morphological information included from: Morphologically annotated corpus for isiNdebele https://hdl.handle.net/20.500.12185/680 Morphologically annotated corpus for isiXhosa https://hdl.handle.net/20.500.12185/679 Morphologically annotated corpus for isiZulu https://hdl.handle.net/20.500.12185/678 Morphologically annotated corpus for Sepedi https://hdl.handle.net/20.500.12185/675 Morphologically annotated corpus for Sesotho https://hdl.handle.net/20.500.12185/676 Morphologically annotated corpus for Setswana https://hdl.handle.net/20.500.12185/674 Morphologically annotated corpus for Siswati https://hdl.handle.net/20.500.12185/677 Morphologically annotated corpus for Tshivenḓa https://hdl.handle.net/20.500.12185/673 Morphologically annotated corpus for Xitsonga https://hdl.handle.net/20.500.12185/672 Original NCHLT data: NCHLT isiNdebele Annotated Text Corpora https://hdl.handle.net/20.500.12185/302 NCHLT isiXhosa Annotated Text Corpora https://hdl.handle.net/20.500.12185/309 NCHLT isiZulu Annotated Text Corpora https://hdl.handle.net/20.500.12185/315 NCHLT Sepedi Annotated Text Corpora https://hdl.handle.net/20.500.12185/325 NCHLT Sesotho Annotated Text Corpora https://hdl.handle.net/20.500.12185/332 NCHLT Setswana Annotated Text Corpora https://hdl.handle.net/20.500.12185/337 NCHLT Siswati Annotated Text Corpora https://hdl.handle.net/20.500.12185/344 NCHLT Tshivenda Annotated Text Corpora https://hdl.handle.net/20.500.12185/353 NCHLT Xitsonga Annotated Text Corpora https://hdl.handle.net/20.500.12185/359 Contents Language | TRAIN | TEST | Total | | Tokens | Tokens | | ------------------------------------------------------------------------------ isiNdebele dataset (Lemma, POS annotated) | 38,427 | 3,911 | 42,338 | isiXhosa dataset (Lemma, POS annotated) | 42,049 | 4,408 | 46,457 | isiZulu dataset (Lemma, POS annotated) | 41580 | 4,341 | 45,921 | Siswati dataset (Lemma, POS annotated) | 39,486 | 4,077 | 43,563 | Sesotho dataset (Lemma, POS annotated) | 66,881 | 6,849 | 73,730 | Sepedi dataset (Lemma, POS annotated) | 65,920 | 7,157 | 73,077 | Setswana dataset (Lemma, POS annotated) | 65,802 | 6,808 | 72,610 | Tshivenda dataset (Lemma, POS annotated) | 59,814 | 6,676 | 66,490 | Xitsonga dataset (Lemma, POS annotated) | 63,091 | 6,494 | 69,585 | ------------------------------------------------------------------------------ Source(s): NCHLT - Documents from various South African domains (mainly government, municipalities, and publications). SADiLaR website: https://sadilar.org _________________________________________________________________________________ Licence: This initial version is not intended for distribution _________________________________________________________________________________ Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International URL: http://creativecommons.org/licenses/by/4.0/ Attribute work to: CTexT® (Centre for Text Technology, North-West University), South Africa; SADiLaR (South African Centre for Digital Language Resources), South Africa. Attribute work to URL: http://humanities.nwu.ac.za/ctext https://sadilar.org