NCHLT Siswati POS and Lemma annotated corpus

Gaustad, Tanja

Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/710'

NCHLT Siswati POS and Lemma annotated corpus

Files

Protocol.SADiLaR.LemmatizationSiswati.Final.2026-03-31.doc (118.5 KB)

Protocol.SADiLaR.PartOfSpeechTaggingSiswati.Final.2026-03-31.docx (54.2 KB)

README.SAD-IV.NCHLT_LEMMA-POS_converted.Final.2026-03-31.txt (4.94 KB)

SAD-IV.NCHLT_Lemma-Morph-POS.Final_TEST.2026-03-31.ss.txt (214.9 KB)

SAD-IV.NCHLT_Lemma-Morph-POS.Final_TRAIN.2026-03-31.ss.txt (2.01 MB)

Date

2026-03-31

Authors

Gaustad, Tanja

Publisher

North-West University - Centre for Text Technology (CTexT)

Description

NCHLT corpora with tokens lemmatised and converted to POS tags used during the SADiLaR-II project for Siswati. The POS tag conversion results have been thoroughly quality controlled by linguistic experts. Part of the multilingual NCHLT data set where each text type data file contains approximately 45,000 tokens for the conjunctive languages and 75,000 tokens for the disjunctive languages. Siswati dataset (Lemma, POS annotated) TRAIN: 39,486 tokens; TEST: 4,077 tokens; Total: 43,563 tokens Please see the included protocols for more details on the POS tags used and on the lemmatisation process. The data is given as txt files where each line contains a token, the corresponding lemma, morphological analysis and POS tag, all tab separated. The morphological information originates from a previous project "SADiLaR II (Extension): Linguistic corpus enrichment for South African languages" (see handles below for the data with only morphological analysis included). The data has been split into Train and Test sets according to the original NCHLT data (see handles below for the original NCHLT data). NB: There can be tokenisation differences between this release of the NCHLT data, the original data and the morphologically annotated data due to corrections. Morphological information included from: Morphologically annotated corpus for Siswati https://hdl.handle.net/20.500.12185/677 Original NCHLT data: NCHLT Siswati Annotated Text Corpora https://hdl.handle.net/20.500.12185/344

Keywords

Siswati, POS annotated, Lemma annotated, NCHLT, annotated corpus

License

Creative Commons Attribution 4.0 International

URI

https://hdl.handle.net/20.500.12185/710

Collections

Resource Catalogue

Verification status

Level 0

Full item page

NCHLT Siswati POS and Lemma annotated corpus

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

License

URI

Collections

Verification status