Department of Science, Technology and InnovationCLARIN in South Africa

NCHLT Xitsonga POS and Lemma annotated corpus

dc.contact.emailtanja.gaustad@nwu.ac.za
dc.contact.nameTanja Gaustad
dc.contributor.authorGaustad, Tanja
dc.contributor.otherMcKellar, Cindy
dc.contributor.otherGent, Sunny
dc.date.accessioned2026-03-26T12:58:53Z
dc.date.available2026-03-26T12:58:53Z
dc.date.issued2026-03-31
dc.descriptionNCHLT corpora with tokens lemmatised and converted to POS tags used during the SADiLaR-II project for Xitsonga. The POS tag conversion results have been thoroughly quality controlled by linguistic experts. Part of the multilingual NCHLT data set where each text type data file contains approximately 45,000 tokens for the conjunctive languages and 75,000 tokens for the disjunctive languages. Xitsonga dataset (Lemma, POS annotated) TRAIN: 63,091 tokens; TEST: 6,494 tokens; Total: 69,585 tokens Please see the included protocols for more details on the POS tags used and on the lemmatisation process. The data is given as txt files where each line contains a token, the corresponding lemma, morphological analysis and POS tag, all tab separated. The morphological information originates from a previous project "SADiLaR II (Extension): Linguistic corpus enrichment for South African languages" (see handles below for the data with only morphological analysis included). The data has been split into Train and Test sets according to the original NCHLT data (see handles below for the original NCHLT data). NB: There can be tokenisation differences between this release of the NCHLT data, the original data and the morphologically annotated data due to corrections. Morphological information included from: Morphologically annotated corpus for Xitsonga https://hdl.handle.net/20.500.12185/672 Original NCHLT data: NCHLT Xitsonga Annotated Text Corpora https://hdl.handle.net/20.500.12185/359
dc.formattext
dc.format.extent69585 tokens
dc.format.mediumN/A
dc.format.size2.5 Mb
dc.identifier.urihttps://hdl.handle.net/20.500.12185/705
dc.languagesXitsonga
dc.media.categoryannotated multilingual corpus
dc.media.typeText
dc.projectUpdate and extension of linguistic resources and core technologies for South African languages
dc.publisherNorth-West University - Centre for Text Technology (CTexT)
dc.rights.licenseCreative Commons Attribution 4.0 International
dc.subjectXitsonga, POS annotated, Lemma annotated, NCHLT, annotated corpus
dc.titleNCHLT Xitsonga POS and Lemma annotated corpus
dc.version1.0

Files

Original bundle

Now showing 1 - 5 of 5
Loading...
Thumbnail Image
Name:
Protocol.SADiLaR.LemmatizationXitsonga.Final.2026-03-31.doc
Size:
113 KB
Format:
Microsoft Word
Loading...
Thumbnail Image
Name:
Protocol.SADiLaR.PartOfSpeechTaggingXitsonga.Final.2026-03-31.docx
Size:
57.3 KB
Format:
Microsoft Word XML
Loading...
Thumbnail Image
Name:
README.SAD-IV.NCHLT_LEMMA-POS_converted.Final.2026-03-31.txt
Size:
4.94 KB
Format:
Plain Text
Loading...
Thumbnail Image
Name:
SAD-IV.NCHLT_Lemma-Morph-POS.Final_TEST.2026-03-31.ts.txt
Size:
218.86 KB
Format:
Plain Text
Loading...
Thumbnail Image
Name:
SAD-IV.NCHLT_Lemma-Morph-POS.Final_TRAIN.2026-03-31.ts.txt
Size:
2.09 MB
Format:
Plain Text

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.22 KB
Format:
Item-specific license agreed upon to submission
Description: