Project: SADiLaR II (Extension): Linguistic corpus enrichment for South African languages Type: Converted NCHLT morphology annotations for nine languages (Final versions). Languages: isiNdebele (nr), isiXhosa (xh), isiZulu (zu), Siswati (ss), Sesotho (st), Sepedi (nso), Setswana (tn), Tshivenḓa (ve), Xitsonga (ts) Date: 2024-01-31 Version: 1.0 (Final) Description: NCHLT corpus of morphologically annotated tokens in isiNdebele, isiXhosa, isiZulu, Siswati, Sesotho, Sepedi, Setswana, Tshivenḓa and Xitsonga converted to the tags used during phases 1 and 2 of the SADiLaR-II project. The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated. Each data file contains a total of approximately 45,000 tokens for conjunctive languages and 70,000 tokens for disjunctive languages annotated for morphology. All the data has been automatically converted, checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the protocols for more details on the morphological tags used. Contents Language | Tokens | ----------------------------------------------------------- isiNdebele dataset (morphologically annotated) | 42,335 | isiXhosa dataset (morphologically annotated) | 46,465 | isiZulu dataset (morphologically annotated) | 45,933 | Siswati dataset (morphologically annotated) | 43,568 | Sesotho dataset (morphologically annotated) | 73,727 | Sepedi dataset (morphologically annotated) | 73,031 | Setswana dataset (morphologically annotated) | 72,609 | Tshivenḓa dataset (morphologically annotated) | 66,487 | Xitsonga dataset (morphologically annotated) | 69,584 | ----------------------------------------------------------------------------- Source(s): Documents from various South African domains (mainly government, municipalities, and publications). SADiLaR website: https://sadilar.org _________________________________________________________________________________ Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International URL: http://creativecommons.org/licenses/by/4.0/ Attribute work to: CTexT® (Centre for Text Technology, North-West University), South Africa; SADiLaR (South African Centre for Digital Language Resources), South Africa. Attribute work to URL: http://humanities.nwu.ac.za/ctext https://sadilar.org