Project: SADiLaR II (Extension): Linguistic corpus enrichment for South African languages

Type: Converted NCHLT morphology annotations for nine languages (Final versions).
Languages: 	isiNdebele (nr), isiXhosa (xh), isiZulu (zu), Siswati (ss), Sesotho (st), Sepedi (nso), Setswana (tn), Tshivenḓa (ve), Xitsonga (ts)
Date: 2024-01-31
Version: 1.0 (Final)

Description: 
NCHLT corpus of morphologically annotated tokens in isiNdebele, isiXhosa, isiZulu, Siswati, Sesotho, Sepedi, Setswana, Tshivenḓa and Xitsonga converted to the tags used during phases 1 and 2 of the SADiLaR-II project. 

The data is given as txt files. Each line consists of a token and the corresponding morphological analysis, tab separated.

Each data file contains a total of approximately 45,000 tokens for conjunctive languages and 70,000 tokens for disjunctive languages annotated for morphology. All the data has been automatically converted, checked and re-annotated where necessary by linguistic experts as well as quality controlled. Please see the protocols for more details on the morphological tags used.

Contents

Language										| Tokens |
-----------------------------------------------------------
isiNdebele dataset (morphologically annotated)	| 42,335 |
isiXhosa dataset (morphologically annotated)	| 46,465 |
isiZulu dataset (morphologically annotated)		| 45,933 |
Siswati dataset (morphologically annotated)		| 43,568 |
Sesotho dataset (morphologically annotated)		| 73,727 |
Sepedi dataset (morphologically annotated)		| 73,031 |
Setswana dataset (morphologically annotated)	| 72,609 |
Tshivenḓa dataset (morphologically annotated)	| 66,487 |
Xitsonga dataset (morphologically annotated)	| 69,584 |
-----------------------------------------------------------------------------

Source(s):
Documents from various South African domains (mainly government, municipalities, and publications).

SADiLaR website: https://sadilar.org

_________________________________________________________________________________
Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International
 
URL: http://creativecommons.org/licenses/by/4.0/
 
Attribute work to: 
	CTexT® (Centre for Text Technology, North-West University), South Africa; 
	SADiLaR (South African Centre for Digital Language Resources), South Africa.
Attribute work to URL:	
	http://humanities.nwu.ac.za/ctext
    https://sadilar.org