﻿Project: SADiLaR-II: Linguistic corpus enrichment for conjuctively written South African languages.

Type: Parallel data annotated for morphology, lemmatisation and part-of-speech
Languages: 	isiNdebele (NR), isiXhosa (XH), isiZulu (ZU) & Siswati (SS). English (EN) tokens only.
Date: 2021-09-30
Version: 1.0 (final)

Description: 
The dataset contains a total of 1431 paragraphs and approx. 50,000 tokens for each of the languages. As this data has been used for the DHASA 2021 shared task (https://dh2021.digitalhumanities.org.za/shared-task/), all data has been split randomly by paragraphs into a 90% training and 10% testing part. The data is given in one train and one test text file for all languages. The start of each original paragraph is marked by a line marker with a counter. Each token/annotation combination can be found on a newline. Tokens and all annotation types (morphology, lemmas, parts-of-speech) are separated by a single tab character in the .txt file. The English file only contains tokens.

A more thorough description can be found in the Data in Brief article "Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati" https://doi.org/10.1016/j.dib.2022.107994. Please refer to this article for citations of the dataset.

Content:
			|	Total tokens	|
			-------------------------
isiNdebele	|		51,120		|	
Siswati		|		48,816		|		
isiXhosa	|		50,166		|		
isiZulu		|		50,528		|	
English		|		68,431		|

_________________________________________________________________________________
Licence: This pre-final version is not intended for distribution
_____________________________________________________________________________________
Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International
 
URL: http://creativecommons.org/licenses/by/4.0/

Citation: https://doi.org/10.1016/j.dib.2022.107994
Attribute work to: 
	CTexT® (Centre for Text Technology, North-West University), South Africa; 
	South African Centre for Digital Language Resources (SADiLaR), South Africa.
Attribute work to URL:	
	http://humanities.nwu.ac.za/ctext and 
	http://www.sadilar.org