﻿Project: SADiLaR: Parallel corpora for English into Siswati

Type: Monolingual text corpora
Languages: 	Siswati (ss_ZA).
Date: 2022-03-31
Version: 1.0 (Final)

Description: 
Monolingual corpus for Siswati. The data is given as a single UTF-8 text file, with each segment on a newline. 
The dataset contains monolingual data for Siswati obtained during data acquisition that do not have aligned English sentences.

Content:
						|	Segments	|	SS Words	|
---------------------------------------------------------
Monolingual dataset		|	138,651		|	1,536,356	|


Source(s):
Documents from various South African domains (mainly government and publications).

SADiLaR website: https://sadilar.org/index.php/en/

_________________________________________________________________________________
Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International
 
URL: http://creativecommons.org/licenses/by/4.0/
 
Attribute work to: 
	CTexT® (Centre for Text Technology, North-West University), South Africa; 
	SADiLaR (South African Centre for Digital Language Resources), South Africa;
	Department of Sport, Arts and Culture, South Africa.
Attribute work to URL:	
	http://humanities.nwu.ac.za/ctext
    https://sadilar.org
	http://www.dac.gov.za