﻿Project: SADiLaR: Parallel corpora for English into Siswati

Type: Multilingual text corpora: aligned
Languages: 	English (en_ZA) & Siswati (ss_ZA).
Date: 2022-03-31
Version: 1.0 (Final)

Description: 
Aligned parallel corpora for the following language pair: English-Siswati.
The data is given as separate UTF-8 text files for each language, with each segment on a newline. 

The dataset contains the following types of bilingual data:

	* Translations from English to Siswati;
	* Crawled parallel data English-Siswati.

All word counts are measured on the English portion.

Content:
						|	Segments	|	EN Words	|	SS Words	|
-------------------------------------------------------------------------
Translated dataset		|	 53,382		|	1,000,531	|	  705,060	|
Crawled dataset			|	 61,457		|	1,001,762	|	  718,354	|
-------------------------------------------------------------------------
Entire dataset			|	114,839		|	2,002,293	|	1,423,414	|


Source(s):
Paralell documents from various South African domains (mainly government and publications).

SADiLaR website: https://sadilar.org/index.php/en/

_________________________________________________________________________________
Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International
 
URL: http://creativecommons.org/licenses/by/4.0/
 
Attribute work to: 
	CTexT® (Centre for Text Technology, North-West University), South Africa; 
	SADiLaR (South African Centre for Digital Language Resources), South Africa;
	Department of Sport, Arts and Culture, South Africa.
Attribute work to URL:	
	http://humanities.nwu.ac.za/ctext
    https://sadilar.org
	http://www.dac.gov.za