Project: Autshumato II

Type: Aligned parallel corpus
Languages: English (en_GB, eng_GB) & Xitsonga (ts_ZA, tso_ZA)
Date: 2014-11-18
Version: 1.0.8

Description: 
Aligned English-Xitsonga parallel corpus.
The data is given as two seperate UTF-8 text files; with each segment on a newline.

Content:
Bilingual Segments: 450,000.
English Words (excluding punctuation and numbers): 3,461,089.
Tsonga Words (excluding punctuation and numbers): 4,328,407.

Source(s):
Sourced translation of Government domain data.
Data crawled from Government website (*.gov.za).
_________________________________________________________________________________
Licence: Creative Commons Attribution Non-Commercial ShareAlike 2.5 South Africa
 
URL: http://creativecommons.org/licenses/by-nc-sa/2.5/za/
 
Attribute work to: 
	CTexT (Centre for Text Technology, North-West University), South Africa; 
	Department of Arts and Culture, South Africa.
Attribute work to URL:	
	http://autshumato.sourceforge.net/ and 
	http://www.nwu.ac.za/ctext
_________________________________________________________________________________