Project: Autshumato II

Type: Monolingual corpus
Language: Xitsonga (ts_ZA, tso_ZA)
Date: 2014-09-04
Version: 1.0.3

Description: 
Xitsonga monolingual corpus as deliverable of the Autshumato project.
The data is given as a UTF-8 text file; with each sentence on a newline.
All exact duplicates have been removed.
The data is tokenised (inserted spaces between punctuation and words).

Content:
58 398 Xitsonga segments.

Monolingual Lines: 58,398
Monolingual Words (excludes punctuation and numbers): 537,552


Source(s):
Sourced translation of Government domain data.

_________________________________________________________________________________
Licence: Creative Commons Attribution Non-Commercial ShareAlike 2.5 South Africa
 
URL: http://creativecommons.org/licenses/by-nc-sa/2.5/za/
 
Attribute work to: 
	CTexT (Centre for Text Technology, North-West University), South Africa; 
	Department of Arts and Culture, South Africa.
Attribute work to URL:	
	http://autshumato.sourceforge.net/ and 
	http://www.nwu.ac.za/ctext
_________________________________________________________________________________