﻿Project: Autshumato V

Type: Multilingual text corpora: aligned
Languages: 	English (en_ZA) & Tshivenḓa (ve_ZA).
Date: 2023-12-12
Version: 3.0 (Final)

Description: 
Aligned parallel corpora for the following language pair: English-Tshivenḓa. Data was crawled from various multilingual government websites, sourced from translated material and created by translating English sentences into Tshivenḓa. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. 

Content: There are 110,367 English-Tshivenḓa segments, consisting of 2,000,657 English words and 2,527,789 Tshivenḓa words.
					
Source(s):
Various documents from the government domain.

Project website: http://autshumato.sourceforge.net/
_________________________________________________________________________________
Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International
 
URL: http://creativecommons.org/licenses/by/4.0/
 
Attribute work to: 
	CTexT® (Centre for Text Technology, North-West University), South Africa; 
	Department of Sport, Arts and Culture, South Africa.
Attribute work to URL:	
	http://humanities.nwu.ac.za/ctext and 
	http://www.dac.gov.za/