Project: Autshumato 6

Type: Multilingual text corpora - aligned
Languages: 	English (en_ZA) & Sesotho (st_ZA).
Date: 2022-09-30
Version: 2.0 (Final)

Description: 
Aligned parallel corpora for the language pair English-Sesotho. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for.

Sources:
The data is a combination of sources where the Centre for Text Technology has distribution rights (magazines, policies, newsletters, translation works) as well as documents crawled from the government domain (*.gov.za). 

Processing and clean-up:
All data was extracted from the original formats (doc(x), html or pdf) and converted to UTF-8 text. Lines that were split during the pdf extraction were re-joined if possible.
Alignment between the two languages was done using HunAlign with a wordlist and only documents that aligned at 80% or more were kept. The parallel data was then sorted uniquely (discarding lines where both languages are identical) and language identified at a 50% confidence level.
Further clean-up entailed deleting lines with broken diacritics, lines with excessive punctuation and lines that were unusually long (more than 100 words). If these issues were found, both aligned language lines were deleted.
The data was spellchecked and only lines that are spelled correctly at 80% or higher were kept. As a final step the data was randomized to comply with usage restrictions.

Word Counts:
------------
Aligned Segments: 171 292
English Words: 2 848 205
Sesotho Words: 3 465 480


Project website: http://autshumato.sourceforge.net/

_________________________________________________________________________________
Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International
 
URL: http://creativecommons.org/licenses/by/4.0/
 
Attribute work to: 
	CTexT® (Centre for Text Technology, North-West University), South Africa; 
	SADiLaR (South African Centre for Digital Language Resources), South Africa;
	Department of Sport, Arts and Culture, South Africa.
Attribute work to URL:	
	http://humanities.nwu.ac.za/ctext
    https://sadilar.org
	http://www.dac.gov.za
