Project: Autshumato 6

Type: Monolingual text corpora
Languages: 	Sepedi (nso_ZA).
Date: 2022-09-30
Version: 2.1 (Final)

Description: 
Monolingual corpus for Sepedi. The data is given as a single UTF-8 text file, with each segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for.

Sources:
The data is a combination of sources where the Centre for Text Technology has distribution rights (magazines, policies, newsletters, translation works) as well as documents crawled from the government domain (*.gov.za). 

Processing and clean-up:
All data was extracted from the original formats (doc(x), html or pdf) and converted to UTF-8 text. Lines that were split during the pdf extraction were re-joined if possible. The text data was then sorted uniquely and language identified at a 50% confidence level.
Further clean-up entailed deleting lines with broken diacritics, lines with excessive punctuation and unusually long lines. 
The data was spellchecked and only lines that are spelled correctly at 80% or higher were kept. As a final step data was randomized to comply with usage restrictions.

Word Count:
-----------
Sepedi Segments: 171 774
Sepedi Words: 3 448 592

Project website: http://autshumato.sourceforge.net/

_________________________________________________________________________________
Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International
 
URL: http://creativecommons.org/licenses/by/4.0/
 
Attribute work to: 
	CTexT® (Centre for Text Technology, North-West University), South Africa; 
	SADiLaR (South African Centre for Digital Language Resources), South Africa;
	Department of Sport, Arts and Culture, South Africa.
Attribute work to URL:	
	http://humanities.nwu.ac.za/ctext
    https://sadilar.org
	http://www.dac.gov.za
