Project: CTexT Spelling Checkers
Type: Multilingual spelling checker lexicon source files
Languages: Afrikaans (af), isiNdebele (nr), isiXhosa (xh), isiZulu (zu), Sepedi (nso), Sesotho (st), Setswana (tn), Siswati (ss), Tshivenḓa (ve), Xitsonga (ts).
Date: 2022-06-30
Version: 1.0

Description: 
Spelling checker lexicons for 10 South African languages. Lexicons created by collecting data from various sources and manually reviewed by language experts according to the standard written orthography.
For each language there are four different lexicon files: 
	* abbreviations.<lang>.txt	abbreviations and abbreviation compounds.
	* lowercase.<lang>.txt	words that are correct when written in lower case.
	* offensive.<lang>.txt	words that are potentially offensive, obscene, racist, or should not be suggested by a spelling checker for some other reason.
	* uppercase.<lang>.txt	words that should only be written with one or more capitalised characters, such as person and place names.
All files is are in UTF8 format with with each word on a separate line.
The following table provides a summary for each language
	Lang	|Words	|Abbr	|Lower	|Offen	|Upper
	af	|437562	|4454	|387186	|1086	|44836
	nr	|146801	| 149	|144932	|  42	| 1678
	nso	| 58697	|  14	| 56859	|  34	| 1790
	ss	|235822	|  46	|226733	|  60	| 8983
	st	| 44033	|  86	| 42302	|  50	| 1595
	tn	| 72029	|   4	| 69783	| 127	| 2115
	ts	| 28530	|  55	| 27828	|   9	|  638
	ve	| 25808	|   0	| 24766	|  12	| 1030
	xh	|203604	|   9	|192773	| 121	|10701
	zu	|243966	|  32	|226795	|  47	|17092




_________________________________________________________________________________
Licence for v1.0 distribution: Creative Commons Attribution 4.0 International
URL: https://creativecommons.org/licenses/by/4.0/
 
Attribute work to: 
	CTexT® (Centre for Text Technology, North-West University), South Africa
Attribute work to URL: 
	https://humanities.nwu.ac.za/ctext 
