______________________________________________
README: NCHLT Text Corpora
______________________________________________
1. About NCHLT Text Corpora
2. Development process
3. Directory structure
4. License
______________________________________________

1. About NCHLT Text Corpora

This directory contains corpora developed during project NCHLT: Text. Languages included are: Afrikaans, English*, isiXhosa, isiNdebele, isiZulu, Sesotho, Setswana, Sepedi, Siswati, Tshivenda, Xitsonga.

These corpora were developed by the Centre for Text Technology (CTexT, North-West University, South Africa).

See: http://www.nwu.ac.za/ctext for more information.

______________________________________________

2. Development process 

These corpora are based on documents from the South African government domain, mainly crawled from gov.za websites and collected from various language units.

Raw formats (PDF, HTML, DOC) were converted to plain text format (TXT).
Duplicate documents are removed to eliminate multiple entries of the same data.

Raw text documents go through various cleanup procedures to ensure the best quality for corpus data.
Examples:
Remove non valid characters at the beginning and end of sentences (,*,-, [,] ,%,#,@, etc.).
Remove digits and codes at the beginning and end of sentences (1, 1.2.3, 20390sdsd9023, A.1.2, a., B., etc.).
Remove continuous characters (....... -------- ????? ,,,,,, etc.).
Remove empty lines.
Remove non-closed brackets ([...text).
Diacritic restoration (Language specific).

Once the cleanup process is completed, the text files are concatenated (combined) into one file.
Final encoding and character checks are performed on this file to ensure quality.
For each language there is a CLEAN and a RAW corpus. 
RAW corpus versions are created before the cleanup process begins.

______________________________________________
4. Directory structure

The following sub-directories are found in each language directory:

1.Source
Contains the txt versions of the individual files.
2.Corpora
Contains the raw and clean corpus.
3.Lexica
Contains a lexicon and named-entity list, derived from the clean corpus.

*Note on English: A corpus for English was not included in this project, but as it was developed in order to derive the lexicon and named-entity list, the corpus is also supplied. This corpus is supplied as is, CTexT gives no guarantee and accept no responsibility whatsoever for any errors in the corpus and accept no liability whatsoever for damage, loss or inconvenience resulting from use of the corpus.

______________________________________________

4. License

These files are distributed under the Creative Commons Attribution 2.5 South Africa license. 

A header containing the license is only included in the raw and clean corpora, but all files are distributed under the same conditions.
_______________________________________________
License: Creative Commons Attribution 2.5 South Africa
URL: http://creativecommons.org/licenses/by/2.5/za/

Attribute work to: South African Department of Arts and Culture & Centre for Text Technology (CTexT, North-West University, South Africa)

Attribute work to URL: http://www.nwu.ac.za/ctext 
______________________________________________

