POS annotated corpus with 5 different text types for isiZulu
License agreement
By downloading this resource I accept and agree to the terms of use and the associated license conditions under which the resource is distributed.
Download
MD5: a7ffd461cdcfa17f42c452cb8f3049aa
License agreement
By downloading this resource I accept and agree to the terms of use and the associated license conditions under which the resource is distributed.
MD5: 7fd4d5590d2dc4b3d799e64bf00d8ee0
License agreement
By downloading this resource I accept and agree to the terms of use and the associated license conditions under which the resource is distributed.
Collections
- Resource Catalogue [350]
Author(s)
Gaustad, Tanja
Metadata
Show full item recordDescription
This is a POS annotated corpus with 5 different text types for isiZulu.
The text types included are:
- CAPS gr12 (Academic) - https://www.education.gov.za/Curriculum/NationalSeniorCertificate(NSC)Examinations.aspx;
- PhD Theses (Academic) - for isiZulu https://researchspace.ukzn.ac.za/, for Sepedi https://repository.up.ac.za/;
- Magazines (Non-Academic) - CTexT acquired data from Pula Imvula;
- News (Non-Academic) - for isiZulu Isolezwe content sourced from Leipzig corpus, for Sepedi CTexT acquired data;
- Novels (Fiction) - SADiLaR acquired data from OUP and Shuter and Shooter.
For isiZulu, the data was annotated with the Core Tech POS tagger developed during SADiLaR II.
The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated.
Each text type data file contains approximately 5,000 tokens, amounting to a total of 25,000 tokens per languages. Please see the protocol for more details on the POS tags used.
Contents: isiZulu CAPS gr12 - 3,634 tokens, isiZulu PhD Theses - 5,716 tokens, isiZulu Magazines - 3,658, isiZulu News - 5,974 tokens, isiZulu Novels - 5,909 tokens. Total 21,233 tokens.
Contact person
T. GaustadContact person's e-mail address
tanja.gaustad@nwu.ac.zaPublisher(s)
Centre for Text Technology (CTexT)