Project: SADiLaR II (Extension): Linguistic corpus enrichment for South African languages Type: 25,000 tokens in two languages annotated for part-of-speech for five different text types (Final versions). Languages: isiZulu (zu), Sepedi (nso) Date: 2024-01-31 Version: 1.0 (Final) Description: This deliverable contains part-of-speech tagged data from five different text types for isiZulu and Sepedi. The text types included are: - CAPS gr12 (Academic) - https://www.education.gov.za/Curriculum/NationalSeniorCertificate(NSC)Examinations.aspx; - PhD Theses (Academic) - for isiZulu https://researchspace.ukzn.ac.za/, for Sepedi https://repository.up.ac.za/; - Magazines (Non-Academic) - CTexT acquired data from Pula Imvula; - News (Non-Academic) - for isiZulu Isolezwe content sourced from Leipzig corpus, for Sepedi CTexT acquired data; - Novels (Fiction) - SADiLaR acquired data from OUP and Shuter and Shooter. For Sepedi, the data was tagged using the NCHLT webservices. The POS tags were then converted to the latest POS tag set (see protocol). For isiZulu, the data was annotated with the Core Tech POS tagger developed during SADiLaR II. The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated. Each text type data file contains approximately 5,000 tokens, amounting to a total of 25,000 tokens per languages. Please see the protocols for more details on the POS tags used. Contents Language and text type | Tokens | ----------------------------------------------------------------------------- isiZulu CAPS gr12 (POS annotated) | 3,634 | isiZulu PhD Theses (POS annotated | 5,716 | isiZulu Magazines (POS annotated) | 3,658 | isiZulu News (POS annotated) | 5,974 | isiZulu Novels (POS annotated) | 5,909 | Total isiZulu: 21,233 Sepedi CAPS gr12 (POS annotated) | 6,634 | Sepedi PhD Theses (POS annotated) | 7,395 | Sepedi Magazines (POS annotated) | 5,547 | Sepedi News (POS annotated) | 8,782 | Sepedi Novels (POS annotated) | 6,924 | Total Sepedi: 30,158 ----------------------------------------------------------------------------- SADiLaR website: https://sadilar.org _________________________________________________________________________________ Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International URL: http://creativecommons.org/licenses/by/4.0/ Attribute work to: CTexT® (Centre for Text Technology, North-West University), South Africa; SADiLaR (South African Centre for Digital Language Resources), South Africa. Attribute work to URL: http://humanities.nwu.ac.za/ctext https://sadilar.org