Repository logoRepository logo
 

POS annotated corpus with 5 different text types for isiZulu

dc.contact.emailtanja.gaustad@nwu.ac.zaen_ZA
dc.contact.nameT. Gaustaden_ZA
dc.contributor.authorGaustad, Tanja
dc.date.accessioned2024-03-27T08:14:48Z
dc.date.available2024-03-27T08:14:48Z
dc.date.issued2024-01-31
dc.descriptionThis is a POS annotated corpus with 5 different text types for isiZulu. The text types included are: - CAPS gr12 (Academic) - https://www.education.gov.za/Curriculum/NationalSeniorCertificate(NSC)Examinations.aspx; - PhD Theses (Academic) - for isiZulu https://researchspace.ukzn.ac.za/, for Sepedi https://repository.up.ac.za/; - Magazines (Non-Academic) - CTexT acquired data from Pula Imvula; - News (Non-Academic) - for isiZulu Isolezwe content sourced from Leipzig corpus, for Sepedi CTexT acquired data; - Novels (Fiction) - SADiLaR acquired data from OUP and Shuter and Shooter. For isiZulu, the data was annotated with the Core Tech POS tagger developed during SADiLaR II. The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated. Each text type data file contains approximately 5,000 tokens, amounting to a total of 25,000 tokens per languages. Please see the protocol for more details on the POS tags used. Contents: isiZulu CAPS gr12 - 3,634 tokens, isiZulu PhD Theses - 5,716 tokens, isiZulu Magazines - 3,658, isiZulu News - 5,974 tokens, isiZulu Novels - 5,909 tokens. Total 21,233 tokens.en_ZA
dc.formattexten_ZA
dc.format.extent21,233 tokensen_ZA
dc.format.mediumN/Aen_ZA
dc.format.size200 kben_ZA
dc.identifier.urihttps://hdl.handle.net/20.500.12185/671
dc.languagesisiZuluen_ZA
dc.media.categoryannotated text corpusen_ZA
dc.media.typeTexten_ZA
dc.projectLinguistic corpus enrichment for South African languagesen_ZA
dc.publisherCentre for Text Technology (CTexT)en_ZA
dc.rights.licenseCC BY 4.0en_ZA
dc.subjectannotateden_ZA
dc.subjectpart of speechen_ZA
dc.subjectdomainsen_ZA
dc.titlePOS annotated corpus with 5 different text types for isiZuluen_ZA
dc.version1.0en_ZA

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
POS_genre_ZU.zip
Size:
177.02 KB
Format:
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
Description:
POS annotated corpora for 5 text types and POS annotation protocol
Loading...
Thumbnail Image
Name:
README.POS.Final.2024-01-31.txt
Size:
2.64 KB
Format:
Plain Text
Description:
Read Me

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.22 KB
Format:
Item-specific license agreed upon to submission
Description: