isiZulu Domain corpus POS annotated (5 domains)
Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/701'
| dc.contact.email | tanja.gaustad@nwu.ac.za | |
| dc.contact.name | Tanja Gaustad | |
| dc.contributor.author | Gaustad, Tanja | |
| dc.contributor.other | McKellar, Cindy | |
| dc.contributor.other | Gent, Sunny | |
| dc.date.accessioned | 2026-03-26T12:56:39Z | |
| dc.date.available | 2026-03-26T12:56:39Z | |
| dc.date.issued | 2026-03-31 | |
| dc.description | This deliverable contains part-of-speech tagged data from five different text types for isiZulu. The text types included are: - CAPS gr12 (Academic) - MA/PhD Theses (Academic) - Magazines (Non-Academic) - News (Non-Academic) - Novels (Fiction) The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated. Each text type data file contains 11,000+ tokens, amounting to a total of 67,875 tokens for the language. Please see the included protocol for more details on the POS tags used. This data is a combination of new data with the previously published smaller data set "POS annotated corpus with 5 different text types for isiZulu" https://hdl.handle.net/20.500.12185/671. Please see Tanja Gaustad, Roald Eiselen, Cindy McKellar (2026). Extension of Linguistic Resources for South African Languages: Part-of-Speech Annotated Domain-Specific Data. Proceedings of the Seventh Workshop on Resources for African Indigenous Languages (RAIL) (collocated with LREC 2026) for more detailed information. | |
| dc.format | text | |
| dc.format.extent | 67875 tokens | |
| dc.format.medium | N/A | |
| dc.format.size | 1 Mb | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12185/701 | |
| dc.languages | isiZulu | |
| dc.media.category | annotated domain-specific corpus | |
| dc.media.type | Text | |
| dc.project | Update and extension of linguistic resources and core technologies for South African languages | |
| dc.publisher | North-West University - Centre for Text Technology (CTexT) | |
| dc.rights.license | Creative Commons Attribution 4.0 International | |
| dc.subject | isiZulu, POS annotated, domain-specific, annotated corpus | |
| dc.title | isiZulu Domain corpus POS annotated (5 domains) | |
| dc.version | 1.0 |
Files
Original bundle
1 - 5 of 7
Loading...
- Name:
- Protocol.SADiLaR.PartOfSpeechTaggingIsiZulu.Final.2026-03-31.docx
- Size:
- 53.44 KB
- Format:
- Microsoft Word XML
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 3.22 KB
- Format:
- Item-specific license agreed upon to submission
- Description:


