Project: SADiLaR IV (Extension): Update and extension of linguistic resources and core technologies for South African languages Type: 60,809 tokens in Afrikaans (af_ZA) annotated for part-of-speech for five different text types (Final version). Languages: Afrikaans (af_ZA) Date: 2026-05-13 Version: 1.0 (Final) Description: This deliverable contains part-of-speech tagged data from five different text types for Afrikaans. The text types included are: - CAPS gr12 (Academic) - MA/PhD Theses (Academic) - Magazines (Non-Academic) - News (Non-Academic) - Novels (Fiction) The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated. Each text type data file contains 11,000+ tokens, amounting to a total of 60,809 tokens for the language. Please see the included protocol for more details on the POS tags used. Contents Language and text type | Tokens | -------------------------------------------------------------------------------- Afrikaans CAPS gr12 (POS annotated) | 11,698 | Afrikaans MA/PhD Theses (POS annotated) | 12,251 | Afrikaans Magazines (POS annotated) | 12,570 | Afrikaans News (POS annotated) | 11,955 | Afrikaans Novels (POS annotated) | 12,335 | Total Afrikaans: 60,809 -------------------------------------------------------------------------------- SADiLaR website: https://sadilar.org _________________________________________________________________________________ Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International URL: http://creativecommons.org/licenses/by/4.0/ Attribute work to: CTexT® (Centre for Text Technology, North-West University), South Africa; SADiLaR (South African Centre for Digital Language Resources), South Africa. Attribute work to URL: http://humanities.nwu.ac.za/ctext https://sadilar.org