Project: SADiLaR IV (Extension): Update and extension of linguistic resources and core technologies for South African languages Type: 75,017 tokens in Sesotho sa Leboa/Sepedi (nso_ZA) annotated for part-of-speech for five different text types (Final version). Languages: Sesotho sa Leboa/Sepedi (nso_ZA) Date: 2026-03-31 Version: 1.0 (Final) Description: This deliverable contains part-of-speech tagged data from five different text types for Sepedi. The text types included are: - CAPS gr12 (Academic) - MA/PhD Theses (Academic) - Magazines (Non-Academic) - News (Non-Academic) - Novels (Fiction) The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated. Each text type data file contains 13,000+ tokens, amounting to a total of 75,017 tokens for the language. Please see the included protocol for more details on the POS tags used. This data is a combination of new data with the previously published smaller data set "POS annotated corpus in 5 different genres for Sepedi" https://hdl.handle.net/20.500.12185/670. Please see Tanja Gaustad, Roald Eiselen, Cindy McKellar (2026). Extension of Linguistic Resources for South African Languages: Part-of-Speech Annotated Domain-Specific Data. Proceedings of the Seventh Workshop on Resources for African Indigenous Languages (RAIL) (collocated with LREC 2026) for more detailed information. Contents Language and text type | Tokens | -------------------------------------------------------------------------------- Sepedi CAPS gr12 (POS annotated) | 15,510 | Sepedi MA/PhD Theses (POS annotated) | 14,721 | Sepedi Magazines (POS annotated) | 13,320 | Sepedi News (POS annotated) | 16,475 | Sepedi Novels (POS annotated) | 14,991 | Total Sepedi: 75,017 -------------------------------------------------------------------------------- SADiLaR website: https://sadilar.org _________________________________________________________________________________ Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International URL: http://creativecommons.org/licenses/by/4.0/ Attribute work to: CTexT® (Centre for Text Technology, North-West University), South Africa; SADiLaR (South African Centre for Digital Language Resources), South Africa. Attribute work to URL: http://humanities.nwu.ac.za/ctext https://sadilar.org