Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/700'
Sepedi Domain corpus POS annotated (5 domains)
Loading...
Date
2026-03-31
Authors
Gaustad, Tanja
Journal Title
Journal ISSN
Volume Title
Publisher
North-West University - Centre for Text Technology (CTexT)
Abstract
Description
This deliverable contains part-of-speech tagged data from five different text types for Sepedi.
The text types included are:
- CAPS gr12 (Academic)
- MA/PhD Theses (Academic)
- Magazines (Non-Academic)
- News (Non-Academic)
- Novels (Fiction)
The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated.
Each text type data file contains 13,000+ tokens, amounting to a total of 75,017 tokens for the language. Please see the included protocol for more details on the POS tags used.
This data is a combination of new data with the previously published smaller data set "POS annotated corpus in 5 different genres for Sepedi" https://hdl.handle.net/20.500.12185/670.
Please see Tanja Gaustad, Roald Eiselen, Cindy McKellar (2026). Extension of Linguistic Resources for South African Languages: Part-of-Speech Annotated Domain-Specific Data. Proceedings of the Seventh Workshop on Resources for African Indigenous Languages (RAIL) (collocated with LREC 2026) for more detailed information.
Citation
License
Creative Commons Attribution 4.0 International
Collections
Verification status
Level 0


