Department of Science, Technology and InnovationCLARIN in South Africa

Sepedi Domain corpus POS annotated (5 domains)

dc.contact.emailtanja.gaustad@nwu.ac.za
dc.contact.nameTanja Gaustad
dc.contributor.authorGaustad, Tanja
dc.contributor.otherMcKellar, Cindy
dc.contributor.otherGent, Sunny
dc.date.accessioned2026-03-26T12:55:06Z
dc.date.available2026-03-26T12:55:06Z
dc.date.issued2026-03-31
dc.descriptionThis deliverable contains part-of-speech tagged data from five different text types for Sepedi. The text types included are: - CAPS gr12 (Academic) - MA/PhD Theses (Academic) - Magazines (Non-Academic) - News (Non-Academic) - Novels (Fiction) The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated. Each text type data file contains 13,000+ tokens, amounting to a total of 75,017 tokens for the language. Please see the included protocol for more details on the POS tags used. This data is a combination of new data with the previously published smaller data set "POS annotated corpus in 5 different genres for Sepedi" https://hdl.handle.net/20.500.12185/670. Please see Tanja Gaustad, Roald Eiselen, Cindy McKellar (2026). Extension of Linguistic Resources for South African Languages: Part-of-Speech Annotated Domain-Specific Data. Proceedings of the Seventh Workshop on Resources for African Indigenous Languages (RAIL) (collocated with LREC 2026) for more detailed information.
dc.formattext
dc.format.extent75017 tokens
dc.format.mediumN/A
dc.format.size1 Mb
dc.identifier.urihttps://hdl.handle.net/20.500.12185/700
dc.languagesSepedi
dc.media.categoryannotated domain-specific corpus
dc.media.typeText
dc.projectUpdate and extension of linguistic resources and core technologies for South African languages
dc.publisherNorth-West University - Centre for Text Technology (CTexT)
dc.rights.licenseCreative Commons Attribution 4.0 International
dc.subjectSepedi, POS annotated, domain-specific, annotated corpus
dc.titleSepedi Domain corpus POS annotated (5 domains)
dc.version1.0

Files

Original bundle

Now showing 1 - 5 of 7
Loading...
Thumbnail Image
Name:
Protocol.SADiLaR.PartOfSpeechTaggingSesothosaLeboa.Final.2026-03-31.docx
Size:
56.89 KB
Format:
Microsoft Word XML
Loading...
Thumbnail Image
Name:
README.POS.GenreData.Final.nso.2026-03-31.txt
Size:
2.37 KB
Format:
Plain Text
Loading...
Thumbnail Image
Name:
SAD-IV.Caps.POS.2026-03-23.nso.txt
Size:
178.2 KB
Format:
Plain Text
Loading...
Thumbnail Image
Name:
SAD-IV.Magazines.POS.2026-03-23.nso.txt
Size:
150.89 KB
Format:
Plain Text
Loading...
Thumbnail Image
Name:
SAD-IV.News.POS.2026-03-23.nso.txt
Size:
186.29 KB
Format:
Plain Text

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.22 KB
Format:
Item-specific license agreed upon to submission
Description: