POS annotated corpus in 5 different genres for Sepedi

Gaustad, Tanja

Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/670'

POS annotated corpus in 5 different genres for Sepedi

Files

README.POS.Final.2024-01-31.txt (2.64 KB)

POS_genre_NSO.zip (148.27 KB)

Date

2024-01-31

Authors

Gaustad, Tanja

Publisher

Centre for Text Technology (CTexT)

Description

This corpus contains POS annotated data in 5 different genres for Sepedi. The text types included are: - CAPS gr12 (Academic) - https://www.education.gov.za/Curriculum/NationalSeniorCertificate(NSC)Examinations.aspx; - PhD Theses (Academic) - for Sepedi https://repository.up.ac.za/; - Magazines (Non-Academic) - CTexT acquired data from Pula Imvula; - News (Non-Academic) - for Sepedi CTexT acquired data; - Novels (Fiction) - SADiLaR acquired data from OUP and Shuter and Shooter. For Sepedi, the data was tagged using the NCHLT webservices. The POS tags were then converted to the latest POS tag set (see protocol). The data is given as txt files where each line contains a token and the corresponding POS tag, tab separated. Each text type data file contains approximately 5,000 tokens, amounting to a total of 25,000 tokens per languages. Please see the protocols for more details on the POS tags used. Contents: Sepedi CAPS gr12 - 6,634 tokens, Sepedi PhD Theses - 7,395 tokens, Sepedi Magazines - 5,547 tokens, Sepedi News - 8,782 tokens, Sepedi Novels - 6,924 tokens. Total Sepedi: 30,158 tokens.

Keywords

annotated, part of speech, domains

License

CC BY 4.0

URI

https://hdl.handle.net/20.500.12185/670

Collections

Resource Catalogue

Verification status

Level 0

Full item page

POS annotated corpus in 5 different genres for Sepedi

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

License

URI

Collections

Verification status