Linguistically enriched corpora for conjunctively written South African languages
Title | Linguistically enriched corpora for conjunctively written South African languages |
Description | This resource contains linguistically annotated data for four official South African languages with a conjunctive orthography from the Nguni family (isiNdebele, isiXhosa, isiZulu and Siswati) as well as English. The data set is parallel for all five languages and the Nguni languages have been annotated for three different types of linguistic information: morphology, part-of-speech and lemmas. We have also included the protocols and tagsets used during annotation. |
Contact name | Tanja Gaustad |
Contact email | tanja.gaustad@nwu.ac.za |
Publisher(s) | North-West University, Centre for Language Technology (CTexT) |
License | CC BY 4.0: https://creativecommons.org/licenses/by/4.0/ |
Language(s) | English; isiNdebele; isiXhosa; isiZulu; Siswati |
Author(s) | Puttkammer, Martin; Gaustad, Tanja |
Contributor | Pienaar, Wikus; du Toit, Jaco; Gent, Sunny |
Subject | Nguni languages; POS; Morphology; Lemma; Parallel data |
Citation | https://doi.org/10.1016/j.dib.2022.107994 |
URI | https://hdl.handle.net/20.500.12185/546 |
Media type | Text |
Media category | Parallel multilingual annotated text corpus |
Format extent | min. 50'000 tokens per language |
Version | 1.0 |
Format size | 10Mb |
Format medium | N/A |
Project | Linguistic corpus enrichment for conjunctively written South African languages |
Submit date | 2021-09-30T12:41:11Z |
Date available | 2021-09-30T12:41:11Z |
Date created | 2021-09 |
Files in this item
This item appears in the following Collection(s)
-
Resource Index [412]
A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.