Linguistically enriched corpora for conjunctively written South African languages
Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/546'
dc.contact.email | tanja.gaustad@nwu.ac.za | en_ZA |
dc.contact.name | Tanja Gaustad | en_ZA |
dc.contributor.author | Puttkammer, Martin | |
dc.contributor.author | Gaustad, Tanja | |
dc.contributor.other | Pienaar, Wikus | |
dc.contributor.other | du Toit, Jaco | |
dc.contributor.other | Gent, Sunny | |
dc.date.accessioned | 2021-09-30T12:41:11Z | |
dc.date.available | 2021-09-30T12:41:11Z | |
dc.date.issued | 2021-09 | |
dc.description | This resource contains linguistically annotated data for four official South African languages with a conjunctive orthography from the Nguni family (isiNdebele, isiXhosa, isiZulu and Siswati) as well as English. The data set is parallel for all five languages and the Nguni languages have been annotated for three different types of linguistic information: morphology, part-of-speech and lemmas. We have also included the protocols and tagsets used during annotation. | en_ZA |
dc.format | text | en_ZA |
dc.format.extent | min. 50'000 tokens per language | en_ZA |
dc.format.medium | N/A | en_ZA |
dc.format.size | 10Mb | en_ZA |
dc.identifier.citation | https://doi.org/10.1016/j.dib.2022.107994 | |
dc.identifier.uri | https://hdl.handle.net/20.500.12185/546 | |
dc.languages | English | en_ZA |
dc.languages | isiNdebele | en_ZA |
dc.languages | isiXhosa | en_ZA |
dc.languages | isiZulu | en_ZA |
dc.languages | Siswati | en_ZA |
dc.media.category | Parallel multilingual annotated text corpus | en_ZA |
dc.media.type | Text | en_ZA |
dc.project | Linguistic corpus enrichment for conjunctively written South African languages | en_ZA |
dc.publisher | North-West University, Centre for Language Technology (CTexT) | en_ZA |
dc.rights.license | CC BY 4.0: https://creativecommons.org/licenses/by/4.0/ | en_ZA |
dc.subject | Nguni languages | en_ZA |
dc.subject | POS | en_ZA |
dc.subject | Morphology | en_ZA |
dc.subject | Lemma | en_ZA |
dc.subject | Parallel data | en_ZA |
dc.title | Linguistically enriched corpora for conjunctively written South African languages | en_ZA |
dc.version | 1.0 | en_ZA |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- SADII_CTexT_2022-03-14.zip
- Size:
- 3.45 MB
- Format:
- ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
- Description:
- Archive containing a Readme, a train folder with 5 text files, a test folder with 5 text files, a folder with all annotation protocols
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 3.23 KB
- Format:
- Item-specific license agreed upon to submission
- Description: