Repository logoRepository logo
 

Linguistically enriched corpora for conjunctively written South African languages

dc.contact.emailtanja.gaustad@nwu.ac.zaen_ZA
dc.contact.nameTanja Gaustaden_ZA
dc.contributor.authorPuttkammer, Martin
dc.contributor.authorGaustad, Tanja
dc.contributor.otherPienaar, Wikus
dc.contributor.otherdu Toit, Jaco
dc.contributor.otherGent, Sunny
dc.date.accessioned2021-09-30T12:41:11Z
dc.date.available2021-09-30T12:41:11Z
dc.date.issued2021-09
dc.descriptionThis resource contains linguistically annotated data for four official South African languages with a conjunctive orthography from the Nguni family (isiNdebele, isiXhosa, isiZulu and Siswati) as well as English. The data set is parallel for all five languages and the Nguni languages have been annotated for three different types of linguistic information: morphology, part-of-speech and lemmas. We have also included the protocols and tagsets used during annotation.en_ZA
dc.formattexten_ZA
dc.format.extentmin. 50'000 tokens per languageen_ZA
dc.format.mediumN/Aen_ZA
dc.format.size10Mben_ZA
dc.identifier.citationhttps://doi.org/10.1016/j.dib.2022.107994
dc.identifier.urihttps://hdl.handle.net/20.500.12185/546
dc.languagesEnglishen_ZA
dc.languagesisiNdebeleen_ZA
dc.languagesisiXhosaen_ZA
dc.languagesisiZuluen_ZA
dc.languagesSiswatien_ZA
dc.media.categoryParallel multilingual annotated text corpusen_ZA
dc.media.typeTexten_ZA
dc.projectLinguistic corpus enrichment for conjunctively written South African languagesen_ZA
dc.publisherNorth-West University, Centre for Language Technology (CTexT)en_ZA
dc.rights.licenseCC BY 4.0: https://creativecommons.org/licenses/by/4.0/en_ZA
dc.subjectNguni languagesen_ZA
dc.subjectPOSen_ZA
dc.subjectMorphologyen_ZA
dc.subjectLemmaen_ZA
dc.subjectParallel dataen_ZA
dc.titleLinguistically enriched corpora for conjunctively written South African languagesen_ZA
dc.version1.0en_ZA

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
SADII_CTexT_2022-03-14.zip
Size:
3.45 MB
Format:
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
Description:
Archive containing a Readme, a train folder with 5 text files, a test folder with 5 text files, a folder with all annotation protocols

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.23 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections