Show simple item record

Core technologies for conjunctively written South African languages
During this SADiLaR funded project, enriched corpora for the four official South African languages with a conjunctive orthography, i.e. isiNdebele (NR), isiXhosa (XH), isiZulu (ZU), and Siswati (SS) was developed. The corpora consist of approximately 50,000 tokens, parallel on sentence level, with English as source language, for each language. Each language’s corpus was annotated on three levels, namely morphological analysis, part of speech and lemmatisation (see: https://repo.sadilar.org/handle/20.500.12185/546). Using the annotated data, 12 core technologies, i.e. morphological analysers, POS taggers and lemmatisers for each of the four languages were developed and packaged in a single graphical user interface (UI).
Martin Puttkammer
Martin.Puttkammer@nwu.ac.za
North-West University, Centre for Language Technology (CTexT)
isiNdebele; isiXhosa; isiZulu; Siswati
Du Toit, Jaco; Puttkammer, Martin
Gent, Sunny; Gaustad, Tanja
part of speech; part of speech tagging; part-of-speech tagging; part-of-speech; lemma; lemmatisation; lemmatization; morphology; morphological analysis; conjunctive languages; Nguni languages
https://hdl.handle.net/20.500.12185/548
Modules
Linguistic corpus enrichment for conjunctively written South African languages
2021-12-06T10:57:03Z
2021-12-06T10:57:03Z
2021-03-31


Files in this item

Thumbnail

This item appears in the following Collection(s)

  • Resource Catalogue [350]
    A collection of language resources available for download from the RMA of SADiLaR. The collection mostly consists of resources developed with funding from the Department of Arts and Culture.

Show simple item record