Repository logoRepository logo
 

Core technologies for conjunctively written South African languages

Loading...
Thumbnail Image

Deposit Licenses

Date

2021-03-31

Authors

Du Toit, Jaco
Puttkammer, Martin

Journal Title

Journal ISSN

Volume Title

Publisher

North-West University, Centre for Language Technology (CTexT)

Abstract

Description

During this SADiLaR funded project, enriched corpora for the four official South African languages with a conjunctive orthography, i.e. isiNdebele (NR), isiXhosa (XH), isiZulu (ZU), and Siswati (SS) was developed. The corpora consist of approximately 50,000 tokens, parallel on sentence level, with English as source language, for each language. Each language’s corpus was annotated on three levels, namely morphological analysis, part of speech and lemmatisation (see: https://repo.sadilar.org/handle/20.500.12185/546). Using the annotated data, 12 core technologies, i.e. morphological analysers, POS taggers and lemmatisers for each of the four languages were developed and packaged in a single graphical user interface (UI).

Citation

License

Verification status

Level 0