Core technologies for conjunctively written South African languages
License agreement
By downloading this resource I accept and agree to the terms of use and the associated license conditions under which the resource is distributed.
Download
MD5: 6b4cd480fb015e51082e12a765b1ba38
License agreement
By downloading this resource I accept and agree to the terms of use and the associated license conditions under which the resource is distributed.
Collections
- Resource Catalogue [350]
Author(s)
Du Toit, Jaco
Puttkammer, Martin
Metadata
Show full item recordDescription
During this SADiLaR funded project, enriched corpora for the four official South African languages with a conjunctive orthography,
i.e. isiNdebele (NR), isiXhosa (XH), isiZulu (ZU), and Siswati (SS) was developed. The corpora consist of approximately 50,000 tokens,
parallel on sentence level, with English as source language, for each language. Each language’s corpus was annotated on three levels, namely morphological analysis, part of speech and lemmatisation (see: https://repo.sadilar.org/handle/20.500.12185/546).
Using the annotated data, 12 core technologies, i.e. morphological analysers, POS taggers and lemmatisers for each of the four languages were developed and packaged in a single graphical user interface (UI).
Contact person
Martin PuttkammerContact person's e-mail address
Martin.Puttkammer@nwu.ac.zaPublisher(s)
North-West University, Centre for Language Technology (CTexT)