Repository logoRepository logo
 

Afrikaans text unit identification data

dc.contact.emailmartin.puttkammer@nwu.ac.zaen_ZA
dc.contact.nameMartin Puttkammeren_ZA
dc.contributor.authorPuttkammer, Martin
dc.date.accessioned2019-04-15T14:05:43Z
dc.date.available2019-04-15T14:05:43Z
dc.date.issued2006
dc.descriptionThis dataset was developed during a masters degree and used in the development of a text unit identifier capable of tagging sentences, named-entities, words, abbreviations and punctuation in Afrikaans text. The dataset consists of 39,762 tokens, containing 3,294 named entities in 1,581 sentences. The data was manually annotated by the author and verified by an independent linguist according to the tagset developed during the same study. Details on the annotation and tagset used are available in the publication mentioned above in (2). The data is also presented in CoNNL-2002 format (Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Available at: https://www.aclweb.org/anthology/W02-2024).en_ZA
dc.format.txten_ZA
dc.format.extent39,762 tokensen_ZA
dc.format.mediumN/Aen_ZA
dc.format.size195,324 bytes (zipped)en_ZA
dc.identifier.citationPuttkammer, M.J. 2006. Outomatiese Afrikaanse tekseenheididentifisering. Potchefstroom: North-West University. (Dissertation - MA).en_ZA
dc.identifier.urihttps://hdl.handle.net/20.500.12185/507
dc.language.isoafr
dc.languagesAfrikaansen_ZA
dc.media.categoryMonolingual text corpus: annotateden_ZA
dc.media.typeTexten_ZA
dc.publisherCentre for Text Technology, North-West Universityen_ZA
dc.rights.licenseCreative Commons Attribution 4.0 International: https://creativecommons.org/licenses/by/4.0/en_ZA
dc.subjectAfrikaans, Tokenisation, Sentence recognition, Named-entity recognition, sentence, named-entity, word, tokenen_ZA
dc.titleAfrikaans text unit identification dataen_ZA
dc.version1.0en_ZA
local.collection.primaryResource Catalogue
local.collection.secondaryResource Index
local.urlhttp://humanities.nwu.ac.za/ctexten_ZA

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
TEI_v1.0.rar
Size:
190.75 KB
Format:
A RAR file (short for a Roshal Archive Compressed file) is a compressed file, or data container, that holds one or more other files and folders inside of it. However, unlike a normal folder, a RAR file needs special software (more on this below) to open and "extract" out the contents.
Description:
All files zipped

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.23 KB
Format:
Item-specific license agreed upon to submission
Description: