Afrikaans text unit identification data

Puttkammer, Martin

Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/507'

Afrikaans text unit identification data

Files

TEI_v1.0.rar (190.75 KB)

Deposit Licenses

license.txt (3.23 KB)

Date

2006

Authors

Puttkammer, Martin

Publisher

Centre for Text Technology, North-West University

Description

This dataset was developed during a masters degree and used in the development of a text unit identifier capable of tagging sentences, named-entities, words, abbreviations and punctuation in Afrikaans text. The dataset consists of 39,762 tokens, containing 3,294 named entities in 1,581 sentences. The data was manually annotated by the author and verified by an independent linguist according to the tagset developed during the same study. Details on the annotation and tagset used are available in the publication mentioned above in (2). The data is also presented in CoNNL-2002 format (Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Available at: https://www.aclweb.org/anthology/W02-2024).

Keywords

Afrikaans, Tokenisation, Sentence recognition, Named-entity recognition, sentence, named-entity, word, token

Citation

Puttkammer, M.J. 2006. Outomatiese Afrikaanse tekseenheididentifisering. Potchefstroom: North-West University. (Dissertation - MA).

License

Creative Commons Attribution 4.0 International

URI

https://hdl.handle.net/20.500.12185/507

Collections

Resource Catalogue
Resource Index

Verification status

Level 0

Full item page

Afrikaans text unit identification data

Files

Deposit Licenses

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

License

URI

Collections

Verification status