AFRIBOOMS: NCHLT AFRIKAANS CORPUS WITH DEPENDENCY RELATIONS 
============================================================

BACKGROUND
------------------------------

This is the annotated corpus developed for Afrikaans for the Afribooms
project.

The corpus includes annotations for lemma, part-of-speech (POS) and
dependency relations. Related corpora can be found at:
http://rma.nwu.ac.za

Please refer to the associated technical report for more details
regarding the development process and evaluation.

The corpus is in FoLiA XML (http://ilk.uvt.nl/folia) format using the
UTF-8 character encoding. It is divided into two sets, a train set
("nchlt_af_trn.xml") and a test set ("nchlt_afr_tst.xml").


POS TAGSET
------------------------------

The following set of POS tags are used in the corpus (given here with
frequency of occurrence):

   3107 ADJ
   6241 ADP
   1679 ADV
   2872 CONJ
   4541 DET
  11145 NOUN
    714 NUM
   4089 PRON
   1927 PRT
   4516 PUNCT
   7581 VERB
    864 X

This corresponds with the "Universal" POS tag set defined in [1].


DEPENDENCY TAGSET
------------------------------

The following set of dependency relations are used (with frequencies):

     63 abbrev
   2447 amod
   1130 arg
   2534 aux
   1886 cc
      3 comp
   2359 conj
   1009 dep
   5101 det
    111 dobj
      5 mark
  11120 mod
    462 num
   3763 obj
   6106 pobj
    830 poss
   1375 prt
   4497 punct
   1870 root
   2605 subj

Corresponding to the Stanford tagset described in [2, 3].


KNOWN ISSUES
------------------------------

As mentioned in the technical report, the annotation of this corpus
primarily focussed on identifying and correcting the structure of
dependencies in each sentence. The result is that "generic" tags are
often used where more specific ones exist in the adopted
tagset. Improving the accuracy of tags in this regard is left as
future work. Examples of tags that should be considered for systematic
improvement are "rel", "neg", "prep", "advcl", "appos" and "iobj".

Also, the "poss" tag is currently being used in two different ways,
one of which to indicate the indirect object (instead of linking it to
the verb with "iobj").


REFERENCES
------------------------------

[1] S. Petrov, D. Das, and R. McDonald, "A universal part-of-speech
    tagset," in Proceedings of the 8th Conference on Language
    Resources and Evaluation, Istanbul, Turkey, 2012.

[2] M.-C. De Marneffe and C. D. Manning, "The Stanford typed
    dependencies representation," in Coling 2008: Proceedings of the
    workshop on Cross-Framework and Cross-Domain Parser Evaluation,
    2008, pp. 1-8.

[3] M.-C. De Marneffe, B. MacCartney, and C. D. Manning, "Generating
    typed dependency parses from phrase structure parses," in
    Proceedings of LREC, 2006, vol. 6, pp. 449-454.
