------------------------------------------------------------------------------------------
AuCoPro - Semantics
------------------------------------------------------------------------------------------

Creator(s):
	Ben Verhoeven (1), Gerhard B. van Huyssteen (2) & Walter Daelemans (1)

	(1) CLiPS Research Center, University of Antwerp, Belgium
		www.clips.uantwerpen.be
	
	(2) Centre for Text Technology (CTexT), North-West University, South Africa
		www.nwu.ac.za/ctext

Version: 2014-01-31
Language: Dutch & Afrikaans

This dataset is available at http://www.clips.uantwerpen.be/datasets

License:
	The dataset is licensed under a Creative Commons Attribution 3.0 Unported License. 
	Please read the terms of use carefully.
	http://creativecommons.org/licenses/by/3.0/
	(Full legal code enclosed: License.txt)

Description:
	The AuCoPro-Semantics dataset serves for the automatic semantic analysis of compounds. 
	It contains semantically annotated noun-noun compounds (NN) from Dutch and Afrikaans, split 
	in two annotation rounds per language. The semantic annotation was performed with 
	annotation guidelines based on those of Ó Séaghdha (2008).
	Another part of the dataset contains other nominal compounds (XN) in Dutch, that were 
	annotated using a newly developed annotation scheme.
	
If you use this dataset in your research, make sure to cite one of the following papers:

	Verhoeven, B., Daelemans, W., & Van Huyssteen, GB. (2012). Classification of Noun-Noun 
	Compound Semantics in Dutch and Afrikaans. In: Proceedings of the Twenty-Third Annual 
	Symposium of the Pattern Recognition Association of South Africa (PRASA). Pretoria, 
	South Africa. 29-30 November. pp. 121-125. ISBN: 978-0-620-54601-0.  
	(Available online at http://www.prasa.org/proceedings/2012/PRASA2012.pdf)
	
	Verhoeven, B., & van Huyssteen, G. B. (2013). More Than Only Noun-Noun Compounds: 
	Towards an Annotation Scheme for the Semantic Modelling of Other Noun Compound Types. 
	In: Proceedings of the 9th Joint ISO - ACL SIGSEM Workshop on Interoperable Semantic 
	Annotation. Potsdam, Germany.
	(Available online at http://www.clips.uantwerpen.be/bibliography/more-than-only-noun-noun-compounds-towards-an-annotation-scheme-for-the-semantic-modell)

Acknowledgement:
	This dataset was created within the 'Automatic Compound Processing (AuCoPro)' project
	that was funded by the Dutch Language Union (Nederlandse Taalunie), the Department of 
	Arts and Culture (DAC) of South Africa and the National Research Foundation (NRF) of 
	South Africa.

------------------------------------------------------------------------------------------
Specifics
------------------------------------------------------------------------------------------

Reference for the annotation schemes:

	Verhoeven, B., van Zaanen, M., van Huyssteen, G.B., & Daelemans, W. (2014).
	Annotation Guidelines for Compound Analysis. CLiPS Technical Report Series, 5.
	
Dataset name encoding:

	Afr: language of dataset is Afrikaans
	Ned: language of dataset is Dutch
	
	Alfa: first annotation round for this language
	Beta: second annotation round for this language
	
	NN: annotation of noun-noun compounds
	XN: annotation of other nominal compounds

Files included:

	Afr_Alfa_NN		List.AUCOPRO.AfrikaansAlfaAnnotationNN_Combined.txt
	Afr_Beta_NN		List.AUCOPRO.AfrikaansBetaAnnotationNN_Combined.txt
	Afr_Beta_XN		List.AUCOPRO.AfrikaansXN_Cleaned.txt
	Ned_Alfa_NN		List.AUCOPRO.DutchAlfaAnnotationNN_Combined.txt
	Ned_Beta_NN		List.AUCOPRO.DutchBetaAnnotationNN_Combined.txt
	Ned_XN			List.AUCOPRO.DutchAnnotationXN_Combined.txt
	
File format:

	Each file contains a list of items (each item is a row).
	Each item is a list of tab-separated values (tsv), which makes the file column-formatted
	(it can thus be opened with or imported in MS Excel).
	The first three columns contain the compound and its first and second constituent.
	The next columns contain annotations. Each column represents one annotator.
	
	Note: not every item was annotated by each annotator.
	
	Example:
	Compound	Constituent1	Constituent2	Annotation1	Annotation2	Annotation3

Terms explained:

	IAA or inter-annotator agreement is a percentage of agreement between two annotators, 
	indicating how often they assign the same annotation to an item. 
	
	Kappa is another measure of agreement between two annotators. It is considered to be 
	a better indication of annotation quality because it takes the class distribution of
	the annotation into account.
	When calculating IAA or Kappa for more than two annotators, we compute the pair-wise 
	average (i.e. the average of the score for each pair of annotators).
	
	The number of triples is the absolute number of items that three annotators agree upon.
	
	The number of doubles is the absolute number of items that two annotators agree upon, 
	there is no overlap with the number of triples.
	
	The gold list ratio is the percentage of items agreed upon by at least 2 annotators, 
	which  one could include in a 'gold standard' list that would be more reliable than 
	separate annotations.

------------------------------------------------------------------------------------------
Data statistics
------------------------------------------------------------------------------------------

---------------
  Afr_Alfa_NN
---------------

Number of items: 		1499
Number of annotators: 	3

Average IAA: 			53.5
Average Kappa:			53.4

N° triples:				575
N° doubles:				679
Gold list ratio:		83.7

---------------
  Afr_Beta_NN
---------------

Number of items: 		2328
Number of annotators: 	3

Average IAA: 			37.7
Average Kappa:			37.6

N° triples:				502
N° doubles:				1126
Gold list ratio:		69.9

---------------
  Afr_XN
---------------

Number of items: 		4553
Number of annotators: 	3

Average IAA:			33.5
Average Kappa:			33.5
		
N° doubles:				2380
N° doubles:				733
Gold list ratio:		68.4	

Note: This list has a slightly different file format:
Split_Compound	Constituent1	Pos1	Constituent2	Pos2	Annotation1	Annotation2	Annotation3

---------------
  Ned_Alfa_NN
---------------

Number of items:		1766
Number of annotators: 	2*
* second annotator only annotated a subset

Subset for Inter-AA: 	500
	Inter-AA:			60.2	
	Kappa:				60.0

Subset for Intra-AA: 	245		# First annotator, two months later (third annotation column)
	Intra-AA:			68.2	
	Kappa:				67.9

---------------
  Ned_Beta_NN
---------------

Number of items:		2000
Number of annotators:	3*
* one annotator did the entire set, two other annotators each did half

Average IAA:			51.1
Average Kappa:			51.0

N° doubles:				1022
Gold list ratio:		51.1	


---------------
  Ned_XN
---------------

Number of items: 		600
Number of annotators: 	2

IAA:					48.7
Kappa:					48.6
		
N° doubles:				292
Gold list ratio:		48.7	
