------------------------------------------------------------------------------------------
Afrikaanse valensiemorfeemdatastel | 
Afrikaans linking element dataset
------------------------------------------------------------------------------------------

Author:
	Benito Trollip (1)
Language: Afrikaans

License:
	The dataset is licensed under a Creative Commons Attribution 4.0 International Public License.
	Please read the terms of use carefully.
	https://creativecommons.org/licenses/by/4.0/
	(Full legal code enclosed: License.txt)

Description:
	This dataset is a subset of the AuCoPro (noun-noun) compound dataset and the NCHLT corpus.
	
	The first part of the subset was extracted by combining the split and semantic annotated text to include only items that:
		
		contain "_" in the splitting dataset as it indicates the use of a linking element;
		items that are both split and semantically annotated;
		items that have been annotated into one of the semantic categories indicating a semantic relationship
		(thus excluding the MISTAG and NON-COMPOUND tags)
	  
		The semantic annotation was performed with annotation guidelines based on those of Verhoeven et al. (2012), which in turn was based on Ó Séaghdha (2008).
	
	The second part of the subset was extracted by combining the morphologically tagged items that:
	
		are non-compounds (not containing more than one stem);
		has been tagged with MLG as it indicates the presence of Germanic linking elements in the tokens.
		
If you decide to use this dataset, please site this source:

	Trollip, E.B. (2016).  	'n Beskrywing van die valensiemorfeem in Afrikaans vanuit 'n kognitiewe gebruiksgebaseerde beskrywingsraamwerk.
	[A description of the linking element in Afrikaans from a cognitive grammar, usage-based perspective]
	North-West University, Potchefstroom Campus. Magister Artium.
	
	(Available online at https://repository.nwu.ac.za/handle/10394/25653)
	
Acknowledgements:
	
	This dataset is an extract from the full AuCoPro dataset (see https://www.clips.uantwerpen.be/projects/aucopro for information on the AuCoPro project) and
	the NCHLT corpus (see https://repo.sadilar.org/handle/20.500.12185/293?show=full for more information on the NCHLT project).
		

------------------------------------------------------------------------------------------
Specifics
------------------------------------------------------------------------------------------

Reference for the semantic annotation schemes:

	Verhoeven, B., van Zaanen, M., van Huyssteen, G.B., & Daelemans, W. (2014). Annotation Guidelines for Compound Analysis. CLiPS Technical Report Series, 5.
	
Files included:

	Afrikaanse.valensiemorfeem.datastel.1.1.BT.2019-06-05.xlsx
	
File format:

	Each file contains a list of items (each item is a row).
	The file is a xlsx-file (an Excel file) and the two sheets can be converted into seeparate csv files by using the "Save as" function in Excel.
	
	In the first sheet (labelled "Komposita") the first column contains the split compound (an "_" before a letter or group of letters indicates a 
	linking element). The second column indicates the numeric code used for the semantic categories the compounds have been 
	categorised in. Categories are explained in Addendum B of Verhoeven et al. (2014), but they are summarised
	here for ease of reference:
	
	2.1.1.1	BE		X is N1 and X is N2									
	2.1.1.2	BE		N2 is a form/shape taken by the 
					substance N1										
	2.1.1.3	BE		N2 is ascribed significant properties 
					of N1 without the ascription of 
					identity. The compound roughly denotes 
					“an N2 like N1”								
	2.1.2.1	HAVE	N1/N2 owns N2/N1 or has exclusive rights 
					or the exclusive ability to access or to 
					use N2/N1 or has a one-to-one possessive 
					association with N2/N1						
	2.1.2.2	HAVE	N1/N2 is a physical condition, a mental 
					state or a mentally salient entity 
					experienced by N2/N1
	2.1.2.3	HAVE	N1/N2 has the property denoted by N2/N1
	2.1.2.4	HAVE	N1/N2 has N2/N1 as a part or constituent
	2.1.2.5	HAVE	N1/N2 is a group/society/set/collection 
					of entities N2/N1
	2.1.3.1	IN		N1/N2 is an object spatially located in 
					or near N2/N1
	2.1.3.2	IN		N1/N2 is an event or activity spatially 
					located in N2/N1
	2.1.3.3	IN		N1/N2 is an object temporally located in 
					or near N2/N1, or is a participant in an 
					event/activity located there
	2.1.3.4	IN		N1/N2 is an event/activity temporally 
					located in or near N2/N1
	2.1.4.1	ACTOR	N1/N2 is a sentient participant in 
					the event N2/N1
	2.1.4.2	ACTOR	N1/N2 is a sentient participant in an 
					event in which N2/N1 is also a participant, 
					and N1/N2 is more agentive than N2/N1
	2.1.5.1	INST	N1/N2 is a participant in an activity or 
					event N2/N1, and N1/N2 is not an ACTOR
	2.1.5.2	INST	The compound is associated with a 
					characteristic event in which N1/N2 and 
					N2/N1 are participants, N1/N2 is more 
					agentive than N2/N1, and N1/N2 is not an 
					ACTOR
	2.1.6.1	ABOUT	N1/N2’s descriptive, significant or 
					propositional content relates to N2/N1
	2.1.6.2	ABOUT	N1/N2 is a collection of items whose 
					descriptive, significant or propositional 
					content relates to N2/N1 or an event that 
					describes or conveys information about N2/N1
	2.1.6.3	ABOUT	N1/N2 is a mental process or mental activity 
					focused	on N2/N1, or an activity resulting from 
					such
	2.1.6.4	ABOUT	N1/N2 is an amount of money or some other 
					commodity given in exchange for N2/N1 or to 
					satisfy a debt arising from N2/N1

		Note: each item was re-annotated by the creator and checked by his study leader.
	
	Example of rows in Sheet 1 of the dataset:
	
	Compound			Split compound				Semantic category
	
	dierekennis			dier _ e + kennis			2.1.6.1
	gesagsverhouding	gesag _ s + verhouding		2.1.2.4
	offisier-veearts	offisier _ - + veearts		2.1.1.1
	wyfie-eend			wyfie _ - + eend			2.1.1.1
	
	In the second sheet of the file (labelled "Nonkomposita") the first column contains the relevant tokens/words extracted and the second column
	contains the morphological annotation, including splitting and tagging.
	
	Examples of rows in Sheet 2 of the dataset:
	
	Token				Morph
	eiendom				eie[RFA]n[MLG]dom[SDN]
	huurder				huur[RFN]d[MLG]er[SDN]
	redelikerwys		rede[RFN]lik[SDA]er[MLG]wys[SDB]
	stedelike			sted[RFNA]e[MLG]lik[SDA]e[SFA]