South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT)

Van Dyk, Tobie

South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT)

Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/557'

dc.contact.email	Tobie.vanDyk@nwu.ac.za	en_ZA
dc.contact.name	Tobie van Dyk	en_ZA
dc.contributor.author	Van Dyk, Tobie
dc.date.accessioned	2022-04-06T16:21:57Z
dc.date.available	2022-04-06T16:21:57Z
dc.date.issued	2021
dc.description	NOTE: THIS HAS BEEN SUPERSEDED. See https://hdl.handle.net/20.500.12185/585 The South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) is a multi-genre, multi-level learner corpus developed by the Inter-institutional Centre for Language Development and Assessment (ICELDA) in collaboration with the South African Centre for Digital Language Resources (SADiLaR). This corpus includes shorter and longer pieces of texts, from an array of genres, different fields of study, and at all levels of study. The corpus was, and continues to be, contributed to by several institutions of higher education that are part of the ICELDA network. Ethical clearance has been granted at all partnering institutions to collect data; this includes informed consent by all students who contributed to SAMULCAT. The corpus is augmented by two sets of metadata. The first set includes mainly biographical detail about students (completed by students themselves); the second set includes more information on different task types and texts included in the corpus (completed by e.g. lecturers, writing centre staff, etc.). Data can be filtered through the metadata filters available in the search functionality of the corpus. The corpus is available under the creative commons 4.0 license and is open source. Use of the corpus for research purposes requires permission from SADiLaR, and applications should include evidence of ethical clearance from the research institutions to which staff and students are affiliated to. More information about the design of the corpus and metadata available in the corpus can be found in the following article: Carstens, A. and Eiselen, R., 2019. Designing a South African multilingual learner corpus of academic texts (SAMuLCAT). Language Matters, 50(1), pp.64-83. Annotation Corpora for the indigenous South African languages are automatically annotated for lemmas and part of speech using the available NCHLT Text lemmatisers and part of speech taggers. Information on the accuracy and tag sets for these languages are available here: NCHLT Web Service. No quality control of the automatic annotations was performed. The English data is annotated using the open-source NLP4J library available here: https://emorynlp.github.io/nlp4j/	en_ZA
dc.format	XML, text	en_ZA
dc.identifier.uri	https://hdl.handle.net/20.500.12185/557
dc.languages	Afrikaans	en_ZA
dc.languages	English	en_ZA
dc.media.type	Text	en_ZA
dc.publisher	ICELDA	en_ZA
dc.publisher	SADiLaR	en_ZA
dc.relation.isreplacedby	https://hdl.handle.net/20.500.12185/585
dc.rights.license	Creative Commons Attribution 4.0 International Public License	en_ZA
dc.subject	Learner Corpus	en_ZA
dc.subject	L2
dc.subject	multi-genre
dc.title	South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT)	en_ZA
dc.version	1.0	en_ZA

Files

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 3.23 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Resource Catalogue