South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) version 2023-03

Van Dyk, Tobie

Title	South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) version 2023-03
Description	The South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) is a multi-genre, multi-level learner corpus developed by the Inter-institutional Centre for Language Development and Assessment (ICELDA) in collaboration with the South African Centre for Digital Language Resources (SADiLaR). This corpus includes shorter and longer pieces of texts, from an array of genres, different fields of study, and at all levels of study. The corpus was, and continues to be, contributed to by several institutions of higher education that are part of the ICELDA network. Ethical clearance has been granted at all partnering institutions to collect data; this includes informed consent by all students who contributed to SAMULCAT. The corpus is augmented by two sets of metadata. The first set includes mainly biographical detail about students (completed by students themselves); the second set includes more information on different task types and texts included in the corpus (completed by e.g. lecturers, writing centre staff, etc.). Data can be filtered through the metadata filters available in the search functionality of the corpus. The corpus is available under the Attribution 4.0 International (CC BY 4.0) license and is open source. More information about the design of the corpus and metadata available in the corpus can be found in the following article: Carstens, A. and Eiselen, R., 2019. Designing a South African multilingual learner corpus of academic texts (SAMuLCAT). Language Matters, 50(1), pp.64-83. The Afrikaans part of the corpus is automatically annotated for lemmas and part of speech using the available NCHLT Text lemmatisers and part of speech taggers. Additional information is available here: https://hlt.nwu.ac.za/about No quality control of the automatic annotations was performed. The English data is annotated using the open-source NLP4J library available here: https://emorynlp.github.io/nlp4j/ DISCLAIMER: For a description of SADiLaR's privacy stance and practices, please see the privacy statement: https://sadilar.org/index.php/en/394-privacy-statement
Contact name	Tobie van Dyk
Contact email	Tobie.vanDyk@nwu.ac.za
Publisher(s)	ICELDA; SADiLaR
License	Research only
Language(s)	Afrikaans; English
Author(s)	Van Dyk, Tobie
Subject	Learner Corpus; L2; multi-genre
URI	https://hdl.handle.net/20.500.12185/585
Media type	Text
Version	2023-03
Format size	11Mb
Submit date	2023-05-04T12:59:24Z
Date available	2023-05-04T12:59:24Z
Date created	2023-03
Verification status	Level 0

Files in this item

Name:: SAMuLCAT-2023-03.xlsx
Size:: 10.63Mb
Format:: Microsoft Excel 2007
MD5:: 5c7684582c87336d02df9add34b6bb7b
Description:: SAMuLCAT Corpus version 2023-03 ...

Download

This item appears in the following Collection(s)

Resource Index [414]
A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.

Show simple item record

South African Multilingual Learner Corpus of Academic Texts (SAMuLCAT) version 2023-03

Files in this item

License agreement

This item appears in the following Collection(s)