Repository logoRepository logo
 

Final year high school examination texts of South African home and first additional language subjects

Loading...
Thumbnail Image

Deposit Licenses

Date

2022-11-16

Authors

Sibeko, Johannes
van Zaanen, Menno

Journal Title

Journal ISSN

Volume Title

Publisher

South African Centre for Digital Language Resources

Abstract

Description

This data collection consists of reading comprehension and summary writing texts. The texts comprise of the final year high school exam texts for Home Language (HL) and First Additional Language (FAL) subjects written in South Africa between 2008 and 2020. The text collection contains texts from all eleven official South African language subjects: Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho, Setswana, Sepedi, Siswati, Tshivenda, and Xitsonga. PDF versions of the texts were downloaded from South Africa's Department of Basic Education online public access repository. Plain text was extracted using pdftotext (version 22.02.0). The texts were then tokenized using Ucto (version 0.21.1). The data collection contains a total of 429 exam text files comprising a total of 1,314,551 tokens with 131,650 types (i.e., unique tokens). Of these, 223 are HL texts that have 689,730 tokens and 88,009 types, whereas the 206 FAL text documents contain 624,821 tokens with 73,451 types. In addition to the full exam texts, the reading comprehension and summary writing texts are extracted manually. The data is useful for studies investigating, e.g., linguistic properties, text readability, text properties, text difficulty, and linguistic complexity in any of the eleven languages. Furthermore, both intra-language and inter-language comparisons can be made.

Citation

Collections

Verification status

Level 0