Show simple item record

Final year high school examination texts of South African home and first additional language subjects
This data collection consists of reading comprehension and summary writing texts. The texts comprise of the final year high school exam texts for Home Language (HL) and First Additional Language (FAL) subjects written in South Africa between 2008 and 2020. The text collection contains texts from all eleven official South African language subjects: Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho, Setswana, Sepedi, Siswati, Tshivenda, and Xitsonga. PDF versions of the texts were downloaded from South Africa's Department of Basic Education online public access repository. Plain text was extracted using pdftotext (version 22.02.0). The texts were then tokenized using Ucto (version 0.21.1). The data collection contains a total of 429 exam text files comprising a total of 1,314,551 tokens with 131,650 types (i.e., unique tokens). Of these, 223 are HL texts that have 689,730 tokens and 88,009 types, whereas the 206 FAL text documents contain 624,821 tokens with 73,451 types. In addition to the full exam texts, the reading comprehension and summary writing texts are extracted manually. The data is useful for studies investigating, e.g., linguistic properties, text readability, text properties, text difficulty, and linguistic complexity in any of the eleven languages. Furthermore, both intra-language and inter-language comparisons can be made.
Menno van Zaanen
menno.vanzaanen@nwu.ac.za
South African Centre for Digital Language Resources
Creative Commons License Attribution-ShareAlike 4.0 International
Afrikaans; English; isiNdebele; isiXhosa; isiZulu; Sepedi; Setswana; Sesotho; Siswati; Tshivenda; Xitsonga
Sibeko, Johannes; van Zaanen, Menno
exam texts; high school
https://hdl.handle.net/20.500.12185/568
Text
2022-11-23T09:20:22Z
2022-11-23T09:20:22Z
2022-11-16


Files in this item

Thumbnail

This item appears in the following Collection(s)

  • Resource Index [409]
    A collection of language resource metadata mostly collected during the NHN funded technology audit of 2009, as well as the SADiLaR technology audit of 2018. Not all resources in this collection are available for download.

Show simple item record