DSI LogoSADiLaR Logo
Clarin-ZA Logo
View Item 
  •   SADiLaR
  • Language Resource Management Agency
  • Resource Index
  • View Item
  •   SADiLaR
  • Language Resource Management Agency
  • Resource Index
  • View Item
    • Login
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Search form

    Browse

    All of SADiLaR

    Communities & CollectionsTitleProjectMedia type

    This Collection

    TitleProjectMedia type

    Final year high school examination texts of South African home and first additional language subjects

    Thumbnail
    Download
    Data collection (356.8Mb)
    MD5: 3e61cdf929aa60489079db106b7a17c5

    License agreement

    By downloading this resource I accept and agree to the terms of use and the associated license conditions under which the resource is distributed.

    URI
    https://hdl.handle.net/20.500.12185/568
    Collections
    • Resource Index [409]
    Author(s)
    Sibeko, Johannes
    van Zaanen, Menno
    Metadata
    Show full item record
    Description
    This data collection consists of reading comprehension and summary writing texts. The texts comprise of the final year high school exam texts for Home Language (HL) and First Additional Language (FAL) subjects written in South Africa between 2008 and 2020. The text collection contains texts from all eleven official South African language subjects: Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sesotho, Setswana, Sepedi, Siswati, Tshivenda, and Xitsonga. PDF versions of the texts were downloaded from South Africa's Department of Basic Education online public access repository. Plain text was extracted using pdftotext (version 22.02.0). The texts were then tokenized using Ucto (version 0.21.1). The data collection contains a total of 429 exam text files comprising a total of 1,314,551 tokens with 131,650 types (i.e., unique tokens). Of these, 223 are HL texts that have 689,730 tokens and 88,009 types, whereas the 206 FAL text documents contain 624,821 tokens with 73,451 types. In addition to the full exam texts, the reading comprehension and summary writing texts are extracted manually. The data is useful for studies investigating, e.g., linguistic properties, text readability, text properties, text difficulty, and linguistic complexity in any of the eleven languages. Furthermore, both intra-language and inter-language comparisons can be made.
    Contact person
    Menno van Zaanen
    Contact person's e-mail address
    menno.vanzaanen@nwu.ac.za
    Publisher(s)
    South African Centre for Digital Language Resources
    License
     

    Copyright © 2018  SADiLaR. All Rights Reserved.
    Contact Us | Send Feedback
     

     


    Copyright © 2018  SADiLaR. All Rights Reserved.
    Contact Us | Send Feedback