Project: SADiLaR: Monolingual corpora for isiXhosa Type: Monolingual text corpora Languages: IsiXhosa (xh_ZA). Date: 2025-06-10 Version: 2.0 (Final) Description: Monolingual corpus for isiXhosa. The data is given as a single UTF-8 text file, with each segment on a newline. The dataset contains existing data sourced for the DAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into isiXhosa project. NOTE: Version 2.0 has been processed in the same way as the other Autshumato resources. Content: | Segments | XH Words | ------------------------------------------------------- Entire isiXhosa dataset | 341,330 | 4,328,245 | Source(s): Documents from various South African domains (mainly government and publications). Autshumato Project website: http://autshumato.sourceforge.net/ SADiLaR website: https://sadilar.org/index.php/en/ _________________________________________________________________________________ Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International URL: http://creativecommons.org/licenses/by/4.0/ Attribute work to: CTexT® (Centre for Text Technology, North-West University), South Africa; SADiLaR (South African Centre for Digital Language Resources), South Africa; Department of Arts and Culture, South Africa. Attribute work to URL: http://humanities.nwu.ac.za/ctext https://sadilar.org http://www.dac.gov.za