Project: SADiLaR: Parallel corpora for English into isiXhosa Type: Multilingual text corpora: aligned Languages: English (en_ZA) & isiXhosa (xh_ZA). Date: 2025-06-10 Version: 2.0 (Final) Description: Aligned parallel corpora for the following language pair: English-isiXhosa. The data is given as two separate UTF-8 text files, with each segment on a newline. Dataset contains existing data sourced for the DAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into isiXhosa project. NOTE: Version 2.0 has been processed in the same way as the other Autshumato resources. Content: | Segments | EN Words | XH Words | ----------------------------------------------------------------------------- Entire English-isiXhosa dataset | 109,940 | 1,745,236 | 1,264,390 | Source(s): Paralell documents from various South African domains (mainly government and publications). Autshumato Project website: http://autshumato.sourceforge.net/ SADiLaR website: https://sadilar.org/index.php/en/ _________________________________________________________________________________ Licence for final (v1.0) distribution: Creative Commons Attribution 4.0 International URL: http://creativecommons.org/licenses/by/4.0/ Attribute work to: CTexT® (Centre for Text Technology, North-West University), South Africa; SADiLaR (South African Centre for Digital Language Resources), South Africa; Department of Arts and Culture, South Africa. Attribute work to URL: http://humanities.nwu.ac.za/ctext https://sadilar.org http://www.dac.gov.za