Autshumato English-Siswati Parallel Corpora

Creative Commons Attribution 4.0 InternationalMcKellar, Cindy2025-09-162022-06-012025-09-162022-03-31https://hdl.handle.net/20.500.12185/560.2Aligned parallel corpora for the following language pair: English-SiSwati. The data is given as four separate UTF-8 text files, with each segment on a newline. Dataset contains existing data sourced for the DSAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into SiSwati project. The dataset contains the following types of bilingual data: Translations from English to Siswati and crawled parallel data for English-Siswati. The dataset comprises a total of 114,839 segments with 2,002,293 English words and 1, 423,414 SiSwati words. (A new version issued since the title was changed)Text114,839 segments with 2,002,293 English words and 1, 423,414 Siswati wordsN/ASiswatimachine translation training datacrawledtranslationsmultilingualaligned dataAutshumato English-Siswati Parallel Corpora9.54 Mb