Autshumato English-Sepedi Parallel Corpora

Creative Commons Attribution 4.0 InternationalMcKellar, CindyGaustad Van Zaanen, TanjaPuttkammer, MartinGent, Sunnyvan Heerden, Jacques2022-12-152022-12-152022-09-30https://hdl.handle.net/20.500.12185/576Aligned parallel corpora for the language pair English-Sepedi. The data is given as two separate UTF-8 text files, with each aligned segment on a newline. The data was specifically selected and formatted for use in the training of machine translation systems. Further clean-up and processing might be required depending on the task the data is reused for.TxtAligned Segments: 131 535 English Words: 2 214 453 Sepedi Words: 2 822 916Text; UTF8AutshumatoEnglishSepediAutshumato English-Sepedi Parallel Corpora10.8 Mb (zipped)