Repository logoRepository logo
 

Bilingual English-Siswati Corpus

dc.contact.emailtanja.gaustad@nwu.ac.zaen_ZA
dc.contact.nameTanja Gaustaden_ZA
dc.contributor.authorMcKellar, Cindy
dc.date.accessioned2022-06-01T08:24:01Z
dc.date.available2022-06-01T08:24:01Z
dc.date.issued2022-03-31
dc.descriptionAligned parallel corpora for the following language pair: English-SiSwati. The data is given as four separate UTF-8 text files, with each segment on a newline. Dataset contains existing data sourced for the DSAC funded Autshumato project as well as new data sourced for the SADiLaR: Parallel corpora for English into SiSwati project. The dataset contains the following types of bilingual data: Translations from English to Siswati and crawled parallel data for English-Siswati. The dataset comprises a total of 114,839 segments with 2,002,293 English words and 1, 423,414 SiSwati words.en_ZA
dc.formatTexten_ZA
dc.format.extent114,839 segments with 2,002,293 English words and 1, 423,414 Siswati wordsen_ZA
dc.format.mediumN/Aen_ZA
dc.format.size9.54 Mben_ZA
dc.identifier.urihttps://hdl.handle.net/20.500.12185/560
dc.languagesEnglishen_ZA
dc.languagesSiswatien_ZA
dc.media.categoryAligned parallel corporaen_ZA
dc.media.typeTexten_ZA
dc.projectSADiLaR: Parallel corpora for English into Siswatien_ZA
dc.publisherNorth-West University - Centre for Text Technology (CTexT)en_ZA
dc.rights.licenseCreative Commons Attribution 4.0 Internationalen_ZA
dc.subjectSiswati, aligned data, multilingual, translations, crawled, machine translation training dataen_ZA
dc.titleBilingual English-Siswati Corpusen_ZA
dc.versionVersion: 1.0 (Final)en_ZA

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
lcontent.SADILAR.BilingualCorpus(EN-SS).1.0.3.CAM.2022-03-08.en.zip
Size:
9.54 MB
Format:
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
Description:
Language pair: English-SiSwati - four separate UTF-8 text files

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.23 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections