Repository logoRepository logo
 

Corpus of multilingual code-switched soap opera speech

dc.contact.emailtrn@sun.ac.zaen_ZA
dc.contact.nameThomas Niesleren_ZA
dc.contributor.authorvan der Westhuizen, Ewald
dc.contributor.authorNiesler, Thomas
dc.date.accessioned2021-08-20T14:54:15Z
dc.date.available2021-08-20T14:54:15Z
dc.date.issued2020-02-28
dc.descriptionThe corpus comprises 26.9 hours of annotated multilingual speech that contains examples of code-switching in isiZulu, isiXhosa, Setswana, Sesotho and English. The speech was obtained from South African soap operas. Code-switching between English and one of the Bantu languages is by far most prevalent in the data. Although not very common, switches between the Bantu languages themselves also occur. An initial attempt to align the audio extracted from soap opera episodes with the corresponding scripts revealed that actors very often perform ad lib. The speech and the examples of code-switching it contains can therefore be considered to be spontaneous.en_ZA
dc.formatwav (audio), XML (transcriptions)en_ZA
dc.format.extent26.9 hours of annotated multilingual code-switched soap opera speechen_ZA
dc.format.mediumN/Aen_ZA
dc.format.size4.25 Gben_ZA
dc.identifier.citationE. van der Westhuizen and T.R. Niesler, “A first South African corpus of multilingual code-switched soap opera speech,” in Proc. LREC, 2018, pp. 2854–2859.en_ZA
dc.identifier.citationA. Biswas, E. Yılmaz, F. de Wet, E. van der Westhuizen, T.R. Niesler, "Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages", in Proc. LREC, 2020, pp. 3468-3474.en_ZA
dc.identifier.urihttps://hdl.handle.net/20.500.12185/545
dc.languagesEnglishen_ZA
dc.languagesisiXhosaen_ZA
dc.languagesisiZuluen_ZA
dc.languagesSetswanaen_ZA
dc.languagesSesothoen_ZA
dc.media.categoryAnnotated multilingual speech corpusen_ZA
dc.media.typeSpeechen_ZA
dc.project"A multilingual corpus of code-switched South African speech", carried out on behalf of the Department of Arts and Culture of the Government of South Africaen_ZA
dc.publisherStellenbosch Universityen_ZA
dc.rights.licenseResearch only.en_ZA
dc.subjectcode-switching, spontaneous speech, South African languages, isiZulu, isiXhosa, Setswana, Sesothoen_ZA
dc.titleCorpus of multilingual code-switched soap opera speechen_ZA
dc.version1.0en_ZA

Files

Original bundle

Now showing 1 - 5 of 13
Loading...
Thumbnail Image
Name:
5lang.zip
Size:
2.75 GB
Format:
ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
Description:
Full archive of the speech and the examples of code-switching (5lang)
Loading...
Thumbnail Image
Name:
5lang.xml
Size:
31.73 MB
Format:
Extensible Markup Language
Description:
Meta-data and transcriptions of mixed language (5lang) data
Loading...
Thumbnail Image
Name:
balanced_engsot.xml
Size:
3.69 MB
Format:
Extensible Markup Language
Description:
Meta-data and transcriptions of English/Sesotho code-switch data
Loading...
Thumbnail Image
Name:
balanced_engtsn.xml
Size:
3.76 MB
Format:
Extensible Markup Language
Description:
Meta-data and transcriptions of English/Setswana code-switch data
Loading...
Thumbnail Image
Name:
balanced_engxho.xml
Size:
3.86 MB
Format:
Extensible Markup Language
Description:
Meta-data and transcriptions of English/isiXhosa code-switch data

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
3.23 KB
Format:
Item-specific license agreed upon to submission
Description: