Repository logoRepository logo
 

Corpus of multilingual code-switched soap opera speech

Loading...
Thumbnail Image

Deposit Licenses

Date

2020-02-28

Authors

van der Westhuizen, Ewald
Niesler, Thomas

Journal Title

Journal ISSN

Volume Title

Publisher

Stellenbosch University

Abstract

Description

The corpus comprises 26.9 hours of annotated multilingual speech that contains examples of code-switching in isiZulu, isiXhosa, Setswana, Sesotho and English. The speech was obtained from South African soap operas. Code-switching between English and one of the Bantu languages is by far most prevalent in the data. Although not very common, switches between the Bantu languages themselves also occur. An initial attempt to align the audio extracted from soap opera episodes with the corresponding scripts revealed that actors very often perform ad lib. The speech and the examples of code-switching it contains can therefore be considered to be spontaneous.

Citation

E. van der Westhuizen and T.R. Niesler, “A first South African corpus of multilingual code-switched soap opera speech,” in Proc. LREC, 2018, pp. 2854–2859.
A. Biswas, E. Yılmaz, F. de Wet, E. van der Westhuizen, T.R. Niesler, "Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages", in Proc. LREC, 2020, pp. 3468-3474.

License

Research only.

Verification status

Level 0