Corpus of multilingual code-switched soap opera speech
Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/545'
dc.contact.email | trn@sun.ac.za | en_ZA |
dc.contact.name | Thomas Niesler | en_ZA |
dc.contributor.author | van der Westhuizen, Ewald | |
dc.contributor.author | Niesler, Thomas | |
dc.date.accessioned | 2021-08-20T14:54:15Z | |
dc.date.available | 2021-08-20T14:54:15Z | |
dc.date.issued | 2020-02-28 | |
dc.description | The corpus comprises 26.9 hours of annotated multilingual speech that contains examples of code-switching in isiZulu, isiXhosa, Setswana, Sesotho and English. The speech was obtained from South African soap operas. Code-switching between English and one of the Bantu languages is by far most prevalent in the data. Although not very common, switches between the Bantu languages themselves also occur. An initial attempt to align the audio extracted from soap opera episodes with the corresponding scripts revealed that actors very often perform ad lib. The speech and the examples of code-switching it contains can therefore be considered to be spontaneous. | en_ZA |
dc.format | wav (audio), XML (transcriptions) | en_ZA |
dc.format.extent | 26.9 hours of annotated multilingual code-switched soap opera speech | en_ZA |
dc.format.medium | N/A | en_ZA |
dc.format.size | 4.25 Gb | en_ZA |
dc.identifier.citation | E. van der Westhuizen and T.R. Niesler, “A first South African corpus of multilingual code-switched soap opera speech,” in Proc. LREC, 2018, pp. 2854–2859. | en_ZA |
dc.identifier.citation | A. Biswas, E. Yılmaz, F. de Wet, E. van der Westhuizen, T.R. Niesler, "Semi-supervised Development of ASR Systems for Multilingual Code-switched Speech in Under-resourced Languages", in Proc. LREC, 2020, pp. 3468-3474. | en_ZA |
dc.identifier.uri | https://hdl.handle.net/20.500.12185/545 | |
dc.languages | English | en_ZA |
dc.languages | isiXhosa | en_ZA |
dc.languages | isiZulu | en_ZA |
dc.languages | Setswana | en_ZA |
dc.languages | Sesotho | en_ZA |
dc.media.category | Annotated multilingual speech corpus | en_ZA |
dc.media.type | Speech | en_ZA |
dc.project | "A multilingual corpus of code-switched South African speech", carried out on behalf of the Department of Arts and Culture of the Government of South Africa | en_ZA |
dc.publisher | Stellenbosch University | en_ZA |
dc.rights.license | Research only. | en_ZA |
dc.subject | code-switching, spontaneous speech, South African languages, isiZulu, isiXhosa, Setswana, Sesotho | en_ZA |
dc.title | Corpus of multilingual code-switched soap opera speech | en_ZA |
dc.version | 1.0 | en_ZA |
Files
Original bundle
1 - 5 of 13
Loading...
- Name:
- 5lang.zip
- Size:
- 2.75 GB
- Format:
- ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
- Description:
- Full archive of the speech and the examples of code-switching (5lang)
Loading...
- Name:
- 5lang.xml
- Size:
- 31.73 MB
- Format:
- Extensible Markup Language
- Description:
- Meta-data and transcriptions of mixed language (5lang) data
Loading...
- Name:
- balanced_engsot.xml
- Size:
- 3.69 MB
- Format:
- Extensible Markup Language
- Description:
- Meta-data and transcriptions of English/Sesotho code-switch data
Loading...
- Name:
- balanced_engtsn.xml
- Size:
- 3.76 MB
- Format:
- Extensible Markup Language
- Description:
- Meta-data and transcriptions of English/Setswana code-switch data
Loading...
- Name:
- balanced_engxho.xml
- Size:
- 3.86 MB
- Format:
- Extensible Markup Language
- Description:
- Meta-data and transcriptions of English/isiXhosa code-switch data
License bundle
1 - 1 of 1
Loading...
- Name:
- license.txt
- Size:
- 3.23 KB
- Format:
- Item-specific license agreed upon to submission
- Description: