------------------------------------------------------------------------------- README: Code-Switch project deliverable (28/02/2020) ------------------------------------------------------------------------------- Description: Audio segments obtained from episodes of Rhythm City & Generations. Meta-data and annotations in XML & text format. Language: The data includes examples of 5 of South Africa's official languages as well as switches between the languages. ------------------------------------------------------------------------------- Information ------------------------------------------------------------------------------- Directory structure: DAC_Deliverables |--28Feb2020 |-- data | -- audio | -- balanced_engzul.zip | -- balanced_engxho.zip | -- balanced_engtsn.zip | -- balanced_engsot.zip | -- 5lang.zip | -- XML | -- balanced_engzul.xml | -- balanced_engxho.xml | -- balanced_engtsn.xml | -- balanced_engsot.xml | -- 5lang.xml README The data directory has two sub-directories: "audio" & "XML". The "audio" directory contains FIVE zipped directories. - Each directory contains speech segments extracted from Rhythm City and Generations. - The audio data is packaged in two formats: * Four balanced bilingual sets that each contain an equal amount of English and one Bantu language. * One multilingual data set that contains examples of all the languages represented in the data (5lang). - For the balanced data sets the audio data is arranged according to the languages between which code-switching occurs: engzul: English - isiZulu engxho: English - isiXhosa engtsn: English - Setswana engsot: English - Sesotho - All the audio data of the big data set is in one zipped directory: 5lang.zip - All audio files are .wav format (32kHz sampling rate, 16-bit mono PCM). - Audio filenames follow one of the following conventions, depending on the information that is available for each file: (1) _--_.wav, e.g. ALLAN_12-12-05_276.wav (2) __.wav, e.g. JASON_150_530.wav (3) _19__.wav, e.g. KHAPHELA_19_103_12.wav - Time information in the file names corresponds to the date of the original transmission. The "XML" directory contains FIVE XML files, one for each code-switch language pair and one for the entire data set. - The XML files include verbatim transcriptions and meta-data for each audio segment. - Transcriptions comply to UTF8 ASCII format. ------------------------------------------------------------------------------- Data statistics ------------------------------------------------------------------------------- Balanced corpora: English - isiZulu : 5.45 h English - isiXhosa : 3.14 h English - Setswana : 2.86 h English - Sesotho : 2.83 h TOTAL : 14.28 h 5lang corpus : 26.86 h The balanced sets are sub-sets of the 5lang corpus. ------------------------------------------------------------------------------- Citation ------------------------------------------------------------------------------- If you use this data, please cite: van der Westhuizen, Ewald, and Thomas Niesler. "A First South African Corpus of Multilingual Code-switched Soap Opera Speech." Proceedings of LREC, Miyazaki, Japan, May 2018, pp 2854-2859. bibtex: @inproceedings{van2018first, title={A First {S}outh {A}frican Corpus of Multilingual Code-switched Soap Opera Speech.}, author={van der Westhuizen, Ewald and Niesler, Thomas}, booktitle={Proceedings of LREC}, address = {Miyazaki, Japan}, month = {May}, year={2018}, pages={2854--2859} } ------------------------------------------------------------------------------- ASR systems ------------------------------------------------------------------------------- Details on the development of the acoustic models are described in the following publications: A. Biswas, E. Yilmaz, F. De Wet, E. Van der Westhuizen and T.R. Niesler, Semi-supervised development of ASR systems for multilingual code-switched speech in under-resourced languages, Proceedings of LREC 2020, Marseille, France. A. Biswas, E. Yilmaz, F. De Wet, E. Van der Westhuizen and T.R. Niesler, Semi-supervised acoustic model training for five-lingual code-switched ASR, Proceedings of Interspeech 2019, Graz, Austria. A. Biswas, F. de Wet, E. van der Westhuizen, E. Yılmaz and T.R. Niesler, Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech, Proceedings of Interspeech 2018, Hyderabad, India. E. Yilmaz, A. Biswas, E. Van der Westhuizen, F. De Wet and T.R. Niesler, Building a Unified Code-Switching ASR System for South African Languages, Proceedings of Interspeech 2018, Hyderabad, India. A. Biswas, E. Van der Westhuizen, T.R. Niesler and F. De Wet, Improving ASR for Code- Switched Speech in Under-Resourced Languages Using Out-of-Domain Data, Proceedings of SLTU 2018, Gurugram, India. -------------------------------------------------------------------------------