--------------------------------------------------------------------------------
README: Lwazi Primary Dictionaries v1.2
--------------------------------------------------------------------------------


1. About Lwazi Primary Dictionaries v1.2
2. Using and citing
3. Development process
4. Directory structure
5. Downloading DictionaryMaker

--------------------------------------------------------------------------------

1. About Lwazi Primary Dictionaries v1.2

This directory contains v1.2 of a language-specific pronunciation dictionary 
developed during project Lwazi. Such a dictionary exists for each of the following 
languages: Afrikaans, English, isiXhosa, isiNdebele, isiZulu, Sesotho, Setswana, 
Sepedi, Siswati, Tshivenda, Xitsonga.

For more info on project Lwazi, see: http://www.meraka.org.za/lwazi

The dictionary was developed by the CSIR Meraka Institute, South Africa with 
collaborators from North-West University, South Africa.

--------------------------------------------------------------------------------

2. Using and citing

This dictionary is licensed under the conditions specified in LICENSE.txt

When using the dictionary, please cite: 

M.Davel and O. Martirosian, "Pronunciation dictionary development in resource-
  scarce environments", In Proceedings of the 10th Annual Conference of the 
  International Speech Communication Association (Interspeech 2009), Brighton UK, 
  Sept 2009, pp 2851-2854.

--------------------------------------------------------------------------------

3. Development process 

This dictionary was developed as a practical resource for speech technology 
development. Phoneme and pronunciation choices were made in order to facilitate 
the development of text-to-speech (TTS) and automatic speech recognition (ASR) 
systems, rather than to provide an authoritative opinion on how a certain 
language should be described.

The development process is described in 
  M.Davel and O. Martirosian, "Pronunciation dictionary development in resource-
  scarce environments", In Proceedings of the 10th Annual Conference of the 
  International Speech Communication Association (Interspeech 2009), Brighton UK, 
  Sept 2009, pp 2851-2854.

Additional information:
- The phoneme set used: 
  http://www.meraka.org.za/lwazi/download/Lwazi.Phoneset.1.2.pdf 

- Language profiles (of dictionary developers): 
  http://www.meraka.org.za/lwazi/download/dictionary_developer_profiles.pdf

- 'Pronunciation dictionary development in resource-scarce environments':
  http://researchspace.csir.co.za/dspace/bitstream/10204/3620/1/Davel_d1_2009.pdf

- Linguistic resources related to the Lwazi project are also distributed via 
  the South African Language Resource Management Agency (RMA) at 
  http://rma.nwu.ac.za/.

--------------------------------------------------------------------------------

4. Directory structure

4.1. dict

The dictionary, packaged in DictionaryMaker format.
For more information on "DictionaryMaker" and its file formats see (5) below.

The following files are all in DictionaryMaker format:
- <lang>.dict		
  dictionary
- <lang>.gra		
  list of graphemes
- <lang>.pho		
  list of phonemes, each linked to an audio sample
- <lang>.wdl		
  list of words in dictionary
- <lang>-no-special.dict 
  version of dictionary with all diacritics and special characters 
  mapped to standard alphanumeric symbols
- sounds		
  directory containing audio samples of all phonemes

4.2 g2p

Grapheme-to-phoneme rules, derived from the dictionary using the Default&Refine 
algorithm. 

- <lang>.rules and <lang>.gnulls
  set of context-sensitive rules in Default&Refine format
  rules can be applied using the DictionaryMaker software

- <lang>.map.graphs
  mapping of special characters to single alphanumeric characters
  used during rule extraction

- <lang>.map.graphs
  mapping of phonemes consisting of more than one character
  to single alphanumeric characters
  used during rule extraction

For more info on Default&Refine, see 
  M.Davel and E.Barnard, "Pronunciation Prediction with Default&Refine", 
  Computer Speech and Language, Vol 22, pp 374-393, Oct 2008.

4.3 stats

Statistics extracted from the dictionary

- <lang>.stats.graphs
  grapheme counts
- <lang>.stats.phones
  phoneme.counts
- <lang>.stats.rules
  basic statistics related to number of rules and rule size per grapheme
- <lang>.stats.exceptions
  words with exceptional pronunciations
  used to identify possible errors
  (depending on the regularity of the language, 
  either the full or strict list may be more appropriate.)

--------------------------------------------------------------------------------

5. Downloading DictionaryMaker

DictionaryMaker is free software, and can be downloaded from:
https://code.google.com/p/dictionarymaker/source/checkout

--------------------------------------------------------------------------------

