===============================================
South African Language Identifier
===============================================

Version:	1.2.8
Date:		2017/02/15

The South African Language Identifier (SALId) was 
developed during the NCHLT-Text phase II project.
The primary focus of SALId is document and line 
level language identification.
_______________________________________________
i. License
===============================================

These files are distributed under the 
Creative Commons Attribution 2.5 South Africa license. 

All files are distributed under the same conditions.
_______________________________________________
License: Creative Commons Attribution 2.5 South Africa
URL: http://creativecommons.org/licenses/by/2.5/za/

Attribute work to: South African Department of 
Arts and Culture & Centre for Text Technology 
(CTexT, North-West University, South Africa)

Attribute work to URL: http://www.nwu.ac.za/ctext 
_______________________________________________
ii. Contents
===============================================

	1. Required files and tree structure
	2. Usage
	3. Troubleshooting
	4. Change log
_______________________________________________
1. Required files and tree structure:
===============================================
	./
	|_ salid(.exe)
	|_ settings.ini
	|_ ./models
		|_ af-6.dat
		|_ en-6.dat
		|_ nr-6.dat
		|_ nso-6.dat
		|_ ss-6.dat
		|_ st-6.dat
		|_ tn-6.dat
		|_ ts-6.dat
		|_ ve-6.dat
		|_ xh-6.dat
		|_ zu-6.dat

_______________________________________________
2. Usage:
===============================================
	2.1 Open command prompt, navigate to directory of salid(.exe)
	
	2.2 Quick look:
		salid <id|train|server> -h for additional options		
		salid -v <id|train|server> -h for verbose output (optional)
		salid -s <id|train|server> -h to suppress output to minimum (optional)
		salid <id|train> -ver for version number
		
	2.3 To run the id option:
		salid id -i "input file | input directory | input phrase" 
			to identify a sample file or phrase, must be enclosed in quotes " 
		salid id -i "input file | input directory | input phrase" -o "output filename | directory"
			to identify a sample file or phrase, result printed/copied in "output filename" or "output directory"
		salid id -t
			to enter the input tool
		salid id -i "input file | input directory | input phrase" -o "output filename | directory" -l
			to run the identifier at line level. Omiting the -l option flag will identify the input at document level.
		salid id -i "input file | input directory | input phrase" -o "output filename | directory" -b 80
			to identify the input at a benchmark(confidence) percentage of x. Value between 0 - 100 only.
		Examples:
			salid id -i "hello world"
			salid id -i sample.txt
			salid id -i C:\samples
			salid id -i sample.txt -o result.txt
			salid id -i C:\samples -o C:\results
			salid id -t
			
	2.4 To run the train option:
		salid train -f "filename" -n 6 -l "Language name" -o "data filename" -q 5 -c
			to train a new language model,
			must be included:
				-f training corpus filename
				-n NGram weight
				-l language name
				-o .dat filename
				-q remove frequency lower than X
			optional:
				-c clean punctuation flag
		Examples:
			salid train -f CORPUS.txt -n 6 -l isiNdebele -o nr-6 -q 5 -c
			salid train -f FILE.txt -n 3 -l Afrikaans -o afDAT -q 10
	
	2.5 To run the network option(*):
		salid server
			starts the app, loads the models and listens for clients on default port (7770)
			and default IPv4 address of 127.0.0.1
		salid server -i 143.160.26.26
			starts the app on the specified IPv4 address
		salid server -p 81
			starts the app on the specified port
		
		Examples:
			salid server -i 192.168.0.10 -p 99
		
		* Server listens for connections through INET simple sockets.
		* Data is expected in JSON style format with two properties (text, benchmark)
			Example: { text:"hello world", benchmark:80 }
			Example: { benchmark:0, text:"Ke a leboga rra" }
		* The JSON string should be in UTF-8 and sent through a byte stream
		* The resulting byte stream must be converted back to UTF-8 string. Result in JSON format.
			Example: { "language":"Afrikaans", "confidence":0.183551226 } 
			Example: { "language":"Xitsonga", "confidence":0.983545621 } 
		* See "client.py" for example
_______________________________________________
3. Troubleshooting:
===============================================
	Most issues will occur because of incorrect encoding of text.
		* Make sure your input files are in UTF-8 without BOM.
		* Make sure filenames do not contain non-ANSI characters (known issue)

_______________________________________________
4. Change log:
===============================================
	v1.2.8 (2017-02-15)
		* Exception handling of file reading
		* Server uses IP address instead of host name
		* Proper use of JSON libraries for (de)serialising strings
		* Fixed issue of white space NGRAMS.
		* Fixed issue of "Unsure" text ("All" languages have same euclidean distance)
			- Doesn't id as isiZulu any more, rather as "Unsure"
			- If two or more languages have the same probability, the input is identified as UNSURE
		
	