NCHLT Optical Character Recognition for South African Languages

Martin Puttkammer; Justin Hocking; Roald Eiselen

NCHLT Optical Character Recognition for South African Languages

Please do not copy the URL from the browser for citation. The correct URL is 'https://hdl.handle.net/20.500.12185/322'

dc.contact.email	Martin.Puttkammer@nwu.ac.za
dc.contact.name	Martin Puttkammer
dc.contributor.author	Martin Puttkammer
dc.contributor.author	Justin Hocking
dc.contributor.author	Roald Eiselen
dc.date.accessioned	2018-02-05T20:22:45Z
dc.date.accessioned	2018-03-05T17:46:33Z
dc.date.available	2018-02-05T20:22:45Z
dc.date.available	2018-03-05T17:46:33Z
dc.date.issued	2017-02-23
dc.description	An OCR system is an application that enables one to convert scanned paper documents into editable and searchable texts. The engine analyses the structure of document image and divides the page into elements such as blocks of texts, tables and images. These blocks are used to identify character image patterns which are used to advance several hypotheses about the character possibilities. These hypotheses are used to produce different character, word and line level variations and associated probabilities. The set of probability hypotheses are then searched to find the most likely combination of characters, words and lines to produce a textual representation of the image.
dc.format.medium	UTF8
dc.identifier.citation	Hocking, J. and Puttkammer, M., 2016, November. Optical character recognition for South African languages. In Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), 2016 (pp. 1-5). IEEE.
dc.identifier.uri	https://hdl.handle.net/20.500.12185/322
dc.language.iso	afr
dc.language.iso	eng
dc.language.iso	nbl
dc.language.iso	xho
dc.language.iso	zul
dc.language.iso	sot
dc.language.iso	nso
dc.language.iso	tsn
dc.language.iso	ssw
dc.language.iso	ven
dc.language.iso	tso
dc.languages	Afrikaans
dc.languages	English
dc.languages	isiNdebele
dc.languages	isiXhosa
dc.languages	isiZulu
dc.languages	Sesotho sa Leboa (Sepedi)
dc.languages	Setswana
dc.languages	Sesotho
dc.languages	Siswati
dc.languages	Tshivenda
dc.languages	Xitsonga
dc.media.type	Text
dc.project	NCHLT Text III
dc.publisher	North-West University
dc.publisher	Centre for Text Technology (CTexT)
dc.rights.license	Creative Commons Attribution 3.0 Unported License (CC BY 3.0): https://creativecommons.org/licenses/by/3.0/za/
dc.software.requirements	Tesseract-OCR
dc.title	NCHLT Optical Character Recognition for South African Languages
dc.type	Tools
dc.version	1.0.
local.collection.primary	Resource Catalogue
local.collection.secondary	Resource Index

Files

Original bundle

Now showing 1 - 1 of 1

Name:: nchlt_optical_character_recognition.zip
Size:: 103.81 MB
Format:: ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.

Download

Collections

Resource Catalogue
Resource Index