Browsing by Title

NCHLT Afrikaans Text Corpora

Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2014-05-30) ~ Resource Catalogue

Collection of source text documents, genre classified text documents, raw corpus, clean corpus, lexicon, frequency list and named-entity lists developed ...

NCHLT Siswati Morphological Decomposer

Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2014-05-30) ~ Resource Catalogue

Morphological decomposer developed during the NCHLT Text project.

NCHLT Afrikaans Annotated Text Corpora

Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2014-05-30) ~ Resource Catalogue

Lemmatised, part of speech tagged and morphologically analysed corpora developed during the NCHLT Text project.

NCHLT Afrikaans Auxiliary Speech Corpus

Febe de Wet, et al. (CSIR Meraka Institute; North-West University, 2019-06-01) ~ Resource Catalogue

The corpus contains orthographically transcribed broadband speech in each of South Africa's eleven official languages. Transcriptions are provided in ...

NCHLT Afrikaans fastText-CBoW embeddings

Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)

Static word and subword embeddings for the continuous bag of words (CBoW) flavour of the fastText architecture (Bojanowski et al., 2017). The embedding ...

NCHLT Afrikaans fastText-Skipgram embeddings

Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)

Static word and subword embeddings for the Skipgram flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued ...

NCHLT Afrikaans FLAIR-backward embeddings

Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)

Contextual word/string embeddings for the backward flavour of the FLAIR architecture (Akbik et al., 2018). The embedding provides real-valued vector ...

NCHLT Afrikaans FLAIR-forward embeddings

Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)

Contextual word/string embeddings for the forward flavour of the FLAIR architecture (Akbik et al., 2018). The embedding provides real-valued vector ...

NCHLT Afrikaans GloVe embeddings

Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)

Static word embedding model based on the Global Vectors architecture (Pennington et al., 2014). The embeddings provide real-valued vector representations ...

NCHLT Afrikaans Lemmatiser

Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2014-05-30) ~ Resource Catalogue

Lemmatiser developed during the NCHLT Text project. \n\n Available in the Readme.txt - Input format: Text data (encoding: UTF8 without BOM), one lowercase ...

NCHLT Afrikaans Morphological Decomposer

Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2014-05-30) ~ Resource Catalogue

Morphological decomposer developed during the NCHLT Text project.

NCHLT Afrikaans Named Entity Annotated Corpus

Gerhard van Huyssteen, et al. (North-West University; Centre for Text Technology (CTexT), 2016-04-29) ~ Resource Catalogue

Named entity annotated data from the NCHLT Text Resource Development: Phase II Project, annotated with PERSON, LOCATION, ORGANISATION and MISCELLANEOUS tags.

NCHLT Afrikaans Phrase Chunk Annotated Corpus

Gerhard van Huyssteen, et al. (North-West University; Centre for Text Technology (CTexT), 2016-04-29) ~ Resource Catalogue

Phrase chunk annotated data for the NCHLT Text Resource Development: Phase II Project. The phrase chunk annotated data is a subset of the 50,000 tokens ...

NCHLT Afrikaans RoBERTa language model

Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)

Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned ...

NCHLT Afrikaans Speech Corpus

Charl van Heerden, et al. (Meraka Institute, CSIR; North-West University, 2014-07-08) ~ Resource Catalogue

Orthographically transcribed broadband speech corpus of approximately 56 hours, including a test suite of 8 speakers.

NCHLT Afrikaans word2vec-CBOW embeddings

Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)

Static word embeddings for the continuous bag of words (CBoW) flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides ...

NCHLT Afrikaans word2vec-Skipgram embeddings

Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)

Static word embeddings for the Skipgram flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides real-valued vector ...

NCHLT English Auxiliary Speech Corpus

Febe de Wet, et al. (CSIR Meraka Institute; North-West University, 2019-06-01) ~ Resource Catalogue

The corpus contains orthographically transcribed broadband speech in each of South Africa's eleven official languages. Transcriptions are provided in ...

NCHLT English Speech Corpus

Charl van Heerden, et al. (Meraka Institute, CSIR; North-West University, 2014-07-08) ~ Resource Catalogue

Orthographically transcribed broadband speech corpus of approximately 56 hours, including a test suite of 8 speakers.

NCHLT English Text Corpora

Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2016-09-09) ~ Resource Catalogue

Collection consisting of a clean corpus, lexicon, frequency list and named-entity lists developed during the NCHLT Text project.