Filter by:

Now showing items 324-343 of 526

Filter options

    • NCHLT Sepedi Phrase Chunk Annotated Corpus 

      D.J. Prinsloo, et al. (North-West University; Centre for Text Technology (CTexT), 2016-04-29) ~ Resource Catalogue
      Phrase chunk annotated data for the NCHLT Text Resource Development: Phase II Project. The phrase chunk annotated data is a subset of the 50,000 tokens ...
    • NCHLT Sepedi RoBERTa language model 

      Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)
      Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned ...
    • NCHLT Sepedi Speech Corpus 

      Charl van Heerden, et al. (Meraka Institute, CSIR; North-West University, 2014-07-08) ~ Resource Catalogue
      Orthographically transcribed broadband speech corpus of approximately 56 hours, including a test suite of 8 speakers.
    • NCHLT Sepedi Text Corpora 

      Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2014-05-30) ~ Resource Catalogue
      Collection of source text documents, genre classified text documents, raw corpus, clean corpus, lexicon, frequency list and named-entity lists developed ...
    • NCHLT Sepedi word2vec-CBOW embeddings 

      Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)
      Static word embeddings for the continuous bag of words (CBoW) flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides ...
    • NCHLT Sepedi word2vec-Skipgram embeddings 

      Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)
      Static word embeddings for the Skipgram flavour of the word2vec (w2v) architecture (Mikolov et al., 2013). The embedding provides real-valued vector ...
    • NCHLT Sesotho Morphological Decomposer 

      Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2014-05-30) ~ Resource Catalogue
      Morphological decomposer developed during the NCHLT Text project.
    • NCHLT Sesotho Annotated Text Corpora 

      Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2014-05-30) ~ Resource Catalogue
      Lemmatised, part of speech tagged and morphologically analysed corpora developed during the NCHLT Text project.
    • NCHLT Sesotho Auxiliary Speech Corpus 

      Febe de Wet, et al. (CSIR Meraka Institute; North-West University, 2019-06-01) ~ Resource Catalogue
      The corpus contains orthographically transcribed broadband speech in each of South Africa's eleven official languages. Transcriptions are provided in ...
    • NCHLT Sesotho fastText-CBoW embeddings 

      Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)
      Static word and subword embeddings for the continuous bag of words (CBoW) flavour of the fastText architecture (Bojanowski et al., 2017). The embedding ...
    • NCHLT Sesotho fastText-Skipgram embeddings 

      Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)
      Static word and subword embeddings for the Skipgram flavour of the fastText architecture (Bojanowski et al., 2017). The embedding provides real-valued ...
    • NCHLT Sesotho FLAIR-backward embeddings 

      Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)
      Contextual word/string embeddings for the backward flavour of the FLAIR architecture (Akbik et al., 2018). The embedding provides real-valued vector ...
    • NCHLT Sesotho FLAIR-forward embeddings 

      Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)
      Contextual word/string embeddings for the forward flavour of the FLAIR architecture (Akbik et al., 2018). The embedding provides real-valued vector ...
    • NCHLT Sesotho GloVe embeddings 

      Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)
      Static word embedding model based on the Global Vectors architecture (Pennington et al., 2014). The embeddings provide real-valued vector representations ...
    • NCHLT Sesotho Lemmatiser 

      Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2014-05-30) ~ Resource Catalogue
      Lemmatiser developed during the NCHLT Text project. \n\n Available in the Readme.txt - Input format: Text data (encoding: UTF8 without BOM), one ...
    • NCHLT Sesotho Named Entity Annotated Corpus 

      M. Setaka, et al. (North-West University; Centre for Text Technology (CTexT), 2016-04-29) ~ Resource Catalogue
      Named entity annotated data from the NCHLT Text Resource Development: Phase II Project, annotated with PERSON, LOCATION, ORGANISATION and MISCELLANEOUS tags.
    • NCHLT Sesotho Phrase Chunk Annotated Corpus 

      M. Setaka, et al. (North-West University; Centre for Text Technology (CTexT), 2016-04-29) ~ Resource Catalogue
      Phrase chunk annotated data for the NCHLT Text Resource Development: Phase II Project. The phrase chunk annotated data is a subset of the 50,000 tokens ...
    • NCHLT Sesotho RoBERTa language model 

      Roald Eiselen (North-West University; Centre for Text Technology (CTexT), 2023-05-01)
      Contextual masked language model based on the RoBERTa architecture (Liu et al., 2019). The model is trained as a masked language model and not fine-tuned ...
    • NCHLT Sesotho Speech Corpus 

      Charl van Heerden, et al. (Meraka Institute, CSIR; North-West University, 2014-07-08) ~ Resource Catalogue
      Orthographically transcribed broadband speech corpus of approximately 56 hours, including a test suite of 8 speakers.
    • NCHLT Sesotho Text Corpora 

      Martin Puttkammer, et al. (North-West University; Centre for Text Technology (CTexT), 2014-05-30) ~ Resource Catalogue
      Collection of source text documents, genre classified text documents, raw corpus, clean corpus, lexicon, frequency list and named-entity lists developed ...