Project: NCHLT Text IV

Type: GloVe
Language: Sesotho (st)
Date: 2023-02-28
Version: 1.0

Description: 
	Static word embedding model based on the Global Vectors architecture (Pennington et al., 2014).
	The embeddings provide real-valued vector representations for Sesotho text.

Model:
	Vocab size: 53,051
	Embedding dimensions: 400

Training data sources:
	The model is trained on a combination of data sources, including:
		- NCHLT Sesotho Text Corpora
		- Autshumato Sesotho Monolingual corpus
		- Leipzig Corpus Collection
		- Common Crawl
	as well as data internally available at CTexT which is not available for distribution.
	All data was language identified with the NCHLT Web Services Language Identifier (Puttkammer et al., 2019) and duplicate paragraphs across the entire set were removed, since at least four of the corpora source data from the web, with substantial overlap.

Training corpus:
	Paragraphs: 535,853
	Token count: 17,425,650

Usage:
	Example code for loading the GloVe model is available from https://github.com/reiselen/TrainEmbeddings/scripts
	More details on the GloVe architecture is available from https://github.com/stanfordnlp/GloVe

Reference:
	Pennington, Jeffrey, Richard Socher & Christopher D. Manning. 2014. Glove: Global vectors for word representation. Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics.
	Puttkammer, Martin J., Roald Eiselen, Justin Hocking & Frederik Koen. 2018. NLP Web Services for Resource-Scarce Languages. 56th Annual Meeting of the Association for Computational Linguistics 2018. Melbourne, Australia: Association for Computational Linguistics.
_________________________________________________________________________________
Licence for distribution: Creative Commons Attribution 4.0 International
 
URL: http://creativecommons.org/licenses/by/4.0/
 
Attribute work to: 
	CTexT® (Centre for Text Technology, North-West University), South Africa; 
	Department of Sport, Arts and Culture, South Africa.
Attribute work to URL:	
	http://humanities.nwu.ac.za/ctext
	http://www.dac.gov.za
