Word Vector

WordVector

cotk.wordvector provides classes and functions downloading and loading wordvector automatically.

class cotk.wordvector.WordVector[source]

Base of all word vector loader.

classmethod get_all_subclasses() → Iterable[Any]

Return a generator of all subclasses.

classmethod load_class(class_name) → Any

Return a subclass of class_name, case insensitively.

Parameters

class_name (str) – target class name.

GeneralWordVector

class cotk.wordvector.GeneralWordVector(file_id)[source]

Bases: dataloader.WordVector

This class is a general pretrained word vector.

Parameters

file_id (str, None) – A str indicates the source of word vectors. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See cotk.file_utils.get_resource_file_path() for further details. If None, do not use pretrained word vector.

Input Format

A text file named wordvec.txt should be contained in the path. In the file, each word vec should be described in two lines. The first line is the word (or phrase), then the next line is multiple floats indicating the embedding.

Example of wordvec.txt:

word
0.0 1.0 -2.3
phrases
0.3 -1.2 3.4
load_matrix(n_dims, vocab_list, mean=None, std=None, default_embeddings=None) → numpy.ndarray[source]

Load pretrained word vector and return a numpy 2-d array. The ith row is the feature of the ith word in vocab_list. If some feature is not included in pretrained word vector, it will be initialized by:

  • default_embeddings, if it is not None.

  • normal distribution with mean and std, otherwise.

Parameters
  • n_dims (int) – specify the dimension size of word vector. If n_dims is bigger than size of pretrained word vector, the rest embedding will be initialized by default_embeddings or a normal distribution.

  • vocab_list (list) – specify the vocab list used in data loader. If there is any word not appeared in pretrained word vector, the embedding will be initialized by default_embeddings or a normal distribution.

  • mean (float, Any, None) – The mean of normal distribution. It can be a float, or an array whose shape is [n_dims]. if None, it will be set by the mean of loaded word vector embedding. Default: None.

  • std (float, Any, None) – The standard deviation of normal distribution. It can be a float, or an array whose shape is [n_dims]. if None, it will be set by the standard deviation of loaded word vector embedding. Default: None.

  • default_embeddings (Any, optional) – The default embeddings, its size should be [len(vocab_list), n_dims]. Default: None, which indicates initializing the embeddings from the normal distribution with mean and std.

Returns

(numpy.ndarray) – A 2-d array. Size:[len(vocab_list), n_dims].

load_dict(vocab_list) → Dict[str, numpy.ndarray][source]

Load word vector and return a dict that maps words to vectors.

Parameters

vocab_list (list) – specify the vocab list used in data loader. If there is any word not appeared in pretrained word vector, the feature will not be returned.

Returns

(dict)

maps a word (str) to its pretrained embedding (numpy.ndarray)

where its shape is [ndims].

Glove

class cotk.wordvector.Glove(file_id='resources://Glove300d')[source]

Bases: dataloader.GeneralWordVector, dataloader.WordVector

GloVe is pre-trained word vector named Global Vectors for Word Representation.

References

[1] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Parameters

file_id (str, None) – A str indicates the source of word vectors. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See cotk.file_utils.get_resource_file_path() for further details. If None, do not use pretrained word vector. Default: resources://Glove300d. A 300-d pretrained GloVe will be downloaded (or loaded from cache) and used.