Word Vector¶
WordVector¶
cotk.wordvector
provides classes and functions downloading and
loading wordvector automatically.
-
class
cotk.wordvector.
WordVector
[source]¶ Base of all word vector loader.
-
classmethod
get_all_subclasses
() → Iterable[Any]¶ Return a generator of all subclasses.
-
classmethod
load_class
(class_name) → Any¶ Return a subclass of
class_name
, case insensitively.- Parameters
class_name (str) – target class name.
-
classmethod
GeneralWordVector¶
-
class
cotk.wordvector.
GeneralWordVector
(file_id)[source]¶ Bases:
dataloader.WordVector
This class is a general pretrained word vector.
- Parameters
file_id (str,
None
) – A str indicates the source of word vectors. It can be local path ("./data"
), a resource name ("resources://dataset"
), or an url ("http://test.com/dataset.zip"
). Seecotk.file_utils.get_resource_file_path()
for further details. IfNone
, do not use pretrained word vector.
- Input Format
A text file named
wordvec.txt
should be contained in the path. In the file, each word vec should be described in two lines. The first line is the word (or phrase), then the next line is multiple floats indicating the embedding.Example of
wordvec.txt
:word 0.0 1.0 -2.3 phrases 0.3 -1.2 3.4
-
load_matrix
(n_dims, vocab_list, mean=None, std=None, default_embeddings=None) → numpy.ndarray[source]¶ Load pretrained word vector and return a numpy 2-d array. The ith row is the feature of the ith word in
vocab_list
. If some feature is not included in pretrained word vector, it will be initialized by:default_embeddings
, if it is notNone
.normal distribution with
mean
andstd
, otherwise.
- Parameters
n_dims (int) – specify the dimension size of word vector. If
n_dims
is bigger than size of pretrained word vector, the rest embedding will be initialized bydefault_embeddings
or a normal distribution.vocab_list (list) – specify the vocab list used in data loader. If there is any word not appeared in pretrained word vector, the embedding will be initialized by
default_embeddings
or a normal distribution.mean (float, Any, None) – The mean of normal distribution. It can be a float, or an array whose shape is
[n_dims]
. ifNone
, it will be set by the mean of loaded word vector embedding. Default:None
.std (float, Any, None) – The standard deviation of normal distribution. It can be a float, or an array whose shape is
[n_dims]
. ifNone
, it will be set by the standard deviation of loaded word vector embedding. Default:None
.default_embeddings (Any, optional) – The default embeddings, its size should be
[len(vocab_list), n_dims]
. Default: None, which indicates initializing the embeddings from the normal distribution withmean
andstd
.
- Returns
(
numpy.ndarray
) – A 2-d array. Size:[len(vocab_list), n_dims]
.
-
load_dict
(vocab_list) → Dict[str, numpy.ndarray][source]¶ Load word vector and return a dict that maps words to vectors.
- Parameters
vocab_list (list) – specify the vocab list used in data loader. If there is any word not appeared in pretrained word vector, the feature will not be returned.
- Returns
(dict) –
- maps a word (str) to its pretrained embedding (
numpy.ndarray
) where its shape is [ndims].
- maps a word (str) to its pretrained embedding (
Glove¶
-
class
cotk.wordvector.
Glove
(file_id='resources://Glove300d')[source]¶ Bases:
dataloader.GeneralWordVector
,dataloader.WordVector
GloVe is pre-trained word vector named Global Vectors for Word Representation.
References
[1] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.
- Parameters
file_id (str,
None
) – A str indicates the source of word vectors. It can be local path ("./data"
), a resource name ("resources://dataset"
), or an url ("http://test.com/dataset.zip"
). Seecotk.file_utils.get_resource_file_path()
for further details. IfNone
, do not use pretrained word vector. Default:resources://Glove300d
. A 300-d pretrained GloVe will be downloaded (or loaded from cache) and used.