Data Loader¶

cotk.dataloader provides classes and functions downloading and loading benchmark data automatically. It reduces your cost preprocessing data and provide a fair dataset for every model. It also helps you adapt your model from one dataset to other datasets.

Overview¶

Dataloaders are essential components in CoTK to build models or do fair evaluation. CoTK uses a dataloader class, LanguageProcessing, to handle all the tasks about language. Here we give an overview of what makes a LanguageProcessing dataloader.

A dataloader may have multiple sets of data. In this case, the name of 3 sets (set_name) are "train", "dev", "test".
Each set stores the data read from a text file. In this example, 3 sets are refered to "train.txt", "dev.txt", "test.txt".
A set may have multiple data fields. In this case, "train" set have two fields, and their name (field_name) are "post" and "resp".
Data fields are specified by Field instances. Field defines the way that the dataloader reads, process, and output the data. (But it doesn’t store the data, the data is stored in dataloader.)
A Field instance can be shared between data fields. In the example, "post" in "train" set and "post" in "dev" set share an instance.
Field may contains Tokenizer and Vocab.
Tokenizer defines the methods to tokenize a sentence.
Vocab defines the vocabulary. A instance of Vocab can be shared between multiple Field, where the data from multiple Field will be used to construct the vocabulary.

Building a Dataloader¶

Predefined Tasks¶

CoTK provides several predefined tasks and benchmarks, including

LanguageGeneration

MSCOCO

SingleTurnDialog

OpenSubtitles

MultiTurnDialog

UbuntuCorpus
SwitchBoard

SentenceClassification

SST

Choose an adequate class for your task, and it would be the simplest and best way to build a dataloader. Each class will explain how the dataloader is composed of.

Customized Tasks¶

If the predefined classes do not satisfy your need, you can construct an instance of LanguageProcessing.

To specify the data format of the customized tasks, the initialization of LanguageProcessing receives an argument named fields. The full description of fields should be like the example below.

>>> postField = SentenceDefault(...)
>>> respField = SentenceDefault(...)
>>> labelField = DenseLabel(...)
>>> fields = {
>>>    "train": [("post", postField), ("resp", respField)],
>>>    "test": [("post", postField), ('resp', respField), ('label', labelField)]
>>> }
>>> dataloader = LangaugeProcessing("/path/to/dataset", fields)

"train" and "test" is the name of the split sets in the dataset. There should be two text file named train.txt and test.txt under /path/to/dataset/, corresponding to the two sets, "train" and "test" respectively.
fields["train"] describes the data format of train.txt. Every sample in train set has two data fields, which is represented by Field objects. As SentenceDefault (a subclass of Field) only read one line per each sample, a sample in train.txt occupy two lines. The first line are named by "post", the second line are named "resp".
Similarily, fields["test"] describes the data format of test.txt. Every sample in test set occupies three lines, where the first line is "post", the second line is "resp", and the third line is an integer indicating "label".

An valid input example:

/path/to/dataset/train.txt

How are you?
I am fine.
What's up?
Everything is good.

/path/to/dataset/test.txt

What is your name?
Jack.
1
How about the food?
Terrible.
0

The Field instances define how dataloaders read the file, process the data, and provide the data to networks. See fields for further details.

Omit Set Names

If you have three sets named "train", "dev", "test", and the data format is the same, you can specify the fields argument in initialization of LanguageProcessing by the following code:

>>> fields = [("post", postField), ("resp", respField)]

equals to

>>> fields = {
>>>    "train": [("post", postField), ("resp", respField)],
>>>    "dev": [("post", postField), ("resp", respField)],
>>>    "test": [("post", postField), ("resp", respField)]
>>> }

Use Simple Create

You can use LanguageProcessing.simple_create() to initialize a dataloder, using the class name of Field instead of instances. The method receives arguments for initializing subclasses of Vocab and Field.

>>> fields = {
>>>    "train": [("post", "SentenceDefault"), ("resp", "SentenceDefault")],
>>>    "dev": [("post", "SentenceDefault"), ("resp", "SentenceDefault")],
>>>    "test": [("post", "SentenceDefault"), ("resp", "SentenceDefault")],
>>> }
>>> #or fields = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
>>> dataloader = LanguageProcessing.simple_create("/path/to/dataset", fields, \
>>>     max_sent_length=10, tokenizer="space", min_frequent_vocab_times=10)

In this example, it will first create an GeneralVocab instances with min_frequent_vocab_times=10. Then it initialize SentenceDefault objects with max_sent_length=10, tokenizer="space" and the created Vocab.

Use Context Manager

There is another way to use the class name of Field instead of instances. Initialize the LanguageProcessing in the context of FieldContext and VocabContext.

>>> fields = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
>>> with FieldContext.set_parameters(max_sent_length=10, tokenizer="space"):
>>>     with VocabContext.set_parameters(min_frequent_vocab_times=10):
>>>         dataloader = LanguageProcessing("/path/to/dataset", fields)

equals to

>>> fields = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
>>> dataloader = LanguageProcessing.simple_create("/path/to/dataset", fields, max_sent_length=10, min_frequent_vocab_times=10)

Context is used to provide default values for Field and Vocab instances. See Context for further details.

Field¶

Field indicates data fields, which work secretly behind dataloaders. They define how dataloaders read the file, process the data, and provide the data to networks.

Cotk provides several fields, including

Sentence
Session
DenseLabel
SparseLabel

Note Field never stores data, because the instance can be shared between different data fields in dataloader.

Read the File¶

Field defines the way to read the file. For example,

Sentence reads one line per sample, which is a string of sentence.
Session reads multiple lines per sample, stopped when a empty line is read.
DenseLabel reads one line per sample, which is an integer.

See the documentation in each class for details.

Process the Data¶

Each subclass of Field defines the methods to process the input.

For example, Sentence processes the sentence into different formats:

(str) The whole sentences.
(List[str]) The tokenized sentences.
(List[id]) The index of tokens in the vocabulary.

Sentence also provides methods to convert a sentence from one format to another:

Sentence.tokenize()
Sentence.convert_ids_to_sentence()
Sentence.convert_sentence_to_ids()
Sentence.convert_tokens_to_ids()
Sentence.convert_ids_to_tokens()

The dataloader has similar methods, which invoke the corresponding methods of the default field. See LanguageProcessing.set_default_field() for details.

Provide the Data to Networks¶

Each subclass of Field defines Field.get_batch(), which returns a dict of data for training the networks.

For example, if an instance of SentenceDefault is named with "sent", SentenceDefault.get_batch() will return a dict containing:

sent
sent_length
sent_allvocabs
sent_str

LanguageProcessing.get_batch() will collect dict returned from every field and merge them. For example, if a dataloader with two SentenceDefault fields named "post", "resp", LanguageProcessing.get_batch() will return a dict containing:

post
post_allvocabs
post_length
post_str
reps
resp_allvocabs
resp_length
resp_str

Pretrained Field¶

Default fields like SentenceDefault and SessionDefault are designing for common use in different language processing task. They use <go> and <eos> to mark the start and the end of sentences.

For some pretrained models like GPT2, <go> are not pretrained in the vocabulary and thus not available. We design different field for different pretrained models, including:

GPT2: SentenceGPT2, SessionGPT2
BERT: SentenceBERT, SessionBERT

Tokenizer¶

Tokenizer defines the method to tokenize a sentence, which is used by Field.

CoTK provides several tokenizers, including

SimpleTokenizer: A simple tokenizer for general use in CoTK, supporting space or nltk tokenization.
PretrainedTokenizer: A pretrained Tokenizer from the transformers package. For example, tokenizer for GPT2.

When creating a dataloader, it often receives str or Tokenizer. If str, the following arguments are acceptable:

space: Split by spaces.
nltk: nltk.tokenize.WordPunctTokenizer will be used.

A SimpleTokenizer will be created by the str arguments.

Vocabulary¶

Vocab defines the vocabulary, which is used by Field.

CoTK provides several vocabularies, including

GeneralVocab: A vocabulary for general use in CoTK. The vocabulary list is often built during the processing of input data. Save and load a predefined vocabulary is also supported.
PretrainedVocab: A predefeined vocabulary from the transformers package. For example, vocabulary for GPT2.

Type of Tokens¶

All tokens appeared in dataset (including the ones only appear in test set) are split into 2 sets.

Frequent Vocabularies(frequent_vocabs)

Tokens that the model should read, predict and generate.

These tokens are important in evaluation. They include common words and usually cover over most of tokens from dataset.

They are extracted from only training set, because models should be blind for test set. Hence they are defined as the tokens appear more than a specified number of times (min_frequent_vocab_times) in training set.

Rare Vocabularies(rare_vocabs)

Tokens that the model can optionally read, but will not predict and generate at most times (except some models can generate rare words using copy mechanism or external knowledge).

These tokens are less important but DO affect the evaluation.

They are extracted from both training set and test set, because they are defined considering evaluation. Hence, they are defined as the tokens (excluded frequent_vocabs) appear more than a specified number (min_rare_vocab_times) of times in the whole dataset.

There is also some other terms for vocabularies.

All Vocabularies(allvocabs)

The union of Frequent vocabularies and rare vocabularies is called all vocabularies.

Special Tokens(special_tokens)

Most used special tokens are <pad>, <unk>, <go>, <eos>.

Special tokens are counted as valid vocabularies.

Unknown tokens (<unk>)

<unk> means “Out of Vocabularies”, but the meaning of <unk> may varies from situations.

If it appears at a list named with allvocabs (eg: sent_allvocabs), <unk> indicates a token out of all vocabularies.

If it appears at a list named without allvocabs (eg: sent), <unk> indicates a token out of frequent vocabularies, which means it may a rare vocabulary.

Why CoTK Uses Rare Words¶

In traditional implementations, vocabulary only contains frequent vocabulary. CoTK use frequent vocabulary and rare vocabulary for supporting fair comparisons across different configurations.

For examples, we test two models under the same dataset, but with different vocabularies.

Model A: Frequent vocabulary F_A; Rare vocabulary R_A.
Model B: Frequent vocabulary F_B; Rare vocabulary R_B.

The fairness of comparisons can be gauranteed under the conditions:

metric.PerplexityMetric: F_A + R_A == F_B + R_B.
metric.BleuCorpusMetric: F_A + R_A == F_B + R_B if tokenizer is None; Always fair if tokenizer is set.

See each metrics for when the fairness can be gauranteed. Hash value can help user determine whether the comparisons is fair.

Connecting Field and Vocab¶

GeneralVocab is often shared between fields for constructing vocabulary list together. To identify tokens from a field is regarded as training set or test set (which may be relative to the division of frequent vocab and rare vocab), Sentence use an arguments named vocab_from_mappings.

vocab_from_mappings is a dict, which infer the type of token by the set name. By default:

Set Name	Type
train	train
training	train
dev	test
development	test
valid	test
validation	test
test	test
evaluation	test

For example, a token from the training set will have a type of train. The type will passed to Vocab.add_tokens() as vocab_from. There are 3 types:

train: Frequent vocabs are selected from tokens of this type.
test: Rare vocabs are selected from tokens of this type.
extra: The tokens of this type will not selected as frequent or rare vocabs.

Context¶

FieldContext and VocabContext are used to set the default arguments for subclasses of Field and Vocab respectively.

>>> vocab = GeneralVocab(...)
>>> with FieldContext.set_parameters(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10):
>>>     field = SentenceDefault()

equals to:

>>> vocab = GeneralVocab(...)
>>> field = SentenceDefault(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10)

The context can be stacked, and weak means whether overwrite the outter context.

>>> vocab = GeneralVocab(...)
>>> with FieldContext.set_parameters(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10):
>>>     with FieldContext.set_parameters(min_frequent_vocab_times=20):
>>>         field1 = SentenceDefault()  # min_frequent_vocab_times=20
>>> with FieldContext.set_parameters(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10):
>>>     with FieldContext.set_parameters(min_frequent_vocab_times=20, weak=True):
>>>         field2 = SentenceDefault()  # min_frequent_vocab_times=10

It usually works with the initialization of LanguageProcessing without creating the instance of Field or Vocab. See the examples here.

Hash Value for Dataloader¶

It is usually difficult to track the differences among different configurations, CoTK provides hash codes to identify each part of dataloader including the input data, vocabularies and settings.

For example, if two data loaders have the same general hash, their data, vocabularies and settings are guaranteed to be the same.

LanguageProcessing provides the following hash value:

LanguageProcessing.get_raw_data_hash(). Tracking the raw input file before processed.
LanguageProcessing.get_data_hash(). Tracking the data after processed.
LanguageProcessing.get_vocab_hash(). Tracking the vocabulary before processed.
LanguageProcessing.get_setting_hash(). Tracking the settings (arguments of the dataloader).
LanguageProcessing.get_general_hash(). Tracking all above.

Dataloader¶

class cotk.dataloader.Dataloader[source]¶

Base class of Dataloader.

classmethod get_all_subclasses() → Iterable[Any]¶: Return a generator of all subclasses.

classmethod load_class(class_name) → Any¶

Return a subclass of class_name, case insensitively.

Parameters: class_name (str) – target class name.

LanguageProcessing¶

class cotk.dataloader.LanguageProcessing(file_id, fields)[source]¶

Bases: dataloader.Dataloader

Base class for all language processing tasks. This is an abstract class.

During the initialization of a dataloader, Vocab, Tokenizer or Field may be created. See how to create a dataloader.

Parameters

file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id.
fields (List, OrderedDict, Dict) – This arguments supports multiple input types:
- If OrderDict or List, it specify data format of the "train", "dev", "test" set.
  - A data format should be an OrderedDict or a List[Tuple] can be converted to OrderedDict.
  - The key of data format is the name of a Field (used by get_batch()), and the value is either a class name of a Field or a Field object.
  - Examples:
    
    >>> postField = SentenceDefault(...) >>> respField = SentenceDefault(...) >>> data_format = [("post", postField), ("resp", respField)]
    
    or
    
    >>> data_format = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
  - Examples:
    
    >>> fields = data_format
    
    equals to
    
    >>> fields = {"train": data_format, "dev": data_format, "test": data_format}
- If Dict, fields[key] describes data format of the set named key. Examples:
```
>>> fields = {"train": data_format, "extra": data_format}
```
- See how to create a dataloader.

static LanguageProcessing.simple_create(file_id, fields, **kwargs) → cotk.dataloader.dataloader.LanguageProcessing[source]¶

A simple way to create a dataloader. Instead of using VocabContext and FieldContext, specifying all the possible parameters here.

Parameters

file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id.
fields (List, OrderedDict, Dict) – See initialization of LanguageProcessing for explanation.
**kwargs – Arguments passed to created Vocab and Field.

Tokenizer, Vocabulary, and Field¶

LanguageProcessing.fields¶

This instance attribute shows fields of the dataloader (See the initialization of LanguageProcessing). For example, the fields can be printed as follows:

{
    'train': OrderedDict([('sent', <cotk.dataloader.field.SentenceDefault object at 0x000001E170F8B588>)]),
    'dev': OrderedDict([('sent', <cotk.dataloader.field.SentenceDefault object at 0x000001E170F8BB48>)]),
    'test': OrderedDict([('sent', <cotk.dataloader.field.SentenceDefault object at 0x000001E170F8BEC8>)])}
}

LanguageProcessing.get_default_tokenizer() → cotk.dataloader.tokenizer.Tokenizer[source]¶: Get the default Tokenizer in this dataloader. It can be set by set_default_field().

LanguageProcessing.get_default_vocab() → cotk.dataloader.vocab.Vocab[source]¶: Get the default Vocab in this dataloader. It can be set by set_default_field().

LanguageProcessing.get_default_field() → cotk.dataloader.field.Field[source]¶: Get the default Field in this dataloader. It can be set by set_default_field().

LanguageProcessing.set_default_field(set_name, field_name)[source]¶

Set the default Field in this dataloader. In the meanwhile, the default Vocab and Tokenizer is also set according to the field (if the field have vocab and tokenizer).

The default field will affect the action in the following methods:

get_default_field()
tokenize()
tokenize_sentences()
convert_tokens_to_ids()
convert_ids_to_tokens()
convert_ids_to_sentence()
convert_sentence_to_ids()
add_special_to_ids()
remove_special_in_ids()
process_sentences()
trim_in_ids()
get_default_vocab()
get_special_tokens_mapping()
get_special_tokens_id()
get_default_tokenizer()

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".
field_name (str) – The name of field.

LanguageProcessing.get_field(set_name, field_name) → cotk.dataloader.field.Field[source]¶

Get Field according to name of set and field.

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".
field_name (str) – The name of field.

Batched Data¶

LanguageProcessing.get_batch(set_name, indexes) → Dict[str, Any][source]¶

Get a batch of data with specified indexes. Return a merged dict containing all the data from each field by calling field.get_batch(). See examples in subclasses for the return value of predefined tasks.

get_next_batch(), get_batches(), get_all_batch() provide other methods to get batched data, Their return values are consistent with this methods.

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".

indexes (list) – a list of specified indexes of batched data.

LanguageProcessing.restart(set_name, batch_size=None, shuffle=True)[source]¶

Initialize batches. This function be called before get_next_batch() or an epoch is end. See get_next_batch() for examples.

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".

batch_size (int) – the number of sample in a batch. default: if None, last batch_size is used.

shuffle (bool) – whether to shuffle the data. Default: True.
LanguageProcessing.get_next_batch(set_name, ignore_left_samples=False) → Optional[Dict[str, Any]][source]¶
Get next batch. It can be called only after Initializing batches (restart()). Return a dict like get_batch(), or None if the epoch is end.

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".

ignore_left_samples (bool) – If the number of the samples is not divisible by batch_size, ignore the left samples less than batch_size Setting it to True make that every batch will have the same number of samples. Default: False.

Examples
>>> dataloader.restart("train")
>>> while True:
>>>     data = dataloader.get_next_batch("train")
>>>     if data:
>>>         break
>>>     print(data)
LanguageProcessing.get_batches(set_name, batch_size=None, shuffle=True, ignore_left_samples=False) → Iterable[Dict[str, Any]][source]¶

An iterable generator over batches. It first call restart(), and then get_next_batch() until no more data is available. Returns an iterable generator where each element is like get_batch().

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".

batch_size (int, optional) – default: None. Use batch_size by default.

shuffle (bool) – whether to shuffle the data. Default: True.

ignore_left_samples (bool) – If the number of the samples is not divisible by batch_size, ignore the left samples less than batch_size Setting it to True make that every batch will have the same number of samples. Default: False.

LanguageProcessing.get_all_batch(set_name) → Dict[str, List[Any]][source]¶

Concatenate all batches to a single dict, where padding will not be applied.

Returns a dict like get_batch() with all valid indexes, but all the sentences are not padded and their type will be converted to list. Exactly, this function called get_batch() where len(indexes)==1 multiple times and concatenate all the values in the returned dicts.

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".

Sentences and Manipulations¶

LanguageProcessing.tokenize(sentence) → List[str]¶

Tokenize sentence.

It calls the identical method of the Sentence instance sentence, from get_default_field().

Convert tokens to lower case if sentence.convert_to_lower_letter is True.

Parameters: sentence (str) – The sentence to be tokenized.

LanguageProcessing.tokenize_sentences(sentences) → List[List[str]]¶

Tokenize sentences.

It calls the identical method of the Sentence instance sentence, from get_default_field().

Convert tokens to lower case if sentence.convert_to_lower_letter is True.

Parameters: sentences (List[str]) – The list of sentence to be tokenized.

LanguageProcessing.convert_tokens_to_ids(tokens, add_special=False, only_frequent_word=False) → List[int]¶

Convert list of tokens to list of ids. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters

tokens (List[str]) – The tokens to be converted.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

LanguageProcessing.convert_ids_to_tokens(ids, remove_special=True, trim=True) → List[str]¶

Convert list of ids to list of tokens. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters

ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.
trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

LanguageProcessing.convert_ids_to_sentence(ids, remove_special=True, trim=True) → str¶

Convert list of tokens to a sentence. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters

ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.
trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

LanguageProcessing.convert_sentence_to_ids(sentence, add_special=False, only_frequent_word=False) → List[int]¶

Convert a sentence to a list of ids. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters

sentence (str) – The sentence to be converted.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

LanguageProcessing.add_special_to_ids(ids) → List[int]¶

Add special tokens, such as go_id or eos_id to the input ids. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters: ids (List[int]) – The input ids.

LanguageProcessing.remove_special_in_ids(ids, remove_special=True, trim=True) → List[int]¶

Remove special ids in input ids. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters

ids (List[int]) – Input ids.
remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.
trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

LanguageProcessing.process_sentences(sentences, add_special=True, only_frequent_word=False, cut=True) → List[List[int]]¶

Process input sentences.

It calls the identical method of the Sentence instance sentence, from get_default_field().

If sentences haven’t been tokenized, tokenize them by invoking Sentence.tokenize_sentences().
Then, convert the list of tokens to a list of ids.
If sentence.max_sent_length is not None and cut is True, sentences, whose length are more than sentence.max_sent_length, are shorten to first sentence.max_sent_length tokens.

Parameters

sentences (List[str], List[List[str]]) – sentences can be a list of sentences or a list of lists of tokens.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: True.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.
cut (bool, optional) – Whether to cut sentences with too many tokens. Default: True.

LanguageProcessing.trim_in_ids(ids) → List[int]¶

Find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing pad. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters: ids (List[int]) – The input ids.

Vocabulary¶

LanguageProcessing.frequent_vocab_size¶: int – The number of frequent words. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.all_vocab_size¶: int – The number of frequent words and rare words. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.frequent_vocab_list¶: list – The list of frequent words. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.all_vocab_list¶: list – The list of frequent words and rare words. Frequent words are always in the front of the list. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.get_special_tokens_mapping() → MutableMapping[str, str]¶: Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.get_special_tokens_id(name) → int¶

Get id of special token specifying the general name. Raise KeyError if no such token in this instance. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

Parameters: name (str) – the general name, must be one of the following, pad, unk, go, eos, sep, cls, mask.

LanguageProcessing.pad_id¶: int – The id of pad token. Raise KeyError if no pad token in this instance. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.unk_id¶: int – The id of unk token. Raise KeyError if no unk token in this instance. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.go_id¶: int – The id of go token. Raise KeyError if no go token in this instance. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.eos_id¶: int – The id of eos token. Raise KeyError if no eos token in this instance. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

Hash¶

LanguageProcessing.get_general_hash() → str[source]¶

General hash. Identifying all details in dataloader, including raw data before processed, tokenized data, vocabulary, and settings.

See dataloader hash for explaination.

LanguageProcessing.get_raw_data_hash() → str[source]¶

Raw data hash. Identifying raw data before processed.

See dataloader hash for explaination.

LanguageProcessing.get_data_hash() → str[source]¶

Data hash. Identifying data after processed (tokenized).

See dataloader hash for explaination.

LanguageProcessing.get_vocab_hash() → str[source]¶

Vocab hash. Identifying vocabulary.

See dataloader hash for explaination.

LanguageProcessing.get_setting_hash() → str[source]¶

Setting hash, identifying settings to create the data loader.

See dataloader hash for explaination.

LanguageGeneration¶

class cotk.dataloader.LanguageGeneration(file_id, *, tokenizer=None, max_sent_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, pretrained=None)[source]¶

Bases: dataloader.LanguageProcessing

This class is supported for language modeling tasks or language generation tasks without any inputs.

Parameters

file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id.
tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.
min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than min_frequent_vocab_times will be regarded as frequent words. Default: 0
min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than min_rare_vocab_times will be regarded as rare words (frequent word excluded). Default: 0
pretrained (str, optional) – Use pretrained field instead of SentenceDefault. Default: If None, no pretrained field used.

get_batch(set_name, indexes) → Dict[str, Any][source]¶

Get a batch of data with specified indexes. Returns a dict at least contains:

sent_length (numpy.ndarray): A 1-d array, the length of sentence in each batch. Size: [batch_size]

sent (numpy.ndarray): A 2-d padding array containing id of tokens. Only provide frequent tokens. unk_id will be used for a rare token. Size: [batch_size, max(sent_length)]

sent_allvocabs (numpy.ndarray): A 2-d padding array containing id of tokens. Provide both frequent and rare tokens. Size: [batch_size, max(sent_length)]

sent_str (List[str]): A list containing raw sentences before tokenizing, converting to ids, or padding. Do not contain any special tokens. Size: [batch_size]

get_next_batch(), get_batches(), get_all_batch() provide other methods to get batched data, Their return values are consistent with this methods.

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".
indexes (list) – a list of specified indexes of batched data.

Examples

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you",
>>> #   "hello", "i", "am", "fine"]
>>> # frequent_vocab_size = 9
>>> # frequent_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", "hello", "i"]
>>> dataloader.get_batch('train', [0, 1, 2])
{
    "sent": numpy.array([
        [2, 4, 5, 6, 3, 0],   # first sentence: <go> how are you <eos> <pad>
        [2, 7, 3, 0, 0, 0],   # second sentence:  <go> hello <eos> <pad> <pad> <pad>
        [2, 7, 8, 1, 1, 3]    # third sentence: <go> hello i <unk> <unk> <eos>
    ]),
    "sent_length": numpy.array([5, 3, 6]), # length of sentences
    "sent_allvocabs": numpy.array([
        [2, 4, 5, 6, 3, 0],   # first sentence: <go> how are you <eos> <pad>
        [2, 7, 3, 0, 0, 0],   # second sentence:  <go> hello <eos> <pad> <pad> <pad>
        [2, 7, 8, 9, 10, 3]   # third sentence: <go> hello i am fine <eos>
    ]),
    "sent_str": [
        "how are you",
        "hello",
        "hello i am fine"
    ],
}

get_teacher_forcing_metric(gen_log_prob_key='gen_log_prob') → cotk.metric.metric.MetricChain[source]¶

Get metrics for teacher-forcing. In other words, this function provides metrics for language modelling task.

It contains:

metric.PerplexityMetric

See the above class for details of arguments.

Parameters: gen_log_prob_key (str, optional) – The key of predicted log probability over words. Default: gen_log_prob.

get_inference_metric(gen_key='gen', sample_in_bleu=1000, sample_in_ngram_perplexity=10000, seed=1229, cpu_count=None) → cotk.metric.metric.MetricChain[source]¶

Get metrics for inference. In other words, this function provides metrics for language generation tasks.

It contains:

See the above class for details of arguments.

Parameters

gen_key (str, optional) – The key of generated sentences. Default: gen.
sample_in_bleu (int, optional) – Number of examples sampled from the generated sentences. Default: 1000.
sample_in_ngram_perplexity (int, optional) – Number of examples sampled from the generated sentences. Default: 10000.
seed (int, optional) – Random seed for sampling. Default: 1229.
cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when cpu_count is set to 1 or the dataset is small. Default: If None, the environment variable CPU_COUNT will be used when available, or all available cpu will be used otherwise.

MSCOCO¶

class cotk.dataloader.MSCOCO(file_id, *, tokenizer='nltk', max_sent_length=50, convert_to_lower_letter=False, min_frequent_vocab_times=10, min_rare_vocab_times=0, pretrained=None)[source]¶

Bases: dataloader.LanguageGeneration

A dataloader for preprocessed MSCOCO dataset. Refer to LanguageGeneration and LanguageProcessing for attributes and methods.

Parameters

file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id. Default: resources://MSCOCO.
tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. Default: nltk
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: True
min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than min_frequent_vocab_times will be regarded as frequent words. Default: 10.
min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than min_rare_vocab_times will be regarded as rare words (frequent word excluded). Default: 0.
pretrained (str, optional) – Use pretrained field instead of SentenceDefault. Default: If None, no pretrained field used.

References

[1] http://images.cocodataset.org/annotations/annotations_trainval2017.zip

[2] Chen X, Fang H, Lin T Y, et al. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325, 2015.

SingleTurnDialog¶

class cotk.dataloader.SingleTurnDialog(file_id, *, tokenizer=None, max_sent_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, pretrained=None)[source]¶

Bases: dataloader.LanguageProcessing

This class is supported for sequence to sequence generation tasks, especially single turn dialog tasks.

Parameters

file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id.
tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.
min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than min_frequent_vocab_times will be regarded as frequent words. Default: 0
min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than min_rare_vocab_times will be regarded as rare words (frequent word excluded). Default: 0
pretrained (str, optional) – Use pretrained field instead of SentenceDefault. Default: If None, no pretrained field used.

get_batch(set_name, indexes) → Dict[str, Any][source]¶

Get a batch of data with specified indexes. Return a dict contains:

post_length (numpy.ndarray): A 1-d array, the length of post in each batch. Size: [batch_size]

post (numpy.ndarray): A 2-d padded array containing tokens of id form in posts. Only provide frequent tokens. unk_id will be used for a rare token. Size: [batch_size, max(sent_length)]

post_allvocabs (numpy.ndarray): A 2-d padded array containing tokens of id form in posts. Provide both frequent and rare vocabs. Size: [batch_size, max(sent_length)]

post_str (List[str]): A list containing raw posts before tokenizing, converting to ids, or padding. Do not contain any special tokens. Size: [batch_size]

resp_length (numpy.ndarray): A 1-d array, the length of response in each batch. Size: [batch_size]

resp (numpy.ndarray): A 2-d padded array containing tokens of id form in responses. Only provide valid vocabs. unk_id will be used for a rare token. Size: [batch_size, max(sent_length)]

resp_allvocabs (numpy.ndarray): A 2-d padded array containing tokens of id form in responses. Provide both valid and invalid vocabs. Size: [batch_size, max(sent_length)]

post_str (List[str]): A list containing raw responses before tokenizing, converting to ids, or padding. Do not contain any special tokens. Size: [batch_size]

get_next_batch(), get_batches(), get_all_batch() provide other methods to get batched data, Their return values are consistent with this methods.

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".
indexes (list) – a list of specified indexes of batched data.

Examples

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you",
>>> #   "hello", "i", "am", "fine"]
>>> # frequent_vocab_size = 9
>>> # frequent_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", "hello", "i"]
>>> dataloader.get_batch('train', [0, 1])
{
    "post_str": [
        "are you fine",
        "hello",
    ],
    "post_allvocabs": numpy.array([
        [2, 5, 6, 10, 3],  # first post:  <go> are you fine <eos>
        [2, 7, 3, 0, 0],   # second post: <go> hello <eos> <pad> <pad>
    ]),
    "post": numpy.array([
        [2, 5, 6, 1, 3],   # first post:  <go> are you <unk> <eos>
        [2, 7, 3, 0, 0],   # second post: <go> hello <eos> <pad> <pad>
    ]),
    "resp_str": [
        "i am fine",
        "hello"
    ],
    "resp_allvocabs": numpy.array([
        [2, 8, 9, 10, 3],  # first response:  <go> i am fine <eos>
        [2, 7, 3, 0, 0],   # second response: <go> hello <eos> <pad> <pad>
    ]),
    "resp": numpy.array([
        [2, 8, 1, 1, 3],   # first response:  <go> i <unk> <unk> <eos>
        [2, 7, 3, 0, 0],   # second response: <go> hello <eos> <pad> <pad>
    ]),
    "post_length": numpy.array([5, 3]), # length of posts
    "resp_length": numpy.array([5, 3]), # length of responses
}

get_teacher_forcing_metric(gen_log_prob_key='gen_log_prob', generate_rare_vocab=False) → MetricChain[source]¶

Get metrics for teacher-forcing.

It contains:

metric.PerplexityMetric

Parameters

gen_log_prob_key (str) – The key of predicted log probability over words. Refer to metric.PerplexityMetric. Default: gen_log_prob.
generate_rare_vocab (bool) – Whether gen_log_prob contains invalid vocab. Refer to metric.PerplexityMetric. Default: False.

get_inference_metric(gen_key='gen') → MetricChain[source]¶

Get metrics for inference.

It contains:

Parameters: gen_key (str) – The key of generated sentences in index form. Refer to metric.BleuCorpusMetric or metric.SingleTurnDialogRecorder. Default: gen.

OpenSubtitles¶

class cotk.dataloader.OpenSubtitles(file_id='resources://OpenSubtitles', *, tokenizer='nltk', max_sent_length=50, convert_to_lower_letter=False, min_frequent_vocab_times=10, min_rare_vocab_times=0, pretrained=None)[source]¶

Bases: dataloader.SingleTurnDialog

A dataloader for OpenSubtitles dataset. Refer to SingleTurnDialog, LanguageProcessing for attributes and methods.

Parameters

file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id. Default: resources://OpenSubtitles.
tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. Default: nltk
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: 50.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False
min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than min_frequent_vocab_times will be regarded as frequent words. Default: 10.
min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than min_rare_vocab_times will be regarded as rare words (frequent word excluded). Default: 0
pretrained (str, optional) – Use pretrained field instead of SentenceDefault. Default: If None, no pretrained field used.

References

[1] http://opus.nlpl.eu/OpenSubtitles.php

[2] P. Lison and J. Tiedemann, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. LREC 2016.

MultiTurnDialog¶

class cotk.dataloader.MultiTurnDialog(file_id, tokenizer=None, max_sent_length=None, max_turn_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, fields=None, pretrained=None)[source]¶

Base class for multi-turn dialog datasets. This is an abstract class.

Arguments:

Attributes:

Notes

A Session field must be set as default field. When invoking __init__() of MultiTurnDialog, the default field, which may be reset in subclass, is set as self.fields[‘train’][‘session’].

get_batch(set_name, indexes) → Dict[str, Any]¶

Get a batch of data with specified indexes. Return a merged dict containing all the data from each field by calling field.get_batch(). See examples in subclasses for the return value of predefined tasks.

get_next_batch(), get_batches(), get_all_batch() provide other methods to get batched data, Their return values are consistent with this methods.

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".
indexes (list) – a list of specified indexes of batched data.

get_teacher_forcing_metric(multi_turn_gen_log_prob_key='multi_turn_gen_log_prob')[source]¶

Get metric for teacher-forcing.

It contains:

metric.MultiTurnPerplexityMetric

Parameters: gen_log_prob_key (str) – The key of predicted log probability over words. Refer to metric.MultiTurnPerplexityMetric. Default: gen_log_prob.
Returns: A metric.MetricChain object.

get_inference_metric(multi_turn_gen_key='multi_turn_gen')[source]¶

Get metric for inference.

It contains:

Parameters: gen_key (str) – The key of generated sentences in index form. Refer to metric.BleuCorpusMetric or metric.MultiTurnDialogRecorder. Default: gen.
Returns: A metric.MetricChain object.

UbuntuCorpus¶

class cotk.dataloader.UbuntuCorpus(file_id='resources://Ubuntu', min_frequent_vocab_times=10, max_sent_length=50, max_turn_length=20, min_rare_vocab_times=0, tokenizer='nltk', pretrained=None)[source]¶

A dataloader for Ubuntu dataset.

Parameters

file_id (str) – a str indicates the source of UbuntuCorpus dataset. Default: resources://Ubuntu. A preset dataset is downloaded and cached.
min_frequent_vocab_times (int) – A cut-off threshold of valid tokens. All tokens appear not less than min_vocab_times in training set will be marked as frequent words. Default: 10.
max_sent_length (int) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. Default: 50.
max_turn_length (int) – All sessions longer than max_turn_length will be shortened to first max_turn_length sentences. Default: 20.
min_rare_vocab_times (int) – A cut-off threshold of rare tokens. All tokens appear not less than invalid_vocab_times in the whole dataset (except valid words) will be marked as rare words. Otherwise, they are unknown words, both in training or testing stages. Default: 0 (No unknown words).

Refer to MultiTurnDialog for attributes and methods.

References

[1] https://github.com/rkadlec/ubuntu-ranking-dataset-creator

[2] Lowe R, Pow N, Serban I, et al. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. SIGDIAL 2015.

SwitchboardCorpus¶

class cotk.dataloader.SwitchboardCorpus(file_id='resources://SwitchboardCorpus', min_frequent_vocab_times=5, max_sent_length=50, max_turn_length=1000, min_rare_vocab_times=0, tokenizer='nltk', pretrained=None)[source]¶

A dataloader for Switchboard dataset.

In this dataset, all sessions start with a <d> representing empty context.

Parameters: file_id (str) – a string indicating the source of SwitchboardCorpus dataset. Default: resources://SwitchboardCorpus. A preset dataset is downloaded and cached.

Refer to MultiTurnDialog for attributes and methods.

References

[1] https://catalog.ldc.upenn.edu/LDC97S62

[2] John J G and Edward H. Switchboard-1 release 2. Linguistic Data Consortium, Philadelphia 1997.

SentenceClassification¶

class cotk.dataloader.SentenceClassification(file_id, tokenizer=None, max_sent_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, fields=None, pretrained=None)[source]¶

Base class for sentence classification datasets. This is an abstract class.

Arguments:

Notes

A Sentence field must be set as default field. When invoking __init__() of SentenceClassification, the default field, which may be reset in subclass, is set as self.fields[‘train’][‘sent’].

get_batch(set_name, indexes)[source]¶

Get a batch of specified indexes.

Parameters

set_name (str) – must be contained in key_name
indexes (list) – a list of specified indexes

Returns

(dict) –

A dict at least contains:

sent_length(numpy.array): A 1-d array, the length of sentence in each batch. Size: [batch_size]

sent(numpy.array): A 2-d padding array containing id of words. Only provide valid words. unk_id will be used if a word is not valid. Size: [batch_size, max(sent_length)]

label(numpy.array): A 1-d array, the label of sentence in each batch.

sent_allvocabs(numpy.array): A 2-d padding array containing id of words. Provide both valid and invalid words. Size: [batch_size, max(sent_length)]

Examples

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you",
>>> #   "hello", "i", "am", "fine"]
>>> # vocab_size = 9
>>> # vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", "hello", "i"]
>>> dataloader.get_batch('train', [0, 1, 2])
{
    "sent": numpy.array([
            [2, 4, 5, 6, 3, 0],   # first sentence: <go> how are you <eos> <pad>
            [2, 7, 3, 0, 0, 0],   # second sentence:  <go> hello <eos> <pad> <pad> <pad>
            [2, 7, 8, 1, 1, 3]    # third sentence: <go> hello i <unk> <unk> <eos>
        ]),
    "label": numpy.array([1, 2, 0]) # label of sentences
    "sent_length": numpy.array([5, 3, 6]), # length of sentences
    "sent_allvocabs": numpy.array([
            [2, 4, 5, 6, 3, 0],   # first sentence: <go> how are you <eos> <pad>
            [2, 7, 3, 0, 0, 0],   # second sentence:  <go> hello <eos> <pad> <pad> <pad>
            [2, 7, 8, 9, 10, 3]   # third sentence: <go> hello i am fine <eos>
        ]),
}

get_metric(prediction_key='prediction')[source]¶

Get metrics for accuracy. In other words, this function provides metrics for sentence classification task.

It contains:

metric.AccuracyMetric

Parameters: prediction_key (str) – The key of prediction over sentences. Refer to metric.AccuracyMetric. Default: prediction.
Returns: A metric.MetricChain object.

SST¶

class cotk.dataloader.SST(file_id, min_frequent_vocab_times=10, max_sent_length=50, min_rare_vocab_times=0, tokenizer='space', pretrained=None)[source]¶

A dataloader for preprocessed SST dataset.

Parameters

file_id (str) – a str indicates the source of SST dataset.
min_frequent_vocab_times (int) – A cut-off threshold of valid tokens. All tokens appear not less than min_frequent_vocab_times in training set will be marked as frequent words. Default: 10.
max_sent_length (int) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. Default: 50.
min_rare_vocab_times (int) – A cut-off threshold of invalid tokens. All tokens appear not less than min_rare_vocab_times in the whole dataset (except valid words) will be marked as rare words. Otherwise, they are unknown words, both in training or testing stages. Default: 0 (No unknown words).

Refer to SentenceClassification for attributes and methods.

References

[1] http://images.cocodataset.org/annotations/annotations_trainval2017.zip

[2] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common Objects in Context. ECCV 2014.

Field¶

class cotk.dataloader.Field[source]¶

A base class of data field, which specify the format of the dataset. See Field and building a dataloader of customized task for usages.

Notice Field object may be shared between different fields, data sets or dataloader. Thus it only defines settings and do NOT stores data.

classmethod get_all_subclasses() → Iterable[Any]¶: Return a generator of all subclasses.

classmethod load_class(class_name) → Any¶

Return a subclass of class_name, case insensitively.

Parameters: class_name (str) – target class name.

get_vocab() → Optional[cotk.dataloader.vocab.Vocab][source]¶: Get Vocab object for the field. None if the field do not have a Vocab.

get_tokenizer() → Optional[cotk.dataloader.tokenizer.Tokenizer][source]¶: Get Tokenizer object for the field. None if the field do not have a Tokenizer.

get_batch(name, data, indexes) → Dict[str, Any][source]¶

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

Parameters

name (str) – name of the field.
data (Any) – the data stored in dataloader.
indexes (List[int]) – the indexes of the data in this batch

Sentence¶

class cotk.dataloader.Sentence(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]¶

Bases: dataloader.Field

A field for sentence. This class is a virtual class and the base of Sentence, SentenceGPT2 and SentenceBERT.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters

tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.
vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.
vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Formats: This field read one line of sentence per sample.

tokenize(sentence) → List[str][source]¶

Tokenize sentence.

Convert tokens to lower case if self.convert_to_lower_letter is True.

Parameters: sentence (str) – The sentence to be tokenized.

tokenize_sentences(sentences) → List[List[str]][source]¶

Tokenize sentences.

Convert tokens to lower case if self.convert_to_lower_letter is True.

Parameters: sentences (List[str]) – The list of sentence to be tokenized.

convert_tokens_to_ids(tokens, add_special=False, only_frequent_word=False) → List[int][source]¶

Convert list of tokens to list of ids.

Parameters

tokens (List[str]) – The tokens to be converted.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

convert_ids_to_tokens(ids, remove_special=True, trim=True) → List[str][source]¶

Convert list of ids to list of tokens.

Parameters

ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.
trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

convert_sentence_to_ids(sentence, add_special=False, only_frequent_word=False) → List[int][source]¶

Convert a sentence to a list of ids.

Parameters

sentence (str) – The sentence to be converted.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

convert_ids_to_sentence(ids, remove_special=True, trim=True) → str[source]¶

Convert list of tokens to a sentence.

Parameters

ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.
trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

add_special_to_ids(ids) → List[int][source]¶

Add special tokens, such as go_id or eos_id to the input ids.

Parameters: ids (List[int]) – The input ids.

remove_special_in_ids(ids, remove_special=True, trim=True) → List[int][source]¶

Remove special ids in input ids.

Parameters

ids (List[int]) – Input ids.
remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.
trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

trim_in_ids(ids) → List[int][source]¶

Find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing pad.

Parameters: ids (List[int]) – The input ids.

process_sentences(sentences, add_special=True, only_frequent_word=False, cut=True) → List[List[int]][source]¶

Process input sentences.

If sentences haven’t been tokenized, tokenize them by invoking Sentence.tokenize_sentences().
Then, convert the list of tokens to a list of ids.
If self.max_sent_length is not None and cut is True, sentences, whose length are more than self.max_sent_length, are shorten to first self.max_sent_length tokens.

Parameters

sentences (List[str], List[List[str]]) – sentences can be a list of sentences or a list of lists of tokens.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: True.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.
cut (bool, optional) – Whether to cut sentences with too many tokens. Default: True.

frequent_vocab_size¶: int – The number of frequent words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

all_vocab_size¶: int – The number of frequent words and rare words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

frequent_vocab_list¶: list – The list of frequent words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

all_vocab_list¶: list – The list of frequent words and rare words. Frequent words are always in the front of the list. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

get_special_tokens_mapping() → MutableMapping[str, str]¶: Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It calls the method with the identical name of the Vocab instance, from self.get_vocab().

get_special_tokens_id(name) → int¶

Get id of special token specifying the general name. Raise KeyError if no such token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

Parameters: name (str) – the general name, must be one of the following, pad, unk, go, eos, sep, cls, mask.

pad_id¶: int – The id of pad token. Raise KeyError if no pad token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

unk_id¶: int – The id of unk token. Raise KeyError if no unk token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

go_id¶: int – The id of go token. Raise KeyError if no go token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

eos_id¶: int – The id of eos token. Raise KeyError if no eos token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

SentenceDefault¶

class cotk.dataloader.SentenceDefault(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]¶

Bases: dataloader.Sentence, dataloader.Field

A common use field for sentence.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters

tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.
vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.
vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Formats: This field read one line of sentence per sample.

get_batch(name, data, indexes) → Dict[str, Any][source]¶

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

FIELDNAME (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It only contains frequent vocabs, and rare words are replaced by unk_id.
FIELDNAME_allvocabs (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It contains frequent vocabs and rare vocabs.
FIELDNAME_length (np.ndarray[batch_size]): The length of sentences.
FIELDNAME_str (List[str]): The raw sentences.

where

FIELDNAME is the name of the field.
batch_size is len(indexes).
max_sent_length_in_batch is the maximum length of sentences in the batch.

Parameters

name (str) – name of the field.
data (Any) – the data stored in dataloader.
indexes (List[int]) – the indexes of the data in this batch

Examples

>>> #   all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "Life", "is", "short", ".",
>>> #       "PHP", "the", "best", "language", "in", "world"]
>>> #   frequent_vocab_size = 11
>>> #   frequent_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "Life", "is", "short", ".",
>>> #       "PHP", "the", "best"]
>>> field.get_batch('sent', data, [0, 1])
{
    "sent": numpy.array([
        [2, 4, 5, 6, 7, 3, 0, 0, 0, 0, 0],    # <go> Life is short . <eos> <pad> <pad> <pad> <pad> <pad>
        [2, 8, 5, 9, 10, 1, 1, 9, 1, 7, 3],  # <go> PHP is the best <unk> <unk> the <unk> . <eos>
    ]),
    "sent_length": numpy.array([6, 11]), # length of sentences
    "sent_allvocabs": numpy.array([
        [2, 4, 5, 6, 7, 3, 0, 0, 0, 0, 0],    # <go> Life is short . <eos> <pad> <pad> <pad> <pad> <pad>
        [2, 8, 5, 9, 10, 11, 12, 9, 13, 7, 3],  # <go> PHP is the best language in the world . <eos>
    ]),
    "sent_str": [
        "Life is short.",
        "PHP is the best language in the world.",
    ],
}

SentenceGPT2¶

class cotk.dataloader.SentenceGPT2(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]¶

Bases: dataloader.Sentence, dataloader.Field

A field for sentence in the format of GPT2.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters

tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.
vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.
vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Formats: This field read one line of sentence per sample.

get_batch(name, data, indexes) → Dict[str, Any][source]¶

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

FIELDNAME (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It only contains frequent vocabs, and rare words are replaced by unk_id.
FIELDNAME_allvocabs (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It contains frequent vocabs and rare vocabs.
FIELDNAME_length (np.ndarray[batch_size]): The length of sentences.
FIELDNAME_str (List[str]): The raw sentences.

where

FIELDNAME is the name of the field.
batch_size is len(indexes).
max_sent_length_in_batch is the maximum length of sentences in the batch.

Parameters

name (str) – name of the field.
data (Any) – the data stored in dataloader.
indexes (List[int]) – the indexes of the data in this batch

Examples

>>> # This example is based on GPT2Tokenizer. The vocab files are in ./tests/dummy_gpt2vocab.
>>> # field.eos_id = 413 # <|endoftext|>, also used for <pad>, <unk>, <go>
>>> field.get_batch('sent', data, [0, 2])
{
    "sent": numpy.array([
        [413, 6, 134, 321, 407, 107, 157, 121, 372, 201, 402, 105, 413, 413, 413, 413],
            # ['<|endoftext|>', 'A', 'Ġbicycle', 'Ġreplica', 'Ġwith', 'Ġa', 'Ġclock', 'Ġas', 'Ġthe',
            #   'Ġfront', 'Ġwheel', 'Ġ.', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>']
        [413, 6, 149, 370, 330, 384, 126, 298, 236, 130, 107, 255, 298, 149, 105, 413],
            # ['<|endoftext|>', 'A', 'Ġcar', 'Ġthat', 'Ġseems', 'Ġto', 'Ġbe', 'Ġparked', 'Ġillegally',
            #   'Ġbehind', 'Ġa', 'Ġlegally', 'Ġparked', 'Ġcar', 'Ġ.', '<|endoftext|>']
    ]),
    "sent_length": numpy.array([13, 16]), # length of sentences
    "sent_allvocabs": numpy.array([
        [413, 6, 134, 321, 407, 107, 157, 121, 372, 201, 402, 105, 413, 413, 413, 413],
            # ['<|endoftext|>', 'A', 'Ġbicycle', 'Ġreplica', 'Ġwith', 'Ġa', 'Ġclock', 'Ġas', 'Ġthe',
            #   'Ġfront', 'Ġwheel', 'Ġ.', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>']
        [413, 6, 149, 370, 330, 384, 126, 298, 236, 130, 107, 255, 298, 149, 105, 413],
            # ['<|endoftext|>', 'A', 'Ġcar', 'Ġthat', 'Ġseems', 'Ġto', 'Ġbe', 'Ġparked', 'Ġillegally',
            #   'Ġbehind', 'Ġa', 'Ġlegally', 'Ġparked', 'Ġcar', 'Ġ.', '<|endoftext|>']
    ]),
    "sent_str": [
        "A bicycle replica with a clock as the front wheel .",
        "A car that seems to be parked illegally behind a legally parked car .",
    ],
}

SentenceBERT¶

class cotk.dataloader.SentenceBERT(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]¶

Bases: dataloader.Sentence, dataloader.Field

A field for sentence in the format of BERT.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters

tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.
vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.
vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Formats: This field read one line of sentence per sample.

get_batch(name, data, indexes) → Dict[str, Any][source]¶

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

FIELDNAME (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It only contains frequent vocabs, and rare words are replaced by unk_id.
FIELDNAME_allvocabs (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It contains frequent vocabs and rare vocabs.
FIELDNAME_length (np.ndarray[batch_size]): The length of sentences.
FIELDNAME_str (List[str]): The raw sentences.

where

FIELDNAME is the name of the field.
batch_size is len(indexes).
max_sent_length_in_batch is the maximum length of sentences in the batch.

Parameters

name (str) – name of the field.
data (Any) – the data stored in dataloader.
indexes (List[int]) – the indexes of the data in this batch

Examples

>>> # This example is based on BertTokenizer. The vocab files are in ./tests/dummy_bertvocab.
>>> field.get_batch('sent', data, [0, 1])
{
    "sent": numpy.array([
        [101, 147,  37,  29, 359, 102,   0,   0,   0,   0,   0,   0,   0],
            # ['<cls>', 'How', 'are', 'you', '?', '<sep>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
    [101, 375, 334, 379, 127, 341, 350,  29, 328,   9,  29, 359, 102]
            # ['<cls>', 'i', ''', 'm', 'fine', '.',  'thank', 'you', '!', 'and', 'you', '?', '<sep>']
    ]),
    "sent_length": numpy.array([6, 13]), # length of sentences,
    "sent_allvocabs": numpy.array([
        [101, 147,  37,  29, 359, 102,   0,   0,   0,   0,   0,   0,   0],
            # ['<cls>', 'how', 'are', 'you', '?', '<sep>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
    [101, 375, 334, 379, 127, 341, 350,  29, 328,   9,  29, 359, 102]
            # ['<cls>', 'i', ''', 'm', 'fine', '.',  'thank', 'you', '!', 'and', 'you', '?', '<sep>']
    ]),
    "sent_str": [
        "How are you?",
        "I'm fine. Thank you! And you?"
    ],
}

Session¶

class cotk.dataloader.Session(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]¶

Bases: dataloader.Field

A field for session. Each session is a list of sentences.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters

tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.
vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.
vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.
max_turn_length (int, _InfiniteLength, optional) – Set the maximum turn length of a session. If it’s an integer, any session, whose turn length is more than max_turn_length is shortened to first max_sent_length turns. The left turns are ignored. If it’s None or Sentence.INFINITE_LENGTH, sessions won’t be shortened and all turns are remained. Default: None.

Input Format: This field read multiple line of sentences per sample, until a blank line.

tokenize(sentence) → List[str]¶

Tokenize sentence.

Convert tokens to lower case if self.convert_to_lower_letter is True.

Parameters: sentence (str) – The sentence to be tokenized.

tokenize_sentences(sentences) → List[List[str]]¶

Tokenize sentences.

Convert tokens to lower case if self.convert_to_lower_letter is True.

Parameters: sentences (List[str]) – The list of sentence to be tokenized.

tokenize_sessions(sessions) → List[List[List[str]]][source]¶

Tokenize sessions.

Convert the tokens to lower case if self.convert_to_lower_letter is True.

Parameters: sessions (List[List[str]]) – The list of sessions to be tokenized.

convert_tokens_to_ids(tokens, add_special=False, only_frequent_word=False) → List[int]¶

Convert list of tokens to list of ids.

Parameters

tokens (List[str]) – The tokens to be converted.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

convert_ids_to_tokens(ids, remove_special=True, trim=True) → List[str]¶

Convert list of ids to list of tokens.

Parameters

ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.
trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

convert_sentence_to_ids(sentence, add_special=False, only_frequent_word=False) → List[int]¶

Convert a sentence to a list of ids.

Parameters

sentence (str) – The sentence to be converted.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

convert_ids_to_sentence(ids, remove_special=True, trim=True) → str¶

Convert list of tokens to a sentence.

Parameters

ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.
trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

convert_multi_turn_tokens_to_ids(session, add_special=False, only_frequent_word=False) → List[List[int]][source]¶

Convert list of tokenized sentences to list of sentence ids.

Parameters

session (List[List[str]]) – The tokenized sentences to be converted.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

convert_multi_turn_ids_to_tokens(session_ids, remove_special=True, trim=True)[source]¶

Convert list of sentence ids to list of sentences.

Parameters

session_ids (List[List[int]]) – The sentence ids to be converted.
remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.
trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

add_special_to_ids(ids) → List[int]¶

Add special tokens, such as go_id or eos_id to the input ids.

Parameters: ids (List[int]) – The input ids.

remove_special_in_ids(ids, remove_special=True, trim=True) → List[int]¶

Remove special ids in input ids.

Parameters

ids (List[int]) – Input ids.
remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.
trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

trim_in_ids(ids) → List[int]¶

Find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing pad.

Parameters: ids (List[int]) – The input ids.

multi_turn_trim_in_ids(session_ids) → List[List[int]][source]¶

For each sentence ids in session, find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing pad.

Parameters: session_ids (List[List[int]]) – The input ids of session.

process_sentences(sentences, add_special=True, only_frequent_word=False, cut=True) → List[List[int]]¶

Process input sentences.

If sentences haven’t been tokenized, tokenize them by invoking Sentence.tokenize_sentences().
Then, convert the list of tokens to a list of ids.
If self.max_sent_length is not None and cut is True, sentences, whose length are more than self.max_sent_length, are shorten to first self.max_sent_length tokens.

Parameters

sentences (List[str], List[List[str]]) – sentences can be a list of sentences or a list of lists of tokens.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: True.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.
cut (bool, optional) – Whether to cut sentences with too many tokens. Default: True.

process_sessions(sessions, add_special=True, only_frequent_word=False, cut=True)[source]¶

Process input sessions.

If self.max_turn_length is not None and cut is True, sessions, whose length are more than self.max_turn_length, are shorten to first self.max_turn_length sentences.
If sessions haven’t been tokenized, tokenize them by invoking self.tokenize_sessions()
Then, convert the list of tokens to a list of ids.
If self.max_sent_length is not None and cut is True, sentences, whose length are more than self.max_sent_length, are shorten to first self.max_sent_length tokens.

Parameters

sessions (List[List[str], List[List[str]]]) – sentences in a session can be a str or a list of tokens.
add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: True.
only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.
cut (bool, optional) – Whether to cut sessions/sentences with too many sentences/tokens. Default: True.

frequent_vocab_size¶: int – The number of frequent words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

all_vocab_size¶: int – The number of frequent words and rare words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

frequent_vocab_list¶: list – The list of frequent words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

all_vocab_list¶: list – The list of frequent words and rare words. Frequent words are always in the front of the list. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

get_special_tokens_mapping() → MutableMapping[str, str]¶: Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It calls the method with the identical name of the Vocab instance, from self.get_vocab().

get_special_tokens_id(name) → int¶

Get id of special token specifying the general name. Raise KeyError if no such token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

Parameters: name (str) – the general name, must be one of the following, pad, unk, go, eos, sep, cls, mask.

pad_id¶: int – The id of pad token. Raise KeyError if no pad token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

unk_id¶: int – The id of unk token. Raise KeyError if no unk token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

go_id¶: int – The id of go token. Raise KeyError if no go token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

eos_id¶: int – The id of eos token. Raise KeyError if no eos token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

SessionDefault¶

class cotk.dataloader.SessionDefault(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]¶

Bases: dataloader.Session, dataloader.Field

A common use field for sessions.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters

tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.
vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.
vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Format: This field read multiple line of sentences per sample, until a blank line.

get_batch(name, data, indexes) → Dict[str, Any][source]¶

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field.

This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

FIELDNAME (np.ndarray[batch_size, max_turn_length_in_batch, max_sent_length_in_batch]): Padded sessions in id formats. It only contains frequent vocabs, and rare words are replaced by unk_id.
FIELDNAME_allvocabs (np.ndarray[batch_size, max_turn_length_in_batch, max_sent_length_in_batch]): Padded sessions in id formats. It contains frequent vocabs and rare vocabs.
FIELDNAME_turn_length (np.ndarray[batch_size]): The turn numbers of sessions.
FIELDNAME_sent_length (List[List[int]]): The length of sentences of sessions.
FIELDNAME_str (List[str]): The raw sessions.

where

FIELDNAME is the name of the field.
batch_size is len(indexes).
max_turn_length_in_batch is the maximum turn number of sessions in the batch.
max_sent_length_in_batch is the maximum length of sentences in the batch.

Arguments:: name (str): name of the field. data (Any): the data stored in dataloader. indexes (List[int]): the indexes of the data in this batch

Examples

>>> #   dataset = iter(['How are you?\n', "I'm fine. And you?\n", "I'm fine, too.\n", "\n",
>>> #       "How to install cotk?\n", "pip install cotk.\n", "\n"])
>>> #   min_frequent_vocab_times = 2
>>> #   all_vocab_list = ['<pad>', '<unk>', '<go>', '<eos>', '.', '?', "'", 'How', 'I',
>>> #       'cotk', 'fine', 'install', 'm', 'you', ',', 'And', 'are', 'pip', 'to', 'too']
>>> #   frequent_vocab_size = 14
>>> #   frequent_vocab_list = ['<pad>', '<unk>', '<go>', '<eos>', '.', '?', "'", 'How', 'I',
>>> #       'cotk', 'fine', 'install', 'm', 'you']
>>> #   data = {
>>> #       'id': [
>>> #           [
>>> #               [2, 7, 16, 13, 5, 3],
>>> #               [2, 8, 6, 12, 10, 4, 15, 13, 5, 3],
>>> #               [2, 8, 6, 12, 10, 14, 19, 4, 3],
>>> #           ],
>>> #           [
>>> #               [2, 7, 18, 11, 9, 5, 3],
>>> #               [2, 17, 11, 9, 4, 3],
>>> #           ]
>>> #       ],
>>> #       'str': [
>>> #           [
>>> #               'How are you?',
>>> #               "I'm fine. And you?",
>>> #               "I'm fine, too."
>>> #           ],
>>> #           [
>>> #               'How to install cotk?',
>>> #               'pip install cotk.'
>>> #           ]
>>> #
>>> #   }
>>> field.get_batch('session', data, [0, 1])
{
    'session_turn_length': numpy.array([3, 2]),
    'session_sent_length': [
        [6, 10, 9],
        [7, 6]
    ],
    'session': numpy.array([
        [
            [ 2,  7,  1, 13,  5,  3,  0,  0,  0,  0], # <go> How <unk> you? <eos> <pad> <pad> <pad> <pad>
            [ 2,  8,  6, 12, 10,  4,  1, 13,  5,  3], # <go> I'm fine. <unk> you? <eos>
            [ 2,  8,  6, 12, 10,  1,  1,  4,  3,  0]  # <go> I'm fine <unk> <unk>. <eos> <pad>
        ],
        [
            [ 2,  7,  1, 11,  9,  5,  3,  0,  0,  0], # <go> How <unk> install cotk? <eos> <pad> <pad> <pad>
            [ 2,  1, 11,  9,  4,  3,  0,  0,  0,  0], # <go> <unk> install cotk. <eos> <pad> <pad> <pad> <pad>
            [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]  # all <pad>
        ]
    ]),
    'session_allvocabs': numpy.array([
        [
            [ 2,  7, 16, 13,  5,  3,  0,  0,  0,  0], # <go> How are you? <eos> <pad> <pad> <pad> <pad>
            [ 2,  8,  6, 12, 10,  4, 15, 13,  5,  3], # <go> I'm fine. And you? <eos>
            [ 2,  8,  6, 12, 10, 14, 19,  4,  3,  0]  # <go> I'm fine, too. <eos> <pad>
        ],
        [
            [ 2,  7, 18, 11,  9,  5,  3,  0,  0,  0], # <go> How to install cotk? <eos> <pad> <pad> <pad>
            [ 2, 17, 11,  9,  4,  3,  0,  0,  0,  0], # <go> pip install cotk. <eos> <pad> <pad> <pad> <pad>
            [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]  # all <pad>
        ]
    ]),
    'session_str': [
        [
            'How are you?',
            "I'm fine. And you?",
            "I'm fine, too."
        ],
        [
            'How to install cotk?',
            'pip install cotk.'
        ]
    ]
}

SessionGPT2¶

class cotk.dataloader.SessionGPT2(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]¶

Bases: dataloader.Session, dataloader.Field

A field for session in the format of GPT2.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters

tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.
vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.
vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Format: This field read multiple line of sentences per sample, until a blank line.

get_batch(name, data, indexes) → Dict[str, Any][source]¶

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field.

This function is for INTERNAL USE only, but it shows the data format of the returned batch.

Arguments:: name (str): name of the field. data (Any): the data stored in dataloader. indexes (List[int]): the indexes of the data in this batch

# NOTE: We only show the structure of return value of get_batch. The real value of each entry may depends on the loaded vocab. .. rubric:: Examples

>>> from transformers.tokenization_gpt2 import GPT2Tokenizer
>>> from cotk.dataloader.tokenizer import PretrainedTokenizer
>>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
>>> field = SessionGPT2(PretrainedTokenizer(tokenizer))
>>> field_content = field._create('train')
>>> dataset = iter(['How are you?\n', "I'm fine. Thank you! And you?\n", "I'm fine, too.\n", "\n", "How to install CoTk?\n", "pip install cotk.\n", "\n"])
>>> while True:
...     try:
...         field_content.read_next(dataset)
...     except StopIteration:
...         break
>>> field_content.process_before_vocab()
>>> field.vocab.build_vocab()
>>> data = field_content.get_data()
>>> data
{'id': [[[2, 8, 18, 6, 5, 3],
        [2, 9, 7, 12, 10, 4, 17, 6, 13, 15, 6, 5, 3],
        [2, 9, 7, 12, 10, 14, 22, 4, 3]],
       [[2, 8, 21, 11, 16, 5, 3], [2, 20, 11, 19, 4, 3]]],
  'str': [['How are you?', "I'm fine. Thank you! And you?", "I'm fine, too."],
      ['How to install CoTk?', 'pip install cotk.']]}
>>> batch_data = field.get_batch('session', data, [1])
>>> batch_data
{'session_turn_length': array([2]),
  'session_sent_length': [[7, 6]],
  'session': array([[[ 2,  8, 21, 11, 16,  5,  3],
                 [ 2, 20, 11, 19,  4,  3,  0]]]),
  'session_allvocabs': array([[[ 2,  8, 21, 11, 16,  5,  3],
                 [ 2, 20, 11, 19,  4,  3,  0]]]),
  'session_str': [['How to install CoTk?', 'pip install cotk.']]}
>>> # 'session_turn_length' (`name` + '_turn_length') is a :class:`np.ndarray` object with shape == (batch size, ). Each element is the length of corresponding sssion.
>>> # 'session_sent_length' (`name` + '_sent_length') is List[List[int]]. Each integer is the length of corresponding sentence.
>>> # 'session' (`name`) is a :class:`np.ndarray` object with shape == (batch size, max turn length, max sentence length).
>>>             # batch_data['session'][i, j] is a sentence. batch_data['session'][i, j, k] is an id.
>>>             # If `self.max_turn_length` is not None and j >= `self.max_turn_length` or `self.max_sent_length` is not None and k >= `self.max_sent_length`,
>>>             # batch_data['session'][i, j, k] is `self.eos_id`.
>>> # 'session_allvocabs' (`name` + '_allvocabs') is the same with 'session'.

SessionBERT¶

class cotk.dataloader.SessionBERT(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]¶

Bases: dataloader.Session, dataloader.Field

A field for session in the format of BERT.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters

tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.
vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.
vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.
convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Format: This field read multiple line of sentences per sample, until a blank line.

get_batch(name, data, indexes) → Dict[str, Any][source]¶

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field.

This function is for INTERNAL USE only, but it shows the data format of the returned batch.

Arguments:: name (str): name of the field. data (Any): the data stored in dataloader. indexes (List[int]): the indexes of the data in this batch

# NOTE: We only show the structure of return value of get_batch. The real value of each entry may depends on the loaded vocab. .. rubric:: Examples

>>> from transformers.tokenization_bert import BertTokenizer
>>> from cotk.dataloader.tokenizer import PretrainedTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('bert')
>>> field = SessionBERT(PretrainedTokenizer(tokenizer))
>>> field_content = field._create('train')
>>> dataset = iter(['How are you?\n', "I'm fine. Thank you! And you?\n", "I'm fine, too.\n", "\n", "How to install CoTk?\n", "pip install cotk.\n", "\n"])
>>> while True:
...     try:
...         field_content.read_next(dataset)
...     except StopIteration:
...         break
>>> field_content.process_before_vocab()
>>> field.vocab.build_vocab()
>>> data = field_content.get_data()
>>> data
{'id': [[[2, 8, 18, 6, 5, 3],
        [2, 9, 7, 12, 10, 4, 17, 6, 13, 15, 6, 5, 3],
        [2, 9, 7, 12, 10, 14, 22, 4, 3]],
       [[2, 8, 21, 11, 16, 5, 3], [2, 20, 11, 19, 4, 3]]],
  'str': [['How are you?', "I'm fine. Thank you! And you?", "I'm fine, too."],
      ['How to install CoTk?', 'pip install cotk.']]}
>>> batch_data = field.get_batch('session', data, [1])
>>> batch_data
{'session_turn_length': array([2]),
  'session_sent_length': [[7, 6]],
  'session': array([[[ 2,  8, 21, 11, 16,  5,  3],
                 [ 2, 20, 11, 19,  4,  3,  0]]]),
  'session_allvocabs': array([[[ 2,  8, 21, 11, 16,  5,  3],
                 [ 2, 20, 11, 19,  4,  3,  0]]]),
  'session_str': [['How to install CoTk?', 'pip install cotk.']]}
>>> # 'session_turn_length' (`name` + '_turn_length') is a :class:`np.ndarray` object with shape == (batch size, ). Each element is the length of corresponding sssion.
>>> # 'session_sent_length' (`name` + '_sent_length') is List[List[int]]. Each integer is the length of corresponding sentence.
>>> # 'session' (`name`) is a :class:`np.ndarray` object with shape == (batch size, max turn length, max sentence length).
>>>             # batch_data['session'][i, j] is a sentence. batch_data['session'][i, j, k] is an id.
>>>             # If `self.max_turn_length` is not None and j >= `self.max_turn_length` or `self.max_sent_length` is not None and k >= `self.max_sent_length`,
>>>             # batch_data['session'][i, j, k] is `self.pad_id`.
>>> # 'session_allvocabs' (`name` + '_allvocabs') is the same with 'session'.

DenseLabel¶

class cotk.dataloader.DenseLabel[source]¶

Bases: dataloader.Field

A field of categorical labels whose values are integer which ranges from 0 to label_types - 1.

See dataloader.SparseLabel for labels in str or sparse integer.

Parameters: This class do not contains arguments for initialization.

Input Format: This field reads one line per sample. The line must be an integer.

get_batch(name, data, indexes) → Dict[str, Any][source]¶

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

FIELDNAME (np.ndarray[batch_size]): Labels of corresponding batched data.

where

FIELDNAME is the name of the field.

Parameters

name (str) – name of the field.
data (Any) – the data stored in dataloader.
indexes (List[int]) – the indexes of the data in this batch

Examples

>>> #   data = {'label': [1, 0]}
>>> field.get_batch('label', data, [0, 1])
{
    'label': numpy.array([1, 0])
}

SparseLabel¶

class cotk.dataloader.SparseLabel(vocab=None)[source]¶

Bases: dataloader.Field

A field of categorical labels whose values are strings or sparse integer.

See dataloader.DenseLabel for labels in dense integers.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters: vocab (SimpleVocab, optional) – The vocab to store all the labels. If None, a SimpleVocab is automatically created.

Input Format: This field reads one line per sample. The line can be an arbitary string.

get_batch(name, data, indexes) → Dict[str, Any][source]¶

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

FIELDNAME_id (np.ndarray[batch_size]): Ids of corresponding labels.
FIELDNAME_str (List[str]): Raw labels of the batched data.

where

FIELDNAME is the name of the field.

Parameters

name (str) – name of the field.
data (Dict[str, Any]) – the object returned by _SparseLabelContent.get_data().

data[‘str’] is raw labels. data[‘id’] is ids of labels.

indexes (List[int]): the indexes of the data in this batch

Examples

>>> #   data = {
>>> #       'id': [0, 2, 1, 0],
>>> #       'str': ['Java', 'Python', 'Cpp', 'Java']
>>> #   }
>>> field.get_batch('label', data, [0, 1])
{
    'label_id': numpy.array([0, 2]),  # Ids of corresponding labels.
    'label_str': ['Java', 'Python']   # Raw labels.
}

Tokenizer¶

class cotk.dataloader.Tokenizer[source]¶

Tokenizer is used for spliting sentence to tokens. This is an abstract base class. It often works as a part of Field

tokenize(sentence) → List[str][source]¶

Tokenize a sentence to a list of tokens.

Parameters: sentence (str) – a sentence to tokenize.

tokenize_sentences(sentences) → List[List[str]][source]¶

Tokenize a list of sentences to a list of lists of tokens.

Parameters: sentences (List[str]) – sentences to tokenize.

tokenize_sessions(sessions) → List[List[List[str]]][source]¶

Tokenize sessions to a 3-d list of tokens.

Parameters: sessions (List[List[str]]) – sessions to tokenize.

convert_tokens_to_sentence(tokens) → str[source]¶

Convert tokens to sentence. It usually works like the reverse operation of tokenize(), but it is not gauranteed. It may like " ".join(tokens), but some special condition and tokens will be took care.

Parameters: tokens (List[str]) – tokenized sentence

get_setting_hash() → str[source]¶: Return the setting hash of this tokenizer instance. See here for the explaination of setting hash.

SimpleTokenizer¶

class cotk.dataloader.SimpleTokenizer(method, special_tokens=None)[source]¶

Bases: dataloader.Tokenizer

A simple tokenizer. method can either be nltk or space. If nltk, use WordPunctTokenizer from nltk.tokenize. If space, use str.split(" ").

Parameters

method (str) – the tokenization method, nltk or space.
special_tokens (List[str]) – special tokens not to tokenize, such as <go>.

Pretrainedtokenizer¶

class cotk.dataloader.PretrainedTokenizer(tokenizer)[source]¶

Bases: dataloader.Tokenizer

A wrapper for Pretrainedtokenizer from transformers package. If you don’t want to do tokenization on some special tokens, see transformers.Pretrainedtokenizer.add_special_tokens.

Parameters: tokenizer (transformers.Pretrainedtokenizer) – An instance of transformers.Pretrainedtokenizer.

get_tokenizer_class() → str[source]¶: Get the class name of pretrained tokenizer.

Vocab¶

class cotk.dataloader.Vocab[source]¶

A class for storing vocabulary. This is an abstract base class. It often works as a part of Field or is shared between Field.

See introduction of vocabulary for more information.

Parameters: This class do not contains arguments for initialization.

classmethod get_all_subclasses() → Iterable[Any]¶: Return a generator of all subclasses.

classmethod load_class(class_name) → Any¶

Return a subclass of class_name, case insensitively.

Parameters: class_name (str) – target class name.

add_tokens(tokens, vocab_from) → None[source]¶

Add tokens for this vocabulary instance, the tokens will be used for building vocabulary list. Must be called before build_vocab().

Parameters

tokens (List[str]) – A list of tokens to add to the vocabulary.
vocab_from (str) – One of train, test, extra.
- train: The tokens are from the training data. Frequent vocabs are selected from tokens of this type.
- test: The tokens are from the validation data or test data. Rare vocabs are selected from tokens of this type.
- extra: The tokens are from extra data. The tokens of this type will not selected as frequent or rare vocabs.

build_vocab()[source]¶: Building the vocabulary list according to the tokens from add_tokens().

convert_tokens_to_ids(tokens, only_frequent_word=False) → List[int][source]¶

Convert list of tokens to list of ids.

Parameters

tokens (List[str]) – List of tokens.
only_frequent_word (bool, optional) – Use unk for rare tokens. Defaults: False.

convert_ids_to_tokens(ids) → List[str][source]¶

Convert list of ids to list of tokens.

Parameters: ids (List[int]) – List of ids.

frequent_vocab_size¶: int – The number of frequent words.

all_vocab_size¶: int – The number of frequent words and rare words.

frequent_vocab_list¶: list – The list of frequent words.

all_vocab_list¶: list – The list of frequent words and rare words. Frequent words are always in the front of the list.

get_special_tokens_mapping() → MutableMapping[str, str][source]¶: Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>".

get_special_tokens_id(name) → int[source]¶

Get id of special token specifying the general name. Raise KeyError if no such token in this instance.

Parameters: name (str) – the general name, must be one of the following, pad, unk, go, eos, sep, cls, mask.

pad_id¶: int – The id of pad token. Raise KeyError if no pad token in this instance.

unk_id¶: int – The id of unk token. Raise KeyError if no unk token in this instance.

go_id¶: int – The id of go token. Raise KeyError if no go token in this instance.

eos_id¶: int – The id of eos token. Raise KeyError if no eos token in this instance.

get_setting_hash() → str[source]¶: Get setting hash for the Vocabulary instance. See here for the explaination of setting hash.

get_vocab_hash() → str[source]¶: Get vocab hash for the Vocabulary instance. See here for the explaination of vocab hash.

GeneralVocab¶

class cotk.dataloader.GeneralVocab(min_frequent_vocab_times=None, min_rare_vocab_times=None, special_tokens_mapping=None, special_appeared_in_data=None)[source]¶

Bases: dataloader.Vocab

A vocabulary class for general use.

This class always have the following 4 speical tokens: pad, unk, go, eos.

If any argument is not specified, the value will be first retrieved from VocabContext. If still None, default value will be used.

Parameters

min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than min_frequent_vocab_times will be regarded as frequent words. Default: 0
min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than min_rare_vocab_times will be regarded as rare words (frequent word excluded). Default: 0
special_tokens_mapping (OrderedDict, optional) – Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It must at least contains pad, unk, go, eos. All the value of special tokens cannot be the same. Default: If None, it will be OrderedDict([("pad", "<pad>"), ("unk", "<unk>"), ("go", "<go>"), ("eos", "<eos>")].
special_appeared_in_data (bool, optional) – If the string of special tokens will appear in the data. Default: If not specified, it will be False.

static from_predefined(vocab_list, frequent_vocab_size, special_tokens_mapping=None) → cotk.dataloader.vocab.GeneralVocab[source]¶

Return a GeneralVocab instance, whose vocabulary comes from a predefined list. See from_predefined_vocab() if you want to use the vocabulary from an existing GeneralVocab instance.

Parameters

vocab_list (List[str]) – A list of all vocabulary.
frequent_vocab_size (int) – the length of the frequent words. The frequent word must be in the front of the vocab_list.
special_tokens_mapping (OrderedDict, optional) – Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It must at least contains pad, unk, go, eos. All the value of special tokens cannot be the same. Special tokens MUST be in the front of the frequent_vocab_list (ordered sensitive). Default: If None, it will be OrderedDict([("pad", "<pad>"), ("unk", "<unk>"), ("go", "<go>"), ("eos", "<eos>")].

static from_predefined_vocab(vocab) → cotk.dataloader.vocab.GeneralVocab[source]¶

Return a new GeneralVocab instance from vocab. The new instance have the same vocabulary list as the old one.

Parameters: vocab (GeneralVocab) – The old instance.

static from_frequent_word(frequent_vocab_list, special_tokens_mapping=None) → cotk.dataloader.vocab.GeneralVocab[source]¶

Return a GeneralVocab instance, whose vocabulary comes from a predefined frequent list. And its rare word list can be built later. See from_frequent_word_of_vocab() if you want to use the frequent vocabulary from an existing GeneralVocab instance.

Parameters

frequent_vocab_list (List[str]) – A list of frequent vocabulary.
special_tokens_mapping (OrderedDict, optional) – Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It must at least contains pad, unk, go, eos. All the value of special tokens cannot be the same. Special tokens MUST be in the front of the frequent_vocab_list (ordered sensitive). Default: If None, it will be OrderedDict([("pad", "<pad>"), ("unk", "<unk>"), ("go", "<go>"), ("eos", "<eos>")].

static from_frequent_word_of_vocab(vocab) → cotk.dataloader.vocab.GeneralVocab[source]¶

Return a GeneralVocab instance, which has the same frequent vocabulary list as the old one. The rare word list can be built later.

Parameters: vocab (GeneralVocab) – The old instance to provide frequent words.

PretrainedVocab¶

class cotk.dataloader.PretrainedVocab(tokenizer)[source]¶

Bases: dataloader.Vocab

Use the vocabulary from a pretrained tokenizer in transformers package. This class is usually used for pretrained models, and it do NOT have rare words.

Unlike GeneralVocab, this class do not always have pad, unk, go, eos. Some special tokens may refer to the same token.

Parameters: tokenizer (transformers.PretrainedTokenizer) – A pretrained tokenizer from transformers package.

frequent_vocab_list¶: list – The list of frequent words.

all_vocab_list¶: list – The list of frequent words and rare words. Frequent words are always in the front of the list.

SimpleVocab¶

class cotk.dataloader.SimpleVocab[source]¶

Bases: dataloader.Vocab

A very simple vocabulary class. No rare vocabs or special tokens. Used by SparseLabel.

Parameters: This class do not contains arguments for initialization.

Context¶

class cotk.dataloader.Context(parameter_dict, weak=False, none_as_ignored=True)[source]¶

An abstract base class for context manager.

This class is used for setting default parameters for Field or Vocab, without directly passing parameters to __init__ of the object.

See examples for how to use context manager.

Parameters

parameter_dict (Dict[str, Any]) – Key-value dict for changed parameters.
weak (bool, optional) – When False, overwrite existing parameters. Default: False.
none_as_ignored (bool, optional) – When True, None values in parameter_dict are ignored. Otherwise, the corresponding key will be set to None. Default: True.

classmethod get(key, default=None, no_default=False) → Any[source]¶

Get the value of parameter named key stored in this class.

Parameters

key (str) – name of the parameter
default (Any, optional) – Default value if key is not set. Defaults: None.
no_default (bool, optional) – When True, Raise KeyError if key is not set. Defaults: False.

classmethod set(key, value, weak=False, none_as_ignored=True) → Any[source]¶

Set the parameter named key to value, stored in this class. If weak is True, do not overwrite if key is already set. Return the old value.

Parameters

key (str) – The name of the changed parameter.
value (Any) – The new value of changed parameter. If want to delete the key, use Context.UNDEFINED.
weak (bool, optional) – When False, overwrite existing parameters. Defaults: False.
none_as_ignored (bool, optional) – When True, None values in parameter_dict are ignored. Otherwise, the corresponding value will be set to None. Default: True.

close()[source]¶: Restore the old parameter.

FieldContext¶

class cotk.dataloader.FieldContext(parameter_dict, weak=False, none_as_ignored=True)[source]¶

Bases: dataloader.Context

A context class for setting default parameters for Field.

classmethod set_parameters(*, weak=False, none_as_ignored=True, **kwargs) → cotk.dataloader.context.FieldContext[source]¶

Set a context for initialization of Field. See examples for how to use context manager.

Parameters

weak (bool, optional) – When False, overwrite existing parameters. Defaults: False.
none_as_ignored (bool, optional) – When True, None values in kwargs are ignored. Otherwise, the corresponding value will be set to None. Default: True.
**kwargs – Any parameters to be set. Set key to FieldContext.UNDEFINED to delete a parameter.

VocabContext¶

class cotk.dataloader.VocabContext(parameter_dict, weak=False, none_as_ignored=True)[source]¶

Bases: dataloader.Context

A context class for setting default parameters for Vocab.

classmethod set_parameters(*, weak=False, none_as_ignored=True, **kwargs) → cotk.dataloader.context.VocabContext[source]¶

Set a context for initialization of Vocab. See examples for how to use context manager.

Parameters

weak (bool, optional) – When False, overwrite existing parameters. Defaults: False.
none_as_ignored (bool, optional) – When True, None values in kwargs are ignored. Otherwise, the corresponding value will be set to None. Default: True.
**kwargs – Any parameters to be set. Set key to VocabContext.UNDEFINED to delete a parameter.