Data Loader

cotk.dataloader provides classes and functions downloading and loading benchmark data automatically. It reduces your cost preprocessing data and provide a fair dataset for every model. It also helps you adapt your model from one dataset to other datasets.

Overview

Dataloaders are essential components in CoTK to build models or do fair evaluation. CoTK uses a dataloader class, LanguageProcessing, to handle all the tasks about language. Here we give an overview of what makes a LanguageProcessing dataloader.

_images/dataloader_structure.png
  • A dataloader may have multiple sets of data. In this case, the name of 3 sets (set_name) are "train", "dev", "test".

  • Each set stores the data read from a text file. In this example, 3 sets are refered to "train.txt", "dev.txt", "test.txt".

  • A set may have multiple data fields. In this case, "train" set have two fields, and their name (field_name) are "post" and "resp".

  • Data fields are specified by Field instances. Field defines the way that the dataloader reads, process, and output the data. (But it doesn’t store the data, the data is stored in dataloader.)

  • A Field instance can be shared between data fields. In the example, "post" in "train" set and "post" in "dev" set share an instance.

  • Field may contains Tokenizer and Vocab.

  • Tokenizer defines the methods to tokenize a sentence.

  • Vocab defines the vocabulary. A instance of Vocab can be shared between multiple Field, where the data from multiple Field will be used to construct the vocabulary.

Building a Dataloader

Predefined Tasks

CoTK provides several predefined tasks and benchmarks, including

LanguageGeneration
SingleTurnDialog
MultiTurnDialog
SentenceClassification

Choose an adequate class for your task, and it would be the simplest and best way to build a dataloader. Each class will explain how the dataloader is composed of.

Customized Tasks

If the predefined classes do not satisfy your need, you can construct an instance of LanguageProcessing.

To specify the data format of the customized tasks, the initialization of LanguageProcessing receives an argument named fields. The full description of fields should be like the example below.

>>> postField = SentenceDefault(...)
>>> respField = SentenceDefault(...)
>>> labelField = DenseLabel(...)
>>> fields = {
>>>    "train": [("post", postField), ("resp", respField)],
>>>    "test": [("post", postField), ('resp', respField), ('label', labelField)]
>>> }
>>> dataloader = LangaugeProcessing("/path/to/dataset", fields)
  • "train" and "test" is the name of the split sets in the dataset. There should be two text file named train.txt and test.txt under /path/to/dataset/, corresponding to the two sets, "train" and "test" respectively.

  • fields["train"] describes the data format of train.txt. Every sample in train set has two data fields, which is represented by Field objects. As SentenceDefault (a subclass of Field) only read one line per each sample, a sample in train.txt occupy two lines. The first line are named by "post", the second line are named "resp".

  • Similarily, fields["test"] describes the data format of test.txt. Every sample in test set occupies three lines, where the first line is "post", the second line is "resp", and the third line is an integer indicating "label".

An valid input example:

  • /path/to/dataset/train.txt

    How are you?
    I am fine.
    What's up?
    Everything is good.
    
  • /path/to/dataset/test.txt

    What is your name?
    Jack.
    1
    How about the food?
    Terrible.
    0
    

The Field instances define how dataloaders read the file, process the data, and provide the data to networks. See fields for further details.

Omit Set Names

If you have three sets named "train", "dev", "test", and the data format is the same, you can specify the fields argument in initialization of LanguageProcessing by the following code:

>>> fields = [("post", postField), ("resp", respField)]

equals to

>>> fields = {
>>>    "train": [("post", postField), ("resp", respField)],
>>>    "dev": [("post", postField), ("resp", respField)],
>>>    "test": [("post", postField), ("resp", respField)]
>>> }

Use Simple Create

You can use LanguageProcessing.simple_create() to initialize a dataloder, using the class name of Field instead of instances. The method receives arguments for initializing subclasses of Vocab and Field.

>>> fields = {
>>>    "train": [("post", "SentenceDefault"), ("resp", "SentenceDefault")],
>>>    "dev": [("post", "SentenceDefault"), ("resp", "SentenceDefault")],
>>>    "test": [("post", "SentenceDefault"), ("resp", "SentenceDefault")],
>>> }
>>> #or fields = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
>>> dataloader = LanguageProcessing.simple_create("/path/to/dataset", fields, \
>>>     max_sent_length=10, tokenizer="space", min_frequent_vocab_times=10)

In this example, it will first create an GeneralVocab instances with min_frequent_vocab_times=10. Then it initialize SentenceDefault objects with max_sent_length=10, tokenizer="space" and the created Vocab.

Use Context Manager

There is another way to use the class name of Field instead of instances. Initialize the LanguageProcessing in the context of FieldContext and VocabContext.

>>> fields = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
>>> with FieldContext.set_parameters(max_sent_length=10, tokenizer="space"):
>>>     with VocabContext.set_parameters(min_frequent_vocab_times=10):
>>>         dataloader = LanguageProcessing("/path/to/dataset", fields)

equals to

>>> fields = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
>>> dataloader = LanguageProcessing.simple_create("/path/to/dataset", fields, max_sent_length=10, min_frequent_vocab_times=10)

Context is used to provide default values for Field and Vocab instances. See Context for further details.

Field

Field indicates data fields, which work secretly behind dataloaders. They define how dataloaders read the file, process the data, and provide the data to networks.

Cotk provides several fields, including

Note Field never stores data, because the instance can be shared between different data fields in dataloader.

Read the File

Field defines the way to read the file. For example,

  • Sentence reads one line per sample, which is a string of sentence.

  • Session reads multiple lines per sample, stopped when a empty line is read.

  • DenseLabel reads one line per sample, which is an integer.

See the documentation in each class for details.

Process the Data

Each subclass of Field defines the methods to process the input.

For example, Sentence processes the sentence into different formats:

  • (str) The whole sentences.

  • (List[str]) The tokenized sentences.

  • (List[id]) The index of tokens in the vocabulary.

Sentence also provides methods to convert a sentence from one format to another:

The dataloader has similar methods, which invoke the corresponding methods of the default field. See LanguageProcessing.set_default_field() for details.

Provide the Data to Networks

Each subclass of Field defines Field.get_batch(), which returns a dict of data for training the networks.

For example, if an instance of SentenceDefault is named with "sent", SentenceDefault.get_batch() will return a dict containing:

  • sent

  • sent_length

  • sent_allvocabs

  • sent_str

LanguageProcessing.get_batch() will collect dict returned from every field and merge them. For example, if a dataloader with two SentenceDefault fields named "post", "resp", LanguageProcessing.get_batch() will return a dict containing:

  • post

  • post_allvocabs

  • post_length

  • post_str

  • reps

  • resp_allvocabs

  • resp_length

  • resp_str

Pretrained Field

Default fields like SentenceDefault and SessionDefault are designing for common use in different language processing task. They use <go> and <eos> to mark the start and the end of sentences.

For some pretrained models like GPT2, <go> are not pretrained in the vocabulary and thus not available. We design different field for different pretrained models, including:

Tokenizer

Tokenizer defines the method to tokenize a sentence, which is used by Field.

CoTK provides several tokenizers, including

  • SimpleTokenizer: A simple tokenizer for general use in CoTK, supporting space or nltk tokenization.

  • PretrainedTokenizer: A pretrained Tokenizer from the transformers package. For example, tokenizer for GPT2.

When creating a dataloader, it often receives str or Tokenizer. If str, the following arguments are acceptable:

  • space: Split by spaces.

  • nltk: nltk.tokenize.WordPunctTokenizer will be used.

A SimpleTokenizer will be created by the str arguments.

Vocabulary

Vocab defines the vocabulary, which is used by Field.

CoTK provides several vocabularies, including

  • GeneralVocab: A vocabulary for general use in CoTK. The vocabulary list is often built during the processing of input data. Save and load a predefined vocabulary is also supported.

  • PretrainedVocab: A predefeined vocabulary from the transformers package. For example, vocabulary for GPT2.

Type of Tokens

All tokens appeared in dataset (including the ones only appear in test set) are split into 2 sets.

Frequent Vocabularies(frequent_vocabs)
  • Tokens that the model should read, predict and generate.

  • These tokens are important in evaluation. They include common words and usually cover over most of tokens from dataset.

  • They are extracted from only training set, because models should be blind for test set. Hence they are defined as the tokens appear more than a specified number of times (min_frequent_vocab_times) in training set.

Rare Vocabularies(rare_vocabs)
  • Tokens that the model can optionally read, but will not predict and generate at most times (except some models can generate rare words using copy mechanism or external knowledge).

  • These tokens are less important but DO affect the evaluation.

  • They are extracted from both training set and test set, because they are defined considering evaluation. Hence, they are defined as the tokens (excluded frequent_vocabs) appear more than a specified number (min_rare_vocab_times) of times in the whole dataset.

There is also some other terms for vocabularies.

All Vocabularies(allvocabs)
  • The union of Frequent vocabularies and rare vocabularies is called all vocabularies.

Special Tokens(special_tokens)
  • Most used special tokens are <pad>, <unk>, <go>, <eos>.

  • Special tokens are counted as valid vocabularies.

Unknown tokens (<unk>)
  • <unk> means “Out of Vocabularies”, but the meaning of <unk> may varies from situations.

  • If it appears at a list named with allvocabs (eg: sent_allvocabs), <unk> indicates a token out of all vocabularies.

  • If it appears at a list named without allvocabs (eg: sent), <unk> indicates a token out of frequent vocabularies, which means it may a rare vocabulary.

Why CoTK Uses Rare Words

In traditional implementations, vocabulary only contains frequent vocabulary. CoTK use frequent vocabulary and rare vocabulary for supporting fair comparisons across different configurations.

For examples, we test two models under the same dataset, but with different vocabularies.

  • Model A: Frequent vocabulary F_A; Rare vocabulary R_A.

  • Model B: Frequent vocabulary F_B; Rare vocabulary R_B.

The fairness of comparisons can be gauranteed under the conditions:

See each metrics for when the fairness can be gauranteed. Hash value can help user determine whether the comparisons is fair.

Connecting Field and Vocab

GeneralVocab is often shared between fields for constructing vocabulary list together. To identify tokens from a field is regarded as training set or test set (which may be relative to the division of frequent vocab and rare vocab), Sentence use an arguments named vocab_from_mappings.

vocab_from_mappings is a dict, which infer the type of token by the set name. By default:

Set Name

Type

train

train

training

train

dev

test

development

test

valid

test

validation

test

test

test

evaluation

test

For example, a token from the training set will have a type of train. The type will passed to Vocab.add_tokens() as vocab_from. There are 3 types:

  • train: Frequent vocabs are selected from tokens of this type.

  • test: Rare vocabs are selected from tokens of this type.

  • extra: The tokens of this type will not selected as frequent or rare vocabs.

Context

FieldContext and VocabContext are used to set the default arguments for subclasses of Field and Vocab respectively.

>>> vocab = GeneralVocab(...)
>>> with FieldContext.set_parameters(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10):
>>>     field = SentenceDefault()

equals to:

>>> vocab = GeneralVocab(...)
>>> field = SentenceDefault(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10)

The context can be stacked, and weak means whether overwrite the outter context.

>>> vocab = GeneralVocab(...)
>>> with FieldContext.set_parameters(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10):
>>>     with FieldContext.set_parameters(min_frequent_vocab_times=20):
>>>         field1 = SentenceDefault()  # min_frequent_vocab_times=20
>>> with FieldContext.set_parameters(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10):
>>>     with FieldContext.set_parameters(min_frequent_vocab_times=20, weak=True):
>>>         field2 = SentenceDefault()  # min_frequent_vocab_times=10

It usually works with the initialization of LanguageProcessing without creating the instance of Field or Vocab. See the examples here.

Hash Value for Dataloader

It is usually difficult to track the differences among different configurations, CoTK provides hash codes to identify each part of dataloader including the input data, vocabularies and settings.

For example, if two data loaders have the same general hash, their data, vocabularies and settings are guaranteed to be the same.

LanguageProcessing provides the following hash value:

Dataloader

class cotk.dataloader.Dataloader[source]

Base class of Dataloader.

classmethod get_all_subclasses() → Iterable[Any]

Return a generator of all subclasses.

classmethod load_class(class_name) → Any

Return a subclass of class_name, case insensitively.

Parameters

class_name (str) – target class name.

LanguageProcessing

class cotk.dataloader.LanguageProcessing(file_id, fields)[source]

Bases: dataloader.Dataloader

Base class for all language processing tasks. This is an abstract class.

During the initialization of a dataloader, Vocab, Tokenizer or Field may be created. See how to create a dataloader.

Parameters
  • file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id.

  • fields (List, OrderedDict, Dict) – This arguments supports multiple input types:

    • If OrderDict or List, it specify data format of the "train", "dev", "test" set.

      • A data format should be an OrderedDict or a List[Tuple] can be converted to OrderedDict.

      • The key of data format is the name of a Field (used by get_batch()), and the value is either a class name of a Field or a Field object.

      • Examples:

        >>> postField = SentenceDefault(...)
        >>> respField = SentenceDefault(...)
        >>> data_format = [("post", postField), ("resp", respField)]
        

        or

        >>> data_format = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
        
      • Examples:

        >>> fields = data_format
        

        equals to

        >>> fields = {"train": data_format, "dev": data_format, "test": data_format}
        
    • If Dict, fields[key] describes data format of the set named key. Examples:

      >>> fields = {"train": data_format, "extra": data_format}
      
    • See how to create a dataloader.

static LanguageProcessing.simple_create(file_id, fields, **kwargs) → cotk.dataloader.dataloader.LanguageProcessing[source]

A simple way to create a dataloader. Instead of using VocabContext and FieldContext, specifying all the possible parameters here.

Parameters
  • file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id.

  • fields (List, OrderedDict, Dict) – See initialization of LanguageProcessing for explanation.

  • **kwargs – Arguments passed to created Vocab and Field.

Tokenizer, Vocabulary, and Field

LanguageProcessing.fields

This instance attribute shows fields of the dataloader (See the initialization of LanguageProcessing). For example, the fields can be printed as follows:

{
    'train': OrderedDict([('sent', <cotk.dataloader.field.SentenceDefault object at 0x000001E170F8B588>)]),
    'dev': OrderedDict([('sent', <cotk.dataloader.field.SentenceDefault object at 0x000001E170F8BB48>)]),
    'test': OrderedDict([('sent', <cotk.dataloader.field.SentenceDefault object at 0x000001E170F8BEC8>)])}
}
LanguageProcessing.get_default_tokenizer() → cotk.dataloader.tokenizer.Tokenizer[source]

Get the default Tokenizer in this dataloader. It can be set by set_default_field().

LanguageProcessing.get_default_vocab() → cotk.dataloader.vocab.Vocab[source]

Get the default Vocab in this dataloader. It can be set by set_default_field().

LanguageProcessing.get_default_field() → cotk.dataloader.field.Field[source]

Get the default Field in this dataloader. It can be set by set_default_field().

LanguageProcessing.set_default_field(set_name, field_name)[source]

Set the default Field in this dataloader. In the meanwhile, the default Vocab and Tokenizer is also set according to the field (if the field have vocab and tokenizer).

The default field will affect the action in the following methods:

Parameters
  • set_name (str) – The name of set. For example: "train", "dev", "test".

  • field_name (str) – The name of field.

LanguageProcessing.get_field(set_name, field_name) → cotk.dataloader.field.Field[source]

Get Field according to name of set and field.

Parameters
  • set_name (str) – The name of set. For example: "train", "dev", "test".

  • field_name (str) – The name of field.

Batched Data

LanguageProcessing.get_batch(set_name, indexes) → Dict[str, Any][source]

Get a batch of data with specified indexes. Return a merged dict containing all the data from each field by calling field.get_batch(). See examples in subclasses for the return value of predefined tasks.

get_next_batch(), get_batches(), get_all_batch() provide other methods to get batched data, Their return values are consistent with this methods.

Parameters
  • set_name (str) – The name of set. For example: "train", "dev", "test".

  • indexes (list) – a list of specified indexes of batched data.

LanguageProcessing.restart(set_name, batch_size=None, shuffle=True)[source]

Initialize batches. This function be called before get_next_batch() or an epoch is end. See get_next_batch() for examples.

Parameters
  • set_name (str) – The name of set. For example: "train", "dev", "test".

  • batch_size (int) – the number of sample in a batch. default: if None, last batch_size is used.

  • shuffle (bool) – whether to shuffle the data. Default: True.

LanguageProcessing.get_next_batch(set_name, ignore_left_samples=False) → Optional[Dict[str, Any]][source]

Get next batch. It can be called only after Initializing batches (restart()). Return a dict like get_batch(), or None if the epoch is end.

Parameters
  • set_name (str) – The name of set. For example: "train", "dev", "test".

  • ignore_left_samples (bool) – If the number of the samples is not divisible by batch_size, ignore the left samples less than batch_size Setting it to True make that every batch will have the same number of samples. Default: False.

Examples

>>> dataloader.restart("train")
>>> while True:
>>>     data = dataloader.get_next_batch("train")
>>>     if data:
>>>         break
>>>     print(data)
LanguageProcessing.get_batches(set_name, batch_size=None, shuffle=True, ignore_left_samples=False) → Iterable[Dict[str, Any]][source]

An iterable generator over batches. It first call restart(), and then get_next_batch() until no more data is available. Returns an iterable generator where each element is like get_batch().

Parameters
  • set_name (str) – The name of set. For example: "train", "dev", "test".

  • batch_size (int, optional) – default: None. Use batch_size by default.

  • shuffle (bool) – whether to shuffle the data. Default: True.

  • ignore_left_samples (bool) – If the number of the samples is not divisible by batch_size, ignore the left samples less than batch_size Setting it to True make that every batch will have the same number of samples. Default: False.

LanguageProcessing.get_all_batch(set_name) → Dict[str, List[Any]][source]

Concatenate all batches to a single dict, where padding will not be applied.

Returns a dict like get_batch() with all valid indexes, but all the sentences are not padded and their type will be converted to list. Exactly, this function called get_batch() where len(indexes)==1 multiple times and concatenate all the values in the returned dicts.

Parameters

set_name (str) – The name of set. For example: "train", "dev", "test".

Sentences and Manipulations

LanguageProcessing.tokenize(sentence) → List[str]

Tokenize sentence.

It calls the identical method of the Sentence instance sentence, from get_default_field().

  • Convert tokens to lower case if sentence.convert_to_lower_letter is True.

Parameters

sentence (str) – The sentence to be tokenized.

LanguageProcessing.tokenize_sentences(sentences) → List[List[str]]

Tokenize sentences.

It calls the identical method of the Sentence instance sentence, from get_default_field().

  • Convert tokens to lower case if sentence.convert_to_lower_letter is True.

Parameters

sentences (List[str]) – The list of sentence to be tokenized.

LanguageProcessing.convert_tokens_to_ids(tokens, add_special=False, only_frequent_word=False) → List[int]

Convert list of tokens to list of ids. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters
  • tokens (List[str]) – The tokens to be converted.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

LanguageProcessing.convert_ids_to_tokens(ids, remove_special=True, trim=True) → List[str]

Convert list of ids to list of tokens. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters
  • ids (List[int]) – The ids to be converted.

  • remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.

  • trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

LanguageProcessing.convert_ids_to_sentence(ids, remove_special=True, trim=True) → str

Convert list of tokens to a sentence. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters
  • ids (List[int]) – The ids to be converted.

  • remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.

  • trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

LanguageProcessing.convert_sentence_to_ids(sentence, add_special=False, only_frequent_word=False) → List[int]

Convert a sentence to a list of ids. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters
  • sentence (str) – The sentence to be converted.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

LanguageProcessing.add_special_to_ids(ids) → List[int]

Add special tokens, such as go_id or eos_id to the input ids. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters

ids (List[int]) – The input ids.

LanguageProcessing.remove_special_in_ids(ids, remove_special=True, trim=True) → List[int]

Remove special ids in input ids. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters
  • ids (List[int]) – Input ids.

  • remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.

  • trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

LanguageProcessing.process_sentences(sentences, add_special=True, only_frequent_word=False, cut=True) → List[List[int]]

Process input sentences.

It calls the identical method of the Sentence instance sentence, from get_default_field().

  • If sentences haven’t been tokenized, tokenize them by invoking Sentence.tokenize_sentences().

  • Then, convert the list of tokens to a list of ids.

  • If sentence.max_sent_length is not None and cut is True, sentences, whose length are more than sentence.max_sent_length, are shorten to first sentence.max_sent_length tokens.

Parameters
  • sentences (List[str], List[List[str]]) – sentences can be a list of sentences or a list of lists of tokens.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: True.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

  • cut (bool, optional) – Whether to cut sentences with too many tokens. Default: True.

LanguageProcessing.trim_in_ids(ids) → List[int]

Find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing pad. It calls the identical method of the Sentence instance sentence, from get_default_field().

Parameters

ids (List[int]) – The input ids.

Vocabulary

LanguageProcessing.frequent_vocab_size

int – The number of frequent words. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.all_vocab_size

int – The number of frequent words and rare words. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.frequent_vocab_list

list – The list of frequent words. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.all_vocab_list

list – The list of frequent words and rare words. Frequent words are always in the front of the list. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.get_special_tokens_mapping() → MutableMapping[str, str]

Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.get_special_tokens_id(name) → int

Get id of special token specifying the general name. Raise KeyError if no such token in this instance. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

Parameters

name (str) – the general name, must be one of the following, pad, unk, go, eos, sep, cls, mask.

LanguageProcessing.pad_id

int – The id of pad token. Raise KeyError if no pad token in this instance. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.unk_id

int – The id of unk token. Raise KeyError if no unk token in this instance. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.go_id

int – The id of go token. Raise KeyError if no go token in this instance. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

LanguageProcessing.eos_id

int – The id of eos token. Raise KeyError if no eos token in this instance. It calls the identical method of the Vocab instance vocab, from get_default_vocab().

Hash

LanguageProcessing.get_general_hash() → str[source]

General hash. Identifying all details in dataloader, including raw data before processed, tokenized data, vocabulary, and settings.

See dataloader hash for explaination.

LanguageProcessing.get_raw_data_hash() → str[source]

Raw data hash. Identifying raw data before processed.

See dataloader hash for explaination.

LanguageProcessing.get_data_hash() → str[source]

Data hash. Identifying data after processed (tokenized).

See dataloader hash for explaination.

LanguageProcessing.get_vocab_hash() → str[source]

Vocab hash. Identifying vocabulary.

See dataloader hash for explaination.

LanguageProcessing.get_setting_hash() → str[source]

Setting hash, identifying settings to create the data loader.

See dataloader hash for explaination.

LanguageGeneration

class cotk.dataloader.LanguageGeneration(file_id, *, tokenizer=None, max_sent_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, pretrained=None)[source]

Bases: dataloader.LanguageProcessing

This class is supported for language modeling tasks or language generation tasks without any inputs.

Parameters
  • file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id.

  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

  • min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than min_frequent_vocab_times will be regarded as frequent words. Default: 0

  • min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than min_rare_vocab_times will be regarded as rare words (frequent word excluded). Default: 0

  • pretrained (str, optional) – Use pretrained field instead of SentenceDefault. Default: If None, no pretrained field used.

get_batch(set_name, indexes) → Dict[str, Any][source]

Get a batch of data with specified indexes. Returns a dict at least contains:

  • sent_length (numpy.ndarray): A 1-d array, the length of sentence in each batch. Size: [batch_size]

  • sent (numpy.ndarray): A 2-d padding array containing id of tokens. Only provide frequent tokens. unk_id will be used for a rare token. Size: [batch_size, max(sent_length)]

  • sent_allvocabs (numpy.ndarray): A 2-d padding array containing id of tokens. Provide both frequent and rare tokens. Size: [batch_size, max(sent_length)]

  • sent_str (List[str]): A list containing raw sentences before tokenizing, converting to ids, or padding. Do not contain any special tokens. Size: [batch_size]

get_next_batch(), get_batches(), get_all_batch() provide other methods to get batched data, Their return values are consistent with this methods.

Parameters
  • set_name (str) – The name of set. For example: "train", "dev", "test".

  • indexes (list) – a list of specified indexes of batched data.

Examples

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you",
>>> #   "hello", "i", "am", "fine"]
>>> # frequent_vocab_size = 9
>>> # frequent_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", "hello", "i"]
>>> dataloader.get_batch('train', [0, 1, 2])
{
    "sent": numpy.array([
        [2, 4, 5, 6, 3, 0],   # first sentence: <go> how are you <eos> <pad>
        [2, 7, 3, 0, 0, 0],   # second sentence:  <go> hello <eos> <pad> <pad> <pad>
        [2, 7, 8, 1, 1, 3]    # third sentence: <go> hello i <unk> <unk> <eos>
    ]),
    "sent_length": numpy.array([5, 3, 6]), # length of sentences
    "sent_allvocabs": numpy.array([
        [2, 4, 5, 6, 3, 0],   # first sentence: <go> how are you <eos> <pad>
        [2, 7, 3, 0, 0, 0],   # second sentence:  <go> hello <eos> <pad> <pad> <pad>
        [2, 7, 8, 9, 10, 3]   # third sentence: <go> hello i am fine <eos>
    ]),
    "sent_str": [
        "how are you",
        "hello",
        "hello i am fine"
    ],
}
get_teacher_forcing_metric(gen_log_prob_key='gen_log_prob') → cotk.metric.metric.MetricChain[source]

Get metrics for teacher-forcing. In other words, this function provides metrics for language modelling task.

It contains:

See the above class for details of arguments.

Parameters

gen_log_prob_key (str, optional) – The key of predicted log probability over words. Default: gen_log_prob.

get_inference_metric(gen_key='gen', sample_in_bleu=1000, sample_in_ngram_perplexity=10000, seed=1229, cpu_count=None) → cotk.metric.metric.MetricChain[source]

Get metrics for inference. In other words, this function provides metrics for language generation tasks.

It contains:

See the above class for details of arguments.

Parameters
  • gen_key (str, optional) – The key of generated sentences. Default: gen.

  • sample_in_bleu (int, optional) – Number of examples sampled from the generated sentences. Default: 1000.

  • sample_in_ngram_perplexity (int, optional) – Number of examples sampled from the generated sentences. Default: 10000.

  • seed (int, optional) – Random seed for sampling. Default: 1229.

  • cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when cpu_count is set to 1 or the dataset is small. Default: If None, the environment variable CPU_COUNT will be used when available, or all available cpu will be used otherwise.

MSCOCO

class cotk.dataloader.MSCOCO(file_id, *, tokenizer='nltk', max_sent_length=50, convert_to_lower_letter=False, min_frequent_vocab_times=10, min_rare_vocab_times=0, pretrained=None)[source]

Bases: dataloader.LanguageGeneration

A dataloader for preprocessed MSCOCO dataset. Refer to LanguageGeneration and LanguageProcessing for attributes and methods.

Parameters
  • file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id. Default: resources://MSCOCO.

  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. Default: nltk

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: True

  • min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than min_frequent_vocab_times will be regarded as frequent words. Default: 10.

  • min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than min_rare_vocab_times will be regarded as rare words (frequent word excluded). Default: 0.

  • pretrained (str, optional) – Use pretrained field instead of SentenceDefault. Default: If None, no pretrained field used.

References

[1] http://images.cocodataset.org/annotations/annotations_trainval2017.zip

[2] Chen X, Fang H, Lin T Y, et al. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325, 2015.

SingleTurnDialog

class cotk.dataloader.SingleTurnDialog(file_id, *, tokenizer=None, max_sent_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, pretrained=None)[source]

Bases: dataloader.LanguageProcessing

This class is supported for sequence to sequence generation tasks, especially single turn dialog tasks.

Parameters
  • file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id.

  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

  • min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than min_frequent_vocab_times will be regarded as frequent words. Default: 0

  • min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than min_rare_vocab_times will be regarded as rare words (frequent word excluded). Default: 0

  • pretrained (str, optional) – Use pretrained field instead of SentenceDefault. Default: If None, no pretrained field used.

get_batch(set_name, indexes) → Dict[str, Any][source]

Get a batch of data with specified indexes. Return a dict contains:

  • post_length (numpy.ndarray): A 1-d array, the length of post in each batch. Size: [batch_size]

  • post (numpy.ndarray): A 2-d padded array containing tokens of id form in posts. Only provide frequent tokens. unk_id will be used for a rare token. Size: [batch_size, max(sent_length)]

  • post_allvocabs (numpy.ndarray): A 2-d padded array containing tokens of id form in posts. Provide both frequent and rare vocabs. Size: [batch_size, max(sent_length)]

  • post_str (List[str]): A list containing raw posts before tokenizing, converting to ids, or padding. Do not contain any special tokens. Size: [batch_size]

  • resp_length (numpy.ndarray): A 1-d array, the length of response in each batch. Size: [batch_size]

  • resp (numpy.ndarray): A 2-d padded array containing tokens of id form in responses. Only provide valid vocabs. unk_id will be used for a rare token. Size: [batch_size, max(sent_length)]

  • resp_allvocabs (numpy.ndarray): A 2-d padded array containing tokens of id form in responses. Provide both valid and invalid vocabs. Size: [batch_size, max(sent_length)]

  • post_str (List[str]): A list containing raw responses before tokenizing, converting to ids, or padding. Do not contain any special tokens. Size: [batch_size]

get_next_batch(), get_batches(), get_all_batch() provide other methods to get batched data, Their return values are consistent with this methods.

Parameters
  • set_name (str) – The name of set. For example: "train", "dev", "test".

  • indexes (list) – a list of specified indexes of batched data.

Examples

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you",
>>> #   "hello", "i", "am", "fine"]
>>> # frequent_vocab_size = 9
>>> # frequent_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", "hello", "i"]
>>> dataloader.get_batch('train', [0, 1])
{
    "post_str": [
        "are you fine",
        "hello",
    ],
    "post_allvocabs": numpy.array([
        [2, 5, 6, 10, 3],  # first post:  <go> are you fine <eos>
        [2, 7, 3, 0, 0],   # second post: <go> hello <eos> <pad> <pad>
    ]),
    "post": numpy.array([
        [2, 5, 6, 1, 3],   # first post:  <go> are you <unk> <eos>
        [2, 7, 3, 0, 0],   # second post: <go> hello <eos> <pad> <pad>
    ]),
    "resp_str": [
        "i am fine",
        "hello"
    ],
    "resp_allvocabs": numpy.array([
        [2, 8, 9, 10, 3],  # first response:  <go> i am fine <eos>
        [2, 7, 3, 0, 0],   # second response: <go> hello <eos> <pad> <pad>
    ]),
    "resp": numpy.array([
        [2, 8, 1, 1, 3],   # first response:  <go> i <unk> <unk> <eos>
        [2, 7, 3, 0, 0],   # second response: <go> hello <eos> <pad> <pad>
    ]),
    "post_length": numpy.array([5, 3]), # length of posts
    "resp_length": numpy.array([5, 3]), # length of responses
}
get_teacher_forcing_metric(gen_log_prob_key='gen_log_prob', generate_rare_vocab=False) → MetricChain[source]

Get metrics for teacher-forcing.

It contains:

Parameters
  • gen_log_prob_key (str) – The key of predicted log probability over words. Refer to metric.PerplexityMetric. Default: gen_log_prob.

  • generate_rare_vocab (bool) – Whether gen_log_prob contains invalid vocab. Refer to metric.PerplexityMetric. Default: False.

get_inference_metric(gen_key='gen') → MetricChain[source]

Get metrics for inference.

It contains:

Parameters

gen_key (str) – The key of generated sentences in index form. Refer to metric.BleuCorpusMetric or metric.SingleTurnDialogRecorder. Default: gen.

OpenSubtitles

class cotk.dataloader.OpenSubtitles(file_id='resources://OpenSubtitles', *, tokenizer='nltk', max_sent_length=50, convert_to_lower_letter=False, min_frequent_vocab_times=10, min_rare_vocab_times=0, pretrained=None)[source]

Bases: dataloader.SingleTurnDialog

A dataloader for OpenSubtitles dataset. Refer to SingleTurnDialog, LanguageProcessing for attributes and methods.

Parameters
  • file_id (str) – A string indicating the source (path) of the dataset. It can be local path ("./data"), a resource name ("resources://dataset"), or an url ("http://test.com/dataset.zip"). See the details of file id. Default: resources://OpenSubtitles.

  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. Default: nltk

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: 50.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False

  • min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than min_frequent_vocab_times will be regarded as frequent words. Default: 10.

  • min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than min_rare_vocab_times will be regarded as rare words (frequent word excluded). Default: 0

  • pretrained (str, optional) – Use pretrained field instead of SentenceDefault. Default: If None, no pretrained field used.

References

[1] http://opus.nlpl.eu/OpenSubtitles.php

[2] P. Lison and J. Tiedemann, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. LREC 2016.

MultiTurnDialog

class cotk.dataloader.MultiTurnDialog(file_id, tokenizer=None, max_sent_length=None, max_turn_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, fields=None, pretrained=None)[source]

Base class for multi-turn dialog datasets. This is an abstract class.

Arguments:

Attributes:

Notes

A Session field must be set as default field. When invoking __init__() of MultiTurnDialog, the default field, which may be reset in subclass, is set as self.fields[‘train’][‘session’].

get_batch(set_name, indexes) → Dict[str, Any]

Get a batch of data with specified indexes. Return a merged dict containing all the data from each field by calling field.get_batch(). See examples in subclasses for the return value of predefined tasks.

get_next_batch(), get_batches(), get_all_batch() provide other methods to get batched data, Their return values are consistent with this methods.

Parameters
  • set_name (str) – The name of set. For example: "train", "dev", "test".

  • indexes (list) – a list of specified indexes of batched data.

get_teacher_forcing_metric(multi_turn_gen_log_prob_key='multi_turn_gen_log_prob')[source]

Get metric for teacher-forcing.

It contains:

Parameters

gen_log_prob_key (str) – The key of predicted log probability over words. Refer to metric.MultiTurnPerplexityMetric. Default: gen_log_prob.

Returns

A metric.MetricChain object.

get_inference_metric(multi_turn_gen_key='multi_turn_gen')[source]

Get metric for inference.

It contains:

Parameters

gen_key (str) – The key of generated sentences in index form. Refer to metric.BleuCorpusMetric or metric.MultiTurnDialogRecorder. Default: gen.

Returns

A metric.MetricChain object.

UbuntuCorpus

class cotk.dataloader.UbuntuCorpus(file_id='resources://Ubuntu', min_frequent_vocab_times=10, max_sent_length=50, max_turn_length=20, min_rare_vocab_times=0, tokenizer='nltk', pretrained=None)[source]

A dataloader for Ubuntu dataset.

Parameters
  • file_id (str) – a str indicates the source of UbuntuCorpus dataset. Default: resources://Ubuntu. A preset dataset is downloaded and cached.

  • min_frequent_vocab_times (int) – A cut-off threshold of valid tokens. All tokens appear not less than min_vocab_times in training set will be marked as frequent words. Default: 10.

  • max_sent_length (int) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. Default: 50.

  • max_turn_length (int) – All sessions longer than max_turn_length will be shortened to first max_turn_length sentences. Default: 20.

  • min_rare_vocab_times (int) – A cut-off threshold of rare tokens. All tokens appear not less than invalid_vocab_times in the whole dataset (except valid words) will be marked as rare words. Otherwise, they are unknown words, both in training or testing stages. Default: 0 (No unknown words).

Refer to MultiTurnDialog for attributes and methods.

References

[1] https://github.com/rkadlec/ubuntu-ranking-dataset-creator

[2] Lowe R, Pow N, Serban I, et al. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. SIGDIAL 2015.

SwitchboardCorpus

class cotk.dataloader.SwitchboardCorpus(file_id='resources://SwitchboardCorpus', min_frequent_vocab_times=5, max_sent_length=50, max_turn_length=1000, min_rare_vocab_times=0, tokenizer='nltk', pretrained=None)[source]

A dataloader for Switchboard dataset.

In this dataset, all sessions start with a <d> representing empty context.

Parameters

file_id (str) – a string indicating the source of SwitchboardCorpus dataset. Default: resources://SwitchboardCorpus. A preset dataset is downloaded and cached.

Refer to MultiTurnDialog for attributes and methods.

References

[1] https://catalog.ldc.upenn.edu/LDC97S62

[2] John J G and Edward H. Switchboard-1 release 2. Linguistic Data Consortium, Philadelphia 1997.

SentenceClassification

class cotk.dataloader.SentenceClassification(file_id, tokenizer=None, max_sent_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, fields=None, pretrained=None)[source]

Base class for sentence classification datasets. This is an abstract class.

Arguments:

Notes

A Sentence field must be set as default field. When invoking __init__() of SentenceClassification, the default field, which may be reset in subclass, is set as self.fields[‘train’][‘sent’].

get_batch(set_name, indexes)[source]

Get a batch of specified indexes.

Parameters
  • set_name (str) – must be contained in key_name

  • indexes (list) – a list of specified indexes

Returns

(dict)

A dict at least contains:

  • sent_length(numpy.array): A 1-d array, the length of sentence in each batch. Size: [batch_size]

  • sent(numpy.array): A 2-d padding array containing id of words. Only provide valid words. unk_id will be used if a word is not valid. Size: [batch_size, max(sent_length)]

  • label(numpy.array): A 1-d array, the label of sentence in each batch.

  • sent_allvocabs(numpy.array): A 2-d padding array containing id of words. Provide both valid and invalid words. Size: [batch_size, max(sent_length)]

Examples

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you",
>>> #   "hello", "i", "am", "fine"]
>>> # vocab_size = 9
>>> # vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", "hello", "i"]
>>> dataloader.get_batch('train', [0, 1, 2])
{
    "sent": numpy.array([
            [2, 4, 5, 6, 3, 0],   # first sentence: <go> how are you <eos> <pad>
            [2, 7, 3, 0, 0, 0],   # second sentence:  <go> hello <eos> <pad> <pad> <pad>
            [2, 7, 8, 1, 1, 3]    # third sentence: <go> hello i <unk> <unk> <eos>
        ]),
    "label": numpy.array([1, 2, 0]) # label of sentences
    "sent_length": numpy.array([5, 3, 6]), # length of sentences
    "sent_allvocabs": numpy.array([
            [2, 4, 5, 6, 3, 0],   # first sentence: <go> how are you <eos> <pad>
            [2, 7, 3, 0, 0, 0],   # second sentence:  <go> hello <eos> <pad> <pad> <pad>
            [2, 7, 8, 9, 10, 3]   # third sentence: <go> hello i am fine <eos>
        ]),
}
get_metric(prediction_key='prediction')[source]

Get metrics for accuracy. In other words, this function provides metrics for sentence classification task.

It contains:

  • metric.AccuracyMetric

Parameters

prediction_key (str) – The key of prediction over sentences. Refer to metric.AccuracyMetric. Default: prediction.

Returns

A metric.MetricChain object.

SST

class cotk.dataloader.SST(file_id, min_frequent_vocab_times=10, max_sent_length=50, min_rare_vocab_times=0, tokenizer='space', pretrained=None)[source]

A dataloader for preprocessed SST dataset.

Parameters
  • file_id (str) – a str indicates the source of SST dataset.

  • min_frequent_vocab_times (int) – A cut-off threshold of valid tokens. All tokens appear not less than min_frequent_vocab_times in training set will be marked as frequent words. Default: 10.

  • max_sent_length (int) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. Default: 50.

  • min_rare_vocab_times (int) – A cut-off threshold of invalid tokens. All tokens appear not less than min_rare_vocab_times in the whole dataset (except valid words) will be marked as rare words. Otherwise, they are unknown words, both in training or testing stages. Default: 0 (No unknown words).

Refer to SentenceClassification for attributes and methods.

References

[1] http://images.cocodataset.org/annotations/annotations_trainval2017.zip

[2] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common Objects in Context. ECCV 2014.

Field

class cotk.dataloader.Field[source]

A base class of data field, which specify the format of the dataset. See Field and building a dataloader of customized task for usages.

Notice Field object may be shared between different fields, data sets or dataloader. Thus it only defines settings and do NOT stores data.

classmethod get_all_subclasses() → Iterable[Any]

Return a generator of all subclasses.

classmethod load_class(class_name) → Any

Return a subclass of class_name, case insensitively.

Parameters

class_name (str) – target class name.

get_vocab() → Optional[cotk.dataloader.vocab.Vocab][source]

Get Vocab object for the field. None if the field do not have a Vocab.

get_tokenizer() → Optional[cotk.dataloader.tokenizer.Tokenizer][source]

Get Tokenizer object for the field. None if the field do not have a Tokenizer.

get_batch(name, data, indexes) → Dict[str, Any][source]

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

Parameters
  • name (str) – name of the field.

  • data (Any) – the data stored in dataloader.

  • indexes (List[int]) – the indexes of the data in this batch

Sentence

class cotk.dataloader.Sentence(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]

Bases: dataloader.Field

A field for sentence. This class is a virtual class and the base of Sentence, SentenceGPT2 and SentenceBERT.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters
  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.

  • vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.

  • vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Formats

This field read one line of sentence per sample.

tokenize(sentence) → List[str][source]

Tokenize sentence.

  • Convert tokens to lower case if self.convert_to_lower_letter is True.

Parameters

sentence (str) – The sentence to be tokenized.

tokenize_sentences(sentences) → List[List[str]][source]

Tokenize sentences.

  • Convert tokens to lower case if self.convert_to_lower_letter is True.

Parameters

sentences (List[str]) – The list of sentence to be tokenized.

convert_tokens_to_ids(tokens, add_special=False, only_frequent_word=False) → List[int][source]

Convert list of tokens to list of ids.

Parameters
  • tokens (List[str]) – The tokens to be converted.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

convert_ids_to_tokens(ids, remove_special=True, trim=True) → List[str][source]

Convert list of ids to list of tokens.

Parameters
  • ids (List[int]) – The ids to be converted.

  • remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.

  • trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

convert_sentence_to_ids(sentence, add_special=False, only_frequent_word=False) → List[int][source]

Convert a sentence to a list of ids.

Parameters
  • sentence (str) – The sentence to be converted.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

convert_ids_to_sentence(ids, remove_special=True, trim=True) → str[source]

Convert list of tokens to a sentence.

Parameters
  • ids (List[int]) – The ids to be converted.

  • remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.

  • trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

add_special_to_ids(ids) → List[int][source]

Add special tokens, such as go_id or eos_id to the input ids.

Parameters

ids (List[int]) – The input ids.

remove_special_in_ids(ids, remove_special=True, trim=True) → List[int][source]

Remove special ids in input ids.

Parameters
  • ids (List[int]) – Input ids.

  • remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.

  • trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

trim_in_ids(ids) → List[int][source]

Find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing pad.

Parameters

ids (List[int]) – The input ids.

process_sentences(sentences, add_special=True, only_frequent_word=False, cut=True) → List[List[int]][source]

Process input sentences.

  • If sentences haven’t been tokenized, tokenize them by invoking Sentence.tokenize_sentences().

  • Then, convert the list of tokens to a list of ids.

  • If self.max_sent_length is not None and cut is True, sentences, whose length are more than self.max_sent_length, are shorten to first self.max_sent_length tokens.

Parameters
  • sentences (List[str], List[List[str]]) – sentences can be a list of sentences or a list of lists of tokens.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: True.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

  • cut (bool, optional) – Whether to cut sentences with too many tokens. Default: True.

frequent_vocab_size

int – The number of frequent words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

all_vocab_size

int – The number of frequent words and rare words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

frequent_vocab_list

list – The list of frequent words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

all_vocab_list

list – The list of frequent words and rare words. Frequent words are always in the front of the list. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

get_special_tokens_mapping() → MutableMapping[str, str]

Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It calls the method with the identical name of the Vocab instance, from self.get_vocab().

get_special_tokens_id(name) → int

Get id of special token specifying the general name. Raise KeyError if no such token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

Parameters

name (str) – the general name, must be one of the following, pad, unk, go, eos, sep, cls, mask.

pad_id

int – The id of pad token. Raise KeyError if no pad token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

unk_id

int – The id of unk token. Raise KeyError if no unk token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

go_id

int – The id of go token. Raise KeyError if no go token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

eos_id

int – The id of eos token. Raise KeyError if no eos token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

SentenceDefault

class cotk.dataloader.SentenceDefault(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]

Bases: dataloader.Sentence, dataloader.Field

A common use field for sentence.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters
  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.

  • vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.

  • vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Formats

This field read one line of sentence per sample.

get_batch(name, data, indexes) → Dict[str, Any][source]

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

  • FIELDNAME (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It only contains frequent vocabs, and rare words are replaced by unk_id.

  • FIELDNAME_allvocabs (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It contains frequent vocabs and rare vocabs.

  • FIELDNAME_length (np.ndarray[batch_size]): The length of sentences.

  • FIELDNAME_str (List[str]): The raw sentences.

where

  • FIELDNAME is the name of the field.

  • batch_size is len(indexes).

  • max_sent_length_in_batch is the maximum length of sentences in the batch.

Parameters
  • name (str) – name of the field.

  • data (Any) – the data stored in dataloader.

  • indexes (List[int]) – the indexes of the data in this batch

Examples

>>> #   all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "Life", "is", "short", ".",
>>> #       "PHP", "the", "best", "language", "in", "world"]
>>> #   frequent_vocab_size = 11
>>> #   frequent_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "Life", "is", "short", ".",
>>> #       "PHP", "the", "best"]
>>> field.get_batch('sent', data, [0, 1])
{
    "sent": numpy.array([
        [2, 4, 5, 6, 7, 3, 0, 0, 0, 0, 0],    # <go> Life is short . <eos> <pad> <pad> <pad> <pad> <pad>
        [2, 8, 5, 9, 10, 1, 1, 9, 1, 7, 3],  # <go> PHP is the best <unk> <unk> the <unk> . <eos>
    ]),
    "sent_length": numpy.array([6, 11]), # length of sentences
    "sent_allvocabs": numpy.array([
        [2, 4, 5, 6, 7, 3, 0, 0, 0, 0, 0],    # <go> Life is short . <eos> <pad> <pad> <pad> <pad> <pad>
        [2, 8, 5, 9, 10, 11, 12, 9, 13, 7, 3],  # <go> PHP is the best language in the world . <eos>
    ]),
    "sent_str": [
        "Life is short.",
        "PHP is the best language in the world.",
    ],
}

SentenceGPT2

class cotk.dataloader.SentenceGPT2(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]

Bases: dataloader.Sentence, dataloader.Field

A field for sentence in the format of GPT2.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters
  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.

  • vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.

  • vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Formats

This field read one line of sentence per sample.

get_batch(name, data, indexes) → Dict[str, Any][source]

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

  • FIELDNAME (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It only contains frequent vocabs, and rare words are replaced by unk_id.

  • FIELDNAME_allvocabs (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It contains frequent vocabs and rare vocabs.

  • FIELDNAME_length (np.ndarray[batch_size]): The length of sentences.

  • FIELDNAME_str (List[str]): The raw sentences.

where

  • FIELDNAME is the name of the field.

  • batch_size is len(indexes).

  • max_sent_length_in_batch is the maximum length of sentences in the batch.

Parameters
  • name (str) – name of the field.

  • data (Any) – the data stored in dataloader.

  • indexes (List[int]) – the indexes of the data in this batch

Examples

>>> # This example is based on GPT2Tokenizer. The vocab files are in ./tests/dummy_gpt2vocab.
>>> # field.eos_id = 413 # <|endoftext|>, also used for <pad>, <unk>, <go>
>>> field.get_batch('sent', data, [0, 2])
{
    "sent": numpy.array([
        [413, 6, 134, 321, 407, 107, 157, 121, 372, 201, 402, 105, 413, 413, 413, 413],
            # ['<|endoftext|>', 'A', 'Ġbicycle', 'Ġreplica', 'Ġwith', 'Ġa', 'Ġclock', 'Ġas', 'Ġthe',
            #   'Ġfront', 'Ġwheel', 'Ġ.', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>']
        [413, 6, 149, 370, 330, 384, 126, 298, 236, 130, 107, 255, 298, 149, 105, 413],
            # ['<|endoftext|>', 'A', 'Ġcar', 'Ġthat', 'Ġseems', 'Ġto', 'Ġbe', 'Ġparked', 'Ġillegally',
            #   'Ġbehind', 'Ġa', 'Ġlegally', 'Ġparked', 'Ġcar', 'Ġ.', '<|endoftext|>']
    ]),
    "sent_length": numpy.array([13, 16]), # length of sentences
    "sent_allvocabs": numpy.array([
        [413, 6, 134, 321, 407, 107, 157, 121, 372, 201, 402, 105, 413, 413, 413, 413],
            # ['<|endoftext|>', 'A', 'Ġbicycle', 'Ġreplica', 'Ġwith', 'Ġa', 'Ġclock', 'Ġas', 'Ġthe',
            #   'Ġfront', 'Ġwheel', 'Ġ.', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>']
        [413, 6, 149, 370, 330, 384, 126, 298, 236, 130, 107, 255, 298, 149, 105, 413],
            # ['<|endoftext|>', 'A', 'Ġcar', 'Ġthat', 'Ġseems', 'Ġto', 'Ġbe', 'Ġparked', 'Ġillegally',
            #   'Ġbehind', 'Ġa', 'Ġlegally', 'Ġparked', 'Ġcar', 'Ġ.', '<|endoftext|>']
    ]),
    "sent_str": [
        "A bicycle replica with a clock as the front wheel .",
        "A car that seems to be parked illegally behind a legally parked car .",
    ],
}

SentenceBERT

class cotk.dataloader.SentenceBERT(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]

Bases: dataloader.Sentence, dataloader.Field

A field for sentence in the format of BERT.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters
  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.

  • vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.

  • vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Formats

This field read one line of sentence per sample.

get_batch(name, data, indexes) → Dict[str, Any][source]

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

  • FIELDNAME (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It only contains frequent vocabs, and rare words are replaced by unk_id.

  • FIELDNAME_allvocabs (np.ndarray[batch_size, max_sent_length_in_batch]): Padded sentences in id formats. It contains frequent vocabs and rare vocabs.

  • FIELDNAME_length (np.ndarray[batch_size]): The length of sentences.

  • FIELDNAME_str (List[str]): The raw sentences.

where

  • FIELDNAME is the name of the field.

  • batch_size is len(indexes).

  • max_sent_length_in_batch is the maximum length of sentences in the batch.

Parameters
  • name (str) – name of the field.

  • data (Any) – the data stored in dataloader.

  • indexes (List[int]) – the indexes of the data in this batch

Examples

>>> # This example is based on BertTokenizer. The vocab files are in ./tests/dummy_bertvocab.
>>> field.get_batch('sent', data, [0, 1])
{
    "sent": numpy.array([
        [101, 147,  37,  29, 359, 102,   0,   0,   0,   0,   0,   0,   0],
            # ['<cls>', 'How', 'are', 'you', '?', '<sep>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
    [101, 375, 334, 379, 127, 341, 350,  29, 328,   9,  29, 359, 102]
            # ['<cls>', 'i', ''', 'm', 'fine', '.',  'thank', 'you', '!', 'and', 'you', '?', '<sep>']
    ]),
    "sent_length": numpy.array([6, 13]), # length of sentences,
    "sent_allvocabs": numpy.array([
        [101, 147,  37,  29, 359, 102,   0,   0,   0,   0,   0,   0,   0],
            # ['<cls>', 'how', 'are', 'you', '?', '<sep>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
    [101, 375, 334, 379, 127, 341, 350,  29, 328,   9,  29, 359, 102]
            # ['<cls>', 'i', ''', 'm', 'fine', '.',  'thank', 'you', '!', 'and', 'you', '?', '<sep>']
    ]),
    "sent_str": [
        "How are you?",
        "I'm fine. Thank you! And you?"
    ],
}

Session

class cotk.dataloader.Session(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]

Bases: dataloader.Field

A field for session. Each session is a list of sentences.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters
  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.

  • vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.

  • vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

  • max_turn_length (int, _InfiniteLength, optional) – Set the maximum turn length of a session. If it’s an integer, any session, whose turn length is more than max_turn_length is shortened to first max_sent_length turns. The left turns are ignored. If it’s None or Sentence.INFINITE_LENGTH, sessions won’t be shortened and all turns are remained. Default: None.

Input Format

This field read multiple line of sentences per sample, until a blank line.

tokenize(sentence) → List[str]

Tokenize sentence.

  • Convert tokens to lower case if self.convert_to_lower_letter is True.

Parameters

sentence (str) – The sentence to be tokenized.

tokenize_sentences(sentences) → List[List[str]]

Tokenize sentences.

  • Convert tokens to lower case if self.convert_to_lower_letter is True.

Parameters

sentences (List[str]) – The list of sentence to be tokenized.

tokenize_sessions(sessions) → List[List[List[str]]][source]

Tokenize sessions.

  • Convert the tokens to lower case if self.convert_to_lower_letter is True.

Parameters

sessions (List[List[str]]) – The list of sessions to be tokenized.

convert_tokens_to_ids(tokens, add_special=False, only_frequent_word=False) → List[int]

Convert list of tokens to list of ids.

Parameters
  • tokens (List[str]) – The tokens to be converted.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

convert_ids_to_tokens(ids, remove_special=True, trim=True) → List[str]

Convert list of ids to list of tokens.

Parameters
  • ids (List[int]) – The ids to be converted.

  • remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.

  • trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

convert_sentence_to_ids(sentence, add_special=False, only_frequent_word=False) → List[int]

Convert a sentence to a list of ids.

Parameters
  • sentence (str) – The sentence to be converted.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

convert_ids_to_sentence(ids, remove_special=True, trim=True) → str

Convert list of tokens to a sentence.

Parameters
  • ids (List[int]) – The ids to be converted.

  • remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.

  • trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

convert_multi_turn_tokens_to_ids(session, add_special=False, only_frequent_word=False) → List[List[int]][source]

Convert list of tokenized sentences to list of sentence ids.

Parameters
  • session (List[List[str]]) – The tokenized sentences to be converted.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: False.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

convert_multi_turn_ids_to_tokens(session_ids, remove_special=True, trim=True)[source]

Convert list of sentence ids to list of sentences.

Parameters
  • session_ids (List[List[int]]) – The sentence ids to be converted.

  • remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.

  • trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

add_special_to_ids(ids) → List[int]

Add special tokens, such as go_id or eos_id to the input ids.

Parameters

ids (List[int]) – The input ids.

remove_special_in_ids(ids, remove_special=True, trim=True) → List[int]

Remove special ids in input ids.

Parameters
  • ids (List[int]) – Input ids.

  • remove_special (bool, optional) – If True, detect and try to do a reverse operation of add_special in convert_tokens_to_ids(). It will not remove unk or special tokens in the middle of sentences. Default: True.

  • trim (bool, optional) – If True, use trim_in_ids() to remove trailing pad and eos. Default: True.

trim_in_ids(ids) → List[int]

Find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing pad.

Parameters

ids (List[int]) – The input ids.

multi_turn_trim_in_ids(session_ids) → List[List[int]][source]

For each sentence ids in session, find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing pad.

Parameters

session_ids (List[List[int]]) – The input ids of session.

process_sentences(sentences, add_special=True, only_frequent_word=False, cut=True) → List[List[int]]

Process input sentences.

  • If sentences haven’t been tokenized, tokenize them by invoking Sentence.tokenize_sentences().

  • Then, convert the list of tokens to a list of ids.

  • If self.max_sent_length is not None and cut is True, sentences, whose length are more than self.max_sent_length, are shorten to first self.max_sent_length tokens.

Parameters
  • sentences (List[str], List[List[str]]) – sentences can be a list of sentences or a list of lists of tokens.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: True.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

  • cut (bool, optional) – Whether to cut sentences with too many tokens. Default: True.

process_sessions(sessions, add_special=True, only_frequent_word=False, cut=True)[source]

Process input sessions.

  • If self.max_turn_length is not None and cut is True, sessions, whose length are more than self.max_turn_length, are shorten to first self.max_turn_length sentences.

  • If sessions haven’t been tokenized, tokenize them by invoking self.tokenize_sessions()

  • Then, convert the list of tokens to a list of ids.

  • If self.max_sent_length is not None and cut is True, sentences, whose length are more than self.max_sent_length, are shorten to first self.max_sent_length tokens.

Parameters
  • sessions (List[List[str], List[List[str]]]) – sentences in a session can be a str or a list of tokens.

  • add_special (bool, optional) – If True, special tokens (e.g. go, eos) are added. Default: True.

  • only_frequent_word (bool, optional) – If True, rare vocabs will be replaced by unk_id. Default: False.

  • cut (bool, optional) – Whether to cut sessions/sentences with too many sentences/tokens. Default: True.

frequent_vocab_size

int – The number of frequent words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

all_vocab_size

int – The number of frequent words and rare words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

frequent_vocab_list

list – The list of frequent words. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

all_vocab_list

list – The list of frequent words and rare words. Frequent words are always in the front of the list. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

get_special_tokens_mapping() → MutableMapping[str, str]

Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It calls the method with the identical name of the Vocab instance, from self.get_vocab().

get_special_tokens_id(name) → int

Get id of special token specifying the general name. Raise KeyError if no such token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

Parameters

name (str) – the general name, must be one of the following, pad, unk, go, eos, sep, cls, mask.

pad_id

int – The id of pad token. Raise KeyError if no pad token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

unk_id

int – The id of unk token. Raise KeyError if no unk token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

go_id

int – The id of go token. Raise KeyError if no go token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

eos_id

int – The id of eos token. Raise KeyError if no eos token in this instance. It calls the method with the identical name of the Vocab instance, from self.get_vocab().

SessionDefault

class cotk.dataloader.SessionDefault(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]

Bases: dataloader.Session, dataloader.Field

A common use field for sessions.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters
  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.

  • vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.

  • vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Format

This field read multiple line of sentences per sample, until a blank line.

get_batch(name, data, indexes) → Dict[str, Any][source]
Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field.

This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

  • FIELDNAME (np.ndarray[batch_size, max_turn_length_in_batch, max_sent_length_in_batch]): Padded sessions in id formats. It only contains frequent vocabs, and rare words are replaced by unk_id.

  • FIELDNAME_allvocabs (np.ndarray[batch_size, max_turn_length_in_batch, max_sent_length_in_batch]): Padded sessions in id formats. It contains frequent vocabs and rare vocabs.

  • FIELDNAME_turn_length (np.ndarray[batch_size]): The turn numbers of sessions.

  • FIELDNAME_sent_length (List[List[int]]): The length of sentences of sessions.

  • FIELDNAME_str (List[str]): The raw sessions.

where

  • FIELDNAME is the name of the field.

  • batch_size is len(indexes).

  • max_turn_length_in_batch is the maximum turn number of sessions in the batch.

  • max_sent_length_in_batch is the maximum length of sentences in the batch.

Arguments:

name (str): name of the field. data (Any): the data stored in dataloader. indexes (List[int]): the indexes of the data in this batch

Examples

>>> #   dataset = iter(['How are you?\n', "I'm fine. And you?\n", "I'm fine, too.\n", "\n",
>>> #       "How to install cotk?\n", "pip install cotk.\n", "\n"])
>>> #   min_frequent_vocab_times = 2
>>> #   all_vocab_list = ['<pad>', '<unk>', '<go>', '<eos>', '.', '?', "'", 'How', 'I',
>>> #       'cotk', 'fine', 'install', 'm', 'you', ',', 'And', 'are', 'pip', 'to', 'too']
>>> #   frequent_vocab_size = 14
>>> #   frequent_vocab_list = ['<pad>', '<unk>', '<go>', '<eos>', '.', '?', "'", 'How', 'I',
>>> #       'cotk', 'fine', 'install', 'm', 'you']
>>> #   data = {
>>> #       'id': [
>>> #           [
>>> #               [2, 7, 16, 13, 5, 3],
>>> #               [2, 8, 6, 12, 10, 4, 15, 13, 5, 3],
>>> #               [2, 8, 6, 12, 10, 14, 19, 4, 3],
>>> #           ],
>>> #           [
>>> #               [2, 7, 18, 11, 9, 5, 3],
>>> #               [2, 17, 11, 9, 4, 3],
>>> #           ]
>>> #       ],
>>> #       'str': [
>>> #           [
>>> #               'How are you?',
>>> #               "I'm fine. And you?",
>>> #               "I'm fine, too."
>>> #           ],
>>> #           [
>>> #               'How to install cotk?',
>>> #               'pip install cotk.'
>>> #           ]
>>> #
>>> #   }
>>> field.get_batch('session', data, [0, 1])
{
    'session_turn_length': numpy.array([3, 2]),
    'session_sent_length': [
        [6, 10, 9],
        [7, 6]
    ],
    'session': numpy.array([
        [
            [ 2,  7,  1, 13,  5,  3,  0,  0,  0,  0], # <go> How <unk> you? <eos> <pad> <pad> <pad> <pad>
            [ 2,  8,  6, 12, 10,  4,  1, 13,  5,  3], # <go> I'm fine. <unk> you? <eos>
            [ 2,  8,  6, 12, 10,  1,  1,  4,  3,  0]  # <go> I'm fine <unk> <unk>. <eos> <pad>
        ],
        [
            [ 2,  7,  1, 11,  9,  5,  3,  0,  0,  0], # <go> How <unk> install cotk? <eos> <pad> <pad> <pad>
            [ 2,  1, 11,  9,  4,  3,  0,  0,  0,  0], # <go> <unk> install cotk. <eos> <pad> <pad> <pad> <pad>
            [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]  # all <pad>
        ]
    ]),
    'session_allvocabs': numpy.array([
        [
            [ 2,  7, 16, 13,  5,  3,  0,  0,  0,  0], # <go> How are you? <eos> <pad> <pad> <pad> <pad>
            [ 2,  8,  6, 12, 10,  4, 15, 13,  5,  3], # <go> I'm fine. And you? <eos>
            [ 2,  8,  6, 12, 10, 14, 19,  4,  3,  0]  # <go> I'm fine, too. <eos> <pad>
        ],
        [
            [ 2,  7, 18, 11,  9,  5,  3,  0,  0,  0], # <go> How to install cotk? <eos> <pad> <pad> <pad>
            [ 2, 17, 11,  9,  4,  3,  0,  0,  0,  0], # <go> pip install cotk. <eos> <pad> <pad> <pad> <pad>
            [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]  # all <pad>
        ]
    ]),
    'session_str': [
        [
            'How are you?',
            "I'm fine. And you?",
            "I'm fine, too."
        ],
        [
            'How to install cotk?',
            'pip install cotk.'
        ]
    ]
}

SessionGPT2

class cotk.dataloader.SessionGPT2(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]

Bases: dataloader.Session, dataloader.Field

A field for session in the format of GPT2.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters
  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.

  • vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.

  • vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Format

This field read multiple line of sentences per sample, until a blank line.

get_batch(name, data, indexes) → Dict[str, Any][source]
Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field.

This function is for INTERNAL USE only, but it shows the data format of the returned batch.

Arguments:

name (str): name of the field. data (Any): the data stored in dataloader. indexes (List[int]): the indexes of the data in this batch

# NOTE: We only show the structure of return value of get_batch. The real value of each entry may depends on the loaded vocab. .. rubric:: Examples

>>> from transformers.tokenization_gpt2 import GPT2Tokenizer
>>> from cotk.dataloader.tokenizer import PretrainedTokenizer
>>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
>>> field = SessionGPT2(PretrainedTokenizer(tokenizer))
>>> field_content = field._create('train')
>>> dataset = iter(['How are you?\n', "I'm fine. Thank you! And you?\n", "I'm fine, too.\n", "\n", "How to install CoTk?\n", "pip install cotk.\n", "\n"])
>>> while True:
...     try:
...         field_content.read_next(dataset)
...     except StopIteration:
...         break
>>> field_content.process_before_vocab()
>>> field.vocab.build_vocab()
>>> data = field_content.get_data()
>>> data
{'id': [[[2, 8, 18, 6, 5, 3],
        [2, 9, 7, 12, 10, 4, 17, 6, 13, 15, 6, 5, 3],
        [2, 9, 7, 12, 10, 14, 22, 4, 3]],
       [[2, 8, 21, 11, 16, 5, 3], [2, 20, 11, 19, 4, 3]]],
  'str': [['How are you?', "I'm fine. Thank you! And you?", "I'm fine, too."],
      ['How to install CoTk?', 'pip install cotk.']]}
>>> batch_data = field.get_batch('session', data, [1])
>>> batch_data
{'session_turn_length': array([2]),
  'session_sent_length': [[7, 6]],
  'session': array([[[ 2,  8, 21, 11, 16,  5,  3],
                 [ 2, 20, 11, 19,  4,  3,  0]]]),
  'session_allvocabs': array([[[ 2,  8, 21, 11, 16,  5,  3],
                 [ 2, 20, 11, 19,  4,  3,  0]]]),
  'session_str': [['How to install CoTk?', 'pip install cotk.']]}
>>> # 'session_turn_length' (`name` + '_turn_length') is a :class:`np.ndarray` object with shape == (batch size, ). Each element is the length of corresponding sssion.
>>> # 'session_sent_length' (`name` + '_sent_length') is List[List[int]]. Each integer is the length of corresponding sentence.
>>> # 'session' (`name`) is a :class:`np.ndarray` object with shape == (batch size, max turn length, max sentence length).
>>>             # batch_data['session'][i, j] is a sentence. batch_data['session'][i, j, k] is an id.
>>>             # If `self.max_turn_length` is not None and j >= `self.max_turn_length` or `self.max_sent_length` is not None and k >= `self.max_sent_length`,
>>>             # batch_data['session'][i, j, k] is `self.eos_id`.
>>> # 'session_allvocabs' (`name` + '_allvocabs') is the same with 'session'.

SessionBERT

class cotk.dataloader.SessionBERT(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]

Bases: dataloader.Session, dataloader.Field

A field for session in the format of BERT.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters
  • tokenizer (Tokenizer, str, optional) – How to tokenize sentence. if str, see tokenizer for possible value. No default value, KeyError will be raised.

  • vocab (Vocab, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value, KeyError will be raised.

  • vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example, DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test" means that the words from the “dev” set is used for test. Default: See the table for default value.

  • max_sent_length (int, _InfiniteLength, optional) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. If it’s None or Sentence.INFINITE_LENGTH, sentences won’t be shortened no matter how long they are. Default: None.

  • convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default: False.

Input Format

This field read multiple line of sentences per sample, until a blank line.

get_batch(name, data, indexes) → Dict[str, Any][source]
Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field.

This function is for INTERNAL USE only, but it shows the data format of the returned batch.

Arguments:

name (str): name of the field. data (Any): the data stored in dataloader. indexes (List[int]): the indexes of the data in this batch

# NOTE: We only show the structure of return value of get_batch. The real value of each entry may depends on the loaded vocab. .. rubric:: Examples

>>> from transformers.tokenization_bert import BertTokenizer
>>> from cotk.dataloader.tokenizer import PretrainedTokenizer
>>> tokenizer = BertTokenizer.from_pretrained('bert')
>>> field = SessionBERT(PretrainedTokenizer(tokenizer))
>>> field_content = field._create('train')
>>> dataset = iter(['How are you?\n', "I'm fine. Thank you! And you?\n", "I'm fine, too.\n", "\n", "How to install CoTk?\n", "pip install cotk.\n", "\n"])
>>> while True:
...     try:
...         field_content.read_next(dataset)
...     except StopIteration:
...         break
>>> field_content.process_before_vocab()
>>> field.vocab.build_vocab()
>>> data = field_content.get_data()
>>> data
{'id': [[[2, 8, 18, 6, 5, 3],
        [2, 9, 7, 12, 10, 4, 17, 6, 13, 15, 6, 5, 3],
        [2, 9, 7, 12, 10, 14, 22, 4, 3]],
       [[2, 8, 21, 11, 16, 5, 3], [2, 20, 11, 19, 4, 3]]],
  'str': [['How are you?', "I'm fine. Thank you! And you?", "I'm fine, too."],
      ['How to install CoTk?', 'pip install cotk.']]}
>>> batch_data = field.get_batch('session', data, [1])
>>> batch_data
{'session_turn_length': array([2]),
  'session_sent_length': [[7, 6]],
  'session': array([[[ 2,  8, 21, 11, 16,  5,  3],
                 [ 2, 20, 11, 19,  4,  3,  0]]]),
  'session_allvocabs': array([[[ 2,  8, 21, 11, 16,  5,  3],
                 [ 2, 20, 11, 19,  4,  3,  0]]]),
  'session_str': [['How to install CoTk?', 'pip install cotk.']]}
>>> # 'session_turn_length' (`name` + '_turn_length') is a :class:`np.ndarray` object with shape == (batch size, ). Each element is the length of corresponding sssion.
>>> # 'session_sent_length' (`name` + '_sent_length') is List[List[int]]. Each integer is the length of corresponding sentence.
>>> # 'session' (`name`) is a :class:`np.ndarray` object with shape == (batch size, max turn length, max sentence length).
>>>             # batch_data['session'][i, j] is a sentence. batch_data['session'][i, j, k] is an id.
>>>             # If `self.max_turn_length` is not None and j >= `self.max_turn_length` or `self.max_sent_length` is not None and k >= `self.max_sent_length`,
>>>             # batch_data['session'][i, j, k] is `self.pad_id`.
>>> # 'session_allvocabs' (`name` + '_allvocabs') is the same with 'session'.

DenseLabel

class cotk.dataloader.DenseLabel[source]

Bases: dataloader.Field

A field of categorical labels whose values are integer which ranges from 0 to label_types - 1.

See dataloader.SparseLabel for labels in str or sparse integer.

Parameters

This class do not contains arguments for initialization.

Input Format

This field reads one line per sample. The line must be an integer.

get_batch(name, data, indexes) → Dict[str, Any][source]

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

  • FIELDNAME (np.ndarray[batch_size]): Labels of corresponding batched data.

where

  • FIELDNAME is the name of the field.

Parameters
  • name (str) – name of the field.

  • data (Any) – the data stored in dataloader.

  • indexes (List[int]) – the indexes of the data in this batch

Examples

>>> #   data = {'label': [1, 0]}
>>> field.get_batch('label', data, [0, 1])
{
    'label': numpy.array([1, 0])
}

SparseLabel

class cotk.dataloader.SparseLabel(vocab=None)[source]

Bases: dataloader.Field

A field of categorical labels whose values are strings or sparse integer.

See dataloader.DenseLabel for labels in dense integers.

If any argument is not specified, the value will be first retrieved from FieldContext. If still None, default value will be used.

Parameters

vocab (SimpleVocab, optional) – The vocab to store all the labels. If None, a SimpleVocab is automatically created.

Input Format

This field reads one line per sample. The line can be an arbitary string.

get_batch(name, data, indexes) → Dict[str, Any][source]

Invoked by LanguageProcessing.get_batch(), return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.

The function will return a dict, containing:

  • FIELDNAME_id (np.ndarray[batch_size]): Ids of corresponding labels.

  • FIELDNAME_str (List[str]): Raw labels of the batched data.

where

  • FIELDNAME is the name of the field.

Parameters
  • name (str) – name of the field.

  • data (Dict[str, Any]) – the object returned by _SparseLabelContent.get_data().

data[‘str’] is raw labels. data[‘id’] is ids of labels.

indexes (List[int]): the indexes of the data in this batch

Examples

>>> #   data = {
>>> #       'id': [0, 2, 1, 0],
>>> #       'str': ['Java', 'Python', 'Cpp', 'Java']
>>> #   }
>>> field.get_batch('label', data, [0, 1])
{
    'label_id': numpy.array([0, 2]),  # Ids of corresponding labels.
    'label_str': ['Java', 'Python']   # Raw labels.
}

Tokenizer

class cotk.dataloader.Tokenizer[source]

Tokenizer is used for spliting sentence to tokens. This is an abstract base class. It often works as a part of Field

tokenize(sentence) → List[str][source]

Tokenize a sentence to a list of tokens.

Parameters

sentence (str) – a sentence to tokenize.

tokenize_sentences(sentences) → List[List[str]][source]

Tokenize a list of sentences to a list of lists of tokens.

Parameters

sentences (List[str]) – sentences to tokenize.

tokenize_sessions(sessions) → List[List[List[str]]][source]

Tokenize sessions to a 3-d list of tokens.

Parameters

sessions (List[List[str]]) – sessions to tokenize.

convert_tokens_to_sentence(tokens) → str[source]

Convert tokens to sentence. It usually works like the reverse operation of tokenize(), but it is not gauranteed. It may like " ".join(tokens), but some special condition and tokens will be took care.

Parameters

tokens (List[str]) – tokenized sentence

get_setting_hash() → str[source]

Return the setting hash of this tokenizer instance. See here for the explaination of setting hash.

SimpleTokenizer

class cotk.dataloader.SimpleTokenizer(method, special_tokens=None)[source]

Bases: dataloader.Tokenizer

A simple tokenizer. method can either be nltk or space. If nltk, use WordPunctTokenizer from nltk.tokenize. If space, use str.split(" ").

Parameters
  • method (str) – the tokenization method, nltk or space.

  • special_tokens (List[str]) – special tokens not to tokenize, such as <go>.

Pretrainedtokenizer

class cotk.dataloader.PretrainedTokenizer(tokenizer)[source]

Bases: dataloader.Tokenizer

A wrapper for Pretrainedtokenizer from transformers package. If you don’t want to do tokenization on some special tokens, see transformers.Pretrainedtokenizer.add_special_tokens.

Parameters

tokenizer (transformers.Pretrainedtokenizer) – An instance of transformers.Pretrainedtokenizer.

get_tokenizer_class() → str[source]

Get the class name of pretrained tokenizer.

Vocab

class cotk.dataloader.Vocab[source]

A class for storing vocabulary. This is an abstract base class. It often works as a part of Field or is shared between Field.

See introduction of vocabulary for more information.

Parameters

This class do not contains arguments for initialization.

classmethod get_all_subclasses() → Iterable[Any]

Return a generator of all subclasses.

classmethod load_class(class_name) → Any

Return a subclass of class_name, case insensitively.

Parameters

class_name (str) – target class name.

add_tokens(tokens, vocab_from) → None[source]

Add tokens for this vocabulary instance, the tokens will be used for building vocabulary list. Must be called before build_vocab().

Parameters
  • tokens (List[str]) – A list of tokens to add to the vocabulary.

  • vocab_from (str) – One of train, test, extra.

    • train: The tokens are from the training data. Frequent vocabs are selected from tokens of this type.

    • test: The tokens are from the validation data or test data. Rare vocabs are selected from tokens of this type.

    • extra: The tokens are from extra data. The tokens of this type will not selected as frequent or rare vocabs.

build_vocab()[source]

Building the vocabulary list according to the tokens from add_tokens().

convert_tokens_to_ids(tokens, only_frequent_word=False) → List[int][source]

Convert list of tokens to list of ids.

Parameters
  • tokens (List[str]) – List of tokens.

  • only_frequent_word (bool, optional) – Use unk for rare tokens. Defaults: False.

convert_ids_to_tokens(ids) → List[str][source]

Convert list of ids to list of tokens.

Parameters

ids (List[int]) – List of ids.

frequent_vocab_size

int – The number of frequent words.

all_vocab_size

int – The number of frequent words and rare words.

frequent_vocab_list

list – The list of frequent words.

all_vocab_list

list – The list of frequent words and rare words. Frequent words are always in the front of the list.

get_special_tokens_mapping() → MutableMapping[str, str][source]

Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>".

get_special_tokens_id(name) → int[source]

Get id of special token specifying the general name. Raise KeyError if no such token in this instance.

Parameters

name (str) – the general name, must be one of the following, pad, unk, go, eos, sep, cls, mask.

pad_id

int – The id of pad token. Raise KeyError if no pad token in this instance.

unk_id

int – The id of unk token. Raise KeyError if no unk token in this instance.

go_id

int – The id of go token. Raise KeyError if no go token in this instance.

eos_id

int – The id of eos token. Raise KeyError if no eos token in this instance.

get_setting_hash() → str[source]

Get setting hash for the Vocabulary instance. See here for the explaination of setting hash.

get_vocab_hash() → str[source]

Get vocab hash for the Vocabulary instance. See here for the explaination of vocab hash.

GeneralVocab

class cotk.dataloader.GeneralVocab(min_frequent_vocab_times=None, min_rare_vocab_times=None, special_tokens_mapping=None, special_appeared_in_data=None)[source]

Bases: dataloader.Vocab

A vocabulary class for general use.

This class always have the following 4 speical tokens: pad, unk, go, eos.

If any argument is not specified, the value will be first retrieved from VocabContext. If still None, default value will be used.

Parameters
  • min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than min_frequent_vocab_times will be regarded as frequent words. Default: 0

  • min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than min_rare_vocab_times will be regarded as rare words (frequent word excluded). Default: 0

  • special_tokens_mapping (OrderedDict, optional) – Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It must at least contains pad, unk, go, eos. All the value of special tokens cannot be the same. Default: If None, it will be OrderedDict([("pad", "<pad>"), ("unk", "<unk>"), ("go", "<go>"), ("eos", "<eos>")].

  • special_appeared_in_data (bool, optional) – If the string of special tokens will appear in the data. Default: If not specified, it will be False.

static from_predefined(vocab_list, frequent_vocab_size, special_tokens_mapping=None) → cotk.dataloader.vocab.GeneralVocab[source]

Return a GeneralVocab instance, whose vocabulary comes from a predefined list. See from_predefined_vocab() if you want to use the vocabulary from an existing GeneralVocab instance.

Parameters
  • vocab_list (List[str]) – A list of all vocabulary.

  • frequent_vocab_size (int) – the length of the frequent words. The frequent word must be in the front of the vocab_list.

  • special_tokens_mapping (OrderedDict, optional) – Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It must at least contains pad, unk, go, eos. All the value of special tokens cannot be the same. Special tokens MUST be in the front of the frequent_vocab_list (ordered sensitive). Default: If None, it will be OrderedDict([("pad", "<pad>"), ("unk", "<unk>"), ("go", "<go>"), ("eos", "<eos>")].

static from_predefined_vocab(vocab) → cotk.dataloader.vocab.GeneralVocab[source]

Return a new GeneralVocab instance from vocab. The new instance have the same vocabulary list as the old one.

Parameters

vocab (GeneralVocab) – The old instance.

static from_frequent_word(frequent_vocab_list, special_tokens_mapping=None) → cotk.dataloader.vocab.GeneralVocab[source]

Return a GeneralVocab instance, whose vocabulary comes from a predefined frequent list. And its rare word list can be built later. See from_frequent_word_of_vocab() if you want to use the frequent vocabulary from an existing GeneralVocab instance.

Parameters
  • frequent_vocab_list (List[str]) – A list of frequent vocabulary.

  • special_tokens_mapping (OrderedDict, optional) – Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following: pad, unk, go, eos, sep, cls, mask. The value can be arbitrary string, e.g., "<pad>", "<unk>". It must at least contains pad, unk, go, eos. All the value of special tokens cannot be the same. Special tokens MUST be in the front of the frequent_vocab_list (ordered sensitive). Default: If None, it will be OrderedDict([("pad", "<pad>"), ("unk", "<unk>"), ("go", "<go>"), ("eos", "<eos>")].

static from_frequent_word_of_vocab(vocab) → cotk.dataloader.vocab.GeneralVocab[source]

Return a GeneralVocab instance, which has the same frequent vocabulary list as the old one. The rare word list can be built later.

Parameters

vocab (GeneralVocab) – The old instance to provide frequent words.

PretrainedVocab

class cotk.dataloader.PretrainedVocab(tokenizer)[source]

Bases: dataloader.Vocab

Use the vocabulary from a pretrained tokenizer in transformers package. This class is usually used for pretrained models, and it do NOT have rare words.

Unlike GeneralVocab, this class do not always have pad, unk, go, eos. Some special tokens may refer to the same token.

Parameters

tokenizer (transformers.PretrainedTokenizer) – A pretrained tokenizer from transformers package.

frequent_vocab_list

list – The list of frequent words.

all_vocab_list

list – The list of frequent words and rare words. Frequent words are always in the front of the list.

SimpleVocab

class cotk.dataloader.SimpleVocab[source]

Bases: dataloader.Vocab

A very simple vocabulary class. No rare vocabs or special tokens. Used by SparseLabel.

Parameters

This class do not contains arguments for initialization.

Context

class cotk.dataloader.Context(parameter_dict, weak=False, none_as_ignored=True)[source]

An abstract base class for context manager.

This class is used for setting default parameters for Field or Vocab, without directly passing parameters to __init__ of the object.

See examples for how to use context manager.

Parameters
  • parameter_dict (Dict[str, Any]) – Key-value dict for changed parameters.

  • weak (bool, optional) – When False, overwrite existing parameters. Default: False.

  • none_as_ignored (bool, optional) – When True, None values in parameter_dict are ignored. Otherwise, the corresponding key will be set to None. Default: True.

classmethod get(key, default=None, no_default=False) → Any[source]

Get the value of parameter named key stored in this class.

Parameters
  • key (str) – name of the parameter

  • default (Any, optional) – Default value if key is not set. Defaults: None.

  • no_default (bool, optional) – When True, Raise KeyError if key is not set. Defaults: False.

classmethod set(key, value, weak=False, none_as_ignored=True) → Any[source]

Set the parameter named key to value, stored in this class. If weak is True, do not overwrite if key is already set. Return the old value.

Parameters
  • key (str) – The name of the changed parameter.

  • value (Any) – The new value of changed parameter. If want to delete the key, use Context.UNDEFINED.

  • weak (bool, optional) – When False, overwrite existing parameters. Defaults: False.

  • none_as_ignored (bool, optional) – When True, None values in parameter_dict are ignored. Otherwise, the corresponding value will be set to None. Default: True.

close()[source]

Restore the old parameter.

FieldContext

class cotk.dataloader.FieldContext(parameter_dict, weak=False, none_as_ignored=True)[source]

Bases: dataloader.Context

A context class for setting default parameters for Field.

classmethod set_parameters(*, weak=False, none_as_ignored=True, **kwargs) → cotk.dataloader.context.FieldContext[source]

Set a context for initialization of Field. See examples for how to use context manager.

Parameters
  • weak (bool, optional) – When False, overwrite existing parameters. Defaults: False.

  • none_as_ignored (bool, optional) – When True, None values in kwargs are ignored. Otherwise, the corresponding value will be set to None. Default: True.

  • **kwargs – Any parameters to be set. Set key to FieldContext.UNDEFINED to delete a parameter.

VocabContext

class cotk.dataloader.VocabContext(parameter_dict, weak=False, none_as_ignored=True)[source]

Bases: dataloader.Context

A context class for setting default parameters for Vocab.

classmethod set_parameters(*, weak=False, none_as_ignored=True, **kwargs) → cotk.dataloader.context.VocabContext[source]

Set a context for initialization of Vocab. See examples for how to use context manager.

Parameters
  • weak (bool, optional) – When False, overwrite existing parameters. Defaults: False.

  • none_as_ignored (bool, optional) – When True, None values in kwargs are ignored. Otherwise, the corresponding value will be set to None. Default: True.

  • **kwargs – Any parameters to be set. Set key to VocabContext.UNDEFINED to delete a parameter.