Data Loader¶
cotk.dataloader
provides classes and functions downloading and
loading benchmark data automatically. It reduces your cost preprocessing
data and provide a fair dataset for every model. It also helps you adapt
your model from one dataset to other datasets.
Overview¶
Dataloaders are essential components in CoTK
to build models or do fair evaluation.
CoTK
uses a dataloader class, LanguageProcessing
, to handle all the tasks about language.
Here we give an overview of what makes a LanguageProcessing
dataloader.
A dataloader may have multiple sets of data. In this case, the name of 3 sets (
set_name
) are"train"
,"dev"
,"test"
.Each set stores the data read from a text file. In this example, 3 sets are refered to
"train.txt"
,"dev.txt"
,"test.txt"
.A set may have multiple data fields. In this case,
"train"
set have two fields, and their name (field_name
) are"post"
and"resp"
.Data fields are specified by
Field
instances.Field
defines the way that the dataloader reads, process, and output the data. (But it doesn’t store the data, the data is stored in dataloader.)A
Field
instance can be shared between data fields. In the example,"post"
in"train"
set and"post"
in"dev"
set share an instance.Tokenizer
defines the methods to tokenize a sentence.Vocab
defines the vocabulary. A instance ofVocab
can be shared between multipleField
, where the data from multipleField
will be used to construct the vocabulary.
Building a Dataloader¶
Predefined Tasks¶
CoTK
provides several predefined tasks and benchmarks, including
Choose an adequate class for your task, and it would be the simplest and best way to build a dataloader. Each class will explain how the dataloader is composed of.
Customized Tasks¶
If the predefined classes do not satisfy your need, you can construct an instance of LanguageProcessing
.
To specify the data format of the customized tasks, the initialization of LanguageProcessing
receives an argument named fields
.
The full description of fields
should be like the example below.
>>> postField = SentenceDefault(...)
>>> respField = SentenceDefault(...)
>>> labelField = DenseLabel(...)
>>> fields = {
>>> "train": [("post", postField), ("resp", respField)],
>>> "test": [("post", postField), ('resp', respField), ('label', labelField)]
>>> }
>>> dataloader = LangaugeProcessing("/path/to/dataset", fields)
"train"
and"test"
is the name of the split sets in the dataset. There should be two text file namedtrain.txt
andtest.txt
under/path/to/dataset/
, corresponding to the two sets,"train"
and"test"
respectively.fields["train"]
describes the data format oftrain.txt
. Every sample intrain
set has two data fields, which is represented byField
objects. AsSentenceDefault
(a subclass ofField
) only read one line per each sample, a sample intrain.txt
occupy two lines. The first line are named by"post"
, the second line are named"resp"
.Similarily,
fields["test"]
describes the data format oftest.txt
. Every sample intest
set occupies three lines, where the first line is"post"
, the second line is"resp"
, and the third line is an integer indicating"label"
.
An valid input example:
/path/to/dataset/train.txt
How are you? I am fine. What's up? Everything is good.
/path/to/dataset/test.txt
What is your name? Jack. 1 How about the food? Terrible. 0
The Field
instances define how dataloaders read the file, process the data, and provide the data to networks.
See fields for further details.
Omit Set Names
If you have three sets named "train"
, "dev"
, "test"
, and the data format is the same, you can
specify the fields
argument in initialization of LanguageProcessing
by the following code:
>>> fields = [("post", postField), ("resp", respField)]
equals to
>>> fields = {
>>> "train": [("post", postField), ("resp", respField)],
>>> "dev": [("post", postField), ("resp", respField)],
>>> "test": [("post", postField), ("resp", respField)]
>>> }
Use Simple Create
You can use LanguageProcessing.simple_create()
to initialize a dataloder, using the class name of Field
instead of instances. The method receives arguments for initializing subclasses of Vocab
and Field
.
>>> fields = {
>>> "train": [("post", "SentenceDefault"), ("resp", "SentenceDefault")],
>>> "dev": [("post", "SentenceDefault"), ("resp", "SentenceDefault")],
>>> "test": [("post", "SentenceDefault"), ("resp", "SentenceDefault")],
>>> }
>>> #or fields = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
>>> dataloader = LanguageProcessing.simple_create("/path/to/dataset", fields, \
>>> max_sent_length=10, tokenizer="space", min_frequent_vocab_times=10)
In this example, it will first create an GeneralVocab
instances with min_frequent_vocab_times=10
.
Then it initialize SentenceDefault
objects with max_sent_length=10, tokenizer="space"
and the created Vocab
.
Use Context Manager
There is another way to use the class name of Field
instead of instances. Initialize the LanguageProcessing
in the context of FieldContext
and VocabContext
.
>>> fields = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
>>> with FieldContext.set_parameters(max_sent_length=10, tokenizer="space"):
>>> with VocabContext.set_parameters(min_frequent_vocab_times=10):
>>> dataloader = LanguageProcessing("/path/to/dataset", fields)
equals to
>>> fields = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
>>> dataloader = LanguageProcessing.simple_create("/path/to/dataset", fields, max_sent_length=10, min_frequent_vocab_times=10)
Context is used to provide default values for Field
and Vocab
instances.
See Context for further details.
Field¶
Field
indicates data fields, which work secretly behind dataloaders.
They define how dataloaders read the file, process the data, and provide the data to networks.
Cotk
provides several fields, including
Note Field
never stores data, because the instance can be shared between different data fields in dataloader.
Read the File¶
Field
defines the way to read the file. For example,
Sentence
reads one line per sample, which is a string of sentence.Session
reads multiple lines per sample, stopped when a empty line is read.DenseLabel
reads one line per sample, which is an integer.
See the documentation in each class for details.
Process the Data¶
Each subclass of Field
defines the methods to process the input.
For example, Sentence
processes the sentence into different formats:
(str) The whole sentences.
(List[str]) The tokenized sentences.
(List[id]) The index of tokens in the vocabulary.
Sentence
also provides methods to convert a sentence from one format to another:
The dataloader has similar methods, which invoke the corresponding methods of the default field.
See LanguageProcessing.set_default_field()
for details.
Provide the Data to Networks¶
Each subclass of Field
defines Field.get_batch()
,
which returns a dict of data for training the networks.
For example, if an instance of SentenceDefault
is named with "sent"
,
SentenceDefault.get_batch()
will return a dict containing:
sent
sent_length
sent_allvocabs
sent_str
LanguageProcessing.get_batch()
will collect dict returned from every field and merge them.
For example, if a dataloader with two SentenceDefault
fields named "post"
, "resp"
,
LanguageProcessing.get_batch()
will return a dict containing:
post
post_allvocabs
post_length
post_str
reps
resp_allvocabs
resp_length
resp_str
Pretrained Field¶
Default fields like SentenceDefault
and SessionDefault
are designing
for common use in different language processing task. They use <go>
and <eos>
to mark
the start and the end of sentences.
For some pretrained models like GPT2
, <go>
are not pretrained in the vocabulary and thus not available.
We design different field for different pretrained models, including:
GPT2:
SentenceGPT2
,SessionGPT2
BERT:
SentenceBERT
,SessionBERT
Tokenizer¶
Tokenizer
defines the method to tokenize a sentence, which is used by Field
.
CoTK
provides several tokenizers, including
SimpleTokenizer
: A simple tokenizer for general use inCoTK
, supportingspace
ornltk
tokenization.PretrainedTokenizer
: A pretrained Tokenizer from thetransformers
package. For example, tokenizer forGPT2
.
When creating a dataloader, it often receives str
or Tokenizer
.
If str
, the following arguments are acceptable:
space
: Split by spaces.nltk
:nltk.tokenize.WordPunctTokenizer
will be used.
A SimpleTokenizer
will be created by the str
arguments.
Vocabulary¶
Vocab
defines the vocabulary, which is used by Field
.
CoTK
provides several vocabularies, including
GeneralVocab
: A vocabulary for general use inCoTK
. The vocabulary list is often built during the processing of input data. Save and load a predefined vocabulary is also supported.PretrainedVocab
: A predefeined vocabulary from thetransformers
package. For example, vocabulary forGPT2
.
Type of Tokens¶
All tokens appeared in dataset (including the ones only appear in test set) are split into 2 sets.
- Frequent Vocabularies(
frequent_vocabs
)
Tokens that the model should read, predict and generate.
These tokens are important in evaluation. They include common words and usually cover over most of tokens from dataset.
They are extracted from only training set, because models should be blind for test set. Hence they are defined as the tokens appear more than a specified number of times (
min_frequent_vocab_times
) in training set.- Rare Vocabularies(
rare_vocabs
)
Tokens that the model can optionally read, but will not predict and generate at most times (except some models can generate rare words using copy mechanism or external knowledge).
These tokens are less important but DO affect the evaluation.
They are extracted from both training set and test set, because they are defined considering evaluation. Hence, they are defined as the tokens (excluded
frequent_vocabs
) appear more than a specified number (min_rare_vocab_times
) of times in the whole dataset.
There is also some other terms for vocabularies.
- All Vocabularies(
allvocabs
)
The union of Frequent vocabularies and rare vocabularies is called all vocabularies.
- Special Tokens(
special_tokens
)
Most used special tokens are
<pad>
,<unk>
,<go>
,<eos>
.Special tokens are counted as valid vocabularies.
- Unknown tokens (
<unk>
)
<unk>
means “Out of Vocabularies”, but the meaning of<unk>
may varies from situations.If it appears at a list named with
allvocabs
(eg:sent_allvocabs
),<unk>
indicates a token out of all vocabularies.If it appears at a list named without
allvocabs
(eg:sent
),<unk>
indicates a token out of frequent vocabularies, which means it may arare vocabulary
.
Why CoTK Uses Rare Words¶
In traditional implementations, vocabulary only contains frequent vocabulary.
CoTK
use frequent vocabulary and rare vocabulary for supporting fair comparisons across different configurations.
For examples, we test two models under the same dataset, but with different vocabularies.
Model A: Frequent vocabulary
F_A
; Rare vocabularyR_A
.Model B: Frequent vocabulary
F_B
; Rare vocabularyR_B
.
The fairness of comparisons can be gauranteed under the conditions:
metric.PerplexityMetric
:F_A + R_A == F_B + R_B
.metric.BleuCorpusMetric
:F_A + R_A == F_B + R_B
if tokenizer isNone
; Always fair if tokenizer is set.
See each metrics for when the fairness can be gauranteed. Hash value can help user determine whether the comparisons is fair.
Connecting Field and Vocab¶
GeneralVocab
is often shared between fields for constructing vocabulary list together.
To identify tokens from a field is regarded as training set or test set
(which may be relative to the division of frequent vocab and rare vocab), Sentence
use an arguments named vocab_from_mappings
.
vocab_from_mappings
is a dict, which infer the type of token by the set name. By default:
Set Name |
Type |
train |
train |
training |
train |
dev |
test |
development |
test |
valid |
test |
validation |
test |
test |
test |
evaluation |
test |
For example, a token from the training
set will have a type of train
.
The type will passed to Vocab.add_tokens()
as vocab_from
.
There are 3 types:
train
: Frequent vocabs are selected from tokens of this type.test
: Rare vocabs are selected from tokens of this type.extra
: The tokens of this type will not selected as frequent or rare vocabs.
Context¶
FieldContext
and VocabContext
are used to set
the default arguments for subclasses of Field
and Vocab
respectively.
>>> vocab = GeneralVocab(...)
>>> with FieldContext.set_parameters(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10):
>>> field = SentenceDefault()
equals to:
>>> vocab = GeneralVocab(...)
>>> field = SentenceDefault(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10)
The context can be stacked, and weak
means whether overwrite the outter context.
>>> vocab = GeneralVocab(...)
>>> with FieldContext.set_parameters(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10):
>>> with FieldContext.set_parameters(min_frequent_vocab_times=20):
>>> field1 = SentenceDefault() # min_frequent_vocab_times=20
>>> with FieldContext.set_parameters(vocab=vocab, tokenizer="space", min_frequent_vocab_times=10):
>>> with FieldContext.set_parameters(min_frequent_vocab_times=20, weak=True):
>>> field2 = SentenceDefault() # min_frequent_vocab_times=10
It usually works with the initialization of LanguageProcessing
without creating the instance of Field
or Vocab
.
See the examples here.
Hash Value for Dataloader¶
It is usually difficult to track the differences among different configurations, CoTK provides hash codes to identify each part of dataloader including the input data, vocabularies and settings.
For example, if two data loaders have the same general hash, their data, vocabularies and settings are guaranteed to be the same.
LanguageProcessing
provides the following hash value:
LanguageProcessing.get_raw_data_hash()
. Tracking the raw input file before processed.LanguageProcessing.get_data_hash()
. Tracking the data after processed.LanguageProcessing.get_vocab_hash()
. Tracking the vocabulary before processed.LanguageProcessing.get_setting_hash()
. Tracking the settings (arguments of the dataloader).LanguageProcessing.get_general_hash()
. Tracking all above.
Dataloader¶
LanguageProcessing¶
-
class
cotk.dataloader.
LanguageProcessing
(file_id, fields)[source]¶ Bases:
dataloader.Dataloader
Base class for all language processing tasks. This is an abstract class.
During the initialization of a dataloader,
Vocab
,Tokenizer
orField
may be created. See how to create a dataloader.- Parameters
file_id (str) – A string indicating the source (path) of the dataset. It can be local path (
"./data"
), a resource name ("resources://dataset"
), or an url ("http://test.com/dataset.zip"
). See the details of file id.fields (List, OrderedDict, Dict) – This arguments supports multiple input types:
If
OrderDict
orList
, it specifydata format
of the"train"
,"dev"
,"test"
set.A
data format
should be anOrderedDict
or aList[Tuple]
can be converted toOrderedDict
.The
key
ofdata format
is the name of a Field (used byget_batch()
), and thevalue
is either a class name of a Field or aField
object.Examples:
>>> postField = SentenceDefault(...) >>> respField = SentenceDefault(...) >>> data_format = [("post", postField), ("resp", respField)]
or
>>> data_format = [("post", "SentenceDefault"), ("resp", "SentenceDefault")]
Examples:
>>> fields = data_format
equals to
>>> fields = {"train": data_format, "dev": data_format, "test": data_format}
If
Dict
,fields[key]
describesdata format
of the set namedkey
. Examples:>>> fields = {"train": data_format, "extra": data_format}
-
static
LanguageProcessing.
simple_create
(file_id, fields, **kwargs) → cotk.dataloader.dataloader.LanguageProcessing[source]¶ A simple way to create a dataloader. Instead of using
VocabContext
andFieldContext
, specifying all the possible parameters here.- Parameters
file_id (str) – A string indicating the source (path) of the dataset. It can be local path (
"./data"
), a resource name ("resources://dataset"
), or an url ("http://test.com/dataset.zip"
). See the details of file id.fields (List, OrderedDict, Dict) – See initialization of
LanguageProcessing
for explanation.
Tokenizer, Vocabulary, and Field¶
-
LanguageProcessing.
fields
¶ This instance attribute shows fields of the dataloader (See the initialization of
LanguageProcessing
). For example, the fields can be printed as follows:{ 'train': OrderedDict([('sent', <cotk.dataloader.field.SentenceDefault object at 0x000001E170F8B588>)]), 'dev': OrderedDict([('sent', <cotk.dataloader.field.SentenceDefault object at 0x000001E170F8BB48>)]), 'test': OrderedDict([('sent', <cotk.dataloader.field.SentenceDefault object at 0x000001E170F8BEC8>)])} }
-
LanguageProcessing.
get_default_tokenizer
() → cotk.dataloader.tokenizer.Tokenizer[source]¶ Get the default
Tokenizer
in this dataloader. It can be set byset_default_field()
.
-
LanguageProcessing.
get_default_vocab
() → cotk.dataloader.vocab.Vocab[source]¶ Get the default
Vocab
in this dataloader. It can be set byset_default_field()
.
-
LanguageProcessing.
get_default_field
() → cotk.dataloader.field.Field[source]¶ Get the default
Field
in this dataloader. It can be set byset_default_field()
.
-
LanguageProcessing.
set_default_field
(set_name, field_name)[source]¶ Set the default
Field
in this dataloader. In the meanwhile, the defaultVocab
andTokenizer
is also set according to the field (if the field have vocab and tokenizer).The default field will affect the action in the following methods:
- Parameters
set_name (str) – The name of set. For example:
"train"
,"dev"
,"test"
.field_name (str) – The name of field.
Batched Data¶
LanguageProcessing.
get_batch
(set_name, indexes) → Dict[str, Any][source]¶Get a batch of data with specified
indexes
. Return a merged dict containing all the data from each field by callingfield.get_batch()
. See examples in subclasses for the return value of predefined tasks.
get_next_batch()
,get_batches()
,get_all_batch()
provide other methods to get batched data, Their return values are consistent with this methods.
- Parameters
set_name (str) – The name of set. For example:
"train"
,"dev"
,"test"
.indexes (list) – a list of specified indexes of batched data.
LanguageProcessing.
restart
(set_name, batch_size=None, shuffle=True)[source]¶Initialize batches. This function be called before
get_next_batch()
or an epoch is end. Seeget_next_batch()
for examples.
- Parameters
set_name (str) – The name of set. For example:
"train"
,"dev"
,"test"
.batch_size (int) – the number of sample in a batch. default: if
None
, lastbatch_size
is used.shuffle (bool) – whether to shuffle the data. Default:
True
.
LanguageProcessing.
get_next_batch
(set_name, ignore_left_samples=False) → Optional[Dict[str, Any]][source]¶Get next batch. It can be called only after Initializing batches (
restart()
). Return a dict likeget_batch()
, or None if the epoch is end.
- Parameters
set_name (str) – The name of set. For example:
"train"
,"dev"
,"test"
.ignore_left_samples (bool) – If the number of the samples is not divisible by
batch_size
, ignore the left samples less thanbatch_size
Setting it toTrue
make that every batch will have the same number of samples. Default:False
.Examples
>>> dataloader.restart("train") >>> while True: >>> data = dataloader.get_next_batch("train") >>> if data: >>> break >>> print(data)
LanguageProcessing.
get_batches
(set_name, batch_size=None, shuffle=True, ignore_left_samples=False) → Iterable[Dict[str, Any]][source]¶An iterable generator over batches. It first call
restart()
, and thenget_next_batch()
until no more data is available. Returns an iterable generator where each element is likeget_batch()
.
- Parameters
set_name (str) – The name of set. For example:
"train"
,"dev"
,"test"
.batch_size (int, optional) – default:
None
. Usebatch_size
by default.shuffle (bool) – whether to shuffle the data. Default:
True
.ignore_left_samples (bool) – If the number of the samples is not divisible by
batch_size
, ignore the left samples less thanbatch_size
Setting it toTrue
make that every batch will have the same number of samples. Default:False
.
LanguageProcessing.
get_all_batch
(set_name) → Dict[str, List[Any]][source]¶Concatenate all batches to a single dict, where padding will not be applied.
Returns a dict like
get_batch()
with all validindexes
, but all the sentences are not padded and their type will be converted to list. Exactly, this function calledget_batch()
wherelen(indexes)==1
multiple times and concatenate all the values in the returned dicts.
- Parameters
set_name (str) – The name of set. For example:
"train"
,"dev"
,"test"
.
Sentences and Manipulations¶
-
LanguageProcessing.
tokenize
(sentence) → List[str]¶ Tokenize
sentence
.It calls the identical method of the
Sentence
instancesentence
, fromget_default_field()
.Convert tokens to lower case if
sentence.convert_to_lower_letter
isTrue
.
- Parameters
sentence (str) – The sentence to be tokenized.
-
LanguageProcessing.
tokenize_sentences
(sentences) → List[List[str]]¶ Tokenize
sentences
.It calls the identical method of the
Sentence
instancesentence
, fromget_default_field()
.Convert tokens to lower case if
sentence.convert_to_lower_letter
isTrue
.
- Parameters
sentences (List[str]) – The list of sentence to be tokenized.
-
LanguageProcessing.
convert_tokens_to_ids
(tokens, add_special=False, only_frequent_word=False) → List[int]¶ Convert list of tokens to list of ids. It calls the identical method of the
Sentence
instancesentence
, fromget_default_field()
.- Parameters
tokens (List[str]) – The tokens to be converted.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:False
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.
-
LanguageProcessing.
convert_ids_to_tokens
(ids, remove_special=True, trim=True) → List[str]¶ Convert list of ids to list of tokens. It calls the identical method of the
Sentence
instancesentence
, fromget_default_field()
.- Parameters
ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If
True
, detect and try to do a reverse operation ofadd_special
inconvert_tokens_to_ids()
. It will not removeunk
or special tokens in the middle of sentences. Default:True
.trim (bool, optional) – If
True
, usetrim_in_ids()
to remove trailingpad
andeos
. Default:True
.
-
LanguageProcessing.
convert_ids_to_sentence
(ids, remove_special=True, trim=True) → str¶ Convert list of tokens to a sentence. It calls the identical method of the
Sentence
instancesentence
, fromget_default_field()
.- Parameters
ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If
True
, detect and try to do a reverse operation ofadd_special
inconvert_tokens_to_ids()
. It will not removeunk
or special tokens in the middle of sentences. Default:True
.trim (bool, optional) – If
True
, usetrim_in_ids()
to remove trailingpad
andeos
. Default:True
.
-
LanguageProcessing.
convert_sentence_to_ids
(sentence, add_special=False, only_frequent_word=False) → List[int]¶ Convert a sentence to a list of ids. It calls the identical method of the
Sentence
instancesentence
, fromget_default_field()
.- Parameters
sentence (str) – The sentence to be converted.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:False
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.
-
LanguageProcessing.
add_special_to_ids
(ids) → List[int]¶ Add special tokens, such as
go_id
oreos_id
to the inputids
. It calls the identical method of theSentence
instancesentence
, fromget_default_field()
.- Parameters
ids (List[int]) – The input ids.
-
LanguageProcessing.
remove_special_in_ids
(ids, remove_special=True, trim=True) → List[int]¶ Remove special ids in input ids. It calls the identical method of the
Sentence
instancesentence
, fromget_default_field()
.- Parameters
ids (List[int]) – Input ids.
remove_special (bool, optional) – If
True
, detect and try to do a reverse operation ofadd_special
inconvert_tokens_to_ids()
. It will not removeunk
or special tokens in the middle of sentences. Default:True
.trim (bool, optional) – If
True
, usetrim_in_ids()
to remove trailingpad
andeos
. Default:True
.
-
LanguageProcessing.
process_sentences
(sentences, add_special=True, only_frequent_word=False, cut=True) → List[List[int]]¶ Process input sentences.
It calls the identical method of the
Sentence
instancesentence
, fromget_default_field()
.If sentences haven’t been tokenized, tokenize them by invoking
Sentence.tokenize_sentences()
.Then, convert the list of tokens to a list of ids.
If
sentence.max_sent_length
is notNone
andcut
isTrue
, sentences, whose length are more thansentence.max_sent_length
, are shorten to firstsentence.max_sent_length
tokens.
- Parameters
sentences (List[str], List[List[str]]) – sentences can be a list of sentences or a list of lists of tokens.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:True
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.cut (bool, optional) – Whether to cut sentences with too many tokens. Default:
True
.
-
LanguageProcessing.
trim_in_ids
(ids) → List[int]¶ Find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing
pad
. It calls the identical method of theSentence
instancesentence
, fromget_default_field()
.- Parameters
ids (List[int]) – The input ids.
Vocabulary¶
-
LanguageProcessing.
frequent_vocab_size
¶ int – The number of frequent words. It calls the identical method of the
Vocab
instancevocab
, fromget_default_vocab()
.
-
LanguageProcessing.
all_vocab_size
¶ int – The number of frequent words and rare words. It calls the identical method of the
Vocab
instancevocab
, fromget_default_vocab()
.
-
LanguageProcessing.
frequent_vocab_list
¶ list – The list of frequent words. It calls the identical method of the
Vocab
instancevocab
, fromget_default_vocab()
.
-
LanguageProcessing.
all_vocab_list
¶ list – The list of frequent words and rare words. Frequent words are always in the front of the list. It calls the identical method of the
Vocab
instancevocab
, fromget_default_vocab()
.
-
LanguageProcessing.
get_special_tokens_mapping
() → MutableMapping[str, str]¶ Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following:
pad
,unk
,go
,eos
,sep
,cls
,mask
. The value can be arbitrary string, e.g.,"<pad>"
,"<unk>"
. It calls the identical method of theVocab
instancevocab
, fromget_default_vocab()
.
-
LanguageProcessing.
get_special_tokens_id
(name) → int¶ Get id of special token specifying the general name. Raise
KeyError
if no such token in this instance. It calls the identical method of theVocab
instancevocab
, fromget_default_vocab()
.- Parameters
name (str) – the general name, must be one of the following,
pad
,unk
,go
,eos
,sep
,cls
,mask
.
-
LanguageProcessing.
pad_id
¶ int – The id of pad token. Raise
KeyError
if no pad token in this instance. It calls the identical method of theVocab
instancevocab
, fromget_default_vocab()
.
-
LanguageProcessing.
unk_id
¶ int – The id of unk token. Raise
KeyError
if no unk token in this instance. It calls the identical method of theVocab
instancevocab
, fromget_default_vocab()
.
-
LanguageProcessing.
go_id
¶ int – The id of go token. Raise
KeyError
if no go token in this instance. It calls the identical method of theVocab
instancevocab
, fromget_default_vocab()
.
-
LanguageProcessing.
eos_id
¶ int – The id of eos token. Raise
KeyError
if no eos token in this instance. It calls the identical method of theVocab
instancevocab
, fromget_default_vocab()
.
Hash¶
-
LanguageProcessing.
get_general_hash
() → str[source]¶ General hash. Identifying all details in dataloader, including raw data before processed, tokenized data, vocabulary, and settings.
See dataloader hash for explaination.
-
LanguageProcessing.
get_raw_data_hash
() → str[source]¶ Raw data hash. Identifying raw data before processed.
See dataloader hash for explaination.
-
LanguageProcessing.
get_data_hash
() → str[source]¶ Data hash. Identifying data after processed (tokenized).
See dataloader hash for explaination.
-
LanguageProcessing.
get_vocab_hash
() → str[source]¶ Vocab hash. Identifying vocabulary.
See dataloader hash for explaination.
-
LanguageProcessing.
get_setting_hash
() → str[source]¶ Setting hash, identifying settings to create the data loader.
See dataloader hash for explaination.
LanguageGeneration¶
-
class
cotk.dataloader.
LanguageGeneration
(file_id, *, tokenizer=None, max_sent_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, pretrained=None)[source]¶ Bases:
dataloader.LanguageProcessing
This class is supported for language modeling tasks or language generation tasks without any inputs.
- Parameters
file_id (str) – A string indicating the source (path) of the dataset. It can be local path (
"./data"
), a resource name ("resources://dataset"
), or an url ("http://test.com/dataset.zip"
). See the details of file id.tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. No default value,KeyError
will be raised.max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
.min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than
min_frequent_vocab_times
will be regarded as frequent words. Default:0
min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than
min_rare_vocab_times
will be regarded as rare words (frequent word excluded). Default:0
pretrained (str, optional) – Use pretrained field instead of
SentenceDefault
. Default: IfNone
, no pretrained field used.
-
get_batch
(set_name, indexes) → Dict[str, Any][source]¶ Get a batch of data with specified
indexes
. Returns a dict at least contains:sent_length (
numpy.ndarray
): A 1-d array, the length of sentence in each batch. Size:[batch_size]
sent (
numpy.ndarray
): A 2-d padding array containing id of tokens. Only provide frequent tokens.unk_id
will be used for a rare token. Size:[batch_size, max(sent_length)]
sent_allvocabs (
numpy.ndarray
): A 2-d padding array containing id of tokens. Provide both frequent and rare tokens. Size:[batch_size, max(sent_length)]
sent_str (
List[str]
): A list containing raw sentences before tokenizing, converting to ids, or padding. Do not contain any special tokens. Size:[batch_size]
get_next_batch()
,get_batches()
,get_all_batch()
provide other methods to get batched data, Their return values are consistent with this methods.- Parameters
set_name (str) – The name of set. For example:
"train"
,"dev"
,"test"
.indexes (list) – a list of specified indexes of batched data.
Examples
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", >>> # "hello", "i", "am", "fine"] >>> # frequent_vocab_size = 9 >>> # frequent_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", "hello", "i"] >>> dataloader.get_batch('train', [0, 1, 2]) { "sent": numpy.array([ [2, 4, 5, 6, 3, 0], # first sentence: <go> how are you <eos> <pad> [2, 7, 3, 0, 0, 0], # second sentence: <go> hello <eos> <pad> <pad> <pad> [2, 7, 8, 1, 1, 3] # third sentence: <go> hello i <unk> <unk> <eos> ]), "sent_length": numpy.array([5, 3, 6]), # length of sentences "sent_allvocabs": numpy.array([ [2, 4, 5, 6, 3, 0], # first sentence: <go> how are you <eos> <pad> [2, 7, 3, 0, 0, 0], # second sentence: <go> hello <eos> <pad> <pad> <pad> [2, 7, 8, 9, 10, 3] # third sentence: <go> hello i am fine <eos> ]), "sent_str": [ "how are you", "hello", "hello i am fine" ], }
-
get_teacher_forcing_metric
(gen_log_prob_key='gen_log_prob') → cotk.metric.metric.MetricChain[source]¶ Get metrics for teacher-forcing. In other words, this function provides metrics for language modelling task.
It contains:
See the above class for details of arguments.
- Parameters
gen_log_prob_key (str, optional) – The key of predicted log probability over words. Default:
gen_log_prob
.
-
get_inference_metric
(gen_key='gen', sample_in_bleu=1000, sample_in_ngram_perplexity=10000, seed=1229, cpu_count=None) → cotk.metric.metric.MetricChain[source]¶ Get metrics for inference. In other words, this function provides metrics for language generation tasks.
It contains:
See the above class for details of arguments.
- Parameters
gen_key (str, optional) – The key of generated sentences. Default:
gen
.sample_in_bleu (int, optional) – Number of examples sampled from the generated sentences. Default:
1000
.sample_in_ngram_perplexity (int, optional) – Number of examples sampled from the generated sentences. Default:
10000
.seed (int, optional) – Random seed for sampling. Default:
1229
.cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when
cpu_count
is set to1
or the dataset is small. Default: IfNone
, the environment variableCPU_COUNT
will be used when available, or all available cpu will be used otherwise.
MSCOCO¶
-
class
cotk.dataloader.
MSCOCO
(file_id, *, tokenizer='nltk', max_sent_length=50, convert_to_lower_letter=False, min_frequent_vocab_times=10, min_rare_vocab_times=0, pretrained=None)[source]¶ Bases:
dataloader.LanguageGeneration
A dataloader for preprocessed MSCOCO dataset. Refer to
LanguageGeneration
andLanguageProcessing
for attributes and methods.- Parameters
file_id (str) – A string indicating the source (path) of the dataset. It can be local path (
"./data"
), a resource name ("resources://dataset"
), or an url ("http://test.com/dataset.zip"
). See the details of file id. Default:resources://MSCOCO
.tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. Default:nltk
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
True
min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than
min_frequent_vocab_times
will be regarded as frequent words. Default:10
.min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than
min_rare_vocab_times
will be regarded as rare words (frequent word excluded). Default:0
.pretrained (str, optional) – Use pretrained field instead of
SentenceDefault
. Default: IfNone
, no pretrained field used.
References
[1] http://images.cocodataset.org/annotations/annotations_trainval2017.zip
[2] Chen X, Fang H, Lin T Y, et al. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325, 2015.
SingleTurnDialog¶
-
class
cotk.dataloader.
SingleTurnDialog
(file_id, *, tokenizer=None, max_sent_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, pretrained=None)[source]¶ Bases:
dataloader.LanguageProcessing
This class is supported for sequence to sequence generation tasks, especially single turn dialog tasks.
- Parameters
file_id (str) – A string indicating the source (path) of the dataset. It can be local path (
"./data"
), a resource name ("resources://dataset"
), or an url ("http://test.com/dataset.zip"
). See the details of file id.tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. No default value,KeyError
will be raised.max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
.min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than
min_frequent_vocab_times
will be regarded as frequent words. Default:0
min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than
min_rare_vocab_times
will be regarded as rare words (frequent word excluded). Default:0
pretrained (str, optional) – Use pretrained field instead of
SentenceDefault
. Default: IfNone
, no pretrained field used.
-
get_batch
(set_name, indexes) → Dict[str, Any][source]¶ Get a batch of data with specified
indexes
. Return a dict contains:post_length (
numpy.ndarray
): A 1-d array, the length of post in each batch. Size:[batch_size]
post (
numpy.ndarray
): A 2-d padded array containing tokens of id form in posts. Only provide frequent tokens.unk_id
will be used for a rare token. Size:[batch_size, max(sent_length)]
post_allvocabs (
numpy.ndarray
): A 2-d padded array containing tokens of id form in posts. Provide both frequent and rare vocabs. Size:[batch_size, max(sent_length)]
post_str (
List[str]
): A list containing raw posts before tokenizing, converting to ids, or padding. Do not contain any special tokens. Size:[batch_size]
resp_length (
numpy.ndarray
): A 1-d array, the length of response in each batch. Size:[batch_size]
resp (
numpy.ndarray
): A 2-d padded array containing tokens of id form in responses. Only provide valid vocabs.unk_id
will be used for a rare token. Size:[batch_size, max(sent_length)]
resp_allvocabs (
numpy.ndarray
): A 2-d padded array containing tokens of id form in responses. Provide both valid and invalid vocabs. Size:[batch_size, max(sent_length)]
post_str (
List[str]
): A list containing raw responses before tokenizing, converting to ids, or padding. Do not contain any special tokens. Size:[batch_size]
get_next_batch()
,get_batches()
,get_all_batch()
provide other methods to get batched data, Their return values are consistent with this methods.- Parameters
set_name (str) – The name of set. For example:
"train"
,"dev"
,"test"
.indexes (list) – a list of specified indexes of batched data.
Examples
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", >>> # "hello", "i", "am", "fine"] >>> # frequent_vocab_size = 9 >>> # frequent_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", "hello", "i"] >>> dataloader.get_batch('train', [0, 1]) { "post_str": [ "are you fine", "hello", ], "post_allvocabs": numpy.array([ [2, 5, 6, 10, 3], # first post: <go> are you fine <eos> [2, 7, 3, 0, 0], # second post: <go> hello <eos> <pad> <pad> ]), "post": numpy.array([ [2, 5, 6, 1, 3], # first post: <go> are you <unk> <eos> [2, 7, 3, 0, 0], # second post: <go> hello <eos> <pad> <pad> ]), "resp_str": [ "i am fine", "hello" ], "resp_allvocabs": numpy.array([ [2, 8, 9, 10, 3], # first response: <go> i am fine <eos> [2, 7, 3, 0, 0], # second response: <go> hello <eos> <pad> <pad> ]), "resp": numpy.array([ [2, 8, 1, 1, 3], # first response: <go> i <unk> <unk> <eos> [2, 7, 3, 0, 0], # second response: <go> hello <eos> <pad> <pad> ]), "post_length": numpy.array([5, 3]), # length of posts "resp_length": numpy.array([5, 3]), # length of responses }
-
get_teacher_forcing_metric
(gen_log_prob_key='gen_log_prob', generate_rare_vocab=False) → MetricChain[source]¶ Get metrics for teacher-forcing.
It contains:
- Parameters
gen_log_prob_key (str) – The key of predicted log probability over words. Refer to
metric.PerplexityMetric
. Default:gen_log_prob
.generate_rare_vocab (bool) – Whether
gen_log_prob
contains invalid vocab. Refer tometric.PerplexityMetric
. Default:False
.
-
get_inference_metric
(gen_key='gen') → MetricChain[source]¶ Get metrics for inference.
It contains:
- Parameters
gen_key (str) – The key of generated sentences in index form. Refer to
metric.BleuCorpusMetric
ormetric.SingleTurnDialogRecorder
. Default:gen
.
OpenSubtitles¶
-
class
cotk.dataloader.
OpenSubtitles
(file_id='resources://OpenSubtitles', *, tokenizer='nltk', max_sent_length=50, convert_to_lower_letter=False, min_frequent_vocab_times=10, min_rare_vocab_times=0, pretrained=None)[source]¶ Bases:
dataloader.SingleTurnDialog
A dataloader for OpenSubtitles dataset. Refer to
SingleTurnDialog
,LanguageProcessing
for attributes and methods.- Parameters
file_id (str) – A string indicating the source (path) of the dataset. It can be local path (
"./data"
), a resource name ("resources://dataset"
), or an url ("http://test.com/dataset.zip"
). See the details of file id. Default:resources://OpenSubtitles
.tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. Default:nltk
max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:50
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than
min_frequent_vocab_times
will be regarded as frequent words. Default:10
.min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than
min_rare_vocab_times
will be regarded as rare words (frequent word excluded). Default:0
pretrained (str, optional) – Use pretrained field instead of
SentenceDefault
. Default: IfNone
, no pretrained field used.
References
[1] http://opus.nlpl.eu/OpenSubtitles.php
[2] P. Lison and J. Tiedemann, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. LREC 2016.
MultiTurnDialog¶
-
class
cotk.dataloader.
MultiTurnDialog
(file_id, tokenizer=None, max_sent_length=None, max_turn_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, fields=None, pretrained=None)[source]¶ Base class for multi-turn dialog datasets. This is an abstract class.
Arguments:
Attributes:
Notes
A
Session
field must be set as default field. When invoking__init__()
ofMultiTurnDialog
, the default field, which may be reset in subclass, is set as self.fields[‘train’][‘session’].-
get_batch
(set_name, indexes) → Dict[str, Any]¶ Get a batch of data with specified
indexes
. Return a merged dict containing all the data from each field by callingfield.get_batch()
. See examples in subclasses for the return value of predefined tasks.get_next_batch()
,get_batches()
,get_all_batch()
provide other methods to get batched data, Their return values are consistent with this methods.- Parameters
set_name (str) – The name of set. For example:
"train"
,"dev"
,"test"
.indexes (list) – a list of specified indexes of batched data.
-
get_teacher_forcing_metric
(multi_turn_gen_log_prob_key='multi_turn_gen_log_prob')[source]¶ Get metric for teacher-forcing.
It contains:
- Parameters
gen_log_prob_key (str) – The key of predicted log probability over words. Refer to
metric.MultiTurnPerplexityMetric
. Default:gen_log_prob
.- Returns
A
metric.MetricChain
object.
-
get_inference_metric
(multi_turn_gen_key='multi_turn_gen')[source]¶ Get metric for inference.
It contains:
- Parameters
gen_key (str) – The key of generated sentences in index form. Refer to
metric.BleuCorpusMetric
ormetric.MultiTurnDialogRecorder
. Default:gen
.- Returns
A
metric.MetricChain
object.
-
UbuntuCorpus¶
-
class
cotk.dataloader.
UbuntuCorpus
(file_id='resources://Ubuntu', min_frequent_vocab_times=10, max_sent_length=50, max_turn_length=20, min_rare_vocab_times=0, tokenizer='nltk', pretrained=None)[source]¶ A dataloader for Ubuntu dataset.
- Parameters
file_id (str) – a str indicates the source of UbuntuCorpus dataset. Default:
resources://Ubuntu
. A preset dataset is downloaded and cached.min_frequent_vocab_times (int) – A cut-off threshold of valid tokens. All tokens appear not less than min_vocab_times in training set will be marked as frequent words. Default:
10
.max_sent_length (int) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. Default:50
.max_turn_length (int) – All sessions longer than
max_turn_length
will be shortened to firstmax_turn_length
sentences. Default:20
.min_rare_vocab_times (int) – A cut-off threshold of rare tokens. All tokens appear not less than
invalid_vocab_times
in the whole dataset (except valid words) will be marked as rare words. Otherwise, they are unknown words, both in training or testing stages. Default:0
(No unknown words).
Refer to
MultiTurnDialog
for attributes and methods.References
[1] https://github.com/rkadlec/ubuntu-ranking-dataset-creator
[2] Lowe R, Pow N, Serban I, et al. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. SIGDIAL 2015.
SwitchboardCorpus¶
-
class
cotk.dataloader.
SwitchboardCorpus
(file_id='resources://SwitchboardCorpus', min_frequent_vocab_times=5, max_sent_length=50, max_turn_length=1000, min_rare_vocab_times=0, tokenizer='nltk', pretrained=None)[source]¶ A dataloader for Switchboard dataset.
In this dataset, all sessions start with a
<d>
representing empty context.- Parameters
file_id (str) – a string indicating the source of SwitchboardCorpus dataset. Default:
resources://SwitchboardCorpus
. A preset dataset is downloaded and cached.
Refer to
MultiTurnDialog
for attributes and methods.References
[1] https://catalog.ldc.upenn.edu/LDC97S62
[2] John J G and Edward H. Switchboard-1 release 2. Linguistic Data Consortium, Philadelphia 1997.
SentenceClassification¶
-
class
cotk.dataloader.
SentenceClassification
(file_id, tokenizer=None, max_sent_length=None, convert_to_lower_letter=None, min_frequent_vocab_times=None, min_rare_vocab_times=None, fields=None, pretrained=None)[source]¶ Base class for sentence classification datasets. This is an abstract class.
Arguments:
Notes
A
Sentence
field must be set as default field. When invoking__init__()
ofSentenceClassification
, the default field, which may be reset in subclass, is set as self.fields[‘train’][‘sent’].-
get_batch
(set_name, indexes)[source]¶ Get a batch of specified indexes.
- Parameters
set_name (str) – must be contained in key_name
indexes (list) – a list of specified indexes
- Returns
(dict) –
A dict at least contains:
sent_length(
numpy.array
): A 1-d array, the length of sentence in each batch. Size: [batch_size]sent(
numpy.array
): A 2-d padding array containing id of words. Only provide valid words. unk_id will be used if a word is not valid. Size: [batch_size, max(sent_length)]label(
numpy.array
): A 1-d array, the label of sentence in each batch.sent_allvocabs(
numpy.array
): A 2-d padding array containing id of words. Provide both valid and invalid words. Size: [batch_size, max(sent_length)]
Examples
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", >>> # "hello", "i", "am", "fine"] >>> # vocab_size = 9 >>> # vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "how", "are", "you", "hello", "i"] >>> dataloader.get_batch('train', [0, 1, 2]) { "sent": numpy.array([ [2, 4, 5, 6, 3, 0], # first sentence: <go> how are you <eos> <pad> [2, 7, 3, 0, 0, 0], # second sentence: <go> hello <eos> <pad> <pad> <pad> [2, 7, 8, 1, 1, 3] # third sentence: <go> hello i <unk> <unk> <eos> ]), "label": numpy.array([1, 2, 0]) # label of sentences "sent_length": numpy.array([5, 3, 6]), # length of sentences "sent_allvocabs": numpy.array([ [2, 4, 5, 6, 3, 0], # first sentence: <go> how are you <eos> <pad> [2, 7, 3, 0, 0, 0], # second sentence: <go> hello <eos> <pad> <pad> <pad> [2, 7, 8, 9, 10, 3] # third sentence: <go> hello i am fine <eos> ]), }
-
get_metric
(prediction_key='prediction')[source]¶ Get metrics for accuracy. In other words, this function provides metrics for sentence classification task.
It contains:
metric.AccuracyMetric
- Parameters
prediction_key (str) – The key of prediction over sentences. Refer to
metric.AccuracyMetric
. Default:prediction
.- Returns
A
metric.MetricChain
object.
-
SST¶
-
class
cotk.dataloader.
SST
(file_id, min_frequent_vocab_times=10, max_sent_length=50, min_rare_vocab_times=0, tokenizer='space', pretrained=None)[source]¶ A dataloader for preprocessed SST dataset.
- Parameters
file_id (str) – a str indicates the source of SST dataset.
min_frequent_vocab_times (int) – A cut-off threshold of valid tokens. All tokens appear not less than min_frequent_vocab_times in training set will be marked as frequent words. Default: 10.
max_sent_length (int) – All sentences longer than max_sent_length will be shortened to first max_sent_length tokens. Default: 50.
min_rare_vocab_times (int) – A cut-off threshold of invalid tokens. All tokens appear not less than min_rare_vocab_times in the whole dataset (except valid words) will be marked as rare words. Otherwise, they are unknown words, both in training or testing stages. Default: 0 (No unknown words).
Refer to
SentenceClassification
for attributes and methods.References
[1] http://images.cocodataset.org/annotations/annotations_trainval2017.zip
[2] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common Objects in Context. ECCV 2014.
Field¶
-
class
cotk.dataloader.
Field
[source]¶ A base class of data field, which specify the format of the dataset. See Field and building a dataloader of customized task for usages.
Notice
Field
object may be shared between different fields, data sets or dataloader. Thus it only defines settings and do NOT stores data.-
classmethod
get_all_subclasses
() → Iterable[Any]¶ Return a generator of all subclasses.
-
classmethod
load_class
(class_name) → Any¶ Return a subclass of
class_name
, case insensitively.- Parameters
class_name (str) – target class name.
-
get_vocab
() → Optional[cotk.dataloader.vocab.Vocab][source]¶ Get
Vocab
object for the field.None
if the field do not have aVocab
.
-
get_tokenizer
() → Optional[cotk.dataloader.tokenizer.Tokenizer][source]¶ Get
Tokenizer
object for the field.None
if the field do not have aTokenizer
.
-
get_batch
(name, data, indexes) → Dict[str, Any][source]¶ Invoked by
LanguageProcessing.get_batch()
, return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.- Parameters
name (str) – name of the field.
data (Any) – the data stored in dataloader.
indexes (List[int]) – the indexes of the data in this batch
-
classmethod
Sentence¶
-
class
cotk.dataloader.
Sentence
(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]¶ Bases:
dataloader.Field
A field for sentence. This class is a virtual class and the base of
Sentence
,SentenceGPT2
andSentenceBERT
.If any argument is not specified, the value will be first retrieved from
FieldContext
. If stillNone
, default value will be used.- Parameters
tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. No default value,KeyError
will be raised.vocab (
Vocab
, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value,KeyError
will be raised.vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example,
DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test"
means that the words from the “dev” set is used for test. Default: See the table for default value.max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
.
- Input Formats
This field read one line of sentence per sample.
-
tokenize
(sentence) → List[str][source]¶ Tokenize
sentence
.Convert tokens to lower case if
self.convert_to_lower_letter
isTrue
.
- Parameters
sentence (str) – The sentence to be tokenized.
-
tokenize_sentences
(sentences) → List[List[str]][source]¶ Tokenize
sentences
.Convert tokens to lower case if
self.convert_to_lower_letter
isTrue
.
- Parameters
sentences (List[str]) – The list of sentence to be tokenized.
-
convert_tokens_to_ids
(tokens, add_special=False, only_frequent_word=False) → List[int][source]¶ Convert list of tokens to list of ids.
- Parameters
tokens (List[str]) – The tokens to be converted.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:False
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.
-
convert_ids_to_tokens
(ids, remove_special=True, trim=True) → List[str][source]¶ Convert list of ids to list of tokens.
- Parameters
ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If
True
, detect and try to do a reverse operation ofadd_special
inconvert_tokens_to_ids()
. It will not removeunk
or special tokens in the middle of sentences. Default:True
.trim (bool, optional) – If
True
, usetrim_in_ids()
to remove trailingpad
andeos
. Default:True
.
-
convert_sentence_to_ids
(sentence, add_special=False, only_frequent_word=False) → List[int][source]¶ Convert a sentence to a list of ids.
- Parameters
sentence (str) – The sentence to be converted.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:False
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.
-
convert_ids_to_sentence
(ids, remove_special=True, trim=True) → str[source]¶ Convert list of tokens to a sentence.
- Parameters
ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If
True
, detect and try to do a reverse operation ofadd_special
inconvert_tokens_to_ids()
. It will not removeunk
or special tokens in the middle of sentences. Default:True
.trim (bool, optional) – If
True
, usetrim_in_ids()
to remove trailingpad
andeos
. Default:True
.
-
add_special_to_ids
(ids) → List[int][source]¶ Add special tokens, such as
go_id
oreos_id
to the inputids
.- Parameters
ids (List[int]) – The input ids.
-
remove_special_in_ids
(ids, remove_special=True, trim=True) → List[int][source]¶ Remove special ids in input ids.
- Parameters
ids (List[int]) – Input ids.
remove_special (bool, optional) – If
True
, detect and try to do a reverse operation ofadd_special
inconvert_tokens_to_ids()
. It will not removeunk
or special tokens in the middle of sentences. Default:True
.trim (bool, optional) – If
True
, usetrim_in_ids()
to remove trailingpad
andeos
. Default:True
.
-
trim_in_ids
(ids) → List[int][source]¶ Find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing
pad
.- Parameters
ids (List[int]) – The input ids.
-
process_sentences
(sentences, add_special=True, only_frequent_word=False, cut=True) → List[List[int]][source]¶ Process input sentences.
If sentences haven’t been tokenized, tokenize them by invoking
Sentence.tokenize_sentences()
.Then, convert the list of tokens to a list of ids.
If
self.max_sent_length
is notNone
andcut
isTrue
, sentences, whose length are more thanself.max_sent_length
, are shorten to firstself.max_sent_length
tokens.
- Parameters
sentences (List[str], List[List[str]]) – sentences can be a list of sentences or a list of lists of tokens.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:True
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.cut (bool, optional) – Whether to cut sentences with too many tokens. Default:
True
.
-
frequent_vocab_size
¶ int – The number of frequent words. It calls the method with the identical name of the
Vocab
instance, fromself.get_vocab()
.
-
all_vocab_size
¶ int – The number of frequent words and rare words. It calls the method with the identical name of the
Vocab
instance, fromself.get_vocab()
.
-
frequent_vocab_list
¶ list – The list of frequent words. It calls the method with the identical name of the
Vocab
instance, fromself.get_vocab()
.
-
all_vocab_list
¶ list – The list of frequent words and rare words. Frequent words are always in the front of the list. It calls the method with the identical name of the
Vocab
instance, fromself.get_vocab()
.
-
get_special_tokens_mapping
() → MutableMapping[str, str]¶ Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following:
pad
,unk
,go
,eos
,sep
,cls
,mask
. The value can be arbitrary string, e.g.,"<pad>"
,"<unk>"
. It calls the method with the identical name of theVocab
instance, fromself.get_vocab()
.
-
get_special_tokens_id
(name) → int¶ Get id of special token specifying the general name. Raise
KeyError
if no such token in this instance. It calls the method with the identical name of theVocab
instance, fromself.get_vocab()
.- Parameters
name (str) – the general name, must be one of the following,
pad
,unk
,go
,eos
,sep
,cls
,mask
.
-
pad_id
¶ int – The id of pad token. Raise
KeyError
if no pad token in this instance. It calls the method with the identical name of theVocab
instance, fromself.get_vocab()
.
-
unk_id
¶ int – The id of unk token. Raise
KeyError
if no unk token in this instance. It calls the method with the identical name of theVocab
instance, fromself.get_vocab()
.
SentenceDefault¶
-
class
cotk.dataloader.
SentenceDefault
(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]¶ Bases:
dataloader.Sentence
,dataloader.Field
A common use field for sentence.
If any argument is not specified, the value will be first retrieved from
FieldContext
. If stillNone
, default value will be used.- Parameters
tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. No default value,KeyError
will be raised.vocab (
Vocab
, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value,KeyError
will be raised.vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example,
DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test"
means that the words from the “dev” set is used for test. Default: See the table for default value.max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
.
- Input Formats
This field read one line of sentence per sample.
-
get_batch
(name, data, indexes) → Dict[str, Any][source]¶ Invoked by
LanguageProcessing.get_batch()
, return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.The function will return a dict, containing:
FIELDNAME
(np.ndarray[batch_size, max_sent_length_in_batch]
): Padded sentences in id formats. It only contains frequent vocabs, and rare words are replaced byunk_id
.FIELDNAME_allvocabs
(np.ndarray[batch_size, max_sent_length_in_batch]
): Padded sentences in id formats. It contains frequent vocabs and rare vocabs.FIELDNAME_length
(np.ndarray[batch_size]
): The length of sentences.FIELDNAME_str
(List[str]
): The raw sentences.
where
FIELDNAME
is the name of the field.batch_size
islen(indexes)
.max_sent_length_in_batch
is the maximum length of sentences in the batch.
- Parameters
name (str) – name of the field.
data (Any) – the data stored in dataloader.
indexes (List[int]) – the indexes of the data in this batch
Examples
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "Life", "is", "short", ".", >>> # "PHP", "the", "best", "language", "in", "world"] >>> # frequent_vocab_size = 11 >>> # frequent_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "Life", "is", "short", ".", >>> # "PHP", "the", "best"] >>> field.get_batch('sent', data, [0, 1]) { "sent": numpy.array([ [2, 4, 5, 6, 7, 3, 0, 0, 0, 0, 0], # <go> Life is short . <eos> <pad> <pad> <pad> <pad> <pad> [2, 8, 5, 9, 10, 1, 1, 9, 1, 7, 3], # <go> PHP is the best <unk> <unk> the <unk> . <eos> ]), "sent_length": numpy.array([6, 11]), # length of sentences "sent_allvocabs": numpy.array([ [2, 4, 5, 6, 7, 3, 0, 0, 0, 0, 0], # <go> Life is short . <eos> <pad> <pad> <pad> <pad> <pad> [2, 8, 5, 9, 10, 11, 12, 9, 13, 7, 3], # <go> PHP is the best language in the world . <eos> ]), "sent_str": [ "Life is short.", "PHP is the best language in the world.", ], }
SentenceGPT2¶
-
class
cotk.dataloader.
SentenceGPT2
(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]¶ Bases:
dataloader.Sentence
,dataloader.Field
A field for sentence in the format of GPT2.
If any argument is not specified, the value will be first retrieved from
FieldContext
. If stillNone
, default value will be used.- Parameters
tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. No default value,KeyError
will be raised.vocab (
Vocab
, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value,KeyError
will be raised.vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example,
DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test"
means that the words from the “dev” set is used for test. Default: See the table for default value.max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
.
- Input Formats
This field read one line of sentence per sample.
-
get_batch
(name, data, indexes) → Dict[str, Any][source]¶ Invoked by
LanguageProcessing.get_batch()
, return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.The function will return a dict, containing:
FIELDNAME
(np.ndarray[batch_size, max_sent_length_in_batch]
): Padded sentences in id formats. It only contains frequent vocabs, and rare words are replaced byunk_id
.FIELDNAME_allvocabs
(np.ndarray[batch_size, max_sent_length_in_batch]
): Padded sentences in id formats. It contains frequent vocabs and rare vocabs.FIELDNAME_length
(np.ndarray[batch_size]
): The length of sentences.FIELDNAME_str
(List[str]
): The raw sentences.
where
FIELDNAME
is the name of the field.batch_size
islen(indexes)
.max_sent_length_in_batch
is the maximum length of sentences in the batch.
- Parameters
name (str) – name of the field.
data (Any) – the data stored in dataloader.
indexes (List[int]) – the indexes of the data in this batch
Examples
>>> # This example is based on GPT2Tokenizer. The vocab files are in ./tests/dummy_gpt2vocab. >>> # field.eos_id = 413 # <|endoftext|>, also used for <pad>, <unk>, <go> >>> field.get_batch('sent', data, [0, 2]) { "sent": numpy.array([ [413, 6, 134, 321, 407, 107, 157, 121, 372, 201, 402, 105, 413, 413, 413, 413], # ['<|endoftext|>', 'A', 'Ġbicycle', 'Ġreplica', 'Ġwith', 'Ġa', 'Ġclock', 'Ġas', 'Ġthe', # 'Ġfront', 'Ġwheel', 'Ġ.', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>'] [413, 6, 149, 370, 330, 384, 126, 298, 236, 130, 107, 255, 298, 149, 105, 413], # ['<|endoftext|>', 'A', 'Ġcar', 'Ġthat', 'Ġseems', 'Ġto', 'Ġbe', 'Ġparked', 'Ġillegally', # 'Ġbehind', 'Ġa', 'Ġlegally', 'Ġparked', 'Ġcar', 'Ġ.', '<|endoftext|>'] ]), "sent_length": numpy.array([13, 16]), # length of sentences "sent_allvocabs": numpy.array([ [413, 6, 134, 321, 407, 107, 157, 121, 372, 201, 402, 105, 413, 413, 413, 413], # ['<|endoftext|>', 'A', 'Ġbicycle', 'Ġreplica', 'Ġwith', 'Ġa', 'Ġclock', 'Ġas', 'Ġthe', # 'Ġfront', 'Ġwheel', 'Ġ.', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>', '<|endoftext|>'] [413, 6, 149, 370, 330, 384, 126, 298, 236, 130, 107, 255, 298, 149, 105, 413], # ['<|endoftext|>', 'A', 'Ġcar', 'Ġthat', 'Ġseems', 'Ġto', 'Ġbe', 'Ġparked', 'Ġillegally', # 'Ġbehind', 'Ġa', 'Ġlegally', 'Ġparked', 'Ġcar', 'Ġ.', '<|endoftext|>'] ]), "sent_str": [ "A bicycle replica with a clock as the front wheel .", "A car that seems to be parked illegally behind a legally parked car .", ], }
SentenceBERT¶
-
class
cotk.dataloader.
SentenceBERT
(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None)[source]¶ Bases:
dataloader.Sentence
,dataloader.Field
A field for sentence in the format of BERT.
If any argument is not specified, the value will be first retrieved from
FieldContext
. If stillNone
, default value will be used.- Parameters
tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. No default value,KeyError
will be raised.vocab (
Vocab
, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value,KeyError
will be raised.vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example,
DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test"
means that the words from the “dev” set is used for test. Default: See the table for default value.max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
.
- Input Formats
This field read one line of sentence per sample.
-
get_batch
(name, data, indexes) → Dict[str, Any][source]¶ Invoked by
LanguageProcessing.get_batch()
, return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.The function will return a dict, containing:
FIELDNAME
(np.ndarray[batch_size, max_sent_length_in_batch]
): Padded sentences in id formats. It only contains frequent vocabs, and rare words are replaced byunk_id
.FIELDNAME_allvocabs
(np.ndarray[batch_size, max_sent_length_in_batch]
): Padded sentences in id formats. It contains frequent vocabs and rare vocabs.FIELDNAME_length
(np.ndarray[batch_size]
): The length of sentences.FIELDNAME_str
(List[str]
): The raw sentences.
where
FIELDNAME
is the name of the field.batch_size
islen(indexes)
.max_sent_length_in_batch
is the maximum length of sentences in the batch.
- Parameters
name (str) – name of the field.
data (Any) – the data stored in dataloader.
indexes (List[int]) – the indexes of the data in this batch
Examples
>>> # This example is based on BertTokenizer. The vocab files are in ./tests/dummy_bertvocab. >>> field.get_batch('sent', data, [0, 1]) { "sent": numpy.array([ [101, 147, 37, 29, 359, 102, 0, 0, 0, 0, 0, 0, 0], # ['<cls>', 'How', 'are', 'you', '?', '<sep>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'] [101, 375, 334, 379, 127, 341, 350, 29, 328, 9, 29, 359, 102] # ['<cls>', 'i', ''', 'm', 'fine', '.', 'thank', 'you', '!', 'and', 'you', '?', '<sep>'] ]), "sent_length": numpy.array([6, 13]), # length of sentences, "sent_allvocabs": numpy.array([ [101, 147, 37, 29, 359, 102, 0, 0, 0, 0, 0, 0, 0], # ['<cls>', 'how', 'are', 'you', '?', '<sep>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'] [101, 375, 334, 379, 127, 341, 350, 29, 328, 9, 29, 359, 102] # ['<cls>', 'i', ''', 'm', 'fine', '.', 'thank', 'you', '!', 'and', 'you', '?', '<sep>'] ]), "sent_str": [ "How are you?", "I'm fine. Thank you! And you?" ], }
Session¶
-
class
cotk.dataloader.
Session
(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]¶ Bases:
dataloader.Field
A field for session. Each session is a list of sentences.
If any argument is not specified, the value will be first retrieved from
FieldContext
. If stillNone
, default value will be used.- Parameters
tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. No default value,KeyError
will be raised.vocab (
Vocab
, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value,KeyError
will be raised.vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example,
DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test"
means that the words from the “dev” set is used for test. Default: See the table for default value.max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
.max_turn_length (int, _InfiniteLength, optional) – Set the maximum turn length of a session. If it’s an integer, any session, whose turn length is more than
max_turn_length
is shortened to firstmax_sent_length
turns. The left turns are ignored. If it’sNone
orSentence.INFINITE_LENGTH
, sessions won’t be shortened and all turns are remained. Default:None
.
- Input Format
This field read multiple line of sentences per sample, until a blank line.
-
tokenize
(sentence) → List[str]¶ Tokenize
sentence
.Convert tokens to lower case if
self.convert_to_lower_letter
isTrue
.
- Parameters
sentence (str) – The sentence to be tokenized.
-
tokenize_sentences
(sentences) → List[List[str]]¶ Tokenize
sentences
.Convert tokens to lower case if
self.convert_to_lower_letter
isTrue
.
- Parameters
sentences (List[str]) – The list of sentence to be tokenized.
-
tokenize_sessions
(sessions) → List[List[List[str]]][source]¶ Tokenize
sessions
.Convert the tokens to lower case if
self.convert_to_lower_letter
isTrue
.
- Parameters
sessions (List[List[str]]) – The list of sessions to be tokenized.
-
convert_tokens_to_ids
(tokens, add_special=False, only_frequent_word=False) → List[int]¶ Convert list of tokens to list of ids.
- Parameters
tokens (List[str]) – The tokens to be converted.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:False
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.
-
convert_ids_to_tokens
(ids, remove_special=True, trim=True) → List[str]¶ Convert list of ids to list of tokens.
- Parameters
ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If
True
, detect and try to do a reverse operation ofadd_special
inconvert_tokens_to_ids()
. It will not removeunk
or special tokens in the middle of sentences. Default:True
.trim (bool, optional) – If
True
, usetrim_in_ids()
to remove trailingpad
andeos
. Default:True
.
-
convert_sentence_to_ids
(sentence, add_special=False, only_frequent_word=False) → List[int]¶ Convert a sentence to a list of ids.
- Parameters
sentence (str) – The sentence to be converted.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:False
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.
-
convert_ids_to_sentence
(ids, remove_special=True, trim=True) → str¶ Convert list of tokens to a sentence.
- Parameters
ids (List[int]) – The ids to be converted.
remove_special (bool, optional) – If
True
, detect and try to do a reverse operation ofadd_special
inconvert_tokens_to_ids()
. It will not removeunk
or special tokens in the middle of sentences. Default:True
.trim (bool, optional) – If
True
, usetrim_in_ids()
to remove trailingpad
andeos
. Default:True
.
-
convert_multi_turn_tokens_to_ids
(session, add_special=False, only_frequent_word=False) → List[List[int]][source]¶ Convert list of tokenized sentences to list of sentence ids.
- Parameters
session (List[List[str]]) – The tokenized sentences to be converted.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:False
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.
-
convert_multi_turn_ids_to_tokens
(session_ids, remove_special=True, trim=True)[source]¶ Convert list of sentence ids to list of sentences.
- Parameters
session_ids (List[List[int]]) – The sentence ids to be converted.
remove_special (bool, optional) – If
True
, detect and try to do a reverse operation ofadd_special
inconvert_tokens_to_ids()
. It will not removeunk
or special tokens in the middle of sentences. Default:True
.trim (bool, optional) – If
True
, usetrim_in_ids()
to remove trailingpad
andeos
. Default:True
.
-
add_special_to_ids
(ids) → List[int]¶ Add special tokens, such as
go_id
oreos_id
to the inputids
.- Parameters
ids (List[int]) – The input ids.
-
remove_special_in_ids
(ids, remove_special=True, trim=True) → List[int]¶ Remove special ids in input ids.
- Parameters
ids (List[int]) – Input ids.
remove_special (bool, optional) – If
True
, detect and try to do a reverse operation ofadd_special
inconvert_tokens_to_ids()
. It will not removeunk
or special tokens in the middle of sentences. Default:True
.trim (bool, optional) – If
True
, usetrim_in_ids()
to remove trailingpad
andeos
. Default:True
.
-
trim_in_ids
(ids) → List[int]¶ Find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing
pad
.- Parameters
ids (List[int]) – The input ids.
-
multi_turn_trim_in_ids
(session_ids) → List[List[int]][source]¶ For each sentence ids in session, find the first special token indicating the sentence is over and remove all the tokens after it (included). Then remove all trailing
pad
.- Parameters
session_ids (List[List[int]]) – The input ids of session.
-
process_sentences
(sentences, add_special=True, only_frequent_word=False, cut=True) → List[List[int]]¶ Process input sentences.
If sentences haven’t been tokenized, tokenize them by invoking
Sentence.tokenize_sentences()
.Then, convert the list of tokens to a list of ids.
If
self.max_sent_length
is notNone
andcut
isTrue
, sentences, whose length are more thanself.max_sent_length
, are shorten to firstself.max_sent_length
tokens.
- Parameters
sentences (List[str], List[List[str]]) – sentences can be a list of sentences or a list of lists of tokens.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:True
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.cut (bool, optional) – Whether to cut sentences with too many tokens. Default:
True
.
-
process_sessions
(sessions, add_special=True, only_frequent_word=False, cut=True)[source]¶ Process input sessions.
If
self.max_turn_length
is notNone
andcut
isTrue
, sessions, whose length are more thanself.max_turn_length
, are shorten to firstself.max_turn_length
sentences.If sessions haven’t been tokenized, tokenize them by invoking
self.tokenize_sessions()
Then, convert the list of tokens to a list of ids.
If
self.max_sent_length
is notNone
andcut
isTrue
, sentences, whose length are more thanself.max_sent_length
, are shorten to firstself.max_sent_length
tokens.
- Parameters
sessions (List[List[str], List[List[str]]]) – sentences in a session can be a str or a list of tokens.
add_special (bool, optional) – If
True
, special tokens (e.g.go
,eos
) are added. Default:True
.only_frequent_word (bool, optional) – If
True
, rare vocabs will be replaced byunk_id
. Default:False
.cut (bool, optional) – Whether to cut sessions/sentences with too many sentences/tokens. Default:
True
.
-
frequent_vocab_size
¶ int – The number of frequent words. It calls the method with the identical name of the
Vocab
instance, fromself.get_vocab()
.
-
all_vocab_size
¶ int – The number of frequent words and rare words. It calls the method with the identical name of the
Vocab
instance, fromself.get_vocab()
.
-
frequent_vocab_list
¶ list – The list of frequent words. It calls the method with the identical name of the
Vocab
instance, fromself.get_vocab()
.
-
all_vocab_list
¶ list – The list of frequent words and rare words. Frequent words are always in the front of the list. It calls the method with the identical name of the
Vocab
instance, fromself.get_vocab()
.
-
get_special_tokens_mapping
() → MutableMapping[str, str]¶ Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following:
pad
,unk
,go
,eos
,sep
,cls
,mask
. The value can be arbitrary string, e.g.,"<pad>"
,"<unk>"
. It calls the method with the identical name of theVocab
instance, fromself.get_vocab()
.
-
get_special_tokens_id
(name) → int¶ Get id of special token specifying the general name. Raise
KeyError
if no such token in this instance. It calls the method with the identical name of theVocab
instance, fromself.get_vocab()
.- Parameters
name (str) – the general name, must be one of the following,
pad
,unk
,go
,eos
,sep
,cls
,mask
.
-
pad_id
¶ int – The id of pad token. Raise
KeyError
if no pad token in this instance. It calls the method with the identical name of theVocab
instance, fromself.get_vocab()
.
-
unk_id
¶ int – The id of unk token. Raise
KeyError
if no unk token in this instance. It calls the method with the identical name of theVocab
instance, fromself.get_vocab()
.
SessionDefault¶
-
class
cotk.dataloader.
SessionDefault
(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]¶ Bases:
dataloader.Session
,dataloader.Field
A common use field for sessions.
If any argument is not specified, the value will be first retrieved from
FieldContext
. If stillNone
, default value will be used.- Parameters
tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. No default value,KeyError
will be raised.vocab (
Vocab
, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value,KeyError
will be raised.vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example,
DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test"
means that the words from the “dev” set is used for test. Default: See the table for default value.max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
.
- Input Format
This field read multiple line of sentences per sample, until a blank line.
-
get_batch
(name, data, indexes) → Dict[str, Any][source]¶ - Invoked by
LanguageProcessing.get_batch()
, return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.
The function will return a dict, containing:
FIELDNAME
(np.ndarray[batch_size, max_turn_length_in_batch, max_sent_length_in_batch]
): Padded sessions in id formats. It only contains frequent vocabs, and rare words are replaced byunk_id
.FIELDNAME_allvocabs
(np.ndarray[batch_size, max_turn_length_in_batch, max_sent_length_in_batch]
): Padded sessions in id formats. It contains frequent vocabs and rare vocabs.FIELDNAME_turn_length
(np.ndarray[batch_size]
): The turn numbers of sessions.FIELDNAME_sent_length
(List[List[int]]
): The length of sentences of sessions.FIELDNAME_str
(List[str]
): The raw sessions.
where
FIELDNAME
is the name of the field.batch_size
islen(indexes)
.max_turn_length_in_batch
is the maximum turn number of sessions in the batch.max_sent_length_in_batch
is the maximum length of sentences in the batch.
- Arguments:
name (str): name of the field. data (Any): the data stored in dataloader. indexes (List[int]): the indexes of the data in this batch
Examples
>>> # dataset = iter(['How are you?\n', "I'm fine. And you?\n", "I'm fine, too.\n", "\n", >>> # "How to install cotk?\n", "pip install cotk.\n", "\n"]) >>> # min_frequent_vocab_times = 2 >>> # all_vocab_list = ['<pad>', '<unk>', '<go>', '<eos>', '.', '?', "'", 'How', 'I', >>> # 'cotk', 'fine', 'install', 'm', 'you', ',', 'And', 'are', 'pip', 'to', 'too'] >>> # frequent_vocab_size = 14 >>> # frequent_vocab_list = ['<pad>', '<unk>', '<go>', '<eos>', '.', '?', "'", 'How', 'I', >>> # 'cotk', 'fine', 'install', 'm', 'you'] >>> # data = { >>> # 'id': [ >>> # [ >>> # [2, 7, 16, 13, 5, 3], >>> # [2, 8, 6, 12, 10, 4, 15, 13, 5, 3], >>> # [2, 8, 6, 12, 10, 14, 19, 4, 3], >>> # ], >>> # [ >>> # [2, 7, 18, 11, 9, 5, 3], >>> # [2, 17, 11, 9, 4, 3], >>> # ] >>> # ], >>> # 'str': [ >>> # [ >>> # 'How are you?', >>> # "I'm fine. And you?", >>> # "I'm fine, too." >>> # ], >>> # [ >>> # 'How to install cotk?', >>> # 'pip install cotk.' >>> # ] >>> # >>> # } >>> field.get_batch('session', data, [0, 1]) { 'session_turn_length': numpy.array([3, 2]), 'session_sent_length': [ [6, 10, 9], [7, 6] ], 'session': numpy.array([ [ [ 2, 7, 1, 13, 5, 3, 0, 0, 0, 0], # <go> How <unk> you? <eos> <pad> <pad> <pad> <pad> [ 2, 8, 6, 12, 10, 4, 1, 13, 5, 3], # <go> I'm fine. <unk> you? <eos> [ 2, 8, 6, 12, 10, 1, 1, 4, 3, 0] # <go> I'm fine <unk> <unk>. <eos> <pad> ], [ [ 2, 7, 1, 11, 9, 5, 3, 0, 0, 0], # <go> How <unk> install cotk? <eos> <pad> <pad> <pad> [ 2, 1, 11, 9, 4, 3, 0, 0, 0, 0], # <go> <unk> install cotk. <eos> <pad> <pad> <pad> <pad> [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # all <pad> ] ]), 'session_allvocabs': numpy.array([ [ [ 2, 7, 16, 13, 5, 3, 0, 0, 0, 0], # <go> How are you? <eos> <pad> <pad> <pad> <pad> [ 2, 8, 6, 12, 10, 4, 15, 13, 5, 3], # <go> I'm fine. And you? <eos> [ 2, 8, 6, 12, 10, 14, 19, 4, 3, 0] # <go> I'm fine, too. <eos> <pad> ], [ [ 2, 7, 18, 11, 9, 5, 3, 0, 0, 0], # <go> How to install cotk? <eos> <pad> <pad> <pad> [ 2, 17, 11, 9, 4, 3, 0, 0, 0, 0], # <go> pip install cotk. <eos> <pad> <pad> <pad> <pad> [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # all <pad> ] ]), 'session_str': [ [ 'How are you?', "I'm fine. And you?", "I'm fine, too." ], [ 'How to install cotk?', 'pip install cotk.' ] ] }
- Invoked by
SessionGPT2¶
-
class
cotk.dataloader.
SessionGPT2
(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]¶ Bases:
dataloader.Session
,dataloader.Field
A field for session in the format of GPT2.
If any argument is not specified, the value will be first retrieved from
FieldContext
. If stillNone
, default value will be used.- Parameters
tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. No default value,KeyError
will be raised.vocab (
Vocab
, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value,KeyError
will be raised.vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example,
DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test"
means that the words from the “dev” set is used for test. Default: See the table for default value.max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
.
- Input Format
This field read multiple line of sentences per sample, until a blank line.
-
get_batch
(name, data, indexes) → Dict[str, Any][source]¶ - Invoked by
LanguageProcessing.get_batch()
, return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.
- Arguments:
name (str): name of the field. data (Any): the data stored in dataloader. indexes (List[int]): the indexes of the data in this batch
# NOTE: We only show the structure of return value of get_batch. The real value of each entry may depends on the loaded vocab. .. rubric:: Examples
>>> from transformers.tokenization_gpt2 import GPT2Tokenizer >>> from cotk.dataloader.tokenizer import PretrainedTokenizer >>> tokenizer = GPT2Tokenizer.from_pretrained('gpt2') >>> field = SessionGPT2(PretrainedTokenizer(tokenizer)) >>> field_content = field._create('train') >>> dataset = iter(['How are you?\n', "I'm fine. Thank you! And you?\n", "I'm fine, too.\n", "\n", "How to install CoTk?\n", "pip install cotk.\n", "\n"]) >>> while True: ... try: ... field_content.read_next(dataset) ... except StopIteration: ... break >>> field_content.process_before_vocab() >>> field.vocab.build_vocab() >>> data = field_content.get_data() >>> data {'id': [[[2, 8, 18, 6, 5, 3], [2, 9, 7, 12, 10, 4, 17, 6, 13, 15, 6, 5, 3], [2, 9, 7, 12, 10, 14, 22, 4, 3]], [[2, 8, 21, 11, 16, 5, 3], [2, 20, 11, 19, 4, 3]]], 'str': [['How are you?', "I'm fine. Thank you! And you?", "I'm fine, too."], ['How to install CoTk?', 'pip install cotk.']]} >>> batch_data = field.get_batch('session', data, [1]) >>> batch_data {'session_turn_length': array([2]), 'session_sent_length': [[7, 6]], 'session': array([[[ 2, 8, 21, 11, 16, 5, 3], [ 2, 20, 11, 19, 4, 3, 0]]]), 'session_allvocabs': array([[[ 2, 8, 21, 11, 16, 5, 3], [ 2, 20, 11, 19, 4, 3, 0]]]), 'session_str': [['How to install CoTk?', 'pip install cotk.']]} >>> # 'session_turn_length' (`name` + '_turn_length') is a :class:`np.ndarray` object with shape == (batch size, ). Each element is the length of corresponding sssion. >>> # 'session_sent_length' (`name` + '_sent_length') is List[List[int]]. Each integer is the length of corresponding sentence. >>> # 'session' (`name`) is a :class:`np.ndarray` object with shape == (batch size, max turn length, max sentence length). >>> # batch_data['session'][i, j] is a sentence. batch_data['session'][i, j, k] is an id. >>> # If `self.max_turn_length` is not None and j >= `self.max_turn_length` or `self.max_sent_length` is not None and k >= `self.max_sent_length`, >>> # batch_data['session'][i, j, k] is `self.eos_id`. >>> # 'session_allvocabs' (`name` + '_allvocabs') is the same with 'session'.
- Invoked by
SessionBERT¶
-
class
cotk.dataloader.
SessionBERT
(tokenizer=None, vocab=None, vocab_from_mappings=None, max_sent_length=None, convert_to_lower_letter=None, max_turn_length=None)[source]¶ Bases:
dataloader.Session
,dataloader.Field
A field for session in the format of BERT.
If any argument is not specified, the value will be first retrieved from
FieldContext
. If stillNone
, default value will be used.- Parameters
tokenizer (
Tokenizer
, str, optional) – How to tokenize sentence. ifstr
, see tokenizer for possible value. No default value,KeyError
will be raised.vocab (
Vocab
, optional) – The vocabulary used for this field. Sharing this object between fields can build vocabulary together. No default value,KeyError
will be raised.vocab_from_mappings (Dict[str, str], optional) – Infer the set type (train, test, or extra) from the set name. For example,
DEFAULT_VOCAB_FROM_MAPPINGS["dev"] == "test"
means that the words from the “dev” set is used for test. Default: See the table for default value.max_sent_length (int, _InfiniteLength, optional) – All sentences longer than
max_sent_length
will be shortened to firstmax_sent_length
tokens. If it’sNone
orSentence.INFINITE_LENGTH
, sentences won’t be shortened no matter how long they are. Default:None
.convert_to_lower_letter (bool, optional) – Whether convert all the tokens to lower case after tokenization. Default:
False
.
- Input Format
This field read multiple line of sentences per sample, until a blank line.
-
get_batch
(name, data, indexes) → Dict[str, Any][source]¶ - Invoked by
LanguageProcessing.get_batch()
, return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.
- Arguments:
name (str): name of the field. data (Any): the data stored in dataloader. indexes (List[int]): the indexes of the data in this batch
# NOTE: We only show the structure of return value of get_batch. The real value of each entry may depends on the loaded vocab. .. rubric:: Examples
>>> from transformers.tokenization_bert import BertTokenizer >>> from cotk.dataloader.tokenizer import PretrainedTokenizer >>> tokenizer = BertTokenizer.from_pretrained('bert') >>> field = SessionBERT(PretrainedTokenizer(tokenizer)) >>> field_content = field._create('train') >>> dataset = iter(['How are you?\n', "I'm fine. Thank you! And you?\n", "I'm fine, too.\n", "\n", "How to install CoTk?\n", "pip install cotk.\n", "\n"]) >>> while True: ... try: ... field_content.read_next(dataset) ... except StopIteration: ... break >>> field_content.process_before_vocab() >>> field.vocab.build_vocab() >>> data = field_content.get_data() >>> data {'id': [[[2, 8, 18, 6, 5, 3], [2, 9, 7, 12, 10, 4, 17, 6, 13, 15, 6, 5, 3], [2, 9, 7, 12, 10, 14, 22, 4, 3]], [[2, 8, 21, 11, 16, 5, 3], [2, 20, 11, 19, 4, 3]]], 'str': [['How are you?', "I'm fine. Thank you! And you?", "I'm fine, too."], ['How to install CoTk?', 'pip install cotk.']]} >>> batch_data = field.get_batch('session', data, [1]) >>> batch_data {'session_turn_length': array([2]), 'session_sent_length': [[7, 6]], 'session': array([[[ 2, 8, 21, 11, 16, 5, 3], [ 2, 20, 11, 19, 4, 3, 0]]]), 'session_allvocabs': array([[[ 2, 8, 21, 11, 16, 5, 3], [ 2, 20, 11, 19, 4, 3, 0]]]), 'session_str': [['How to install CoTk?', 'pip install cotk.']]} >>> # 'session_turn_length' (`name` + '_turn_length') is a :class:`np.ndarray` object with shape == (batch size, ). Each element is the length of corresponding sssion. >>> # 'session_sent_length' (`name` + '_sent_length') is List[List[int]]. Each integer is the length of corresponding sentence. >>> # 'session' (`name`) is a :class:`np.ndarray` object with shape == (batch size, max turn length, max sentence length). >>> # batch_data['session'][i, j] is a sentence. batch_data['session'][i, j, k] is an id. >>> # If `self.max_turn_length` is not None and j >= `self.max_turn_length` or `self.max_sent_length` is not None and k >= `self.max_sent_length`, >>> # batch_data['session'][i, j, k] is `self.pad_id`. >>> # 'session_allvocabs' (`name` + '_allvocabs') is the same with 'session'.
- Invoked by
DenseLabel¶
-
class
cotk.dataloader.
DenseLabel
[source]¶ Bases:
dataloader.Field
A field of categorical labels whose values are integer which ranges from
0
tolabel_types - 1
.See
dataloader.SparseLabel
for labels instr
or sparse integer.- Parameters
This class do not contains arguments for initialization.
- Input Format
This field reads one line per sample. The line must be an integer.
-
get_batch
(name, data, indexes) → Dict[str, Any][source]¶ Invoked by
LanguageProcessing.get_batch()
, return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.The function will return a dict, containing:
FIELDNAME
(np.ndarray[batch_size]
): Labels of corresponding batched data.
where
FIELDNAME
is the name of the field.
- Parameters
name (str) – name of the field.
data (Any) – the data stored in dataloader.
indexes (List[int]) – the indexes of the data in this batch
Examples
>>> # data = {'label': [1, 0]} >>> field.get_batch('label', data, [0, 1]) { 'label': numpy.array([1, 0]) }
SparseLabel¶
-
class
cotk.dataloader.
SparseLabel
(vocab=None)[source]¶ Bases:
dataloader.Field
A field of categorical labels whose values are strings or sparse integer.
See
dataloader.DenseLabel
for labels in dense integers.If any argument is not specified, the value will be first retrieved from
FieldContext
. If stillNone
, default value will be used.- Parameters
vocab (
SimpleVocab
, optional) – The vocab to store all the labels. IfNone
, aSimpleVocab
is automatically created.
- Input Format
This field reads one line per sample. The line can be an arbitary string.
-
get_batch
(name, data, indexes) → Dict[str, Any][source]¶ Invoked by
LanguageProcessing.get_batch()
, return the batched data specified by this field. This function is for INTERNAL USE only, but it shows the data format of the returned batch.The function will return a dict, containing:
FIELDNAME_id
(np.ndarray[batch_size]
): Ids of corresponding labels.FIELDNAME_str
(List[str]
): Raw labels of the batched data.
where
FIELDNAME
is the name of the field.
- Parameters
name (str) – name of the field.
data (Dict[str, Any]) – the object returned by
_SparseLabelContent.get_data()
.
data[‘str’] is raw labels. data[‘id’] is ids of labels.
indexes (List[int]): the indexes of the data in this batch
Examples
>>> # data = { >>> # 'id': [0, 2, 1, 0], >>> # 'str': ['Java', 'Python', 'Cpp', 'Java'] >>> # } >>> field.get_batch('label', data, [0, 1]) { 'label_id': numpy.array([0, 2]), # Ids of corresponding labels. 'label_str': ['Java', 'Python'] # Raw labels. }
Tokenizer¶
-
class
cotk.dataloader.
Tokenizer
[source]¶ Tokenizer is used for spliting sentence to tokens. This is an abstract base class. It often works as a part of
Field
-
tokenize
(sentence) → List[str][source]¶ Tokenize a sentence to a list of tokens.
- Parameters
sentence (str) – a sentence to tokenize.
-
tokenize_sentences
(sentences) → List[List[str]][source]¶ Tokenize a list of sentences to a list of lists of tokens.
- Parameters
sentences (List[str]) – sentences to tokenize.
-
tokenize_sessions
(sessions) → List[List[List[str]]][source]¶ Tokenize sessions to a 3-d list of tokens.
- Parameters
sessions (List[List[str]]) – sessions to tokenize.
-
convert_tokens_to_sentence
(tokens) → str[source]¶ Convert tokens to sentence. It usually works like the reverse operation of
tokenize()
, but it is not gauranteed. It may like" ".join(tokens)
, but some special condition and tokens will be took care.- Parameters
tokens (List[str]) – tokenized sentence
-
SimpleTokenizer¶
-
class
cotk.dataloader.
SimpleTokenizer
(method, special_tokens=None)[source]¶ Bases:
dataloader.Tokenizer
A simple tokenizer.
method
can either benltk
orspace
. Ifnltk
, useWordPunctTokenizer
fromnltk.tokenize
. Ifspace
, usestr.split(" ")
.- Parameters
method (str) – the tokenization method,
nltk
orspace
.special_tokens (List[str]) – special tokens not to tokenize, such as
<go>
.
Pretrainedtokenizer¶
-
class
cotk.dataloader.
PretrainedTokenizer
(tokenizer)[source]¶ Bases:
dataloader.Tokenizer
A wrapper for
Pretrainedtokenizer
fromtransformers
package. If you don’t want to do tokenization on some special tokens, seetransformers.Pretrainedtokenizer.add_special_tokens
.- Parameters
tokenizer (transformers.Pretrainedtokenizer) – An instance of
transformers.Pretrainedtokenizer
.
Vocab¶
-
class
cotk.dataloader.
Vocab
[source]¶ A class for storing vocabulary. This is an abstract base class. It often works as a part of
Field
or is shared betweenField
.See introduction of vocabulary for more information.
- Parameters
This class do not contains arguments for initialization.
-
classmethod
get_all_subclasses
() → Iterable[Any]¶ Return a generator of all subclasses.
-
classmethod
load_class
(class_name) → Any¶ Return a subclass of
class_name
, case insensitively.- Parameters
class_name (str) – target class name.
-
add_tokens
(tokens, vocab_from) → None[source]¶ Add tokens for this vocabulary instance, the tokens will be used for building vocabulary list. Must be called before
build_vocab()
.- Parameters
tokens (List[str]) – A list of tokens to add to the vocabulary.
vocab_from (str) – One of
train
,test
,extra
.train
: The tokens are from the training data. Frequent vocabs are selected from tokens of this type.test
: The tokens are from the validation data or test data. Rare vocabs are selected from tokens of this type.extra
: The tokens are from extra data. The tokens of this type will not selected as frequent or rare vocabs.
-
build_vocab
()[source]¶ Building the vocabulary list according to the tokens from
add_tokens()
.
-
convert_tokens_to_ids
(tokens, only_frequent_word=False) → List[int][source]¶ Convert list of tokens to list of ids.
- Parameters
tokens (List[str]) – List of tokens.
only_frequent_word (bool, optional) – Use
unk
for rare tokens. Defaults: False.
-
convert_ids_to_tokens
(ids) → List[str][source]¶ Convert list of ids to list of tokens.
- Parameters
ids (List[int]) – List of ids.
-
frequent_vocab_size
¶ int – The number of frequent words.
-
all_vocab_size
¶ int – The number of frequent words and rare words.
-
frequent_vocab_list
¶ list – The list of frequent words.
-
all_vocab_list
¶ list – The list of frequent words and rare words. Frequent words are always in the front of the list.
-
get_special_tokens_mapping
() → MutableMapping[str, str][source]¶ Get special tokens mapping. Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following:
pad
,unk
,go
,eos
,sep
,cls
,mask
. The value can be arbitrary string, e.g.,"<pad>"
,"<unk>"
.
-
get_special_tokens_id
(name) → int[source]¶ Get id of special token specifying the general name. Raise
KeyError
if no such token in this instance.- Parameters
name (str) – the general name, must be one of the following,
pad
,unk
,go
,eos
,sep
,cls
,mask
.
-
pad_id
¶ int – The id of pad token. Raise
KeyError
if no pad token in this instance.
-
unk_id
¶ int – The id of unk token. Raise
KeyError
if no unk token in this instance.
-
go_id
¶ int – The id of go token. Raise
KeyError
if no go token in this instance.
-
eos_id
¶ int – The id of eos token. Raise
KeyError
if no eos token in this instance.
GeneralVocab¶
-
class
cotk.dataloader.
GeneralVocab
(min_frequent_vocab_times=None, min_rare_vocab_times=None, special_tokens_mapping=None, special_appeared_in_data=None)[source]¶ Bases:
dataloader.Vocab
A vocabulary class for general use.
This class always have the following 4 speical tokens:
pad
,unk
,go
,eos
.If any argument is not specified, the value will be first retrieved from
VocabContext
. If stillNone
, default value will be used.- Parameters
min_frequent_vocab_times (int, optional) – Tokens from training data appeared no less than
min_frequent_vocab_times
will be regarded as frequent words. Default:0
min_rare_vocab_times (int, optional) – Tokens from training data or test data appeared more than
min_rare_vocab_times
will be regarded as rare words (frequent word excluded). Default:0
special_tokens_mapping (OrderedDict, optional) – Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following:
pad
,unk
,go
,eos
,sep
,cls
,mask
. The value can be arbitrary string, e.g.,"<pad>"
,"<unk>"
. It must at least containspad
,unk
,go
,eos
. All the value of special tokens cannot be the same. Default: IfNone
, it will beOrderedDict([("pad", "<pad>"), ("unk", "<unk>"), ("go", "<go>"), ("eos", "<eos>")]
.special_appeared_in_data (bool, optional) – If the string of special tokens will appear in the data. Default: If not specified, it will be
False
.
-
static
from_predefined
(vocab_list, frequent_vocab_size, special_tokens_mapping=None) → cotk.dataloader.vocab.GeneralVocab[source]¶ Return a
GeneralVocab
instance, whose vocabulary comes from a predefined list. Seefrom_predefined_vocab()
if you want to use the vocabulary from an existingGeneralVocab
instance.- Parameters
vocab_list (List[str]) – A list of all vocabulary.
frequent_vocab_size (int) – the length of the frequent words. The frequent word must be in the front of the
vocab_list
.special_tokens_mapping (OrderedDict, optional) – Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following:
pad
,unk
,go
,eos
,sep
,cls
,mask
. The value can be arbitrary string, e.g.,"<pad>"
,"<unk>"
. It must at least containspad
,unk
,go
,eos
. All the value of special tokens cannot be the same. Special tokens MUST be in the front of thefrequent_vocab_list
(ordered sensitive). Default: IfNone
, it will beOrderedDict([("pad", "<pad>"), ("unk", "<unk>"), ("go", "<go>"), ("eos", "<eos>")]
.
-
static
from_predefined_vocab
(vocab) → cotk.dataloader.vocab.GeneralVocab[source]¶ Return a new
GeneralVocab
instance fromvocab
. The new instance have the same vocabulary list as the old one.- Parameters
vocab (
GeneralVocab
) – The old instance.
-
static
from_frequent_word
(frequent_vocab_list, special_tokens_mapping=None) → cotk.dataloader.vocab.GeneralVocab[source]¶ Return a
GeneralVocab
instance, whose vocabulary comes from a predefined frequent list. And its rare word list can be built later. Seefrom_frequent_word_of_vocab()
if you want to use the frequent vocabulary from an existingGeneralVocab
instance.- Parameters
frequent_vocab_list (List[str]) – A list of frequent vocabulary.
special_tokens_mapping (OrderedDict, optional) – Special tokens mapping is an ordered dict mapping the general name of special tokens to its string. The key must be one of the following:
pad
,unk
,go
,eos
,sep
,cls
,mask
. The value can be arbitrary string, e.g.,"<pad>"
,"<unk>"
. It must at least containspad
,unk
,go
,eos
. All the value of special tokens cannot be the same. Special tokens MUST be in the front of thefrequent_vocab_list
(ordered sensitive). Default: IfNone
, it will beOrderedDict([("pad", "<pad>"), ("unk", "<unk>"), ("go", "<go>"), ("eos", "<eos>")]
.
-
static
from_frequent_word_of_vocab
(vocab) → cotk.dataloader.vocab.GeneralVocab[source]¶ Return a
GeneralVocab
instance, which has the same frequent vocabulary list as the old one. The rare word list can be built later.- Parameters
vocab (
GeneralVocab
) – The old instance to provide frequent words.
PretrainedVocab¶
-
class
cotk.dataloader.
PretrainedVocab
(tokenizer)[source]¶ Bases:
dataloader.Vocab
Use the vocabulary from a pretrained tokenizer in
transformers
package. This class is usually used for pretrained models, and it do NOT have rare words.Unlike
GeneralVocab
, this class do not always havepad
,unk
,go
,eos
. Some special tokens may refer to the same token.- Parameters
tokenizer (
transformers.PretrainedTokenizer
) – A pretrained tokenizer from transformers package.
-
frequent_vocab_list
¶ list – The list of frequent words.
-
all_vocab_list
¶ list – The list of frequent words and rare words. Frequent words are always in the front of the list.
SimpleVocab¶
-
class
cotk.dataloader.
SimpleVocab
[source]¶ Bases:
dataloader.Vocab
A very simple vocabulary class. No rare vocabs or special tokens. Used by
SparseLabel
.- Parameters
This class do not contains arguments for initialization.
Context¶
-
class
cotk.dataloader.
Context
(parameter_dict, weak=False, none_as_ignored=True)[source]¶ An abstract base class for context manager.
This class is used for setting default parameters for
Field
orVocab
, without directly passing parameters to__init__
of the object.See examples for how to use context manager.
- Parameters
parameter_dict (Dict[str, Any]) – Key-value dict for changed parameters.
weak (bool, optional) – When
False
, overwrite existing parameters. Default:False
.none_as_ignored (bool, optional) – When
True
,None
values inparameter_dict
are ignored. Otherwise, the corresponding key will be set toNone
. Default:True
.
-
classmethod
get
(key, default=None, no_default=False) → Any[source]¶ Get the value of parameter named
key
stored in this class.- Parameters
key (str) – name of the parameter
default (Any, optional) – Default value if
key
is not set. Defaults:None
.no_default (bool, optional) – When
True
, RaiseKeyError
ifkey
is not set. Defaults:False
.
-
classmethod
set
(key, value, weak=False, none_as_ignored=True) → Any[source]¶ Set the parameter named
key
tovalue
, stored in this class. If weak isTrue
, do not overwrite ifkey
is already set. Return the old value.- Parameters
key (str) – The name of the changed parameter.
value (Any) – The new value of changed parameter. If want to delete the key, use
Context.UNDEFINED
.weak (bool, optional) – When
False
, overwrite existing parameters. Defaults:False
.none_as_ignored (bool, optional) – When
True
,None
values inparameter_dict
are ignored. Otherwise, the corresponding value will be set toNone
. Default:True
.
FieldContext¶
-
class
cotk.dataloader.
FieldContext
(parameter_dict, weak=False, none_as_ignored=True)[source]¶ Bases:
dataloader.Context
A context class for setting default parameters for
Field
.-
classmethod
set_parameters
(*, weak=False, none_as_ignored=True, **kwargs) → cotk.dataloader.context.FieldContext[source]¶ Set a context for initialization of
Field
. See examples for how to use context manager.- Parameters
weak (bool, optional) – When
False
, overwrite existing parameters. Defaults:False
.none_as_ignored (bool, optional) – When
True
,None
values inkwargs
are ignored. Otherwise, the corresponding value will be set toNone
. Default:True
.**kwargs – Any parameters to be set. Set
key
toFieldContext.UNDEFINED
to delete a parameter.
-
classmethod
VocabContext¶
-
class
cotk.dataloader.
VocabContext
(parameter_dict, weak=False, none_as_ignored=True)[source]¶ Bases:
dataloader.Context
A context class for setting default parameters for
Vocab
.-
classmethod
set_parameters
(*, weak=False, none_as_ignored=True, **kwargs) → cotk.dataloader.context.VocabContext[source]¶ Set a context for initialization of
Vocab
. See examples for how to use context manager.- Parameters
weak (bool, optional) – When
False
, overwrite existing parameters. Defaults:False
.none_as_ignored (bool, optional) – When
True
,None
values inkwargs
are ignored. Otherwise, the corresponding value will be set toNone
. Default:True
.**kwargs – Any parameters to be set. Set
key
toVocabContext.UNDEFINED
to delete a parameter.
-
classmethod