Metric¶
cotk.metric
provides commonly used metrics for cotk.dataloader
.
All metric objects receive a batch of data per one call of forward
.
The batch of data is represented by a dict which contains models’ outputs and
answers. The answers are highly relevant to the corresponding
dataloader, where the type, shape are usually identical with the return value of
get_batch
in dataloader, as long as the correct key name is set.
forward
function can be called several times and at last
close
can be called for results.
Here is an example:
>>> dm = OpenSubtitles()
>>> metric = BleuCorpusMetric(gen_key="gen",\
... reference_allvocabs_key="resp_allvocabs_key")
... # "resp_allvocabs_key" is a key name in get_batch()
>>> for data in dm.get_batches("test", batch_size=32):
... data["gen"] = predict(data["post"])
... assert "resp_allvocabs_key" in data
... metric.forward(data)
>>> print(metric.close())
{"bleu": 0.135, "bleu hashvalue": b"XXXX"}
We also provide default metrics in dataloader, you can use “get_metric”-like
functions (example: SingleTurnDialog.get_inference_metric()
) to get
default metrics and avoid the mess with complex key name.
Here is an exmample:
>>> dm = OpenSubtitles()
>>> metric = dm.get_inference_metric(gen_key="gen")
>>> for data in dm.get_batches("test", batch_size=32):
... data["gen"] = predict(data["post"])
... metric.forward(data)
>>> print(metric.close())
{"bleu": 0.135, "bleu hashvalue": b"XXXX", ...}
Hash Value¶
MetricBase.close()
will return a dict containing hash value,
which can validate whether two models used the same test data and the
same setting. Only two models using the same metric with the same hash
value returned, can compare with each other.
Basic Classes¶
-
class
cotk.metric.
MetricBase
(name, version)[source]¶ Base class for metrics.
-
forward
(data)[source]¶ Processing a batch of data.
- Parameters
data (dict) – A dict contains the data that metrics need.
-
close
() → Dict[Any, Any][source]¶ Close the metric and return a dict containing results. Once the metric is closed, any operation on the metric (e.g. forward or another close) will raise a ValueError.
-
Metric class¶
PerplexityMetric¶
-
class
cotk.metric.
PerplexityMetric
(dataloader, reference_allvocabs_key='ref_allvocabs', reference_len_key='ref_length', gen_log_prob_key='gen_log_prob', generate_rare_vocab=False, full_check=False)[source]¶ Metric for calculating perplexity.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.reference_allvocabs_key (str, optional) – The key of reference sentences. Default:
ref_allvocabs
.reference_len_key (str, optional) – The key of lengths of reference sentences. Default:
ref_length
.gen_log_prob_key (str, optional) – The key of predicted log probability over words. Default:
gen_log_prob
.generate_rare_vocab (bool, optional) – Whether
gen_log_prob
contains invalid vocab. Default:False
.full_check (bool, optional) – Whether to perform a full check on
gen_log_prob
to make sure the sum of probability is 1. Otherwise, a random check will be performed for efficiency. If PyTorch is used, a full check is always performed and this argument will be ignored. Default:False
.
Here is an example:
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> reference_allvocabs_key="ref_allvocabs" >>> reference_len_key="ref_length" >>> gen_log_prob_key="gen_log_prob" >>> metric = cotk.metric.PerplexityMetric(dl, ... reference_allvocabs_key=reference_allvocabs_key, ... reference_len_key=reference_len_key, ... gen_log_prob_key=gen_log_prob_key) >>> data = { ... reference_allvocabs_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]], ... # reference_allvocabs_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]], ... reference_len_key: [5, 5], ... gen_log_prob_key: [[[-11.31, -11.31, -0.69, ..., -11.31, -11.31, -11.31],...],...] # shape == (batch, length, vocab_size) ... } >>> metric.forward(data) >>> metric.close() {'perplexity': 81458.00000000006, 'perplexity hashvalue': '7f9b88b8f9996f5d49a512258f250fbc56adee714952b2c696c0b36cce36f648'}
-
forward
(data)[source]¶ Processing a batch of data. Smoothing will be performed for rare vocabs.
- Parameters
data (dict) – A dict at least contains the following keys:
data[reference_allvocabs_key] (list,
numpy.ndarray
,torch.Tensor
): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size, ~ref_sentence_length]
, where “~” means different sizes in this dimension is allowed.data[reference_len_key] (list,
numpy.ndarray
): Length of reference sentences. Contains start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size]
.data[gen_log_prob_key] (list,
numpy.ndarray
,torch.Tensor
): The log softmax probability of the sentence generations model outputs. A 3-d jagged or padded array of float. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, ~gen_sentence_length, vocab_size]
forgenerate_rare_vocab = False
, or[batch_size, ~gen_sentence_length, all_vocab_size]
forgenerate_rare_vocab = True
, where “~” means different sizes in this dimension is allowed. Iftorch.Tensor
is used, the following data should also betorch.Tensor
.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... reference_allvocabs_key: [[2,4,3], [2,5,6,3]], ... reference_len_key: [3,4], ... gen_log_prob_key: [[[-3.80666249, -3.11351531, -2.7080502 , -2.42036813, -2.19722458, -2.01490302, -1.86075234, -1.72722095, -1.60943791],...],...] ... }
Warning
data[gen_log_prob_key]
must be processed after log_softmax. That means,np.sum(np.exp(gen_log_prob), -1)
equalsnp.ones((batch_size, gen_sentence_length))
MultiTurnPerplexityMetric¶
-
class
cotk.metric.
MultiTurnPerplexityMetric
(dataloader, multi_turn_reference_allvocabs_key='multi_turn_ref_allvocabs', multi_turn_reference_len_key='multi_turn_ref_length', multi_turn_gen_log_prob_key='multi_turn_gen_log_prob', generate_rare_vocab=False, full_check=False)[source]¶ Metric for calculating multi-turn perplexity.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.multi_turn_reference_allvocabs_key (str, optional) – The key of reference sentences. Default:
multi_turn_ref_allvocabs
.multi_turn_reference_len_key (str, optional) – The key of lengths of reference sentences. Default:
multi_turn_ref_length
.gen_log_prob_key (str, optional) – The key of predicted log probability over words. Default:
gen_log_prob
.generate_rare_vocab (bool, optional) – Whether
gen_log_prob
contains invalid vocab. Default:False
.full_check (bool, optional) – Whether to perform a full check on
gen_log_prob
to make sure the sum of probability is 1. Otherwise, a random check will be performed for efficiency. If PyTorch is used, a full check is always performed and this argument will be ignored. Default:False
.
Here is an example:
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> multi_turn_reference_allvocabs_key = "multi_turn_ref_allvocabs" >>> multi_turn_reference_len_key = "multi_turn_ref_length" >>> multi_turn_gen_log_prob_key = "multi_turn_gen_log_prob" >>> metric = cotk.metric.MultiTurnPerplexityMetric(dl, ... multi_turn_reference_allvocabs_key="multi_turn_ref_allvocabs", ... multi_turn_reference_len_key="multi_turn_ref_length", ... multi_turn_gen_log_prob_key="multi_turn_gen_log_prob") >>> data = { ... multi_turn_reference_allvocabs_key: [[[2, 10, 64, 851, 3], [2, 10, 64, 479, 3]], [[2, 10, 64, 279, 1460, 3]]], ... # multi_turn_reference_allvocabs_key = [[["<go>", "I", "like", "python", "<eos>"], ... # ["<go>", "I", "like", "java", "<eos>"]], ... # [["<go>", "I", "like", "machine", "learning", "<eos>"]]] ... ... multi_turn_reference_len_key: [[5, 5], [6]], ... multi_turn_gen_log_prob_key: [[[[-11.30784283, -11.30784283, -0.69312263, ..., -11.30784283, -11.30784283, -11.30784283], ...], ...], ...] ... } >>> metric.forward(data) >>> metric.close() {'perplexity': 81458.00000000006, 'perplexity hashvalue': '3a7647507f2e0d05a235c1d3a29515dc8885650884d625a5b76d305541dca685'}
-
forward
(data)[source]¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[multi_turn_reference_allvocabs_key] (list,
numpy.ndarray
,torch.Tensor
): A 3-d jagged or padded array of int. Multi-turn reference sentences with all vocabs. Contains start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size, ~turn_length, ~sentence_length]
, where “~” means different sizes in this dimension is allowed.data[multi_turn_reference_len_key] (list,
numpy.ndarray
): A 2-d jagged or padded array of int. If padded, redundant position must be set to0
. Length of multi-turn reference sentences. Contains start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size, ~turn_length]
, where “~” means different sizes in this dimension is allowed.data[multi_turn_gen_log_prob_key] (list,
numpy.ndarray
,torch.Tensor
): The log softmax probability of the sentence generations model outputs. A 4-d jagged or padded array. log softmax probability. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, ~gen_sentence_length, vocab_size]
forgenerate_rare_vocab = False
, or[batch_size, ~gen_sentence_length, all_vocab_size]` for ``generate_rare_vocab = True
, where “~” means different sizes in this dimension is allowed. Iftorch.Tensor
is used, the following data should also betorch.Tensor
.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... multi_turn_reference_allvocabs_key: [[[2,4,3], [2,5,6,3]], [[2,7,6,8,3]]], ... multi_turn_reference_len_key: [[3, 4], [5]], ... multi_turn_gen_log_prob_key: [[[[-3.80666249, -3.11351531, -2.7080502, -2.42036813, -2.19722458, -2.01490302, -1.86075234, -1.72722095, -1.60943791], ...], ...], ...] ... }
Warning
data[multi_turn_gen_log_prob_key]
must be processed after log_softmax. That means,np.sum(np.exp(multi_turn_gen_log_prob_key), -1)
equalsnp.ones((batch_size, ~gen_sentence_length))
BleuCorpusMetric¶
-
class
cotk.metric.
BleuCorpusMetric
(dataloader, ngram=4, *, tokenizer=None, reference_num=1, ignore_smoothing_error=False, reference_allvocabs_key='ref_allvocabs', gen_key='gen', reference_str_key='ref_str')[source]¶ Metric for calculating BLEU.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default:
4
.tokenizer (None,
dataloader.Tokenizer
, str, optional) – Specifies the tokenizer used in the metric. Default:None
.reference_num (int, None, optional) – The number of references used to calculate BLEU. If
None
, the number of references is uncertain and will be determined by the argument offorward()
. Default:1
.ignore_smoothing_error (bool, optional) – Specifies whether to ignore the smoothing error when calculating BLEU. Default:
False
.reference_allvocabs_key (str, optional) – The key of reference sentences. Default:
ref_allvocabs
.gen_key (str, optional) – The key of generated sentences. Default:
gen
.reference_str_key (str, optional) – The key of reference sentences in the string form. Default:
ref_str
.
Here is an example:
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> reference_allvocabs_key = "ref_allvocabs" >>> gen_key = "gen" >>> metric = cotk.metric.BleuCorpusMetric(dl, ... reference_allvocabs_key=reference_allvocabs_key, ... gen_key=gen_key) >>> data = { ... reference_allvocabs_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]], ... # reference_allvocabs_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]], ... ... gen_key: [[10, 1028, 479, 285, 220, 3], [851, 17, 2451, 3]] ... # gen_key: [["I", "love", "java", "very", "much", "<eos>"], ["python", "is", "excellent", "<eos>"]], ... } >>> metric.forward(data) >>> metric.close() {'bleu': 0.08582363099612991, 'bleu hashvalue': '70e019630fef24d9477034a3d941a5349fcbff5a3dc6978a13ea3d85290114fb'}
-
forward
(data)[source]¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[reference_allvocabs_key] (list,
numpy.ndarray
): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size, ~ref_sentence_length]
, where “~” means different sizes in this dimension is allowed.data[gen_key] (list,
numpy.ndarray
): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, ~gen_sentence_length]
, where “~” means different sizes in this dimension is allowed.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... reference_allvocabs_key: [[2,4,3], [2,5,6,3]], ... gen_key: [[4,5,3], [6,7,8,3]] ... }
SelfBleuCorpusMetric¶
-
class
cotk.metric.
SelfBleuCorpusMetric
(dataloader, ngram=4, *, tokenizer=None, gen_key='gen', sample=1000, seed=1229, cpu_count=None)[source]¶ Metric for calculating Self-BLEU.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default:
4
.tokenizer (None,
dataloader.Tokenizer
, str, optional) – Specifies the tokenizer used in the metric. Default:None
.gen_key (str, optional) – The key of generated sentences. Default:
gen
.sample (int, optional) – Number of examples sampled from the generated sentences. Default:
1000
.seed (int, optional) – Random seed for sampling. Default:
1229
.cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when
cpu_count
is set to1
or the dataset is small. Default: IfNone
, the environment variableCPU_COUNT
will be used when available, or all available cpu will be used otherwise.
Warning
the calculation of
hashvalue
considers the actual sample size of hypotheses which will be less thansample
if the size of hypotheses is smaller thansample
.Here is an example:
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> gen_key = 'gen' >>> metric = cotk.metric.SelfBleuCorpusMetric(dl, gen_key=gen_key) >>> data = { ... gen_key: [[10, 64, 851, 3], [10, 48, 851, 3]], ... # gen_key: [["I", "like", "python", "<eos>"], ["I", "use", "python", "<eos>"]], ... } >>> metric.forward(data) >>> metric.close() {'self-bleu': 0.13512001548070346, 'self-bleu hashvalue': '53cf55829c1b080c86c392c846a5d39a54340c70d838ec953f952aa6731118fb'}
-
forward
(data)[source]¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[gen_key] (list,
numpy.ndarray
): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, ~gen_sentence_length]
, where “~” means different sizes in this dimension is allowed.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... gen_key: [[4,5,3], [6,7,8,3]] ... }
FwBwBleuCorpusMetric¶
-
class
cotk.metric.
FwBwBleuCorpusMetric
(dataloader, reference_test_list, ngram=4, *, tokenizer=None, gen_key='gen', sample=1000, seed=1229, cpu_count=None)[source]¶ Metric for calculating FwBw-BLEU.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.reference_test_list (list) – Reference sentences with all vocabs in test data.
ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default:
4
.tokenizer (None,
dataloader.Tokenizer
, str, optional) – Specifies the tokenizer used in the metric. Default:None
.gen_key (str, optional) – The key of generated sentences. Default:
gen
.sample (int, optional) – Number of examples sampled from the generated sentences. Default:
1000
.seed (int, optional) – Random seed for sampling. Default:
1229
.cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when
cpu_count
is set to1
or the dataset is small. Default: IfNone
, the environment variableCPU_COUNT
will be used when available, or all available cpu will be used otherwise.
Warning
The calculation of
hashvalue
considers the actual sample size of hypotheses and references. Thereforehashvalue
may vary with the size of hypothesis or references if the size of them is smaller thansample
.Here is an example:
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> gen_key = 'gen' >>> metric = cotk.metric.FwBwBleuCorpusMetric(dl, ... reference_test_list=dl.get_all_batch('test')['session'][0].tolist(), ... gen_key=gen_key) >>> data = { ... gen_key: [[10, 64, 851, 3], [10, 48, 851, 3]], ... # gen_key: [["I", "like", "python", "<eos>"], ["I", "use", "python", "<eos>"]], ... } >>> metric.forward(data) >>> metric.close() {'fw-bleu': 0.007688528488990184, 'bw-bleu': 0.0012482612634667945, 'fw-bw-bleu': 0.002147816509441494, 'fw-bw-bleu hashvalue': '0e3f58a90225af615ff780f04c91613759e04a3c7b4329670b1d03b679adf8cd'}
-
forward
(data)[source]¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[gen_key] (list,
numpy.ndarray
): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, ~gen_sentence_length]
, where “~” means different sizes in this dimension is allowed.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... gen_key: [[4,5,3], [6,7,8,3]] ... }
MultiTurnBleuCorpusMetric¶
-
class
cotk.metric.
MultiTurnBleuCorpusMetric
(dataloader, ignore_smoothing_error=False, multi_turn_reference_allvocabs_key='reference_allvocabs', multi_turn_gen_key='multi_turn_gen', turn_len_key='turn_length')[source]¶ Metric for calculating multi-turn BLEU.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.ignore_smoothing_error (bool, optional) – Specifies whether to ignore the smoothing error when calculating BLEU. Default:
False
.multi_turn_reference_allvocabs_key (str, optional) – The key of reference sentences. Default:
multi_turn_ref_allvocabs
.multi_turn_gen_key (str, optional) – The key of generated sentences. Default:
multi_turn_gen
.turn_length (str, optional) – The key of length of turns. Default:
turn_length
.
Here is an example:
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> multi_turn_reference_allvocabs_key = "reference_allvocabs" >>> multi_turn_gen_key = "multi_turn_gen" >>> turn_len_key = "turn_length" >>> metric = cotk.metric.MultiTurnBleuCorpusMetric(dl, >>> multi_turn_reference_allvocabs_key=multi_turn_reference_allvocabs_key, >>> multi_turn_gen_key=multi_turn_gen_key, >>> turn_len_key=turn_len_key) >>> data = { ... multi_turn_reference_allvocabs_key: [[[2, 10, 64, 851, 3], [2, 10, 64, 479, 3]], [[2, 10, 64, 279, 1460, 3]]], ... # multi_turn_reference_allvocabs_key = [[["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "like", "java", "<eos>"]], ... # [["<go>", "I", "like", "machine", "learning", "<eos>"]]] ... ... turn_len_key: [2, 1], ... # turn_len_key: [len(multi_turn_reference_allvocabs_key[0]), len(multi_turn_reference_allvocabs_key[1])] ... ... multi_turn_gen_key: [[[851, 17, 2451, 3], [2019, 17, 393, 3]], [[10, 64, 34058, 805, 2601, 3]]] ... # multi_turn_gen_key = [[["python", "is", "excellent", "<eos>"], ["PHP", "is", "best", "<eos>"]], ... # [["I", "like", "natural", "language", "processing", "<eos>"]]] ... } >>> metric.forward(data) >>> metric.close() {'bleu': 0.12081744577265555, 'bleu hashvalue': 'c65b44c454dee5a8d393901644c7f1acfdb847bae3ab03823cb5b9f643958960'}
-
forward
(data)[source]¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[multi_turn_reference_allvocabs_key] (list,
numpy.ndarray
): A 3-d jagged or padded array of int. Multi-turn reference sentences with all vocabs. Contains start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size, ~turn_length, ~sentence_length]
, where “~” means different sizes in this dimension is allowed.data[gen_key] (list,
numpy.ndarray
): A 3-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, ~max_turn_length, ~gen_sentence_length]
, where “~” means different sizes in this dimension is allowed.data[turn_len_key] (list,
numpy.ndarray
): Length of turns in each sample. Size:[batch_size]
.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... multi_turn_reference_allvocabs_key: [[[2,4,3], [2,5,6,3]], [[2,7,6,8,3]]], ... turn_len_key: [2, 1], ... gen_key: [[[6,7,8,3], [4,5,3]], [[7,3]]] ... }
BleuPrecisionRecallMetric¶
-
class
cotk.metric.
BleuPrecisionRecallMetric
(dataloader, ngram, generated_num_per_context, candidates_allvocabs_key='candidate_allvocabs', multiple_gen_key='multiple_gen')[source]¶ Metric for calculating sentence BLEU precision and recall.
References
[1] Zhao, T., Zhao, R., & Eskenazi, M. (2017). Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.generated_num_per_context (int) – The number of sentences generated per context.
candidate_allvocabs_key (str, optional) – The key of reference sentences. Default:
candidate_allvocabs
.multiple_gen_key (str, optional) – The key of multiple generated sentences. Default:
multiple_gen
.ngram (int) – Specifies using BLEU-ngram.
Here is an exmaple:
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> candidate_allvocabs_key = 'candidate_allvocabs' >>> multiple_gen_key='multiple_gen' >>> metric = cotk.metric.BleuPrecisionRecallMetric(dl, 2, 2) >>> data = { ... candidate_allvocabs_key: [[[10, 64, 851], [10, 48, 851]]], ... # candidate_allvocabs_key: [[["I", "like", "python"], ["I", "use", "python"]]], ... multiple_gen_key: [[[10, 64, 479, 3], [10, 48, 2019, 3]]], ... # multiple_gen_key: [[["I", "like", "java", "<eos>"], ["I", "use", "PHP", "<eos>"]]], ... } >>> metric.forward(data) >>> metric.close() {'BLEU-2 precision': 0.12909944355487823, 'BLEU-2 recall': 0.12909944355487823, 'BLEU-2 hashvalue': '1652cd40276078ec8722d367f18008bf14053572ac15ce10e270eb41eae34bbf'}
-
_score
(gen, reference) → float[source]¶ Return a BLEU score in [0, 1] to calculate BLEU-ngram precision and recall.
- Parameters
gen (list) – list of generated word ids.
reference (list) – list of word ids of a reference.
Here is an Example:
>>> gen = [4,5] >>> reference = [5,6] >>> self._score(gen, reference) 0.150 # assume self.weights = [0.25,0.25,0.25,0.25]
-
close
() → Dict[str, Any]¶ Return a dict which contains
res_prefix
precision: average precision.res_prefix
recall: average recall.res_prefix
hashvalue: hash value for precision & recall metric, same hash value stands for same evaluation settings.
-
forward
(data)¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[candidate_allvocabs_key] (list,
numpy.ndarray
): A 3-d jagged list of index. Multiple reference sentences for a single context. Does not contain start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size, ~sentence_num, ~word_num]
, where “~” means different sizes in this dimension is allowed.data[multiple_gen_key] (list,
numpy.ndarray
): A 3-d jagged or padded array. Sentences generated by model. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, generated_num_per_context, ~gen_sentence_length]
, where “~” means different sizes in this dimension is allowed.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... candidate_allvocabs_key: [[[4], [5,6]], [[4,5,6]]], ... multiple_gen_key: [[[5,6,3]], [[4,5,7,3], [8,3]]] ... }
EmbSimilarityPrecisionRecallMetric¶
-
class
cotk.metric.
EmbSimilarityPrecisionRecallMetric
(dataloader, word2vec, mode, generated_num_per_context, candidates_allvocabs_key='candidate_allvocabs', multiple_gen_key='multiple_gen')[source]¶ Metric for calculating cosine similarity precision and recall.
References
[1] Zhao, T., Zhao, R., & Eskenazi, M. (2017). Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.generated_num_per_context (int) – The number of sentences generated per context.
candidate_allvocabs_key (str, optional) – The key of reference sentences. Default:
candidate_allvocabs
.multiple_gen_key (str, optional) – The key of multiple generated sentences. Default:
multiple_gen
.word2vec (dict) – Maps a word (str) to its pretrained embedding (
numpy.ndarray
or list)mode (str) – Specifies the operation that computes the bag-of-word representation. Must be
avg
orextrema
:avg
: element-wise average word embeddings.extrema
: element-wise maximum word embeddings.
Here is an exmaple:
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> candidate_allvocabs_key = 'candidate_allvocabs' >>> multiple_gen_key='multiple_gen' >>> wordvector = cotk.wordvector.Glove() >>> metric = cotk.metric.EmbSimilarityPrecisionRecallMetric(dl, wordvector.load_dict(dl.all_vocab_list), 'avg', 2) >>> data = { ... candidate_allvocabs_key: [[[10, 64, 851], [10, 48, 851]]], ... # candidate_allvocabs_key: [[["I", "like", "python"], ["I", "use", "python"]]], ... multiple_gen_key: [[[10, 64, 479, 3], [10, 48, 2019, 3]]], ... # multiple_gen_key: [[["I", "like", "java", "<eos>"], ["I", "use", "PHP", "<eos>"]]], ... } >>> metric.forward(data) >>> metric.close() >>> # metric.close() returns a dict like this. >>> # {'avg-bow precision': 0.0, >>> # 'avg-bow recall': 0.0, >>> # 'avg-bow hashvalue': '5abaaa9a8e709b3f05467e3f6d0e27c6cc904fceebd3accb3b768928595e729a'}
-
_score
(gen, reference) → float[source]¶ Return a cosine similarity score in [0, 1] between two sentence embeddings to calculate cosine similarity precision and recall.
- Parameters
gen (list) – list of generated word ids.
reference (list) – list of word ids of a reference.
Here is an Example:
>>> gen = [4,5] >>> reference = [5,6] >>> self._score(gen, reference) 0.135 # assume self.mode = 'avg'
-
close
() → Dict[str, Any]¶ Return a dict which contains
res_prefix
precision: average precision.res_prefix
recall: average recall.res_prefix
hashvalue: hash value for precision & recall metric, same hash value stands for same evaluation settings.
-
forward
(data)¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[candidate_allvocabs_key] (list,
numpy.ndarray
): A 3-d jagged list of index. Multiple reference sentences for a single context. Does not contain start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size, ~sentence_num, ~word_num]
, where “~” means different sizes in this dimension is allowed.data[multiple_gen_key] (list,
numpy.ndarray
): A 3-d jagged or padded array. Sentences generated by model. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, generated_num_per_context, ~gen_sentence_length]
, where “~” means different sizes in this dimension is allowed.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... candidate_allvocabs_key: [[[4], [5,6]], [[4,5,6]]], ... multiple_gen_key: [[[5,6,3]], [[4,5,7,3], [8,3]]] ... }
NgramFwBwPerplexityMetric¶
-
class
cotk.metric.
NgramFwBwPerplexityMetric
(dataloader, reference_test_list, ngram=4, *, tokenizer=None, gen_key='gen', sample=10000, seed=1229, cpu_count=None)[source]¶ Metric for calculating n gram forward perplexity and backward perplexity.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.reference_test_list (list) – Reference sentences with all vocabs in test data.
ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default:
4
.tokenizer (None,
dataloader.Tokenizer
, str, optional) – Specifies the tokenizer used in the metric. Default:None
.gen_key (str, optional) – The key of generated sentences. Default:
gen
.sample (int, optional) – Number of examples sampled from the generated sentences. Default:
10000
.seed (int, optional) – Random seed for sampling. Default:
1229
.cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when
cpu_count
is set to1
or the dataset is small. Default: IfNone
, the environment variableCPU_COUNT
will be used when available, or all available cpu will be used otherwise.
Here is an example (to only show the format but not the exact value of results):
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> gen_key = "gen" >>> metric = cotk.metric.NgramFwBwPerplexityMetric(dl, dl.get_all_batch('test')['session'][0].tolist(), 2, gen_key=gen_key) >>> data = { ... gen_key: [[10, 1028, 479, 285, 220, 3], [851, 17, 2451, 3]] ... # gen_key: [["I", "love", "java", "very", "much", "<eos>"], ["python", "is", "excellent", "<eos>"]], ... } >>> metric.forward(data) >>> metric.close() {'fwppl': 51.44751843841384, 'bwppl': 138.954327895075, 'fwppl hashvalue': '2ea52377084692953f602e4ebad23e8a46e1c4bb527947d29a03c14b426efe67', 'bwppl hashvalue': '2ea52377084692953f602e4ebad23e8a46e1c4bb527947d29a03c14b426efe67'}
-
forward
(data)[source]¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[gen_key] (list,
numpy.ndarray
): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, ~gen_sentence_length]
, where “~” means different sizes in this dimension is allowed.
Metric-like class¶
SingleTurnDialogRecorder¶
-
class
cotk.metric.
SingleTurnDialogRecorder
(dataloader, post_allvocabs_key='post_allvocabs', resp_allvocabs_key='resp_allvocabs', gen_key='gen')[source]¶ A metric-like class for recording generated sentences and references.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.post_allvocabs_key (str, optional) – The key of dialog posts with allvocabs. Default:
post_allvocabs
.resp_allvocabs_key (str, optional) – The key of dialog responses with allvocabs. Default:
resp_allvocabs
.gen_key (str, optional) – The key of generated sentences. Default:
gen
.
Here is an example:
>>> post_allvocabs_key = "post_allvocabs" >>> resp_allvocabs_key = "resp_allvocabs" >>> gen_key = "gen" >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> metric = cotk.metric.SingleTurnDialogRecorder(dl, ... post_allvocabs_key=post_allvocabs_key, ... resp_allvocabs_key=resp_allvocabs_key, ... gen_key=gen_key) >>> data = { ... post_allvocabs_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]], ... # post_allvocabs_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]], ... ... resp_allvocabs_key: [[2, 10, 1214, 479, 3], [2, 851, 17, 2451, 3]], ... # resp_allvocabs_key: [["<go>", "I", "prefe", "java", "<eos>"], ["<go>", "python", "is", "excellent", "<eos>"]], ... ... gen_key: [[10, 64, 2019, 3], [851, 17, 4124, 3]], ... # gen_key: [["I", "like", "PHP", "<eos>"], ["python", "is", "powerful", "<eos>"]] ... } >>> metric.forward(data) >>> metric.close() {'post': [['I', 'like', 'python'], ['I', 'use', 'python']], 'resp': [['I', 'prefer', 'java'], ['python', 'is', 'excellent']], 'gen': [['I', 'like', 'PHP'], ['python', 'is', 'powerful']]}
-
forward
(data)[source]¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[post_allvocabs_key] (list,
numpy.ndarray
): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size, ~ref_sentence_length]
, where “~” means different sizes in this dimension is allowed.data[resp_allvocabs_key] (list,
numpy.ndarray
): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size, ~ref_sentence_length]
, where “~” means different sizes in this dimension is allowed.data[gen_key] (list,
numpy.ndarray
): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, ~gen_sentence_length]
, where “~” means different sizes in this dimension is allowed.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... post_allvocabs_key: [[2,4,3], [2,5,6,3]], ... resp_allvocabs_key: [[2,5,4,3], [2,6,3]], ... gen_key: [[6,7,8,3], [4,5,3]] ... }
-
close
() → Dict[str, Any][source]¶ Return a dict which contains
post: a list of post sentences. A jagged 2-d array of int. Size:
[batch_size, ~sent_length]
, where “~” means different sizes in this dimension is allowed.resp: a list of response sentences. A jagged 2-d array of int. Size:
[batch_size, ~sent_length]
, where “~” means different sizes in this dimension is allowed.gen: A list of generated sentences. A jagged 2-d array of int. Size:
[batch_size, ~sent_length]
, where “~” means different sizes in this dimension is allowed.
MultiTurnDialogRecorder¶
-
class
cotk.metric.
MultiTurnDialogRecorder
(dataloader, multi_turn_reference_allvocabs_key='multi_turn_ref_allvocabs', multi_turn_gen_key='multi_turn_gen', turn_len_key='turn_length')[source]¶ A metric-like class for recording generated sentences and references.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Session
) – A language generation dataloader.multi_turn_reference_allvocabs_key (str, optional) – The key of dialog references with allvocabs. Default:
multi_turn_ref_allvocabs
.multi_turn_gen_key (str, optional) – The key of generated sentences. Default:
multi_turn_gen
.turn_length (str, optional) – The key of length of turns. Default:
turn_length
.
Here is an example:
>>> multi_turn_reference_allvocabs_key = "multi_turn_ref_allvocabs" >>> multi_turn_gen_key = "multi_turn_gen" >>> turn_len_key = "turn_length" >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> metric = cotk.metric.MultiTurnDialogRecorder(dl, ... multi_turn_reference_allvocabs_key=multi_turn_reference_allvocabs_key, ... multi_turn_gen_key=multi_turn_gen_key, ... turn_len_key=turn_len_key) >>> data = { ... multi_turn_reference_allvocabs_key: [[[2, 10, 64, 851, 3], [2, 10, 64, 479, 3]], [[2, 10, 64, 279, 1460, 3]]], ... # multi_turn_reference_allvocabs_key = [[["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "like", "java", "<eos>"]], ... # [["<go>", "I", "like", "machine", "learning", "<eos>"]]] ... ... turn_len_key: [2, 1], ... # turn_len_key: [len(multi_turn_reference_allvocabs_key[0]), len(multi_turn_reference_allvocabs_key[1])] ... ... multi_turn_gen_key: [[[851, 17, 2451, 3], [2019, 17, 393, 3]], [[10, 64, 34058, 805, 2601, 3]]] ... # multi_turn_gen_key = [[["python", "is", "excellent", "<eos>"], ["PHP", "is", "best", "<eos>"]], ... # [["I", "like", "natural", "language", "processing", "<eos>"]]] ... } >>> metric.forward(data) >>> metric.close() {'reference': [[['I', 'like', 'python'], ['I', 'like', 'java']], [['I', 'like', 'machine', 'learning']]], 'gen': [[['python', 'is', 'excellent'], ['PHP', 'is', 'best']], [['I', 'like', 'natural', 'language', 'processing']]]}
-
forward
(data)[source]¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[multi_turn_reference_allvocabs_key] (list,
numpy.ndarray
): A 3-d jagged or padded array of int. Multi-turn reference sentences with all vocabs. Contains start token (eg:<go>
) and end token (eg:<eos>
). Size:[batch_size, ~turn_length, ~sentence_length]
, where “~” means different sizes in this dimension is allowed.data[gen_key] (list,
numpy.ndarray
): A 3-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, ~max_turn_length, ~gen_sentence_length]
, where “~” means different sizes in this dimension is allowed.data[turn_len_key] (list,
numpy.ndarray
): Length of turns in each sample. Size:[batch_size]
.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... multi_turn_context_allvocabs_key: [[[2,4,3], [2,5,6,3]], [[2,7,6,8,3]]], ... multi_turn_reference_allvocabs_key: [[[2,6,7,3], [2,5,3]], [[2,7,6,8,3]]], ... multi_turn_gen_key: [[[6,7,8,3], [4,5,3]], [[7,3]]], ... turn_len_key: [2,1] ... }
-
close
() → Dict[str, Any][source]¶ Return a dict which contains
reference: a list of response sentences. A jagged 3-d array of int. Size:
[batch_size, ~turn_length, ~sent_length]
, where “~” means different sizes in this dimension is allowed.gen: a list of generated sentences. A jagged 3-d array of int. Size:
[batch_size, ~turn_length, ~sent_length]
, where “~” means different sizes in this dimension is allowed.
LanguageGenerationRecorder¶
-
class
cotk.metric.
LanguageGenerationRecorder
(dataloader, gen_key='gen')[source]¶ A metric-like class for recorder generated sentences.
- Parameters
dataloader (
dataloader.LanguageProcessing
,dataloader.Sentence
,dataloader.Session
) – A language generation dataloader.gen_key (str, optional) – The key of generated sentences. Default:
gen
.
Here is an example:
>>> gen_key = "gen_key" >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> metric = cotk.metric.LanguageGenerationRecorder(dl, gen_key=gen_key) >>> data = { ... gen_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]], ... # gen_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]], ... } >>> metric.forward(data) >>> metric.close() {'gen': [['<go>', 'I', 'like', 'python'], ['<go>', 'I', 'use', 'python']]}
-
forward
(data)[source]¶ Processing a batch of data.
- Parameters
data (dict) – A dict at least contains the following keys:
data[gen_key] (list,
numpy.ndarray
): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:<eos>
), but without start token (eg:<go>
). Size:[batch_size, ~gen_sentence_length]
, where “~” means different sizes in this dimension is allowed.
Here is an example for data:
>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... gen_key: [[6,7,8,3], [4,5,3]] ... }
MetricChain¶
-
class
cotk.metric.
MetricChain
[source]¶ A metric-like class for stacked metric. You can use this class making multiples metric combination like one.
Examples
>>> metric = MetricChain() >>> metric.add_metric(BleuCorpusMetric()) >>> metric.add_metric(SingleDialogRecorder(dataloader))
Todo: Give more examples to combining forward and close
-
add_metric
(metric)[source]¶ Add metric for processing.
- Parameters
metric (
metric.MetricBase
) – a metric class.
-