
cotk.metric provides commonly used metrics for cotk.dataloader. All metric objects receive a batch of data per one call of forward. The batch of data is represented by a dict which contains models’ outputs and answers. The answers are highly relevant to the corresponding dataloader, where the type, shape are usually identical with the return value of get_batch in dataloader, as long as the correct key name is set. forward function can be called several times and at last close can be called for results.

Here is an example:

>>> dm = OpenSubtitles()
>>> metric = BleuCorpusMetric(gen_key="gen",\
...     reference_allvocabs_key="resp_allvocabs_key")
... # "resp_allvocabs_key" is a key name in get_batch()
>>> for data in dm.get_batches("test", batch_size=32):
...     data["gen"] = predict(data["post"])
...     assert "resp_allvocabs_key" in data
...     metric.forward(data)
>>> print(metric.close())
{"bleu": 0.135, "bleu hashvalue": b"XXXX"}

We also provide default metrics in dataloader, you can use “get_metric”-like functions (example: SingleTurnDialog.get_inference_metric()) to get default metrics and avoid the mess with complex key name.

Here is an exmample:

>>> dm = OpenSubtitles()
>>> metric = dm.get_inference_metric(gen_key="gen")
>>> for data in dm.get_batches("test", batch_size=32):
...     data["gen"] = predict(data["post"])
...     metric.forward(data)
>>> print(metric.close())
{"bleu": 0.135, "bleu hashvalue": b"XXXX", ...}

Hash Value

MetricBase.close() will return a dict containing hash value, which can validate whether two models used the same test data and the same setting. Only two models using the same metric with the same hash value returned, can compare with each other.

Basic Classes

class cotk.metric.MetricBase(name, version)[source]

Base class for metrics.


Processing a batch of data.


data (dict) – A dict contains the data that metrics need.

close() → Dict[Any, Any][source]

Close the metric and return a dict containing results. Once the metric is closed, any operation on the metric (e.g. forward or another close) will raise a ValueError.


Invoked by forward() or close() to hash relevant data when computing a metric.


data_list (list) – relevant data organized as list.


Invoked by close() to return the recorded hash value.

Metric class


class cotk.metric.PerplexityMetric(dataloader, reference_allvocabs_key='ref_allvocabs', reference_len_key='ref_length', gen_log_prob_key='gen_log_prob', generate_rare_vocab=False, full_check=False)[source]

Metric for calculating perplexity.

  • dataloader (dataloader.LanguageProcessing, dataloader.Sentence, dataloader.Session) – A language generation dataloader.

  • reference_allvocabs_key (str, optional) – The key of reference sentences. Default: ref_allvocabs.

  • reference_len_key (str, optional) – The key of lengths of reference sentences. Default: ref_length.

  • gen_log_prob_key (str, optional) – The key of predicted log probability over words. Default: gen_log_prob.

  • generate_rare_vocab (bool, optional) – Whether gen_log_prob contains invalid vocab. Default: False.

  • full_check (bool, optional) – Whether to perform a full check on gen_log_prob to make sure the sum of probability is 1. Otherwise, a random check will be performed for efficiency. If PyTorch is used, a full check is always performed and this argument will be ignored. Default: False.

Here is an example:

>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> reference_allvocabs_key="ref_allvocabs"
>>> reference_len_key="ref_length"
>>> gen_log_prob_key="gen_log_prob"
>>> metric = cotk.metric.PerplexityMetric(dl,
...     reference_allvocabs_key=reference_allvocabs_key,
...     reference_len_key=reference_len_key,
...     gen_log_prob_key=gen_log_prob_key)
>>> data = {
...     reference_allvocabs_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]],
...     # reference_allvocabs_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]],
...     reference_len_key: [5, 5],
...     gen_log_prob_key: [[[-11.31, -11.31,  -0.69, ..., -11.31, -11.31, -11.31],...],...] # shape == (batch, length, vocab_size)
... }
>>> metric.forward(data)
>>> metric.close()
{'perplexity': 81458.00000000006,
 'perplexity hashvalue': '7f9b88b8f9996f5d49a512258f250fbc56adee714952b2c696c0b36cce36f648'}

Processing a batch of data. Smoothing will be performed for rare vocabs.


data (dict) – A dict at least contains the following keys:

  • data[reference_allvocabs_key] (list, numpy.ndarray, torch.Tensor): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg: <go>) and end token (eg: <eos>). Size: [batch_size, ~ref_sentence_length], where “~” means different sizes in this dimension is allowed.

  • data[reference_len_key] (list, numpy.ndarray): Length of reference sentences. Contains start token (eg:<go>) and end token (eg:<eos>). Size: [batch_size].

  • data[gen_log_prob_key] (list, numpy.ndarray, torch.Tensor): The log softmax probability of the sentence generations model outputs. A 3-d jagged or padded array of float. Contains end token (eg:<eos>), but without start token (eg: <go>). Size: [batch_size, ~gen_sentence_length, vocab_size] for generate_rare_vocab = False, or [batch_size, ~gen_sentence_length, all_vocab_size] for generate_rare_vocab = True, where “~” means different sizes in this dimension is allowed. If torch.Tensor is used, the following data should also be torch.Tensor.

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     reference_allvocabs_key: [[2,4,3], [2,5,6,3]],
...     reference_len_key: [3,4],
...     gen_log_prob_key: [[[-3.80666249, -3.11351531, -2.7080502 , -2.42036813, -2.19722458,
            -2.01490302, -1.86075234, -1.72722095, -1.60943791],...],...]
... }


data[gen_log_prob_key] must be processed after log_softmax. That means, np.sum(np.exp(gen_log_prob), -1) equals np.ones((batch_size, gen_sentence_length))

close() → Dict[str, Any][source]

Return a dict which contains

  • perplexity: perplexity value.

  • perplexity hashvalue: hash value for perplexity metric, same hash value stands for same evaluation settings.


class cotk.metric.MultiTurnPerplexityMetric(dataloader, multi_turn_reference_allvocabs_key='multi_turn_ref_allvocabs', multi_turn_reference_len_key='multi_turn_ref_length', multi_turn_gen_log_prob_key='multi_turn_gen_log_prob', generate_rare_vocab=False, full_check=False)[source]

Metric for calculating multi-turn perplexity.

  • dataloader (dataloader.LanguageProcessing, dataloader.Sentence, dataloader.Session) – A language generation dataloader.

  • multi_turn_reference_allvocabs_key (str, optional) – The key of reference sentences. Default: multi_turn_ref_allvocabs.

  • multi_turn_reference_len_key (str, optional) – The key of lengths of reference sentences. Default: multi_turn_ref_length.

  • gen_log_prob_key (str, optional) – The key of predicted log probability over words. Default: gen_log_prob.

  • generate_rare_vocab (bool, optional) – Whether gen_log_prob contains invalid vocab. Default: False.

  • full_check (bool, optional) – Whether to perform a full check on gen_log_prob to make sure the sum of probability is 1. Otherwise, a random check will be performed for efficiency. If PyTorch is used, a full check is always performed and this argument will be ignored. Default: False.

Here is an example:

>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> multi_turn_reference_allvocabs_key = "multi_turn_ref_allvocabs"
>>> multi_turn_reference_len_key = "multi_turn_ref_length"
>>> multi_turn_gen_log_prob_key = "multi_turn_gen_log_prob"
>>> metric = cotk.metric.MultiTurnPerplexityMetric(dl,
...     multi_turn_reference_allvocabs_key="multi_turn_ref_allvocabs",
...     multi_turn_reference_len_key="multi_turn_ref_length",
...     multi_turn_gen_log_prob_key="multi_turn_gen_log_prob")
>>> data = {
...     multi_turn_reference_allvocabs_key: [[[2, 10, 64, 851, 3], [2, 10, 64, 479, 3]], [[2, 10, 64, 279, 1460, 3]]],
...     # multi_turn_reference_allvocabs_key = [[["<go>", "I", "like", "python", "<eos>"],
...     #   ["<go>", "I", "like", "java", "<eos>"]],
...     #   [["<go>", "I", "like", "machine", "learning", "<eos>"]]]
...     multi_turn_reference_len_key: [[5, 5], [6]],
...     multi_turn_gen_log_prob_key: [[[[-11.30784283, -11.30784283,  -0.69312263, ..., -11.30784283, -11.30784283, -11.30784283], ...], ...], ...]
... }
>>> metric.forward(data)
>>> metric.close()
{'perplexity': 81458.00000000006,
 'perplexity hashvalue': '3a7647507f2e0d05a235c1d3a29515dc8885650884d625a5b76d305541dca685'}

Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[multi_turn_reference_allvocabs_key] (list, numpy.ndarray, torch.Tensor): A 3-d jagged or padded array of int. Multi-turn reference sentences with all vocabs. Contains start token (eg: <go>) and end token (eg: <eos>). Size: [batch_size, ~turn_length, ~sentence_length], where “~” means different sizes in this dimension is allowed.

  • data[multi_turn_reference_len_key] (list, numpy.ndarray): A 2-d jagged or padded array of int. If padded, redundant position must be set to 0. Length of multi-turn reference sentences. Contains start token (eg:<go>) and end token (eg:<eos>). Size: [batch_size, ~turn_length], where “~” means different sizes in this dimension is allowed.

  • data[multi_turn_gen_log_prob_key] (list, numpy.ndarray, torch.Tensor): The log softmax probability of the sentence generations model outputs. A 4-d jagged or padded array. log softmax probability. Contains end token (eg:<eos>), but without start token (eg: <go>). Size: [batch_size, ~gen_sentence_length, vocab_size] for generate_rare_vocab = False, or [batch_size, ~gen_sentence_length, all_vocab_size]` for ``generate_rare_vocab = True, where “~” means different sizes in this dimension is allowed. If torch.Tensor is used, the following data should also be torch.Tensor.

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     multi_turn_reference_allvocabs_key: [[[2,4,3], [2,5,6,3]], [[2,7,6,8,3]]],
...     multi_turn_reference_len_key: [[3, 4], [5]],
...     multi_turn_gen_log_prob_key: [[[[-3.80666249, -3.11351531, -2.7080502,
            -2.42036813, -2.19722458, -2.01490302, -1.86075234, -1.72722095,
            -1.60943791], ...], ...], ...]
... }


data[multi_turn_gen_log_prob_key] must be processed after log_softmax. That means, np.sum(np.exp(multi_turn_gen_log_prob_key), -1) equals np.ones((batch_size, ~gen_sentence_length))

close() → Dict[str, Any][source]

Return a dict which contains

  • perplexity: perplexity value.

  • perplexity hashvalue: hash value for perplexity metric, same hash value stands for same evaluation settings.


class cotk.metric.BleuCorpusMetric(dataloader, ngram=4, *, tokenizer=None, reference_num=1, ignore_smoothing_error=False, reference_allvocabs_key='ref_allvocabs', gen_key='gen', reference_str_key='ref_str')[source]

Metric for calculating BLEU.

  • dataloader (dataloader.LanguageProcessing, dataloader.Sentence, dataloader.Session) – A language generation dataloader.

  • ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default: 4.

  • tokenizer (None, dataloader.Tokenizer, str, optional) – Specifies the tokenizer used in the metric. Default: None.

  • reference_num (int, None, optional) – The number of references used to calculate BLEU. If None, the number of references is uncertain and will be determined by the argument of forward(). Default: 1.

  • ignore_smoothing_error (bool, optional) – Specifies whether to ignore the smoothing error when calculating BLEU. Default: False.

  • reference_allvocabs_key (str, optional) – The key of reference sentences. Default: ref_allvocabs.

  • gen_key (str, optional) – The key of generated sentences. Default: gen.

  • reference_str_key (str, optional) – The key of reference sentences in the string form. Default: ref_str.

Here is an example:

>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> reference_allvocabs_key = "ref_allvocabs"
>>> gen_key = "gen"
>>> metric = cotk.metric.BleuCorpusMetric(dl,
...     reference_allvocabs_key=reference_allvocabs_key,
...     gen_key=gen_key)
>>> data = {
...     reference_allvocabs_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]],
...     # reference_allvocabs_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]],
...     gen_key: [[10, 1028, 479, 285, 220, 3], [851, 17, 2451, 3]]
...     # gen_key: [["I", "love", "java", "very", "much", "<eos>"], ["python", "is", "excellent", "<eos>"]],
... }
>>> metric.forward(data)
>>> metric.close()
{'bleu': 0.08582363099612991,
'bleu hashvalue': '70e019630fef24d9477034a3d941a5349fcbff5a3dc6978a13ea3d85290114fb'}

Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[reference_allvocabs_key] (list, numpy.ndarray): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg: <go>) and end token (eg: <eos>). Size: [batch_size, ~ref_sentence_length], where “~” means different sizes in this dimension is allowed.

  • data[gen_key] (list, numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg: <eos>), but without start token (eg: <go>). Size: [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     reference_allvocabs_key: [[2,4,3], [2,5,6,3]],
...     gen_key: [[4,5,3], [6,7,8,3]]
... }
close() → Dict[str, Any][source]

Return a dict which contains

  • bleu: bleu value.

  • bleu hashvalue: hash value for bleu metric, same hash value stands for same evaluation settings.


class cotk.metric.SelfBleuCorpusMetric(dataloader, ngram=4, *, tokenizer=None, gen_key='gen', sample=1000, seed=1229, cpu_count=None)[source]

Metric for calculating Self-BLEU.

  • dataloader (dataloader.LanguageProcessing, dataloader.Sentence, dataloader.Session) – A language generation dataloader.

  • ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default: 4.

  • tokenizer (None, dataloader.Tokenizer, str, optional) – Specifies the tokenizer used in the metric. Default: None.

  • gen_key (str, optional) – The key of generated sentences. Default: gen.

  • sample (int, optional) – Number of examples sampled from the generated sentences. Default: 1000.

  • seed (int, optional) – Random seed for sampling. Default: 1229.

  • cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when cpu_count is set to 1 or the dataset is small. Default: If None, the environment variable CPU_COUNT will be used when available, or all available cpu will be used otherwise.


the calculation of hashvalue considers the actual sample size of hypotheses which will be less than sample if the size of hypotheses is smaller than sample.

Here is an example:

>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> gen_key = 'gen'
>>> metric = cotk.metric.SelfBleuCorpusMetric(dl, gen_key=gen_key)
>>> data = {
...     gen_key: [[10, 64, 851, 3], [10, 48, 851, 3]],
...     # gen_key: [["I", "like", "python", "<eos>"], ["I", "use", "python", "<eos>"]],
... }
>>> metric.forward(data)
>>> metric.close()
{'self-bleu': 0.13512001548070346,
'self-bleu hashvalue': '53cf55829c1b080c86c392c846a5d39a54340c70d838ec953f952aa6731118fb'}

Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[gen_key] (list, numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg: <eos>), but without start token (eg: <go>). Size: [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     gen_key: [[4,5,3], [6,7,8,3]]
... }
close() → Dict[str, Any][source]

Return a dict which contains

  • self-bleu: self-bleu value.

  • self-bleu hashvalue: hash value for self-bleu metric, same hash value stands for same evaluation settings.


class cotk.metric.FwBwBleuCorpusMetric(dataloader, reference_test_list, ngram=4, *, tokenizer=None, gen_key='gen', sample=1000, seed=1229, cpu_count=None)[source]

Metric for calculating FwBw-BLEU.

  • dataloader (dataloader.LanguageProcessing, dataloader.Sentence, dataloader.Session) – A language generation dataloader.

  • reference_test_list (list) – Reference sentences with all vocabs in test data.

  • ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default: 4.

  • tokenizer (None, dataloader.Tokenizer, str, optional) – Specifies the tokenizer used in the metric. Default: None.

  • gen_key (str, optional) – The key of generated sentences. Default: gen.

  • sample (int, optional) – Number of examples sampled from the generated sentences. Default: 1000.

  • seed (int, optional) – Random seed for sampling. Default: 1229.

  • cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when cpu_count is set to 1 or the dataset is small. Default: If None, the environment variable CPU_COUNT will be used when available, or all available cpu will be used otherwise.


The calculation of hashvalue considers the actual sample size of hypotheses and references. Therefore hashvalue may vary with the size of hypothesis or references if the size of them is smaller than sample.

Here is an example:

>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> gen_key = 'gen'
>>> metric = cotk.metric.FwBwBleuCorpusMetric(dl,
...     reference_test_list=dl.get_all_batch('test')['session'][0].tolist(),
...     gen_key=gen_key)
>>> data = {
...     gen_key: [[10, 64, 851, 3], [10, 48, 851, 3]],
...     # gen_key: [["I", "like", "python", "<eos>"], ["I", "use", "python", "<eos>"]],
... }
>>> metric.forward(data)
>>> metric.close()
{'fw-bleu': 0.007688528488990184,
 'bw-bleu': 0.0012482612634667945,
 'fw-bw-bleu': 0.002147816509441494,
 'fw-bw-bleu hashvalue': '0e3f58a90225af615ff780f04c91613759e04a3c7b4329670b1d03b679adf8cd'}

Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[gen_key] (list, numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg: <eos>), but without start token (eg: <go>). Size: [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     gen_key: [[4,5,3], [6,7,8,3]]
... }
close() → Dict[str, Any][source]

Return a dict which contains

  • fw-bleu: fw bleu value.

  • bw-bleu: bw bleu value.

  • fw-bw-bleu: harmony mean of fw/bw bleu value.

  • fw-bw-bleu hashvalue: hash value for fwbwbleu metric, same hash value stands for same evaluation settings.


class cotk.metric.MultiTurnBleuCorpusMetric(dataloader, ignore_smoothing_error=False, multi_turn_reference_allvocabs_key='reference_allvocabs', multi_turn_gen_key='multi_turn_gen', turn_len_key='turn_length')[source]

Metric for calculating multi-turn BLEU.

  • dataloader (dataloader.LanguageProcessing, dataloader.Sentence, dataloader.Session) – A language generation dataloader.

  • ignore_smoothing_error (bool, optional) – Specifies whether to ignore the smoothing error when calculating BLEU. Default: False.

  • multi_turn_reference_allvocabs_key (str, optional) – The key of reference sentences. Default: multi_turn_ref_allvocabs.

  • multi_turn_gen_key (str, optional) – The key of generated sentences. Default: multi_turn_gen.

  • turn_length (str, optional) – The key of length of turns. Default: turn_length.

Here is an example:

>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> multi_turn_reference_allvocabs_key = "reference_allvocabs"
>>> multi_turn_gen_key = "multi_turn_gen"
>>> turn_len_key = "turn_length"
>>> metric = cotk.metric.MultiTurnBleuCorpusMetric(dl,
>>>     multi_turn_reference_allvocabs_key=multi_turn_reference_allvocabs_key,
>>>     multi_turn_gen_key=multi_turn_gen_key,
>>>     turn_len_key=turn_len_key)
>>> data = {
...     multi_turn_reference_allvocabs_key: [[[2, 10, 64, 851, 3], [2, 10, 64, 479, 3]], [[2, 10, 64, 279, 1460, 3]]],
...     # multi_turn_reference_allvocabs_key = [[["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "like", "java", "<eos>"]],
...     #   [["<go>", "I", "like", "machine", "learning", "<eos>"]]]
...     turn_len_key: [2, 1],
...     # turn_len_key: [len(multi_turn_reference_allvocabs_key[0]), len(multi_turn_reference_allvocabs_key[1])]
...     multi_turn_gen_key: [[[851, 17, 2451, 3], [2019, 17, 393, 3]], [[10, 64, 34058, 805, 2601, 3]]]
...     # multi_turn_gen_key = [[["python", "is", "excellent", "<eos>"], ["PHP", "is", "best", "<eos>"]],
...     #   [["I", "like", "natural", "language", "processing", "<eos>"]]]
... }
>>> metric.forward(data)
>>> metric.close()
{'bleu': 0.12081744577265555,
'bleu hashvalue': 'c65b44c454dee5a8d393901644c7f1acfdb847bae3ab03823cb5b9f643958960'}

Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[multi_turn_reference_allvocabs_key] (list, numpy.ndarray): A 3-d jagged or padded array of int. Multi-turn reference sentences with all vocabs. Contains start token (eg: <go>) and end token (eg: <eos>). Size: [batch_size, ~turn_length, ~sentence_length], where “~” means different sizes in this dimension is allowed.

  • data[gen_key] (list, numpy.ndarray): A 3-d jagged or padded array of int. Sentences generated by model. Contains end token (eg: <eos>), but without start token (eg: <go>). Size: [batch_size, ~max_turn_length, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.

  • data[turn_len_key] (list, numpy.ndarray): Length of turns in each sample. Size: [batch_size].

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     multi_turn_reference_allvocabs_key: [[[2,4,3], [2,5,6,3]], [[2,7,6,8,3]]],
...     turn_len_key: [2, 1],
...     gen_key: [[[6,7,8,3], [4,5,3]], [[7,3]]]
... }
close() → Dict[str, Any][source]

Return a dict which contains

  • bleu: bleu value.

  • bleu hashvalue: hash value for bleu metric, same hash value stands for same evaluation settings.


class cotk.metric.BleuPrecisionRecallMetric(dataloader, ngram, generated_num_per_context, candidates_allvocabs_key='candidate_allvocabs', multiple_gen_key='multiple_gen')[source]

Metric for calculating sentence BLEU precision and recall.


[1] Zhao, T., Zhao, R., & Eskenazi, M. (2017). Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960.

  • dataloader (dataloader.LanguageProcessing, dataloader.Sentence, dataloader.Session) – A language generation dataloader.

  • generated_num_per_context (int) – The number of sentences generated per context.

  • candidate_allvocabs_key (str, optional) – The key of reference sentences. Default: candidate_allvocabs.

  • multiple_gen_key (str, optional) – The key of multiple generated sentences. Default: multiple_gen.

  • ngram (int) – Specifies using BLEU-ngram.

Here is an exmaple:

>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> candidate_allvocabs_key = 'candidate_allvocabs'
>>> multiple_gen_key='multiple_gen'
>>> metric = cotk.metric.BleuPrecisionRecallMetric(dl, 2, 2)
>>> data = {
...     candidate_allvocabs_key: [[[10, 64, 851], [10, 48, 851]]],
...     # candidate_allvocabs_key: [[["I", "like", "python"], ["I", "use", "python"]]],
...     multiple_gen_key: [[[10, 64, 479, 3], [10, 48, 2019, 3]]],
...     # multiple_gen_key: [[["I", "like", "java", "<eos>"], ["I", "use", "PHP", "<eos>"]]],
... }
>>> metric.forward(data)
>>> metric.close()
{'BLEU-2 precision': 0.12909944355487823,
 'BLEU-2 recall': 0.12909944355487823,
 'BLEU-2 hashvalue': '1652cd40276078ec8722d367f18008bf14053572ac15ce10e270eb41eae34bbf'}
_score(gen, reference) → float[source]

Return a BLEU score in [0, 1] to calculate BLEU-ngram precision and recall.

  • gen (list) – list of generated word ids.

  • reference (list) – list of word ids of a reference.

Here is an Example:

>>> gen = [4,5]
>>> reference = [5,6]
>>> self._score(gen, reference)
0.150 # assume self.weights = [0.25,0.25,0.25,0.25]
close() → Dict[str, Any]

Return a dict which contains

  • res_prefix precision: average precision.

  • res_prefix recall: average recall.

  • res_prefix hashvalue: hash value for precision & recall metric, same hash value stands for same evaluation settings.


Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[candidate_allvocabs_key] (list, numpy.ndarray): A 3-d jagged list of index. Multiple reference sentences for a single context. Does not contain start token (eg: <go>) and end token (eg: <eos>). Size: [batch_size, ~sentence_num, ~word_num], where “~” means different sizes in this dimension is allowed.

  • data[multiple_gen_key] (list, numpy.ndarray): A 3-d jagged or padded array. Sentences generated by model. Contains end token (eg: <eos>), but without start token (eg: <go>). Size: [batch_size, generated_num_per_context, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     candidate_allvocabs_key: [[[4], [5,6]], [[4,5,6]]],
...     multiple_gen_key: [[[5,6,3]], [[4,5,7,3], [8,3]]]
... }


class cotk.metric.EmbSimilarityPrecisionRecallMetric(dataloader, word2vec, mode, generated_num_per_context, candidates_allvocabs_key='candidate_allvocabs', multiple_gen_key='multiple_gen')[source]

Metric for calculating cosine similarity precision and recall.


[1] Zhao, T., Zhao, R., & Eskenazi, M. (2017). Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960.

  • dataloader (dataloader.LanguageProcessing, dataloader.Sentence, dataloader.Session) – A language generation dataloader.

  • generated_num_per_context (int) – The number of sentences generated per context.

  • candidate_allvocabs_key (str, optional) – The key of reference sentences. Default: candidate_allvocabs.

  • multiple_gen_key (str, optional) – The key of multiple generated sentences. Default: multiple_gen.

  • word2vec (dict) – Maps a word (str) to its pretrained embedding (numpy.ndarray or list)

  • mode (str) – Specifies the operation that computes the bag-of-word representation. Must be avg or extrema:

    • avg : element-wise average word embeddings.

    • extrema : element-wise maximum word embeddings.

Here is an exmaple:

>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> candidate_allvocabs_key = 'candidate_allvocabs'
>>> multiple_gen_key='multiple_gen'
>>> wordvector = cotk.wordvector.Glove()
>>> metric = cotk.metric.EmbSimilarityPrecisionRecallMetric(dl, wordvector.load_dict(dl.all_vocab_list), 'avg', 2)
>>> data = {
...     candidate_allvocabs_key: [[[10, 64, 851], [10, 48, 851]]],
...     # candidate_allvocabs_key: [[["I", "like", "python"], ["I", "use", "python"]]],
...     multiple_gen_key: [[[10, 64, 479, 3], [10, 48, 2019, 3]]],
...     # multiple_gen_key: [[["I", "like", "java", "<eos>"], ["I", "use", "PHP", "<eos>"]]],
... }
>>> metric.forward(data)
>>> metric.close()
>>> # metric.close() returns a dict like this.
>>> # {'avg-bow precision': 0.0,
>>> # 'avg-bow recall': 0.0,
>>> # 'avg-bow hashvalue': '5abaaa9a8e709b3f05467e3f6d0e27c6cc904fceebd3accb3b768928595e729a'}
_score(gen, reference) → float[source]

Return a cosine similarity score in [0, 1] between two sentence embeddings to calculate cosine similarity precision and recall.

  • gen (list) – list of generated word ids.

  • reference (list) – list of word ids of a reference.

Here is an Example:

>>> gen = [4,5]
>>> reference = [5,6]
>>> self._score(gen, reference)
0.135 # assume self.mode = 'avg'
close() → Dict[str, Any]

Return a dict which contains

  • res_prefix precision: average precision.

  • res_prefix recall: average recall.

  • res_prefix hashvalue: hash value for precision & recall metric, same hash value stands for same evaluation settings.


Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[candidate_allvocabs_key] (list, numpy.ndarray): A 3-d jagged list of index. Multiple reference sentences for a single context. Does not contain start token (eg: <go>) and end token (eg: <eos>). Size: [batch_size, ~sentence_num, ~word_num], where “~” means different sizes in this dimension is allowed.

  • data[multiple_gen_key] (list, numpy.ndarray): A 3-d jagged or padded array. Sentences generated by model. Contains end token (eg: <eos>), but without start token (eg: <go>). Size: [batch_size, generated_num_per_context, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     candidate_allvocabs_key: [[[4], [5,6]], [[4,5,6]]],
...     multiple_gen_key: [[[5,6,3]], [[4,5,7,3], [8,3]]]
... }


class cotk.metric.NgramFwBwPerplexityMetric(dataloader, reference_test_list, ngram=4, *, tokenizer=None, gen_key='gen', sample=10000, seed=1229, cpu_count=None)[source]

Metric for calculating n gram forward perplexity and backward perplexity.

  • dataloader (dataloader.LanguageProcessing, dataloader.Sentence, dataloader.Session) – A language generation dataloader.

  • reference_test_list (list) – Reference sentences with all vocabs in test data.

  • ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default: 4.

  • tokenizer (None, dataloader.Tokenizer, str, optional) – Specifies the tokenizer used in the metric. Default: None.

  • gen_key (str, optional) – The key of generated sentences. Default: gen.

  • sample (int, optional) – Number of examples sampled from the generated sentences. Default: 10000.

  • seed (int, optional) – Random seed for sampling. Default: 1229.

  • cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when cpu_count is set to 1 or the dataset is small. Default: If None, the environment variable CPU_COUNT will be used when available, or all available cpu will be used otherwise.

Here is an example (to only show the format but not the exact value of results):

>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> gen_key = "gen"
>>> metric = cotk.metric.NgramFwBwPerplexityMetric(dl, dl.get_all_batch('test')['session'][0].tolist(), 2, gen_key=gen_key)
>>> data = {
...     gen_key: [[10, 1028, 479, 285, 220, 3], [851, 17, 2451, 3]]
...     # gen_key: [["I", "love", "java", "very", "much", "<eos>"], ["python", "is", "excellent", "<eos>"]],
... }
>>> metric.forward(data)
>>> metric.close()
{'fwppl': 51.44751843841384,
 'bwppl': 138.954327895075,
 'fwppl hashvalue': '2ea52377084692953f602e4ebad23e8a46e1c4bb527947d29a03c14b426efe67',
 'bwppl hashvalue': '2ea52377084692953f602e4ebad23e8a46e1c4bb527947d29a03c14b426efe67'}

Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[gen_key] (list, numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg: <eos>), but without start token (eg: <go>). Size: [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.

close() → Dict[str, Any][source]

Return a dict which contains:

  • fwppl: fw ppl value.

  • bwppl: bw ppl value.

  • fwppl hashvalue: hash value of fw ppl.

  • bwppl hashvalue: hash value of bw ppl.

Metric-like class


class cotk.metric.SingleTurnDialogRecorder(dataloader, post_allvocabs_key='post_allvocabs', resp_allvocabs_key='resp_allvocabs', gen_key='gen')[source]

A metric-like class for recording generated sentences and references.


Here is an example:

>>> post_allvocabs_key = "post_allvocabs"
>>> resp_allvocabs_key = "resp_allvocabs"
>>> gen_key = "gen"
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> metric = cotk.metric.SingleTurnDialogRecorder(dl,
...     post_allvocabs_key=post_allvocabs_key,
...     resp_allvocabs_key=resp_allvocabs_key,
...     gen_key=gen_key)
>>> data = {
...     post_allvocabs_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]],
...     # post_allvocabs_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]],
...     resp_allvocabs_key: [[2, 10, 1214, 479, 3], [2, 851, 17, 2451, 3]],
...     # resp_allvocabs_key: [["<go>", "I", "prefe", "java", "<eos>"], ["<go>", "python", "is", "excellent", "<eos>"]],
...     gen_key: [[10, 64, 2019, 3], [851, 17, 4124, 3]],
...     # gen_key: [["I", "like", "PHP", "<eos>"], ["python", "is", "powerful", "<eos>"]]
... }
>>> metric.forward(data)
>>> metric.close()
{'post': [['I', 'like', 'python'], ['I', 'use', 'python']],
 'resp': [['I', 'prefer', 'java'], ['python', 'is', 'excellent']],
 'gen': [['I', 'like', 'PHP'], ['python', 'is', 'powerful']]}

Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[post_allvocabs_key] (list, numpy.ndarray): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg: <go>) and end token (eg: <eos>). Size: [batch_size, ~ref_sentence_length], where “~” means different sizes in this dimension is allowed.

  • data[resp_allvocabs_key] (list, numpy.ndarray): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg: <go>) and end token (eg: <eos>). Size: [batch_size, ~ref_sentence_length], where “~” means different sizes in this dimension is allowed.

  • data[gen_key] (list, numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg: <eos>), but without start token (eg: <go>). Size: [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     post_allvocabs_key: [[2,4,3], [2,5,6,3]],
...     resp_allvocabs_key: [[2,5,4,3], [2,6,3]],
...     gen_key: [[6,7,8,3], [4,5,3]]
... }
close() → Dict[str, Any][source]

Return a dict which contains

  • post: a list of post sentences. A jagged 2-d array of int. Size:[batch_size, ~sent_length], where “~” means different sizes in this dimension is allowed.

  • resp: a list of response sentences. A jagged 2-d array of int. Size:[batch_size, ~sent_length], where “~” means different sizes in this dimension is allowed.

  • gen: A list of generated sentences. A jagged 2-d array of int. Size:[batch_size, ~sent_length], where “~” means different sizes in this dimension is allowed.


class cotk.metric.MultiTurnDialogRecorder(dataloader, multi_turn_reference_allvocabs_key='multi_turn_ref_allvocabs', multi_turn_gen_key='multi_turn_gen', turn_len_key='turn_length')[source]

A metric-like class for recording generated sentences and references.

  • dataloader (dataloader.LanguageProcessing, dataloader.Session) – A language generation dataloader.

  • multi_turn_reference_allvocabs_key (str, optional) – The key of dialog references with allvocabs. Default: multi_turn_ref_allvocabs.

  • multi_turn_gen_key (str, optional) – The key of generated sentences. Default: multi_turn_gen.

  • turn_length (str, optional) – The key of length of turns. Default: turn_length.

Here is an example:

>>> multi_turn_reference_allvocabs_key = "multi_turn_ref_allvocabs"
>>> multi_turn_gen_key = "multi_turn_gen"
>>> turn_len_key = "turn_length"
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> metric = cotk.metric.MultiTurnDialogRecorder(dl,
...     multi_turn_reference_allvocabs_key=multi_turn_reference_allvocabs_key,
...     multi_turn_gen_key=multi_turn_gen_key,
...     turn_len_key=turn_len_key)
>>> data = {
...     multi_turn_reference_allvocabs_key: [[[2, 10, 64, 851, 3], [2, 10, 64, 479, 3]], [[2, 10, 64, 279, 1460, 3]]],
...     # multi_turn_reference_allvocabs_key = [[["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "like", "java", "<eos>"]],
...     #   [["<go>", "I", "like", "machine", "learning", "<eos>"]]]
...     turn_len_key: [2, 1],
...     # turn_len_key: [len(multi_turn_reference_allvocabs_key[0]), len(multi_turn_reference_allvocabs_key[1])]
...     multi_turn_gen_key: [[[851, 17, 2451, 3], [2019, 17, 393, 3]], [[10, 64, 34058, 805, 2601, 3]]]
...     # multi_turn_gen_key = [[["python", "is", "excellent", "<eos>"], ["PHP", "is", "best", "<eos>"]],
...     #   [["I", "like", "natural", "language", "processing", "<eos>"]]]
... }
>>> metric.forward(data)
>>> metric.close()
{'reference': [[['I', 'like', 'python'], ['I', 'like', 'java']],
 [['I', 'like', 'machine', 'learning']]],
 'gen': [[['python', 'is', 'excellent'],
 ['PHP', 'is', 'best']],
 [['I', 'like', 'natural', 'language', 'processing']]]}

Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[multi_turn_reference_allvocabs_key] (list, numpy.ndarray): A 3-d jagged or padded array of int. Multi-turn reference sentences with all vocabs. Contains start token (eg: <go>) and end token (eg: <eos>). Size: [batch_size, ~turn_length, ~sentence_length], where “~” means different sizes in this dimension is allowed.

  • data[gen_key] (list, numpy.ndarray): A 3-d jagged or padded array of int. Sentences generated by model. Contains end token (eg: <eos>), but without start token (eg: <go>). Size: [batch_size, ~max_turn_length, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.

  • data[turn_len_key] (list, numpy.ndarray): Length of turns in each sample. Size: [batch_size].

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     multi_turn_context_allvocabs_key: [[[2,4,3], [2,5,6,3]], [[2,7,6,8,3]]],
...     multi_turn_reference_allvocabs_key: [[[2,6,7,3], [2,5,3]], [[2,7,6,8,3]]],
...     multi_turn_gen_key: [[[6,7,8,3], [4,5,3]], [[7,3]]],
...     turn_len_key: [2,1]
... }
close() → Dict[str, Any][source]

Return a dict which contains

  • reference: a list of response sentences. A jagged 3-d array of int. Size:[batch_size, ~turn_length, ~sent_length], where “~” means different sizes in this dimension is allowed.

  • gen: a list of generated sentences. A jagged 3-d array of int. Size:[batch_size, ~turn_length, ~sent_length], where “~” means different sizes in this dimension is allowed.


class cotk.metric.LanguageGenerationRecorder(dataloader, gen_key='gen')[source]

A metric-like class for recorder generated sentences.


Here is an example:

>>> gen_key = "gen_key"
>>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small')
>>> metric = cotk.metric.LanguageGenerationRecorder(dl, gen_key=gen_key)
>>> data = {
...     gen_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]],
...     # gen_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]],
... }
>>> metric.forward(data)
>>> metric.close()
{'gen': [['<go>', 'I', 'like', 'python'], ['<go>', 'I', 'use', 'python']]}

Processing a batch of data.


data (dict) – A dict at least contains the following keys:

  • data[gen_key] (list, numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg: <eos>), but without start token (eg: <go>). Size: [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.

Here is an example for data:

>>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have",
>>> #   "been", "to", "China"]
>>> data = {
...     gen_key: [[6,7,8,3], [4,5,3]]
... }
close() → Dict[str, Any][source]

Return a dict which contains

  • gen: a list of generated sentences. A jagged 2-d array of int. Size:[batch_size, ~sent_length], where “~” means different sizes in this dimension is allowed.


class cotk.metric.MetricChain[source]

A metric-like class for stacked metric. You can use this class making multiples metric combination like one.


>>> metric = MetricChain()
>>> metric.add_metric(BleuCorpusMetric())
>>> metric.add_metric(SingleDialogRecorder(dataloader))

Todo: Give more examples to combining forward and close


Add metric for processing.


metric (metric.MetricBase) – a metric class.


Processing a batch of data.


data (dict) – A dict at least contains keys which all the metric components need.

close() → Dict[Any, Any][source]

Return a dict containing the items which all the metric components return.