Metric¶
cotk.metric provides commonly used metrics for cotk.dataloader.
All metric objects receive a batch of data per one call of forward.
The batch of data is represented by a dict which contains models’ outputs and
answers. The answers are highly relevant to the corresponding
dataloader, where the type, shape are usually identical with the return value of
get_batch in dataloader, as long as the correct key name is set.
forward function can be called several times and at last
close can be called for results.
Here is an example:
>>> dm = OpenSubtitles()
>>> metric = BleuCorpusMetric(gen_key="gen",\
...     reference_allvocabs_key="resp_allvocabs_key")
... # "resp_allvocabs_key" is a key name in get_batch()
>>> for data in dm.get_batches("test", batch_size=32):
...     data["gen"] = predict(data["post"])
...     assert "resp_allvocabs_key" in data
...     metric.forward(data)
>>> print(metric.close())
{"bleu": 0.135, "bleu hashvalue": b"XXXX"}
We also provide default metrics in dataloader, you can use “get_metric”-like
functions (example: SingleTurnDialog.get_inference_metric()) to get
default metrics and avoid the mess with complex key name.
Here is an exmample:
>>> dm = OpenSubtitles()
>>> metric = dm.get_inference_metric(gen_key="gen")
>>> for data in dm.get_batches("test", batch_size=32):
...     data["gen"] = predict(data["post"])
...     metric.forward(data)
>>> print(metric.close())
{"bleu": 0.135, "bleu hashvalue": b"XXXX", ...}
Hash Value¶
MetricBase.close() will return a dict containing hash value,
which can validate whether two models used the same test data and the
same setting. Only two models using the same metric with the same hash
value returned, can compare with each other.
Basic Classes¶
- 
class cotk.metric.MetricBase(name, version)[source]¶
- Base class for metrics. - 
forward(data)[source]¶
- Processing a batch of data. - Parameters
- data (dict) – A dict contains the data that metrics need. 
 
 - 
close() → Dict[Any, Any][source]¶
- Close the metric and return a dict containing results. Once the metric is closed, any operation on the metric (e.g. forward or another close) will raise a ValueError. 
 
- 
Metric class¶
PerplexityMetric¶
- 
class cotk.metric.PerplexityMetric(dataloader, reference_allvocabs_key='ref_allvocabs', reference_len_key='ref_length', gen_log_prob_key='gen_log_prob', generate_rare_vocab=False, full_check=False)[source]¶
- Metric for calculating perplexity. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- reference_allvocabs_key (str, optional) – The key of reference sentences. Default: - ref_allvocabs.
- reference_len_key (str, optional) – The key of lengths of reference sentences. Default: - ref_length.
- gen_log_prob_key (str, optional) – The key of predicted log probability over words. Default: - gen_log_prob.
- generate_rare_vocab (bool, optional) – Whether - gen_log_probcontains invalid vocab. Default:- False.
- full_check (bool, optional) – Whether to perform a full check on - gen_log_probto make sure the sum of probability is 1. Otherwise, a random check will be performed for efficiency. If PyTorch is used, a full check is always performed and this argument will be ignored. Default:- False.
 
 - Here is an example: - >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> reference_allvocabs_key="ref_allvocabs" >>> reference_len_key="ref_length" >>> gen_log_prob_key="gen_log_prob" >>> metric = cotk.metric.PerplexityMetric(dl, ... reference_allvocabs_key=reference_allvocabs_key, ... reference_len_key=reference_len_key, ... gen_log_prob_key=gen_log_prob_key) >>> data = { ... reference_allvocabs_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]], ... # reference_allvocabs_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]], ... reference_len_key: [5, 5], ... gen_log_prob_key: [[[-11.31, -11.31, -0.69, ..., -11.31, -11.31, -11.31],...],...] # shape == (batch, length, vocab_size) ... } >>> metric.forward(data) >>> metric.close() {'perplexity': 81458.00000000006, 'perplexity hashvalue': '7f9b88b8f9996f5d49a512258f250fbc56adee714952b2c696c0b36cce36f648'} - 
forward(data)[source]¶
- Processing a batch of data. Smoothing will be performed for rare vocabs. - Parameters
- data (dict) – A dict at least contains the following keys: - data[reference_allvocabs_key] (list, - numpy.ndarray,- torch.Tensor): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size, ~ref_sentence_length], where “~” means different sizes in this dimension is allowed.
- data[reference_len_key] (list, - numpy.ndarray): Length of reference sentences. Contains start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size].
- data[gen_log_prob_key] (list, - numpy.ndarray,- torch.Tensor): The log softmax probability of the sentence generations model outputs. A 3-d jagged or padded array of float. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, ~gen_sentence_length, vocab_size]for- generate_rare_vocab = False, or- [batch_size, ~gen_sentence_length, all_vocab_size]for- generate_rare_vocab = True, where “~” means different sizes in this dimension is allowed. If- torch.Tensoris used, the following data should also be- torch.Tensor.
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... reference_allvocabs_key: [[2,4,3], [2,5,6,3]], ... reference_len_key: [3,4], ... gen_log_prob_key: [[[-3.80666249, -3.11351531, -2.7080502 , -2.42036813, -2.19722458, -2.01490302, -1.86075234, -1.72722095, -1.60943791],...],...] ... } 
 - Warning - data[gen_log_prob_key]must be processed after log_softmax. That means,- np.sum(np.exp(gen_log_prob), -1)equals- np.ones((batch_size, gen_sentence_length))
 
MultiTurnPerplexityMetric¶
- 
class cotk.metric.MultiTurnPerplexityMetric(dataloader, multi_turn_reference_allvocabs_key='multi_turn_ref_allvocabs', multi_turn_reference_len_key='multi_turn_ref_length', multi_turn_gen_log_prob_key='multi_turn_gen_log_prob', generate_rare_vocab=False, full_check=False)[source]¶
- Metric for calculating multi-turn perplexity. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- multi_turn_reference_allvocabs_key (str, optional) – The key of reference sentences. Default: - multi_turn_ref_allvocabs.
- multi_turn_reference_len_key (str, optional) – The key of lengths of reference sentences. Default: - multi_turn_ref_length.
- gen_log_prob_key (str, optional) – The key of predicted log probability over words. Default: - gen_log_prob.
- generate_rare_vocab (bool, optional) – Whether - gen_log_probcontains invalid vocab. Default:- False.
- full_check (bool, optional) – Whether to perform a full check on - gen_log_probto make sure the sum of probability is 1. Otherwise, a random check will be performed for efficiency. If PyTorch is used, a full check is always performed and this argument will be ignored. Default:- False.
 
 - Here is an example: - >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> multi_turn_reference_allvocabs_key = "multi_turn_ref_allvocabs" >>> multi_turn_reference_len_key = "multi_turn_ref_length" >>> multi_turn_gen_log_prob_key = "multi_turn_gen_log_prob" >>> metric = cotk.metric.MultiTurnPerplexityMetric(dl, ... multi_turn_reference_allvocabs_key="multi_turn_ref_allvocabs", ... multi_turn_reference_len_key="multi_turn_ref_length", ... multi_turn_gen_log_prob_key="multi_turn_gen_log_prob") >>> data = { ... multi_turn_reference_allvocabs_key: [[[2, 10, 64, 851, 3], [2, 10, 64, 479, 3]], [[2, 10, 64, 279, 1460, 3]]], ... # multi_turn_reference_allvocabs_key = [[["<go>", "I", "like", "python", "<eos>"], ... # ["<go>", "I", "like", "java", "<eos>"]], ... # [["<go>", "I", "like", "machine", "learning", "<eos>"]]] ... ... multi_turn_reference_len_key: [[5, 5], [6]], ... multi_turn_gen_log_prob_key: [[[[-11.30784283, -11.30784283, -0.69312263, ..., -11.30784283, -11.30784283, -11.30784283], ...], ...], ...] ... } >>> metric.forward(data) >>> metric.close() {'perplexity': 81458.00000000006, 'perplexity hashvalue': '3a7647507f2e0d05a235c1d3a29515dc8885650884d625a5b76d305541dca685'} - 
forward(data)[source]¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[multi_turn_reference_allvocabs_key] (list, - numpy.ndarray,- torch.Tensor): A 3-d jagged or padded array of int. Multi-turn reference sentences with all vocabs. Contains start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size, ~turn_length, ~sentence_length], where “~” means different sizes in this dimension is allowed.
- data[multi_turn_reference_len_key] (list, - numpy.ndarray): A 2-d jagged or padded array of int. If padded, redundant position must be set to- 0. Length of multi-turn reference sentences. Contains start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size, ~turn_length], where “~” means different sizes in this dimension is allowed.
- data[multi_turn_gen_log_prob_key] (list, - numpy.ndarray,- torch.Tensor): The log softmax probability of the sentence generations model outputs. A 4-d jagged or padded array. log softmax probability. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, ~gen_sentence_length, vocab_size]for- generate_rare_vocab = False, or- [batch_size, ~gen_sentence_length, all_vocab_size]` for ``generate_rare_vocab = True, where “~” means different sizes in this dimension is allowed. If- torch.Tensoris used, the following data should also be- torch.Tensor.
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... multi_turn_reference_allvocabs_key: [[[2,4,3], [2,5,6,3]], [[2,7,6,8,3]]], ... multi_turn_reference_len_key: [[3, 4], [5]], ... multi_turn_gen_log_prob_key: [[[[-3.80666249, -3.11351531, -2.7080502, -2.42036813, -2.19722458, -2.01490302, -1.86075234, -1.72722095, -1.60943791], ...], ...], ...] ... } 
 - Warning - data[multi_turn_gen_log_prob_key]must be processed after log_softmax. That means,- np.sum(np.exp(multi_turn_gen_log_prob_key), -1)equals- np.ones((batch_size, ~gen_sentence_length))
 
BleuCorpusMetric¶
- 
class cotk.metric.BleuCorpusMetric(dataloader, ngram=4, *, tokenizer=None, reference_num=1, ignore_smoothing_error=False, reference_allvocabs_key='ref_allvocabs', gen_key='gen', reference_str_key='ref_str')[source]¶
- Metric for calculating BLEU. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default: - 4.
- tokenizer (None, - dataloader.Tokenizer, str, optional) – Specifies the tokenizer used in the metric. Default:- None.
- reference_num (int, None, optional) – The number of references used to calculate BLEU. If - None, the number of references is uncertain and will be determined by the argument of- forward(). Default:- 1.
- ignore_smoothing_error (bool, optional) – Specifies whether to ignore the smoothing error when calculating BLEU. Default: - False.
- reference_allvocabs_key (str, optional) – The key of reference sentences. Default: - ref_allvocabs.
- gen_key (str, optional) – The key of generated sentences. Default: - gen.
- reference_str_key (str, optional) – The key of reference sentences in the string form. Default: - ref_str.
 
 - Here is an example: - >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> reference_allvocabs_key = "ref_allvocabs" >>> gen_key = "gen" >>> metric = cotk.metric.BleuCorpusMetric(dl, ... reference_allvocabs_key=reference_allvocabs_key, ... gen_key=gen_key) >>> data = { ... reference_allvocabs_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]], ... # reference_allvocabs_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]], ... ... gen_key: [[10, 1028, 479, 285, 220, 3], [851, 17, 2451, 3]] ... # gen_key: [["I", "love", "java", "very", "much", "<eos>"], ["python", "is", "excellent", "<eos>"]], ... } >>> metric.forward(data) >>> metric.close() {'bleu': 0.08582363099612991, 'bleu hashvalue': '70e019630fef24d9477034a3d941a5349fcbff5a3dc6978a13ea3d85290114fb'} - 
forward(data)[source]¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[reference_allvocabs_key] (list, - numpy.ndarray): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size, ~ref_sentence_length], where “~” means different sizes in this dimension is allowed.
- data[gen_key] (list, - numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... reference_allvocabs_key: [[2,4,3], [2,5,6,3]], ... gen_key: [[4,5,3], [6,7,8,3]] ... } 
 
 
SelfBleuCorpusMetric¶
- 
class cotk.metric.SelfBleuCorpusMetric(dataloader, ngram=4, *, tokenizer=None, gen_key='gen', sample=1000, seed=1229, cpu_count=None)[source]¶
- Metric for calculating Self-BLEU. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default: - 4.
- tokenizer (None, - dataloader.Tokenizer, str, optional) – Specifies the tokenizer used in the metric. Default:- None.
- gen_key (str, optional) – The key of generated sentences. Default: - gen.
- sample (int, optional) – Number of examples sampled from the generated sentences. Default: - 1000.
- seed (int, optional) – Random seed for sampling. Default: - 1229.
- cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when - cpu_countis set to- 1or the dataset is small. Default: If- None, the environment variable- CPU_COUNTwill be used when available, or all available cpu will be used otherwise.
 
 - Warning - the calculation of - hashvalueconsiders the actual sample size of hypotheses which will be less than- sampleif the size of hypotheses is smaller than- sample.- Here is an example: - >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> gen_key = 'gen' >>> metric = cotk.metric.SelfBleuCorpusMetric(dl, gen_key=gen_key) >>> data = { ... gen_key: [[10, 64, 851, 3], [10, 48, 851, 3]], ... # gen_key: [["I", "like", "python", "<eos>"], ["I", "use", "python", "<eos>"]], ... } >>> metric.forward(data) >>> metric.close() {'self-bleu': 0.13512001548070346, 'self-bleu hashvalue': '53cf55829c1b080c86c392c846a5d39a54340c70d838ec953f952aa6731118fb'} - 
forward(data)[source]¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[gen_key] (list, - numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... gen_key: [[4,5,3], [6,7,8,3]] ... } 
 
 
FwBwBleuCorpusMetric¶
- 
class cotk.metric.FwBwBleuCorpusMetric(dataloader, reference_test_list, ngram=4, *, tokenizer=None, gen_key='gen', sample=1000, seed=1229, cpu_count=None)[source]¶
- Metric for calculating FwBw-BLEU. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- reference_test_list (list) – Reference sentences with all vocabs in test data. 
- ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default: - 4.
- tokenizer (None, - dataloader.Tokenizer, str, optional) – Specifies the tokenizer used in the metric. Default:- None.
- gen_key (str, optional) – The key of generated sentences. Default: - gen.
- sample (int, optional) – Number of examples sampled from the generated sentences. Default: - 1000.
- seed (int, optional) – Random seed for sampling. Default: - 1229.
- cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when - cpu_countis set to- 1or the dataset is small. Default: If- None, the environment variable- CPU_COUNTwill be used when available, or all available cpu will be used otherwise.
 
 - Warning - The calculation of - hashvalueconsiders the actual sample size of hypotheses and references. Therefore- hashvaluemay vary with the size of hypothesis or references if the size of them is smaller than- sample.- Here is an example: - >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> gen_key = 'gen' >>> metric = cotk.metric.FwBwBleuCorpusMetric(dl, ... reference_test_list=dl.get_all_batch('test')['session'][0].tolist(), ... gen_key=gen_key) >>> data = { ... gen_key: [[10, 64, 851, 3], [10, 48, 851, 3]], ... # gen_key: [["I", "like", "python", "<eos>"], ["I", "use", "python", "<eos>"]], ... } >>> metric.forward(data) >>> metric.close() {'fw-bleu': 0.007688528488990184, 'bw-bleu': 0.0012482612634667945, 'fw-bw-bleu': 0.002147816509441494, 'fw-bw-bleu hashvalue': '0e3f58a90225af615ff780f04c91613759e04a3c7b4329670b1d03b679adf8cd'} - 
forward(data)[source]¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[gen_key] (list, - numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... gen_key: [[4,5,3], [6,7,8,3]] ... } 
 
 
MultiTurnBleuCorpusMetric¶
- 
class cotk.metric.MultiTurnBleuCorpusMetric(dataloader, ignore_smoothing_error=False, multi_turn_reference_allvocabs_key='reference_allvocabs', multi_turn_gen_key='multi_turn_gen', turn_len_key='turn_length')[source]¶
- Metric for calculating multi-turn BLEU. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- ignore_smoothing_error (bool, optional) – Specifies whether to ignore the smoothing error when calculating BLEU. Default: - False.
- multi_turn_reference_allvocabs_key (str, optional) – The key of reference sentences. Default: - multi_turn_ref_allvocabs.
- multi_turn_gen_key (str, optional) – The key of generated sentences. Default: - multi_turn_gen.
- turn_length (str, optional) – The key of length of turns. Default: - turn_length.
 
 - Here is an example: - >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> multi_turn_reference_allvocabs_key = "reference_allvocabs" >>> multi_turn_gen_key = "multi_turn_gen" >>> turn_len_key = "turn_length" >>> metric = cotk.metric.MultiTurnBleuCorpusMetric(dl, >>> multi_turn_reference_allvocabs_key=multi_turn_reference_allvocabs_key, >>> multi_turn_gen_key=multi_turn_gen_key, >>> turn_len_key=turn_len_key) >>> data = { ... multi_turn_reference_allvocabs_key: [[[2, 10, 64, 851, 3], [2, 10, 64, 479, 3]], [[2, 10, 64, 279, 1460, 3]]], ... # multi_turn_reference_allvocabs_key = [[["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "like", "java", "<eos>"]], ... # [["<go>", "I", "like", "machine", "learning", "<eos>"]]] ... ... turn_len_key: [2, 1], ... # turn_len_key: [len(multi_turn_reference_allvocabs_key[0]), len(multi_turn_reference_allvocabs_key[1])] ... ... multi_turn_gen_key: [[[851, 17, 2451, 3], [2019, 17, 393, 3]], [[10, 64, 34058, 805, 2601, 3]]] ... # multi_turn_gen_key = [[["python", "is", "excellent", "<eos>"], ["PHP", "is", "best", "<eos>"]], ... # [["I", "like", "natural", "language", "processing", "<eos>"]]] ... } >>> metric.forward(data) >>> metric.close() {'bleu': 0.12081744577265555, 'bleu hashvalue': 'c65b44c454dee5a8d393901644c7f1acfdb847bae3ab03823cb5b9f643958960'} - 
forward(data)[source]¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[multi_turn_reference_allvocabs_key] (list, - numpy.ndarray): A 3-d jagged or padded array of int. Multi-turn reference sentences with all vocabs. Contains start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size, ~turn_length, ~sentence_length], where “~” means different sizes in this dimension is allowed.
- data[gen_key] (list, - numpy.ndarray): A 3-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, ~max_turn_length, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.
- data[turn_len_key] (list, - numpy.ndarray): Length of turns in each sample. Size:- [batch_size].
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... multi_turn_reference_allvocabs_key: [[[2,4,3], [2,5,6,3]], [[2,7,6,8,3]]], ... turn_len_key: [2, 1], ... gen_key: [[[6,7,8,3], [4,5,3]], [[7,3]]] ... } 
 
 
BleuPrecisionRecallMetric¶
- 
class cotk.metric.BleuPrecisionRecallMetric(dataloader, ngram, generated_num_per_context, candidates_allvocabs_key='candidate_allvocabs', multiple_gen_key='multiple_gen')[source]¶
- Metric for calculating sentence BLEU precision and recall. - References - [1] Zhao, T., Zhao, R., & Eskenazi, M. (2017). Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- generated_num_per_context (int) – The number of sentences generated per context. 
- candidate_allvocabs_key (str, optional) – The key of reference sentences. Default: - candidate_allvocabs.
- multiple_gen_key (str, optional) – The key of multiple generated sentences. Default: - multiple_gen.
- ngram (int) – Specifies using BLEU-ngram. 
 
 - Here is an exmaple: - >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> candidate_allvocabs_key = 'candidate_allvocabs' >>> multiple_gen_key='multiple_gen' >>> metric = cotk.metric.BleuPrecisionRecallMetric(dl, 2, 2) >>> data = { ... candidate_allvocabs_key: [[[10, 64, 851], [10, 48, 851]]], ... # candidate_allvocabs_key: [[["I", "like", "python"], ["I", "use", "python"]]], ... multiple_gen_key: [[[10, 64, 479, 3], [10, 48, 2019, 3]]], ... # multiple_gen_key: [[["I", "like", "java", "<eos>"], ["I", "use", "PHP", "<eos>"]]], ... } >>> metric.forward(data) >>> metric.close() {'BLEU-2 precision': 0.12909944355487823, 'BLEU-2 recall': 0.12909944355487823, 'BLEU-2 hashvalue': '1652cd40276078ec8722d367f18008bf14053572ac15ce10e270eb41eae34bbf'} - 
_score(gen, reference) → float[source]¶
- Return a BLEU score in [0, 1] to calculate BLEU-ngram precision and recall. - Parameters
- gen (list) – list of generated word ids. 
- reference (list) – list of word ids of a reference. 
 
 - Here is an Example: - >>> gen = [4,5] >>> reference = [5,6] >>> self._score(gen, reference) 0.150 # assume self.weights = [0.25,0.25,0.25,0.25] 
 - 
close() → Dict[str, Any]¶
- Return a dict which contains - res_prefixprecision: average precision.
- res_prefixrecall: average recall.
- res_prefixhashvalue: hash value for precision & recall metric, same hash value stands for same evaluation settings.
 
 - 
forward(data)¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[candidate_allvocabs_key] (list, - numpy.ndarray): A 3-d jagged list of index. Multiple reference sentences for a single context. Does not contain start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size, ~sentence_num, ~word_num], where “~” means different sizes in this dimension is allowed.
- data[multiple_gen_key] (list, - numpy.ndarray): A 3-d jagged or padded array. Sentences generated by model. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, generated_num_per_context, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... candidate_allvocabs_key: [[[4], [5,6]], [[4,5,6]]], ... multiple_gen_key: [[[5,6,3]], [[4,5,7,3], [8,3]]] ... } 
 
 
EmbSimilarityPrecisionRecallMetric¶
- 
class cotk.metric.EmbSimilarityPrecisionRecallMetric(dataloader, word2vec, mode, generated_num_per_context, candidates_allvocabs_key='candidate_allvocabs', multiple_gen_key='multiple_gen')[source]¶
- Metric for calculating cosine similarity precision and recall. - References - [1] Zhao, T., Zhao, R., & Eskenazi, M. (2017). Learning discourse-level diversity for neural dialog models using conditional variational autoencoders. arXiv preprint arXiv:1703.10960. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- generated_num_per_context (int) – The number of sentences generated per context. 
- candidate_allvocabs_key (str, optional) – The key of reference sentences. Default: - candidate_allvocabs.
- multiple_gen_key (str, optional) – The key of multiple generated sentences. Default: - multiple_gen.
- word2vec (dict) – Maps a word (str) to its pretrained embedding ( - numpy.ndarrayor list)
- mode (str) – Specifies the operation that computes the bag-of-word representation. Must be - avgor- extrema:- avg: element-wise average word embeddings.
- extrema: element-wise maximum word embeddings.
 
 
 - Here is an exmaple: - >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> candidate_allvocabs_key = 'candidate_allvocabs' >>> multiple_gen_key='multiple_gen' >>> wordvector = cotk.wordvector.Glove() >>> metric = cotk.metric.EmbSimilarityPrecisionRecallMetric(dl, wordvector.load_dict(dl.all_vocab_list), 'avg', 2) >>> data = { ... candidate_allvocabs_key: [[[10, 64, 851], [10, 48, 851]]], ... # candidate_allvocabs_key: [[["I", "like", "python"], ["I", "use", "python"]]], ... multiple_gen_key: [[[10, 64, 479, 3], [10, 48, 2019, 3]]], ... # multiple_gen_key: [[["I", "like", "java", "<eos>"], ["I", "use", "PHP", "<eos>"]]], ... } >>> metric.forward(data) >>> metric.close() >>> # metric.close() returns a dict like this. >>> # {'avg-bow precision': 0.0, >>> # 'avg-bow recall': 0.0, >>> # 'avg-bow hashvalue': '5abaaa9a8e709b3f05467e3f6d0e27c6cc904fceebd3accb3b768928595e729a'} - 
_score(gen, reference) → float[source]¶
- Return a cosine similarity score in [0, 1] between two sentence embeddings to calculate cosine similarity precision and recall. - Parameters
- gen (list) – list of generated word ids. 
- reference (list) – list of word ids of a reference. 
 
 - Here is an Example: - >>> gen = [4,5] >>> reference = [5,6] >>> self._score(gen, reference) 0.135 # assume self.mode = 'avg' 
 - 
close() → Dict[str, Any]¶
- Return a dict which contains - res_prefixprecision: average precision.
- res_prefixrecall: average recall.
- res_prefixhashvalue: hash value for precision & recall metric, same hash value stands for same evaluation settings.
 
 - 
forward(data)¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[candidate_allvocabs_key] (list, - numpy.ndarray): A 3-d jagged list of index. Multiple reference sentences for a single context. Does not contain start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size, ~sentence_num, ~word_num], where “~” means different sizes in this dimension is allowed.
- data[multiple_gen_key] (list, - numpy.ndarray): A 3-d jagged or padded array. Sentences generated by model. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, generated_num_per_context, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... candidate_allvocabs_key: [[[4], [5,6]], [[4,5,6]]], ... multiple_gen_key: [[[5,6,3]], [[4,5,7,3], [8,3]]] ... } 
 
 
NgramFwBwPerplexityMetric¶
- 
class cotk.metric.NgramFwBwPerplexityMetric(dataloader, reference_test_list, ngram=4, *, tokenizer=None, gen_key='gen', sample=10000, seed=1229, cpu_count=None)[source]¶
- Metric for calculating n gram forward perplexity and backward perplexity. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- reference_test_list (list) – Reference sentences with all vocabs in test data. 
- ngram (int, optional) – The order of ngram to calculate metrics like BLEU and Perplexity. Default: - 4.
- tokenizer (None, - dataloader.Tokenizer, str, optional) – Specifies the tokenizer used in the metric. Default:- None.
- gen_key (str, optional) – The key of generated sentences. Default: - gen.
- sample (int, optional) – Number of examples sampled from the generated sentences. Default: - 10000.
- seed (int, optional) – Random seed for sampling. Default: - 1229.
- cpu_count (int, optional) – Number of used cpu for multiprocessing. Multiprocessing will NOT be used when - cpu_countis set to- 1or the dataset is small. Default: If- None, the environment variable- CPU_COUNTwill be used when available, or all available cpu will be used otherwise.
 
 - Here is an example (to only show the format but not the exact value of results): - >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> gen_key = "gen" >>> metric = cotk.metric.NgramFwBwPerplexityMetric(dl, dl.get_all_batch('test')['session'][0].tolist(), 2, gen_key=gen_key) >>> data = { ... gen_key: [[10, 1028, 479, 285, 220, 3], [851, 17, 2451, 3]] ... # gen_key: [["I", "love", "java", "very", "much", "<eos>"], ["python", "is", "excellent", "<eos>"]], ... } >>> metric.forward(data) >>> metric.close() {'fwppl': 51.44751843841384, 'bwppl': 138.954327895075, 'fwppl hashvalue': '2ea52377084692953f602e4ebad23e8a46e1c4bb527947d29a03c14b426efe67', 'bwppl hashvalue': '2ea52377084692953f602e4ebad23e8a46e1c4bb527947d29a03c14b426efe67'} - 
forward(data)[source]¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[gen_key] (list, - numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.
 
 
 
Metric-like class¶
SingleTurnDialogRecorder¶
- 
class cotk.metric.SingleTurnDialogRecorder(dataloader, post_allvocabs_key='post_allvocabs', resp_allvocabs_key='resp_allvocabs', gen_key='gen')[source]¶
- A metric-like class for recording generated sentences and references. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- post_allvocabs_key (str, optional) – The key of dialog posts with allvocabs. Default: - post_allvocabs.
- resp_allvocabs_key (str, optional) – The key of dialog responses with allvocabs. Default: - resp_allvocabs.
- gen_key (str, optional) – The key of generated sentences. Default: - gen.
 
 - Here is an example: - >>> post_allvocabs_key = "post_allvocabs" >>> resp_allvocabs_key = "resp_allvocabs" >>> gen_key = "gen" >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> metric = cotk.metric.SingleTurnDialogRecorder(dl, ... post_allvocabs_key=post_allvocabs_key, ... resp_allvocabs_key=resp_allvocabs_key, ... gen_key=gen_key) >>> data = { ... post_allvocabs_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]], ... # post_allvocabs_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]], ... ... resp_allvocabs_key: [[2, 10, 1214, 479, 3], [2, 851, 17, 2451, 3]], ... # resp_allvocabs_key: [["<go>", "I", "prefe", "java", "<eos>"], ["<go>", "python", "is", "excellent", "<eos>"]], ... ... gen_key: [[10, 64, 2019, 3], [851, 17, 4124, 3]], ... # gen_key: [["I", "like", "PHP", "<eos>"], ["python", "is", "powerful", "<eos>"]] ... } >>> metric.forward(data) >>> metric.close() {'post': [['I', 'like', 'python'], ['I', 'use', 'python']], 'resp': [['I', 'prefer', 'java'], ['python', 'is', 'excellent']], 'gen': [['I', 'like', 'PHP'], ['python', 'is', 'powerful']]} - 
forward(data)[source]¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[post_allvocabs_key] (list, - numpy.ndarray): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size, ~ref_sentence_length], where “~” means different sizes in this dimension is allowed.
- data[resp_allvocabs_key] (list, - numpy.ndarray): A 2-d jagged or padded array of int. Reference sentences with allvocabs in index form. Contains start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size, ~ref_sentence_length], where “~” means different sizes in this dimension is allowed.
- data[gen_key] (list, - numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... post_allvocabs_key: [[2,4,3], [2,5,6,3]], ... resp_allvocabs_key: [[2,5,4,3], [2,6,3]], ... gen_key: [[6,7,8,3], [4,5,3]] ... } 
 
 - 
close() → Dict[str, Any][source]¶
- Return a dict which contains - post: a list of post sentences. A jagged 2-d array of int. Size: - [batch_size, ~sent_length], where “~” means different sizes in this dimension is allowed.
- resp: a list of response sentences. A jagged 2-d array of int. Size: - [batch_size, ~sent_length], where “~” means different sizes in this dimension is allowed.
- gen: A list of generated sentences. A jagged 2-d array of int. Size: - [batch_size, ~sent_length], where “~” means different sizes in this dimension is allowed.
 
 
MultiTurnDialogRecorder¶
- 
class cotk.metric.MultiTurnDialogRecorder(dataloader, multi_turn_reference_allvocabs_key='multi_turn_ref_allvocabs', multi_turn_gen_key='multi_turn_gen', turn_len_key='turn_length')[source]¶
- A metric-like class for recording generated sentences and references. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Session) – A language generation dataloader.
- multi_turn_reference_allvocabs_key (str, optional) – The key of dialog references with allvocabs. Default: - multi_turn_ref_allvocabs.
- multi_turn_gen_key (str, optional) – The key of generated sentences. Default: - multi_turn_gen.
- turn_length (str, optional) – The key of length of turns. Default: - turn_length.
 
 - Here is an example: - >>> multi_turn_reference_allvocabs_key = "multi_turn_ref_allvocabs" >>> multi_turn_gen_key = "multi_turn_gen" >>> turn_len_key = "turn_length" >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> metric = cotk.metric.MultiTurnDialogRecorder(dl, ... multi_turn_reference_allvocabs_key=multi_turn_reference_allvocabs_key, ... multi_turn_gen_key=multi_turn_gen_key, ... turn_len_key=turn_len_key) >>> data = { ... multi_turn_reference_allvocabs_key: [[[2, 10, 64, 851, 3], [2, 10, 64, 479, 3]], [[2, 10, 64, 279, 1460, 3]]], ... # multi_turn_reference_allvocabs_key = [[["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "like", "java", "<eos>"]], ... # [["<go>", "I", "like", "machine", "learning", "<eos>"]]] ... ... turn_len_key: [2, 1], ... # turn_len_key: [len(multi_turn_reference_allvocabs_key[0]), len(multi_turn_reference_allvocabs_key[1])] ... ... multi_turn_gen_key: [[[851, 17, 2451, 3], [2019, 17, 393, 3]], [[10, 64, 34058, 805, 2601, 3]]] ... # multi_turn_gen_key = [[["python", "is", "excellent", "<eos>"], ["PHP", "is", "best", "<eos>"]], ... # [["I", "like", "natural", "language", "processing", "<eos>"]]] ... } >>> metric.forward(data) >>> metric.close() {'reference': [[['I', 'like', 'python'], ['I', 'like', 'java']], [['I', 'like', 'machine', 'learning']]], 'gen': [[['python', 'is', 'excellent'], ['PHP', 'is', 'best']], [['I', 'like', 'natural', 'language', 'processing']]]} - 
forward(data)[source]¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[multi_turn_reference_allvocabs_key] (list, - numpy.ndarray): A 3-d jagged or padded array of int. Multi-turn reference sentences with all vocabs. Contains start token (eg:- <go>) and end token (eg:- <eos>). Size:- [batch_size, ~turn_length, ~sentence_length], where “~” means different sizes in this dimension is allowed.
- data[gen_key] (list, - numpy.ndarray): A 3-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, ~max_turn_length, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.
- data[turn_len_key] (list, - numpy.ndarray): Length of turns in each sample. Size:- [batch_size].
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... multi_turn_context_allvocabs_key: [[[2,4,3], [2,5,6,3]], [[2,7,6,8,3]]], ... multi_turn_reference_allvocabs_key: [[[2,6,7,3], [2,5,3]], [[2,7,6,8,3]]], ... multi_turn_gen_key: [[[6,7,8,3], [4,5,3]], [[7,3]]], ... turn_len_key: [2,1] ... } 
 
 - 
close() → Dict[str, Any][source]¶
- Return a dict which contains - reference: a list of response sentences. A jagged 3-d array of int. Size: - [batch_size, ~turn_length, ~sent_length], where “~” means different sizes in this dimension is allowed.
- gen: a list of generated sentences. A jagged 3-d array of int. Size: - [batch_size, ~turn_length, ~sent_length], where “~” means different sizes in this dimension is allowed.
 
 
LanguageGenerationRecorder¶
- 
class cotk.metric.LanguageGenerationRecorder(dataloader, gen_key='gen')[source]¶
- A metric-like class for recorder generated sentences. - Parameters
- dataloader ( - dataloader.LanguageProcessing,- dataloader.Sentence,- dataloader.Session) – A language generation dataloader.
- gen_key (str, optional) – The key of generated sentences. Default: - gen.
 
 - Here is an example: - >>> gen_key = "gen_key" >>> dl = cotk.dataloader.UbuntuCorpus('resources://Ubuntu_small') >>> metric = cotk.metric.LanguageGenerationRecorder(dl, gen_key=gen_key) >>> data = { ... gen_key: [[2, 10, 64, 851, 3], [2, 10, 48, 851, 3]], ... # gen_key: [["<go>", "I", "like", "python", "<eos>"], ["<go>", "I", "use", "python", "<eos>"]], ... } >>> metric.forward(data) >>> metric.close() {'gen': [['<go>', 'I', 'like', 'python'], ['<go>', 'I', 'use', 'python']]} - 
forward(data)[source]¶
- Processing a batch of data. - Parameters
- data (dict) – A dict at least contains the following keys: - data[gen_key] (list, - numpy.ndarray): A 2-d jagged or padded array of int. Sentences generated by model. Contains end token (eg:- <eos>), but without start token (eg:- <go>). Size:- [batch_size, ~gen_sentence_length], where “~” means different sizes in this dimension is allowed.
 - Here is an example for data: - >>> # all_vocab_list = ["<pad>", "<unk>", "<go>", "<eos>", "I", "have", >>> # "been", "to", "China"] >>> data = { ... gen_key: [[6,7,8,3], [4,5,3]] ... } 
 
 
MetricChain¶
- 
class cotk.metric.MetricChain[source]¶
- A metric-like class for stacked metric. You can use this class making multiples metric combination like one. - Examples - >>> metric = MetricChain() >>> metric.add_metric(BleuCorpusMetric()) >>> metric.add_metric(SingleDialogRecorder(dataloader)) - Todo: Give more examples to combining forward and close - 
add_metric(metric)[source]¶
- Add metric for processing. - Parameters
- metric ( - metric.MetricBase) – a metric class.
 
 
-