Resources¶
File ID¶
Classes like dataloder.Dataloader
usually need file_id
to locate
resources. The string will be passed to file_utils.file_utils.get_resource_file_path()
or
file_utils.file_utils.import_local_resources()
.
The format of file_id
is name[@source][#processor]
,
where source
and processor
are optional.
name
can be:A string start with “resources://”, indicating predefined resources.
A string start with “https://” indicating online resources.
A string indicating local path, absolute or relative to cwd.
source
only works whenname
indicates a predefined resources, wheresource
means where to download the dataset. See the following sections of a specific resource for all availiablesource
. If not specified, a default source will be chosen (the first one in the list showing source).preprocessor
is necessary whenname
is not a predefined resources. It has to be one of the subclass offile_utils.file_utils.ResourceProcessor
.
Examples:
file_id |
name |
source |
processor |
---|---|---|---|
resources://MSCOCO |
MSCOCO |
default(amazon) |
Default(MSCOCOResourceProcessor) |
resources://MSCOCO@tsinghua |
MSCOCO |
tsinghua |
Default(MSCOCOResourceProcessor) |
https://cotk-data.s3-ap-northeast-1.amazonaws.com/mscoco.zip#MSCOCO |
MSCOCO |
None |
MSCOCOResourceProcessor |
./mscoco.zip#MSCOCO |
MSCOCO |
None |
MSCOCOResourceProcessor |
Download / Import Resources Manually¶
If you want to use CoTK resources in an offline environment, you can download resources and import them to cotk manually.
The download urls are placed under
./cotk/resource_config
, see Resource Configs.The following command can import a local file of resources into cache:
cotk import <file_id> <file_path>
where file_id
should start with resources://
, and file_path
is the path to the local resource. For example:
cotk import resources://MSCOCO ./MSCOCO.zip
More CLI commands here.
Predefined Resources
Word Vector¶
Glove50d¶
name:
Glove50d
processor:
_utils.resource_processor.Glove50dResourceProcessor
source:
stanford
used by:
wordvector.Glove
- Introduction
This is a file containing vector representation for words pretrained by GloVe, which is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
This file is a 50 dimension version from
glove.6B
, which is trained on Wikipedia2014 and Gigaword5.- Statistic
Property
Dimension
50
Vocabulary Size
400,000
- Reference
[1] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/
[2] Pennington J, Socher R, Manning C. Glove: Global vectors for word representation //Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014: 1532-1543.
Glove100d¶
name:
Glove100d
processor:
_utils.resource_processor.Glove100dResourceProcessor
source:
stanford
usage:
wordvector.Glove
- Introduction
This is a file containing vector representation for words pretrained by GloVe.
This file is a 100 dimension version from
glove.6B
.Refer to Glove50d for more information and references.
- Statistic
Property
Dimension
100
Vocabulary Size
400,000
Glove200d¶
name:
Glove200d
processor:
_utils.resource_processor.Glove200dResourceProcessor
source:
stanford
usage:
wordvector.Glove
- Introduction
This is a file containing vector representation for words pretrained by GloVe.
This file is a 200 dimension version from
glove.6B
.Refer to Glove50d for more information and references.
- Statistic
Property
Dimension
200
Vocabulary Size
400,000
Glove300d¶
name:
Glove300d
processor:
_utils.resource_processor.Glove300dResourceProcessor
source:
stanford
usage:
wordvector.Glove
- Introduction
This is a file containing vector representation for words pretrained by GloVe.
This file is a 300 dimension version from
glove.6B
.Refer to Glove50d for more information and references.
- Statistic
Property
Dimension
300
Vocabulary Size
40,000
Glove50d_small¶
name:
Glove50d_small
processor:
_utils.resource_processor.Glove50dResourceProcessor
source:
amazon
usage:
wordvector.Glove
- Introduction
This is a file containing vector representation for words pretrained by GloVe.
This file is a 50 dimension version from
glove.6B
and only contains the most frequency 40,000 words.Refer to Glove50d for more information and references.
- Statistic
Property
Dimension
50
Vocabulary Size
40,000
Datasets¶
MSCOCO¶
name:
MSCOCO
processor:
_utils.resource_processor.MSCOCOResourceProcessor
source:
amazon
,tsinghua
usage:
dataloader.MSCOCO
- Introduction
MSCOCO is a new dataset gathering images of complex everyday scenes containing common objects in their natural context. We neglect the images and just employ the corresponding caption.
The original data is 2017 Train/Val annotations [241MB] . We use the same train set as original data, but split the val set into dev(odd-numbered sentences) and test set(even-numbered sentences). We extract the caption and use
nltk.tokenize.word_tokenize
for tokenization. We also capitalize each sentence and add full stop to it if it does not have one.- Statistic
Property
Train
Dev
Test
Numbers of Sentences
591,753
12,507
12,507
Minimum length of Sentences
8
10
10
Maximum length of Sentences
50
48
50
Average length of Sentences
13.55
13.55
12.52
Std Deviation of Length of Sentences
2.51
2.44
2.44
Vocabulary Size
33,241
5,950
5,869
Frequency Vocabulary Size (times>=10)
8,064
1,050
1,034
- Reference
[1] COCO: Common Objects in Context. http://cocodataset.org
[2] Chen X, Fang H, Lin T Y, et al. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325, 2015.
MSCOCO_small¶
name:
MSCOCO_small
processor:
_utils.resource_processor.MSCOCOResourceProcessor
source:
amazon
usage:
dataloader.MSCOCO
OpenSubtitles¶
name:
OpenSubtitles
processor:
_utils.resource_processor.OpenSubtitlesResourceProcessor
source:
amazon
,tsinghua
usage:
dataloader.OpenSubtitles
- Introduction
Opensubtitle dataset is collected from movie subtitles. To construct this dataset for single-turn dialogue generation, we follow [3] and regard a pair of adjacent sentences as one dialogue turn. We set the former sentence as a post and the latter one as the corresponding response. We remove the pairs whose lengths are extremely long or short, and randomly sample 1,144,949/20,000/10,000 pairs as our train/dev/test set. The dataset is tokenized and all the tokens are in lower-case.
- Statistic
Property
Train
Dev
Test
Numbers of Post/Response Pairs
1,144,949
20,000
10,000
Average Length (Post/Response)
9.08/9.10
9.06/9.13
9.04/9.05
Vocabulary Size
104,683
18,653
12,769
Frequency Vocabulary Size (times>=10)
31,039
2,030
1,241
- Reference
[1] OpenSubtitles. http://opus.nlpl.eu/OpenSubtitles.php
[2] J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
[3] Vinyals O, Le Q. A Neural Conversational Model. arXiv preprint arXiv:1506.05869, 2015.
OpenSubtitles_small¶
name:
OpenSubtitles_small
processor:
_utils.resource_processor.OpenSubtitlesResourceProcessor
source:
amazon
usage:
dataloader.OpenSubtitles
- Introduction
The data is random uniformed sampled from OpenSubtitles.
Refer to OpenSubtitles for more information and references.
- Statistic
Property
Train
Dev
Test
Numbers of Post/Response Pairs
11,449
2,000
1,000
Vocabulary Size
13,984
4,661
3,079
Frequency Vocabulary Size (times>=10)
1,333
367
211
SST¶
name:
SST
processor:
_utils.resource_processor.SSTResourceProcessor
source:
stanford
usage:
dataloader.SST
- Introduction
Stanford Sentiment Treebank is the first corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language.
We remain the original split of dataset. The dataset is tokenized and contains capitalized letters.
- Statistic
Property
Train
Dev
Test
Numbers of Sentences
8,544
1,101
2,210
Average Length
19.14
19.32
19.19
Vocabulary Size
16,586
5,043
7,934
Frequency Vocabulary Size (times>=10)
1,685
232
441
- Reference
[1] Recursive Deep Model for Sematic Compositionality over a Sentiment Treebank. https://nlp.stanford.edu/sentiment/
[2] Socher R, Perelygin A, Wu J, et al. Recursive deep models for semantic compositionality over a sentiment treebank //Proceedings of the 2013 conference on empirical methods in natural language processing. 2013: 1631-1642.
SwitchboardCorpus¶
name:
SwitchboardCorpus
processor:
_utils.resource_processor.SwitchboardCorpusResourceProcessor
source:
amazon
usage:
dataloader.SwitchboardCorpus
- Introduction
Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven robot operator system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.
We introduce the data processed by Zhao, Ran and Eskenazi[4], which construct the set
multi_ref
for one-to-many dialog evaluation (One context, multiple responses).multi_ref
was constructed by extracting multiple responses for single context with retrieval method and annotation on the other test set. For the details, please refer to [4].We used their training set, dev set and test set. That is, capitalization, tokenization, splits of dataset and any other aspects of our data are the same as theirs (in their version, utterances are lowercase and are not tokenized).
However, there are two differences between our data with theirs:
We ensure that any two consecutive utterances come from different speakers, by concatenating the original consecutive utterances from the same speakers during pre-processing of
dataloader.SwitchboardCorpus
. (This is because we want to be compatible with other multi-turn dialog set.)To avoid the gap between training and test, we have to remove some samples from
multi_ref
, where the target speaker is the same as the last one in the context.- Statistic
Property
Train
Dev
Test
Numbers of Sessions
2,316
60
62
Minimum Number of Turns (per session)
3
19
8
Maximum Number of Turns (per session)
190
144
148
Average Number of Turns (per session)
59.47
58.92
58.95
Std Deviation of Number of Turns
27.50
26.91
32.43
Minimum Length of Sentences
3
3
3
Maximum Length of Sentences
401
185
333
Average Length of Sentences
19.03
19.12
20.15
Std Deviation of Number of Sentences
20.25
19.65
21.59
Vocabulary Size
24,719
7,704
6,203
Frequency Vocabulary Size (times>=10)
9,508
1,021
1,095
- Reference
[1] Switchboard-1 release 2. https://catalog.ldc.upenn.edu/LDC97S62
[2] John J G and Edward H. Switchboard-1 release 2. Linguistic Data Consortium, Philadelphia 1997.
[3] NeuralDialog-CVAE. https://github.com/snakeztc/NeuralDialog-CVAE
[4] Zhao, Tiancheng and Zhao, Ran and Eskenazi, Maxine. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. ACL 2017.
SwitchboardCorpus_small¶
name:
SwitchboardCorpus_small
processor:
_utils.resource_processor.SwitchboardCorpusResourceProcessor
source:
amazon
usage:
dataloader.SwitchboardCorpus
- Introduction
The data is random uniformed sampled from SwitchboardCorpus.
Refer to SwitchboardCorpus for details and references.
- Statistic
Property
Train
Dev
Test
Numbers of Sessions
463
12
12
Vocabulary Size
13,235
5,189
4,729
Frequency Vocabulary Size (times>=10)
3,923
341
317
Ubuntu¶
name:
Ubuntu
processor:
_utils.resource_processor.UbuntuResourceProcessor
source:
amazon
,tsinghua
usage:
dataloader.UbuntuCorpus
- Introduction
Ubuntu Dialogue Corpus 2.0 is a dataset containing a mass of multi-turn dialogues. The dataset has both the multi-turn property of conversations in the Dialog State Tracking Challenge datasets, and the unstructured nature of interactions from microblog services such as Twitter.
We build the dataset using the ubuntu-ranking-dataset-creator[1], without tokenization, lemmatization or stemming. The dataset contains capital character. The positive example probability is set to 1.0 for training set. The examples for training/dev/test sets are set to 1,000,000/19,560/18,920 as default. Other settings are default.
- Statistic
Property
Train
Dev
Test
Numbers of Sessions
1,000,000
19,560
18,920
Minimum Number of Turns (per session)
3
3
3
Maximum Number of Turns (per session)
19
19
19
Average Number of Turns (per session)
4.95
4.79
4.85
Std Deviation of Number of Turns
2.97
2.79
2.85
Minimum Length of Sentences
2
2
2
Maximum Length of Sentences
977
343
817
Average Length of Sentences
17.98
19.40
19.61
Std Deviation of Number of Sentences
16.26
17.25
17.94
Vocabulary Size
576,693
47,288
47,847
Frequency Vocabulary Size (times>=10)
23,498
2,568
2,565
- Reference
[1] Ubuntu Dialogue Corpus v2.0. https://github.com/rkadlec/ubuntu-ranking-dataset-creator
[2] R. Lowe, N. Pow, I. Serban, and J. Pineau. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Special Interest Group on Discourse and Dialogue (SIGDIAL), 2015a.
Ubuntu_small¶
name:
Ubuntu_small
processor:
_utils.resource_processor.UbuntuResourceProcessor
source:
amazon
usage:
dataloader.UbuntuCorpus