Resources¶

File ID¶

Classes like dataloder.Dataloader usually need file_id to locate resources. The string will be passed to file_utils.file_utils.get_resource_file_path() or file_utils.file_utils.import_local_resources().

The format of file_id is name[@source][#processor], where source and processor are optional.

name can be:
- A string start with “resources://”, indicating predefined resources.
- A string start with “https://” indicating online resources.
- A string indicating local path, absolute or relative to cwd.
source only works when name indicates a predefined resources, where source means where to download the dataset. See the following sections of a specific resource for all availiable source. If not specified, a default source will be chosen (the first one in the list showing source).
preprocessor is necessary when name is not a predefined resources. It has to be one of the subclass of file_utils.file_utils.ResourceProcessor.

Examples:

file_id	name	source	processor
resources://MSCOCO	MSCOCO	default(amazon)	Default(MSCOCOResourceProcessor)
resources://MSCOCO@tsinghua	MSCOCO	tsinghua	Default(MSCOCOResourceProcessor)
https://cotk-data.s3-ap-northeast-1.amazonaws.com/mscoco.zip#MSCOCO	MSCOCO	None	MSCOCOResourceProcessor
./mscoco.zip#MSCOCO	MSCOCO	None	MSCOCOResourceProcessor

Download / Import Resources Manually¶

If you want to use CoTK resources in an offline environment, you can download resources and import them to cotk manually.

The download urls are placed under ./cotk/resource_config, see Resource Configs.
The following command can import a local file of resources into cache:

cotk import <file_id> <file_path>

where file_id should start with resources://, and file_path is the path to the local resource. For example:

cotk import resources://MSCOCO ./MSCOCO.zip

More CLI commands here.

Predefined Resources

Word Vector¶

Glove50d¶

name: Glove50d

processor: _utils.resource_processor.Glove50dResourceProcessor

source: stanford

used by: wordvector.Glove

Introduction
This is a file containing vector representation for words pretrained by GloVe, which is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

This file is a 50 dimension version from glove.6B, which is trained on Wikipedia2014 and Gigaword5.

Statistic

Property

Dimension

50

Vocabulary Size

400,000

Reference
[1] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/

[2] Pennington J, Socher R, Manning C. Glove: Global vectors for word representation //Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014: 1532-1543.

Glove100d¶

name: Glove100d

processor: _utils.resource_processor.Glove100dResourceProcessor

source: stanford

usage: wordvector.Glove

Introduction

This is a file containing vector representation for words pretrained by GloVe.

This file is a 100 dimension version from glove.6B.

Refer to Glove50d for more information and references.

Statistic

Property

Dimension

100

Vocabulary Size

400,000

Glove200d¶

name: Glove200d

processor: _utils.resource_processor.Glove200dResourceProcessor

source: stanford

usage: wordvector.Glove

Introduction

This is a file containing vector representation for words pretrained by GloVe.

This file is a 200 dimension version from glove.6B.

Refer to Glove50d for more information and references.

Statistic

Property

Dimension

200

Vocabulary Size

400,000

Glove300d¶

name: Glove300d

processor: _utils.resource_processor.Glove300dResourceProcessor

source: stanford

usage: wordvector.Glove

Introduction

This is a file containing vector representation for words pretrained by GloVe.

This file is a 300 dimension version from glove.6B.

Refer to Glove50d for more information and references.

Statistic

Property

Dimension

300

Vocabulary Size

40,000

Glove50d_small¶

name: Glove50d_small

processor: _utils.resource_processor.Glove50dResourceProcessor

source: amazon

usage: wordvector.Glove

Introduction

This is a file containing vector representation for words pretrained by GloVe.

This file is a 50 dimension version from glove.6B and only contains the most frequency 40,000 words.

Refer to Glove50d for more information and references.

Statistic

Property

Dimension

50

Vocabulary Size

40,000

Datasets¶

MSCOCO¶

name: MSCOCO

processor: _utils.resource_processor.MSCOCOResourceProcessor

source: amazon, tsinghua

usage: dataloader.MSCOCO

Introduction
MSCOCO is a new dataset gathering images of complex everyday scenes containing common objects in their natural context. We neglect the images and just employ the corresponding caption.

The original data is 2017 Train/Val annotations [241MB] . We use the same train set as original data, but split the val set into dev(odd-numbered sentences) and test set(even-numbered sentences). We extract the caption and use nltk.tokenize.word_tokenize for tokenization. We also capitalize each sentence and add full stop to it if it does not have one.

Statistic

Property

Train

Dev

Test

Numbers of Sentences

591,753

12,507

12,507

Minimum length of Sentences

8

10

10

Maximum length of Sentences

50

48

50

Average length of Sentences

13.55

13.55

12.52

Std Deviation of Length of Sentences

2.51

2.44

2.44

Vocabulary Size

33,241

5,950

5,869

Frequency Vocabulary Size (times>=10)

8,064

1,050

1,034

Reference
[1] COCO: Common Objects in Context. http://cocodataset.org

[2] Chen X, Fang H, Lin T Y, et al. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325, 2015.

MSCOCO_small¶

name: MSCOCO_small

processor: _utils.resource_processor.MSCOCOResourceProcessor

source: amazon

usage: dataloader.MSCOCO

Introduction

The data is random uniformed sampled from MSCOCO.

Refer to MSCOCO for more information and references.

Statistic

Property

Train

Dev

Test

Numbers of Sentences

59,175

1,250

1,250

Vocabulary Size

12,212

1,877

1,841

Frequency Vocabulary Size (times>=10)

2,588

219

219

OpenSubtitles¶

name: OpenSubtitles

processor: _utils.resource_processor.OpenSubtitlesResourceProcessor

source: amazon, tsinghua

usage: dataloader.OpenSubtitles

Introduction
Opensubtitle dataset is collected from movie subtitles. To construct this dataset for single-turn dialogue generation, we follow [3] and regard a pair of adjacent sentences as one dialogue turn. We set the former sentence as a post and the latter one as the corresponding response. We remove the pairs whose lengths are extremely long or short, and randomly sample 1,144,949/20,000/10,000 pairs as our train/dev/test set. The dataset is tokenized and all the tokens are in lower-case.

Statistic

Property

Train

Dev

Test

Numbers of Post/Response Pairs

1,144,949

20,000

10,000

Average Length (Post/Response)

9.08/9.10

9.06/9.13

9.04/9.05

Vocabulary Size

104,683

18,653

12,769

Frequency Vocabulary Size (times>=10)

31,039

2,030

1,241

Reference
[1] OpenSubtitles. http://opus.nlpl.eu/OpenSubtitles.php

[2] J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

[3] Vinyals O, Le Q. A Neural Conversational Model. arXiv preprint arXiv:1506.05869, 2015.

OpenSubtitles_small¶

name: OpenSubtitles_small

processor: _utils.resource_processor.OpenSubtitlesResourceProcessor

source: amazon

usage: dataloader.OpenSubtitles

Introduction

The data is random uniformed sampled from OpenSubtitles.

Refer to OpenSubtitles for more information and references.

Statistic

Property

Train

Dev

Test

Numbers of Post/Response Pairs

11,449

2,000

1,000

Vocabulary Size

13,984

4,661

3,079

Frequency Vocabulary Size (times>=10)

1,333

367

211

SST¶

name: SST

processor: _utils.resource_processor.SSTResourceProcessor

source: stanford

usage: dataloader.SST

Introduction
Stanford Sentiment Treebank is the first corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language.

We remain the original split of dataset. The dataset is tokenized and contains capitalized letters.

Statistic

Property

Train

Dev

Test

Numbers of Sentences

8,544

1,101

2,210

Average Length

19.14

19.32

19.19

Vocabulary Size

16,586

5,043

7,934

Frequency Vocabulary Size (times>=10)

1,685

232

441

Reference
[1] Recursive Deep Model for Sematic Compositionality over a Sentiment Treebank. https://nlp.stanford.edu/sentiment/

[2] Socher R, Perelygin A, Wu J, et al. Recursive deep models for semantic compositionality over a sentiment treebank //Proceedings of the 2013 conference on empirical methods in natural language processing. 2013: 1631-1642.

SwitchboardCorpus¶

name: SwitchboardCorpus

processor: _utils.resource_processor.SwitchboardCorpusResourceProcessor

source: amazon

usage: dataloader.SwitchboardCorpus

Introduction
Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven robot operator system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.

We introduce the data processed by Zhao, Ran and Eskenazi[4], which construct the set multi_ref for one-to-many dialog evaluation (One context, multiple responses). multi_ref was constructed by extracting multiple responses for single context with retrieval method and annotation on the other test set. For the details, please refer to [4].

We used their training set, dev set and test set. That is, capitalization, tokenization, splits of dataset and any other aspects of our data are the same as theirs (in their version, utterances are lowercase and are not tokenized).

However, there are two differences between our data with theirs:

We ensure that any two consecutive utterances come from different speakers, by concatenating the original consecutive utterances from the same speakers during pre-processing of dataloader.SwitchboardCorpus. (This is because we want to be compatible with other multi-turn dialog set.)

To avoid the gap between training and test, we have to remove some samples from multi_ref, where the target speaker is the same as the last one in the context.

Statistic

Property

Train

Dev

Test

Numbers of Sessions

2,316

60

62

Minimum Number of Turns (per session)

3

19

8

Maximum Number of Turns (per session)

190

144

148

Average Number of Turns (per session)

59.47

58.92

58.95

Std Deviation of Number of Turns

27.50

26.91

32.43

Minimum Length of Sentences

3

3

3

Maximum Length of Sentences

401

185

333

Average Length of Sentences

19.03

19.12

20.15

Std Deviation of Number of Sentences

20.25

19.65

21.59

Vocabulary Size

24,719

7,704

6,203

Frequency Vocabulary Size (times>=10)

9,508

1,021

1,095

Reference
[1] Switchboard-1 release 2. https://catalog.ldc.upenn.edu/LDC97S62

[2] John J G and Edward H. Switchboard-1 release 2. Linguistic Data Consortium, Philadelphia 1997.

[3] NeuralDialog-CVAE. https://github.com/snakeztc/NeuralDialog-CVAE

[4] Zhao, Tiancheng and Zhao, Ran and Eskenazi, Maxine. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. ACL 2017.

SwitchboardCorpus_small¶

name: SwitchboardCorpus_small

processor: _utils.resource_processor.SwitchboardCorpusResourceProcessor

source: amazon

usage: dataloader.SwitchboardCorpus

Introduction

The data is random uniformed sampled from SwitchboardCorpus.

Refer to SwitchboardCorpus for details and references.

Statistic

Property

Train

Dev

Test

Numbers of Sessions

463

12

12

Vocabulary Size

13,235

5,189

4,729

Frequency Vocabulary Size (times>=10)

3,923

341

317

Ubuntu¶

name: Ubuntu

processor: _utils.resource_processor.UbuntuResourceProcessor

source: amazon, tsinghua

usage: dataloader.UbuntuCorpus

Introduction
Ubuntu Dialogue Corpus 2.0 is a dataset containing a mass of multi-turn dialogues. The dataset has both the multi-turn property of conversations in the Dialog State Tracking Challenge datasets, and the unstructured nature of interactions from microblog services such as Twitter.

We build the dataset using the ubuntu-ranking-dataset-creator[1], without tokenization, lemmatization or stemming. The dataset contains capital character. The positive example probability is set to 1.0 for training set. The examples for training/dev/test sets are set to 1,000,000/19,560/18,920 as default. Other settings are default.

Statistic

Property

Train

Dev

Test

Numbers of Sessions

1,000,000

19,560

18,920

Minimum Number of Turns (per session)

3

3

3

Maximum Number of Turns (per session)

19

19

19

Average Number of Turns (per session)

4.95

4.79

4.85

Std Deviation of Number of Turns

2.97

2.79

2.85

Minimum Length of Sentences

2

2

2

Maximum Length of Sentences

977

343

817

Average Length of Sentences

17.98

19.40

19.61

Std Deviation of Number of Sentences

16.26

17.25

17.94

Vocabulary Size

576,693

47,288

47,847

Frequency Vocabulary Size (times>=10)

23,498

2,568

2,565

Reference
[1] Ubuntu Dialogue Corpus v2.0. https://github.com/rkadlec/ubuntu-ranking-dataset-creator

[2] R. Lowe, N. Pow, I. Serban, and J. Pineau. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Special Interest Group on Discourse and Dialogue (SIGDIAL), 2015a.

Ubuntu_small¶

name: Ubuntu_small

processor: _utils.resource_processor.UbuntuResourceProcessor

source: amazon

usage: dataloader.UbuntuCorpus

Introduction

The data is random uniformed sampled from Ubuntu.

Refer to Ubuntu for details and references.

Statistic

Property

Train

Dev

Test

Numbers of Sessions

10,001

1,957

1,893

Vocabulary Size

32,941

12,348

12,090

Frequency Vocabulary Size (times>=10)

1,590

501

475