convlab2.policy.ppo package¶

Subpackages¶

class convlab2.policy.ppo.ppo.PPO(is_train=False, dataset='Multiwoz')¶

est_adv(r, v, mask)¶: we save a trajectory in continuous space and it reaches the ending of current trajectory when mask=0. :param r: reward, Tensor, [b] :param v: estimated value, Tensor, [b] :param mask: indicates ending for 0 otherwise 1, Tensor, [b] :return: A(s, a), V-target(s), both Tensor

classmethod from_pretrained(archive_file='', model_file='https://convlab.blob.core.windows.net/convlab-2/ppo_policy_multiwoz.zip', is_train=False, dataset='Multiwoz')¶

predict(state)¶

Predict an system action given state. Args:

state (dict): Dialog state. Please refer to util/state.py

Returns:: action : System act, with the form of (act_type, {slot_name_1: value_1, slot_name_2, value_2, …})

Created on Sun Jul 14 16:14:07 2019 @author: truthless

convlab2.policy.ppo.train.sample(env, policy, batchsz, process_num)¶

Given batchsz number of task, the batchsz will be splited equally to each processes and when processes return, it merge all data and return

param env

param policy

Parameters

batchsz –

Returns

batch

convlab2.policy.ppo.train.sampler(pid, queue, evt, env, policy, batchsz)¶: This is a sampler function, and it will be called by multiprocess.Process to sample data from environment by multiple processes. :param pid: process id :param queue: multiprocessing.Queue, to collect sampled data :param evt: multiprocessing.Event, to keep the process alive :param env: environment instance :param policy: policy network, to generate action from current policy :param batchsz: total sampled items :return:

convlab2.policy.ppo.train.update(env, policy, batchsz, epoch, process_num)¶