Policy

Policy.py - abstract class for all policies

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import utils.Settings
import utils.DiaAct
import utils.ContextLogger
import ontology.OntologyUtils
import policy.SummaryAction


class policy.Policy.Action(action)

Dummy class representing one action. Used for recording and may be overridden by sub-class.

class policy.Policy.Episode(dstring=None)

An episode encapsulates the state-action-reward triplet which may be used for learning. Every entry represents one turn. The last entry should contain TerminalState and TerminalAction

check()

Checks whether length of internal state action and reward lists are equal.

getWeightedReward()

Returns the reward weighted by normalised accumulated weights. Used for multiagent learning in committee.

Returns:the reward weighted by normalised accumulated weights
record(state, action, reward, ma_weight=None)

Stores the state action reward in internal lists.

Parameters:
  • state (State) – the last belief state
  • action (Action) – the last system action
  • reward (int) – the reward of the last turn
  • ma_weight (float) – used by committee: the weight assigned by multiagent learning, optional
tostring()

Prints state, action, and reward lists to screen.

class policy.Policy.EpisodeStack(block_size=100)

A handler for episodes. Required if stack size is to become very large - may not want to hold all episodes in memory, but write out to file.

add_episode(domain_episodes)

Items on stack are dictionaries of episodes for each domain (since with BCM can learn from 2 or more domains if a multidomain dialogue happens)

retrieve_episode(episode_key)

NB: this should probably be an iterator, using yield, rather than return

class policy.Policy.Policy(domainString, learning=False, specialDomain=False)

Interface class for a single domain policy. Responsible for selecting the next system action and handling the learning of the policy.

To create your own policy model or to change the state representation, derive from this class.

act_on(state)

Main policy method: mapping of belief state to system action.

This method is automatically invoked by the agent at each turn after tracking the belief state.

May initially return ‘hello()’ as hardcoded action. Keeps track of last system action and last belief state.

Parameters:
  • state (DialogueState) – the belief state to act on
  • hyps (list) – n-best-list of semantic interpretations
Returns:

the next system action of type DiaAct

convertStateAction(state, action)

Converts the given state and action to policy-specific representations.

By default, the generic classes State and Action are used. To change this, override method in sub-class.

Parameters:
  • state (anything) – the state to be encapsulated
  • action – the action to be encapsulated
Type:

action: anything

finalizeRecord(reward, domainInControl=None)

Records the final reward along with the terminal system action and terminal state. To change the type of state/action override convertStateAction().

This method is automatically executed by the agent at the end of each dialogue.

Parameters:
  • reward (int) – the final reward
  • domainInControl (str) – used by committee: the unique identifier domain string of the domain this dialogue originates in, optional
Returns:

None

nextAction(beliefstate)

Interface method for selecting the next system action. Should be overridden by sub-class.

This method is automatically executed by act_on() thus at each turn.

Parameters:beliefstate (dict) – the state the policy acts on
Returns:the next system action
record(reward, domainInControl=None, weight=None, state=None, action=None)

Records the current turn reward along with the last system action and belief state.

This method is automatically executed by the agent at the end of each turn.

To change the type of state/action override convertStateAction(). By default, the last master action is recorded. If you want to have another action being recorded, eg., summary action, assign the respective object to self.actToBeRecorded in a derived class.

Parameters:
  • reward (int) – the turn reward to be recorded
  • domainInControl (str) – the domain string unique identifier of the domain the reward originates in
  • weight (float) – used by committee: the weight of the reward in case of multiagent learning
  • state (dict) – used by committee: the belief state to be recorded
  • action (str) – used by committee: the action to be recorded
Returns:

None

restart()

Restarts the policy. Resets internal variables.

This method is automatically executed by the agent at the end/beginning of each dialogue.

savePolicy(FORCE_SAVE=False)

Saves the learned policy model to file. Should be overridden by sub-class.

This method is automatically executed by the agent either at certain intervals or at least before shutting down the agent.

Parameters:FORCE_SAVE (bool) – used to force cleaning up of any learning and saving when we are powering off an agent.
train()

Interface method for initiating the training. Should be overridden by sub-class.

This method is automatically executed by the agent at the end of each dialogue if learning is True.

This method is called at the end of each dialogue by PolicyManager if learning is enabled for the given domain policy.

class policy.Policy.State(state)

Dummy class representing one state. Used for recording and may be overridden by sub-class.

class policy.Policy.TerminalAction

Dummy class representing one terminal action. Used for recording and may be overridden by sub-class.

class policy.Policy.TerminalState

Dummy class representing one terminal state. Used for recording and may be overridden by sub-class.

PolicyManager.py - container for all policies

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import utils.Settings
import utils.ContextLogger
import ontology.Ontology
import ontology.OntologyUtils


class policy.PolicyManager.PolicyManager

The policy manager manages the policies for all domains.

It provides the interface to get the next system action based on the current belief state in act_on() and to initiate the learning in the policy in train().

_check_committee(committee)

Safety tool - should check some logical requirements on the list of domains given by the config

Parameters:committee (PolicyCommittee) – the committee be be checked
_load_committees()

Loads and instantiates the committee as configured in config file. The new object is added to the internal dictionary.

_load_domains_policy(domainString=None)

Loads and instantiates the respective policy as configured in config file. The new object is added to the internal dictionary.

Default is ‘hdc’.

Parameters:domainString (str) – the domain the policy will work on. Default is None.
Returns:the new policy object
act_on(dstring, state)

Main policy method which maps the provided belief to the next system action. This is called at each turn by DialogueAgent

Parameters:
  • dstring (str) – the domain string unique identifier.
  • state (DialogueState) – the belief state the policy should act on
Returns:

the next system action as DiaAct

bootup(domainString)

Loads a policy for a given domain.

finalizeRecord(domainRewards)

Records the final rewards of all domains. In case of a committee, the recording is delegated.

This method is called once at the end of each dialogue by the DialogueAgent. (One dialogue may contain multiple domains.)

Parameters:domainRewards (dict) – a dictionary mapping from domains to final rewards
Returns:None
getLastSystemAction(domainString)

Returns the last system action of the specified domain.

Parameters:domainString (str) – the domain string unique identifier.
Returns:the last system action of the given domain or None
printEpisodes()

Prints the recorded episode of the current dialogue.

record(reward, domainString)

Records the current turn reward for the given domain. In case of a committee, the recording is delegated.

This method is called each turn by the DialogueAgent.

Parameters:
  • reward (int) – the turn reward to be recorded
  • domainString (str) – the domain string unique identifier of the domain the reward originates in
Returns:

None

restart()

Restarts all policies of all domains and resets internal variables.

savePolicy(FORCE_SAVE=False)

Initiates the policies of all domains to be saved.

Parameters:FORCE_SAVE (bool) – used to force cleaning up of any learning and saving when we are powering off an agent.
train(training_vec=None)

Initiates the training for the policies of all domains. This is called at the end of each dialogue by DialogueAgent

PolicyCommittee.py - implementation of the Bayesian committee machine for dialogue management

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import utils.Settings
import utils.ContextLogger
import utils.DiaAct


class policy.PolicyCommittee.CommitteeMember

Base class defining the interface methods which are needed in addition to the basic functionality provided by Policy

Committee members should derive from this class.

abstract_actions(actions)

Converts a list of domain acts to their abstract form

Parameters:actions (list of actions) – the actions to be abstracted
getMeanVar_for_executable_actions(belief, abstracted_currentstate, nonExecutableActions)

Computes the mean and variance of the Q value based on the abstracted belief state for each executable action.

Parameters:
  • belief (dict) – the unabstracted current domain belief
  • abstracted_currentstate (State or subclass) – the abstracted current belief
  • nonExecutableActions (list) – actions which are not selected for execution based on heuristic
getPriorVar(belief, act)

Returns prior variance for a given belief and action

Parameters:
  • belief (dict) – the unabstracted current domain belief state
  • act (str) – the unabstracted action
get_Action(action)

Converts the unabstracted domain action into an abstracted action to be used for multiagent learning.

Parameters:action (str) – the last system action
get_State(beliefstate, keep_none=False)

Converts the unabstracted domain state into an abstracted belief state to be used with getMeanVar_for_executable_actions().

Parameters:beliefstate (dict) – the unabstracted belief state
unabstract_action(actions)

Converts a list of abstract acts to their domain form

Parameters:actions (list of actions) – the actions to be unabstracted
class policy.PolicyCommittee.PolicyCommittee(policyManager, committeeMembers, learningmethod)

Manages everything related to policy committee. All policy members must inherit from Policy and CommitteeMember.

_bayes_committee_calculator(domainQs, priors, domainInControl, scale)

Given means and variances of committee members - forms the Bayesian committee distribution for each action, draws sample from each, returns act with highest sample.

Note

this implementation is probably slow – can reformat domainQs - and redo this via matricies and slicing

Parameters:
  • domainQs (dict of domains and dict of actions and dict of variance/mu and values) – the means and variances of all Q-value estimates of all domains
  • priors (dict of actions and values) – the prior of the Q-value
  • domainInControl (str) – domain the dialoge is in
  • scale (float) – a scaling factor used to control exploration during learning
Returns:

the next abstract system action

_set_multi_agent_learning_weights(comm_meansVars, chosen_act)

Set reward scalings for each committee member. Implements NAIVE approach from “Multi-agent learning in multi-domain spoken dialogue systems”, Milica Gasic et al. 2015.

Parameters:
  • comm_meansVars (dict of domains and dict of actions and dict of variance/mu and values) – the means and variances of all committee members
  • chosen_act (str) – the abstract system action to be executed
Returns:

None

act_on(domainInControl, state)

Provides the next system action based on the domain in control and the belief state.

The belief state is mapped to an abstract representation which is used for all committee members.

Parameters:
  • domainInControl (str) – the domain unique identifier string of the domain in control
  • state (DialogueState) – the belief state to act on
Returns:

the next system action

finalizeRecord(reward, domainInControl)

Records for each committee member the reward and the domain the dialogue has been on

Parameters:
  • reward (int) – the final reward to be recorded
  • domainInControl (str) – the domain the reward was achieved in
record(reward, domainInControl)

record for committee members. in case of multiagent learning, use information held in committee along with the reward to record (b,a) + r

Parameters:
  • reward (str) – the turn reward to be recorded
  • reward – the domain the reward was achieved in
Returns:

None

HDCPolicy.py - Handcrafted dialogue manager

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import policy.Policy
import policy.PolicyUtils
import policy.SummaryUtils
import utils.Settings
import utils.ContextLogger


class policy.HDCPolicy.HDCPolicy(domainString)

Handcrafted policy derives from Policy base class. Based on the slots defined in the ontology and fix thresholds, defines a rule-based policy.

If no info is provided by the user, the system will always ask for the slot information in the same order based on the ontology.

GPPolicy.py - Gaussian Process policy

Copyright CUED Dialogue Systems Group 2015 - 2017

Relevant Config variables [Default values]:

[gppolicy]
kernel = polysort
thetafile = ''    

See also

CUED Imports/Dependencies:

import policy.GPLib
import policy.Policy
import policy.PolicyCommittee
import ontology.Ontology
import utils.Settings
import utils.ContextLogger


class policy.GPPolicy.GPPolicy(domainString, learning, sharedParams=None)

An implementation of the dialogue policy based on Gaussian process and the GPSarsa algorithm to optimise actions where states are GPState and actions are GPAction.

The class implements the public interfaces from Policy and CommitteeMember.

class policy.GPPolicy.Kernel(kernel_type, theta, der=None, action_kernel_type='delta', action_names=None, domainString=None)

The Kernel class defining the kernel for the GPSARSA algorithm.

The kernel is usually divided into a belief part where a dot product or an RBF-kernel is used. The action kernel is either the delta function or a handcrafted or distributed kernel.

class policy.GPPolicy.GPAction(action, numActions, replace={})

Definition of summary action used for GP-SARSA.

class policy.GPPolicy.GPState(belief, keep_none=False, replace={}, domainString=None)

Definition of state representation needed for GP-SARSA algorithm Main requirement for the ability to compute kernel function over two states

class policy.GPPolicy.TerminalGPAction

Class representing the action object recorded in the (b,a) pair along with the final reward.

class policy.GPPolicy.TerminalGPState

Basic object to explicitly denote the terminal state. Always transition into this state at dialogues completion.

GPLib.py - Gaussian Process SARSA algorithm

Copyright CUED Dialogue Systems Group 2015 - 2017

This module encapsulates all classes and functionality which implement the GPSARSA algorithm for dialogue learning.

Relevant Config variables [Default values]. X is the domain tag:

[gpsarsa_X]
saveasprior = False 
random = False
learning = False
gamma = 1.0
sigma = 5.0
nu = 0.001
scale = -1
numprior = 0

See also

CUED Imports/Dependencies:

import utils.Settings
import utils.ContextLogger
import policy.PolicyUtils


class policy.GPLib.GPSARSA(in_policyfile, out_policyfile, domainString=None, learning=False, sharedParams=None)
Derives from GPSARSAPrior

Implements GPSarsa algorithm where mean can have a predefined value self._num_prior specifies number of means self._prior specifies the prior If not specified a zero-mean is assumed

Parameters needed to estimate the GP posterior self._K_tida_inv inverse of the Gram matrix of dictionary state-action pairs self.sharedParams[‘_C_tilda’] covariance function needed to estimate the final variance of the posterior self.sharedParams[‘_c_tilda’] vector needed to calculate self.sharedParams[‘_C_tilda’] self.sharedParams[‘_alpha_tilda’] vector needed to estimate the mean of the posterior self.sharedParams[‘_d’] and self.sharedParams[‘_s’] sufficient statistics needed for the iterative estimation of the posterior

Parameters needed for the policy selection self._random random policy choice self._scale scaling of the standard deviation when sampling Q-value, if -1 than taking the mean self.learning if true in learning mode

class policy.GPLib.GPSARSAPrior(in_policyfile, out_policyfile, numPrior=-1, learning=False, domainString=None, sharedParams=None)

Defines the GP prior. Derives from LearnerInterface.

class policy.GPLib.LearnerInterface

This class defines the basic interface for the GPSARSA algorithm.

specifies the policy files self._inputDictFile input dictionary file self._inputParamFile input parameter file self._outputDictFile output dictionary file self._outputParamFile output parameter file

self.initial self.terminal flags are needed for learning to specify initial and terminal states in the episode

HDCTopicManager.py - policy for the front end topic manager

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import policy.Policy
import utils.Settings
import utils.ContextLogger


class policy.HDCTopicManager.HDCTopicManagerPolicy(dstring=None, learning=None)

The dialogue while being in the process of finding the topic/domain of the conversation.

At the current stage, this only happens at the beginning of the dialogue, so this policy has to take care of wecoming the user as well as creating actions which disambiguate/clarify the topic of the interaction.

It allows for the system to hang up if the topic could not be identified after a specified amount of attempts.

WikipediaTools.py - basic tools to access wikipedia

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import policy.Policy
import utils.Settings
import utils.ContextLogger


class policy.WikipediaTools.WikipediaDM

Dialogue Manager interface to Wikipedia – developement state.

SummaryAction.py - Mapping between summary and master actions

Copyright CUED Dialogue Systems Group 2015 - 2017, 2017

See also

CUED Imports/Dependencies:

import policy.SummaryUtils
import ontology.Ontology
import utils.ContextLogger
import utils.Settings


class policy.SummaryAction.SummaryAction(domainString, empty=False, confreq=False)

The summary action class encapsulates the functionality of a summary action along with the conversion from summary to master actions.

Note

The list of all possible summary actions are defined in this class.

SummaryUtils.py - summarises dialog events for mapping from master to summary belief

Copyright CUED Dialogue Systems Group 2015 - 2017

Basic Usage:
>>> import SummaryUtils

Note

No classes; collection of utility methods

Local module variables:

global_summary_features:    (list) global actions/methods
REQUESTING_THRESHOLD:             (float) 0.5 min value to consider a slot requested

See also

CUED Imports/Dependencies:

import ontology.Ontology
import utils.Settings
import utils.ContextLogger


PolicyUtils.py - Utility Methods for Policies

Copyright CUED Dialogue Systems Group 2015 - 2017

Note

PolicyUtils.py is a collection of utility functions only (No classes).

Local/file variables:

ZERO_THRESHOLD:             unused
REQUESTING_THRESHOLD:       affects getRequestedSlots() method

See also

CUED Imports/Dependencies:

import ontology.Ontology
import utils.DiaAct
import utils.Settings
import policy.SummaryUtils
import utils.ContextLogger


policy.PolicyUtils.REQUESTING_THRESHOLD = 0.5

Methods for global action.

policy.PolicyUtils.add_venue_count(input, belief, domainString)

Add venue count.

Parameters:
  • input – String input act.
  • belief – Belief state
  • domainString (str) – domain tag like ‘SFHotels’
Returns:

act with venue count.

policy.PolicyUtils.checkDirExistsAndMake(fullpath)

Used when saving a policy – if dir doesn’t exisit –> is created

policy.PolicyUtils.getGlobalAction(belief, globalact, domainString)

Method for global action: returns action

Parameters:
  • belief (dict) – full belief state
  • globalact (int) –
    • str of globalActionName, e.g. ‘INFORM_REQUESTED’
  • domainString (str) – domain tag
Returns:

(str) action

policy.PolicyUtils.getInformAcceptedSlotsAboutEntity(acceptanceList, ent, numFeats)

Method for global inform action: returns filled out inform() string need to be cleaned (Dongho)

Parameters:
  • acceptanceList (dict) – of slots with value:prob mass pairs
  • ent (dict) – slot:value properties for this entity
  • numFeats (int) – result of globalOntology.entity_by_features(acceptedValues)
Returns:

(str) filled out inform() act

policy.PolicyUtils.getInformAction(numAccepted, belief, domainString)

Method for global inform action: returns inform act via getInformExactEntity() method or null() if not enough accepted

Parameters:
  • belief (dict) – full belief state
  • numAccepted (int) – number of slots with prob. mass > 80
  • domainString (str) – domain tag
Returns:

getInformExactEntity(acceptanceList,numAccepted)

policy.PolicyUtils.getInformExactEntity(acceptanceList, numAccepted, domainString)

Method for global inform action: creates inform act with none or an entity

Parameters:
  • acceptanceList (dict) – of slots with value:prob mass pairs
  • numAccepted (int) – number of accepted slots (>80 prob mass)
  • domainString (str) – domain tag
Returns:

getInformNoneVenue() or getInformAcceptedSlotsAboutEntity() as appropriate

BCM_Tools.py - Script for creating slot abstraction mapping files

Copyright CUED Dialogue Systems Group 2015 - 2017

Note

Collection of utility classes and methods

See also

CUED Imports/Dependencies:

import ontology.Ontology
import utils.Settings
import utils.ContextLogger


This script is used to create a mapping from slot names to abstract slot (like slot0, slot1 etc), highest entropy to lowest. Writes mapping to JSON file

DeepRL Policies

A2CPolicy.py - Advantage Actor-Critic policy

Copyright CUED Dialogue Systems Group 2015 - 2017

The implementation of the advantage actor-critic with the temporal difference as an approximation of the advantage function. The network is defined in DRL.a2c.py You can turn on the importance sampling through the parameter A2CPolicy.importance_sampling

The details of the implementation can be found here: https://arxiv.org/abs/1707.00130

See also: https://dl.acm.org/citation.cfm?id=3009806

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger


ACERPolicy.py - Sample Efficient Actor Critic with Experience Replay

Copyright CUED Dialogue Systems Group 2015 - 2017

The implementation of the sample efficient actor critic with truncated importance sampling with bias correction, the trust region policy optimization method and RETRACE-like multi-step estimation of the value function. The parameters ACERPolicy.c, ACERPolicy.alpha, ACERPolicy. The details of the implementation can be found here: https://arxiv.org/abs/1802.03753

See also: https://arxiv.org/abs/1611.01224 https://arxiv.org/abs/1606.02647

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger


BDQNPolicy.py - deep Bayesian Q network policy

Copyright CUED Dialogue Systems Group 2015 - 2018

Implementation of Bayes by Backprop. The prediction is used both at training and testing time. The model is highly dependent on the following parameters:

See also: https://arxiv.org/abs/1505.05424 http://zacklipton.com/media/papers/bbq-learning-dialogue-policy-lipton2016.pdf

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger


DQNPolicy.py - deep Q network policy

Copyright CUED Dialogue Systems Group 2015 - 2017

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger

Warning

Documentation not done.


ENACPolicy.py - Episodic Natural Actor-Critic policy

Copyright CUED Dialogue Systems Group 2015 - 2017

The implementation of episodic natural actor-critic. The vanilla gradients are computed in DRL/enac.py using Tensorflow and then the natural gradient is obtained through function train. You can turn on the importance sampling through the parameter ENACPolicy.importance_sampling

The details of implementation can be found here: https://arxiv.org/abs/1707.00130 See also: https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2007-125.pdf

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger


TRACERPolicy.py - Trust region advantage Actor-Critic policy with experience replay

Copyright CUED Dialogue Systems Group 2015 - 2017

The implementation of the actor-critic algorithm with off-policy learning and trust region constraint for stable training. The definition of the network and the approximation of the natural gradient is computed in DRL.na2c.py. You can turn on the importance sampling through the parameter TRACERPolicy.importance_sampling

The details of the implementation can be found here: https://arxiv.org/abs/1707.00130

See also: https://arxiv.org/abs/1611.01224 https://pdfs.semanticscholar.org/c79d/c0bdb138e5ca75445e84e1118759ac284da0.pdf

See also

CUED Imports/Dependencies:

import Policy import utils.ContextLogger


FeudalRL Policies

Traditional Reinforcement Learning algorithms fail to scale to large domains due to the curse of dimensionality. A novel Dialogue Management architecture based on Feudal RL decomposes the decision into two steps; a first step where a master policy selects a subset of primitive actions, and a second step where a primitive action is chosen from the selected subset. The structural information included in the domain ontology is used to abstract the dialogue state space, taking the decisions at each step using different parts of the abstracted state. This, combined with an information sharing mechanism between slots, increases the scalability to large domains.

For more information, please look at the paper Feudal Reinforcement Learning for Dialogue Management in Large Domains.