Probabilistic Dialogue Models with Prior Domain Knowledge

Similar documents
Automatic Domain Partitioning for Multi-Domain Learning

Efficient, Scalable, and Provenance-Aware Management of Linked Data

2. Graphical Models. Undirected graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

Block-wise abstract interpretation by combining abstract domains with SMT

Predictive Indexing for Fast Search

5. Tests and results Scan Matching Optimization Parameters Influence

Lecture 3: Conditional Independence - Undirected

Unsupervised Semantic Parsing

Graphical Models. Pradeep Ravikumar Department of Computer Science The University of Texas at Austin

Introduction to Mobile Robotics SLAM Landmark-based FastSLAM

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

K-Means Clustering 3/3/17

Edge-exchangeable graphs and sparsity

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

Introduction to Mobile Robotics

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Discriminative Training with Perceptron Algorithm for POS Tagging Task

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Human-Computer Interaction Design Studio

A Deep Relevance Matching Model for Ad-hoc Retrieval

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

ECS289: Scalable Machine Learning

Convexization in Markov Chain Monte Carlo

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

Joint Entity Resolution

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS

A Unified Framework to Integrate Supervision and Metric Learning into Clustering

Recommender Systems. Collaborative Filtering & Content-Based Recommending

Decision Trees Oct

Bayesian Networks Inference

Learning to Choose Instance-Specific Macro Operators

LECTURE 4: DEDUCTIVE REASONING AGENTS. An Introduction to Multiagent Systems CIS 716.5, Spring 2006

Machine Learning Applications for Data Center Optimization

An Introduction to Big Data Formats

Learning algorithms for physical systems: challenges and solutions

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

Network Lasso: Clustering and Optimization in Large Graphs

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

A novel approach to motion tracking with wearable sensors based on Probabilistic Graphical Models

The Business Case for a Web Content Management System. Published: July 2001

INFORMATION ACCESS VIA VOICE. dissertation is my own or was done in collaboration with my advisory committee. Yapin Zhong

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

Planning and Control: Markov Decision Processes

Motivation. CS389L: Automated Logical Reasoning. Lecture 5: Binary Decision Diagrams. Historical Context. Binary Decision Trees

A Hierarchical Domain Model-Based Multi-Domain Selection Framework for Multi-Domain Dialog Systems

Collective classification in network data

Classification: Linear Discriminant Functions

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Search Engines. Information Retrieval in Practice

COMP310 MultiAgent Systems. Chapter 3 - Deductive Reasoning Agents

Conditional Random Fields : Theory and Application

Probabilistic Graphical Models

Efficient Processing of Multiple DTW Queries in Time Series Databases

Deep Learning With Noise

Probabilistic Graphical Models

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Weka ( )

Uncertain Data Models

K-means and Hierarchical Clustering

Scene Grammars, Factor Graphs, and Belief Propagation

Pattern Recognition Lecture Sequential Clustering

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

A Parallel Implementation of a Higher-order Self Consistent Mean Field. Effectively solving the protein repacking problem is a key step to successful

Decision Fusion using Dempster-Schaffer Theory

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

EMO A Real-World Application of a Many-Objective Optimisation Complexity Reduction Process

Algorithms: Decision Trees

The Eclipse Parallel Tools Platform Project

Composition Systems. Composition Systems. Contents. Contents. What s s in this paper. Introduction. On-Line Character Recognition.

Dialog System & Technology Challenge 6 Overview of Track 1 - End-to-End Goal-Oriented Dialog learning

Markov Decision Processes. (Slides from Mausam)

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

One-Shot DRC within a Fine-Grain Physical Verification Platform for advanced process nodes

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

International Journal of Software and Web Sciences (IJSWS)

What do Compilers Produce?

The Need for a Holistic Automation Solution to Overcome the Pitfalls in Test Automation

CSc 372. Comparative Programming Languages. 36 : Scheme Conditional Expressions. Department of Computer Science University of Arizona

Copyright 2000, Kevin Wayne 1

Prototype of Silver Corpus Merging Framework

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Simple mechanisms for escaping from local optima:

Monitoring Interfaces for Faults

Compressive Sensing for Multimedia. Communications in Wireless Sensor Networks

Graph Neural Network. learning algorithm and applications. Shujia Zhang

Learning Algorithms for Medical Image Analysis. Matteo Santoro slipguru

Learning to bounce a ball with a robotic arm

Fault Identification from Web Log Files by Pattern Discovery

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Artificial Intelligence. Programming Styles

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Bayesian Classification Using Probabilistic Graphical Models

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

Fitting D.A. Forsyth, CS 543

ActiveClean: Interactive Data Cleaning For Statistical Modeling. Safkat Islam Carolyn Zhang CS 590

Localization and Map Building

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Section 2.3: Monte Carlo Simulation

Transcription:

Probabilistic Dialogue Models with Prior Domain Knowledge Pierre Lison Language Technology Group University of Oslo July 6, 2012 SIGDIAL

Introduction The use of statistical models is getting increasingly popular in spoken dialogue systems But scalability remains a challenge for many domains 2

Introduction The use of statistical models is getting increasingly popular in spoken dialogue systems But scalability remains a challenge for many domains Advantages Explicit account of uncertainties, increased robustness to errors Better domain- and user-adaptivity, more natural and flexible conversational behaviours 2

Introduction The use of statistical models is getting increasingly popular in spoken dialogue systems But scalability remains a challenge for many domains Advantages Explicit account of uncertainties, increased robustness to errors Better domain- and user-adaptivity, more natural and flexible conversational behaviours Challenges Good domain data is scarce and expensive to acquire! Scalability to complex domains (state space grows exponentially with the problem size) 2

Introduction (2) Well-known problem in A.I. and machine learning Solutions typically involve the use of more expressive representations Capturing relevant aspects of the problem structure Taking advantage of hierarchical or relational abstractions We present here such an abstraction mechanism, based on the concept of probabilistic rule Goal: leverage our prior domain knowledge to yield structured, compact probabilistic models 3

Key idea Observation: dialogue models exhibit a fair amount of internal structure: Probability (or utility) distributions can often be factored Even if the full distribution has many dependencies, the probability (or utility) of a specific outcome often depends on only a small subset of variables Finally, the values of the dependent variables can often be grouped into partitions 4

Example of partitioning Consider a dialogue where the user asks a robot yes/no questions about his location The state contains the following variables : Last user dialogue act, e.g. The robot location, e.g. a u = AreYouIn(corridor) location = kitchen au location am You want to learn the utility of a m = SayYes 5

Example of partitioning Consider a dialogue where the user asks a robot yes/no questions about his location The state contains the following variables : Last user dialogue act, e.g. The robot location, e.g. a u = AreYouIn(corridor) location = kitchen au location am You want to learn the utility of a m = SayYes The combination of the two variables can take many values, but they can be partitioned in two sets: a u = AreYouIn(x) location = x a u = AreYouIn(x) location = x positive utility negative utility 5

Probabilistic rules Probabilistic rules attempt to capture such kind of structure They take the form of structured if...then...else cases, mappings from conditions to (probabilistic) effects: if (condition 1 holds) then P(effect 1 )= θ 1, P(effect 2 )= θ 2 else if (condition 2 holds) then P(effect 3 ) = θ 3 For action-selection rules, the effect associates utilities to particular actions: if (condition 1 holds) then Q(actions)= θ 1 6

Probabilistic rules (2) Conditions are arbitrary logical formulae on state variables Effects are value assignments on state variables Example of rule for action selection: if (a u = AreYouIn(x) location = x) then {Q(a m = SayYes) =3.0} else if (a u = AreYouIn(x) location = x) then {Q(a m = SayNo) =3.0} Effect probabilities and utilities are parameters which can be estimated from data 7

Rule-based state update How are these rules applied in practice? The architecture revolves around a shared dialogue state, represented as a Bayesian network At runtime, the rules are instantiated in the network, updating and expanding it with new nodes and dependencies The rules thus function as high-level templates for a classical probabilistic model 8

Example 9

Example location office [P=0.95] kitchen [P=0.05] au AreYouIn(kitchen) [P=0.7] AreYouIn(corridor) [P=0.2] None [P=0.1] 9

Example location Rule 1: if (a u = AreYouIn(x) location = x) then {Q(a m = SayYes) =3.0} else if (a u = AreYouIn(x) location = x) then {Q(a m = SayNo) =3.0} office [P=0.95] kitchen [P=0.05] au AreYouIn(kitchen) [P=0.7] AreYouIn(corridor) [P=0.2] None [P=0.1] 9

Example location Rule 1: if (a u = AreYouIn(x) location = x) then {Q(a m = SayYes) =3.0} else if (a u = AreYouIn(x) location = x) then {Q(a m = SayNo) =3.0} office [P=0.95] kitchen [P=0.05] au ϕ1 cond1 [P=0.035] cond2 [P=0.865] cond3 [P=0.1] AreYouIn(kitchen) [P=0.7] AreYouIn(corridor) [P=0.2] None [P=0.1] 9

Example location Rule 1: if (a u = AreYouIn(x) location = x) then {Q(a m = SayYes) =3.0} else if (a u = AreYouIn(x) location = x) then {Q(a m = SayNo) =3.0} office [P=0.95] kitchen [P=0.05] ϕ1 ψ1 Q(am=SayYes) = 0.105 Q(am=SayNo) = 2.6 au cond1 [P=0.035] cond2 [P=0.865] cond3 [P=0.1] am AreYouIn(kitchen) [P=0.7] AreYouIn(corridor) [P=0.2] None [P=0.1] 9

Example location office [P=0.95] kitchen [P=0.05] ϕ1 ψ1 au am AreYouIn(kitchen) [P=0.7] AreYouIn(corridor) [P=0.2] None [P=0.1] 10

Example Rule 2: if (a u = None) then {Q(a m = AskRepeat) =0.5} location office [P=0.95] kitchen [P=0.05] ϕ1 ψ1 au am AreYouIn(kitchen) [P=0.7] AreYouIn(corridor) [P=0.2] None [P=0.1] 10

Example Rule 2: if (a u = None) then {Q(a m = AskRepeat) =0.5} location office [P=0.95] kitchen [P=0.05] ϕ1 ψ1 au am ϕ2 AreYouIn(kitchen) [P=0.7] AreYouIn(corridor) [P=0.2] None [P=0.1] 10

Example Rule 2: if (a u = None) then {Q(a m = AskRepeat) =0.5} location office [P=0.95] kitchen [P=0.05] ϕ1 ψ1 Q(am=SayYes) = 0.105 Q(am=SayNo) = 2.6 Q(am=AskRepeat)=0.45 au am ϕ2 ψ2 AreYouIn(kitchen) [P=0.7] AreYouIn(corridor) [P=0.2] None [P=0.1] 10

Processing workflow The dialogue state, encoded as a Bayesian Network, is the central, shared information repository Each processing task (understanding, management, generation, etc.) read and write to it Many of these tasks are expressed in terms of collections of probabilistic rules Dialogue interpretation Action selection Speech understanding Speech recognition Dialogue state Generation Speech synthesis 11

Parameter learning The rule parameters (probabilities or utilities) must be estimated from empirical data We adopted a Bayesian approach, where the parameters are themselves defined as variables The parameter distributions will then be modified given the evidence from the training data 12

Parameter learning The rule parameters (probabilities or utilities) must be estimated from empirical data We adopted a Bayesian approach, where the parameters are themselves defined as variables The parameter distributions will then be modified given the evidence from the training data ϕ ψ 12

Parameter learning The rule parameters (probabilities or utilities) must be estimated from empirical data We adopted a Bayesian approach, where the parameters are themselves defined as variables The parameter distributions will then be modified given the evidence from the training data θ ϕ ψ 12

Evaluation Policy learning task in a human-robot interaction scenario, based on Wizard-of-Oz training data Objective: estimate the utilities of possible system actions Baselines: «rolled-out» versions of the model «plain» probabilistic models with identical input and output variables, but without the condition and effect nodes as intermediary structures 13

Evaluation Policy learning task in a human-robot interaction scenario, based on Wizard-of-Oz training data Objective: estimate the utilities of possible system actions Baselines: «rolled-out» versions of the model «plain» probabilistic models with identical input and output variables, but without the condition and effect nodes as intermediary structures θ A ϕ ψ B θ ϕ ψ D 13

Evaluation Policy learning task in a human-robot interaction scenario, based on Wizard-of-Oz training data Objective: estimate the utilities of possible system actions Baselines: «rolled-out» versions of the model «plain» probabilistic models with identical input and output variables, but without the condition and effect nodes as intermediary structures θ θ A ϕ ψ B θ ϕ ψ D vs A B ψ D 13

Experimental setup Interaction scenario: users instructed to teach the robot a sequence of basic movements (e.g. a small dance) Dialogue system comprising ASR and TTS modules, shallow components for understanding and generation, and libraries for robot control The Wizard had access to the dialogue state and took decisions based on it (among a set of 14 alternatives) 20 interactions with 7 users, for a total of 1020 turns Each sample d in the data set is a pair (bd, td): bd is a recorded dialogue state td is the «gold standard» system action selected by the Wizard at the state bd 14

Empirical results Data set split into training (75%) and testing (25%) Accuracy measure: percentage of actions corresponding to the ones selected by the Wizard But Wizard sometimes inconsistent / unpredictable The rule-structured model outperformed the two baselines in accuracy and convergence speed Type of model Accuracy (in %) Plain model 67.35 Linear model 61.85 Rule-structured model 82.82 15

Learning curve (linear scale) 100 Accuracy on testing set (in %) 75 50 25 0 Rule-structured model Linear model Plain model 0 152 304 456 608 760 Number of training samples 16

Learning curve (log-2 scale) 100 Accuracy on testing set (in %) 75 50 25 0 Rule-structured model Linear model Plain model 0 2 11 47 190 760 Number of training samples 17

Conclusions Probabilistic rules used to capture the underlying structure of dialogue models Allow developers to exploit powerful generalisations and domain knowledge without sacrificing the probabilistic nature of the model Framework validated on a policy learning task based on a Wizard-of-Oz dataset Future work: extend the approach towards model-based Bayesian reinforcement learning 18