Active Sensing as Bayes-Optimal Sequential Decision-Making

Similar documents
Q-learning with linear function approximation

arxiv: v2 [stat.ml] 5 Nov 2018

Optimal Scanning for Faster Object Detection

10-701/15-781, Fall 2006, Final

15-780: MarkovDecisionProcesses

Markov Decision Processes. (Slides from Mausam)

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)

Perseus: randomized point-based value iteration for POMDPs

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

Gradient Reinforcement Learning of POMDP Policy Graphs

Generalized and Bounded Policy Iteration for Interactive POMDPs

08 An Introduction to Dense Continuous Robotic Mapping

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Planning and Control: Markov Decision Processes

Incremental methods for computing bounds in partially observable Markov decision processes

COS Lecture 13 Autonomous Robot Navigation

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Adaptive radar sensing strategies

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Classification: Linear Discriminant Functions

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Probabilistic Robotics

CS 229 Midterm Review

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

Approximate Linear Programming for Average-Cost Dynamic Programming

Partially Observable Markov Decision Processes for Faster Object Recognition

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation

Evaluation of regions-of-interest based attention algorithms using a probabilistic measure

Independent Component Analysis (ICA) in Real and Complex Fourier Space: An Application to Videos and Natural Scenes

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Neuro-Dynamic Programming An Overview

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Monte Carlo Tree Search PAH 2015

Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs

Forward Search Value Iteration For POMDPs

Performance analysis of POMDP for tcp good put improvement in cognitive radio network

Hierarchical Reinforcement Learning for Robot Navigation

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Loopy Belief Propagation

Predictive Autonomous Robot Navigation

Using Artificial Neural Networks for Prediction Of Dynamic Human Motion

Mixture Models and the EM Algorithm

Artificial Neural Network-Based Prediction of Human Posture

Support Vector Machines.

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Active Fixation Control to Predict Saccade Sequences Supplementary Material

An Improved Policy Iteratioll Algorithm for Partially Observable MDPs

An Approach to State Aggregation for POMDPs

A fast point-based algorithm for POMDPs

Information-Driven Dynamic Sensor Collaboration for Tracking Applications

10703 Deep Reinforcement Learning and Control

Practical Course WS12/13 Introduction to Monte Carlo Localization

Decision Making under Uncertainty

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization

Residual Advantage Learning Applied to a Differential Game

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Gaussian Processes for Robotics. McGill COMP 765 Oct 24 th, 2017

Inverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations

Learning Inverse Dynamics: a Comparison

Probabilistic Double-Distance Algorithm of Search after Static or Moving Target by Autonomous Mobile Agent

A Nonparametric Approach to Bottom-Up Visual Saliency

Solving Factored POMDPs with Linear Value Functions

Locally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005

Point-based value iteration: An anytime algorithm for POMDPs

What is machine learning?

Random projection for non-gaussian mixture models

Adaptive Metric Nearest Neighbor Classification

Artificial Intelligence. Programming Styles

Content-based image and video analysis. Machine learning

Active Multi-View Object Recognition: A Unifying View on Online Feature Selection and View Planning

Localization and Map Building

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation

CSEP 573: Artificial Intelligence

AC : USING A SCRIPTING LANGUAGE FOR DYNAMIC PROGRAMMING

Non-Stationary Covariance Models for Discontinuous Functions as Applied to Aircraft Design Problems

Modular Value Iteration Through Regional Decomposition

Graphical Models for Resource- Constrained Hypothesis Testing and Multi-Modal Data Fusion

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Support Vector Machines (a brief introduction) Adrian Bevan.

Markov Decision Processes and Reinforcement Learning

Salient Region Detection and Segmentation in Images using Dynamic Mode Decomposition

Probabilistic Robotics

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

CSE151 Assignment 2 Markov Decision Processes in the Grid World

Graphical Models, Bayesian Method, Sampling, and Variational Inference

Clustering with Reinforcement Learning

Feature Selection for Image Retrieval and Object Recognition

Probabilistic Planning for Behavior-Based Robots

Neural Network Weight Selection Using Genetic Algorithms

Multiple Constraint Satisfaction by Belief Propagation: An Example Using Sudoku

Decentralized Stochastic Planning for Nonparametric Bayesian Models

Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

Planning with Continuous Actions in Partially Observable Environments

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Hybrid PSO-SA algorithm for training a Neural Network for Classification

Using Machine Learning to Optimize Storage Systems

CME323 Report: Distributed Multi-Armed Bandits

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces

Transcription:

Active Sensing as Bayes-Optimal Sequential Decision-Making Sheeraz Ahmad & Angela J. Yu Department of Computer Science and Engineering University of California, San Diego December 7, 2012

Outline Introduction Active Sensing: Background Visual Search Task POMDP Formulation Bayesian Inference Optimal Action Selection Results Scalability Issues Low Dimensional Approximate Control Results Comparison with Infomax Policy Conclusion

Outline Introduction Active Sensing: Background Visual Search Task POMDP Formulation Bayesian Inference Optimal Action Selection Results Scalability Issues Low Dimensional Approximate Control Results Comparison with Infomax Policy Conclusion

Introduction Active sensing falls under the more general area of closed loop decision making. The underlying problem structure being:

Introduction Other examples of such decision making problems include: Sensor management [Hero and Cochran, 2011] Generalized binary search [Nowak, 2011] Teaching word meanings [Whitehill and Movellan, 2012] Underwater object classification [Hollinger et al., 2011] Menu design for P300 prosthetic [Jarzebowski et al., 2012] A natural framework to study these problems is Markov Decision Processes (MDPs) or Partially Observable Markov Decision Processes (POMDPs). Exact solutions are computationally inefficient especially for POMDPs. General as well as application specific approximations are an active research field [Powell, 2007; Lagoudakis and Parr, 2003; Kaplow, 2010].

Outline Introduction Active Sensing: Background Visual Search Task POMDP Formulation Bayesian Inference Optimal Action Selection Results Scalability Issues Low Dimensional Approximate Control Results Comparison with Infomax Policy Conclusion

Active Sensing: Background Problem of choosing fixation location has been well-studied. Feedforward approaches include random fixations, using saliency maps [Itti et al., 1998], fixating class separating locations [Lacroix et al., 2008], etc. Usually very simple, and describe some free viewing behavior. Some shortcomings: Lack of provision to query peripheral locations. Lack of inherent mechanism to implement inhibition of return. Saliency has been shown to play little role in goal-oriented visual tasks [Yarbus, 1967].

Active Sensing: Background Feedback approaches include maximizing one step detection probability [Najemnik and Geisler, 2005], minimizing entropy [Butko and Movellan, 2010], etc. Such surrogate goals can yield computationally tractable policies, with some performance guarantees [Williams et al., 2007]. Some shortcomings: Lack of provision for task specific demands or behavioral costs. Require an ad-hoc stopping criteria for terminal decision. More descriptive than predictive. Ideal goal: Computationally tractable policy that also overcomes these shortcomings. Contribution: Solve for exact optimal policy that explains human data; use the insights gained to design approximations, and to augment existing algorithms.

Outline Introduction Active Sensing: Background Visual Search Task POMDP Formulation Bayesian Inference Optimal Action Selection Results Scalability Issues Low Dimensional Approximate Control Results Comparison with Infomax Policy Conclusion

Visual Search Task [Huang and Yu, SfN, 2010] Task: Find the target ( ) amongst the distractors ( ). Gaze contingent display allows exact measurement of where subject obtains sensory input. Sequence of stimulus controlled by subject.

Visual Search Task Some locations more likely to be the target (1:3:9) Reward policy:

Outline Introduction Active Sensing: Background Visual Search Task POMDP Formulation Bayesian Inference Optimal Action Selection Results Scalability Issues Low Dimensional Approximate Control Results Comparison with Infomax Policy Conclusion

POMDP Formulation Loss formulation of a POMDP is a six-tuple (S, A, O, T, Ω, L).

POMDP Formulation Loss formulation of a POMDP is a six-tuple (S, A, O, T, Ω, L). S (set of states): Set of target locations {1, 2, 3}.

POMDP Formulation Loss formulation of a POMDP is a six-tuple (S, A, O, T, Ω, L). S (set of states): Set of target locations {1, 2, 3}. A (set of actions): Next location to fixate {1, 2, 3}, and terminal (stopping) action {0}.

POMDP Formulation Loss formulation of a POMDP is a six-tuple (S, A, O, T, Ω, L). S (set of states): Set of target locations {1, 2, 3}. A (set of actions): Next location to fixate {1, 2, 3}, and terminal (stopping) action {0}. O (set of observations): Direction of dots {0(right), 1(left)}.

POMDP Formulation Loss formulation of a POMDP is a six-tuple (S, A, O, T, Ω, L). S (set of states): Set of target locations {1, 2, 3}. A (set of actions): Next location to fixate {1, 2, 3}, and terminal (stopping) action {0}. O (set of observations): Direction of dots {0(right), 1(left)}. T (set of transition probabilities): A 3x3 identity matrix.

POMDP Formulation Loss formulation of a POMDP is a six-tuple (S, A, O, T, Ω, L). S (set of states): Set of target locations {1, 2, 3}. A (set of actions): Next location to fixate {1, 2, 3}, and terminal (stopping) action {0}. O (set of observations): Direction of dots {0(right), 1(left)}. T (set of transition probabilities): A 3x3 identity matrix. Ω (set of observation probabilities): Ω(o s, a) = 1 {s=a} Bern(o, β) + 1 {s a} Bern(o, 1 β)

POMDP Formulation Loss formulation of a POMDP is a six-tuple (S, A, O, T, Ω, L). S (set of states): Set of target locations {1, 2, 3}. A (set of actions): Next location to fixate {1, 2, 3}, and terminal (stopping) action {0}. O (set of observations): Direction of dots {0(right), 1(left)}. T (set of transition probabilities): A 3x3 identity matrix. Ω (set of observation probabilities): Ω(o s, a) = 1 {s=a} Bern(o, β) + 1 {s a} Bern(o, 1 β) L (loss function): L(s, a t 1, a t ) = { 1 {s at 1 } if a t = 0 c + c s 1 {at a t 1 } if a t {1, 2, 3} where c is cost of unit time and c s is cost of a switch.

Bayesian Inference The agent does not know the exact state (target location). Instead it maintains a probability distribution on states, known as belief states: b t = ( (p(s = 1 o t ; a t ), (p(s = 2 o t ; a t ), (p(s = 3 o t ; a t ) ) where o t is observation history, and a t is fixation location history till time t. Belief update using Bayes rule: b t (s) p(o t s; a t ) p(s o t 1 ; a t 1 ) = Ω(o t s, a t )b t 1 (s)

Optimal Action Selection A policy (π) is a function mapping belief states to actions. The value of a policy is defined as the expected loss incurred following that policy: V π (b t, a t ) = The optimal policy is thus: t =t+1 E[L t b t, π] π (b t, a t ) = argmin V π (b t, a t ) π Bellman optimality equation [Bellman, 1952]: { V (1 b t (a t )) if a t+1 = 0 (b t, a t ) = min c + c s 1 {at+1 a t} + E[V (b t+1, a t+1 )] otherwise

Results: Optimal Policy Results shown over a gridded belief state (size = 201). Grid-based approximation improves with grid density [Lovejoy, 1991], but computationally inefficient. Environment (c, c s, β) = (0.1, 0, 0.9) Stop at high certainty

Results: Optimal Policy Results shown over a gridded belief state (size = 201). Grid-based approximation improves with grid density [Lovejoy, 1991], but computationally inefficient. Environment (c, c s, β) = (0.1, 0, 0.7) Stop early

Results: Optimal Policy Results shown over a gridded belief state (size = 201). Grid-based approximation improves with grid density [Lovejoy, 1991], but computationally inefficient. Environment (c, c s, β) = (0.1, 0.1, 0.9) Switch less

Results: Optimal Policy Results shown over a gridded belief state (size = 201). Grid-based approximation improves with grid density [Lovejoy, 1991], but computationally inefficient. Environment (c, c s, β) = (0.2, 0, 0.9) Stop early

Results: Optimal Policy Results shown over a gridded belief state (size = 201). Grid-based approximation improves with grid density [Lovejoy, 1991], but computationally inefficient. Environment (c, c s, β) = (0.2, 0.2, 0.9) Stop early, switch less

Results: Confirmation Bias I prior expectation P(target selection)

Results: Confirmation Bias II

Results: Confirmation Bias III prior expectation time to confirm, time to disconfirm

Scalability Issues Belief state MDP formulations suffer from the curse of dimensionality. The state-space (belief state) is continuous, hence infinite dimensional. Algorithmic complexity is O(kn k 1 ), for k sensing locations and a grid-size of n. Next we present simpler approximations (complexity linear in k), that also retain context sensitivity.

Outline Introduction Active Sensing: Background Visual Search Task POMDP Formulation Bayesian Inference Optimal Action Selection Results Scalability Issues Low Dimensional Approximate Control Results Comparison with Infomax Policy Conclusion

Low Dimensional Approximate Control 1. Fix M Radial Basis Functions (RBF): φ(b) = 1 e b µi 2 σ(2π) k/2 2σ 2

Low Dimensional Approximate Control 1. Fix M Radial Basis Functions (RBF): φ(b) = 1 e b µi 2 σ(2π) k/2 2σ 2 2. Generate m points randomly from the belief space (b).

Low Dimensional Approximate Control 1. Fix M Radial Basis Functions (RBF): φ(b) = 1 e b µi 2 σ(2π) k/2 2σ 2 2. Generate m points randomly from the belief space (b). 3. Initialize the value function({v (b i )} m i=1 ) with the stopping costs.

Low Dimensional Approximate Control 1. Fix M Radial Basis Functions (RBF): φ(b) = 1 e b µi 2 σ(2π) k/2 2σ 2 2. Generate m points randomly from the belief space (b). 3. Initialize the value function({v (b i )} m i=1 ) with the stopping costs. 4. Find w, the minimum norm solution of: V (b) = Φ(b)w.

Low Dimensional Approximate Control 1. Fix M Radial Basis Functions (RBF): φ(b) = 1 e b µi 2 σ(2π) k/2 2σ 2 2. Generate m points randomly from the belief space (b). 3. Initialize the value function({v (b i )} m i=1 ) with the stopping costs. 4. Find w, the minimum norm solution of: V (b) = Φ(b)w. 5. Generate a new set of m random belief state points (b ).

Low Dimensional Approximate Control 1. Fix M Radial Basis Functions (RBF): φ(b) = 1 e b µi 2 σ(2π) k/2 2σ 2 2. Generate m points randomly from the belief space (b). 3. Initialize the value function({v (b i )} m i=1 ) with the stopping costs. 4. Find w, the minimum norm solution of: V (b) = Φ(b)w. 5. Generate a new set of m random belief state points (b ). 6. Evaluate required V values for value iteration using current w.

Low Dimensional Approximate Control 1. Fix M Radial Basis Functions (RBF): φ(b) = 1 e b µi 2 σ(2π) k/2 2σ 2 2. Generate m points randomly from the belief space (b). 3. Initialize the value function({v (b i )} m i=1 ) with the stopping costs. 4. Find w, the minimum norm solution of: V (b) = Φ(b)w. 5. Generate a new set of m random belief state points (b ). 6. Evaluate required V values for value iteration using current w. 7. Update V (b ) using value iteration.

Low Dimensional Approximate Control 1. Fix M Radial Basis Functions (RBF): φ(b) = 1 e b µi 2 σ(2π) k/2 2σ 2 2. Generate m points randomly from the belief space (b). 3. Initialize the value function({v (b i )} m i=1 ) with the stopping costs. 4. Find w, the minimum norm solution of: V (b) = Φ(b)w. 5. Generate a new set of m random belief state points (b ). 6. Evaluate required V values for value iteration using current w. 7. Update V (b ) using value iteration. 8. Find a new w from V (b ) = Φ(b )w.

Low Dimensional Approximate Control 1. Fix M Radial Basis Functions (RBF): φ(b) = 1 e b µi 2 σ(2π) k/2 2σ 2 2. Generate m points randomly from the belief space (b). 3. Initialize the value function({v (b i )} m i=1 ) with the stopping costs. 4. Find w, the minimum norm solution of: V (b) = Φ(b)w. 5. Generate a new set of m random belief state points (b ). 6. Evaluate required V values for value iteration using current w. 7. Update V (b ) using value iteration. 8. Find a new w from V (b ) = Φ(b )w. 9. Repeat steps 5 through 8, until w converges.

Results: Comparison with Approximate Policies Results shown for RBF, Gaussian Processes Regression (GPR) [Williams and Rasmussen, 1996] and GPR with Automatic Relevance Determination (ARD). Grid size = 201. RBF: M = 49, m = 1000. Environment (c, c s, β) = (0.1, 0, 0.9)

Results: Comparison with Approximate Policies Results shown for RBF, Gaussian Processes Regression (GPR) [Williams and Rasmussen, 1996] and GPR with Automatic Relevance Determination (ARD). Grid size = 201. RBF: M = 49, m = 1000. Environment (c, c s, β) = (0.1, 0.1, 0.9)

Outline Introduction Active Sensing: Background Visual Search Task POMDP Formulation Bayesian Inference Optimal Action Selection Results Scalability Issues Low Dimensional Approximate Control Results Comparison with Infomax Policy Conclusion

Comparison with Infomax Policy Infomax [Butko and Movellan, 2010] also tackles a visual search problem. Uses finite horizon entropy as the cost function. Insights gained from the geometry of optimal policy can be used to parametrically augment Infomax policy. Figure: Policies shown over 201 bins. c = 0.1, c s = 0, β = 0.9. (A) Behavioral policy. (B) Infomax policy (stop when posterior belief exceeds 0.9)

Outline Introduction Active Sensing: Background Visual Search Task POMDP Formulation Bayesian Inference Optimal Action Selection Results Scalability Issues Low Dimensional Approximate Control Results Comparison with Infomax Policy Conclusion

Conclusion Presented an active sensing framework that takes into account task demands and behavioral costs. Application to a simple visual search task makes intuitive predictions. Comparison with human data shows close fit, explains confirmation bias. Presented approximate algorithms that are computationally tractable yet context sensitive. The work aims to add to the growing literature on problems in decision processes, to sprout new approximations and to augment existing algorithms. We believe that a framework sensitive towards behavioral costs can not only lead to better artificial agents, but also provide us with neurological underpinnings of active sensing.

References I R Bellman. On the theory of dynamic programming. PNAS, 38(8):716 719, 1952. N J Butko and J R Movellan. Infomax control of eyemovements. IEEE Transactions on Autonomous Mental Development, 2(2):91 107, 2010. A.O. Hero and D. Cochran. Sensor management: Past, present, and future. Sensors Journal, IEEE, 11(12):3064 3075, December 2011. ISSN 1530-437X. doi: 10.1109/JSEN.2011.2167964. G.A. Hollinger, U. Mitra, and G.S. Sukhatme. Active classification: Theory and application to underwater inspection. arxiv preprint arxiv:1106.5829, 2011. L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(11):1254 1259, 1998. J. Jarzebowski, R. Ma, N. Aghasadeghi, T. Bretl, and T.P. Coleman. A stochastic control approach to optimally designing variable-sized menus in p300 communication prostheses. 2012. R. Kaplow. Point-based POMDP solvers: Survey and comparative analysis. PhD thesis, McGill University, 2010. J. Lacroix, E. Postma, J. Van Den Herik, and J. Murre. Toward a visual cognitive system using active top-down saccadic control. International Journal of Humanoid Robotics, 5(02):225 246, 2008. M.G. Lagoudakis and R. Parr. Least-squares policy iteration. The Journal of Machine Learning Research, 4:1107 1149, 2003.

References II W.S. Lovejoy. Computationally feasible bounds for partially observed markov decision processes. Operations research, 39(1):162 175, 1991. J Najemnik and W S Geisler. Optimal eye movement strategies in visual search. Nature, 434(7031):387 91, 2005. R.D. Nowak. The geometry of generalized binary search. Information Theory, IEEE Transactions on, 57(12):7893 7906, 2011. W.B. Powell. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. Wiley-Interscience, 2007. J. Whitehill and J. Movellan. Teaching word meanings by visual examples. Journal of Machine Learning Research, 2012. C K I Williams and C E Rasmussen. Gaussian processes for regression. In M.C. Mozer D. S. Touretzky and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514 20. MIT Press, Cambridge, MA, 1996. J.L. Williams, J.W. Fisher III, and A.S. Willsky. Performance guarantees for information theoretic active inference. AI & Statistics (AISTATS), 2007. A F Yarbus. Eye Movements and Vision. Plenum Press, New York, 1967.

Thanks!!

Additional Slides Complexity of RBF approximation is O(k(mM + M 3 )) Complexity of GPR approximation is O(kN 3 ), where N is the number of points used for regression. For GPR simulations: 200 points used for extrapolation at each step, length scale = 1, signal strength = 1 and noise strength = 0.1 Approximation motivated by Warren Powell s book [Powell, 2007] and LSPI [Lagoudakis and Parr, 2003].