Markov Decision Processes and Reinforcement Learning

Similar documents
Marco Wiering Intelligent Systems Group Utrecht University

CS 188: Artificial Intelligence Spring Recap Search I

Recap Search I. CS 188: Artificial Intelligence Spring Recap Search II. Example State Space Graph. Search Problems.

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Markov Decision Processes. (Slides from Mausam)

A Brief Introduction to Reinforcement Learning

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019

10703 Deep Reinforcement Learning and Control

Exam Topics. Search in Discrete State Spaces. What is intelligence? Adversarial Search. Which Algorithm? 6/1/2012

15-780: MarkovDecisionProcesses

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

CSE151 Assignment 2 Markov Decision Processes in the Grid World

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

Markov Decision Processes (MDPs) (cont.)

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Generalized Inverse Reinforcement Learning

Planning and Control: Markov Decision Processes

When Network Embedding meets Reinforcement Learning?

15-780: Problem Set #3

Deep Reinforcement Learning

CSEP 573: Artificial Intelligence

Slides credited from Dr. David Silver & Hung-Yi Lee

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Neuro-Dynamic Programming An Overview

Local Decision-Making in Multi-Agent Systems. Maike Kaufman Balliol College University of Oxford

Adversarial Policy Switching with Application to RTS Games

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

Decision Making under Uncertainty

Per-decision Multi-step Temporal Difference Learning with Control Variates

Probabilistic Belief. Adversarial Search. Heuristic Search. Planning. Probabilistic Reasoning. CSPs. Learning CS121

Reinforcement Learning (2)

CS 188: Artificial Intelligence. Recap Search I

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Q-learning with linear function approximation

Introduction to Fall 2008 Artificial Intelligence Midterm Exam

More Realistic Adversarial Settings. Virginia Tech CS5804 Introduction to Artificial Intelligence Spring 2015

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)

A fast point-based algorithm for POMDPs

Monte Carlo Tree Search PAH 2015

WestminsterResearch

Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes

Master Thesis. Simulation Based Planning for Partially Observable Markov Decision Processes with Continuous Observation Spaces

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

arxiv: v1 [cs.cv] 2 Sep 2018

Artificial Intelligence

Deep Reinforcement Learning for Urban Traffic Light Control

Introduction to Fall 2010 Artificial Intelligence Midterm Exam

AIR FORCE INSTITUTE OF TECHNOLOGY

Two-player Games ZUI 2016/2017

Learning and Solving Partially Observable Markov Decision Processes

Two-player Games ZUI 2012/2013

Searching with Partial Information

CS885 Reinforcement Learning Lecture 9: May 30, Model-based RL [SutBar] Chap 8

Adaptive Building of Decision Trees by Reinforcement Learning

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Potential Midterm Exam Questions

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

Computer Vision 2 Lecture 8

Heuristic Search Value Iteration Trey Smith. Presenter: Guillermo Vázquez November 2007

Planning, Execution & Learning: Planning with POMDPs (II)

Artificial Intelligence

Introduction to Mobile Robotics Path and Motion Planning. Wolfram Burgard, Cyrill Stachniss, Maren Bennewitz, Diego Tipaldi, Luciano Spinello

Intelligent agents. As seen from Russell & Norvig perspective. Slides from Russell & Norvig book, revised by Andrea Roli

Unsupervised Learning: Clustering

Reinforcement Learning on the Lego Mindstorms NXT Robot. Analysis and Implementation.

Using Domain-Configurable Search Control for Probabilistic Planning

Data Structures and Algorithms

Perseus: randomized point-based value iteration for POMDPs

Chapter 9: Artificial Intelligence

Monte Carlo Tree Search

State Space Reduction For Hierarchical Reinforcement Learning

Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning

Throughput Maximization for Energy Efficient Multi-Node Communications using Actor-Critic Approach

PEAS: Medical diagnosis system

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Máté Lengyel, Peter Dayan: Hippocampal contributions to control: the third way

3 SOLVING PROBLEMS BY SEARCHING

Neural Networks and Tree Search

Gradient Reinforcement Learning of POMDP Policy Graphs

10-701/15-781, Fall 2006, Final

Reinforcement Learning of Traffic Light Controllers under Partial Observability

Artificial Intelligence

CS 416, Artificial Intelligence Midterm Examination Fall 2004

UNIVERSITY OF CALGARY. Flexible and Scalable Routing Approach for Mobile Ad Hoc Networks. by Function Approximation of Q-Learning.

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Prioritizing Point-Based POMDP Solvers

Indoor Visual Navigation using Deep Reinforcement Learning

Tree Based Discretization for Continuous State Space Reinforcement Learning

Deep Q-Learning to play Snake

CS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation

Diagnose and Decide: An Optimal Bayesian Approach

Name: UW CSE 473 Midterm, Fall 2014

ARTIFICIAL INTELLIGENCE (CSC9YE ) LECTURES 2 AND 3: PROBLEM SOLVING

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours.

Incremental methods for computing bounds in partially observable Markov decision processes

Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Transcription:

Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig

Course Overview Introduction Artificial Intelligence Intelligent Agents Search Uninformed Search Heuristic Search Uncertain knowledge and Reasoning Probability and Bayesian approach Bayesian Networks Hidden Markov Chains Kalman Filters Learning Supervised Decision Trees, Neural Networks Learning Bayesian Networks Unsupervised EM Algorithm Games and Adversarial Search Minimax search and Alpha-beta pruning Multiagent search Knowledge representation and Reasoning Propositional logic First order logic Inference Plannning 2

Recap Supervised (x 1, y 1 )(x 2, y 2 )... y = f (x) Unsupervised x 1, x 2,... Pr(X = x) Reinforcement (s, a, s, a, s) + rewards at some states π(s) 3

Consider chess: we wish to learn correct move for each state but no feedback available on this only feedback available is a reward or reinforcement at the end of a sequence of moves or at some intermediary states. agents then learn a transition model Other examples, backgammon, helicopter, etc. Recall: Environments are categorized along several dimensions: fully observable deterministic episodic static discrete single-agent partially observable stochastic sequential dynamic continuous multi-agents 4

Sequential decision problems: the outcome depends on a sequence of decisions. Include search and plannig as special cases. search (problem solving in a state space (detrministic and fully observable) planning (interleaves planning and execution gathering feedback from environment because of stochasticity, partial observability, multi-agents. Belief state space) learning uncertainty Environment: Deterministic Stochastic Fully observable A, DFS, BFS MDP 5

MDP: fully observable environment and agent knows reward functions Now: fully observable environment but no knoweldge of how it works (reward functions) and probabilistic actions 6

Outline 1. 2. 7

Terminology and Notation Sequential decision probelm in a fully observable, stochastic environment with Markov transition model and additive rewards s S states a A(s) actions s 0 start state p(s s, a) transition probability; world is stochastic; Markovian assumption R(s) or R(s, a, s ) reward U([s 0, s 1,..., s n ]) or V () utility function depends on sequence of states (sum of rewards) Example: 3 2 1 + 1 1 START 1 2 3 4 (a) 0.1 0.8 (b) 0.1 A fixed action sequence is not good becasue of probabilistic actions Policy π: specification of what to do in any state Optimal policy π : policy with highest expected utility 8

Highest Expected Utility U([s 0, s 1,..., s n]) = R(s 0) + γr(s 1) + γ 2 R(s 2) +... + γ n R(s n) [ ] U π (s) = E π γ t R(s t) = R(s) + γ Pr(s s, a π(s))u π (s ) s t=0 looks onwards, dependency on future neighbors Optimal policy: U π (s) = max π Uπ (s) π (s) = argmax π U π (s) Choose actions by max expected utilities (Bellman equation): U π (s) =R(s) + γ max Pr(s s, a)u(s ) a A(s) s ] π (s) = argmax a A(s) [R(s) + γ max Pr(s s, a)u(s ) a A(s) s 9

Value Iteration 1. calculate the utility function of each state using the iterative procedure below 2. use state utilities to select an optimal action For 1. use the following iterative algorithm: 10

Q-Values For 2. once the optimal U values have been calculated: ] π (s) = argmax a A(s) [R(s) + γ s Pr(s s, a)u (s ) Hence we would need to compute the sum for each a. Idea: save Q-values Q (s, a) = R(s) + γ s Pr(s s, a)u (s ) so actions are easier to select: π (s) = argmax a A(s) Q (s, a) 11

Example python gridworld.py -a value -i 1 --discount 0.9 --noise 0.2 -r 0 -k 1 -t VALUES AFTER 1 ITERATIONS 0 1 2 3 ^ ^ ------ 2 0.00 0.00 0.00 > 1.00 ------ ^ ------- ##### 1 0.00 ##### < 0.00-1.00 ##### ------- ^ ^ ^ 0 S: 0.00 0.00 0.00 0.00 v Q-VALUES AFTER 1 ITERATIONS 0 1 2 3 /0.00\ /0.00\ 0.09 [ 1.00 ] 2 <0.00 0.00> <0.00 0.00> 0.00 0.72> \0.00/ \0.00/ 0.09 /0.00\ -0.09 ##### [ -1.00 ] 1 <0.00 0.00> ##### <0.00-0.72 ##### \0.00/ -0.09 /0.00\ /0.00\ /0.00\ -0.72 0 <0.00 S 0.00> <0.00 0.00> <0.00 0.00> -0.09-0.09 \0.00/ \0.00/ \0.00/ \0.00/ 12

Example python gridworld.py -a value -i 2 --discount 0.9 --noise 0.2 -r 0 -k 1 -t VALUES AFTER 2 ITERATIONS 0 1 2 3 ^ ------ 2 0.00 0.00 > 0.72 > 1.00 ------ ^ ^ ------- ##### 1 0.00 ##### 0.00-1.00 ##### ------- ^ ^ ^ 0 S: 0.00 0.00 0.00 0.00 v Q-VALUES AFTER 2 ITERATIONS 0 1 2 3 /0.00\ 0.06 0.61 [ 1.00 ] 2 <0.00 0.00> 0.00 0.52> 0.06 0.78> \0.00/ 0.06 0.09 /0.00\ /0.43\ ##### [ -1.00 ] 1 <0.00 0.00> ##### 0.06-0.66 ##### \0.00/ -0.09 /0.00\ /0.00\ /0.00\ -0.72 0 <0.00 S 0.00> <0.00 0.00> <0.00 0.00> -0.09-0.09 \0.00/ \0.00/ \0.00/ \0.00/ 13

Example python gridworld.py -a value -i 3 --discount 0.9 --noise 0.2 -r 0 -k 1 -t VALUES AFTER 3 ITERATIONS 0 1 2 3 ------ 2 0.00 > 0.52 > 0.78 > 1.00 ------ ^ ^ ------- ##### 1 0.00 ##### 0.43-1.00 ##### ------- ^ ^ ^ 0 S: 0.00 0.00 0.00 0.00 v Q-VALUES AFTER 3 ITERATIONS 0 1 2 3 0.05 0.44 0.70 [ 1.00 ] 2 0.00 0.37> 0.09 0.66> 0.48 0.83> 0.05 0.44 0.45 /0.00\ /0.51\ ##### [ -1.00 ] 1 <0.00 0.00> ##### 0.38-0.65 ##### \0.00/ -0.05 /0.00\ /0.00\ /0.31\ -0.72 0 <0.00 S 0.00> <0.00 0.00> 0.04 0.04-0.09-0.09 \0.00/ \0.00/ 0.00 \0.00/ 14

+1 +1 3 + 1 1 1 2 1 R(s) < 1.6284 0.4278 < R(s) < 0.0850 +1 +1 1 1 1 1 2 3 4 (a) 0.0221 < R(s) < 0 (b) R(s) > 0 Balancing of risk and reward. 15

Outline 1. 2. 16

Terminology and Notation s S a A(s) s 0 p(s s, a) R(s) or R(s, a, s ) states actions start state transition probability; world is stochastic; reward In reinforcement learning, we do not know p and R Agent knows learns uses utility-based agent p R U U Q-learning Q(s, a) Q reflex agent π(s) π Passive RL: policy fixed Active RL: policy can be changed 17

Passive RL Perform a set of trials and build up the utility function table 3 +1 3 0.812 0.868 0.918 + 1 2 1 2 0.762 0.660 1 1 1 0.705 0.655 0.611 0.388 1 2 3 4 1 2 3 4 18

Passive RL Direct utility estimator (Monte Carlo method that waits until the end of the episode to determine the increment to U t (s) Temporal difference learning (wait only until the next time step) 19

Passive RL Temporal difference learning If a nonterminal state s t is visited at time t, then update the estimate for U t based on what happens after that visit and the old estimate. Exponential Moving average: Running interpolation update: x n =(1 α) x n 1 + αx n x n = x n + (1 α)x n 1 + (1 α) 2 x n 2 +... 1 + (1 α) + (1 α) 2 +... Makes recent samples more important, forgets about past (old samples were wrong anyway) α learning rate: if a function that decreases then the average converges. E.g. α = 1/N s, α = N s /N, α = 1000/(1000 + N), NewEstimate (1 α)oldestimate + αaftervist U(s) (1 α)u(s) + α[(r + γu(s )] U(s) U(s) + α(n s )[r + γu(s ) U(s)] 20

Initialize U(s) arbitrarily, π to the policy to be evaluated; repeat /* for each episode */ Initialize s; repeat /* for step of an episode */ a action given by π for s Take action a, observe reward r and next state s U(s) U(s) + α(n s)[r + γu(s ) U(s)] s s until s is terminal; until convergence; 21

Learning curves Utility estimates 1 0.8 0.6 0.4 (4,3) (3,3) (1,3) (1,1) (2,1) 0.2 0 0 100 200 300 400 500 Number of trials 22

Active Learning Greedy agent, recompute a new π in TD algorithm Problem: actions not only provide rewards according to the learned model but also influence the learning by affecting the percepts that are received. 3 +1 2 2 1 1 RMS error, policy loss 1.5 1 0.5 RMS error Policy loss 1 2 3 4 0 0 50 100 150 200 250 300 350 400 450 500 Number of trials Adjust the TD with a Greedy in the Limit of Infinite Exploration 23

Exploration/Exploitation Simplest: random actions (ɛ-greedy) every time step, draw a number in [0, 1] if smaller than ɛ act randomly if largerthan ɛ act according to greedy policy ɛ can be lowered with time Another solution: exploration function 24

Active RL Q-learning We can choose the action we like with the goal of learning optimal policy π (s) = argmax a [R(s) + γ s Pr(s s, a)u (s ) ] same as in value iterations algorithm but not off-line Q-values are more useful to be learned: Q =R(s) + γ Pr(s s, a)u (s ) s π(s) = argmax a Q (s, a) Sarsa algorithm learns Q values same way as TD algorithm 25

In value iteration algorithm U i+1 (s) R(s) + γ max a A(s) s Pr(s s, a)u i (s ) Same with Q Q i+1 (s, a) R(s) + γ max Pr(s s, a)q i (s, a ) a A(s) s Sample based Q learning: observe sample s, a, s, r consider the old estimate Q(s, a) derive the new sample estimate Q (s, a) R(s) + γ max Pr(s s, a)q (s, a ) a s sample = R(s) + γ max a Q (s, a ) Incorporate the new estimate in running average: Q(s, a) (1 α)q(s, a) + αsample

Initialize Q(s, a) arbitrarily; repeat /* for each episode */ Initialize s; Choose a from s using policy derived from Q (e.g., ɛ-greedy) repeat /* for step of an episode */ Take action a, observe reward r and next state s Choose a from s using policy derived from Q (e.g., ɛ-greedy) Q(s, a) Q(s, a) + α[r + γq(s, a ) Q(s, a)] s s ; a a ; until s is terminal; until convergence; Note: update is not Q(s, a) Q(s, a) + α[r + γ max Q(s, a ) Q(s, a)] a since by ɛ-greedy we allow not to choose the best 27

Example python gridworld.py -a q -k 10 --discount 0.9 --noise 0.2 -r 0 -e 0.1 -t less (note: now episodes are used in training, and there are no iterations, rather steps that end at terminal state) 28

Properties Converges if explore enough and α is small enough but α does not decrease too quickly Learns optimal policy without following it 29