Planning and Control: Markov Decision Processes

Size: px

Start display at page:

Download "Planning and Control: Markov Decision Processes"

Jason Bruce
5 years ago
Views:

1 CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes

2 Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What action next? Discrete vs. Continuous Outcomes Deterministic vs. Stochastic Percepts Actions Full vs. Partial satisfaction

3 Full Classical Planning Static Predictable Environment Fully Observable What action next? Discrete Deterministic Perfect Percepts Actions

4 Full Stochastic Planning Static Unpredictable Environment Fully Observable Perfect What action next? Discrete Stochastic Percepts Actions

5 Deterministic, fully observable

6 Stochastic, Fully Observable

7 Stochastic, Partially Observable

8 Markov Decision Process (MDP) S: A set of states A: A set of actions Pr(s s,a): transition model C(s,a,s ): cost model G: set of goals s 0 : start state : discount factor R(s,a,s ): reward model

9 Role of Discount Factor ( ) Keep the total reward/total cost finite useful for infinite horizon problems sometimes indefinite horizon: if there are deadends Intuition (economics): Money today is worth more than money tomorrow. Total reward: r 1 + r r 3 + Total cost: c 1 + c c 3 +

10 Objective of a Fully Observable MDP Find a policy : S A which optimises minimises maximises maximises discounted or undiscount. expected cost to reach a goal expected reward expected (reward-cost) given a horizon finite infinite indefinite assuming full observability

11 Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimisation MDP <S, A, Pr, C, G, s 0 > Infinite Horizon, Discounted Reward Maximisation MDP <S, A, Pr, R, > Reward = t t r t Goal-directed, Finite Horizon, Prob. Maximisation MDP <S, A, Pr, G, s 0, T>

12 Bellman Equations for MDP 1 <S, A, Pr, C, G, s 0 > Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. J* should satisfy the following equation: Q*(s,a)

13 Bellman Equations for MDP 2 <S, A, Pr, R, s 0, > Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation:

14 Bellman Backup Given an estimate of V* function (say V n ) Backup V n function at state s calculate a new estimate (V n+1 ) : Q n+1 (s,a) : value/cost of the strategy: execute action a in s, execute n subsequently n = argmax a Ap(s) Q n (s,a) (greedy action)

15 Bellman Backup a greedy = a 1 max a 1 s 1 V 0 = 20 Q 1 (s,a 1 ) = Q 1 (s,a 2 ) = Q 1 (s,a 3 ) = V 1 = 25 s 0 20 a? 2 s 2 V 0 = 2 a 3 s 3 V 0 = 3

16 Value iteration [Bellman 57] assign an arbitrary assignment of V 0 to each non-goal state. repeat for all states s compute V n+1 (s) by Bellman backup at s. until max s V n+1 (s) V n (s) < Iteration n+1 Residual(s) -convergence

17 Complexity of value iteration One iteration takes O( A S 2 ) time. Number of iterations required poly( S, A,1/(1-γ)) Overall: the algorithm is polynomial in state space thus exponential in number of state variables.

18 Policy Computation Optimal policy is stationary and time-independent. for infinite/indefinite horizon problems Policy Evaluation A system of linear equations in S variables.

19 Markov Decision Process (MDP) r=1 0.7 s r=0 s s 3 r= r=0 0.2 s 4 s r=-10

20 Value Function and Policy Value residual and policy residual

21 Changing the Search Space Value Iteration Search in value space Compute the resulting policy Policy Iteration [Howard 60] Search in policy space Compute the resulting value

22 Policy iteration [Howard 60] assign an arbitrary assignment of 0 to each state. repeat compute V n+1 : the evaluation of n for all states s compute n+1 (s): argmax a2 Ap(s) Q n+1 (s,a) until n+1 = n Advantage Modified Policy Iteration costly: O(n 3 ) approximate by value iteration using fixed policy searching in a finite (policy) space as opposed to uncountably infinite (value) space convergence faster. all other properties follow!

23 LP Formulation minimise s2s V*(s) under constraints: for every s, a V*(s) R(s) + s 2S Pr(s a,s)v*(s ) A big LP. So other tricks used to solve it!

24 Hybrid MDPs Hybrid Markov decision process: Markov state = (n, x), where n is the discrete component (set of fluents) and. Bellman s equation: V t 1 n ( x) = max Pr( n' n, x, a) a An ( x) n' N x' X Pr( x' x l n, x, a, n') t R ( x' ) V ( x' dx' n ' n' )

25 Hybrid MDPs Hybrid Markov decision process: Markov state = (n, x), where n is the discrete component (set of fluents) and. Bellman s equation: V t 1 n ( x) = max Pr( n' n, x, a) a An ( x) n' N x' X Pr( x' x l n, x, a, n') t R ( x' ) V ( x' dx' n ' n' )

26 Convolutions discrete-discrete constant-discrete [Feng et.al. 04] constant-constant [Li&Littman 05]

27 probability density function Result of convolutions value function discrete constant linear discrete discrete constant linear constant constant linear quadratic linear linear quadratic cubic

28 Value Iteration for Motion Planning (assumes knowledge of robot s location)

29 Frontier-based Exploration Every unknown location is a target point.

30 Manipulator Control Arm with two joints Configuration space

31 Manipulator Control Path State space Configuration space

32 Manipulator Control Path State space Configuration space

33 Collision Avoidance via Planning Potential field methods have local minima Perform efficient path planning in the local perceptual space Path costs depend on length and closeness to obstacles [Konolige, Gradient method]

34 Paths and Costs Path is list of points P={p 1, p 2, p k } p k is only point in goal set Cost of path is separable into intrinsic cost at each point along with adjacency cost of moving from one point to the next F ) ( P) = I( pi ) A( pi, pi 1 i i Adjacency cost typically Euclidean distance Intrinsic cost typically occupancy, distance to obstacle

35 Navigation Function Assignment of potential field value to every element in configuration space [Latombe, 91]. Goal set is always downhill, no local minima. Navigation function of a point is cost of minimal cost path that starts at that point. N k = min P k F( P k )

36 Computation of Navigation Function Initialization Points in goal set 0 cost All other points infinite cost Active list goal set Repeat Take point from active list and update neighbors If cost changes, add the point to the active list Until active list is empty

37 Challenges Where do we get the state space from? Where do we get the model from? What happens when the world is slightly different? Where does reward come from? Continuous state variables Continuous action space

38 How to solve larger problems? If deterministic problem Use dijkstra s algorithm If no back-edge Use backward Bellman updates Prioritize Bellman updates to maximize information flow If known initial state Use dynamic programming + heuristic search LAO*, RTDP and variants Divide an MDP into sub-mdps are solve the hierarchy Aggregate states with similar values Relational MDPs

39 Approximations: n-step lookahead n=1 : greedy 1 (s) = argmax a R(s,a) n-step lookahead n (s) = argmax a V n (s)

40 Approximation: Incremental approaches deterministic relaxation Deterministic planner plan Stochastic simulation Identify weakness Solve/Merge

41 Approximations: Planning and Replanning deterministic relaxation Deterministic planner plan send the state reached Execute the action

42 CSE-571 AI-based Mobile Robotics Planning and Control: SA-1 (1) Reinforcement Learning (2) Partially Observable Markov Decision Processes

43 Reinforcement Learning Still have an MDP Still looking for policy New twist: don t know Pr and/or R i.e. don t know which states are good And what actions do Must actually try actions and states out to learn

44 Model based methods Visit different states, perform different actions Estimate Pr and R Once model built, do planning using V.I. or other methods Cons: require _huge_ amounts of data

45 Model free methods TD learning Directly learn Q*(s,a) values sample = R(s,a,s ) + max a Q n (s,a ) Nudge the old estimate towards the new sample Q n+1 (s,a) Ã (1- )Q n (s,a) + [sample]

46 Properties Converges to optimal if If you explore enough If you make learning rate ( ) small enough But not decrease it too quickly

47 Exploration vs. Exploitation -greedy Each time step flip a coin With prob, action randomly With prob 1- take the current greedy action Lower over time to increase exploitation as more learning has happened

48 Q-learning Problems Too many states to visit during learning Q(s,a) is a BIG table We want to generalize from small set of training examples Solutions Value function approximators Policy approximators Hierarchical Reinforcement Learning

49 Task Hierarchy: MAXQ Decomposition [Dietterich 00] Fetch Root Deliver Children of a task are unordered Take Navigate(loc) Give Extend-arm Grab Release Extend-arm Move s Move w Move e Move n

50 MAXQ Decomposition Augment the state s by adding the subtask i: [s,i]. Define C([s,i],j) as the reward received in i after j finishes. Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr)) Reward received while navigating Reward received after navigation Express V in terms of C Learn C, instead of learning Q

51 MAXQ Decomposition (contd) State Abstraction Finding irrelevant actions Finding funnel actions

52 POMDPs: Recall example

53 Partially Observable Markov Decision Processes

54 POMDPs In POMDPs we apply the very same idea as in MDPs. Since the state is not observable, the agent has to make its decisions based on the belief state which is a posterior distribution over states. Let b be the belief of the agent about the state under consideration. POMDPs compute a value function over belief space: CSE-571- AI-based Mobile Robotics 54

55 Problems Each belief is a probability distribution, thus, each value in a POMDP is a function of an entire probability distribution. This is problematic, since probability distributions are continuous. Additionally, we have to deal with the huge complexity of belief spaces. For finite worlds with finite state, action, and measurement spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions CSE-571- AI-based Mobile Robotics 55

56 An Illustrative Example measurements state x 1 action u 3 state x 2 measurements z 1 z x u u 3 x 2 u1 u2 u1 u z 1 z 2 actions u 1, u 2 payoff payoff CSE-571- AI-based Mobile Robotics 56

57 The Parameters of the Example The actions u 1 and u 2 are terminal actions. The action u 3 is a sensing action that potentially leads to a state transition. The horizon is finite and = CSE-571- AI-based Mobile Robotics 57

58 Payoff in POMDPs In MDPs, the payoff (or return) depended on the state of the system. In POMDPs, however, the true state is not exactly known. Therefore, we compute the expected payoff by integrating over all states: CSE-571- AI-based Mobile Robotics 58

59 Payoffs in Our Example (1) If we are totally certain that we are in state x 1 and execute action u 1, we receive a reward of -100 If, on the other hand, we definitely know that we are in x 2 and execute u 1, the reward is In between it is the linear combination of the extreme values weighted by the probabilities CSE-571- AI-based Mobile Robotics 59

60 Payoffs in Our Example (2) CSE-571- AI-based Mobile Robotics 60

61 The Resulting Policy for T=1 Given we have a finite POMDP with T=1, we would use V 1 (b) to determine the optimal policy. In our example, the optimal policy for T=1 is This is the upper thick graph in the diagram CSE-571- AI-based Mobile Robotics 61

62 Piecewise Linearity, Convexity The resulting value function V 1 (b) is the maximum of the three functions at each point It is piecewise linear and convex CSE-571- AI-based Mobile Robotics 62

63 Pruning If we carefully consider V 1 (b), we see that only the first two components contribute. The third component can therefore safely be pruned away from V 1 (b) CSE-571- AI-based Mobile Robotics 63

64 Increasing the Time Horizon Assume the robot can make an observation before deciding on an action V 1 (b) CSE-571- AI-based Mobile Robotics 64

65 Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. Suppose the robot perceives z 1 for which p(z 1 x 1 )=0.7 and p(z 1 x 2 )=0.3. Given the observation z 1 we update the belief using Bayes rule. p' p' 1 2 p( z = = p1 p( z ) 1 0.3(1 p p( z ) 1 ) = 0.7 p 1 1 ) 0.3(1 p 1 ) = 0.4 p CSE-571- AI-based Mobile Robotics 65

66 Value Function V 1 (b) b (b z 1 ) V 1 (b z 1 ) CSE-571- AI-based Mobile Robotics 66

67 Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. Suppose the robot perceives z 1 for which p(z 1 x 1 )=0.7 and p(z 1 x 2 )=0.3. Given the observation z 1 we update the belief using Bayes rule. Thus V 1 (b z 1 ) is given by CSE-571- AI-based Mobile Robotics 67

68 CSE-571- AI-based Mobile Robotics 68 Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief = = = = = = = ) ( ) ( ) ( ) ( ) ( ) ( )] ( [ ) ( i i i i i i i i i z p x z p V z p p x z p V z p z b V z p z b V E b V

69 Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief CSE-571- AI-based Mobile Robotics 69

70 Resulting Value Function The four possible combinations yield the following function which then can be simplified and pruned CSE-571- AI-based Mobile Robotics 70

71 Value Function p(z 1 ) V 1 (b z 1 ) b (b z 1 ) p(z 2 ) V 2 (b z 2 ) \bar{v} 1 (b) CSE-571- AI-based Mobile Robotics 71

72 State Transitions (Prediction) When the agent selects u 3 its state potentially changes. When computing the value function, we have to take these potential state changes into account CSE-571- AI-based Mobile Robotics 72

73 Resulting Value Function after executing u 3 Taking the state transitions into account, we finally obtain CSE-571- AI-based Mobile Robotics 73

74 Value Function after executing u 3 \bar{v}1(b) \bar{v} 1 (b u 3 ) CSE-571- AI-based Mobile Robotics 74

75 Value Function for T=2 Taking into account that the agent can either directly perform u 1 or u 2 or first u 3 and then u 1 or u 2, we obtain (after pruning) CSE-571- AI-based Mobile Robotics 75

76 Graphical Representation of V 2 (b) u 1 optimal u 2 optimal unclear outcome of measuring is important here CSE-571- AI-based Mobile Robotics 76

77 Deep Horizons and Pruning We have now completed a full backup in belief space. This process can be applied recursively. The value functions for T=10 and T=20 are CSE-571- AI-based Mobile Robotics 77

78 Deep Horizons and Pruning CSE-571- AI-based Mobile Robotics 78

79 CSE-571- AI-based Mobile Robotics 79

80 Why Pruning is Essential Each update introduces additional linear components to V. Each measurement squares the number of linear components. Thus, an unpruned value function for T=20 includes more than ,864 linear functions. At T=30 we have ,012,337 linear functions. The pruned value functions at T=20, in comparison, contains only 12 linear components. The combinatorial explosion of linear components in the value function are the major reason why POMDPs are impractical for most applications CSE-571- AI-based Mobile Robotics 80

81 POMDP Summary POMDPs compute the optimal action in partially observable, stochastic domains. For finite horizon problems, the resulting value functions are piecewise linear and convex. In each iteration the number of linear constraints grows exponentially. POMDPs so far have only been applied successfully to very small state spaces with small numbers of possible observations and actions CSE-571- AI-based Mobile Robotics 81

82 POMDP Approximations Point-based value iteration QMDPs AMDPs CSE-571- AI-based Mobile Robotics 82

83 Point-based Value Iteration Maintains a set of example beliefs Only considers constraints that maximize value function for at least one of the examples CSE-571- AI-based Mobile Robotics 83

84 Point-based Value Iteration Value functions for T=30 Exact value function PBVI CSE-571- AI-based Mobile Robotics 84

85 Example Application CSE-571- AI-based Mobile Robotics 85

86 Example Application CSE-571- AI-based Mobile Robotics 86

87 QMDPs QMDPs only consider state uncertainty in the first step After that, the world becomes fully observable CSE-571- AI-based Mobile Robotics 87

88 CSE-571- AI-based Mobile Robotics 88 = = N j i j j i i x u x p x V u x r u x Q 1 ), ( ) ( ), ( ), ( = N j i i u u x Q p 1 ), ( arg max

89 Augmented MDPs Augmentation adds uncertainty component to state space, e.g. b arg max b( x) = = x, Hb( x) b( x)log b( x) dx Hb( x) Planning is performed by MDP in augmented state space Transition, observation and payoff models have to be learned CSE-571- AI-based Mobile Robotics 89

Markov Decision Processes. (Slides from Mausam)

Markov Decision Processes. (Slides from Mausam) Markov Decision Processes (Slides from Mausam) Machine Learning Operations Research Graph Theory Control Theory Markov Decision Process Economics Robotics Artificial Intelligence Neuroscience /Psychology