SIMULATION BASED REINFORCEMENT LEARNING FOR PATH TRACKING ROBOT

Size: px

Start display at page:

Download "SIMULATION BASED REINFORCEMENT LEARNING FOR PATH TRACKING ROBOT"

Lucy Smith
5 years ago
Views:

SIMULATION BASED REINFORCEMENT LEARNING FOR PATH TRACKING ROBOT Tony 1, M. Rahmat Widyanto 2 1 Department of Computer System, Faculty of Information Technology, Tarumanagara University Jl. Letjen S.

1 SIMULATION BASED REINFORCEMENT LEARNING FOR PATH TRACKING ROBOT Tony 1, M. Rahmat Widyanto 2 1 Department of Computer System, Faculty of Information Technology, Tarumanagara University Jl. Letjen S. Parman No. 1 Jakarta Faculty of Computer Science, University of Indonesia Kampus Baru UI Depok tony.b@fti.utara.org 1, widyanto@cs.ui.ac.id 2 ABSTRACT In this paper, we present a simulation based reinforcement learning for path tracking robot using Q-Learning algorithm. Q-Learning is reinforcement learning technique used mainly in robotics. This technique will estimate the value of state-action and choose the maximum reward value for each action so the shortest path will be taken. The simulation based on Q-Learning has been conducted in an environment which consists of eight rooms with one room as the goal. Our experiment shows that the algorithm can choose actions with maximum reward to find the shortest path to reach the goal. Keywords: path tracking, Q-Learning, reinforcement learning, shortest path 1 INTRODUCTION Reinforcement learning is a sub-area of machine learning which concerned in learning process by rewards and punishments. An agent ought to take actions in an environment so as to maximize the reward value. Reinforcement learning algorithm will find a policy that maps states of the environment to the actions that agent should take in those states. Commonly, the basic reinforcement learning model consists of environment states, actions, and rewards. Figure 1 shows a typical reinforcement learning system. An agent receives description of the environment, which are called states, and choose actions to perform. The effect of an action to the environment is evaluated and fed back to the agent in the form of positive or negative rewards. The mission of the agent is to find the action rules to achieve maximum rewards through its interaction with the environment. One of the most important breakthroughs in reinforcement learning was the development of Q-Learning by Watkins in 1989 [1]. In this paper, we present simulation based reinforcement learning for path tracking robot using the famous Q-Learning algorithm. The rest of the paper is organized as follows: section 2 introduces Q-Learning algorithm, environment model, and experiment. Section 3 discusses the result of simulation, while section 4 concludes the paper. Figure 1. A Typical Structure of Reinforcement Learning 2 ALGORITHM, ENVIRONMENT MODEL, AND EXPERIMENT This section describes the Q-Learning algorithm, its environment model, and experiment. 2.1 Q-Learning Algorithm The task of reinforcement learning is generally stated as follows [2]: For each transition of the system from one state to another, a value called reward is assigned. The system receives the reward after the transition is carried out. The purpose is to find a control policy that maximizes the expected discounted amount of the reward known as a return. The value function is a prediction of return value of any state : k V ( xt ) E γ. rt + k k = 0 where r t is the reward received in transition from state x t to x t+1 and γ is the discount factor (0 γ 1). V(x t ) is the discounted amount of the reward since time t. This amount depends on the sequence of actions chosen which are determined by the 289

2 290 The 5th International Conference on Information & Communication Technology and Systems policy of control. The system has to find a control policy that maximizes V(x t ) for each state. The Q-Learning algorithm does not work with the value function directly. It employs the Q-function whose argument might be not only a state, but also an action. This enables one to construct the Q-function by iterative method and thus find an optimal control policy. The expression for the Q-Learning function describes like this: Q( xt ) rt + γ. V ( xt + 1) a t is an action chosen at time t out of set of all possible actions. Because the purpose of the system is to maximize the sum total of the reward, V(x t+1 ) is replaced by max Q(x t+1,a t+1 ) and as a result, the following function is obtained: Q( xt ) rt + γ. max Q( xt ) Q values are stored in a matrix whose inputs are a state and an action. In systems which employ Q-Learning, the expression above is usually combined with the method of temporal difference, TD(λ), proposed by Sutton [3]. If the parameter of temporal difference λ is equal to zero, the method is called single step Q-Learning because only the current and the next value of the prediction of Q-values participate in the update. The function for single step Q-Learning describes as follows. Q( xt, at ) Q( xt ) + α.( rt + γ. max Q( xt+ 1, at + 1) Q( xt )) Single step Q-Learning is a reinforcement learning technique which learn an action-value function that gives the expected utility of taking a given action in a given state and following a fixed policy thereafter. It leads an agent to acquire optimal control strategies from delayed rewards, even when there is no prior knowledge of the effects of its actions on the environment [4]. Single step Q-Learning algorithm is described as follows [5]: Given: state diagram with a goal state (R matrix) Find: shortest path from any initial state to the goal state (Q matrix) 1. Set parameter γ, and reward (R) matrix 2. Initialize Q matrix as zero matrix 3. For each episode: Select random initial state Do while not reach goal state a. Select one among all possible actions for the current state b. Using this possible action, consider go to the next state c. Get maximum Q value of this next state based on all possible actions d. Compute: Q(state,action) = R(state,action) + γ. max [Q(next state, all actions)] e. Set the next state as the current state end do end for 2.2 Environment Model The environment model consists of eight rooms which connected by certain doors as shown in Figure 2. Each room is labeled from A to H. Room D is the target room or goal. Notice that there is only one door which connected to the goal room, that is through room C. Figure 2. Environment model The environment model in Figure 2 can be represented by graph, where each room as a vertex and each door as an edge. The graph is shown in the Figure 3. Figure 3. Graph of environment The goal room is the node D. Each door or edge of the graph has a reward value. The edge that lead immediately to the goal has instant reward of 100 (see Figure 4). Others that do not have direct connection to the target room have zero reward. Because the door is two way (from A can go to E and from E can go back to A), we assign two arrows to each room of the previous graph. Each arrow contains a reward value. The above graph becomes state diagram as shown in the Figure 4. Additional loop with highest reward (100) is given to the goal room (D back to D) so if the agent arrives at the goal, it will remain there forever. This is called

3 C44 - Simulation Based Reinforcement Learning For Path Tracking Robot - Tony 291 absorbing goal because when it reaches the goal state, it will stay in there. Figure 4. State Diagram The virtual robot acts as an agent that can pass one room to another without knowledge of the environment. It does not know which sequence of doors to pass to go to the target room. It will learn through experience using Q-Learning. Suppose the agent is in room A, it will learn to reach room D as goal (see Figure 2). Each room in the environment is called as a state. Agent's movement from one room to another room is called as an action. State is represented by node in the state diagram, while action is represented by arrow (see Figure 4). Suppose the agent is in state A. From state A, the agent can not directly go to state B because there is no direct door or arrow connecting state A and B, it just can go to the state E because the state E is connected to A. From state E, the agent just can go to state F. From state F, the agent can go either to state B or state G or back to state E (look at the arrow out of state D). From state B, the agent can go either to state C or back to state F. From state C, the agent can go either to state D or state G or back to state B. From state G, the agent can go either to state C or state H or back to state F. From state H, the agent just can go to state G. If the agent is in state D, it will remain there because state D is the target room. The state diagram and reward values can be illustrated into the following reward table or R matrix. The negative value in the matrix means that the row state has no action to go to column state. For example, state A can not go to state B, C, D, F, G, and H because there is no connecting door. Zero means that the action from one state to other state is not the goal. The reward of 100 is given to action from one state to goal state. 2.3 Experiment In our experiment, we use javascript, PHP (Personal Home Page), and HTML (Hyper Text Mark-up Language) as programming tools. To support that, we need XAMPP (X-any of operating systems, Apache, MySQL, PHP, and Perl) as web server and internet browser (such as Internet Explorer or Mozilla Firefox) which supports javascript to run the simulation. After XAMPP is installed, folder which contains the simulation program must be copied to folder C:\ProgramFiles\XAMPP\htdocs. To run the program, open internet browser, type on the address bar. Action to go to state Agent now in state A B C D E F G H A B C R = D E F G H Figure 5. R Matrix The implementation of Q-learning algorithm is described as follows: Set the value of learning parameter γ = 0.8 and reward (R) matrix. The reward value of each action are: R[A,E] = 0; R[B,C] = 0; R[B,F] = 0; R[C,B] = 0; R[C,D] = 100; R[C,G] = 0; R[D,C] = 0; R[D,D] = 100; R[E,A] = 0; R[E,F] = 0; R[F,B] = 0; R[F,E] = 0; R[F,G] = 0; R[G,C] = 0; R[G,F] = 0; R[G,H] = 0; R[G,H] = 0. The R matrix is shown in Figure 5 above. Initialize Q matrix as a zero matrix. Initial value each state for all action is zero. Q matrix is described in Figure below. state/action A B C D E F G H A B C Q = D E F G H Figure 6. Q Matrix For each episode, set initial state (s t ) by random selection. Using this possible action, go to the next state. Look at the fourth column (action to go to

4 292 The 5th International Conference on Information & Communication Technology and Systems state D) of R matrix. There is only one possible state to go to state D. It is state C. We call it as Q(C,D). Get the maximum Q value of this state based on all possible action. The action is Q(C,D), so we count the maximum Q value of state D: max[q(next state, all actions) or max [Q(D,D)] because state D is the goal state. Then, compute the Q value. Q(state,action) = R(state,action) + γ. max[q(next state, all actions)] or Q(s t,a t ) = R(s t,a t ) + γ.max[q(s t+1,a t+1 )] Q(C,D) = R(C,D) + γ * max[q(d,d)] = * 0 = 100 Because state D is the goal state, then we finish one episode. The updated Q matrix is: From room F, choose room B or room G with reward value 64. From room B or room G, choose room C with reward value 80. From room C, go to room D (goal state) with reward value 100. (a). Agent & Environment Model Figure 7. Updated Q Matrix Set s t as s t+1. Choose the initial state which can go to state C. There are two possible states, state B and state G. So the actions are Q(B,C) and Q(G,C). Compute the Q value again using the updated Q matrix (or do step c, d, and e from Q-Learning algorithm above). The agent will learn more and more experiences through many episodes until the value Q matrix reach convergence. If the convergence value is reached, the agent will choose the shortest path to go to the goal state. 3 RESULT The simulation of path tracking is shown in Figure 8. The Q matrix in the simulation program is the convergence value. Figure 9 shows the convergence Q matrix. The convergence Q matrix can be represented in graph (see Figure 10). The reward (R) matrix in the simulation program is same as the R matrix in Figure 5. In our simulation program, we can locate any state as initial state. The goal state is state D. If we start from state D, the agent does not need to find the shortest path to go to goal state. It will remain there because state D is the goal state. As first example, we choose room A as initial state. We describe Figure 10 as follow: Start from room A, choose room E with reward value 41. From room E, choose room F with reward value 51. (b). Convergence Q Matrix (c). R Matrix Figure 8. Screen shot simulation program state/action A B C D E F G H A B C Q = D E F G H Figure 9. The convergence Q matrix Based on reward value, the shortest path from initial state (room A) to goal state (room D) are A-E-F-B-C-D or A-E-F-G-C-D. There are

5 C44 - Simulation Based Reinforcement Learning For Path Tracking Robot - Tony 293 two options of shortest path because the reward value from F to B and from F to G are similar. As discussed above, we can locate any room as initial state. For example, we choose room H as initial state. The result is shown in Figure 11. Figure 10. The graph of convergence Q matrix with room A as initial state especially for path tracking. Our experiment shows that by using the simulation program, the reinforcement learning technique using Q-Learning algorithm can be used to find the shortest path for path tracking robot. The simulation shows that there were two shortest path from initial state A to goal state D. The path are A-E-F-B-C-D and A-E-F-G-C- D. Based on reward value, we found that there was the same value in some actions. This is the weakness of Q-Learning algorithm. The algorithm only consider one step ahead and update the value of Q-matrix for one action. In the future, we will consider stronger complex algorithm such as Q(λ) algorithm which is an on-line multi step learning algorithm also developed by Watkins [6] used to perform faster and more accurate actions in finding a shortest path. REFERENCES Figure 11. The graph of convergence Q matrix with room H as initial state Figure 11 can be explained as follow: Start from room H, choose room G with reward value 64. From room G, choose room C with reward value 80. From room C, go to room D (goal state) with reward value 100. Based on reward value, the shortest path from initial state (room H) to goal state (room D) are H-G-C-D. 4 CONCLUSION AND DISCUSSION As reinforcement learning technique, Q-Learning algorithm developed by Watkins in 1989 can be used in many robotics application, [1] C. J. C. H. Watkins (1989). Learning From Delayed Rewards. PhD Dissertation, Cambridge University. [2] Valery Kuzmin (2002). Connectionist Q-Learning In Robot Control Task. Scientific Proceedings of Riga Technical University. [3] R. S. Sutton (1988). Learning to Predict by Methods of Temporal Differences. In Machine Learning 3, Kluwer Academic Publishers, Boston, pp [4] C. Clausen and H. Wechsler (2000). Quad Q-Learning. IEEE Trans. On Neural Network 11: [5] Kardi T (2005). Q-Learning by Examples [Online]. Available at: forcementlearning/index.html [Accessed: March 3 rd,2009]. [6] Hyun-Chang Y, et al. (2007). Hexagon-Based Q-Learning Algorithm and Applications. International Journal of Control, Automation, and Systems 5:

6 294 The 5th International Conference on Information & Communication Technology and Systems Figure 12 (a) Start from A and go to E Figure 12 (f) Reach the goal state D Figure 12 (b) In E and go to F Figure 13 (a) Start from H and go to G Figure 12 (c) In F and go to B Figure 13 (b) In G and go to C Figure 12 (d) In B and go to C Figure 13 (c) In C and go to D Figure 12 (e) In C and go to D Figure 13 (d) Reach the goal state D

Reinforcement Learning and Shape Grammars

Reinforcement Learning and Shape Grammars Technical report Author Manuela Ruiz Montiel Date April 15, 2011 Version 1.0 1 Contents 0. Introduction... 3 1. Tabular approach... 4 1.1 Tabular Q-learning...