Representations and Control in Atari Games Using Reinforcement Learning

Size: px

Start display at page:

Download "Representations and Control in Atari Games Using Reinforcement Learning"

Julie Morgan
5 years ago
Views:

1 Representations and Control in Atari Games Using Reinforcement Learning by Yitao Liang Class of 2016 A thesis submitted in partial fulfillment of the requirements for the honor in Computer Science May 29, 2016 Department of Mathematics & Computer Science Franklin & Marshall College

2 Abstract Arcade Learning Environment (ALE) is a challenging framework composed of dozens of qualitatively different Atari 2600 games and thus an ideal testbed for general AI competency. Similar to many other complex reinforcement learning domains, finding a good representation to predict expected cumulative rewards has been proven to be the key in success in ALE. Many recent approaches utilize non-linear function approximations and neural networks, which incur a high computational cost. This thesis presents a simple, computationally practical, linear feature representation Blob- PROST (blob pairwise offsets in space and time) whose performance is competitive compared with the current state-of-the art results generated by Deep Q-Networks (DQN). In addition, we provide a simple and reproducible benchmark for the sake of comparison to future work in the ALE. Moreover this thesis tries to address two major drawbacks inherent in linear function approximations: 1) finding right sets of features itself is challenging and 2) usually there exists a subset of features that captures the most representation power while others are just useless. In order to address the two issues, a new framework called A-BPROS (adaptive blob pairwise offsets in space) which is inspired and built upon Blob-PROST, is developed as a method for feature expansion in ALE. Initial results suggest in term of representation power more work needs to be done to make A-BPROS as competitive as Blob-PORST while in terms of memory saving, A-BPROS is promising. Finally, future works are discussed. 2

3 Acknowledgements First of all, I want to thank my supervisor Dr. Erik Talvitie, working with him has been a wonderful and very rewarding experience. I am very grateful to have had the freedom to work on problems that I found interesting, along with his support and guidance whenever I needed it. The same thanks go to all of my defense board members, Dr. Jing Hu, Dr. Danel Draguljic and Instructor Anthony Weaver. Your suggestions and comments about my honor thesis and my defense are all of great value. I also thank all the friends I have made at Franklin & Marshall College. Thanks for always being encouraging and supportive. Finally, I would like to assert my deep gratitude towards my parents. Their love and support keeps me chasing my dream and being who I truly am. This research was supported by Alberta Innovates Technology Futures and the Alberta Innovates Center for Machine Learning. Computing resources were provided by Compute Canada through CalculQuébec. 3

4 Contents 1 Introduction 7 2 Reinforcement Learning Problem The Problem Setting Markov Properties of the Environment Value Functions Features & Linear Function Approximation SARSA Arcade Learning Environment ALE Preprocessing Basic Features BASS Spatial Invariance Empirical Evaluation Short Order Markov Features Empirical Evaluation Object Detection Empirical Evaluation Comparison with DQN Deep Q-Networks (DQN)

5 7.2 DQN Evaluation Methodology Comparison with DQN in Performance Comparison with DQN in Computational Cost A New Benchmark 43 9 Conclusion and Discussions about Blob-PROST Limits of Blob-PROST A-BPROS Approach A-BPROS Algorithm Details Candidate Feature Generation Refinement of the Resolution for Anchors Addition of New Offsets Refinement of the Resolution for Offsets Calculating relevances Candidate Feature Promotion Other Implementation Details Experiment Results and Discussion Direct Evaluation after Expanding Learning after Resetting Representation of an Observed Policy Conclusion and Future Work for A-BPROS Appendix 59 5

6 1 Introduction Reinforcement learning (RL) is concerned about creating an agent that is able to learn efficiently and effective from its own experience. Many real-world problems can be summarized into this category. (e.g. autonomous driving cars, drones, playing boardgames, etc.) Specifically, in RL problems, an agent interacts with the unknown environment and learns from its own experience to achieve some certain goal. Several notable problems, such as autonomous driving cars, drones, or playing board games, can all be framed in the category of reinforcement learning. Many RL problems such as those mentioned above use the real world as the source of interactions. As such, they are not suitable for our research: first, the computational recourses need to solve those problems are unaffordable; second, successful strategies to conquer those problem usually involve presetting rules and successfully extracting useful information from possibly inaccurate sensors. Those strategies are worth investigation but also they drive the research focus away from designing agents which are capable of learning in a domain-independent manner. Instead, Arcade Learning Environment (ALE) is a good fit for researching domain-independent AI. ALE consists of roughly 50 qualitatively different Atari 2600 video games [1]. As a successful strategy in one game may not work in another game, it is extremely hard for an agent whose strategies rely greatly on game-specific information to succeed. In another words, ALE is a particularly suitable platform to test domain-independent AI learning. Since all the interactions happen on computer screens, extracting information becomes very convenient as no sensor needs to be used. The easiness of making controls and interactions allows us to be fully focused on RL research. Furthermore, ALE s games are designed by humans and intend to be played by humans. In another words, the complexity of situations that would be encountered by agents in ALE are comparable to many real-world problems, which makes this platform particularly interesting and challenging. Lastly, the success of an agent in ALE can be easily measured, as most videos games have a numerical point system which makes the goal in the game straightforward. 6

7 In order to make rational decisions in ALE, the agent usually implements a mapping function from different situations (or state-action pairs) of the environment to numerical values. The mapping function is called value function and those numerical values are used to estimate how good to be if taking a certain action in a given state. From the value function, an agent derives its policy, which usually is a function that maps the situations to the probability of choosing different actions. The action with the highest numerical value is called the best action and an agent utilizing a pure greedy policy would always follow the best action for every situation. An ɛ-greedy policy is that with a probability of 1 ɛ, the agent would pick the best action and for the rest the agent would pick a sub-best action. In this kind of approach, a typical way to achieve generalization over different situations of the environment is to design features to represent the environment. In this way, instead of mapping environment situations to values, the value function maps the combinations of features to values. Since the quality of feature sets play a vital role in determining a successful RL agent, many RL researches have been focused on designing features. However, many successful feature designs involve the engineering of capturing problem-specific information, diminishing RL agents as fully autonomous and reducing their flexibility. One important contribution of this thesis is to present a computationally practical, fixed, generic feature representation called Blob-PROST that is able to yield human-level performance in ALE. Moreover, the feature set introduced here may shed some light on the minimum information required to obtain such a level of performance in ALE and provide some insights in what constitutes as a good representation in a visual domain, such as the encoding of pairwise distances between objects and the use of temporally extended representations. Blob-PROST mainly captures pairwise relationships between objects on the screen. It is a logical speculation that if the feature set could extend to capture relationships about more than two objects, the agent s performance can be further improved (which is confirmed by our primitive experiments). However, any addition of features focused on such relationships is beyond practical, as the time complexity would grow exponentially. A natural followup research direction would be 7

8 to design a method which expands feature sets as the agent learns and hence potentially different games have different feature sets. The second main contribution of this thesis is to introduce such a method called A-BPROS. Different from many previous works [2], A-BPROS does not generate new features by combining existing features; instead, A-BPROS starts with a small initial feature set and generate candidate features based on them, with the aim to iteratively improve the feature set s representation power. A-BPROS is expected to only pick up important features and thus should achieve the same representation power as Blob-PROS using a way smaller memory. Note one of A-BPROS s main jobs can be considered as seeking the minimal subset of features within Blob-PROST that yields the same level of performance as Blob-PROST. 8

9 2 Reinforcement Learning Problem In this chapter, we introduce reinforcement learning and its common approaches. The notation and definitions largely follow the tradition used in Sutton and Barto [3]. 2.1 The Problem Setting The reinforcement learning problem is concerned with learning from experience of interactions to achieve a certain goal. The learner or action taker is called the agent and everything else is called the environment. An environment usually consists of many if not countless situations and we intuitively call them as the environment s states. The agent interacts with the environment continuously: whenever the agent makes an action (being idle is also considered as an action), the environment responds and evolves to a new state. In common approaches, the time when the Figure 1: Reinforcement Learning Problem agent is allowed to make an action is discretized, making the problem more practical. Usually, the environment also presents the agents with rewards, whose cumulative value is what the tries to maximize. Based on the description above, a basic reinforcement learning problem could be modeled as an entity consisting of the following 4 important components. a set of environmental states S. 9

10 a set of actions A. a set of transition rules between states P. a reward function R which determines the immediate reward. The rules between states are usually stochastic, and the agent is assumed to observe the current environmental state and the reward associated with the last action. We further define a sequence of agent s interactions with the environment as finite, when the agent eventually ends up in some natural endpoints. One finite sequence of agent-environment interactions is called as an episode and the natural endpoint is called as the terminal state or end state. A policy define which action to take at a given state and the ultimate goal for a successful agent is to obtain a policy by following which the maximum overall reward is expected to be obtained. Figure 2: A Tiny Example of Reinforcement Learning Problem Example 1. Consider the tiny reinforcement learning problem shown in Figure 2. The agent can end up in any of the 11 green boxes, and staying at each box is a different situation for the agent. In another words, in this example we can consider those 11 boxes as the environmental states. Some boxes in the figure have numbers while others do not. We can treat those empty boxes as having a number of 0.The reward function can be summarized as when the agent enters a state, the agent receives a reward worth of the number associated with the box. The agent has four actions: Up, 10

11 Down, Left and Right. The only knowledge the agent posses before it stars is the set of actions. By observing the environment, it can determine which state it is in. The blue arrows represent the agent s current policy. By following the current policy, the agent would start from the bottom left state and end up in the upper right state, which is the terminal or end state in this example. Since in a typical reinforcement learning problem, the agent usually possesses minimal prior knowledge about the environment, it is crucial to develop a exploration mechanism for the agent to succeed. In many RL problems, exploration could be simplified as random walks. However, randomly selecting actions without a reference to the knowledge the agent manages to obtain from its previous interactions would intuitively result in poor performance. After the agent gathers enough" information to make informal guesses and estimations about the environment, developing a mechanism to find and follow the currently best policy is necessary, which is known as exploitation. Purely greedy policy can be considered as an example of exploitation. In specifics, a purely greedy policy would always choose the action which leads to the most overall reward at any given state. Keeping a good balance between exploration and exploitation has proven to be a key in determining whether an agent is capable of making a breakthrough in many RL games [3, 4]. A most commonly used mechanism to keep such a balance is ɛ greedy policy. In specifics, with a probability of 1 ɛ the agent would follow the purely greedy policy and with a probability of ɛ the agent would pick a random action. 2.2 Markov Properties of the Environment In our setting, the agent makes its decisions based on its observations of the environmental states it encounters. Ideally, the observations should contain all the relevant information about the states compactly, in which the environment is defined to be Markov. Formally speaking, being Markov refers to the memoryless nature of the environment s state. When a problem satisfies Markov property, the term state" may be used to refer both the state and the observation. Consider how a 11

12 general environment would respond to an agent s action at a specific time. It might depend on the full history of the environmental states and the agent s actions. If the state is Markov, the whole dynamics can be simplified to be only dependent on the environment s current state and the agent s current action, regardless of any previous history. By assuming the problem satisfies the Markov property, the environment can be formulated as a Markov Decision Process (MDP) [5] ) M defined as a 5-tuple M = S, A, R, P, γ. At each time t {0, 1, 2,... }, the agent is at a state s t S where it selects an action a t A. After executing the action a t, it receives an immediate reward r t+1 defined by the reward function R(s t, a t, s t+1 ) and transits to a next state s t+1 S which is drawn from the transition distribution P(s t+1 s t, a t ). The goal for the agent is to find a policy π, a set of rules to determine which action to choose on a certain environmental state: S A which maximizes the expected cumulative reward. 2.3 Value Functions Many reinforcement learning methods are based on estimating value functions functions of states (or state-action paris) that measure how good it is in a given state, where the notion of good" is defined in terms of the expected sum of rewards. Obviously whether the state is good or not highly depends on the agent s decisions over which actions to choose, or in another words the agent s chosen policy. Accordingly, value functions are defined with respect to certain policies. Informally, consider the agent is at the state s t and by following its chosen policy π, it ends up in some terminal state s t f. Meanwhile, it receives immediate rewards after each action it makes. All those rewards combined also form a sequence and we can generalize it as R = r t+1 + r t+2 + r t r t f. Its sum R can be treated as the value of the state s t. However, this reward sequence has two flaw. First, in some RL problems there is no such terminal state in such case the sum of the reward sequence can easily go up to the infinity. Second, it should not treat the short-term and long-term rewards equally. Like the quote a dollar today is worth more than a dollar tomorrow" suggests, intuitively people prefer the short-term rewards given their numerical value 12

(a) Following the chosen policy π (b) Deviation from the chosen policy π at the start state Figure 3: Policy Iteration are similar to those of the long-term ones.

13 (a) Following the chosen policy π (b) Deviation from the chosen policy π at the start state Figure 3: Policy Iteration are similar to those of the long-term ones. In order to fix these two issues and build a mechanism to keep a balance between short-term and long-term rewards, an additional concept needs to be introduced discounting. A future reward would be multiplied by a discounting rate which is inversely related to the length of the time interval between the current time step and the time step where the reward is expected to be collected. With this new concept added in, the reward sequence is transformed to be R = γ 0 r t+1 + γ 1 r t+2 + γ 2 r t+3 +, where γ is the discount rate, which is set to be 0 < γ 1. When a discounting rate is introduced, this new sum R is consider as the value of the state s t. Example 2. Let us consider our tiny reinforcement problem again which is specified by Figure 3. Note Figure 3a is exactly the same as Figure 2. Assume the agent is currently at the start state. For simplicity, let us assume the discount rate is 1. Then by following the policy specified by the blue arrows, the sequence of rewards the agent will collect is R = ( 1) , which sums up to 9. Formally, the value of a state s with respect to a certain policy π, is defined to be the sum of discounted rewards to be expected by following the policy π starting from the state s t. V π (s) = E π k=0 γ k r t+k+1 s = s t 13

14 where V π (s) denotes the state-value function with respect to policy π, and E π ( s = s t ) denotes the expected sum of rewards given that the agent follows policy π starting from the state s t. Note that the value for terminal state, if any, is always set to be 0. Similarly, we define the state-action value function under policy π, denoted as Q π (s t, a) as the cumulative discounted reward the agent is expected to receive by selection action a t at state s t and following the policy π afterwards. Q π (s, a) = E π [R(s t, a t, s t+1 ) + γv π (s t+1 ) s t = s, a t = a] Note in this state-action pair value function, when the action a t does not equal to the action specified by the chosen policy π, the agent actually deviates from the policy π at the state s t. However, regardless of what a t is, the agent always follows the chosen policy π at least starting from the state s t+1. In specifics, we can break the Q equation in two parts to reflect this idea. The term R(s t, a t, s t+1 ) denotes the immediate reward received by deviating from the chosen policy π at the state s t and choosing the action a t instead. The term γv π (s t+1 ) denotes the discounted reward expected by following the chosen policy π starting from the state s t+1. Example 3. According to the definition of state value function, the discounted sum we calculate in the last example is also the value of the start state with respect to our current policy π. In mathematical notations, this is expressed as V π (start state) = 9. As shown in Figure 3b, the agent decides to deviate from the current policy π at the start state. The immediate reward it will receive by doing so is 0, and the discounted value of following the policy π afterwards is 10 = According to our definition, with respect to the chosen policy π, the value for the state-action pair (start state, Up) is 10, which in mathematical notations is expressed as Q π (start state, Up) = 10. If the Q π (s t, a t ) turns out to be larger than V π (s t ), we can improve our chosen policy π by replacing the specified action at s t by the action a t. We call this process as policy iteration.an optimal policy π would maximize the expected return on every state s S, thus satisfying the 14

15 Bellman equation: π (s) = arg maxq π (s, a) x Example 4. Let us consider the tiny reinforcement learning problem for the last time. It turns out Q π (start state, Up) is larger than V π (start state), which means the agent has a chance to improve its policy by simply changing the policy at the start state to be action Up. And this improvement is our one-step policy iteration. 2.4 Features & Linear Function Approximation In problems with large or infinite number of states, it may be infeasible to calculate a value for each state-action pair. Instead, we extract important information which is representative to environmental states and use them to build a function to approximate state-action values. We call those important information as features. Of special interest is the set of binary features features whose values are set to 1 when active and 0 when inactive. A good feature set usually provide meaningful generalization about the environment and thus an agent s learning experience in one state can be referred when the agent steps into a similar yet different state. Figure 4: Atari 2600 Game Pong Example 5. Consider the Atari 2600 game Pong as our example. In this game, the brown and green pedal can only moves Up and Down and the agent is controlling the green pedal though it may not be aware of that. When the brown pedal fails to bounce back the ball, the agent wins a 15

16 point. When the green pedal fails to catch the ball, the agent loses a point. Any small change to the absolute position of either the brown pedal or the green pedal or the ball would lead to a new environmental state. And the number of the three positions combinations is at least at the range of thousands if not countless. In this game, using features to generalize different environmental states into similar situations can greatly help us decrease the difficulty of deriving a value function. Consider a very small feature set which only consists of three binary features: the ball is in the left half of the screen, the ball is in the right half and the green pedal is in the right. Since they are all binary features, only two possible values are allowed. Using the feature set, all the possible environmental states are vastly reduced to only 2 3 = 8 possible situations. Note this feature set may not be effective in this game. Essentially a function approximation represents Q π in a lower dimensional space: F(φ s,a ; θ) Q π (s, a), where θ is a parameter vector and φ s,a denotes a static feature representation of the state s when selecting the action a and F(φ s,a ; θ) means it is a function of φ s,a parameterized by θ. This means the approximate Q π (s, a) values totally depend on the static feature set and the parameters for each feature. Note that though the feature set is static, different environmental states usually have different subset of active features. Further note even though learning happens when the values of parameters are updated, the real key in determining an agent s ultimate learning capacity lies in the quality of the static feature set. In more specifics, with enough learning samples provided, the parameter vector would be eventually tuned to be optimal with respect to the given feature set. During the process, the error between the approximation and the real value hopefully decreases. However, whether the optimal" approximation is close enough to the real value still totally depends on the quality of the provide feature set. Imagine with a feature set which provides no generalization information about the environment, a feature may be active in one state and inactive in another almost identical state. This conflicting information may drag the update for the parameter associated with this specific feature into two totally opposite directions which in turn makes the parameter not optimized and all and makes any decrease in approximation error impossible. 16

17 One of the most important and widely used form of function approximation is that in which the approximation function is a linear function of the parameter vector θ. Correspondingly, there is a vector set of features φ s,a with the same number of components as θ. Formally, the approximate state-action pair linear value function is defined as Q π (s, a) n θ(i)φ (s,a) (i) i=1 The linear function approximation mainly provides one important advantage. The gradient of the approximate value function with respect to θ in this case would simply be: Q π (s, a) = φ s,a. The gradient descent is used to optimize the parameter vector θ and this computational convenience helps reduce the updates of parameter vector θ to a very simple form. Because of its simplicity, this linear method is one of the most favorable for setting up reinforcement learning architectures. If we use binary features in a linear function approximation, we can further gain two additional advantages. First, the approximation values can be computed more efficiently as the whole process reduces to a lookup table. Second, the parameter vector can provide very intuitive information about the importance of features. Example 6. Let us consider the example game PONG again. The linear function approximation using the simple feature set introduced in the last example would be Q(s, a) F(φ s,a ; θ) = w 1 (the ball is in the left) + w 2 (the ball is in the right) + w 3 (the green pedal is in the right). When the ball is in the left, the brown pedal is likely to not catch it, which means the agent may win a point and w 1 should be positive. Similarly when the ball is in the right, the agent may lose a point and w 2 should be negative. Regardless of the ball s position, the green pedal is always in the right which means w 3 should be close to 0. 17

18 2.5 SARSA One of the most popular online reinforcement learning algorithm is SARSA(λ) [3], which stands for State-Action-Reward-State-Action. The basics of S ARS A is policy iteration which has been covered in the value function subsection. Recall as the states are visited, and the rewards from those states are obtained, the state-action pair value function is updated and consequently the policy is improved. Note the updates on a certain state-action pair value estimation in part are based on estimations on other state-action pair values.when using function approximation, it is common to further denote Q values as Q(s, a; θ), where θ is the parameter vector associated with the chosen feature set. In order to better approximate Q values at different actions, it is common to use the same feature set but keep separate parameter vectors for different actions. In particular, SARSA(λ) update equations when using function approximation are [3]: δ t = r t+1 + γq(s t+1, a t+1 ; θ t ) Q(s t, a t ; θ t ) e t = γλ e t 1 + Q(s t, a t ; θ t ) θ t+1 = θ t + αδ t e t, where α denotes the step-size (learning rate), which determines how fast the newly acquired information overrides the old information. A factor of 0 stands for not learning any new information, while a factor of 1 would make the agent consider only the most recent information. δ t is the temporal difference error and γ is the discount rate. Note the λ in the parenthesis following SARSA refers to the use of an eligibility trace, which is denoted as e t in the update equations. An eligibility trace is a temporal record of the occurrence of events. In this thesis, its values are positively related to the occurrence of feature-action pairs. Every time a feature-action is active, its eligibility increases, and the longer the feature-action stays inactive, the more its eligibility decays. The decay rate is specified by λ in the eligibility trace update equation. When an update is made, only those 18

19 feature-action paris which have eligibilities are assigned credits for the update, and thus only their associated parameters are changed. One can view the eligible trace as a mechanism to properly distribute the errors between estimates and true reward values. Thus they serve as the bridges which connect between the features and the training information. In this thesis, a particular type of eligibility trace replacing eligibility trace is used. In more specifics, in a replacing eligibility trace whenever a state-action pair becomes active, its eligibility is reset to be 1. When the pair is inactive, its eligibility decays by the discounted rate γ. Thus, the eligibility trace update function can be simplified as the following. e(s, a) = 1(for the current state s and current action a) e(s, a) = γλe(s, a)(for all other states and actions) All these update equations mentioned above are only used after each nonterminal state s t. If at the state s t+1 the episode ends, Q(s t+1, a t+1 ) is defined as zero for every possible action. With all those necessary elements defined, it is time to present the pseudo-codes (see Algorithm 1). Algorithm 1: SARSA (λ) Algorithm Initialize Q(s, a) arbitrarily (usually 0) and e(s, a) = 0 for all s, a for every episode do Initialize s Choose a using policy derived from Q (e.g. ɛ-greedy) while s is not terminal do Take action a, observe r, s Choose a using policy derived from Q (e.g. ɛ-greedy) δ r t+1 + γq(s t+1, a t+1 ; θ t ) Q(s t, a t ; θ t ) e(s, a) 1 θ θ + αδ e for all s, a do Q(s, a; θ t ) Q(s, a; θ t ) + αθe(s, a) e(s, a) γλe(s, a) s s ; a a 19

20 3 Arcade Learning Environment This section presents a detailed introduction about Arcade Learning Environment (ALE) and previous approaches which yield promising performance in ALE are discussed. 3.1 ALE Games have always been an important field to demonstrate the latest advancements in artificial intelligence. We follow this tradition and choose Arcade Learning Environment as our test platform. Note many games in ALE are qualitatively different from each other, as some are first-person perspective shooting, which require the agent to identify the right timing to fire, (e.g. Battle Zone, see Figure 5), some are prize collecting (e.g. Seaquest, see Figure 6), which require the agent to distinguish the real high-valued prize from less valuable targets, some are sports games (e.g. Ice Hockey, see Figure 7), which require the agent to figure out a way to pass the opponent s defense while also being able to block the opponent s offense, and others belong to other genres. Because of this diversity, success in ALE inherently exhibits a degree of robustness and generality. ALE screens are 160 pixels wide by 210 pixels high with a NTSC-128 color palette. Agents only have access to the screen information and aim to maximize the episode score of the game being played using the 18 actions on a standard joystick without any game-specific prior information. All the 18 actions form the full action set while in each game there is a minimal action set which only contains the indispensable actions to play the certain game. An episode begins on the first frame after a reset command is issued, and terminates when the game ends or after 18,000 frames of playing (equivalent of 5 mins of real-time playing), whichever comes first. In ALE, it is very common to treat screens as the environmental states. Note in most games a single screen does not constitute Markovian state. That said, all Atari 2600 games are deterministic, so the next screen is usually fully determined by some length of interaction history. It is most common for ALE agents to base their decisions only on the most recent few screens. 20

21 Figure 5: A Screenshot of Battle Zone Figure 6: A Screenshot of Seaquest Figure 7: A Screenshot of Ice Hockey 21

22 3.2 Preprocessing As discussed in Chapter 2, the quality of feature set used for the linear function approximation heavily determines whether we will make an agent that is able to learn effectively and efficiently in Atari 2600 games. So our research focus lies in finding a good linear feature representation for ALE. However, it can be computationally demanding to directly extract features from raw Atari screens, in terms of both CPU cycles and memory requirements. To make screens sparse, following previous work (e.g. [1, 6]), we subtract the background from the screen at each frame before processing any visual input. The background is precomputed offline using 18, 000 samples obtained from random trajectories. The trajectories are generated by first following a human-provided prefixed sequence of actions for a random number of steps and subsequently following actions selected uniformly at random. A histogram is used to store the frequencies of colors at each pixel position on the screen. The most frequent color at a certain pixel position is defined as that pixel position s background color. After the background is obtained, during trainings when background subtraction is necessary, the agent first compares the screens with the background and only those pixels whose colors are different from the background colors are used by the agent to later generate features. Those pixels with the same colors with the background are discarded. 3.3 Basic Features Early work on ALE has been focused on developing generic feature representation using linear RL methods. Along with setting ALE up for public research use, Bellemare et al. [1] also presented a benchmark using four different feature sets. The most relevant one is the basic feature features, as we use it as a starting point. Basic features were introduced as an attempt to encode colors on the screen. Since Atari game developers frequently use different colors as one of the primary ways to distinguish objects on the screen, we can consider Basic features as a crude form of object detection. 22

23 Figure 8: Basic Features Extracted from the Screen of Freeway [1] Since most Atari game objects are larger than a few pixels, directly encoding the absolute positions of color pixels is likely to break an object into too many pieces and in turn leads to a feature set which almost does not provide any generalization about which pixels count as the same object and the positions of those objects. Instead, Basic features encode the positions of colors at a given coarse resolution. In specifics, in order to obtain Basic features, we first divide the pixel Atari screen into tiles consisting of pixels. For every time (c, r) and color k, where c {1,, 16}, r {1,, 14} and k {1,, 128}, we check if some pixel of color k is present in the tile. If yes, the binary feature θ c,r,k is defined as 1, otherwise as 0. In total, there are = 28, 672 Basic features. Intuitively, Basic features encode information like there is a blue pixel within the tile in the upper-right corner of the screen. As shown in Figure 8, the Basic feature denoted by the color blue contained in the leftmost tile in the right picture captures the absolute position of the leftmost blue car in the screen of Freeway (the left picture). 3.4 BASS Though Basic features are able to capture absolute positions of colors which potentially represent important objects, it lakes the ability to encode any spatial relationships between objects. In many games, understanding how objects interact with each other may be more important than purely knowing each object individually. This is the motivation behind the first extension to Basic features Bass features [1]. Bass features behave identically to Basic features with two exceptions. First, 23

24 Figure 9: Bass Features Extracted from the Screen of Freeway [1] BASS augments Basic with pairwise products of its features. Second, BASS only uses a smaller, SECAM 8-color pallet to keep the number of features tractable. With this smaller pallet, BASS generates a total of 1,792 Basic features and 3,211,264 pairwise features. As shown in Figure 9, BASS features encode information like there is a blue pixel within the tile in the left part of the screen and there is a yellow pixel within some tile in the upper left part of the screen." By referring to the actual Freeway screen, we would understand this example BASS feature actually tries to make a spatial connection between the leftmost blue car and the yellow car. 24

25 4 Spatial Invariance Recall that BASS feature set, one of the best performing feature set, acknowledges the importance of pairwise spatial relationships between objects. That said, it still does not capture such relationships very efficiently as the feature set is not generalized enough. Figure 10: Cartoon Freeway Illustrating BASS Features [7, 8] In order to illustrate the generalization issue inherent in BASS features, let us consider a cartoon version of the game Freeway (see Figure 10). In this game, the chicken tries to cross the road without being hit by cars. Intuitively, never crossing the road when a car is nearby is a good survival strategy. However, it would be hard for an agent to develop such a strategy if using BASS features. As shown in Figure 10, when using BASS features, the agent needs to develop the same strategy for all the three combinations of the car s and the chicken s absolute positions. However, all these three combinations can be easily generalized to one same situation, if we ignore the absolute positions and only consider the relative spatial offset between the car and the chicken. This is exactly the motivation why we create our first extension to Basic features Basic Pairwise Relative Offset in Space (B-PROS) features. B-PROS behaves similarly to BASS with one exception. Instead of directly encoding the 25

pairwise products of Basic features, it encodes the relative spatial offsets between Basic features at a given coarse resolution.the resolution is the same as what is used in Basic features.

26 pairwise products of Basic features, it encodes the relative spatial offsets between Basic features at a given coarse resolution.the resolution is the same as what is used in Basic features. Since the given resolution divides Atari screens into tiles, the range of possible spatial relative offsets at the y axis is [ 15, 15], and the range of that at thex axis is [ 13, 13]. Figure 11: B-PROS Features Extracted from the Screen of Freeway [1] In more specifics, besides including all Basic features, B-PROS further checks whether there exists a pair of Basic features with colors k 1, k 2 {1,, 128} at some certain offset (i, j) between each other, where 13 i 13 and 15 j 15. If so, φ k1,k 2,(i,j) is set to 1, indicating a pixel of color k 1 is contained within some block (c, r) and a pixel of color k 2 is contained within the block (c + i, r + j). Intuitively, as shown in Figure 11 B-PROS encodes information such as there is a yellow pixel three tiles right and two tiles above a blue pixel". As B-PROS imposes spatial invariance, the three BASS features mentioned in the cartoon freeway example are reduced to only one B-PROS feature (see Figure 12). Note the time complexity of computing B-PROS features is similar to that of BASS though ultimately far fewer features are generated. Due to its relatively small memory usage, using the whole NTSC-128 color pallet is possible. Directly using the method described aforesaid may result in redundant features (e.g. φ blue,yellow,(1,2) and φ yellow,blue,( 1, 2) are identical features), but it is straightforward to eliminate them. After redundancy is eliminated, the B-PROS feature set has 26

27 Figure 12: Cartoon Freeway Illustrating B-PROS Features [7, 8] 6, 885, 440 features in total 6, 885, 440 = (28, (( ) )). 4.1 Empirical Evaluation Our first set of experiments compares B-PROS to previous state-of-the-art linear representations in ALE to evaluate the impact of the spatial invariance we impose on BASS. For the sake of comparison, we follow Bellemare et al. s methodology [1]. Specifically, in 24 independent trials the agent was trained for 5000 episodes. After the learning phase we froze the weights and evaluated the learned policy by recording its average performance over 499 episodes. We report the average evaluation score across the 24 trials. We used a frame-skipping technique, in which the agent selects actions and updates its value function every x frames, repeating the selected action for the next x 1 frames. This allows the agent to play approximately x times faster. Following Bellemare et al. we use x = 5. We used Sarsa(λ) with replacing traces and an ɛ-greedy policy. We performed a parameter sweep over nine games, which we call training" games. Our agent s performance was optimal when used with the following parameters: decay rate γ =0.99, an exploration rate ɛ = 0.01, a step-size α = 0.5 and eligibility decay rate λ =

28 Game Best Linear B-PROS (std. dev.) Asterix (1016.0) Beam Rider (504.4) Breakout (11.0) Enduro (24.2) Freeway (0.7) Pong (6.6) Q*Bert (740.0) Seaquest (321.2) Space Invaders (87.2) Table 1: Comparison of linear representations. Bold denotes the largest value between B-PROS and Best Linear [2]. See text for more details. Table 1 summarizes the results of the comparisons between our agent with existing state-of-theart agents utilizing linear function approximation in our set of training games (note that Breakout, Enduro, Pong and Q*bert were not in the set of training games for previous linear methods). Best Linear" denotes the best performance obtained among four different feature sets: Basic, BASS, DISCO and LSH [1]. According to Table 1, B-PROS s performance surpasses the original benchmarks by a large margin. Furthermore, some of the improvements represent qualitative breakthroughs. For instance, in Pong, B-PROS allows the agent to win the game with a score of (who gets 21 points first wins the game) on average, while previous methods rarely scored more than a few points. When comparing these features sets for all 53 games B- PROS performs better on average than all of Basic, BASS, DISCO, and LSH in 77% (41/53) of the games. The dramatic improvement yielded by B-PROS clearly indicates that focusing on relative spatial relationships rather than absolute positions is vital to success in the ALE (and likely in other visual domains as well). 28

29 5 Short Order Markov Features Though B-PROS is capable of encoding relative distances between objects, it still fails to incorporate movements on the screen. In some games, distinguishing movements is a very important aspect of the games and it provides agents with the ability to develop a more robust and better performing strategy. (e.g. An enemy is very close to the agent s avatar while moving away from it is a totally different situation from an enemy is very close and moving toward the agent s avatar). In order to encode objects movements, Basic Pairwise Relative Offsets in Time (B-PROT) features were proposed. Many previous linear representations relied mainly on the current screen. In contrast, B-PROT extracts information from the two most recent screens, allowing to encode short-order- Markov features of the game screens. B-PORT features behave similarly to B-PROS features with one exception. This time one of the basic features is obtained from the screen five frames in the past while the other is obtained from the current screen. More specifically, for every pair of colors k 1, k 2 {1, 2,, 128} and every offset (i, j), where 13 i 13 and 15 j 15, a B-PROT feature φ k1,k 2,(i,j) is set to 1 if a pixel of color k 1 is contained within some block (c, r) in the current screen and a pixel of color k 2 is contained within the block (c, r) in the screen five frames ago. Intuitively, B-PROT encodes information such as There is a yellow pixel that is two blocks away from where a yellow pixel was 5 frames ago." Such information helps the agent to capture object movements in some sense. After these extensions, the final result is B-PROST feature set, which contains Basic, B-PROS and B-PROT features. B-PROT can be considered a variant of B-PROS. As a direct consequence, the memory requirement and time complexity of generating B-PROT feature set is similar as that of B-PROS, though ultimately roughly twice as many B-PROT features are generated, because no redundancy is involved in the color pair and offset combinations. (e.g. φ blue,yellow,(1,2) and φ yellow,blue,( 1, 2) are identical in B-PROS feature set while they are different in B-PROT feature set because the yellow and blue Basic features are extracted from two different screen.) B-PROST 29

30 Game B-PROS B-PROST Blob-PROST Avg. (std. dev.) Avg. (std. dev.) Avg. (std. dev.) Asterix (1802.9) (1714.5) (1201.3) Beam Rider (255.2) (309.2) (423.9) Breakout 7.1 (1.7) 15.0 (4.7) 46.7 (45.7) Enduro 34.9 (71.5) (23.1) (79.8) Freeway 23.8 (6.7) 29.1 (6.3) 31.5 (1.2) Pong 10.9 (5.2) 18.9 (1.3) 20.1 (0.5) Q*Bert (1273.3) (1129.0) (3036.4) Seaquest (445.9) (519.5) (440.4) Space Invaders (130.4) (139.0) (101.5) Table 2: Comparison of relative offset features. Bold indicates the best average of the three columns. The indicates significant differences between B-PROST and Blob-PROST. has a total of 20, 598, 848 = (6, 885, ) sparse binary features. 5.1 Empirical Evaluation We have shown B-PROS to be superior in terms of performance to previous linear methods. Subsequence extensions will be primarily compared to DQN which is the current state-of-the-art result (See Section 6). Since in this subsection we still only compares B-PROST with B-PROS to investigate the impact of short-order Markov features, we will wait until Section 7.1 to fully introduce what is DQN. For now, let us simply remember DQN is the currently best performance agent in ALE. In these experiments, we decided to adopt an evaluation protocol similar to Mnih et al. s for the sake of comparison [9]. Note the biggest difference between the protocol used in these experiments and the one used in the previous experiment is the number of learning samples we provided. In specifics, each agent was trained for 200,000,000 frames (equivalent to 40,000,000 decisions) over 24 independent trials. The increase in learning samples from 5,000 episodes to 200 million frames also gave us an invaluable chance to investigate whether our agent is able to keep learning if more learning samples are provided. The learned policy in each trial was evaluated by recording its 30

31 (a) Trial One s Learning Curve in Freeway (b) Trial One s Learning Curve in Seaquest Figure 13: Typical B-PROST s Learning Curves average performance in 499 episodes with no learning. We report the average evaluation score over the 24 trials. In an effort to make our results comparable to DQN s we also started each episode with a random number of no-op" (idle) actions. The number of no-ops" was uniformly randomly selected from the interval {1, 2,, 30} before a new episode officially began. Mnih et al. s main purpose of adding no-ops" was to prevent the agent from overfitting to the Atari s determinism. Furthermore, we also restricted the agent to the minimal set of actions that have a unique effect in each game. Such information was never exploited by earlier work in the ALE. Since in most games the minimal action set contains fewer actions than the full action set, agents with access to the minimal set may benefit from a faster effective learning rate. The first two columns of Table 2 present results using B-PROS and B-PROST in the training games. B-PROST outperforms B-PROS in all but one training game. One particularly dramatic improvement occurs in Pong; B-PROS wins with a score of 21-9, on average, while B-PROST rarely allows the opponent to score at all. Another result worth noting is in Enduro. The randomized starting conditions with no-ops" seem to have significantly harmed the performance of B-PROS in this game, while B-PROST seems to be robust to this effect. According to the Table 4 provided in the Appendix, when evaluated over all 49 games, the average score using B-PROST is higher than that using B-PROS in 82% of the games (40/49). This clearly indicates the critical importance of short-order-markov features to success in the ALE. 31

32 Since we are also interested in the impact of a large number of learning samples, we present our agent s learning curve when using B-PROST feature set (see Figure 13). Brown lines in the two learning curves indicate where 5,000 episodes end. Note in different games, a typical episode usually lasts for a different number of frames. (e.g. in Freeway a typical episode lasts for roughly 15,000 frames while in Seaquest a typical episode lasts for less than 10,000 frames) As shown in the presented learning curves, our agent s performance improves by a large margin when more learning samples are provided. However, in Seaquest our agent s performance clearly reaches a plateau and no further improvement can be gained. It is hard to draw a conclusion about the impact of large learning samples. Still our preliminary results demonstrate that whether an agent s performance would benefit from large learning samples at least depends on which specific game the agent is playing, the agent s learning efficiency and also the ultimate learning capacity the agent is empowered with by using a given feature set. As demonstrated in the B-PROST s learning curve in Seaquest, reaching a score of roughly 30 points is clearly the ultimate learning capacity the agent can have by using B-PROST features and there is no way to break this limit by simple increase in the number of learning samples. This further demonstrates the point we have briefly discussed in Section 2.4 that the dominant factor in determining an agent s ultimate learning capacity is the quality of the feature set. 32

6 Object Detection One of the main motivations behind Basic feature set is that low-resolution colors may crudely capture meaningful objects.

33 6 Object Detection One of the main motivations behind Basic feature set is that low-resolution colors may crudely capture meaningful objects. However, as it encodes the positions of individual pixels, it struggles to distinguish which pixels are part of the same object. Since both B-PROS and B-PROS are built based on Basic features, they suffer from the same problem. In order to measure the impact of a more sophisticated low-level object detection method, we impose a simple extension to Basic which exploits the fact that many Atari game objects consists of contiguous same-color pixels (see Figure 14). We call those groups of pixels as blobs". Rather than directly represent coarsened Figure 14: A Typical Car in Freeway positions of pixels, we first process the screen to find a list of blobs. Blob features then represent coarsened positions of blobs on the screen. Changing the foundation from Basic features to Blobs yields the Blob-PROST feature set. Note that in many Atari games, as would be true in more natural images, gaps often exist between pixels which actually represent the same object. Based on this observation, grouping only strictly contiguous same-color pixels would result in many redundant, meaningless blobs that are each only a part of the same objects. To address this issue, we add a tolerance to the contiguity condition. More specifically, we consider pixels that are in the same s s pixel square to be contiguous. This approach has inherent trade-offs. On one hand, this approach could yield appealing advantages: with sufficiently large s, we may successfully represent each object with few blobs. It would 33

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN