Representations and Control in Atari Games Using Reinforcement Learning

Size: px
Start display at page:

Download "Representations and Control in Atari Games Using Reinforcement Learning"

Transcription

1 Representations and Control in Atari Games Using Reinforcement Learning by Yitao Liang Class of 2016 A thesis submitted in partial fulfillment of the requirements for the honor in Computer Science May 29, 2016 Department of Mathematics & Computer Science Franklin & Marshall College

2 Abstract Arcade Learning Environment (ALE) is a challenging framework composed of dozens of qualitatively different Atari 2600 games and thus an ideal testbed for general AI competency. Similar to many other complex reinforcement learning domains, finding a good representation to predict expected cumulative rewards has been proven to be the key in success in ALE. Many recent approaches utilize non-linear function approximations and neural networks, which incur a high computational cost. This thesis presents a simple, computationally practical, linear feature representation Blob- PROST (blob pairwise offsets in space and time) whose performance is competitive compared with the current state-of-the art results generated by Deep Q-Networks (DQN). In addition, we provide a simple and reproducible benchmark for the sake of comparison to future work in the ALE. Moreover this thesis tries to address two major drawbacks inherent in linear function approximations: 1) finding right sets of features itself is challenging and 2) usually there exists a subset of features that captures the most representation power while others are just useless. In order to address the two issues, a new framework called A-BPROS (adaptive blob pairwise offsets in space) which is inspired and built upon Blob-PROST, is developed as a method for feature expansion in ALE. Initial results suggest in term of representation power more work needs to be done to make A-BPROS as competitive as Blob-PORST while in terms of memory saving, A-BPROS is promising. Finally, future works are discussed. 2

3 Acknowledgements First of all, I want to thank my supervisor Dr. Erik Talvitie, working with him has been a wonderful and very rewarding experience. I am very grateful to have had the freedom to work on problems that I found interesting, along with his support and guidance whenever I needed it. The same thanks go to all of my defense board members, Dr. Jing Hu, Dr. Danel Draguljic and Instructor Anthony Weaver. Your suggestions and comments about my honor thesis and my defense are all of great value. I also thank all the friends I have made at Franklin & Marshall College. Thanks for always being encouraging and supportive. Finally, I would like to assert my deep gratitude towards my parents. Their love and support keeps me chasing my dream and being who I truly am. This research was supported by Alberta Innovates Technology Futures and the Alberta Innovates Center for Machine Learning. Computing resources were provided by Compute Canada through CalculQuébec. 3

4 Contents 1 Introduction 7 2 Reinforcement Learning Problem The Problem Setting Markov Properties of the Environment Value Functions Features & Linear Function Approximation SARSA Arcade Learning Environment ALE Preprocessing Basic Features BASS Spatial Invariance Empirical Evaluation Short Order Markov Features Empirical Evaluation Object Detection Empirical Evaluation Comparison with DQN Deep Q-Networks (DQN)

5 7.2 DQN Evaluation Methodology Comparison with DQN in Performance Comparison with DQN in Computational Cost A New Benchmark 43 9 Conclusion and Discussions about Blob-PROST Limits of Blob-PROST A-BPROS Approach A-BPROS Algorithm Details Candidate Feature Generation Refinement of the Resolution for Anchors Addition of New Offsets Refinement of the Resolution for Offsets Calculating relevances Candidate Feature Promotion Other Implementation Details Experiment Results and Discussion Direct Evaluation after Expanding Learning after Resetting Representation of an Observed Policy Conclusion and Future Work for A-BPROS Appendix 59 5

6 1 Introduction Reinforcement learning (RL) is concerned about creating an agent that is able to learn efficiently and effective from its own experience. Many real-world problems can be summarized into this category. (e.g. autonomous driving cars, drones, playing boardgames, etc.) Specifically, in RL problems, an agent interacts with the unknown environment and learns from its own experience to achieve some certain goal. Several notable problems, such as autonomous driving cars, drones, or playing board games, can all be framed in the category of reinforcement learning. Many RL problems such as those mentioned above use the real world as the source of interactions. As such, they are not suitable for our research: first, the computational recourses need to solve those problems are unaffordable; second, successful strategies to conquer those problem usually involve presetting rules and successfully extracting useful information from possibly inaccurate sensors. Those strategies are worth investigation but also they drive the research focus away from designing agents which are capable of learning in a domain-independent manner. Instead, Arcade Learning Environment (ALE) is a good fit for researching domain-independent AI. ALE consists of roughly 50 qualitatively different Atari 2600 video games [1]. As a successful strategy in one game may not work in another game, it is extremely hard for an agent whose strategies rely greatly on game-specific information to succeed. In another words, ALE is a particularly suitable platform to test domain-independent AI learning. Since all the interactions happen on computer screens, extracting information becomes very convenient as no sensor needs to be used. The easiness of making controls and interactions allows us to be fully focused on RL research. Furthermore, ALE s games are designed by humans and intend to be played by humans. In another words, the complexity of situations that would be encountered by agents in ALE are comparable to many real-world problems, which makes this platform particularly interesting and challenging. Lastly, the success of an agent in ALE can be easily measured, as most videos games have a numerical point system which makes the goal in the game straightforward. 6

7 In order to make rational decisions in ALE, the agent usually implements a mapping function from different situations (or state-action pairs) of the environment to numerical values. The mapping function is called value function and those numerical values are used to estimate how good to be if taking a certain action in a given state. From the value function, an agent derives its policy, which usually is a function that maps the situations to the probability of choosing different actions. The action with the highest numerical value is called the best action and an agent utilizing a pure greedy policy would always follow the best action for every situation. An ɛ-greedy policy is that with a probability of 1 ɛ, the agent would pick the best action and for the rest the agent would pick a sub-best action. In this kind of approach, a typical way to achieve generalization over different situations of the environment is to design features to represent the environment. In this way, instead of mapping environment situations to values, the value function maps the combinations of features to values. Since the quality of feature sets play a vital role in determining a successful RL agent, many RL researches have been focused on designing features. However, many successful feature designs involve the engineering of capturing problem-specific information, diminishing RL agents as fully autonomous and reducing their flexibility. One important contribution of this thesis is to present a computationally practical, fixed, generic feature representation called Blob-PROST that is able to yield human-level performance in ALE. Moreover, the feature set introduced here may shed some light on the minimum information required to obtain such a level of performance in ALE and provide some insights in what constitutes as a good representation in a visual domain, such as the encoding of pairwise distances between objects and the use of temporally extended representations. Blob-PROST mainly captures pairwise relationships between objects on the screen. It is a logical speculation that if the feature set could extend to capture relationships about more than two objects, the agent s performance can be further improved (which is confirmed by our primitive experiments). However, any addition of features focused on such relationships is beyond practical, as the time complexity would grow exponentially. A natural followup research direction would be 7

8 to design a method which expands feature sets as the agent learns and hence potentially different games have different feature sets. The second main contribution of this thesis is to introduce such a method called A-BPROS. Different from many previous works [2], A-BPROS does not generate new features by combining existing features; instead, A-BPROS starts with a small initial feature set and generate candidate features based on them, with the aim to iteratively improve the feature set s representation power. A-BPROS is expected to only pick up important features and thus should achieve the same representation power as Blob-PROS using a way smaller memory. Note one of A-BPROS s main jobs can be considered as seeking the minimal subset of features within Blob-PROST that yields the same level of performance as Blob-PROST. 8

9 2 Reinforcement Learning Problem In this chapter, we introduce reinforcement learning and its common approaches. The notation and definitions largely follow the tradition used in Sutton and Barto [3]. 2.1 The Problem Setting The reinforcement learning problem is concerned with learning from experience of interactions to achieve a certain goal. The learner or action taker is called the agent and everything else is called the environment. An environment usually consists of many if not countless situations and we intuitively call them as the environment s states. The agent interacts with the environment continuously: whenever the agent makes an action (being idle is also considered as an action), the environment responds and evolves to a new state. In common approaches, the time when the Figure 1: Reinforcement Learning Problem agent is allowed to make an action is discretized, making the problem more practical. Usually, the environment also presents the agents with rewards, whose cumulative value is what the tries to maximize. Based on the description above, a basic reinforcement learning problem could be modeled as an entity consisting of the following 4 important components. a set of environmental states S. 9

10 a set of actions A. a set of transition rules between states P. a reward function R which determines the immediate reward. The rules between states are usually stochastic, and the agent is assumed to observe the current environmental state and the reward associated with the last action. We further define a sequence of agent s interactions with the environment as finite, when the agent eventually ends up in some natural endpoints. One finite sequence of agent-environment interactions is called as an episode and the natural endpoint is called as the terminal state or end state. A policy define which action to take at a given state and the ultimate goal for a successful agent is to obtain a policy by following which the maximum overall reward is expected to be obtained. Figure 2: A Tiny Example of Reinforcement Learning Problem Example 1. Consider the tiny reinforcement learning problem shown in Figure 2. The agent can end up in any of the 11 green boxes, and staying at each box is a different situation for the agent. In another words, in this example we can consider those 11 boxes as the environmental states. Some boxes in the figure have numbers while others do not. We can treat those empty boxes as having a number of 0.The reward function can be summarized as when the agent enters a state, the agent receives a reward worth of the number associated with the box. The agent has four actions: Up, 10

11 Down, Left and Right. The only knowledge the agent posses before it stars is the set of actions. By observing the environment, it can determine which state it is in. The blue arrows represent the agent s current policy. By following the current policy, the agent would start from the bottom left state and end up in the upper right state, which is the terminal or end state in this example. Since in a typical reinforcement learning problem, the agent usually possesses minimal prior knowledge about the environment, it is crucial to develop a exploration mechanism for the agent to succeed. In many RL problems, exploration could be simplified as random walks. However, randomly selecting actions without a reference to the knowledge the agent manages to obtain from its previous interactions would intuitively result in poor performance. After the agent gathers enough" information to make informal guesses and estimations about the environment, developing a mechanism to find and follow the currently best policy is necessary, which is known as exploitation. Purely greedy policy can be considered as an example of exploitation. In specifics, a purely greedy policy would always choose the action which leads to the most overall reward at any given state. Keeping a good balance between exploration and exploitation has proven to be a key in determining whether an agent is capable of making a breakthrough in many RL games [3, 4]. A most commonly used mechanism to keep such a balance is ɛ greedy policy. In specifics, with a probability of 1 ɛ the agent would follow the purely greedy policy and with a probability of ɛ the agent would pick a random action. 2.2 Markov Properties of the Environment In our setting, the agent makes its decisions based on its observations of the environmental states it encounters. Ideally, the observations should contain all the relevant information about the states compactly, in which the environment is defined to be Markov. Formally speaking, being Markov refers to the memoryless nature of the environment s state. When a problem satisfies Markov property, the term state" may be used to refer both the state and the observation. Consider how a 11

12 general environment would respond to an agent s action at a specific time. It might depend on the full history of the environmental states and the agent s actions. If the state is Markov, the whole dynamics can be simplified to be only dependent on the environment s current state and the agent s current action, regardless of any previous history. By assuming the problem satisfies the Markov property, the environment can be formulated as a Markov Decision Process (MDP) [5] ) M defined as a 5-tuple M = S, A, R, P, γ. At each time t {0, 1, 2,... }, the agent is at a state s t S where it selects an action a t A. After executing the action a t, it receives an immediate reward r t+1 defined by the reward function R(s t, a t, s t+1 ) and transits to a next state s t+1 S which is drawn from the transition distribution P(s t+1 s t, a t ). The goal for the agent is to find a policy π, a set of rules to determine which action to choose on a certain environmental state: S A which maximizes the expected cumulative reward. 2.3 Value Functions Many reinforcement learning methods are based on estimating value functions functions of states (or state-action paris) that measure how good it is in a given state, where the notion of good" is defined in terms of the expected sum of rewards. Obviously whether the state is good or not highly depends on the agent s decisions over which actions to choose, or in another words the agent s chosen policy. Accordingly, value functions are defined with respect to certain policies. Informally, consider the agent is at the state s t and by following its chosen policy π, it ends up in some terminal state s t f. Meanwhile, it receives immediate rewards after each action it makes. All those rewards combined also form a sequence and we can generalize it as R = r t+1 + r t+2 + r t r t f. Its sum R can be treated as the value of the state s t. However, this reward sequence has two flaw. First, in some RL problems there is no such terminal state in such case the sum of the reward sequence can easily go up to the infinity. Second, it should not treat the short-term and long-term rewards equally. Like the quote a dollar today is worth more than a dollar tomorrow" suggests, intuitively people prefer the short-term rewards given their numerical value 12

13 (a) Following the chosen policy π (b) Deviation from the chosen policy π at the start state Figure 3: Policy Iteration are similar to those of the long-term ones. In order to fix these two issues and build a mechanism to keep a balance between short-term and long-term rewards, an additional concept needs to be introduced discounting. A future reward would be multiplied by a discounting rate which is inversely related to the length of the time interval between the current time step and the time step where the reward is expected to be collected. With this new concept added in, the reward sequence is transformed to be R = γ 0 r t+1 + γ 1 r t+2 + γ 2 r t+3 +, where γ is the discount rate, which is set to be 0 < γ 1. When a discounting rate is introduced, this new sum R is consider as the value of the state s t. Example 2. Let us consider our tiny reinforcement problem again which is specified by Figure 3. Note Figure 3a is exactly the same as Figure 2. Assume the agent is currently at the start state. For simplicity, let us assume the discount rate is 1. Then by following the policy specified by the blue arrows, the sequence of rewards the agent will collect is R = ( 1) , which sums up to 9. Formally, the value of a state s with respect to a certain policy π, is defined to be the sum of discounted rewards to be expected by following the policy π starting from the state s t. V π (s) = E π k=0 γ k r t+k+1 s = s t 13

14 where V π (s) denotes the state-value function with respect to policy π, and E π ( s = s t ) denotes the expected sum of rewards given that the agent follows policy π starting from the state s t. Note that the value for terminal state, if any, is always set to be 0. Similarly, we define the state-action value function under policy π, denoted as Q π (s t, a) as the cumulative discounted reward the agent is expected to receive by selection action a t at state s t and following the policy π afterwards. Q π (s, a) = E π [R(s t, a t, s t+1 ) + γv π (s t+1 ) s t = s, a t = a] Note in this state-action pair value function, when the action a t does not equal to the action specified by the chosen policy π, the agent actually deviates from the policy π at the state s t. However, regardless of what a t is, the agent always follows the chosen policy π at least starting from the state s t+1. In specifics, we can break the Q equation in two parts to reflect this idea. The term R(s t, a t, s t+1 ) denotes the immediate reward received by deviating from the chosen policy π at the state s t and choosing the action a t instead. The term γv π (s t+1 ) denotes the discounted reward expected by following the chosen policy π starting from the state s t+1. Example 3. According to the definition of state value function, the discounted sum we calculate in the last example is also the value of the start state with respect to our current policy π. In mathematical notations, this is expressed as V π (start state) = 9. As shown in Figure 3b, the agent decides to deviate from the current policy π at the start state. The immediate reward it will receive by doing so is 0, and the discounted value of following the policy π afterwards is 10 = According to our definition, with respect to the chosen policy π, the value for the state-action pair (start state, Up) is 10, which in mathematical notations is expressed as Q π (start state, Up) = 10. If the Q π (s t, a t ) turns out to be larger than V π (s t ), we can improve our chosen policy π by replacing the specified action at s t by the action a t. We call this process as policy iteration.an optimal policy π would maximize the expected return on every state s S, thus satisfying the 14

15 Bellman equation: π (s) = arg maxq π (s, a) x Example 4. Let us consider the tiny reinforcement learning problem for the last time. It turns out Q π (start state, Up) is larger than V π (start state), which means the agent has a chance to improve its policy by simply changing the policy at the start state to be action Up. And this improvement is our one-step policy iteration. 2.4 Features & Linear Function Approximation In problems with large or infinite number of states, it may be infeasible to calculate a value for each state-action pair. Instead, we extract important information which is representative to environmental states and use them to build a function to approximate state-action values. We call those important information as features. Of special interest is the set of binary features features whose values are set to 1 when active and 0 when inactive. A good feature set usually provide meaningful generalization about the environment and thus an agent s learning experience in one state can be referred when the agent steps into a similar yet different state. Figure 4: Atari 2600 Game Pong Example 5. Consider the Atari 2600 game Pong as our example. In this game, the brown and green pedal can only moves Up and Down and the agent is controlling the green pedal though it may not be aware of that. When the brown pedal fails to bounce back the ball, the agent wins a 15

16 point. When the green pedal fails to catch the ball, the agent loses a point. Any small change to the absolute position of either the brown pedal or the green pedal or the ball would lead to a new environmental state. And the number of the three positions combinations is at least at the range of thousands if not countless. In this game, using features to generalize different environmental states into similar situations can greatly help us decrease the difficulty of deriving a value function. Consider a very small feature set which only consists of three binary features: the ball is in the left half of the screen, the ball is in the right half and the green pedal is in the right. Since they are all binary features, only two possible values are allowed. Using the feature set, all the possible environmental states are vastly reduced to only 2 3 = 8 possible situations. Note this feature set may not be effective in this game. Essentially a function approximation represents Q π in a lower dimensional space: F(φ s,a ; θ) Q π (s, a), where θ is a parameter vector and φ s,a denotes a static feature representation of the state s when selecting the action a and F(φ s,a ; θ) means it is a function of φ s,a parameterized by θ. This means the approximate Q π (s, a) values totally depend on the static feature set and the parameters for each feature. Note that though the feature set is static, different environmental states usually have different subset of active features. Further note even though learning happens when the values of parameters are updated, the real key in determining an agent s ultimate learning capacity lies in the quality of the static feature set. In more specifics, with enough learning samples provided, the parameter vector would be eventually tuned to be optimal with respect to the given feature set. During the process, the error between the approximation and the real value hopefully decreases. However, whether the optimal" approximation is close enough to the real value still totally depends on the quality of the provide feature set. Imagine with a feature set which provides no generalization information about the environment, a feature may be active in one state and inactive in another almost identical state. This conflicting information may drag the update for the parameter associated with this specific feature into two totally opposite directions which in turn makes the parameter not optimized and all and makes any decrease in approximation error impossible. 16

17 One of the most important and widely used form of function approximation is that in which the approximation function is a linear function of the parameter vector θ. Correspondingly, there is a vector set of features φ s,a with the same number of components as θ. Formally, the approximate state-action pair linear value function is defined as Q π (s, a) n θ(i)φ (s,a) (i) i=1 The linear function approximation mainly provides one important advantage. The gradient of the approximate value function with respect to θ in this case would simply be: Q π (s, a) = φ s,a. The gradient descent is used to optimize the parameter vector θ and this computational convenience helps reduce the updates of parameter vector θ to a very simple form. Because of its simplicity, this linear method is one of the most favorable for setting up reinforcement learning architectures. If we use binary features in a linear function approximation, we can further gain two additional advantages. First, the approximation values can be computed more efficiently as the whole process reduces to a lookup table. Second, the parameter vector can provide very intuitive information about the importance of features. Example 6. Let us consider the example game PONG again. The linear function approximation using the simple feature set introduced in the last example would be Q(s, a) F(φ s,a ; θ) = w 1 (the ball is in the left) + w 2 (the ball is in the right) + w 3 (the green pedal is in the right). When the ball is in the left, the brown pedal is likely to not catch it, which means the agent may win a point and w 1 should be positive. Similarly when the ball is in the right, the agent may lose a point and w 2 should be negative. Regardless of the ball s position, the green pedal is always in the right which means w 3 should be close to 0. 17

18 2.5 SARSA One of the most popular online reinforcement learning algorithm is SARSA(λ) [3], which stands for State-Action-Reward-State-Action. The basics of S ARS A is policy iteration which has been covered in the value function subsection. Recall as the states are visited, and the rewards from those states are obtained, the state-action pair value function is updated and consequently the policy is improved. Note the updates on a certain state-action pair value estimation in part are based on estimations on other state-action pair values.when using function approximation, it is common to further denote Q values as Q(s, a; θ), where θ is the parameter vector associated with the chosen feature set. In order to better approximate Q values at different actions, it is common to use the same feature set but keep separate parameter vectors for different actions. In particular, SARSA(λ) update equations when using function approximation are [3]: δ t = r t+1 + γq(s t+1, a t+1 ; θ t ) Q(s t, a t ; θ t ) e t = γλ e t 1 + Q(s t, a t ; θ t ) θ t+1 = θ t + αδ t e t, where α denotes the step-size (learning rate), which determines how fast the newly acquired information overrides the old information. A factor of 0 stands for not learning any new information, while a factor of 1 would make the agent consider only the most recent information. δ t is the temporal difference error and γ is the discount rate. Note the λ in the parenthesis following SARSA refers to the use of an eligibility trace, which is denoted as e t in the update equations. An eligibility trace is a temporal record of the occurrence of events. In this thesis, its values are positively related to the occurrence of feature-action pairs. Every time a feature-action is active, its eligibility increases, and the longer the feature-action stays inactive, the more its eligibility decays. The decay rate is specified by λ in the eligibility trace update equation. When an update is made, only those 18

19 feature-action paris which have eligibilities are assigned credits for the update, and thus only their associated parameters are changed. One can view the eligible trace as a mechanism to properly distribute the errors between estimates and true reward values. Thus they serve as the bridges which connect between the features and the training information. In this thesis, a particular type of eligibility trace replacing eligibility trace is used. In more specifics, in a replacing eligibility trace whenever a state-action pair becomes active, its eligibility is reset to be 1. When the pair is inactive, its eligibility decays by the discounted rate γ. Thus, the eligibility trace update function can be simplified as the following. e(s, a) = 1(for the current state s and current action a) e(s, a) = γλe(s, a)(for all other states and actions) All these update equations mentioned above are only used after each nonterminal state s t. If at the state s t+1 the episode ends, Q(s t+1, a t+1 ) is defined as zero for every possible action. With all those necessary elements defined, it is time to present the pseudo-codes (see Algorithm 1). Algorithm 1: SARSA (λ) Algorithm Initialize Q(s, a) arbitrarily (usually 0) and e(s, a) = 0 for all s, a for every episode do Initialize s Choose a using policy derived from Q (e.g. ɛ-greedy) while s is not terminal do Take action a, observe r, s Choose a using policy derived from Q (e.g. ɛ-greedy) δ r t+1 + γq(s t+1, a t+1 ; θ t ) Q(s t, a t ; θ t ) e(s, a) 1 θ θ + αδ e for all s, a do Q(s, a; θ t ) Q(s, a; θ t ) + αθe(s, a) e(s, a) γλe(s, a) s s ; a a 19

20 3 Arcade Learning Environment This section presents a detailed introduction about Arcade Learning Environment (ALE) and previous approaches which yield promising performance in ALE are discussed. 3.1 ALE Games have always been an important field to demonstrate the latest advancements in artificial intelligence. We follow this tradition and choose Arcade Learning Environment as our test platform. Note many games in ALE are qualitatively different from each other, as some are first-person perspective shooting, which require the agent to identify the right timing to fire, (e.g. Battle Zone, see Figure 5), some are prize collecting (e.g. Seaquest, see Figure 6), which require the agent to distinguish the real high-valued prize from less valuable targets, some are sports games (e.g. Ice Hockey, see Figure 7), which require the agent to figure out a way to pass the opponent s defense while also being able to block the opponent s offense, and others belong to other genres. Because of this diversity, success in ALE inherently exhibits a degree of robustness and generality. ALE screens are 160 pixels wide by 210 pixels high with a NTSC-128 color palette. Agents only have access to the screen information and aim to maximize the episode score of the game being played using the 18 actions on a standard joystick without any game-specific prior information. All the 18 actions form the full action set while in each game there is a minimal action set which only contains the indispensable actions to play the certain game. An episode begins on the first frame after a reset command is issued, and terminates when the game ends or after 18,000 frames of playing (equivalent of 5 mins of real-time playing), whichever comes first. In ALE, it is very common to treat screens as the environmental states. Note in most games a single screen does not constitute Markovian state. That said, all Atari 2600 games are deterministic, so the next screen is usually fully determined by some length of interaction history. It is most common for ALE agents to base their decisions only on the most recent few screens. 20

21 Figure 5: A Screenshot of Battle Zone Figure 6: A Screenshot of Seaquest Figure 7: A Screenshot of Ice Hockey 21

22 3.2 Preprocessing As discussed in Chapter 2, the quality of feature set used for the linear function approximation heavily determines whether we will make an agent that is able to learn effectively and efficiently in Atari 2600 games. So our research focus lies in finding a good linear feature representation for ALE. However, it can be computationally demanding to directly extract features from raw Atari screens, in terms of both CPU cycles and memory requirements. To make screens sparse, following previous work (e.g. [1, 6]), we subtract the background from the screen at each frame before processing any visual input. The background is precomputed offline using 18, 000 samples obtained from random trajectories. The trajectories are generated by first following a human-provided prefixed sequence of actions for a random number of steps and subsequently following actions selected uniformly at random. A histogram is used to store the frequencies of colors at each pixel position on the screen. The most frequent color at a certain pixel position is defined as that pixel position s background color. After the background is obtained, during trainings when background subtraction is necessary, the agent first compares the screens with the background and only those pixels whose colors are different from the background colors are used by the agent to later generate features. Those pixels with the same colors with the background are discarded. 3.3 Basic Features Early work on ALE has been focused on developing generic feature representation using linear RL methods. Along with setting ALE up for public research use, Bellemare et al. [1] also presented a benchmark using four different feature sets. The most relevant one is the basic feature features, as we use it as a starting point. Basic features were introduced as an attempt to encode colors on the screen. Since Atari game developers frequently use different colors as one of the primary ways to distinguish objects on the screen, we can consider Basic features as a crude form of object detection. 22

23 Figure 8: Basic Features Extracted from the Screen of Freeway [1] Since most Atari game objects are larger than a few pixels, directly encoding the absolute positions of color pixels is likely to break an object into too many pieces and in turn leads to a feature set which almost does not provide any generalization about which pixels count as the same object and the positions of those objects. Instead, Basic features encode the positions of colors at a given coarse resolution. In specifics, in order to obtain Basic features, we first divide the pixel Atari screen into tiles consisting of pixels. For every time (c, r) and color k, where c {1,, 16}, r {1,, 14} and k {1,, 128}, we check if some pixel of color k is present in the tile. If yes, the binary feature θ c,r,k is defined as 1, otherwise as 0. In total, there are = 28, 672 Basic features. Intuitively, Basic features encode information like there is a blue pixel within the tile in the upper-right corner of the screen. As shown in Figure 8, the Basic feature denoted by the color blue contained in the leftmost tile in the right picture captures the absolute position of the leftmost blue car in the screen of Freeway (the left picture). 3.4 BASS Though Basic features are able to capture absolute positions of colors which potentially represent important objects, it lakes the ability to encode any spatial relationships between objects. In many games, understanding how objects interact with each other may be more important than purely knowing each object individually. This is the motivation behind the first extension to Basic features Bass features [1]. Bass features behave identically to Basic features with two exceptions. First, 23

24 Figure 9: Bass Features Extracted from the Screen of Freeway [1] BASS augments Basic with pairwise products of its features. Second, BASS only uses a smaller, SECAM 8-color pallet to keep the number of features tractable. With this smaller pallet, BASS generates a total of 1,792 Basic features and 3,211,264 pairwise features. As shown in Figure 9, BASS features encode information like there is a blue pixel within the tile in the left part of the screen and there is a yellow pixel within some tile in the upper left part of the screen." By referring to the actual Freeway screen, we would understand this example BASS feature actually tries to make a spatial connection between the leftmost blue car and the yellow car. 24

25 4 Spatial Invariance Recall that BASS feature set, one of the best performing feature set, acknowledges the importance of pairwise spatial relationships between objects. That said, it still does not capture such relationships very efficiently as the feature set is not generalized enough. Figure 10: Cartoon Freeway Illustrating BASS Features [7, 8] In order to illustrate the generalization issue inherent in BASS features, let us consider a cartoon version of the game Freeway (see Figure 10). In this game, the chicken tries to cross the road without being hit by cars. Intuitively, never crossing the road when a car is nearby is a good survival strategy. However, it would be hard for an agent to develop such a strategy if using BASS features. As shown in Figure 10, when using BASS features, the agent needs to develop the same strategy for all the three combinations of the car s and the chicken s absolute positions. However, all these three combinations can be easily generalized to one same situation, if we ignore the absolute positions and only consider the relative spatial offset between the car and the chicken. This is exactly the motivation why we create our first extension to Basic features Basic Pairwise Relative Offset in Space (B-PROS) features. B-PROS behaves similarly to BASS with one exception. Instead of directly encoding the 25

26 pairwise products of Basic features, it encodes the relative spatial offsets between Basic features at a given coarse resolution.the resolution is the same as what is used in Basic features. Since the given resolution divides Atari screens into tiles, the range of possible spatial relative offsets at the y axis is [ 15, 15], and the range of that at thex axis is [ 13, 13]. Figure 11: B-PROS Features Extracted from the Screen of Freeway [1] In more specifics, besides including all Basic features, B-PROS further checks whether there exists a pair of Basic features with colors k 1, k 2 {1,, 128} at some certain offset (i, j) between each other, where 13 i 13 and 15 j 15. If so, φ k1,k 2,(i,j) is set to 1, indicating a pixel of color k 1 is contained within some block (c, r) and a pixel of color k 2 is contained within the block (c + i, r + j). Intuitively, as shown in Figure 11 B-PROS encodes information such as there is a yellow pixel three tiles right and two tiles above a blue pixel". As B-PROS imposes spatial invariance, the three BASS features mentioned in the cartoon freeway example are reduced to only one B-PROS feature (see Figure 12). Note the time complexity of computing B-PROS features is similar to that of BASS though ultimately far fewer features are generated. Due to its relatively small memory usage, using the whole NTSC-128 color pallet is possible. Directly using the method described aforesaid may result in redundant features (e.g. φ blue,yellow,(1,2) and φ yellow,blue,( 1, 2) are identical features), but it is straightforward to eliminate them. After redundancy is eliminated, the B-PROS feature set has 26

27 Figure 12: Cartoon Freeway Illustrating B-PROS Features [7, 8] 6, 885, 440 features in total 6, 885, 440 = (28, (( ) )). 4.1 Empirical Evaluation Our first set of experiments compares B-PROS to previous state-of-the-art linear representations in ALE to evaluate the impact of the spatial invariance we impose on BASS. For the sake of comparison, we follow Bellemare et al. s methodology [1]. Specifically, in 24 independent trials the agent was trained for 5000 episodes. After the learning phase we froze the weights and evaluated the learned policy by recording its average performance over 499 episodes. We report the average evaluation score across the 24 trials. We used a frame-skipping technique, in which the agent selects actions and updates its value function every x frames, repeating the selected action for the next x 1 frames. This allows the agent to play approximately x times faster. Following Bellemare et al. we use x = 5. We used Sarsa(λ) with replacing traces and an ɛ-greedy policy. We performed a parameter sweep over nine games, which we call training" games. Our agent s performance was optimal when used with the following parameters: decay rate γ =0.99, an exploration rate ɛ = 0.01, a step-size α = 0.5 and eligibility decay rate λ =

28 Game Best Linear B-PROS (std. dev.) Asterix (1016.0) Beam Rider (504.4) Breakout (11.0) Enduro (24.2) Freeway (0.7) Pong (6.6) Q*Bert (740.0) Seaquest (321.2) Space Invaders (87.2) Table 1: Comparison of linear representations. Bold denotes the largest value between B-PROS and Best Linear [2]. See text for more details. Table 1 summarizes the results of the comparisons between our agent with existing state-of-theart agents utilizing linear function approximation in our set of training games (note that Breakout, Enduro, Pong and Q*bert were not in the set of training games for previous linear methods). Best Linear" denotes the best performance obtained among four different feature sets: Basic, BASS, DISCO and LSH [1]. According to Table 1, B-PROS s performance surpasses the original benchmarks by a large margin. Furthermore, some of the improvements represent qualitative breakthroughs. For instance, in Pong, B-PROS allows the agent to win the game with a score of (who gets 21 points first wins the game) on average, while previous methods rarely scored more than a few points. When comparing these features sets for all 53 games B- PROS performs better on average than all of Basic, BASS, DISCO, and LSH in 77% (41/53) of the games. The dramatic improvement yielded by B-PROS clearly indicates that focusing on relative spatial relationships rather than absolute positions is vital to success in the ALE (and likely in other visual domains as well). 28

29 5 Short Order Markov Features Though B-PROS is capable of encoding relative distances between objects, it still fails to incorporate movements on the screen. In some games, distinguishing movements is a very important aspect of the games and it provides agents with the ability to develop a more robust and better performing strategy. (e.g. An enemy is very close to the agent s avatar while moving away from it is a totally different situation from an enemy is very close and moving toward the agent s avatar). In order to encode objects movements, Basic Pairwise Relative Offsets in Time (B-PROT) features were proposed. Many previous linear representations relied mainly on the current screen. In contrast, B-PROT extracts information from the two most recent screens, allowing to encode short-order- Markov features of the game screens. B-PORT features behave similarly to B-PROS features with one exception. This time one of the basic features is obtained from the screen five frames in the past while the other is obtained from the current screen. More specifically, for every pair of colors k 1, k 2 {1, 2,, 128} and every offset (i, j), where 13 i 13 and 15 j 15, a B-PROT feature φ k1,k 2,(i,j) is set to 1 if a pixel of color k 1 is contained within some block (c, r) in the current screen and a pixel of color k 2 is contained within the block (c, r) in the screen five frames ago. Intuitively, B-PROT encodes information such as There is a yellow pixel that is two blocks away from where a yellow pixel was 5 frames ago." Such information helps the agent to capture object movements in some sense. After these extensions, the final result is B-PROST feature set, which contains Basic, B-PROS and B-PROT features. B-PROT can be considered a variant of B-PROS. As a direct consequence, the memory requirement and time complexity of generating B-PROT feature set is similar as that of B-PROS, though ultimately roughly twice as many B-PROT features are generated, because no redundancy is involved in the color pair and offset combinations. (e.g. φ blue,yellow,(1,2) and φ yellow,blue,( 1, 2) are identical in B-PROS feature set while they are different in B-PROT feature set because the yellow and blue Basic features are extracted from two different screen.) B-PROST 29

30 Game B-PROS B-PROST Blob-PROST Avg. (std. dev.) Avg. (std. dev.) Avg. (std. dev.) Asterix (1802.9) (1714.5) (1201.3) Beam Rider (255.2) (309.2) (423.9) Breakout 7.1 (1.7) 15.0 (4.7) 46.7 (45.7) Enduro 34.9 (71.5) (23.1) (79.8) Freeway 23.8 (6.7) 29.1 (6.3) 31.5 (1.2) Pong 10.9 (5.2) 18.9 (1.3) 20.1 (0.5) Q*Bert (1273.3) (1129.0) (3036.4) Seaquest (445.9) (519.5) (440.4) Space Invaders (130.4) (139.0) (101.5) Table 2: Comparison of relative offset features. Bold indicates the best average of the three columns. The indicates significant differences between B-PROST and Blob-PROST. has a total of 20, 598, 848 = (6, 885, ) sparse binary features. 5.1 Empirical Evaluation We have shown B-PROS to be superior in terms of performance to previous linear methods. Subsequence extensions will be primarily compared to DQN which is the current state-of-the-art result (See Section 6). Since in this subsection we still only compares B-PROST with B-PROS to investigate the impact of short-order Markov features, we will wait until Section 7.1 to fully introduce what is DQN. For now, let us simply remember DQN is the currently best performance agent in ALE. In these experiments, we decided to adopt an evaluation protocol similar to Mnih et al. s for the sake of comparison [9]. Note the biggest difference between the protocol used in these experiments and the one used in the previous experiment is the number of learning samples we provided. In specifics, each agent was trained for 200,000,000 frames (equivalent to 40,000,000 decisions) over 24 independent trials. The increase in learning samples from 5,000 episodes to 200 million frames also gave us an invaluable chance to investigate whether our agent is able to keep learning if more learning samples are provided. The learned policy in each trial was evaluated by recording its 30

31 (a) Trial One s Learning Curve in Freeway (b) Trial One s Learning Curve in Seaquest Figure 13: Typical B-PROST s Learning Curves average performance in 499 episodes with no learning. We report the average evaluation score over the 24 trials. In an effort to make our results comparable to DQN s we also started each episode with a random number of no-op" (idle) actions. The number of no-ops" was uniformly randomly selected from the interval {1, 2,, 30} before a new episode officially began. Mnih et al. s main purpose of adding no-ops" was to prevent the agent from overfitting to the Atari s determinism. Furthermore, we also restricted the agent to the minimal set of actions that have a unique effect in each game. Such information was never exploited by earlier work in the ALE. Since in most games the minimal action set contains fewer actions than the full action set, agents with access to the minimal set may benefit from a faster effective learning rate. The first two columns of Table 2 present results using B-PROS and B-PROST in the training games. B-PROST outperforms B-PROS in all but one training game. One particularly dramatic improvement occurs in Pong; B-PROS wins with a score of 21-9, on average, while B-PROST rarely allows the opponent to score at all. Another result worth noting is in Enduro. The randomized starting conditions with no-ops" seem to have significantly harmed the performance of B-PROS in this game, while B-PROST seems to be robust to this effect. According to the Table 4 provided in the Appendix, when evaluated over all 49 games, the average score using B-PROST is higher than that using B-PROS in 82% of the games (40/49). This clearly indicates the critical importance of short-order-markov features to success in the ALE. 31

32 Since we are also interested in the impact of a large number of learning samples, we present our agent s learning curve when using B-PROST feature set (see Figure 13). Brown lines in the two learning curves indicate where 5,000 episodes end. Note in different games, a typical episode usually lasts for a different number of frames. (e.g. in Freeway a typical episode lasts for roughly 15,000 frames while in Seaquest a typical episode lasts for less than 10,000 frames) As shown in the presented learning curves, our agent s performance improves by a large margin when more learning samples are provided. However, in Seaquest our agent s performance clearly reaches a plateau and no further improvement can be gained. It is hard to draw a conclusion about the impact of large learning samples. Still our preliminary results demonstrate that whether an agent s performance would benefit from large learning samples at least depends on which specific game the agent is playing, the agent s learning efficiency and also the ultimate learning capacity the agent is empowered with by using a given feature set. As demonstrated in the B-PROST s learning curve in Seaquest, reaching a score of roughly 30 points is clearly the ultimate learning capacity the agent can have by using B-PROST features and there is no way to break this limit by simple increase in the number of learning samples. This further demonstrates the point we have briefly discussed in Section 2.4 that the dominant factor in determining an agent s ultimate learning capacity is the quality of the feature set. 32

33 6 Object Detection One of the main motivations behind Basic feature set is that low-resolution colors may crudely capture meaningful objects. However, as it encodes the positions of individual pixels, it struggles to distinguish which pixels are part of the same object. Since both B-PROS and B-PROS are built based on Basic features, they suffer from the same problem. In order to measure the impact of a more sophisticated low-level object detection method, we impose a simple extension to Basic which exploits the fact that many Atari game objects consists of contiguous same-color pixels (see Figure 14). We call those groups of pixels as blobs". Rather than directly represent coarsened Figure 14: A Typical Car in Freeway positions of pixels, we first process the screen to find a list of blobs. Blob features then represent coarsened positions of blobs on the screen. Changing the foundation from Basic features to Blobs yields the Blob-PROST feature set. Note that in many Atari games, as would be true in more natural images, gaps often exist between pixels which actually represent the same object. Based on this observation, grouping only strictly contiguous same-color pixels would result in many redundant, meaningless blobs that are each only a part of the same objects. To address this issue, we add a tolerance to the contiguity condition. More specifically, we consider pixels that are in the same s s pixel square to be contiguous. This approach has inherent trade-offs. On one hand, this approach could yield appealing advantages: with sufficiently large s, we may successfully represent each object with few blobs. It would 33

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1 Agent interaction with environment Agent State s Reward r Action a Environment 2 Of course, an oversimplification 3 Review:

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes

More information

Adversarial Policy Switching with Application to RTS Games

Adversarial Policy Switching with Application to RTS Games Adversarial Policy Switching with Application to RTS Games Brian King 1 and Alan Fern 2 and Jesse Hostetler 2 Department of Electrical Engineering and Computer Science Oregon State University 1 kingbria@lifetime.oregonstate.edu

More information

arxiv: v1 [cs.cv] 2 Sep 2018

arxiv: v1 [cs.cv] 2 Sep 2018 Natural Language Person Search Using Deep Reinforcement Learning Ankit Shah Language Technologies Institute Carnegie Mellon University aps1@andrew.cmu.edu Tyler Vuong Electrical and Computer Engineering

More information

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Reinforcement Learning: A brief introduction. Mihaela van der Schaar Reinforcement Learning: A brief introduction Mihaela van der Schaar Outline Optimal Decisions & Optimal Forecasts Markov Decision Processes (MDPs) States, actions, rewards and value functions Dynamic Programming

More information

CS 234 Winter 2018: Assignment #2

CS 234 Winter 2018: Assignment #2 Due date: 2/10 (Sat) 11:00 PM (23:00) PST These questions require thought, but do not require long answers. Please be as concise as possible. We encourage students to discuss in groups for assignments.

More information

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS Summer Introduction to Artificial Intelligence Midterm ˆ You have approximately minutes. ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. ˆ Mark your answers

More information

Slides credited from Dr. David Silver & Hung-Yi Lee

Slides credited from Dr. David Silver & Hung-Yi Lee Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to

More information

Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms

Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms Karan M. Gupta Department of Computer Science Texas Tech University Lubbock, TX 7949-314 gupta@cs.ttu.edu Abstract This paper presents a

More information

A Brief Introduction to Reinforcement Learning

A Brief Introduction to Reinforcement Learning A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml Reinforcement Learning Agent

More information

Markov Decision Processes and Reinforcement Learning

Markov Decision Processes and Reinforcement Learning Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence

More information

Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015

Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015 Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015 Content Demo Framework Remarks Experiment Discussion Content Demo Framework Remarks Experiment Discussion

More information

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering Study of Multiple Multiagent Reinforcement Learning Algorithms in Grid Games by Pascal De Beck-Courcelle A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of

More information

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Eric Christiansen Michael Gorbach May 13, 2008 Abstract One of the drawbacks of standard reinforcement learning techniques is that

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient I Used Materials Disclaimer: Much of the material and slides for this lecture

More information

Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay

Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay Haiyan (Helena) Yin, Sinno Jialin Pan School of Computer Science and Engineering Nanyang Technological University

More information

Deep Q-Learning to play Snake

Deep Q-Learning to play Snake Deep Q-Learning to play Snake Daniele Grattarola August 1, 2016 Abstract This article describes the application of deep learning and Q-learning to play the famous 90s videogame Snake. I applied deep convolutional

More information

Generalized Inverse Reinforcement Learning

Generalized Inverse Reinforcement Learning Generalized Inverse Reinforcement Learning James MacGlashan Cogitai, Inc. james@cogitai.com Michael L. Littman mlittman@cs.brown.edu Nakul Gopalan ngopalan@cs.brown.edu Amy Greenwald amy@cs.brown.edu Abstract

More information

Lecture 13: Learning from Demonstration

Lecture 13: Learning from Demonstration CS 294-5 Algorithmic Human-Robot Interaction Fall 206 Lecture 3: Learning from Demonstration Scribes: Samee Ibraheem and Malayandi Palaniappan - Adapted from Notes by Avi Singh and Sammy Staszak 3. Introduction

More information

Adaptive Building of Decision Trees by Reinforcement Learning

Adaptive Building of Decision Trees by Reinforcement Learning Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece, August 24-26, 2007 34 Adaptive Building of Decision Trees by Reinforcement Learning MIRCEA

More information

When Network Embedding meets Reinforcement Learning?

When Network Embedding meets Reinforcement Learning? When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE

More information

WestminsterResearch

WestminsterResearch WestminsterResearch http://www.westminster.ac.uk/research/westminsterresearch Reinforcement learning in continuous state- and action-space Barry D. Nichols Faculty of Science and Technology This is an

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

CSE151 Assignment 2 Markov Decision Processes in the Grid World

CSE151 Assignment 2 Markov Decision Processes in the Grid World CSE5 Assignment Markov Decision Processes in the Grid World Grace Lin A484 gclin@ucsd.edu Tom Maddock A55645 tmaddock@ucsd.edu Abstract Markov decision processes exemplify sequential problems, which are

More information

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Apprenticeship Learning for Reinforcement Learning with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Table of Contents Introduction Theory Autonomous helicopter control

More information

Monte Carlo Tree Search PAH 2015

Monte Carlo Tree Search PAH 2015 Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS Summer Introduction to Artificial Intelligence Midterm You have approximately minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark your answers

More information

Training Intelligent Stoplights

Training Intelligent Stoplights Training Intelligent Stoplights Thomas Davids, Michael Celentano, and Luke Knepper December 14, 2012 1 Introduction Traffic is a huge problem for the American economy. In 2010, the average American commuter

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3. Policy Gradient and Gradient Estimators 4. Q-prop: Sample Efficient Policy Gradient and an Off-policy Critic

More information

Particle Swarm Optimization applied to Pattern Recognition

Particle Swarm Optimization applied to Pattern Recognition Particle Swarm Optimization applied to Pattern Recognition by Abel Mengistu Advisor: Dr. Raheel Ahmad CS Senior Research 2011 Manchester College May, 2011-1 - Table of Contents Introduction... - 3 - Objectives...

More information

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1 Unsupervised CS 3793/5233 Artificial Intelligence Unsupervised 1 EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means In clustering, the target feature is not given. Goal: Construct

More information

Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs

Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs Nevin L. Zhang and Weihong Zhang lzhang,wzhang @cs.ust.hk Department of Computer Science Hong Kong University of Science &

More information

Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving

Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving Xi Xiong Jianqiang Wang Fang Zhang Keqiang Li State Key Laboratory of Automotive Safety and Energy, Tsinghua University

More information

Monitoring Interfaces for Faults

Monitoring Interfaces for Faults Monitoring Interfaces for Faults Aleksandr Zaks RV 05 - Fifth Workshop on Runtime Verification Joint work with: Amir Pnueli, Lenore Zuck Motivation Motivation Consider two components interacting with each

More information

6.001 Notes: Section 6.1

6.001 Notes: Section 6.1 6.001 Notes: Section 6.1 Slide 6.1.1 When we first starting talking about Scheme expressions, you may recall we said that (almost) every Scheme expression had three components, a syntax (legal ways of

More information

(Refer Slide Time: 1:27)

(Refer Slide Time: 1:27) Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 1 Introduction to Data Structures and Algorithms Welcome to data

More information

This blog addresses the question: how do we determine the intersection of two circles in the Cartesian plane?

This blog addresses the question: how do we determine the intersection of two circles in the Cartesian plane? Intersecting Circles This blog addresses the question: how do we determine the intersection of two circles in the Cartesian plane? This is a problem that a programmer might have to solve, for example,

More information

15-780: MarkovDecisionProcesses

15-780: MarkovDecisionProcesses 15-780: MarkovDecisionProcesses J. Zico Kolter Feburary 29, 2016 1 Outline Introduction Formal definition Value iteration Policy iteration Linear programming for MDPs 2 1988 Judea Pearl publishes Probabilistic

More information

Assignment 4: CS Machine Learning

Assignment 4: CS Machine Learning Assignment 4: CS7641 - Machine Learning Saad Khan November 29, 2015 1 Introduction The purpose of this assignment is to apply some of the techniques learned from reinforcement learning to make decisions

More information

Approximate Linear Programming for Average-Cost Dynamic Programming

Approximate Linear Programming for Average-Cost Dynamic Programming Approximate Linear Programming for Average-Cost Dynamic Programming Daniela Pucci de Farias IBM Almaden Research Center 65 Harry Road, San Jose, CA 51 pucci@mitedu Benjamin Van Roy Department of Management

More information

Residual Advantage Learning Applied to a Differential Game

Residual Advantage Learning Applied to a Differential Game Presented at the International Conference on Neural Networks (ICNN 96), Washington DC, 2-6 June 1996. Residual Advantage Learning Applied to a Differential Game Mance E. Harmon Wright Laboratory WL/AAAT

More information

CS221 Final Project: Learning Atari

CS221 Final Project: Learning Atari CS221 Final Project: Learning Atari David Hershey, Rush Moody, Blake Wulfe {dshersh, rmoody, wulfebw}@stanford December 11, 2015 1 Introduction 1.1 Task Definition and Atari Learning Environment Our goal

More information

You ve already read basics of simulation now I will be taking up method of simulation, that is Random Number Generation

You ve already read basics of simulation now I will be taking up method of simulation, that is Random Number Generation Unit 5 SIMULATION THEORY Lesson 39 Learning objective: To learn random number generation. Methods of simulation. Monte Carlo method of simulation You ve already read basics of simulation now I will be

More information

Reinforcement Learning (2)

Reinforcement Learning (2) Reinforcement Learning (2) Bruno Bouzy 1 october 2013 This document is the second part of the «Reinforcement Learning» chapter of the «Agent oriented learning» teaching unit of the Master MI computer course.

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

DriveFaster: Optimizing a Traffic Light Grid System

DriveFaster: Optimizing a Traffic Light Grid System DriveFaster: Optimizing a Traffic Light Grid System Abstract CS221 Fall 2016: Final Report Team Members: Xiaofan Li, Ahmed Jaffery Traffic lights are the central point of control of traffic for cities

More information

Hierarchical Assignment of Behaviours by Self-Organizing

Hierarchical Assignment of Behaviours by Self-Organizing Hierarchical Assignment of Behaviours by Self-Organizing W. Moerman 1 B. Bakker 2 M. Wiering 3 1 M.Sc. Cognitive Artificial Intelligence Utrecht University 2 Intelligent Autonomous Systems Group University

More information

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours. CS 188 Spring 2010 Introduction to Artificial Intelligence Final Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a two-page crib sheet. Please use non-programmable calculators

More information

State Space Reduction For Hierarchical Reinforcement Learning

State Space Reduction For Hierarchical Reinforcement Learning In Proceedings of the 17th International FLAIRS Conference, pp. 59-514, Miami Beach, FL 24 AAAI State Space Reduction For Hierarchical Reinforcement Learning Mehran Asadi and Manfred Huber Department of

More information

Marco Wiering Intelligent Systems Group Utrecht University

Marco Wiering Intelligent Systems Group Utrecht University Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University marco@cs.uu.nl 22-11-2004 Introduction Robots move in the physical environment to perform tasks The environment

More information

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION ABSTRACT Mark A. Mueller Georgia Institute of Technology, Computer Science, Atlanta, GA USA The problem of autonomous vehicle navigation between

More information

Non-Homogeneous Swarms vs. MDP s A Comparison of Path Finding Under Uncertainty

Non-Homogeneous Swarms vs. MDP s A Comparison of Path Finding Under Uncertainty Non-Homogeneous Swarms vs. MDP s A Comparison of Path Finding Under Uncertainty Michael Comstock December 6, 2012 1 Introduction This paper presents a comparison of two different machine learning systems

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T. Although this paper analyzes shaping with respect to its benefits on search problems, the reader should recognize that shaping is often intimately related to reinforcement learning. The objective in reinforcement

More information

Maximizing the Spread of Influence through a Social Network. David Kempe, Jon Kleinberg and Eva Tardos

Maximizing the Spread of Influence through a Social Network. David Kempe, Jon Kleinberg and Eva Tardos Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg and Eva Tardos Group 9 Lauren Thomas, Ryan Lieblein, Joshua Hammock and Mary Hanvey Introduction In a social network,

More information

Learning to bounce a ball with a robotic arm

Learning to bounce a ball with a robotic arm Eric Wolter TU Darmstadt Thorsten Baark TU Darmstadt Abstract Bouncing a ball is a fun and challenging task for humans. It requires fine and complex motor controls and thus is an interesting problem for

More information

6.001 Notes: Section 8.1

6.001 Notes: Section 8.1 6.001 Notes: Section 8.1 Slide 8.1.1 In this lecture we are going to introduce a new data type, specifically to deal with symbols. This may sound a bit odd, but if you step back, you may realize that everything

More information

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = ( Floating Point Numbers in Java by Michael L. Overton Virtually all modern computers follow the IEEE 2 floating point standard in their representation of floating point numbers. The Java programming language

More information

Logistic Regression and Gradient Ascent

Logistic Regression and Gradient Ascent Logistic Regression and Gradient Ascent CS 349-02 (Machine Learning) April 0, 207 The perceptron algorithm has a couple of issues: () the predictions have no probabilistic interpretation or confidence

More information

Per-decision Multi-step Temporal Difference Learning with Control Variates

Per-decision Multi-step Temporal Difference Learning with Control Variates Per-decision Multi-step Temporal Difference Learning with Control Variates Kristopher De Asis Department of Computing Science University of Alberta Edmonton, AB T6G 2E8 kldeasis@ualberta.ca Richard S.

More information

Strengths, Weaknesses, and Combinations of Model-based and Model-free Reinforcement Learning. Kavosh Asadi

Strengths, Weaknesses, and Combinations of Model-based and Model-free Reinforcement Learning. Kavosh Asadi Strengths, Weaknesses, and Combinations of Model-based and Model-free Reinforcement Learning by Kavosh Asadi A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Localized and Incremental Monitoring of Reverse Nearest Neighbor Queries in Wireless Sensor Networks 1

Localized and Incremental Monitoring of Reverse Nearest Neighbor Queries in Wireless Sensor Networks 1 Localized and Incremental Monitoring of Reverse Nearest Neighbor Queries in Wireless Sensor Networks 1 HAI THANH MAI AND MYOUNG HO KIM Department of Computer Science Korea Advanced Institute of Science

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

Lecture 3: Linear Classification

Lecture 3: Linear Classification Lecture 3: Linear Classification Roger Grosse 1 Introduction Last week, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features.

More information

Novel Function Approximation Techniques for. Large-scale Reinforcement Learning

Novel Function Approximation Techniques for. Large-scale Reinforcement Learning Novel Function Approximation Techniques for Large-scale Reinforcement Learning A Dissertation by Cheng Wu to the Graduate School of Engineering in Partial Fulfillment of the Requirements for the Degree

More information

16.410/413 Principles of Autonomy and Decision Making

16.410/413 Principles of Autonomy and Decision Making 16.410/413 Principles of Autonomy and Decision Making Lecture 17: The Simplex Method Emilio Frazzoli Aeronautics and Astronautics Massachusetts Institute of Technology November 10, 2010 Frazzoli (MIT)

More information

γ 2 γ 3 γ 1 R 2 (b) a bounded Yin set (a) an unbounded Yin set

γ 2 γ 3 γ 1 R 2 (b) a bounded Yin set (a) an unbounded Yin set γ 1 γ 3 γ γ 3 γ γ 1 R (a) an unbounded Yin set (b) a bounded Yin set Fig..1: Jordan curve representation of a connected Yin set M R. A shaded region represents M and the dashed curves its boundary M that

More information

Notebook Assignments

Notebook Assignments Notebook Assignments These six assignments are a notebook using techniques from class in the single concrete context of graph theory. This is supplemental to your usual assignments, and is designed for

More information

Neural Episodic Control. Alexander pritzel et al (presented by Zura Isakadze)

Neural Episodic Control. Alexander pritzel et al (presented by Zura Isakadze) Neural Episodic Control Alexander pritzel et al. 2017 (presented by Zura Isakadze) Reinforcement Learning Image from reinforce.io RL Example - Atari Games Observed States Images. Internal state - RAM.

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

Math 340 Fall 2014, Victor Matveev. Binary system, round-off errors, loss of significance, and double precision accuracy.

Math 340 Fall 2014, Victor Matveev. Binary system, round-off errors, loss of significance, and double precision accuracy. Math 340 Fall 2014, Victor Matveev Binary system, round-off errors, loss of significance, and double precision accuracy. 1. Bits and the binary number system A bit is one digit in a binary representation

More information

Planning and Control: Markov Decision Processes

Planning and Control: Markov Decision Processes CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What

More information

Integer Programming ISE 418. Lecture 7. Dr. Ted Ralphs

Integer Programming ISE 418. Lecture 7. Dr. Ted Ralphs Integer Programming ISE 418 Lecture 7 Dr. Ted Ralphs ISE 418 Lecture 7 1 Reading for This Lecture Nemhauser and Wolsey Sections II.3.1, II.3.6, II.4.1, II.4.2, II.5.4 Wolsey Chapter 7 CCZ Chapter 1 Constraint

More information

Improved Attack on Full-round Grain-128

Improved Attack on Full-round Grain-128 Improved Attack on Full-round Grain-128 Ximing Fu 1, and Xiaoyun Wang 1,2,3,4, and Jiazhe Chen 5, and Marc Stevens 6, and Xiaoyang Dong 2 1 Department of Computer Science and Technology, Tsinghua University,

More information

This work is about a new method for generating diffusion curve style images. Although this topic is dealing with non-photorealistic rendering, as you

This work is about a new method for generating diffusion curve style images. Although this topic is dealing with non-photorealistic rendering, as you This work is about a new method for generating diffusion curve style images. Although this topic is dealing with non-photorealistic rendering, as you will see our underlying solution is based on two-dimensional

More information

AI Technology for Quickly Solving Urban Security Positioning Problems

AI Technology for Quickly Solving Urban Security Positioning Problems AI Technology for Quickly Solving Urban Security Positioning Problems Hiroaki Iwashita Kotaro Ohori Hirokazu Anai Security games are used for mathematically optimizing security measures aimed at minimizing

More information

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho Partially Observable Markov Decision Processes Silvia Cruciani João Carvalho MDP A reminder: is a set of states is a set of actions is the state transition function. is the probability of ending in state

More information

Applications of Reinforcement Learning. Ist künstliche Intelligenz gefährlich?

Applications of Reinforcement Learning. Ist künstliche Intelligenz gefährlich? Applications of Reinforcement Learning Ist künstliche Intelligenz gefährlich? Table of contents Playing Atari with Deep Reinforcement Learning Playing Super Mario World Stanford University Autonomous Helicopter

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

9. MATHEMATICIANS ARE FOND OF COLLECTIONS

9. MATHEMATICIANS ARE FOND OF COLLECTIONS get the complete book: http://wwwonemathematicalcatorg/getfulltextfullbookhtm 9 MATHEMATICIANS ARE FOND OF COLLECTIONS collections Collections are extremely important in life: when we group together objects

More information

Competition Between Reinforcement Learning Methods in a Predator-Prey Grid World

Competition Between Reinforcement Learning Methods in a Predator-Prey Grid World Competition Between Reinforcement Learning Methods in a Predator-Prey Grid World Jacob Schrum (schrum2@cs.utexas.edu) Department of Computer Sciences University of Texas at Austin Austin, TX 78712 USA

More information

Lottery Looper. User Manual

Lottery Looper. User Manual Lottery Looper User Manual Lottery Looper 2.2 copyright Timersoft. All rights reserved. http://www.timersoft.com The information contained in this document is subject to change without notice. This document

More information

Short-Cut MCMC: An Alternative to Adaptation

Short-Cut MCMC: An Alternative to Adaptation Short-Cut MCMC: An Alternative to Adaptation Radford M. Neal Dept. of Statistics and Dept. of Computer Science University of Toronto http://www.cs.utoronto.ca/ radford/ Third Workshop on Monte Carlo Methods,

More information

(Refer Slide Time: 1:40)

(Refer Slide Time: 1:40) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering, Indian Institute of Technology, Delhi Lecture - 3 Instruction Set Architecture - 1 Today I will start discussion

More information

Simulated Transfer Learning Through Deep Reinforcement Learning

Simulated Transfer Learning Through Deep Reinforcement Learning Through Deep Reinforcement Learning William Doan Griffin Jarmin WILLRD9@VT.EDU GAJARMIN@VT.EDU Abstract This paper encapsulates the use reinforcement learning on raw images provided by a simulation to

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

THE preceding chapters were all devoted to the analysis of images and signals which

THE preceding chapters were all devoted to the analysis of images and signals which Chapter 5 Segmentation of Color, Texture, and Orientation Images THE preceding chapters were all devoted to the analysis of images and signals which take values in IR. It is often necessary, however, to

More information

A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY

A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY KARL L. STRATOS Abstract. The conventional method of describing a graph as a pair (V, E), where V and E repectively denote the sets of vertices and edges,

More information

Introduction to Deep Q-network

Introduction to Deep Q-network Introduction to Deep Q-network Presenter: Yunshu Du CptS 580 Deep Learning 10/10/2016 Deep Q-network (DQN) Deep Q-network (DQN) An artificial agent for general Atari game playing Learn to master 49 different

More information

LAO*, RLAO*, or BLAO*?

LAO*, RLAO*, or BLAO*? , R, or B? Peng Dai and Judy Goldsmith Computer Science Dept. University of Kentucky 773 Anderson Tower Lexington, KY 40506-0046 Abstract In 2003, Bhuma and Goldsmith introduced a bidirectional variant

More information

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018 15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018 In this lecture, we describe a very general problem called linear programming

More information

Q-learning with linear function approximation

Q-learning with linear function approximation Q-learning with linear function approximation Francisco S. Melo and M. Isabel Ribeiro Institute for Systems and Robotics [fmelo,mir]@isr.ist.utl.pt Conference on Learning Theory, COLT 2007 June 14th, 2007

More information

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go ConvolutionalRestrictedBoltzmannMachineFeatures fortdlearningingo ByYanLargmanandPeterPham AdvisedbyHonglakLee 1.Background&Motivation AlthoughrecentadvancesinAIhaveallowed Go playing programs to become

More information

Breaking Grain-128 with Dynamic Cube Attacks

Breaking Grain-128 with Dynamic Cube Attacks Breaking Grain-128 with Dynamic Cube Attacks Itai Dinur and Adi Shamir Computer Science department The Weizmann Institute Rehovot 76100, Israel Abstract. We present a new variant of cube attacks called

More information

Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning

Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning Jan Peters 1, Stefan Schaal 1 University of Southern California, Los Angeles CA 90089, USA Abstract. In this paper, we

More information

Robotics. Lecture 5: Monte Carlo Localisation. See course website for up to date information.

Robotics. Lecture 5: Monte Carlo Localisation. See course website  for up to date information. Robotics Lecture 5: Monte Carlo Localisation See course website http://www.doc.ic.ac.uk/~ajd/robotics/ for up to date information. Andrew Davison Department of Computing Imperial College London Review:

More information

Unit 1 Algebraic Functions and Graphs

Unit 1 Algebraic Functions and Graphs Algebra 2 Unit 1 Algebraic Functions and Graphs Name: Unit 1 Day 1: Function Notation Today we are: Using Function Notation We are successful when: We can Use function notation to evaluate a function This

More information