Learning in Games via Opponent Strategy Estimation and Policy Search

Similar documents
Sam knows that his MP3 player has 40% of its battery life left and that the battery charges by an additional 12 percentage points every 15 minutes.

Reinforcement Learning by Policy Improvement. Making Use of Experiences of The Other Tasks. Hajime Kimura and Shigenobu Kobayashi

A Matching Algorithm for Content-Based Image Retrieval

STEREO PLANE MATCHING TECHNIQUE

Gauss-Jordan Algorithm

FIELD PROGRAMMABLE GATE ARRAY (FPGA) AS A NEW APPROACH TO IMPLEMENT THE CHAOTIC GENERATORS

Quantitative macro models feature an infinite number of periods A more realistic (?) view of time

Nonparametric CUSUM Charts for Process Variability

Visual Perception as Bayesian Inference. David J Fleet. University of Toronto

Rao-Blackwellized Particle Filtering for Probing-Based 6-DOF Localization in Robotic Assembly

FACIAL ACTION TRACKING USING PARTICLE FILTERS AND ACTIVE APPEARANCE MODELS. Soumya Hamlaoui & Franck Davoine

Assignment 2. Due Monday Feb. 12, 10:00pm.

PART 1 REFERENCE INFORMATION CONTROL DATA 6400 SYSTEMS CENTRAL PROCESSOR MONITOR

Image segmentation. Motivation. Objective. Definitions. A classification of segmentation techniques. Assumptions for thresholding

Real-time 2D Video/3D LiDAR Registration

A GRAPHICS PROCESSING UNIT IMPLEMENTATION OF THE PARTICLE FILTER

Shortest Path Algorithms. Lecture I: Shortest Path Algorithms. Example. Graphs and Matrices. Setting: Dr Kieran T. Herley.

An Improved Square-Root Nyquist Shaping Filter

4. Minimax and planning problems

An efficient approach to improve throughput for TCP vegas in ad hoc network

Visual Indoor Localization with a Floor-Plan Map

COSC 3213: Computer Networks I Chapter 6 Handout # 7

A GRAPHICS PROCESSING UNIT IMPLEMENTATION OF THE PARTICLE FILTER

MORPHOLOGICAL SEGMENTATION OF IMAGE SEQUENCES

NEWTON S SECOND LAW OF MOTION

Robust Visual Tracking for Multiple Targets

A Fast Stereo-Based Multi-Person Tracking using an Approximated Likelihood Map for Overlapping Silhouette Templates

Tracking Deforming Objects Using Particle Filtering for Geometric Active Contours

MIC2569. Features. General Description. Applications. Typical Application. CableCARD Power Switch

Analysis of Various Types of Bugs in the Object Oriented Java Script Language Coding

In fmri a Dual Echo Time EPI Pulse Sequence Can Induce Sources of Error in Dynamic Magnetic Field Maps

Simultaneous Localization and Mapping with Stereo Vision

CAMERA CALIBRATION BY REGISTRATION STEREO RECONSTRUCTION TO 3D MODEL

Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report)

COMP26120: Algorithms and Imperative Programming

Optimal Crane Scheduling

M(t)/M/1 Queueing System with Sinusoidal Arrival Rate

Why not experiment with the system itself? Ways to study a system System. Application areas. Different kinds of systems

Multiple View Discriminative Appearance Modeling with IMCMC for Distributed Tracking

J. Vis. Commun. Image R.

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS 1

IROS 2015 Workshop on On-line decision-making in multi-robot coordination (DEMUR 15)

1.4 Application Separable Equations and the Logistic Equation

(Structural Time Series Models for Describing Trend in All India Sunflower Yield Using SAS

Dynamic Route Planning and Obstacle Avoidance Model for Unmanned Aerial Vehicles

Video-Based Face Recognition Using Probabilistic Appearance Manifolds

Design Alternatives for a Thin Lens Spatial Integrator Array

MATH Differential Equations September 15, 2008 Project 1, Fall 2008 Due: September 24, 2008

Real Time Integral-Based Structural Health Monitoring

An Efficient Delivery Scheme for Coded Caching

Opportunistic Flooding in Low-Duty-Cycle Wireless Sensor Networks with Unreliable Links

Computer representations of piecewise

A Fast Non-Uniform Knots Placement Method for B-Spline Fitting

Mobile Robots Mapping

Moving Object Detection Using MRF Model and Entropy based Adaptive Thresholding

Last Time: Curves & Surfaces. Today. Questions? Limitations of Polygonal Meshes. Can We Disguise the Facets?

MoBAN: A Configurable Mobility Model for Wireless Body Area Networks

CHANGE DETECTION - CELLULAR AUTOMATA METHOD FOR URBAN GROWTH MODELING

Occlusion-Free Hand Motion Tracking by Multiple Cameras and Particle Filtering with Prediction

Less Pessimistic Worst-Case Delay Analysis for Packet-Switched Networks

Improving Ranking of Search Engines Results Based on Power Links

Coded Caching with Multiple File Requests

CENG 477 Introduction to Computer Graphics. Modeling Transformations

An Adaptive Spatial Depth Filter for 3D Rendering IP

MOTION DETECTORS GRAPH MATCHING LAB PRE-LAB QUESTIONS

IntentSearch:Capturing User Intention for One-Click Internet Image Search

Performance Evaluation of Implementing Calls Prioritization with Different Queuing Disciplines in Mobile Wireless Networks

A time-space consistency solution for hardware-in-the-loop simulation system

Upper Body Tracking for Human-Machine Interaction with a Moving Camera

4 Error Control. 4.1 Issues with Reliable Protocols

A Review on Block Matching Motion Estimation and Automata Theory based Approaches for Fractal Coding

Outline. EECS Components and Design Techniques for Digital Systems. Lec 06 Using FSMs Review: Typical Controller: state

Definition and examples of time series

Robust 3D Visual Tracking Using Particle Filtering on the SE(3) Group

Viewpoint Invariant 3D Landmark Model Inference from Monocular 2D Images Using Higher-Order Priors

The Beer Dock: Three and a Half Implementations of the Beer Distribution Game

Improving the Efficiency of Dynamic Service Provisioning in Transport Networks with Scheduled Services

Probabilistic Detection and Tracking of Motion Discontinuities

Simple Network Management Based on PHP and SNMP

Research Article Particle Filtering: The Need for Speed

STRING DESCRIPTIONS OF DATA FOR DISPLAY*

Tracking Appearances with Occlusions

Open Access Research on an Improved Medical Image Enhancement Algorithm Based on P-M Model. Luo Aijing 1 and Yin Jin 2,* u = div( c u ) u

Improved TLD Algorithm for Face Tracking

The Impact of Product Development on the Lifecycle of Defects

Low-Cost WLAN based. Dr. Christian Hoene. Computer Science Department, University of Tübingen, Germany

A non-stationary uniform tension controlled interpolating 4-point scheme reproducing conics

EECS 487: Interactive Computer Graphics

Web System for the Remote Control and Execution of an IEC Application

MB86297A Carmine Timing Analysis of the DDR Interface

4.1 3D GEOMETRIC TRANSFORMATIONS

Robot localization under perceptual aliasing conditions based on laser reflectivity using particle filter

Video Content Description Using Fuzzy Spatio-Temporal Relations

Improving Occupancy Grid FastSLAM by Integrating Navigation Sensors

Audio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA

Spline Curves. Color Interpolation. Normal Interpolation. Last Time? Today. glshademodel (GL_SMOOTH); Adjacency Data Structures. Mesh Simplification

Adaptive Workflow Scheduling on Cloud Computing Platforms with Iterative Ordinal Optimization

Robust LSTM-Autoencoders for Face De-Occlusion in the Wild

Difficulty-aware Hybrid Search in Peer-to-Peer Networks

Modeling of IEEE in a Cluster of Synchronized Sensor Nodes

Transcription:

Learning in Games via Opponen Sraegy Esimaion and Policy Search Yavar Naddaf Deparmen of Compuer Science Universiy of Briish Columbia Vancouver, BC yavar@naddaf.name Nando de Freias (Supervisor) Deparmen of Compuer Science Universiy of Briish Columbia Vancouver, BC nando@cs.ubc.ca Absrac In his paper, we model he problem of learning o play a game as a POMDP. The agen s hidden sae is he opponen s curren sraegy and he agen s acion is o choose a corresponding sraegy. We apply paricle filers o esimae he opponen s sraegy and use policy search o find a good sraegy for he AI agen. 1 Inroducion Despie he advances in he arificial inelligence and machine learning fields, he game-ai in he video game indusry is sill mainly implemened by creaing large FSMs [3]. The ime and labour required o implemen and fine-une hese FSMs grows exceedingly as he games become more complicaed. Also, since FSMs are by naure deerminisic, he AI acions usually become predicable for he players, which resuls in a poor gaming experience [5]. Here, we demonsrae how a combinaion of paricle filering and reinforcemen learning mehods can be used so ha he game-ai esimaes he player s sraegy and compues is own sraegy accordingly. Insead of coding a large FSM for all players, he idea is o learn a differen FSM for each individual player which is fine-uned o her sraegy. 2 Model Specificaion 2.1 Game Feaures Our assumpion is ha in each insance of he game, he human player chooses her acions based on some feaures of he game environmen. For some game environmen X, we define he feaure vecor F (X) = [f 1 (X), f 2 (X),..., f m (X)] where each feaure f j (X) capures some aspec of he game. These feaures vary wihin differen games and are picked by human expers. For insance, in a sraegy game he feaures can capure he number of opponen s aack unis, he amoun of resources lef in he game, or he ime passed since he opponen s las aack.

2.2 Modeling players as classifiers Given he game feaures F (X), we can model he human player as a classifier, where a ime he probabiliy of acion y is given by p(y F (X, θop) = φ(f (X, θ op ) φ is a parameric classificaion funcion, such as a neural nework or ridge regression, wih parameers θ op. The main idea is ha by esimaing θ op, we can predic he opponen s acions in differen game seings. Furhermore, he AI player oo can be modeled as a classifier wih parameers θ ai. In his seing, learning o play a game agains an opponen is reduced o finding a good se of θ ai given our esimae of θ op. 3 Opponen sraegy esimaion via paricle filering 3.1 General approach To esimae he opponen s sraegy, we employ a paricle filering approach suggesed by Hojen-Sorensen, de Freias and Fog for on-line binary classificaion [6]. This mehod is well suied o our problem because i is able o compue he class memberships as he classes hemselves evolve over ime. This will allow us o esimae a non-saionary sraegy which varies hrough he game 1. We adop a Hidden Markov Model in which a ime, he opponen s classificaion parameers θ is he hidden sae and he opponen s acion y is he observaion. We furher assume ha he classificaion parameers follow a random walk θ = θ 1 + µ where µ ℵ(0, δ 2 I). In a generic imporan sampling algorihm, we sar wih N prior samples θ 0. On each ime sep, we predic N new samples ˆθ p(θ θ 1 ), and se he imporance weigh of each sample o is likelihood ω = p(y ˆθ. We hen eliminae he samples wih low imporance weighs (small likelihood) and muliply samples wih high imporance weighs (large likelihood) o obain N random samples θ. To avoid a skewed imporance weighs disribuion, in which many paricles have no children and some have a large number of children, we also inroduce an MCMC sep. (See [4] and [2] for a deailed inroducion o paricle filering and MCMC mehods). 3.2 Binomial likelihood Le us firs consider a game wih only wo possible acions {0, 1}. We choose φ(f (X, θ = P (1 F (X ) i.e. given he game feaures F (X, our classificaion funcion reurns he probabiliy ha he opponen will carry ou acion 1. Similarly, P (0 F (X ) = 1 φ(f (X, θ. The likelihood of opponen s acion y is given by a binomial disribuion: P (y F (X, θ) = (φ(f (X, θ)) y (1 φ(f (X, θ)) 1 y 3.3 Mulinomial likelihood In a game wih more han wo acions {0, 1,..., K}, a possible approach is o use a se of θ (1:K) and le he probabiliy ha he opponen will play acion i be P (y = i F (X ) = φ(f (X, θ. However, his approach does no ensure ha he probabiliy of all acions 1 There is a significanly more efficien alernaive o wha we use here, inroduced by Andrieu, de Freias and Douce, in which insead of direcly sampling θ, an augmened variable of much smaller dimension is sampled. See [1].

will sum o 1. To make he acion probabiliies sum o 1, we use a sofmax funcion, so ha he probabiliy of acion i becomes: P (y = i F (X, θ) = 3.4 Paricle Filering algorihm K j=1 φ(f (X ), θ φ(f (X, θ (j) Now ha we can compue he imporance weigh ω = p(y F (X, ˆθ for each paricle, we can use he paricle filering algorihm presened in Figure 1 o esimae he opponen s classificaion parameers. Sar wih N prior samples: θ 0 (i = 1,..., N) On each ime sep of he game: 1. Observe game feaures F (X and opponen acion y 2. Imporance sampling sep For i = 1,..., N sample ˆθ = θ 1 + u = p(y F (X, For i = 1,..., N, se ω ˆθ For i = 1,..., N, normalize he imporance weighs: ω = ω N j=1 ω (j) θ 3. Selecion Sep Eliminae samples wih low imporance weighs and muliply samples wih high imporance weighs o obain N random samples 4. MCM Sep For i = 1,..., N Sample v µ [0,1] Sample he proposal candidae θ { } If v min 1, p(y F (X,θ p(y F (X ), θ hen accep move: θ else rejec move: θ = = θ θ 1 = θ 1 + u Figure 1: Paricle filering algorihm for esimaing opponen s classificaion parameers 4 Learning o play as a POMDP problem 4.1 Modeling he problem as a POMDP If we model players as classifiers (as discussed in secion 2.2), learning o play a game can be achieved in wo seps:

Firs we esimae he opponen s curren classificaion parameers θ op by looking a he hisory of he game feaures and he corresponding opponen acions. Once we have an esimae of he opponen s classificaion parameers (which enables us o predic he opponen s acions in differen game seings), we ry o come up wih classificaion parameers for he AI player ha maximizes is expeced reward. These wo seps can be modeled by a Parially Observable Markov Decision Process, in which he agen s curren (hidden) sae is he opponen s classificaion parameers, he agen s observaions are he game feaures plus he opponen s acion, and he agen s acion is choosing is own classificaion parameers. 4.2 Policy Search Policy Search is a mehod o find an opimal policy in a POMDP by finding: π = arg max π (V π (s 0 )) where Π is a se of policy funcions π each mapping he se of saes S o he se of possible acions A. V π (s 0 ) is he expeced reward of saring a sae s 0 and following he policy π. In essence, wha we are rying o do is o search in a se of policy funcions for he policy ha maximizes he expeced reward. Of course, we do no know wha he expeced reward of each policy is, bu we can approximae i using a Mone Carlo approach. Assuming ha we know he iniial sae s 0 and we have a finie horizon of N seps, we will run M simulaions o approximae V π (s 0 ). On each simulaion i, we sar from s 0, choose an acion a 0 based on he policy π and end up in sae s 1 based on he sae-ransiion ). Then, we choose anoher acion a 1 based on π and ransi o s 2, receiving he reward R(s 2 ). We will coninue his unil we reach he final sae s N. Afer M simulaions, he Mone Carlo esimae of he expeced reward of π will be: [ ˆV π (s 0 ) = 1 M ] N γ R(s M model p(s s, a), receiving he reward R(s 1 i=1 =0 where γ is he discoun facor. Now ha we can evaluae he expeced reward of each policy, finding π Π becomes an opimizaion problem which can be solved by any sandard opimizaion mehod, such as gradien descen or simulaed annealing. A problem wih he above approach is ha in order o esimae he expeced reward of each policy, we have o sample a se of random numbers o simulae he sae ransiions 2. Using differen random numbers o esimae each policy will resul in a high variance in ˆV π (s 0 ) and a poor π will be found by he algorihm. The soluion is o sample M random seeds a he beginning of he algorihm, and use he same se of random seeds o esimae he expeced reward of each policy. (This approach is based on Andrew Ng s Pegasus algorihm. See his hesis [7] for a very comprehensive inroducion o POMDP, policy search mehods, and in paricular Pegasus). 2 If he rewards are sochasic, as is he case in our example in secion 5, finding R(s will also require sampling random numbers.

Table 1: Possible acions in Sand-Wars and heir oucomes ACTION OUTCOME Upgrade he casle wall Probabiliy of being hi is reduced by w 1 in he curren urn and by w 2 in he following urn Upgrade he caapul Probabiliy of hiing he opponen is increased by c Throw he sand bag The opponen s casle is hi by probabiliy P(hi) = f(player s Caapul Upgrade, Opponen s Casle Wall Upgrade) 5 Experimen: Sand-Wars 5.1 Game descripion To es our POMDP player, we developed a simple sraegy game, called: Sand-Wars. There are wo players in he game, each conrolling a casle and a caapul. On every urn, each player receives a bag of sand and is given hree choices on wha o do wih i. The player can eiher upgrade her casle walls, upgrade her caapul, or hrough he bag oward he oher casle. Table 1 describes he oucome of each acion 3. Once boh players have chosen heir acions, boh acions are carried ou simulaneously. If a player hi he oher s casle, she receives one poin. A he end of N urns, he player wih he higher score wins he game. 5.2 Implemenaion To apply he algorihms described in secions 3 and 4 on he Sand-Wars game, we chose he game feaures o be a vecor of four elemens: F(X) = [Player s Wall upgrade, Player s Caapul upgrade, Opponen s Wall Upgrade, Opponen s Caapul Upgrade] (all values are normalized o be beween 0 and 1). We used a logisic funcion as our parameric classificaion funcion, so ha: 1 φ(f (X), θ) = 1 + e θ F (X) Finally, we chose simulaed annealing as our opimizaion mehod for he policy search algorihm. 5.3 Resuls We played our POMDP player agains a number of hand-coded opponens for a few housand rounds. In general, he algorihm was able o predic he opponen s acions wih small variance wihin a few hundred rounds. Also, he player s policy usually improved significanly. However, policy serach did no always find he opimum policy by he end of he game. A big challenge was finding good parameers for he simulaed annealing algorihm (e.g. he iniial emperaure, he cooling rae, number of opimizaion seps, ec.). Improved resuls may be possible by eiher finding beer parameers for he simulaed annealing algorihm, or using a differen opimizaion mehod. Figure 2-(a) shows he paricle filer s accumulaed error in predicing he opponen s acions, when playing agains a conservaive player for 5000 rounds. A conservaive 3 Noe ha a wall upgrade is only effecive for he curren and he following urn. This ime limi also applies o upgrading he caapul, which is only effecive for he very nex urn.

player is a hard-coded player who always upgrades her casle walls if here is no wall upgrades, and hrows he sand bag oherwise. Figure 2-(b) shows he obained reward in he same game. We sared he policy-search algorihm a he 1000 h round and re-ran i every 50 rounds. Accumulaed Predicion Error 70 60 50 40 30 20 10 0 0 1000 2000 3000 4000 5000 Time Sep (a) Accumulaed predicion error Reward 250 200 150 100 50 0 50 POMDP Player Reward 100 0 1000 2000 3000 4000 5000 Time Sep (b) POMDP player s reward Figure 2: Playing vs. he conservaive player In he game ploed in Figure 2, he POMDP player learned o repeaedly upgrade he caapul and hen hrow he sand bag, which seemed like a reasonable sraegy given our values of w 1, w 2, and c. In laer experimens, we changed w 1, w 2, and c so ha he caapul upgrade was less beneficial. The POMDP player hen gave up on he caapul upgrades and learned o simply hrow he sand bag on all urns. 6 Conclusion In his paper we modeled he problem of learning o play a game as a POMDP, and applied paricle filering and policy search o ry o solve i. We demonsraed our approach on a simple sraegy game wih some promising resuls. References [1] Andrieu, C., de Freias, N., Douce, A. (2001). RaoBlackwellised paricle filering via daa augmenaion. Advances in Neural Informaion Processing Sysems, 13. [2] Andrieu, C., de Freias, N., Douce, A., Jordan, M. (2001). An inroducion o MCMC for Machine Learning. Machine Learning, 50, 5 43. [3] Buckland, M. (2002). AI Techniques for Game Programming. Cincinnai: Thomson Course Technology. [4] Douce, A., de Freias, N., Gordon, N. (2001). An Inroducion o Sequenial Mone Carlo Mehods. Sequenial Mone Carlo Mehods in Pracice, 3-14. [5] Fairclough, C., Fagan, M., Namee, B., Cunningham, P. (2001). Research direcions for AI in compuer games. Proceedings of he Twelfh Irish Conference on Arificial Inelligence and Cogniive Science, 333-344. [6] Hojen-Sorensen, P., de Freias, N., Fog, T. (2000). On-line probabilisic classificaion wih paricle filers. Neural Neworks for Signal Processing X, 2000. Proceedings of he 2000 IEEE Signal Processing Sociey Workshop, 1, 386 395. [7] Ng, A. Y. (2003). Shaping and Policy Search in Reinforcemen Learning. PhD hesis, EECS, Universiy of California, Berkeley.