Learning in Games via Opponent Strategy Estimation and Policy Search

Learning in Games via Opponen Sraegy Esimaion and Policy Search Yavar Naddaf Deparmen of Compuer Science Universiy of Briish Columbia Vancouver, BC yavar@naddaf.name Nando de Freias (Supervisor) Deparmen of Compuer Science Universiy of Briish Columbia Vancouver, BC nando@cs.ubc.ca Absrac In his paper, we model he problem of learning o play a game as a POMDP. The agen s hidden sae is he opponen s curren sraegy and he agen s acion is o choose a corresponding sraegy. We apply paricle filers o esimae he opponen s sraegy and use policy search o find a good sraegy for he AI agen. 1 Inroducion Despie he advances in he arificial inelligence and machine learning fields, he game-ai in he video game indusry is sill mainly implemened by creaing large FSMs [3]. The ime and labour required o implemen and fine-une hese FSMs grows exceedingly as he games become more complicaed. Also, since FSMs are by naure deerminisic, he AI acions usually become predicable for he players, which resuls in a poor gaming experience [5]. Here, we demonsrae how a combinaion of paricle filering and reinforcemen learning mehods can be used so ha he game-ai esimaes he player s sraegy and compues is own sraegy accordingly. Insead of coding a large FSM for all players, he idea is o learn a differen FSM for each individual player which is fine-uned o her sraegy. 2 Model Specificaion 2.1 Game Feaures Our assumpion is ha in each insance of he game, he human player chooses her acions based on some feaures of he game environmen. For some game environmen X, we define he feaure vecor F (X) = [f 1 (X), f 2 (X),..., f m (X)] where each feaure f j (X) capures some aspec of he game. These feaures vary wihin differen games and are picked by human expers. For insance, in a sraegy game he feaures can capure he number of opponen s aack unis, he amoun of resources lef in he game, or he ime passed since he opponen s las aack.

2.2 Modeling players as classifiers Given he game feaures F (X), we can model he human player as a classifier, where a ime he probabiliy of acion y is given by p(y F (X, θop) = φ(f (X, θ op ) φ is a parameric classificaion funcion, such as a neural nework or ridge regression, wih parameers θ op. The main idea is ha by esimaing θ op, we can predic he opponen s acions in differen game seings. Furhermore, he AI player oo can be modeled as a classifier wih parameers θ ai. In his seing, learning o play a game agains an opponen is reduced o finding a good se of θ ai given our esimae of θ op. 3 Opponen sraegy esimaion via paricle filering 3.1 General approach To esimae he opponen s sraegy, we employ a paricle filering approach suggesed by Hojen-Sorensen, de Freias and Fog for on-line binary classificaion [6]. This mehod is well suied o our problem because i is able o compue he class memberships as he classes hemselves evolve over ime. This will allow us o esimae a non-saionary sraegy which varies hrough he game 1. We adop a Hidden Markov Model in which a ime, he opponen s classificaion parameers θ is he hidden sae and he opponen s acion y is he observaion. We furher assume ha he classificaion parameers follow a random walk θ = θ 1 + µ where µ ℵ(0, δ 2 I). In a generic imporan sampling algorihm, we sar wih N prior samples θ 0. On each ime sep, we predic N new samples ˆθ p(θ θ 1 ), and se he imporance weigh of each sample o is likelihood ω = p(y ˆθ. We hen eliminae he samples wih low imporance weighs (small likelihood) and muliply samples wih high imporance weighs (large likelihood) o obain N random samples θ. To avoid a skewed imporance weighs disribuion, in which many paricles have no children and some have a large number of children, we also inroduce an MCMC sep. (See [4] and [2] for a deailed inroducion o paricle filering and MCMC mehods). 3.2 Binomial likelihood Le us firs consider a game wih only wo possible acions {0, 1}. We choose φ(f (X, θ = P (1 F (X ) i.e. given he game feaures F (X, our classificaion funcion reurns he probabiliy ha he opponen will carry ou acion 1. Similarly, P (0 F (X ) = 1 φ(f (X, θ. The likelihood of opponen s acion y is given by a binomial disribuion: P (y F (X, θ) = (φ(f (X, θ)) y (1 φ(f (X, θ)) 1 y 3.3 Mulinomial likelihood In a game wih more han wo acions {0, 1,..., K}, a possible approach is o use a se of θ (1:K) and le he probabiliy ha he opponen will play acion i be P (y = i F (X ) = φ(f (X, θ. However, his approach does no ensure ha he probabiliy of all acions 1 There is a significanly more efficien alernaive o wha we use here, inroduced by Andrieu, de Freias and Douce, in which insead of direcly sampling θ, an augmened variable of much smaller dimension is sampled. See [1].

will sum o 1. To make he acion probabiliies sum o 1, we use a sofmax funcion, so ha he probabiliy of acion i becomes: P (y = i F (X, θ) = 3.4 Paricle Filering algorihm K j=1 φ(f (X ), θ φ(f (X, θ (j) Now ha we can compue he imporance weigh ω = p(y F (X, ˆθ for each paricle, we can use he paricle filering algorihm presened in Figure 1 o esimae he opponen s classificaion parameers. Sar wih N prior samples: θ 0 (i = 1,..., N) On each ime sep of he game: 1. Observe game feaures F (X and opponen acion y 2. Imporance sampling sep For i = 1,..., N sample ˆθ = θ 1 + u = p(y F (X, For i = 1,..., N, se ω ˆθ For i = 1,..., N, normalize he imporance weighs: ω = ω N j=1 ω (j) θ 3. Selecion Sep Eliminae samples wih low imporance weighs and muliply samples wih high imporance weighs o obain N random samples 4. MCM Sep For i = 1,..., N Sample v µ [0,1] Sample he proposal candidae θ { } If v min 1, p(y F (X,θ p(y F (X ), θ hen accep move: θ else rejec move: θ = = θ θ 1 = θ 1 + u Figure 1: Paricle filering algorihm for esimaing opponen s classificaion parameers 4 Learning o play as a POMDP problem 4.1 Modeling he problem as a POMDP If we model players as classifiers (as discussed in secion 2.2), learning o play a game can be achieved in wo seps:

Firs we esimae he opponen s curren classificaion parameers θ op by looking a he hisory of he game feaures and he corresponding opponen acions. Once we have an esimae of he opponen s classificaion parameers (which enables us o predic he opponen s acions in differen game seings), we ry o come up wih classificaion parameers for he AI player ha maximizes is expeced reward. These wo seps can be modeled by a Parially Observable Markov Decision Process, in which he agen s curren (hidden) sae is he opponen s classificaion parameers, he agen s observaions are he game feaures plus he opponen s acion, and he agen s acion is choosing is own classificaion parameers. 4.2 Policy Search Policy Search is a mehod o find an opimal policy in a POMDP by finding: π = arg max π (V π (s 0 )) where Π is a se of policy funcions π each mapping he se of saes S o he se of possible acions A. V π (s 0 ) is he expeced reward of saring a sae s 0 and following he policy π. In essence, wha we are rying o do is o search in a se of policy funcions for he policy ha maximizes he expeced reward. Of course, we do no know wha he expeced reward of each policy is, bu we can approximae i using a Mone Carlo approach. Assuming ha we know he iniial sae s 0 and we have a finie horizon of N seps, we will run M simulaions o approximae V π (s 0 ). On each simulaion i, we sar from s 0, choose an acion a 0 based on he policy π and end up in sae s 1 based on he sae-ransiion ). Then, we choose anoher acion a 1 based on π and ransi o s 2, receiving he reward R(s 2 ). We will coninue his unil we reach he final sae s N. Afer M simulaions, he Mone Carlo esimae of he expeced reward of π will be: [ ˆV π (s 0 ) = 1 M ] N γ R(s M model p(s s, a), receiving he reward R(s 1 i=1 =0 where γ is he discoun facor. Now ha we can evaluae he expeced reward of each policy, finding π Π becomes an opimizaion problem which can be solved by any sandard opimizaion mehod, such as gradien descen or simulaed annealing. A problem wih he above approach is ha in order o esimae he expeced reward of each policy, we have o sample a se of random numbers o simulae he sae ransiions 2. Using differen random numbers o esimae each policy will resul in a high variance in ˆV π (s 0 ) and a poor π will be found by he algorihm. The soluion is o sample M random seeds a he beginning of he algorihm, and use he same se of random seeds o esimae he expeced reward of each policy. (This approach is based on Andrew Ng s Pegasus algorihm. See his hesis [7] for a very comprehensive inroducion o POMDP, policy search mehods, and in paricular Pegasus). 2 If he rewards are sochasic, as is he case in our example in secion 5, finding R(s will also require sampling random numbers.

Table 1: Possible acions in Sand-Wars and heir oucomes ACTION OUTCOME Upgrade he casle wall Probabiliy of being hi is reduced by w 1 in he curren urn and by w 2 in he following urn Upgrade he caapul Probabiliy of hiing he opponen is increased by c Throw he sand bag The opponen s casle is hi by probabiliy P(hi) = f(player s Caapul Upgrade, Opponen s Casle Wall Upgrade) 5 Experimen: Sand-Wars 5.1 Game descripion To es our POMDP player, we developed a simple sraegy game, called: Sand-Wars. There are wo players in he game, each conrolling a casle and a caapul. On every urn, each player receives a bag of sand and is given hree choices on wha o do wih i. The player can eiher upgrade her casle walls, upgrade her caapul, or hrough he bag oward he oher casle. Table 1 describes he oucome of each acion 3. Once boh players have chosen heir acions, boh acions are carried ou simulaneously. If a player hi he oher s casle, she receives one poin. A he end of N urns, he player wih he higher score wins he game. 5.2 Implemenaion To apply he algorihms described in secions 3 and 4 on he Sand-Wars game, we chose he game feaures o be a vecor of four elemens: F(X) = [Player s Wall upgrade, Player s Caapul upgrade, Opponen s Wall Upgrade, Opponen s Caapul Upgrade] (all values are normalized o be beween 0 and 1). We used a logisic funcion as our parameric classificaion funcion, so ha: 1 φ(f (X), θ) = 1 + e θ F (X) Finally, we chose simulaed annealing as our opimizaion mehod for he policy search algorihm. 5.3 Resuls We played our POMDP player agains a number of hand-coded opponens for a few housand rounds. In general, he algorihm was able o predic he opponen s acions wih small variance wihin a few hundred rounds. Also, he player s policy usually improved significanly. However, policy serach did no always find he opimum policy by he end of he game. A big challenge was finding good parameers for he simulaed annealing algorihm (e.g. he iniial emperaure, he cooling rae, number of opimizaion seps, ec.). Improved resuls may be possible by eiher finding beer parameers for he simulaed annealing algorihm, or using a differen opimizaion mehod. Figure 2-(a) shows he paricle filer s accumulaed error in predicing he opponen s acions, when playing agains a conservaive player for 5000 rounds. A conservaive 3 Noe ha a wall upgrade is only effecive for he curren and he following urn. This ime limi also applies o upgrading he caapul, which is only effecive for he very nex urn.

player is a hard-coded player who always upgrades her casle walls if here is no wall upgrades, and hrows he sand bag oherwise. Figure 2-(b) shows he obained reward in he same game. We sared he policy-search algorihm a he 1000 h round and re-ran i every 50 rounds. Accumulaed Predicion Error 70 60 50 40 30 20 10 0 0 1000 2000 3000 4000 5000 Time Sep (a) Accumulaed predicion error Reward 250 200 150 100 50 0 50 POMDP Player Reward 100 0 1000 2000 3000 4000 5000 Time Sep (b) POMDP player s reward Figure 2: Playing vs. he conservaive player In he game ploed in Figure 2, he POMDP player learned o repeaedly upgrade he caapul and hen hrow he sand bag, which seemed like a reasonable sraegy given our values of w 1, w 2, and c. In laer experimens, we changed w 1, w 2, and c so ha he caapul upgrade was less beneficial. The POMDP player hen gave up on he caapul upgrades and learned o simply hrow he sand bag on all urns. 6 Conclusion In his paper we modeled he problem of learning o play a game as a POMDP, and applied paricle filering and policy search o ry o solve i. We demonsraed our approach on a simple sraegy game wih some promising resuls. References [1] Andrieu, C., de Freias, N., Douce, A. (2001). RaoBlackwellised paricle filering via daa augmenaion. Advances in Neural Informaion Processing Sysems, 13. [2] Andrieu, C., de Freias, N., Douce, A., Jordan, M. (2001). An inroducion o MCMC for Machine Learning. Machine Learning, 50, 5 43. [3] Buckland, M. (2002). AI Techniques for Game Programming. Cincinnai: Thomson Course Technology. [4] Douce, A., de Freias, N., Gordon, N. (2001). An Inroducion o Sequenial Mone Carlo Mehods. Sequenial Mone Carlo Mehods in Pracice, 3-14. [5] Fairclough, C., Fagan, M., Namee, B., Cunningham, P. (2001). Research direcions for AI in compuer games. Proceedings of he Twelfh Irish Conference on Arificial Inelligence and Cogniive Science, 333-344. [6] Hojen-Sorensen, P., de Freias, N., Fog, T. (2000). On-line probabilisic classificaion wih paricle filers. Neural Neworks for Signal Processing X, 2000. Proceedings of he 2000 IEEE Signal Processing Sociey Workshop, 1, 386 395. [7] Ng, A. Y. (2003). Shaping and Policy Search in Reinforcemen Learning. PhD hesis, EECS, Universiy of California, Berkeley.