Ball. Player X. Player O. X Goal. O Goal

Size: px

Start display at page:

Download "Ball. Player X. Player O. X Goal. O Goal"

Griselda Ryan
5 years ago
Views:

1 Generlizing Adversril Reinforcement Lerning Willim T. B. Uther nd Mnuel M. Veloso Computer Science Deprtment Crnegie Mellon University Pittsburgh, PA Abstrct Reinforcement Lerning hs been used for number of yers in single gent environments. This rticle reports on our investigtion of Reinforcement Lerning techniques in multi-gent nd dversril environment with continuous observble stte informtion. Our frmework for evluting lgorithms is two-plyer hexgonl grid soccer. We introduce n extension to Prioritized Sweeping tht llows generliztion of lernt knowledge over neighboring sttes in the domin nd we introduce n extension to the U Tree generlizing lgorithm tht llows the hndling of continuous stte spces. Introduction Multi-gent dversril environments hve trditionlly been ddressed s gme plying situtions. Indeed, one of the rst res studied in Articil Intelligence ws gme plying. For exmple, the pioneering checkers plying lgorithm by Smuel (1959) used both serch nd mchine lerning strtegies. Interestingly, his pproch is similr to modern Reinforcement Lerning techniques (summrized in Kelbling, Littmn, & Moore 1996) in tht both lern n evlution function bsed on experience nd Mrkov ssumption bout the world. In the Reinforcement Lerning prdigm n gent is plced in sitution without knowledge of ny gols or other informtion bout the environment. As the gent cts in the environment it is given feedbck: reinforcement vlue or rewrd tht denes the utility of being in the current stte. Over time the gent is supposed to customize its ctions to the environment so s to mximize the sum of this rewrd. By only giving the gent rewrd when gol is reched, the gent lerns to chieve its gols. Since Smuel's work however, Reinforcement Lerning techniques were not used gin in n dversril setting until quite recently. In n dversril setting there re multiple (t lest two) gents in the world. In gme with two plyers, when n gent wins gme it is given positive reinforcement nd its opponent is given negtive reinforcement. Mximizing rewrd corresponds directly to winning gmes. Over time the gent is lerning to ct so tht it wins the gme. Trditionl Reinforcement Lerning lgorithms rely on prior discretiztion nd do not generlize over multiple sttes in tht generliztion. In this pper we introduce n extension to Prioritized Sweeping (Moore & Atkeson 1993), Fitted Prioritized Sweeping, nd modiction of the U Tree lgorithm (McCllum 1995), Continuous U Tree, s exmples of lgorithms tht generlize over multiple sttes. Tesuro (1995) nd Thrun (1995) hve both used neurl nets in Reinforcement Lerning prdigm. Tesuro's work in the gme of bckgmmon ws successful, but required hnd tuned fetures being fed to the lgorithm for high qulity ply. Thrun ws modertely successful in using similr techniques in chess, but these techniques were not s successful s they hd been in the bckgmmon domin. This work hs been repeted in other domins, but gin, without the sme success s in the bckgmmon domin (Kelbling, Littmn, & Moore 1996). We use similr environment to the one used by Littmn (1994) to investigte Mrkov gmes. Our environment is lrger, both in number of sttes nd number of ctions per stte. This increses our bility to test the generliztion cpbilities of our lgorithms. Neither of the lgorithms we test perform ny internl serch or lookhed when deciding ctions; they use just the current stte nd their lernt evlution for tht stte. While serch would improve performnce, we considered it orthogonl to lerning the evlution function. We intend to explore methods of combining serch with these techniques in future work. Hexcer: The Adversril Lerning Environment As substrte to our investigtion, we introduce hexgonl grid bsed soccer simultion, Hexcer, which is similr to the gme frmework used by Littmn (1994) to test Minimx Q Lerning. Hexcer consists of bord with hexgonl grid with 53 cells, two plyers nd bll (see Figure 1). Ech plyer cn be in

2 Bll X Gol Plyer X Plyer O O Gol Figure 1: The Hexcer bord ny cell not lredy occupied by the other plyer. The bll is either in the center of the bord, or is held by one of the plyers. This gives totl of 8268 distinct sttes in the gme. The two plyers strt in xed positions on the bord, s shown. The gme then proceeds in rounds. During ech round the plyers mke simultneous moves to neighboring cell in ny of the six possible directions. Plyers must specify direction in which to move, but if plyer ttempts to move o the edge of the grid, it remins in the sme cell. Once one plyer moves onto the bll, the bll stys with tht plyer until stolen by the other plyer. When the two plyers try to move onto the sme cell one of them succeeds nd the other fils, remining in its originl position. The choice of who moves into the contested cell is usully mde rndomly with ech plyer hving n equl chnce. The single exception is when one of the plyers is lredy in tht cell (for exmple, if it tried to move into wll nd so didn't move). In this cse the plyer lredy occupying the cell is considered to win nd remins in tht cell. The bll, if it ws in the control of one of the plyers, moves into the contested cell regrdless of which plyer wins nd so my chnge hnds. Plyers score by tking the bll into their opponents gol. When the bll rrives in gol the gme ends. The plyer who owns the gol loses the gme nd gets negtive rewrd. Their opponent, usully the plyer tht took the bll into the gol, receives n equl mgnitude, but positive, rewrd. It does not mtter which plyer took the bll into the gol; tht is, if I tke the bll into my own gol my opponent receives positive rewrd nd I receive negtive rewrd. The hexcer gme provides n interesting environment for studying the use of Reinforcement Lerning. Ech plyer observes its own position, the position of the bll nd the position of its dversry. Initilly, it does not know tht the objective of the gme is to tke the bll to gol position. Reinforcement lerning seems the pproprite technique to cquire the necessry ction selection informtion for ech stte. Reinforcement Lerning Revisited In the mchine lerning formultion of Reinforcement Lerning (Kelbling, Littmn, & Moore 1996) there re discrete set of sttes, s, nd ctions,. The gent cn detect its current stte, nd in ech stte cn choose to perform n ction which will in turn move it to its next stte. For ech stte/ction/resulting stte triple, 0 (s; ; s ) there is reinforcement signl, R(s; ; s 0 ). If ction by the gent when it is in stte s leds to stte s 0, the gent receives the rewrd ssocited with tht stte/ction/resulting stte triple, R(s; ; s 0 ). The world is ssumed to be Mrkov. Tht is, the current stte denes ll relevnt informtion bout the world. There is no predictive power gined by knowing the gent's history in rriving t the current stte. This does not men the world must be deterministic. It is quite possible tht the mpping from stte/ction pirs to next sttes is probbilistic. Tht is, for ech stte/ction pir, (s; ), there is probbility distribution, P(s;)(s 0 ), giving the probbility of reching prticulr successor stte, s 0, from the stte s when ction is performed in tht stte. Becuse the opponent is not necessrily deterministic, this is the cse in Hexcer.

3 As stted bove, the gol is for the gent \to mximize its rewrd over time." To keep this sum of future rewrd bounded, we discount future rewrds. A discount fctor,, 0 < < 1, is dded to the sum giving: P G = 1 t=1 t R t. At ech step the gent is supposed to behve in wy tht mximizes this sum of expected future discounted rewrd. cn be interpreted in different wys. It corresponds to there being chnce, probbility (1? ), tht the world ends between ech move. It cn lso be seen s n interest rte, or just s trick to mke the sum bounded. In our experiments, = 0:95. The foundtions of Reinforcement Lerning re the Bellmn Equtions: X Q(s; ) = 0 0 P(s;)(s )(R(s; ; s ) + 0 V (s )) (1) s 0 V (s) = mx(q(s; )) (2) These equtions dene Q function, Q(s; ), nd vlue function, V (s). The Q function is function from stte/ction pirs to n expected sum of discounted rewrd (Wtkins & Dyn 1992). The result is the expected discounted rewrd for executing tht ction in tht stte then behving optimlly from then on. The vlue function is function from sttes to sums of discounted rewrd. It is the expected sum of discounted rewrd for behving optimlly in tht stte s well s from then on. Prioritized Sweeping (Moore & Atkeson 1993) builds model of the world by recording stte trnsition probbilities nd the rewrd ssocited with ech stte nd ction. Using this model nd the Bellmn equtions Prioritized Sweeping cn clculte Q vlues directly. Re-solving the Bellmn equtions entirely t every timestep is too expensive even if ll the probbilities re known. Prioritized Sweeping proceeds by updting rst those entries those inputs hve chnged most. As the gent moves through the environment stte trnsition grph is built (see Figure 2). Ech time the gent mkes trnsition from stte s to stte s 0 by executing ction, the trnsition count on the link from s to s 0 lbelled is incremented. At the sme time the rewrd for tht trnsition is recorded if not lredy known. Q(s; ) cn then be clculted directly for the stte s using the Bellmn equtions. The trnsition probbilities from stte s, P(s;)(s 0 ) cn be clculted from the trnsition counts when we updte Q(s; ). We performed two stte vlue updtes per simultion step. One updte ws performed fter ech step to reect the chnge in trnsition probbilities fter tht step. The second updte ws performed on the stte whose input hs chnged most since it ws lst updted to propgte the chnge in Q vlues. Ech time Q vlue chnges, sy Q(s 0 ; ), its stte/ction pir, (s 0 ; ), is entered into priority queue with its priority being the mgnitude of the chnge. Ech timestep the gent cn pull the top pir from the queue nd updte the vlue of the stte, V (s 0 ). This chnge in vlue for tht stte leds to chnge in the Q vlue for trnsitions into tht stte, in our exmple Q(s; ) nd Q(s; 0 ). If those Q vlues chnge, their stte/ction pirs re entered into the priority queue nd the wve of updtes continues. Generlizing Algorithms Mny previous Reinforcement Lerning lgorithms hve mjor wekness: they cnnot generlize experience over multiple sttes. In single gent environment this is not good, but in multi-gent environment it is disstrous. Imgine gme of Hexcer where our gent hs the bll in the center of the eld nd the opposing gent is well behind us. By moving directly towrds the gol we cn score every time regrdless of the opponent's exct position. Unfortuntely, chnging the opponent's position t ll, even by the smllest mount, will plce our gent in new stte. If it hs never been in this prticulr stte before then it will not know how to behve. If the opponent's position is irrelevnt then the gent should relise this nd generlize over ll those sttes tht only dier in opponent position, lerning the concept of \opponent behind". One nive generliztion method is to tke simple model-free lgorithm like Q Lerning nd replce the tble of vlues with generlizing function pproximtor (e.g. Neurl Net). This hs been shown to be unstble in some cses (Boyn & Moore 1995) never converging to solution, lthough some good results hve been obtined with this method (Tesuro 1995). Stble Methods exist for using either Neurl Nets (Bird 1995) or Decision Trees (McCllum 1995). First we introduce stble method which uses both tble nd generlizing function pproximtor. Then McCllum's (1995) method using Decision Trees is described nd we extend it to hndle continuous stte vlues. Finlly we compre the methods. Fitted Prioritized Sweeping As pointed out by Boyn & Moore (1996), generlizing function pproximtors re not problem if the pproximtion is not iterted, i.e., if n pproximtion is not used to updte itself. Fitted Prioritized Sweeping mkes use of this result by using stndrd Prioritized Sweeping s bse, nd then doing the generliztion fterwrds. At ech timestep during the gme, the gent updtes stndrd Prioritized Sweeping Q-function over stndrd discretiztion of the stte spce. To choose its ctions however, it reds the Q-vlues from generlizing function pproximtor tht ws lernt from the previous gme. At the end of ech gme the set of Q vlues generted by the Prioritized Sweeping lgorithm for ll visited stte/ction pirs is used s the input to updte the generlizing function pproximtor.

4 ' s'' s ' s' s''' Figure 2: The Grph stored by Prioritized Sweeping Here the pproximtor is used to decide the gent's ctions, but ll dt tht is used s input to the pproximtor is dt gthered from the rel world. The pproximtor decides where to smple, but the vlue of the smple itself is not pproximte. No pproximte vlues re recycled s input to the pproximtor directly. In our implementtion piecewise liner function pproximtor ws used, lthough ny generlizing function pproximtor could be used. The piecewise liner pproximtion ws built up using recursive splitting technique similr to decision tree. At ech stge in the recursion the best wy of splitting the dt is found. If this split gives better t thn hving no split then the split is ccepted. Ech prt of the dt is then t recursively in similr mnner forming tree of splits. In the nl tree, the leves contin lest squres liner ts of the dt tht fll in tht lef. Initilly there is single lef in the tree with constnt vlue of 0. Ech time the tree is updted, the leves of the tree re further divided. Erly splits re never reconsidered. Figure 3 contins the Fitted Prioritized Sweeping Algorithm. During Gme: Updte Prioritized Sweeping tble entry Choose move using liner pproximtion tree for Q vlues After Gme: Updte Prioritized Sweeping tble till chnges fll below threshold Grow liner pproximtion tree using Prioritized Sweeping Q tble entries Figure 3: The Fitted Prioritized Sweeping Algorithm Ech stte ws represented by vector of ttributes. For Hexcer there were 7 ttributes. These were the x nd y coordintes of ech plyer, the dierence in x nd y coordintes between the two plyers, nd the loction of the bll (On X, on O or in the middle of the bord). The vlues in the Q tble were sorted by ech ttribute in turn. For ech ttribute, the hlfwy point between every djcent pir of vlues ws considered s possible split point. A liner t ws mde of the dt on ech side of this split point nd the 2 sttistic of the combined ts ws compred to the 2 sttistic of single liner t for ll the dt. If the split with the best combined 2 sttistic is better thn no split then tht split is recorded nd the splitting proceeds recursively. Continuous U Tree The U Tree lgorithm (McCllum 1995) is method for using decision or regression tree (Breimn et l. 1984; Quinln 1986) insted of tble of vlues in the Prioritized Sweeping lgorithm. In the originl U Tree lgorithm the stte spce is described in terms of discrete ttributes. If continuous vlue is needed it is split over multiple ttributes in mnner nlogous to using one ttribute for ech bit of the continuous vlue. We developed \Continuous U Tree" lgorithm, similr to U Tree, but which hndles continuous stte spces directly. Like U Tree, Continuous U Tree cn hndle modertely high dimensionl stte spces. The originl U Tree work uses this cpbility to remove the Mrkov ssumption. As we were plying Mrkov gme we did not implement this prt of the U Tree lgorithm in Continuous U Tree lthough there is no reson why this could not be done. Continuous U Tree is dierent from the Prioritized Sweeping lgorithms previously mentioned in tht it does not require prior discretiztion of the world into seprte sttes. The lgorithm cn tke continuous stte vector nd utomticlly form its own discretiztion upon which ny Reinforcement Lerning lgorithm cn be used. We used Prioritized Sweeping over this new discretiztion. In the Continuous U Tree lgorithm there re two distinct concepts of stte. The rst type of stte is the full continuous stte of the world the gent is moving through; or t lest wht the gent cn see of it. We term this sensory input. The second type of stte is the position in the discretiztion being formed by the lgorithm. We will use the term stte in this sec-

5 ond sense. Ech stte is n re in the sensory input spce, nd ech sensory input flls within stte. In previous lgorithms these two concepts were identicl. Sensory input ws discrete nd ech dierent sensory input corresponded to stte. In order to form the discretiztion, dt must be sved from the gent's experience. Unfortuntely, s the gent is forming its own discretiztion, there is no prior discretiztion tht cn be used to ggregte the dt. The dtpoints must be sved individully with full sensor ccurcy. Ech dtpoint sved is vector of the sensory input t the strt I, the ction performed, the resulting sensory input I 0 nd the rewrd obtined for tht trnsition r, (I; ; I 0 ; r). As in Fitted Prioritized Sweeping the sensory input is itself vector of vlues, one for ech ttribute of the sensory input. Unlike Fitted Prioritized Sweeping these vlues cn be fully continuous. In Hexcer we used the sme sensory input vector for Continuous U Tree nd Fitted Prioritized Sweeping. The discretiztion formed is tree-structure found by recursive prtitioning of the dtpoints. Initilly the world is considered to be single stte with n expected rewrd, V (s), of 0. At ech stge in the recursive prtitioning we re-clculte expected rewrd vlues of ll the dtpoints nd if signicnt dierence between dtpoint vlues is found within stte tht stte is split in two. To clculte the vlue of dtpoint, q(i; ) we rely on the Bellmn equtions. Ech dtpoint ends in stte. Tht is, the resulting sensory input, I 0, of trnsition, (I; ; I 0 ; r), cn be pssed down the tree structure till it reches lef. Tht lef is stte in the discretiztion, sy s 0. Tht stte is considered to be the nl stte for the trnsition. Using the expected rewrd for tht stte, V (s 0 ), nd the recorded rewrd for the trnsition, r, we cn ssign vlue to the initil sensory input/ction pir: q(i; ) = r + V (s 0 ). For ech stte in the current discretiztion we cn then loop through ll possible single splits for tht stte nd choose the split tht mximizes the dierence between the dtpoint vlues, q(i; ), on either side of the split. This dierence is mesured using the Kolmogorov-Smirnov sttisticl test 1. If the most sttisticlly signicnt split point hs probbility of being rndom less thn some threshold (we used p < 0:01) then tht split is dded - forming two new sttes to replce the old one. Hving clculted vlues for the dtpoints nd used tht to discretize the world we fll bck into stndrd, discrete, Reinforcement Lerning problem: we need to nd Q vlues for our new sttes, Q(s; ). This is done by clculting stte trnsition probbilities from the given dt nd then using Prioritized Sweeping. If the initil sensory input nd the resulting sensory input for dtpoint re in the sme stte then 1 see Press et l. (1992) for more informtion on these sorts of tests. tht is considered self-trnsition. If the initil nd - nl sensory inputs re in two dierent sttes then tht is n exmple of the ction resulting in stte chnge. Figure 4 contins the Continuous U Tree lgorithm. During Gme: Pss the current sensory input, I, down the stte tree to nd the current stte in the discretiztion, s Use Q vlues for the stte s to choose n ction, Store trnsition dtpoint (I; ; I 0 ; r) After Gme: For ech lef: Updte dtpoint vlues, q(i; ), for ech dtpoint in tht lef Find best split point If split point is sttisticlly signicnt then split lef into two sttes Add ll sttes to priority queue Run Prioritized Sweeping over new sttes until ll chnges re below threshold Figure 4: The Continuous U Tree Algorithm Results We performed empiricl comprisons long two different dimensions. Firstly, how fst does ech lgorithm lern to ply? Secondly, wht level of expertise is reched, discounting \resonble" initil lerning period? In ech experiment pir of lgorithms plyed ech other t Hexcer for 1000 gmes. Wins were recorded for the rst 500 gmes nd the second 500 gmes. The rst 500 gmes llow us to mesure lerning speed s ll gents strt out with no knowledge of the gme. The second 500 gmes give n indiction of the level of bility of the lgorithm fter n initil lerning period. This 1000 gme test ws then repeted 20 times for ech pir of lgorithms. The results shown in ech tble re the number of gmes (men stndrd devition) tht the rst plyer listed wins. A check mrk ( p ) indictes sttisticl signicnce, i.e., the probbility rndom vrition would produce dierence this gret is less thn 1%. As well s being tested ginst ech other, we lso compre these lgorithms to Prioritized Sweeping which we showed in Uther & Veloso (1997) ws one of the better non-generlizing lgorithms. In Tble 1 Fitted Prioritized Sweeping is compred to stndrd Prioritized Sweeping. It lerns signicntly fster thn stndrd Prioritized Sweeping, nd keeps performing better even fter the initil gmes. Stndrd Prioritized Sweeping hsn't even nished

6 Prioritized Sweeping Fitted Prioritized Sweeping First 500 gmes % 15% % 15% Second 500 gmes % 16% % 16% Totl % 14% % 14% p p p Tble 1: Prioritized Sweeping vs. Fitted Prioritized Sweeping Prioritized Sweeping Continuous U Tree First 500 gmes % 19% % 19% p Second 500 gmes % 20% % 20% Totl % 18% % 18% p Tble 2: Prioritized Sweeping vs. Continuous U Tree lerning fter this short tril period. Fitted Prioritized Sweeping hs lernt to be n eective plyer with much less dt. While eective, the Fitted Prioritized Sweeping lgorithm still requires prior discretiztion of the world. It cnnot hndle continuous stte spces. Tble 2 shows tht Continuous U Tree performs better thn stndrd Prioritized Sweeping, lthough the dierences for the second 500 gmes re only signicnt t the 5% level. Tble 3 shows tht Continuous U Tree lso performs comprbly to Fitted Prioritized Sweeping. These dierences re not sttisticlly signicnt t the 1% level. However, the results for the totl re signicnt t the 5% level. Continuing Work We re currently extending this work long 4 dierent dimensions. Firstly, we expect the vlue function to be in n exponentil form. Trees with constnt vlue leves do not represent continuous functions very eciently. To solve this we re looking t embedding exponentil functions in the leves of the tree. Preliminry results show tht this system grows smller trees thn using constnt vlued leves. We re lso extending U Tree to hndle continuous ction spces. A gme similr to hexcer, but using fully continuous stte nd ction spce where ech plyer sets pir of polr coordintes for ech move hs been developed. An ction tree is plced in ech lef of the stte tree. Ech ction tree discrimintes between regions of the ction spce in tht stte tht hve dierent expected rewrds. Ech of these regions cn then be treted s seprte ction for the discrete Reinforcement Lerning lgorithm used. We re exploring the use of constructive induction in the stte tree to build bstrctions over the stte-spce nd so improve both generliztion nd plnning. We re intending to move towrds tems of gents on lrger bord to test the generliztion cpbilities of U Tree. Conclusion Introducing extr gents into domin increses the stte spce exponentilly. Not only does this ect lerning speed, but often the exct loction of the other gents is not relevnt. We show eective generlizing Reinforcement Lerning lgorithms to help reduce this stte spce size explosion. Both our generlizing lgorithms perform signicntly better thn non-generlizing lgorithms. Unlike nive use of generliztion, both lgorithms re gurnteed to converge. The second lgorithm, Continuous U Tree, does not require prior discretiztion of the stte spce. It performs comprbly to the Fitted Prioritized Sweeping lgorithm which uses discretiztion. References Bird, L. C Residul lgorithms: Reinforcement lerning with function pproximtion. In Prieditis, A., nd Russell, S., eds., Mchine Lerning: Proceedings of the Twelfth Interntionl Conference (ICML95), 30{37. Sn Mteo, CA: Morgn Kufmnn. Boyn, J. A., nd Moore, A. W Generliztion in reinforcement lerning: Sfely pproximting the vlue function. In Tesuro, G.; Touretzky, D. S.; nd Leen, T. K., eds., Advnces in Neurl Informtion Processing Systems, volume 7. Cmbridge, MA: The MIT Press. Boyn, J. A., nd Moore, A. W Lerning evlution functions for lrge cyclic domins. In Sitt, L., ed., Mchine Lerning: Proceedings of the Thirteenth Interntionl Conference (ICML96). Sn Mteo, CA: Morgn Kufmnn. Breimn, L.; Friedmn, J. H.; Olshen, R. A.; nd Stone, C. J Clssiction And Regression Trees. The Wdsworth Sttistics/Probbility Series. Monterey, Cliforni: Wdsworth nd Brooks/Cole Advnced Books nd Softwre. Kelbling, L. P.; Littmn, M. L.; nd Moore, A. W Reinforcement lerning: A survey. Journl of Articil Intelligence Reserch 4:237{285.

7 Fitted Prioritized Sweeping Continuous U Tree First 500 gmes % 20% % 20% Second 500 gmes % 15% % 15% Totl % 9% % 9% Tble 3: Fitted Prioritized Sweeping vs. Continuous U Tree Littmn, M. L Mrkov gmes s frmework for multi-gent reinforcement lerning. In Mchine Lerning: Proceedings of the Eleventh Interntionl Conference (ICML94), 157{163. Sn Mteo, CA: Morgn Kufmnn. McCllum, A. K Reinforcement Lerning with Selective Perception nd Hidden Stte. Phd. thesis, Deprtment of Computer Science, University of Rochester, Rochester, NY. Moore, A., nd Atkeson, C. G Prioritized sweeping: Reinforcement lerning with less dt nd less rel time. Mchine Lerning 13. Press, W. H.; Teukolsky, S. A.; Vetterling, W. T.; nd Flnnery, B. P Numericl Recipies in C: the rt of scientic computing. Cmbridge: Cmbridge University Press, 2nd edition. Quinln, J. R Induction of decision trees. Mchine Lerning 1:81{106. Smuel, A. L Some studies in mchine lerning using the gme of checkers. IBM Reserch Journl 3(3). Tesuro, G Prcticl issues in temporl dierence lerning. Mchine Lerning 8:257{277. Tesuro, G Temporl dierence lerning nd td-gmmon. Communictions of the ACM 38(3):58{ 67. Thrun, S Lerning to ply the gme of chess. In Tesuro, G., nd Touretzky, D. S., eds., Advnces in Neurl Informtion Processing Systems, volume 7. Cmbridge, MA: The MIT Press. Uther, W. T. B., nd Veloso, M. M Adversril reinforcement lerning. Journl of Articil Intelligence Reserch. To be submitted. Wtkins, C. J. C. H., nd Dyn, P Q-lerning. Mchine Lerning 8(3):279{292.

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method

A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method A New Lerning Algorithm for the MAXQ Hierrchicl Reinforcement Lerning Method Frzneh Mirzzdeh 1, Bbk Behsz 2, nd Hmid Beigy 1 1 Deprtment of Computer Engineering, Shrif University of Technology, Tehrn,