A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method

Size: px

Start display at page:

Download "A New Learning Algorithm for the MAXQ Hierarchical Reinforcement Learning Method"

John Stafford
5 years ago
Views:

1 A New Lerning Algorithm for the MAXQ Hierrchicl Reinforcement Lerning Method Frzneh Mirzzdeh 1, Bbk Behsz 2, nd Hmid Beigy 1 1 Deprtment of Computer Engineering, Shrif University of Technology, Tehrn, Irn {mirzzdeh, beigy}@ce.shrif.edu 2 Deprtment of Computer Engineering nd Informtion Technology, Amirkbir University of Technology (Tehrn Polytechnic), Tehrn, Irn behsz@ce.ut.c.ir Abstrct One of the most effective methods in hierrchicl reinforcement lerning is MAXQ method introduced in [1]. Although this method is shown to be effective in mny ppliction it is computtionlly expensive in pplictions with deep hierrchy [2], which mkes it imprcticl for use in such pplictions. In this pper, we propose new lerning lgorithm for MAXQ method to ddress the open problem of reducing its computtionl complexity. This new lgorithm, which is n improved version of MAXQ-Q lerning lgorithm [2], lerns vlue functions insted of computing them with complete serch of ll pths thorough the MAXQ grph. We use the new lerning lgorithm to solve some instnces of the simple Txi Domin Problem. In this domin, our experimentl results show tht the new lerning lgorithm lwys converges to optiml policy, its convergence behvior is similr to MAXQ-Q lerning lgorithm, nd s it is expected, its overll running time is less thn MAXQ-Q lerning lgorithm. 1. Introduction Reinforcement lerning [3, 4], i.e. lerning from interction, hs chieved high success in solving problems in mchine lerning. However, current tbulr reinforcement lerning methods suffer from the curse of dimensionlity [5], which mens the exponentil growth of computtionl nd memory requirements with the number of stte vribles. These methods tret the stte-ction spce of the problem s single flt serch spce. To hndle this problem, two groups of methods exist. The first group use function pproximtion to show the vlue functions or policies in more compct wy. The second group use hierrchicl representtion of problems to chnging the flt spce to hierrchy of simpler spces nd introduce mechnisms for bstrction nd shring of subtsks to overcome the curse of dimensionlity [6]. Using hierrchy in reinforcement lerning hs three min benefits. First, lerning will be done in fewer trils becuse fewer prmeters must be lerned. In ddition, bstrction cn be used to ignore irrelevnt informtion in sttes of subtsks. Also, lerned subtsks cn be shred between some prent subtsks. Second, lerned subtsks cn be reused for new problem. Third, the explortion will be improved due to existence of high-level ctions tht cn serch lrger spce in one step. The MAXQ method introduced in [1] is well-known pproch in hierrchicl reinforcement lerning, which is bsed on the MAXQ decomposition of the vlue functions. One of its lerning lgorithm MAXQ-Q, converges to recursively optiml policy (the best policy tht is consistent with the hierrchy of tsk with probbility one. Although recursively optiml policy cn be suboptiml, in big problems even sub-optimlity is vluble. As T.G. Dietterich mentioned in his pper [2], problem with MAXQ-Q (or MAXQ-0) lgorithm is the time complexity of vlue functions computtion in ech inner node tht forces complete serch of ll pths strting from tht node nd finishing in the leves. To the best of our knowledge, this problem still remined open nd there is not ny solution for it [7]. In this pper, we ddress this open problem nd propose new lerning rule tht solves the problem of vlue functions clcultion nd reduces the time complexity of the clcultion.

The orgniztion of this pper is s follows: In the next section, we briefly introduce the MAXQ method. In section 3, we present our new lerning lgorithm for MAXQ method.

2 The orgniztion of this pper is s follows: In the next section, we briefly introduce the MAXQ method. In section 3, we present our new lerning lgorithm for MAXQ method. In section 4, we report the experimentl results of implementing our pproch in comprison with the MAXQ-Q method. Finlly, we conclude the pper in section MAXQ Method Figure 1. The tsk grph for the txi problem [2]. (Red, Green, Blue, nd Yellow), txi, nd pssenger (Figure 2). MAXQ pproch works with decomposing the whole tsk into set of subtsk which re in turn decomposed into smller subtsks. This structure forms hierrchy whose leves re primitive ctions. This method is nlogous to the introduction of subroutines in progrmming, but the order in which subtsks re executed is rbitrry. Once the progrmmer defines the hierrchy, this is the reinforcement lerning system tht will write the code for ech subroutine. Ech subtsk hs some termintion conditions. These re the conditions tht once fulfilled the control of progrm returns to the prent subtsk. Termintion conditions re not necessrily desirble conditions. For exmple, n inpproprite invoking of subtsk by the reinforcement lerning system cn lso bring it to termintion condition. The desired subset of termintion condition i.e. the conditions tht show the successful invoking nd performing of the subtsk, re clled gols. A hierrchy cn be represented by tsk grph. An exmple tsk grph is shown in Figure 1 tht is relted to the simple Txi Domin Problem, domin used for introducing nd evluting MAXQ methods [1, 2]. The txi problem consists of 5-by-5 grid world with four sttions ech in different colors Figure 2. The txi domin [2]. This problem is n episodic tsk which strts with txi in rndomly-chosen squre nd pssenger in one of the sttions. The txi tsk is to pick up the pssenger nd bring him to his destintion which is nother sttion. There re six primitive ctions in this domin: four nvigtion ctions tht move the txi one squre North, South, Est, or West; Pickup ction, nd Putdown ction. There is rewrd of -1 for ech ction nd n extr rewrd of +20 for successfully delivering the pssenger. There is rewrd of -10 if the txi ttempts to execute the Putdown or Pickup ctions illeglly. If nvigtion ction would cuse the txi to hit wll, the ction is no op, nd there is only the usul rewrd of -1. As it cn be seen in the tsk grph, the subtsks Put, Get nd Nvigte(t) cn be defined by the progrmmer. As the ction spce of the problem is decomposed by the tsk grph, the ction-vlue function Q ( p,, i.e. the totl expected rewrd of performing ction in subtsk p nd then following the hierrchicl policy =,, K, ) therefter, cn be ( 0 1 n decomposed into two components.

3 The first component is the expected totl rewrd received while executing ction in stte s nd following policy denoted by V (,, nd the second component is the expected totl rewrd of completing prent tsk p following policy fter finished denoted by C ( p, nd clled completion function. Thu we hve: Q ( p, = V (, + C ( p, (1) where n C ( p, = P s N p ( s, N γ Q ( p, s, ( s )), nd Q (, = i ( ) if is composite. V (, P( i) R( i) if is primitive Eqution (1) shows the reltion of ction-vlue functions of prent tsk to the ction-vlue functions of its child tsks. Applied recursively, it shows how we cn decompose the ction-vlue function of the root tsk into summtion of ction-vlue functions of its descendnt subtsks. In Theorem 2 of [2], it is shown tht this decomposition cn represent the vlue function of ny hierrchicl policy. Using eqution (1), lerning lgorithm for hierrchicl reinforcement lerning cn be obtined. The MAXQ-Q lerning lgorithm is hierrchicl version of Q-lerning motivted by this eqution. This lgorithm is shown in Figure 3. As it cn be seen in the pseudo-code of lgorithm, the vlue functions of composite tsks re required in lerning process. This lerning lgorithm is bsed on storing the composite vlue functions in distributed mnner. In fct, Fig 3. MAXQ-Q lerning lgorithm [8]. composite vlue functions re clculted in run time using modified version of eqution (1) by performing depth-first serch. This serch trverses ll pths from the current node to the lef node which is computtionlly expensive (becuse number of pths cn exponentilly grow with the number of node. Finlly, it should be noted tht it is proven tht this lgorithm converges to recursively optiml policy [2]. In the next section, we suggest n lterntive pproch, which lerns nd stores vlue functions s well s completion functions for ll nodes. 3. The New Lerning Algorithm As stted in lst section, trversing ll pths to compute vlue functions of composite nodes is computtionlly expensive. Thu some methods re suggested to reduce the effect of this problem. One suggestion is to perform best first serch nd use brnch nd bounds pproches to prune some subtrees. A more effective method is to mke the computtion incrementl such tht only those nodes whose vlues hve chnged in current step will be re considered (like wht developed in the RETE lgorithm of SOAR rchitecture [9]). It should be noted these pproches only cn reduce the computtionl complexity to some extent nd cn not solve the problem in generl. Now, we present our pproch, which cn solve this problem. Our pproch is to lern nd store vlue functions of ll primitive nd composite nodes (not only the vlue functions of the primitive node.

4 Before presenting the wy to lern the vlue functions of node we require to explin the theoreticl intuition behind the new lerning lgorithm. By the Bellmn eqution for vlue functions of n optiml policy in semi-mrkov decision proces we hve: N = mx P( [ R( + γ )]., N Since the term in mximiztion is equl to Q ( p,, we obtin: = mx Q ( p, nd by utilizing eqution (1), we get: = mx[ V (, + C ( p, ]. Now using the bove eqution, we cn rech following updte rule: = (1 α ) + (1 α ) mx[ V (, + C ( p, ] By this new lerning rule, computtion of ech vlue function will be chnged to simple ddition, insted of depth-first serch. Bsed on this lerning rule, our suggested lgorithm is illustrted in Figure 4. As it cn be seen in the figure, for ech subtsk both V nd C vlues re being clculted nd stored in the node corresponding to the subtsk. This lgorithm hs two potentil problems compred to the originl lgorithm. One nd the most importnt is tht we do not hve convergence proof in hnd for it. Another problem is the following: lthough the computtionl complexity of our method is much less thn the originl lgorithm in ech Fig 4. New proposed lgorithm. itertion, it is possible tht our lgorithm converges in mny more itertions tht yield overll running time even worse thn MAXQ-Q lgorithm. In the next section, we experimentlly show tht we cn expect not to fce these problems in rel problems. 4. Experimentl Results As Stted before, the new lerning lgorithm my hve two potentil problem but it seems tht these problems will not occur in rel situtions. Although we hve not ny rigorous proof for this clim, in this section we experimentlly show this clim for the simple Txi Domin Problem 1. It should be noted tht in implementtion, we used similr bstrctions for the new lgorithm nd the MAXQ-Q lgorithm tht shows firstly, the only difference in their results re due to difference in the lerning lgorithm nd secondly, our lgorithm work in presence of bstrction. First, we executed the lerning lgorithm for three instnces of the simple Txi Domin Problem. The percentge of convergence to optiml policy in 100 trils is presented in Tble 1 (we ssumed, if fter 2000 itertions the bsolute difference of the lerned vlue nd optiml vlue is less thn 0.01 then the lgorithm is converged). These promising results show 1 The source codes relted to experiments cn be ccessed nd downloded from the link below:

5 tht lerning lgorithm lwys converges to optiml policy for these three instnces nd suggest tht lso for other instnces nd even other domin the new lerning lgorithm is expected to converge to optiml policy most of the times. Tble 1. Convergence percents for three instnces of simple Txi Domin Problem. Instnce of Problem Txi Loction = (0,0) Pssenger Loction = B Destintion = G Txi Loction = (2,4) Pssenger Loction = G Destintion = R Txi Loction = (2,4) Pssenger Loction = R Destintion = Y Convergence Percent 100% 100% 100% Second, we compred convergence behvior of the new lerning lgorithm with MAXQ-Q to exmine whether the second potentil problem exists or not. The men squre error of lerned vlues from optiml vlues verged in 200 trils is shown in Figure 5. Although s it ws expected the originl MAXQ-Q converges better thn the new lgorithm, but the difference is very smll nd their convergence behvior is similr. In fct, despite this little difference in men squre error of vlue both of these lgorithms hve converged to the optiml policy in first few itertion which is our min objective. The vrince of lerned vlues in ech episode is shown in Figure 6. As this digrm suggest the new lgorithm like MAXQ-Q hs the good convergence property tht its vrince in different trils reduces nd it becomes more stble when the lgorithm progresses. Figure 5. MSE of lerned vlues from optiml vlues verged in 200 trils. Figure 6. Vrince of lerned vlues in different trils. Finlly, we compred the running time of these two lgorithms. The result for n instnce of Txi Domin Problem is shown in Tble 1 (in 100 itertions nd 200 tril. This result shows tht even in this problem tht height of its hierrchy is very smll; the new lgorithm is pproximtely two times fster thn MAXQ-Q method. Tble 2. Comprison of running time (in second of two lgorithms. Instnce of Problem Txi Loction = (0,0) Pssenger Loction = B Destintion = G 5. Conclusion MAXQ-Q New Algorithm One of the wys to hndle curse of dimensionlity in reinforcement lerning problems is to utilize hierrchicl methods such s the MAXQ hierrchicl reinforcement lerning. For tsks with deep hierrchy (i.e. hierrchy with lrge height), this method is computtionlly expensive. To solve this problem, we proposed new lgorithm nd experimentlly showed tht we cn expect its convergence to n optiml policy. Plese notice tht the new lgorithm cn be simply extended to reduce computtion complexity of MAXQ-0 lerning lgorithm [2], too. In ddition, it should be noted tht it seems the convergence of the new lgorithm to optiml policy cn be proved by induction on height of hierrchy nd using stochstic pproximtion result (Proposition 4.5 of [3]). Furthermore, it should be noticed tht lthough for the Txi Domin Problem our lgorithm ws just two times fster thn MAXQ-

6 Q, by creful considertion, it cn be seen tht for problems with even smll height of hierrchy (for exmple, bout 10), the running time of our lgorithm is much less thn MAXQ-Q lgorithm. In future, firstly, we will try to prove its convergence to n optiml policy. Then, we will run it in some more complicted domins with deep hierrchy to show tht it cn be used in problems where using MAXQ-Q is imprcticl. References [1] T.G. Dietterich, The MAXQ Method for Hierrchicl Reinforcement Lerning, in Fifteenth Interntionl Conference on Mchine Lerning, 1998, pp [2] T.G. Dietterich, Hierrchicl Reinforcement Lerning with the MAXQ Vlue Function Decomposition, in Journl of Artificil Intelligence Reserch, vol. 13, 2000, pp [3] D.P. Bertseks nd J.N. Tsitsikli Neuro-Dynmic Progrmming, Athen Scientific, Belmont, [4] R. Sutton nd A.G. Brto, Introduction to Reinforcement Lerning, MIT Pres Cmbridge, [5] R. E. Bellmn, Dynmic Progrmming, Princeton University Pres Princeton, [6] A.G. Brto, nd S. Mhdevn, Recent Advnces in Hierrchicl Reinforcement Lerning, In Discrete Event Dynmic Systems: Theory nd Appliction vol. 13, 2003, pp [7] T.G. Dietterich, Personl Communiction with F. Mirzzdeh nd B. Behsz, [8] T.G. Dietterich, An Overview of MAXQ Hierrchicl Reinforcement Lerning, in Proceedings of the Symposium on Abstrction, nd Reformultion, [9] M. Tmbe nd P.S. Rosenbloom, Investigting production system representtions for non-combintoril mtch, in Journl of Artificil Intelligence, vol. 68, no. 1, 1994, pp

Policy-contingent state abstraction for

Policy-contingent state abstraction for Policy-contingent stte bstrction for hierrchicl MDPs oelle Pineu nd Geoffrey Gordon School of Computer Science Crnegie Mellon University Pittsburgh PA 15213 jpineuggordon @cs.cmu.edu Abstrct Hierrchiclly