Reinforcement Learning Based on Active Learning Method

Size: px

Start display at page:

Download "Reinforcement Learning Based on Active Learning Method"

Clinton Ford
5 years ago
Views:

1 Second Internatonal Symposum on Intellgent Informaton Technology Applcaton Renforcement Learnng Based on Actve Learnng Method Hesam Sagha 1, Saeed Bagher Shourak 2, Hosen Khasteh 1, Al Akbar Kae 1 1 ACECR, Nasr Branch, Tehran, Iran 2 Department of Electrcal Engneerng Sharf Unversty of Technology,Tehran, Iran sagha@ce.sharf.r, bagher-s@sharf.edu, h_khasteh@ce.sharf.r, kae@ce.sharf.edu Abstract In ths paper, a new renforcement learnng approach s proposed whch s based on a powerful concept named Actve Learnng Method (ALM) n modelng. ALM expresses any mult-nput-sngle-output system as a fuzzy combnaton of some sngle-nput-sngleoutput systems. The proposed method s an actor-crtc system smlar to Generalzed Approxmate Reasonng based Intellgent Control (GARIC) structure to adapt the ALM by delayed renforcement sgnals. Our system uses Temporal Dfference (TD) learnng to model the behavor of useful actons of a control system. The goodness of an acton s modeled on Reward-Penalty-Plane. IDS planes wll be updated accordng to ths plane. It s shown that the system can learn wth a predefned fuzzy system or wthout t (through random actons). 1. Introducton ALM [1,2,3,4] s a recursve fuzzy algorthm, whch expresses a mult-nput-sngle-output system as a fuzzy combnaton of several sngle-nput-sngle-output systems. It models the nput-output relatons for each nput and combnes these models to fnd out the overall system model. ALM starts wth gatherng data and projectng them on dfferent data planes. The horzontal axs of each data plane s one of the nputs and the vertcal axs s the output. IDS processng engne wll look for a behavor curve, hereafter narrow lne, on each data plane. If the spread of data over the narrow lne s more than a threshold, data domans wll be dvded and the algorthm runs agan. The heart of ths learnng algorthm s a fuzzy nterpolaton method whch s used to derve a smooth curve among data ponts. It s done by applyng a three-dmensonal membershp functon to each data pont, whch expresses the belef for the data pont and ts neghbors. Each data pont s consdered as a source of lght, whch has a pyramd-shape llumnaton pattern. As the vertcal dstance from ths source of lght ncreases, ts llumnatng pattern wll nterfere wth ts neghbors formng new brght areas. The projecton of the process on the plane s called IDS. As t s shown n Fg 2, we can use a pyramd as a three dmensonal fuzzy membershp functon of a data pont and ts neghborng ponts. By applyng IDS method to each data plane, two dfferent types of nformaton wll be extracted. One s the narrow path and the other s the devaton of the data ponts around each narrow path. Each narrow path shows the behavor of output relatve to an nput; and spread of the data ponts around ths path shows the mportance degree of that nput n overall system behavor. Less devaton of data ponts around the path represents a hgher degree of mportance and vce versa. Sagha et al [5] proposed a method whch combnes genetc algorthm and IDS to obtan better parttons over nput varables. Ther method s called GIDS (Genetc IDS). Shahd et al [6] proposed RIDS method that replaces each two consequent ponts wth ther mdpont nstead of applyng a 2-d fuzzy membershp functon on each data. RIDS converges to the center of gravty of data and ncreases the number of ponts n order to keep data expanson n plane. In addton, they proposed another method called (Modfed RIDS) MRIDS that support negatve data ponts. In MRIDS, f two consequent ponts are postve, the result s smlar to that of RIDS and the replacng pont s ther mdpont. Nevertheless, f one of the ponts s negatve, then the replacng pont s a pont located near the postve pont on the lne whch connects two ponts; so negatve pont has an effect of devatng center of gravty from postve ponts. MRIDS consders that the rewards and punshments are accessble after each acton, but when they are delayed and ths delay s not determned, t wll not converge correctly. Here we used another method called Renforcement ALM (RALM), to add renforcement capablty to the algorthm. We used the concepts of Acton Selecton /08 $ IEEE DOI /IITA

Weght update r AEN r' F SAM ' ASN F State x Falure Renforcement Sgnal Network (ASN), Acton Evaluaton Network (AEN), and Stochastc Acton Modfer (SAM) that are proposed n GARIC [7] as an actor-crtc

AEN maps a state vector and a falure sgnal nto a scalar score that ndcates state goodness. Ths s also used to produce nternal renforcement, r'.

Learnng occurs by fne-tunng the parameters n the two networks: n the AEN, the weghts or fuzzy parameters are adjusted; n the ASN, the parameters descrbng the fuzzy membershp functons are changed.

2 Weght update r AEN r' F SAM ' ASN F State x Falure Renforcement Sgnal Network (ASN), Acton Evaluaton Network (AEN), and Stochastc Acton Modfer (SAM) that are proposed n GARIC [7] as an actor-crtc algorthm. GARIC: The archtecture of GARIC s schematcally shown n Fg 3. ASN maps a state vector nto a recommended acton, F, usng fuzzy nference. AEN maps a state vector and a falure sgnal nto a scalar score that ndcates state goodness. Ths s also used to produce nternal renforcement, r'. AEN can be a neural network structure or a fuzzy system [8] or alke. SAM uses both F and r' to produce an acton F', whch wll be appled to the plant. Learnng occurs by fne-tunng the parameters n the two networks: n the AEN, the weghts or fuzzy parameters are adjusted; n the ASN, the parameters descrbng the fuzzy membershp functons are changed. These are done by gradent descent approach. AEN parameters are updated va Temporal Dfference Learnng method. Temporal Dfference Learnng: s a predcton method. It approxmates ts current estmate based on prevously learned estmates by assumng subsequent predctons are often correlated n some sense. A predcton s made, and when the observaron s avalabe the predcton s adjusted to better match the observaton. If each state s t has the predcton value v(s t ) that denotes the goodness of done actons n that state, then the updatng formula s: v( st) = v( st) + αt( Rt + 1+ δv( st+ 1) v( st)) (1) where α t s learnng rate and δ s a constant n the range of [0,1] and R t+1 s the receved reward at tme t+1.[9] 2. Proposed Method Physcal System Fgure 3. The artchecture of GARIC Fgure 1. Flowchart of ALM Fgure 2. IDS method and Fuzzy membershp functon In our method we used a smlar structure to GARIC. ASN s an IDS fuzzy system. AEN s made up of a plane called Reward-Penalty-Plane (RPP). On ths plane s the nformaton of how much the done acton n a specfc state s good. From control vewpont, ths plane can be called Error-Change n Error-Plane because one axs Fgure 4. Intal Reward-Penalty Plane for an nverse pendulum system. Mddle ponts denote the desred states (-0.012R< θ <0.012 R and -0.05R/s < Δ θ < 0.05R/s) and have the maxmum value (1), margn ponts denote penalty areas and have the negatve mnmum value (less than zero, more than -1), and other ponts are n the play area. 599

denotes error and the other denotes changng error, and the value of each pont n ths plane shows that how much we can trust the selected actonn of ASN n that specfc state.

At frst we have three regons on RPP plane, ) reward area: s the de- state nto that. sred area we lke the controller takes the Ths area has the fxed value of one.

) Play area: ths area s the rest of the surface that has the value of zero ntally. An ntal plane s shown n fgure 4, for a system we lke to stable the varable angle.

delta = λwn( RPP ( e( t ), ce ( t )) (2) RPP ( e ( t 1), ce ( t 1) ))) RPP ( e( t 1), e( t 1)) = (3) RPP ( ce ( t 1), ce ( t 1)) + delta where, delta s the value of changng RPP, e(t) s the error of

3 denotes error and the other denotes changng error, and the value of each pont n ths plane shows that how much we can trust the selected actonn of ASN n that specfc state. SAM changes the value of fuzzy system output by consderng the output of AEN. More relevant an acton s SAM changes t less. RPP s a surface made up of the desred varable to control. At frst we have three regons on RPP plane, ) reward area: s the de- state nto that. sred area we lke the controller takes the Ths area has the fxed value of one. ) Penalty area: conssts of states that are not desrable n system and makes t unstable. The value of ths area s stuck to the lowest negatve value. ) Play area: ths area s the rest of the surface that has the value of zero ntally. An ntal plane s shown n fgure 4, for a system we lke to stable the varable angle. Durng the run when data s avalable, the value of RPP n the prevous tme step and tss neghbors wll approach to the value of RPP n the current tme step and ts neghbors as same as TD(0). delta = λwn( RPP ( e( t ), ce ( t )) (2) RPP ( e ( t 1), ce ( t 1) ))) RPP ( e( t 1), e( t 1)) = (3) RPP ( ce ( t 1), ce ( t 1)) + delta where, delta s the value of changng RPP, e(t) s the error of control varable at tme t, ce s changng n error and wn s the IDS wndow shows how much the neghbors must be effected and t cann be a pyramd, Gaussan wndow wth the center valuee of one or alke. When an acton s rewarded the rate of update, λ, s hgh, but when an acton s penalzed, λ s very low. It s beactons s much cause we assumed the number of false more than true actons for a system. After updatng the RPP plane, we must update IDS planes for the prevous acton and ts neghbors. In ths case, we reward the acton f t goes too better state and punsh t when t goes to worse state. The total goodness of an acton wll be obtaned by averagng over delta values: IDS ( n ( t 1), out ( t 1)) = (4) IDS ( n ( t 1), out ( t 1)) + delta where n s the th nput varable. Fuzzy system can be adapted onlne. In ths case, after spreadng each datum and neglectng data n neg- atve areas of IDS planes, narrow lnes of a predefned fuzzy nference system are updated. To select the next acton (step tme t) after fuzzy nference procedure n ASN, for exploraton and explotaton n the space, we change the obtaned value by followng formula: Ft () = ASN() t + N(0, Var) (5) (exp( α RPP( e( t), ce( t))) exp( α )) where N(0,Var) s a normally dstrbuted random varconstant. When RPP able wth varance Var and α s a gves the best score for an acton (.e. 1), the selected acton of ASN wll be appled wthout manpulaton. For offlne learnng, after some data capturng, when the RPP plane converges and no changes occurred n t, or a specfc number of teraton s passed, we get IDS planes that has both postve and negatve values. Negwhch are chosen n atve areas show that the actons ths part of space transform the system nto worse state. Therefore, by neglectng the data n these areas, we can flter bad actons. Fnally, by estmatng narrow lnes, and usng ALM, we can construct the fuzzy system. Another problem exsts when the state s n the range of reward area. If we use the orgnal generated fuzzy system, we have vbratons n ths range. It s because the learnng system s not learned how to act when the state s n reward area. To handle ths problem we used fuzzy scalng. In ths knd of scalng, the range of nput varable of fuzzy system wll be scaled proporton to the reward range/nput range: ' In = In Range( Reward( In )) / Range( In ) (6) where In s th nput whch s a part of RPP. Output range wll be scaled by Max ( Range ( Reward ( In ))/ Range ( In )) (7) Fgure 5. Fnal IDS planes for the nverted pendulum system. Whte areas have postve value and darker ones have negatve values. Fgure 6. Selected data 600

Therefore, we do not need another fuzzy system; just varables must be scaled and use the generated fuzzy system.

Also we explctly defne the reward and penalty areas and there s no need to defne how and on what trajectory the system can reach the goal. 3.

98] radan/s and penalty areas are when each varable s more than 0.9 of nput range. λ s chosen to be 0.9 for rewards and 0.05 for penaltes. Penalty areas are set to be -0.5. Tme step was chosen to be 0.

After 20000 sequences and only 32 successes durng t, we got the IDS planes whch are shown n Fg. 5. Success s when the system state s n the reward area.

4 Therefore, we do not need another fuzzy system; just varables must be scaled and use the generated fuzzy system. Ths approach has some advantages n comparson wth MRIDS, especally when the problem has delays to reach nto a desred state. Also we explctly defne the reward and penalty areas and there s no need to defne how and on what trajectory the system can reach the goal. 3. Results We modeled the well-known nverse pendulum problem wth two nput, theta, θ, and angular velocty (Dtheta), θ. Reward area was chosen between θ = [-0.23, 0.23] radan and θ = [-0.98, 0. 98] radan/s and penalty areas are when each varable s more than 0.9 of nput range. λ s chosen to be 0.9 for rewards and 0.05 for penaltes. Penalty areas are set to be Tme step was chosen to be We used two methods of acton selecton n the be- actons n- gnnng of run. The frst one used random stead of a predefned fuzzy system, so no manpulaton by formula (5) was needed. After sequences and only 32 successes durng t, we got the IDS planes whch are shown n Fg. 5. Success s when the system state s n the reward area. Whte areas are postve and dark areas are negatve. Fgure 6 showss the data that are extracted from all data after removng bad ones that are located on negatve part of IDS planes. The fnal Reward-Penalty Plane s shown n Fg. 7. After applyng ALM, we got a fuzzy system wth only four rules. The surface of nput-output-force s shown n Fg. 8. Fg. 9 shows some random ntal states and ther convergence to the desred pont. Rse tme s 2.71 and overshoot s %0.0. The second method uses an ncorrect fuzzy sys tme tem n ASN wth four rules. After about steps of onlne learnng, the system learned to be stable. Fg 10 shows frst 1000 tme steps and last 1000 ones. Learnng by GARIC takes about tme steps but n our system t takes less than tme steps. In addton, we modeled ball and beam system. It s assumed the system has three nputs, θ, x0, v0 and two outputs x and v. θ s the angle of beam wth horzontal lne passng through the orgn, r0 s the ntal value of the dstance of ball from the orgn. v0 s the ntal value of ball's speed and r and v are the fnal values of dstance and speed. Our goal s to move the ball nto the poston zero, so we defne the RPP wth respect to x and v. To control t, we have two nputs v0 and x0 and one output θ. The results of RALM for some random nputs are shown n Fg. 11. Generated fuzzy system has four rules. The rse tme s 1.6 s and overshoot s %0.0. Table 4. Concluson 1 shows the result of other proposed algorthm based on ALM and FALCON. It can be seen that rse tme s reduced about 13% of supervsed ALM and no over- shoot s detected. ALM s a powerful dea for modelng. We changed t to support renforcement learnng. Our approach uses another plane to get the nformaton of renforcement sgnals. The approach s useful when there s no explct dea about the goodness of an acton, and delayed re- RALM con- wards and penaltes must be consdered. sders these very well. Results show that RALM learns better than other proposed ALM based algorthms. Table 1. Comparng control parameters of 4 control- lng method Fuzzy rules Overshoot Rse tme FALCON-ART % 2.11 Unsupervsed % 1.87 ALM Supervsed ALM RALM Fgure 7. Reward--Plane Fgure 8. Fuzzy system surface 601

5 5. References [1] S.Bagher, G.Yuasa, N.Honda, Fuzzy Controller Desgn by an actve Learnng Method,31 st Symposum of Intellgent Control, SIC 98 [2] S.Bagher, N.Honda, Hardware Smulaton of Bran Learnng Process, 15 th Fuzzy Symposum, Osaka, June 99. [3] S.Bagher, N.Honda, A New Method for Establshng and Savng Fuzzy Membershp Functons, 13 th Fuzzy sym- Computer For posum, Toyama, [4] S.Bagher, N.Honda, Outlnes of a Soft Bran Smulaton, Methodologes for the Concepton, Desgn And Applcaton of Soft Computng, IIZUK, 1998 [5] H.Sagha, S. B. Shourak, M.Dehghan, Genetc Ink Drop Spread, Internatonal Symposum n Intellgent Informaton Technology Applcaton, IITA, Chna,2008 [6] A.Shahd, S.Bagher, Supervsed Actve Learnng Meand Its Hardware thod as an ntellgent lngustc Controller Implementaton, IASTED, Span, 2002 [7] H.Berenj, P.Khedkar, Learnng and Tunng Fuzzy Logc Controllers Through Renforcement, IEEE Transacton on Neural Network, Vol 3, No 5, [8] H.Berenj, P.Khedkar, Usng Fuzzy Logc For Perfor- Learnng, mance Evaluaton n Renforcement NASA-TM [9] R. Sutton,A. Barto. Renforcement Learnng. MIT Press, 1998 Fgure 9. Fnal system output for nverse pendulum. Fgure10. Onlne learnng; Left: Frst 1000 tme steps, Rght: Tme steps between and Fgure 11. Fnal system output for ball and beam problem 602

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr