1 Variable Reolution Dicretization in the Joint Space Chritopher K. Monon, David Wingate, and Kevin D. Seppi Computer Science, Brigham Young Univerity Todd S. Peteron Computer and Networking Science, Utah Valley State College Abtract We preent JoSTLe, an algorithm that perform value iteration on control problem with continuou action, allowing thi ueful reinforcement learning technique to be applied to problem where a priori action dicretization i inadequate. The algorithm i an extenion of a variable reolution technique that work for problem with continuou tate and dicrete action [6]. Reult are given that indicate that JoSTLe i a promiing tep toward reinforcement learning in a fully continuou domain. 1. Introduction Reinforcement Learning (RL) can be a ueful way of repreenting control problem becaue of it implicity. RL olution technique, uch a value iteration, can dicover complex olution that may be difficult to repreent or compute in cloed form. In the pecific cae of value iteration, a problem i olved by computing an approximation to the value function. The value function may then be ued to create a control policy. [4] One of the more attractive propertie of value iteration i that it may be performed iteratively. The value of each point in pace may be updated according to an equation like the following: V (, a)= τ γ t R((t), a) dt + γ τ up V ((τ), a ) (1) a where V (, a) denote the value of the initial tate and action a, (t) i the tate reulting from the application of a for t unit of time, R((t), a) i the current reinforcement, γ [, 1) i the dicount factor, and τ i the amount of time for which a i executed. Note that thi aume a determinitic environment, o tranition probabilitie have been omitted. 1 1 Additionally, even though thi definition of the value function look like a quality function, we maintain the nomenclature of value iteration o a to avoid confuion with Q-learning. Implementing (1) directly i impoible. In a dicrete etting, (1) become omething like the following: T 1 V (, a)=φ γ tφ R((tφ), a) + γ T φ up V ((T φ), a ) t= a (2) where φ i a problem-pecific time tep. With dicrete action, the up operator become a max operator and a imple linear earch i ufficient to implement it. In domain with continuou action, however, the preence of the upremum i problematic, a a perfect implementation would require an exhautive earch of an infinite pace. Fortunately thi iue doe not plague all continuou action RL problem; a well-known reult from optimal control theory tate that many minimum time control problem may be olved optimally uing only bang-bang control [2]. Thi fact allow many reearcher to optimally dicretize continuou action pace a priori. Though many intereting problem may be olved uing a naive dicretization, many other may not. Even if a problem may be olved uing bang-bang control in imulation, the policy generated can rarely be run on real hardware. A method for olving thee problem without a tatic action dicretization i needed. Thi paper preent the Joint Space Triangulation Learner (JoSTLe), which enable value iteration to olve problem with continuou action. It i baed on Variable Reolution Dicretization (VRD) a preented by Muño and Moore [6], and relie on the ame fundamental obervation that not all portion of the problem pace are of equal importance. JoSTLe ue a homogeneou data tructure to dynamically allocate reource acro both tate and action pace. In the ame way that VRD allow each problem to dictate region of interet in the tate pace, JoSTLe allow each problem to dictate thoe region in the combined tateaction (or joint ) pace. Thi create the poibility for dicretizing action differently at each tate. Thi i not the firt work to addre continuou action problem, but it i to our knowledge the firt to work with general continuou tate and action problem. Other ap-

2 v = (,a) Action Figure 1. The Kuhn Triangulation of a cube proache typically involve uing one dicretization technique on the tate pace and then performing ome form of regreion in the action pace [1] at each tate. Thi approach i ueful, but can run into problem when the value function ha dicontinuou boundarie. It can alo be inefficient when one action repreentation could pan multiple tate. 2. Baic Variable Reolution Dicretization VRD dicretize the d -dimenional tate pace into hypercube, arranged hierarchically in a kd-trie. The root node cover the entire pace. At every branch, a plit i performed in one of the tate dimenion, creating two maller hypercube. A Kuhn triangulation i implemented at each leaf, effectively plitting it into d! implice (Figure 1). The overall effect i a complete triangulation of the pace. The value function i interpolated linearly within thi triangulation uing barycentric coordinate [6]. VRD proceed in two phae: value iteration and tate pace refinement. It begin with a rough initial dicretization of the tate pace. The dicretization i ued to perform value iteration until it converge. The information gained from the value iteration proce i ued to refine the dicretization, ideally plitting in area that require finer repreentation to generate a good policy. The iteration and refinement tep are repeated until a atifactory repreentation of the value function i obtained. The kd-trie allow for efficient point localization while the Kuhn triangulation allow for efficient interpolation. A full decription of the algorithm with comprehenive citation on component element i contained in [6]. 3. Extending to the Joint Space JoSTLe i an extenion of VRD but retain many of it characteritic, including the kd-trie and Kuhn triangulation. The primary difference between the two i that JoS- TLe work in the joint tate-action pace rather than jut the tate pace. Figure 2. Final tate exit adjacent hypercube (haded) Let d be the dimenionality of the tate pace and d a be the dimenionality of the action pace. We may then define the joint pace a the Carteian product of tate and action pace, yielding a pace whoe dimenionality i given by d = d + d a. Thi joint pace i teellated by hypercube in the ame manner a VRD tate pace. Each vertex of each cube i compoed of the concatenation of tate and action vector: v = (, a). Aociated with each vertex i a value V (, a), ometime abbreviated a V (v). Thi extenion into a higher dimenional pace repreent the primary mathematical difference between JoSTLe and VRD. Equation (2) till applie directly. There i another minor difference in the way that a trajectory topping point i determined. In VRD a trajectory top when it exit the initial implex. JoSTLe top once the tate no longer interect any hypercube adjacent to the initial tate/action point. Figure 2 illutrate thi idea. The circled vertex i the tarting point, and the trajectory in the tate pace doe not end until it i no longer interecting any of that vertex hypercube. Thi paper addree the iue of implementing uch a ytem. The addition of action dimenion to the dicretization poe ome unique problem that require attention. The firt iue i that of finding action at each tate that produce the maximum value. The econd iue i related to the generalization of the earch algorithm to arbitrary dimenion. The third i that of deciding when and how to plit hypercube. Thee iue are addreed in the next ection Searching for Maxima The earch for action that produce the maximum value at a given tate i a fundamental part of value iteration. In traditional value iteration the action pace i dicrete and homogeneou. In the dicretization produced by JoS- TLe the action pace i continuou with a heterogeneou dicretization. The piecewie linear interpolation imple-

3 a 1 a Figure 3. A joint pace implex mented in JoSTLe addree thi problem effectively. Figure 3 illutrate a joint pace with two tate dimenion and one action dimenion. The teellation generated by the kd-trie would actually cover the entire pace with hypercube, each of which would be triangulated, but for purpoe of explanation only one triangle i hown. The action earch problem may be viewed a a earch along the line in the figure. The line repreent a region of contant tate and variable action. Region of thi nature mut be earched at every tep of value iteration in order to calculate a dicounted value. While in general thi i a nonlinear programming problem, the repreentation allow for ignificant implification. Becaue the interior value of each implex follow a linear function, the maximum mut occur at a implex boundary. It cannot occur uniquely at the interior. Thi i eaily proven by howing that the gradient of the interpolation function i contant. The proof that thi hold even when contrained along a region uch a that in Figure 3 i alo fairly traightforward [5]. Thi inight allow u to retrict our earch to the point where the line (a depicted in Figure 3) interect with the implex boundarie, effectively tranforming a continuou problem into a dicrete problem. The earch i performed by finding the interection point and picking the maximum Generalization to Arbitrary Dimenion Finding the interection point can be difficult. Figure 4 illutrate the problem and an inight that erve to generalize the algorithm. Each poible patial interection generate a different kind of earch pace. In Figure 4(a) the region of interet are point on a line. In Figure 4(b), however, the intereting portion of the pace are vertice of a triangle formed by licing a tetrahedron with a plane. In general, we are alway eeking the vertice of a implex formed from uch a lice, and the hape of that implex will change baed on the dimenionalitie of the tate and action pace. Linear interpolation once again allow u to take a ueful hortcut. Figure 4(a) and 4(b) alo how the projection of 1 a2 (a) d = 2, d a = 1 (b) d = 1, d = 2 a1 Figure 4. Joint pace projection the implex into the tate pace. Note that in both projection, the region of interet i a ingle point in overlapping implice. Again, becaue interpolation i linear, the point of interection are eaily found by projecting the boundarie onto the tate pace and performing interpolation on thee lower dimenional hadow. Though boundarie will overlap in the lower dimenional pace, each will have a unique et of vertice and produce different anwer. One of thee will be the maximum. The algorithm for earching the action pace at a given tate thu become very imple and elegant: Find all hypercube interected by the hyperplane at. Thi i eaily done with a kd-trie uing an orthogonal range query. For each hypercube, enumerate all Kuhn implice. For each implex, boundarie. enumerate all d -dimenional Project each boundary into the tate pace and interpolate at. The proof that thi method i equivalent to finding interection in the joint pace i given in [5] Splitting Criteria In principle, it would be bet to refine the joint pace only where it will improve the policy. Unfortunately, thi i

4 not generally computable in advance. An alternative would be to refine region of the pace that improve the value function etimate. Thi i alo not generally knowable, but an approximation may be made. We refine the model baed on maximum interpolation error. The error function i defined over all point p = (, a) within each implex S a E(p) = V (p) V (p) (3) where V (p) i the multitep dicounted value at p, and V (p) i the interpolated value at p, which i a weighted um of the V (p) value of the vertice in the encloing implex. Thi error function i defined everywhere in the interior of each hypercube. The purpoe of plitting i to reduce the maximum error. Splitting i therefore done when up p E(p) urpae ome uitable threhold ɛ. In practice, up p E(p) i approximated with a et of random ample point P H within a hypercube by max p PH E(p). Whenever thi value i greater than ɛ for a hypercube, the cube i marked for plitting. 2 Once all cube have been evaluated, the marked cube are plit. What remain i to determine in which dimenion to plit them. We plit in the dimenion that produce a new cube with the mallet maximum error. More preciely, let C L x (H) and CR x (H) be the left and right children reulting from a plit of hypercube H in dimenion x: [ ( )] x = arg min x {,,d} min D {L,R} max p P C D x E(p). (4) Thi may reult in two child cube with very different error. The plit i performed o a to minimize the maximum error of one of the cube, leaving the poibility that the other will have a very high error. Thi i tolerable becaue the high error cube i likely to be plit at the next iteration to improve it error characteritic. No plitting i done if the hypercube ha low error or if the hypercube i maller than the mallet feature of interet, a parameter decribed in the next ection JoSTLe Parameter JoSTLe add a mall number of tunable parameter to the tandard value iteration algorithm, hown in Table 1. The minimum feature length in dimenion i i denoted ω i. Thi parameter determine when a cube i too mall to be plit regardle of error and it erve to keep the algorithm from plitting forever around dicontinuou boundarie in the value function. Thi parameter i typically eay 2 Random ampling i a naive and imple way to approach the problem, but it i not likely to be the bet. Other approximation are the ubject of future work. ω i Ω ɛ σ Table 1. JoSTLe parameter Minimum feature length in dimenion i Minimum Lebegue meaure of a hypercube Error threhold Number of ample point per hypercube to obtain given that many reinforcement learning problem have known reward boundarie. The mallet reward feature ize i often a good tarting point for thi parameter. In practice, the performance of the algorithm degrade moothly a thi parameter i increaed (a plit are limited). Smaller value alway yield better accuracy, but often at the expene of convergence peed. The ω i parameter may be ued in a number of way. It can limit a cube ability to plit in a particular dimenion if it length in that dimenion i le than ω i. Alternatively, it can be ued to compute a minimum allowed Lebegue meaure Ω = d i=1 ω i. If thi latter method i ued, then a cube i not plit in any dimenion if it Lebegue meaure i maller than Ω. The experiment outlined in thi paper ue the former, though the latter wa teted with imilar reult. The error threhold ɛ i alo ued during the plitting proce. If a cube maximum error i le than ɛ, the cube need not be plit. Determining an appropriate value for ɛ can be challenging, but a practical approximation may be made baed on the maximum range of reward value R max R min, the time tep φ, and the dicount factor γ. If the problem ha only terminal reinforcement, an upper bound on the error i given by φ(r max R min ) ince the integral over trajectory reward will be zero until the end of the trajectory. Although thi i a fairly conervative bound, it work well in practice. Problem with non-terminal reinforcement may ue an alternate upper bound, determined by accounting for an infinite tring of dicounted reward: φ(rmax Rmin) 1 γ. Thee upper bound can be ued to define a more intuitive error threhold. For example, the threhold may be et to ome fraction of the upper bound, making it eay to generate reaonable value. 1% of the maximum error i often a reaonable error threhold. The number of point σ cattered in a given hypercube may be computed from the minimum Lebegue meaure: σ = ΩH where Ω H i the Lebegue meaure of hypercube H. The mallet allowable feature i allocated exactly one point. If σ become 1, then the hypercube can no longer be effectively teted for plitting. Initially, the number of ample point can be very large, reulting in a ubtantial increae in time pent ampling Ω (5)

5 1 5 Action State Figure 5. 1D Golf reward boundarie (a) Dicretization (b) Policy Figure 6. JoSTLe dicretization and policy for 1D Golf and teting for plit. Thi number may be limited to a reaonable maximum, e.g Convergence Gordon addreed the iue of convergence at length, and howed that averaging function approximator will allow the value iteration proce to converge [3]. Among thee are barycentric interpolator, of which the linear interpolation method decribed here i one. Becaue value iteration i done eparately from the refinement proce, and we operate over a finite et of action, Gordon convergence reult till applie. 4. Experiment D Golf One-dimenional Golf i a tet problem with low dimenionality and a continuou tate and action pace. A golf ball i itting on a one-dimenional line and mut be hit into a hole in the center of the pace. The tate pace i decribed by [ 1, 1]. The action pace i alo continuou: a [ 1, 1]. The hole i centered at (, ) and i.5 unit wide. The environment i determinitic and acceible. The ytem characteritic are t+1 = t + a 1 a. (6) a If the ball hit a wall, it top and a reinforcement of 1 i received. If it land in the hole (i.e., t+1 [.25,.25]) a reinforcement of 1 i received. In all other cae, a reinforcement of i received. A graphical repreentation of the joint pace with the poitive (in the center) and negative (in the corner) reward boundarie i hown in Figure 5. Thi problem i more intereting than it firt appear. The region of high reward i very mall and nonlinear. Additionally, reinforcement are not located trictly at the boundary of the problem pace, making them difficult for VRD to find Reult Both JoSTLe and VRD were applied to the golf problem. Though there are many poible plitting criteria for VRD, in a 1-dimenional problem average corner value difference work a well a any of them (more complex criteria are only helpful in higher dimenion [6]). In VRD, plitting occurred if the value difference wa above.1, and in the cae of JoSTLe, ɛ wa et to 5% of the upper bound on the error. The timetep φ = 1. Both ued a γ of ince it wa known a priori that only one tep i ever needed. JoSTLe began with a ingle joint pace hypercube and learned the appropriate dicretization over time. VRD began with a ingle line egment in the tate pace and wa applied uing everal different uniform action dicretization. For each algorithm, policy accuracy wa calculated after every round of plitting and iteration. Since the optimal policy i known for thi problem and alway conit of a ingle tep, the accuracy wa calculated by canning the tate pace and querying the model for correct policy value. The accuracy i the ratio of correct action to total tate queried. The policy obtained by the joint learner i hown in Figure 6. In all available tate, a correct action i choen (the accuracy i 1%). The dicretization hypercube are alo hown in Figure 6; it i clear that the learner concentrated it reource on area of harp reward tranition. Thi behavior i expected ince γ =.

6 Accuracy Joint MM 2.1 MM 16 MM Vertice (Log Scale) Figure 7. JoSTLe and VRD accuracy v. number of vertice Velocity Poition Figure 8. JoSTLe policy for Mountain Car The accuracy of JoSTLe v. the number of vertice ued i hown in Figure 7. The behavior of VRD i hown on the ame graph. The numbered label indicate the number of dicrete action available to the algorithm throughout it trial. The graph how that JoSTLe policy accuracy went up quickly with every refinement, while the accuracy obtained with VRD roe lowly. It took 256 available action to compare to the accuracy of JoSTLe, and far more tate/action pair. Given fewer action, VRD peak at a particular policy accuracy and then level off, ince finer action dicretization i required but not available Additional Experiment The firt additional experiment wa alo done uing the 1D Golf problem, altered o that the reward boundarie did not cover the entire tate pace. Thi required JoSTLe to do real value iteration (with a nonzero dicount factor). The reult were jut a good a with the original problem. Additionally, JoSTLe found a nearly optimal policy for the Mountain Car problem [6] without any prior knowledge either of what action would be ueful or of which part of the tate pace were intereting. It learned that the two bang-bang action are more ueful than the other (full forward and full revere). Thee turn out to be exactly the ame two action ued in [6]. The policy learned by JoSTLe i hown in Figure 8. The policy i not perfect, but i cloe to optimal. Some artifact exit and the reaon for their exitence i till being explored. Alternate view of the policy are hown in Figure 9. Figure 9(a) how the full policy and highlight the fact that JoSTLe focued mot of it attention on two action: full forward and full revere. Thee are the bang-bang action ued by VRD. Figure 9(b) and 9(c) how the pol- icy projected onto the Poition/Acceleration and Velocity/Acceleration plane. The preence of point at the interior of thee figure, rather than excluively at the top and bottom, indicate that other uboptimal action crept into the policy in ome area of the tate pace, a matter which need to be tudied further. 5. Liabilitie One unfortunate characteritic of JoSTLe when compared to VRD i it higher complexity. Becaue the dimenionality i increaed, all of the wort pace and time characteritic of VRD are exacerbated in JoSTLe (e.g. d i higher for JoSTLe, and each plit till produce 2 d new vertice). Additionally, while VRD never enumerate the implice of a hypercube, JoSTLe mut. Each hypercube ha exactly d! Kuhn implice, each of which JoSTLe mut decompoe into all ( ) d d boundarie. Thi ha a negative impact on dimenional calability. Some optimization can alleviate the complexity, placing it on a more equal footing with VRD. The culling of degenerate and redundant implice, a well a the fact that JoSTLe often need fewer node overall can help to ignificantly reduce the complexity in practice. More work mut be done to determine where ele the complexity may be reduced. Another problem i expoed by the fact that a perfectly optimal policy for Mountain Car wa never achieved. Reearch revealed everal potential area of improvement. Firt, it i not clear that the plitting criterion preented here i exactly what i needed to generate a good policy. Second, though the integrity of the final value update equation wa maintained throughout the development of the algorithm, it appear that one of the baic aumption of value iteration may have been violated: the Markov property. VRD mucled itelf into retaining thi property

7 in a continuou pace by treating interpolation weight a tranition probabilitie. JoSTLe ha no imilar interpretation of interpolated point, a it mut firt chooe an action to determine the mot likely current tate. In other word, we don t know where we are until we go omewhere ele, a clear violation of the requirement that our deciion be baed only on the current tate. That JoSTLe work a well a it doe indicate that the violation may not be eriou, but it doe merit further exploration and hould be addreed in a future work. 6. Concluion and Future Reearch Acceleration Acceleration (a) 3D Policy Poition (b) Po/Accel projection Velocity (c) Vel/Accel projection Figure 9. JoSTLe policy for Mountain Car JoSTLe repreent a productive tep toward the ability to perform value iteration on problem with continuou action. It provide a homogeneou framework for refinement of both tate and action and ha an elegant appeal. Even o, there i much room for improvement. The Markov property need to be tudied in greater detail in thi context to determine whether there i a ueful interpretation of JoSTLe that doe not violate thi property. Reearch in that area i ongoing. Additionally, VRD propoed influence and variance to dicretize only thoe portion of the pace that affect the overall policy; more reearch could be devoted to an analyi of joint-pace analogue. Non-uniform plitting i another poible reearch direction. Both JoSTLe and VRD plit region of pace in half, which i not alway optimal in term of efficiency. The ability to perform oblique plit may alo allow for a more efficient repreentation. Some preliminary work in thi area indicate promie, but more reearch i needed. Reference [1] L. Baird and A. Klopf. Reinforcement learning with highdimenional, continuou action. Technical Report WL-TR , Wright-Patteron Air Force Bae, Ohio, [2] B. Friedland. In Advanced Control Sytem Deign, New Jerey, Prentice Hall. [3] G. J. Gordon. Stable function approximation in dynamic programming. In A. Priediti and S. Ruell, editor, Proceeding of the Twelfth International Conference on Machine Learning, page , San Francico, CA, Morgan Kaufmann. [4] L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A urvey. Journal of Artificial Intelligence Reearch, 4: , [5] C. K. Monon. Reinforcement learning in the joint pace: Value iteration in world with continuou tate and action. Mater thei, Brigham Young Univerity, Computer Science Department, Apr. 23. [6] R. Muño and A. Moore. Variable reolution dicretization in optimal control. Machine Learning, 49, Number 2/3: , November/December 22.

