State Indexed Policy Search by Dynamic Programming. Abstract. 1. Introduction. 2. System parameterization. Charles DuHadway

State Inexe Policy Search by Dynamic Programming Charles DuHaway Yi Gu 5435537 503372 December 4, 2007 Abstract We consier the reinforcement learning problem of simultaneous trajectory-following an obstacle avoiance by a raio-controlle car. A space-inexe non-stationary controller policy class is chosen that is linear in the features set, where the multiplier of each feature in each controller is learne using the policy search by ynamic programming algorithm. Uner the control of the learne policies the raio-controlle car is shown to be capable of reliably following a preefine trajectory while avoiing point obstacles along its way.. Introuction We consier the reinforcement learning problem of trajectory-following involving iscrete ecisions. Casting the problem as a Markov ecision process (MDP), we have a tuple (S, A, {P sa }, γ, R) enoting, respectively, the set of states, the set of actions, the state transition probabilities, the iscount factor an the rewar function, the objective is to fin a policy set Π opt that is optimal with respect to some choice of R. In cases where the system ynamics, R an Π are linear or linearizable, Π opt can be reaily solve for e.g. by linear quaratic regulators (LQR) []. Our current work however focuses on the case where such linearity oes not hol, an therefore generalizes the class of trajectory-following problems to scenarios involving non-trivial environmental constraints. We use a raio-controlle (RC) car as a test platform, an introuce non-linearities to R an Π through the nee to make iscrete ecisions in trying to avoi a point obstacle along a pre-efine trajectory. We choose the policy search by ynamic programming (PSDP) algorithm as the means to fining Π opt. This is because given a base istribution (inicating roughly how often we expect a goo policy to visit each state) PSDP can efficiently compute Π opt in spite of non-linear R an Π [2], hence our trajectory-following problem fits naturally into this framework. In trajectory-following problems actions customize for local path characteristics are especially useful. Requisite to such exploitation however are: a) a non-stationary policy set i.e. Π changes as the car traverses the trajectory, b) an accurate corresponence between spatial position an the policy use. However, the accuracy in this corresponence is inevitably egrae by run-time uncertainties. Consequently we chose for the policies to be inexe over space instea of time increments, as uner such a scheme the corresponence woul be immune from the sai uncertainties. The goal of our work is to investigate the feasibility of employing PSDP in trajectory-following problems uner a space-inexe scheme. More concretely, we aim to fin a suitable parameterization of the system S, action space A, iscount mechanism, rewar function R an policy class Π, such that the RC car woul reliably follow a preefine trajectory in the absence of obstacles, an avoi obstacles with some margin in their presence. We will first present a space-inexe parameterization of the system, an then efine the policy class an feature set. This leas on to the algorithm s implementation, starting with a brief efinition of PSDP, followe by a section-by-section account of the choices mae for the various functions, parameters an other algorithmic components. The results an accompanying analysis are provie in the final part of the paper. 2. System parameterization Typically, a reasonable system parameterization for a vehicle coul be s [ ] T t = xt yt t ut vt ut vt where x t an y t are the Cartesian coorinates of the vehicle in the worl coorinate system (WCS), t the orientation of the car in the WCS, an u t an v t are the longituinal an lateral velocities in the vehicle coorinate system (VCS). All values are with respect to the center of the vehicle, which is assume to be a rectangle. The subscripts t an t- inicate the time instants to which the values correspon. This parameterization is illustrate graphically in Figure.

Figure. Timeinexe states Note that time is not inclue as a state as the mapping of controller to the time instant it is use makes this implicit. Uner the space-inexe scheme for a vehicle riving in R 2 however, each controller is mappe to a subspace in R which we will call a waypoint. We efine each waypoint to be a subspace orthogonal to the trajectory tangent at the point where the waypoint exists. This is illustrate in Figure 2. Figure 2. Space-inexe states To localize the vehicle along a waypoint, we efine l as the cross-track error between where the target trajectory intersects the waypoint an where the center of the vehicle intersects with the same waypoint. This gives a space-inexe system parameterization [ t l u v u v ] T s = where t enotes the time at space inex along the preefine target trajectory, while, u an v are efine similarly as before except they now enote state values at a space inex. Note that the position of the vehicle is implicitly but completely specifie. The simple bicycle moel s & = f ( s, a) is use to escribe the vehicle ynamics, where a A is an action. State propagation from inex to + is achieve by first computing t such that at t + =t + t the vehicle is precisely at waypoint +, then states at + are compute by propagating s forwar through f(s,a) by exactly t. For brevity, states capturing historic values of the steering angle use to moel the steering ynamics are not shown. 3. Policy class The ultimate performance of the algorithm epens critically on the policy class, which is efine by both its format an the features that constitutes it. As mentione previously, the class of policy employe is nonstationary i.e. (Θ)=( ( ),, D- ( D- )), where Θ=(,, D- ), so that we have a one-to-one mapping between each inex an stationary policy ( ). All ( ): S A come form the same policy class, an for simplicity we chose each ( ) to be linear in the feature set i.e. T = g s ( ) ( ) = [ ],,2,3 [ g g g g ] T,4 g( s ) = 2 3 4 g is simply the cross-track error l, g 2 is the yaw of the car with respect to the target trajectory tangent at ; for example if the orientation of the vehicle at in WCS is 40 while the trajectory tangent at in WCS is 37, then g 2 =3. Together g an g 2 keep the vehicle close an parallel to the target trajectory. g 3 is the anticipate clearance surplus from the obstacle. For instance if l =.5 cm (.5 cm to the right of the trajectory looking in the irection of travel) as the car is approaching a point obstacle some istance ahea situate 5 cm to the left of the trajectory, an the esire clearance is 20 cm to the right of the obstacle, then g 3 =(.5+5)-20=-3.5 cm i.e. the car nees to be 3.5 cm further to the right to avoi colliing with the obstacle. As a conservative measure, g 3 s value is offset by 5 cm so that g 3 =-25 cm at -20 cm real clearance, increasing to -5 cm at 0 cm real clearance, beyon which its absolute value ecays exponentially while remaining negative. g 3 is intene to steer the car off the target trajectory away an from an obstacle. g 3 =0 when no obstacle is within the control horizon. We allow the car to reevaluate which sie to pass an obstacle on at each step, the rule being that if the obstacle is currently to the left of the car, then pass it on the right an vice versa, an g 3 is compute accoringly. g 4 is the obstacle conitione yaw. Its value is non-zero only when the car is approaching an obstacle an is steering in a irection that woul reuce the clearance. In such cases, g 4 is yaw of the trajectory tangent as measure in the VCS. This feature is inclue to preserve any clearance achieve via g 3, since an existing clearance woul ten to make g 3 small, an a controller without g 4 woul be ominate by g an g 2. This conclues the efinition of the policy class, an we note in particular that the policy class is clearly nonlinear an cannot be easily cast into a form suitable for a linear solver. T 2

4. Algorithm implementation 4. Definition We start by consiering a trajectory that spans D space inices an a policy class as efine in 3, we efine the value function V(s) D V ( s) E R( s ) s s ( ) t,..., D i ;,..., D D = i= R( s) + E V s s~ Ps ( s ) [ ( )] +,..., D where s ~ P s ( s) inicates that the expectation is with respect to s rawn from the state transition probability Ps (s). The PSDP algorithm takes the form: for = D-, D-2,,0 Set arg max E [ V ( )] = s ~ µ s, +,..., D Π where we assume we are given a sequence of base istributions µ over the states. PSDP is thus a policy iteration algorithm that yiels an optimal for every space inex. The ynamic programming part is evient in that the solution for µ starts from the en of the trajectory towars the beginning. 4.2 Initialization We now turn our attention to µ, an note that we o not know the set of base istributions (µ 0, µ,, µ D- ) over a pre-efine trajectory. However if we ha a controller that coul simultaneously track a target trajectory an avoi obstacles, then by running the RC car over the trajectory using that controller a large m number of times, we woul see a representative istribution of s at across all m trials. In this way we have effectively sample from µ without knowing the explicit form of µ ; the problem is that if we ha such a controller, we wouln t have neee to run PSDP to begin with. One way aroun this catch 22 is to start PSDP by sampling from a poor approximation of µ i.e.. Run the RC car over the target trajectory without obstacles m times using a suboptimal yet reasonable controller. A obstacles to each of the m trajectories. 2. Derive a set of policies using PSDP base on these m samples. 3. Run the RC car over m trajectories with obstacles using. 4. Iterate over steps 2 an 3 until converges, at which point we eem that we have been sampling from a goo approximation of (µ 0, µ,, µ D- ). In our implementation, the suboptimal controller is erive in two steps. First we compute a noiseless state 3 trajectory using a heuristically-tune PI controller over the target trajectory with no obstacles. The resulting state sequence is use as the starting point for fining a LQR controller using ifferential ynamic programming (DDP). We finally obtain m samples by using the LQR controller in step above. Graphically, the PSDP algorithm as efine in 4. an initialize in 4.2 appears as in Figure 3. Figure 3. PSDP algorithmic flow 4.3 Action space The control of an RC car involves steering an throttle. If we choose to rive the vehicle at constant velocity, then our action space is simplifie to comprise of only steering. The steering angle range of the RC car is a continuous interval from -28 (full lock left) to 28 (full lock right). In orer to keep the calculation of V(s) tractable however this interval is iscretize to 0 points: A={-28, -2.78,, 2.78, 28 }. 4.4 Transition probability Two mechanisms influence the transition from state s to s + viz. system ynamics an noise. The former is capture by f(s,a), while the noise is moele as zeromean Gaussian measurement noise. Consequently, our transition probability is 2 s ~ N s + f ( s, a ) t σ ( ) +, where σ 2 is chosen to be consistent with variance of state measurements obtaine from the RC car s Kalman filter. 4.5 Discount factor Recall that V(s) is the expecte sum of R(s) over istance, an terms for which γ h R(s +h- ) < ε for some small ε are ignore as they have negligible effect (those more than h- inices own the length of the trajectory from ). If γ is close to then the learne policies are anticipatory of the future an ten to be more globally optimal. However the closer γ is to the larger p becomes for a given ε, with a corresponing longer computation time. If γ is close to 0, the computation time woul be shorter, but

the policies woul also be less sensitive to future highcost consequences to current actions, making them only locally optimal. We foun that the geometrically ecaying weighting envelope γ imposes coul not yiel a satisfactory balance between global-optimality an computation time for any value between 0 an. Consequently we sacrifice envelope continuity an instea employe a rectangular weighting envelope. V(s) is thus the sum of a fixe number h of rewar functions each of which is weighte equally. 4.6 Rewar function The esire behaviors of the car inclue: a) close tracking of the target trajectory, b) reliable avoiance of obstacles, an c) use of steering actions that are realizable i.e. limite banwith actuation. Since it is more natural to treat these attributes in the negative e.g. we o not wish the car to be far from the target trajectory, R(s) will henceforth refer to a cost function instea of a rewar function. The problem remains unchange, only now we wish to minimize R(s) instea of maximizing it. From these a 4-component cost function was conceive: 2 2 bl + b2q& if no collision R( s ) = 2 c b q& + b 2 3 + b4 otherwise c where q& is erivative of steering angle, c is the istance between the center of the car an a point obstacle, an c is the esire minimum clearance to an obstacle. If c < c, then a collision is consiere to have occurre this effectively treats the car as a circular object an was a conscious an conservative choice. The term involving l penalizes eviation from the target path, the term involving q& penalizes rapi changes in steering, the constant b 4 penalizes a collision an the term involving c an c enhances the repulsion of the car away from the obstacle uring learning. Figure 4 is a visualization ai to R(s), which has iscrete regimes an is clearly non-linear. Note in particular the expecte parabolic appearance of the b contribution, the half cyliner an the slight half cone atop of it in b) attributable to the b 4 an b 3 terms respectively. The weighting parameters b to b 4 were chosen experimentally. 4 (a) No obstacle (b) Collision Figure 4. Cost function visualization Note that at this point, we have also fully specifie the MDP in the context of our trajectory following problem, having efine the tuple (S, A, {P sa }, γ, R). 4.7 Optimization Referring to Figure 3, in each iteration of PSDP we have m sample state trajectories each containing D points. For a DP backup at :. We start with s on the first sample trajectory, note the ranom number see, take action a A. 2. Compute s +,,s +h- (h as efine in 4.5), the s (the subscript V a corresponing costs R(s) an ( ) a inicates unique corresponence to action a ). 3. Re-see the ranom number generator with the value note in, take action a 2 A an repeat step 2. 4. Repeat steps 2 an 3 for actions a 3 to a 0. 5. Repeat steps to 4 for sample trajectories 2 to m. At the en of the 5 steps, we have a recor of how V(s) varies with action at inex for each of the m samples. We then use the coorinate escen algorithm with ynamic resolution to fin the that minimizes the expecte value i.e. we solve ( ) = arg min Es [ V ( s)] ( ) ( ) ~ µ, +,..., + h Π Θ To expeite the numeric search, the starting point is chosen to be the least squares estimate init =(B T B) - B T b where b is the vector of value-minimizing steering angle

00 90 80 70 60 50 40 30 20 0 0 2 3 4 5 6 7 8 9 0 specific to each trajectory an B is a matrix containing the feature values associate with each trajectory. This is base on the observation that the least squares solution of to iniviual steering actions that minimize iniviual V(s) is often close to the solution that minimize the aggregate. 5. Results an iscussion 5. Performance The learne controllers performe well in both simulation an the physical system. Example trajectories from both are shown in Figure 5. To quantitatively evaluate the performance of a learne controller we use both the cost function an obstacle collision rate. The controller shown in Figure 5 ha a collision rate of 3.3%. The target an actual trajectories are shown in blue an black respectively. Obstacles are shown as ouble re circles. The inner circle enotes the closest the physical car coul pass by the obstacle without collision if perfectly aligne (exactly tangent to the obstacle). The outer circle enotes the farthest the car coul be an still manage to hit the obstacle given worst case alignment (exactly orthogonal to the obstacle). Note the controller perform wells espite the obvious ifference in sampling rates between the actual an simulate systems. Figure 5. Simulate an physical trajectories Circular trajectories were use to test the actual car system ue to physical limitations in our tracking system. Similar performance was observe in simulation for both straight an oval trajectories an we expect our system to perform well over a large variety of trajectories. 5.2 Choosing a policy class an cost function Choosing the appropriate policy class an cost function was non-trivial an require a series of tests to choose between several alternatives. Figure 6 shows the results of one such test comparing two cost functions: one with an accurate, bouning box collision test an one with a conservatively inaccurate bouning circle collision test. In this case the latter s performance far exceee the former s. Collision rate (%) Iterations Bouning Circle Bouning Box 70000 60000 50000 40000 30000 20000 0000 0 2 3 4 5 6 7 8 9 0 Iterations Figure 6. Bouning box vs. circle collision test Bouning Circle Bouning Box Similar tests were use to choose the specific feature set an cost function escribe above. Among other things we learne that the stanar integral term was unnecessary for our learning task. We also foun it important to use a cost function with no flat areas. 5.3 Conclusions We have implemente a system that emonstrates the feasibility an effectiveness of employing PSDP in a state-inexe scheme to solve a reinforcement learning problem with a non-linear policy class an rewar function. Our approach takes full avantage of PSDP s strengths while eliminating its epenence on accurate space-time correlation. 5.4 Next Steps The obvious next step is a etaile comparison of stateinexe PSDP with time-inexe PSDP. We are confient that such a comparison will clearly emonstrate the superiority of state-base PSDP. This shoul be especially clear over long or non-uniform trajectories. Acknowlegments The authors wish to acknowlege valuable inputs from Anrew Ng an Aam Coates uring the course of this research project. References [] Abbeel P., Coates A., Quigley M., an Ng A. Y.. An Application of Reinforcement Learning to Aerobatic Helicopter Flight, in NIPS 9, 2007 [2] Bagnell J. A., Kakae S., Ng Y A., an Schneier J.. Policy search by ynamic programming, in NIPS 6, 2004. 5