State Indexed Policy Search by Dynamic Programming. Abstract. 1. Introduction. 2. System parameterization. Charles DuHadway

Similar documents
Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

Particle Swarm Optimization Based on Smoothing Approach for Solving a Class of Bi-Level Multiobjective Programming Problem

6 Gradient Descent. 6.1 Functions

Dual Arm Robot Research Report

Bends, Jogs, And Wiggles for Railroad Tracks and Vehicle Guide Ways

Online Appendix to: Generalizing Database Forensics

SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Characterizing Decoding Robustness under Parametric Channel Uncertainty

Questions? Post on piazza, or Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)!

Bayesian localization microscopy reveals nanoscale podosome dynamics

Math 131. Implicit Differentiation Larson Section 2.5

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

Classical Mechanics Examples (Lagrange Multipliers)

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks

Estimating Velocity Fields on a Freeway from Low Resolution Video

Figure 1: Schematic of an SEM [source: ]

Pairwise alignment using shortest path algorithms, Gunnar Klau, November 29, 2005, 11:

Offloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization

Improving Spatial Reuse of IEEE Based Ad Hoc Networks

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Research Article Inviscid Uniform Shear Flow past a Smooth Concave Body

Parts Assembly by Throwing Manipulation with a One-Joint Arm

4.2 Implicit Differentiation

CONSTRUCTION AND ANALYSIS OF INVERSIONS IN S 2 AND H 2. Arunima Ray. Final Paper, MATH 399. Spring 2008 ABSTRACT

Navigation Around an Unknown Obstacle for Autonomous Surface Vehicles Using a Forward-Facing Sonar

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Learning convex bodies is hard

Technical Report TR Navigation Around an Unknown Obstacle for Autonomous Surface Vehicles Using a Forward-Facing Sonar

Tracking and Regulation Control of a Mobile Robot System With Kinematic Disturbances: A Variable Structure-Like Approach

AnyTraffic Labeled Routing

Divide-and-Conquer Algorithms

Genetic Fuzzy Logic Control Technique for a Mobile Robot Tracking a Moving Target

A shortest path algorithm in multimodal networks: a case study with time varying costs

1 Surprises in high dimensions

Using Vector and Raster-Based Techniques in Categorical Map Generalization

Fuzzy Learning Variable Admittance Control for Human-Robot Cooperation

Skyline Community Search in Multi-valued Networks

A Convex Clustering-based Regularizer for Image Segmentation

Distributed Planning in Hierarchical Factored MDPs

Learning Subproblem Complexities in Distributed Branch and Bound

Non-homogeneous Generalization in Privacy Preserving Data Publishing

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 4, APRIL

Ad-Hoc Networks Beyond Unit Disk Graphs

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Threshold Based Data Aggregation Algorithm To Detect Rainfall Induced Landslides

Controlling formations of multiple mobile robots. Jaydev P. Desai Jim Ostrowski Vijay Kumar

Module13:Interference-I Lecture 13: Interference-I

All-to-all Broadcast for Vehicular Networks Based on Coded Slotted ALOHA

Figure 1: 2D arm. Figure 2: 2D arm with labelled angles

Optimal Distributed P2P Streaming under Node Degree Bounds

EDOVE: Energy and Depth Variance-Based Opportunistic Void Avoidance Scheme for Underwater Acoustic Sensor Networks

FINDING OPTICAL DISPERSION OF A PRISM WITH APPLICATION OF MINIMUM DEVIATION ANGLE MEASUREMENT METHOD

Architecture Design of Mobile Access Coordinated Wireless Sensor Networks

Coupling the User Interfaces of a Multiuser Program

Adjacency Matrix Based Full-Text Indexing Models

A Plane Tracker for AEC-automation Applications

Robust Camera Calibration for an Autonomous Underwater Vehicle

Exercises of PIV. incomplete draft, version 0.0. October 2009

Modifying ROC Curves to Incorporate Predicted Probabilities

Automation of Bird Front Half Deboning Procedure: Design and Analysis

Chapter 5 Proposed models for reconstituting/ adapting three stereoscopes

More Raster Line Issues. Bresenham Circles. Once More: 8-Pt Symmetry. Only 1 Octant Needed. Spring 2013 CS5600

Optimal path planning in a constant wind with a bounded turning rate

Preamble. Singly linked lists. Collaboration policy and academic integrity. Getting help

Indexing the Edges A simple and yet efficient approach to high-dimensional indexing

Shift-map Image Registration

Message Transport With The User Datagram Protocol

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

A Discrete Search Method for Multi-modal Non-Rigid Image Registration

Overlap Interval Partition Join

WLAN Indoor Positioning Based on Euclidean Distances and Fuzzy Logic

A FUZZY FRAMEWORK FOR SEGMENTATION, FEATURE MATCHING AND RETRIEVAL OF BRAIN MR IMAGES

Evolutionary Optimisation Methods for Template Based Image Registration

(a) Find the equation of the plane that passes through the points P, Q, and R.

Using Ray Tracing for Site-Specific Indoor Radio Signal Strength Analysis 1

Section 20. Thin Prisms

Approximation with Active B-spline Curves and Surfaces

Provisioning Virtualized Cloud Services in IP/MPLS-over-EON Networks

BIJECTIONS FOR PLANAR MAPS WITH BOUNDARIES

An Investigation in the Use of Vehicle Reidentification for Deriving Travel Time and Travel Time Distributions

Shift-map Image Registration

Dense Disparity Estimation in Ego-motion Reduced Search Space

An Adaptive Routing Algorithm for Communication Networks using Back Pressure Technique

Lesson 11 Interference of Light

Chalmers Publication Library

Verifying performance-based design objectives using assemblybased vulnerability

Questions? Post on piazza, or Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)!

Lab work #8. Congestion control

Comparison of Methods for Increasing the Performance of a DUA Computation

Cluster Center Initialization Method for K-means Algorithm Over Data Sets with Two Clusters

Multilevel Linear Dimensionality Reduction using Hypergraphs for Data Analysis

On the Placement of Internet Taps in Wireless Neighborhood Networks

IN the last few years, we have seen an increasing interest in

Design of Policy-Aware Differentially Private Algorithms

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

Blind Data Classification using Hyper-Dimensional Convex Polytopes

Transcription:

State Inexe Policy Search by Dynamic Programming Charles DuHaway Yi Gu 5435537 503372 December 4, 2007 Abstract We consier the reinforcement learning problem of simultaneous trajectory-following an obstacle avoiance by a raio-controlle car. A space-inexe non-stationary controller policy class is chosen that is linear in the features set, where the multiplier of each feature in each controller is learne using the policy search by ynamic programming algorithm. Uner the control of the learne policies the raio-controlle car is shown to be capable of reliably following a preefine trajectory while avoiing point obstacles along its way.. Introuction We consier the reinforcement learning problem of trajectory-following involving iscrete ecisions. Casting the problem as a Markov ecision process (MDP), we have a tuple (S, A, {P sa }, γ, R) enoting, respectively, the set of states, the set of actions, the state transition probabilities, the iscount factor an the rewar function, the objective is to fin a policy set Π opt that is optimal with respect to some choice of R. In cases where the system ynamics, R an Π are linear or linearizable, Π opt can be reaily solve for e.g. by linear quaratic regulators (LQR) []. Our current work however focuses on the case where such linearity oes not hol, an therefore generalizes the class of trajectory-following problems to scenarios involving non-trivial environmental constraints. We use a raio-controlle (RC) car as a test platform, an introuce non-linearities to R an Π through the nee to make iscrete ecisions in trying to avoi a point obstacle along a pre-efine trajectory. We choose the policy search by ynamic programming (PSDP) algorithm as the means to fining Π opt. This is because given a base istribution (inicating roughly how often we expect a goo policy to visit each state) PSDP can efficiently compute Π opt in spite of non-linear R an Π [2], hence our trajectory-following problem fits naturally into this framework. In trajectory-following problems actions customize for local path characteristics are especially useful. Requisite to such exploitation however are: a) a non-stationary policy set i.e. Π changes as the car traverses the trajectory, b) an accurate corresponence between spatial position an the policy use. However, the accuracy in this corresponence is inevitably egrae by run-time uncertainties. Consequently we chose for the policies to be inexe over space instea of time increments, as uner such a scheme the corresponence woul be immune from the sai uncertainties. The goal of our work is to investigate the feasibility of employing PSDP in trajectory-following problems uner a space-inexe scheme. More concretely, we aim to fin a suitable parameterization of the system S, action space A, iscount mechanism, rewar function R an policy class Π, such that the RC car woul reliably follow a preefine trajectory in the absence of obstacles, an avoi obstacles with some margin in their presence. We will first present a space-inexe parameterization of the system, an then efine the policy class an feature set. This leas on to the algorithm s implementation, starting with a brief efinition of PSDP, followe by a section-by-section account of the choices mae for the various functions, parameters an other algorithmic components. The results an accompanying analysis are provie in the final part of the paper. 2. System parameterization Typically, a reasonable system parameterization for a vehicle coul be s [ ] T t = xt yt t ut vt ut vt where x t an y t are the Cartesian coorinates of the vehicle in the worl coorinate system (WCS), t the orientation of the car in the WCS, an u t an v t are the longituinal an lateral velocities in the vehicle coorinate system (VCS). All values are with respect to the center of the vehicle, which is assume to be a rectangle. The subscripts t an t- inicate the time instants to which the values correspon. This parameterization is illustrate graphically in Figure.

Figure. Timeinexe states Note that time is not inclue as a state as the mapping of controller to the time instant it is use makes this implicit. Uner the space-inexe scheme for a vehicle riving in R 2 however, each controller is mappe to a subspace in R which we will call a waypoint. We efine each waypoint to be a subspace orthogonal to the trajectory tangent at the point where the waypoint exists. This is illustrate in Figure 2. Figure 2. Space-inexe states To localize the vehicle along a waypoint, we efine l as the cross-track error between where the target trajectory intersects the waypoint an where the center of the vehicle intersects with the same waypoint. This gives a space-inexe system parameterization [ t l u v u v ] T s = where t enotes the time at space inex along the preefine target trajectory, while, u an v are efine similarly as before except they now enote state values at a space inex. Note that the position of the vehicle is implicitly but completely specifie. The simple bicycle moel s & = f ( s, a) is use to escribe the vehicle ynamics, where a A is an action. State propagation from inex to + is achieve by first computing t such that at t + =t + t the vehicle is precisely at waypoint +, then states at + are compute by propagating s forwar through f(s,a) by exactly t. For brevity, states capturing historic values of the steering angle use to moel the steering ynamics are not shown. 3. Policy class The ultimate performance of the algorithm epens critically on the policy class, which is efine by both its format an the features that constitutes it. As mentione previously, the class of policy employe is nonstationary i.e. (Θ)=( ( ),, D- ( D- )), where Θ=(,, D- ), so that we have a one-to-one mapping between each inex an stationary policy ( ). All ( ): S A come form the same policy class, an for simplicity we chose each ( ) to be linear in the feature set i.e. T = g s ( ) ( ) = [ ],,2,3 [ g g g g ] T,4 g( s ) = 2 3 4 g is simply the cross-track error l, g 2 is the yaw of the car with respect to the target trajectory tangent at ; for example if the orientation of the vehicle at in WCS is 40 while the trajectory tangent at in WCS is 37, then g 2 =3. Together g an g 2 keep the vehicle close an parallel to the target trajectory. g 3 is the anticipate clearance surplus from the obstacle. For instance if l =.5 cm (.5 cm to the right of the trajectory looking in the irection of travel) as the car is approaching a point obstacle some istance ahea situate 5 cm to the left of the trajectory, an the esire clearance is 20 cm to the right of the obstacle, then g 3 =(.5+5)-20=-3.5 cm i.e. the car nees to be 3.5 cm further to the right to avoi colliing with the obstacle. As a conservative measure, g 3 s value is offset by 5 cm so that g 3 =-25 cm at -20 cm real clearance, increasing to -5 cm at 0 cm real clearance, beyon which its absolute value ecays exponentially while remaining negative. g 3 is intene to steer the car off the target trajectory away an from an obstacle. g 3 =0 when no obstacle is within the control horizon. We allow the car to reevaluate which sie to pass an obstacle on at each step, the rule being that if the obstacle is currently to the left of the car, then pass it on the right an vice versa, an g 3 is compute accoringly. g 4 is the obstacle conitione yaw. Its value is non-zero only when the car is approaching an obstacle an is steering in a irection that woul reuce the clearance. In such cases, g 4 is yaw of the trajectory tangent as measure in the VCS. This feature is inclue to preserve any clearance achieve via g 3, since an existing clearance woul ten to make g 3 small, an a controller without g 4 woul be ominate by g an g 2. This conclues the efinition of the policy class, an we note in particular that the policy class is clearly nonlinear an cannot be easily cast into a form suitable for a linear solver. T 2

4. Algorithm implementation 4. Definition We start by consiering a trajectory that spans D space inices an a policy class as efine in 3, we efine the value function V(s) D V ( s) E R( s ) s s ( ) t,..., D i ;,..., D D = i= R( s) + E V s s~ Ps ( s ) [ ( )] +,..., D where s ~ P s ( s) inicates that the expectation is with respect to s rawn from the state transition probability Ps (s). The PSDP algorithm takes the form: for = D-, D-2,,0 Set arg max E [ V ( )] = s ~ µ s, +,..., D Π where we assume we are given a sequence of base istributions µ over the states. PSDP is thus a policy iteration algorithm that yiels an optimal for every space inex. The ynamic programming part is evient in that the solution for µ starts from the en of the trajectory towars the beginning. 4.2 Initialization We now turn our attention to µ, an note that we o not know the set of base istributions (µ 0, µ,, µ D- ) over a pre-efine trajectory. However if we ha a controller that coul simultaneously track a target trajectory an avoi obstacles, then by running the RC car over the trajectory using that controller a large m number of times, we woul see a representative istribution of s at across all m trials. In this way we have effectively sample from µ without knowing the explicit form of µ ; the problem is that if we ha such a controller, we wouln t have neee to run PSDP to begin with. One way aroun this catch 22 is to start PSDP by sampling from a poor approximation of µ i.e.. Run the RC car over the target trajectory without obstacles m times using a suboptimal yet reasonable controller. A obstacles to each of the m trajectories. 2. Derive a set of policies using PSDP base on these m samples. 3. Run the RC car over m trajectories with obstacles using. 4. Iterate over steps 2 an 3 until converges, at which point we eem that we have been sampling from a goo approximation of (µ 0, µ,, µ D- ). In our implementation, the suboptimal controller is erive in two steps. First we compute a noiseless state 3 trajectory using a heuristically-tune PI controller over the target trajectory with no obstacles. The resulting state sequence is use as the starting point for fining a LQR controller using ifferential ynamic programming (DDP). We finally obtain m samples by using the LQR controller in step above. Graphically, the PSDP algorithm as efine in 4. an initialize in 4.2 appears as in Figure 3. Figure 3. PSDP algorithmic flow 4.3 Action space The control of an RC car involves steering an throttle. If we choose to rive the vehicle at constant velocity, then our action space is simplifie to comprise of only steering. The steering angle range of the RC car is a continuous interval from -28 (full lock left) to 28 (full lock right). In orer to keep the calculation of V(s) tractable however this interval is iscretize to 0 points: A={-28, -2.78,, 2.78, 28 }. 4.4 Transition probability Two mechanisms influence the transition from state s to s + viz. system ynamics an noise. The former is capture by f(s,a), while the noise is moele as zeromean Gaussian measurement noise. Consequently, our transition probability is 2 s ~ N s + f ( s, a ) t σ ( ) +, where σ 2 is chosen to be consistent with variance of state measurements obtaine from the RC car s Kalman filter. 4.5 Discount factor Recall that V(s) is the expecte sum of R(s) over istance, an terms for which γ h R(s +h- ) < ε for some small ε are ignore as they have negligible effect (those more than h- inices own the length of the trajectory from ). If γ is close to then the learne policies are anticipatory of the future an ten to be more globally optimal. However the closer γ is to the larger p becomes for a given ε, with a corresponing longer computation time. If γ is close to 0, the computation time woul be shorter, but

the policies woul also be less sensitive to future highcost consequences to current actions, making them only locally optimal. We foun that the geometrically ecaying weighting envelope γ imposes coul not yiel a satisfactory balance between global-optimality an computation time for any value between 0 an. Consequently we sacrifice envelope continuity an instea employe a rectangular weighting envelope. V(s) is thus the sum of a fixe number h of rewar functions each of which is weighte equally. 4.6 Rewar function The esire behaviors of the car inclue: a) close tracking of the target trajectory, b) reliable avoiance of obstacles, an c) use of steering actions that are realizable i.e. limite banwith actuation. Since it is more natural to treat these attributes in the negative e.g. we o not wish the car to be far from the target trajectory, R(s) will henceforth refer to a cost function instea of a rewar function. The problem remains unchange, only now we wish to minimize R(s) instea of maximizing it. From these a 4-component cost function was conceive: 2 2 bl + b2q& if no collision R( s ) = 2 c b q& + b 2 3 + b4 otherwise c where q& is erivative of steering angle, c is the istance between the center of the car an a point obstacle, an c is the esire minimum clearance to an obstacle. If c < c, then a collision is consiere to have occurre this effectively treats the car as a circular object an was a conscious an conservative choice. The term involving l penalizes eviation from the target path, the term involving q& penalizes rapi changes in steering, the constant b 4 penalizes a collision an the term involving c an c enhances the repulsion of the car away from the obstacle uring learning. Figure 4 is a visualization ai to R(s), which has iscrete regimes an is clearly non-linear. Note in particular the expecte parabolic appearance of the b contribution, the half cyliner an the slight half cone atop of it in b) attributable to the b 4 an b 3 terms respectively. The weighting parameters b to b 4 were chosen experimentally. 4 (a) No obstacle (b) Collision Figure 4. Cost function visualization Note that at this point, we have also fully specifie the MDP in the context of our trajectory following problem, having efine the tuple (S, A, {P sa }, γ, R). 4.7 Optimization Referring to Figure 3, in each iteration of PSDP we have m sample state trajectories each containing D points. For a DP backup at :. We start with s on the first sample trajectory, note the ranom number see, take action a A. 2. Compute s +,,s +h- (h as efine in 4.5), the s (the subscript V a corresponing costs R(s) an ( ) a inicates unique corresponence to action a ). 3. Re-see the ranom number generator with the value note in, take action a 2 A an repeat step 2. 4. Repeat steps 2 an 3 for actions a 3 to a 0. 5. Repeat steps to 4 for sample trajectories 2 to m. At the en of the 5 steps, we have a recor of how V(s) varies with action at inex for each of the m samples. We then use the coorinate escen algorithm with ynamic resolution to fin the that minimizes the expecte value i.e. we solve ( ) = arg min Es [ V ( s)] ( ) ( ) ~ µ, +,..., + h Π Θ To expeite the numeric search, the starting point is chosen to be the least squares estimate init =(B T B) - B T b where b is the vector of value-minimizing steering angle

00 90 80 70 60 50 40 30 20 0 0 2 3 4 5 6 7 8 9 0 specific to each trajectory an B is a matrix containing the feature values associate with each trajectory. This is base on the observation that the least squares solution of to iniviual steering actions that minimize iniviual V(s) is often close to the solution that minimize the aggregate. 5. Results an iscussion 5. Performance The learne controllers performe well in both simulation an the physical system. Example trajectories from both are shown in Figure 5. To quantitatively evaluate the performance of a learne controller we use both the cost function an obstacle collision rate. The controller shown in Figure 5 ha a collision rate of 3.3%. The target an actual trajectories are shown in blue an black respectively. Obstacles are shown as ouble re circles. The inner circle enotes the closest the physical car coul pass by the obstacle without collision if perfectly aligne (exactly tangent to the obstacle). The outer circle enotes the farthest the car coul be an still manage to hit the obstacle given worst case alignment (exactly orthogonal to the obstacle). Note the controller perform wells espite the obvious ifference in sampling rates between the actual an simulate systems. Figure 5. Simulate an physical trajectories Circular trajectories were use to test the actual car system ue to physical limitations in our tracking system. Similar performance was observe in simulation for both straight an oval trajectories an we expect our system to perform well over a large variety of trajectories. 5.2 Choosing a policy class an cost function Choosing the appropriate policy class an cost function was non-trivial an require a series of tests to choose between several alternatives. Figure 6 shows the results of one such test comparing two cost functions: one with an accurate, bouning box collision test an one with a conservatively inaccurate bouning circle collision test. In this case the latter s performance far exceee the former s. Collision rate (%) Iterations Bouning Circle Bouning Box 70000 60000 50000 40000 30000 20000 0000 0 2 3 4 5 6 7 8 9 0 Iterations Figure 6. Bouning box vs. circle collision test Bouning Circle Bouning Box Similar tests were use to choose the specific feature set an cost function escribe above. Among other things we learne that the stanar integral term was unnecessary for our learning task. We also foun it important to use a cost function with no flat areas. 5.3 Conclusions We have implemente a system that emonstrates the feasibility an effectiveness of employing PSDP in a state-inexe scheme to solve a reinforcement learning problem with a non-linear policy class an rewar function. Our approach takes full avantage of PSDP s strengths while eliminating its epenence on accurate space-time correlation. 5.4 Next Steps The obvious next step is a etaile comparison of stateinexe PSDP with time-inexe PSDP. We are confient that such a comparison will clearly emonstrate the superiority of state-base PSDP. This shoul be especially clear over long or non-uniform trajectories. Acknowlegments The authors wish to acknowlege valuable inputs from Anrew Ng an Aam Coates uring the course of this research project. References [] Abbeel P., Coates A., Quigley M., an Ng A. Y.. An Application of Reinforcement Learning to Aerobatic Helicopter Flight, in NIPS 9, 2007 [2] Bagnell J. A., Kakae S., Ng Y A., an Schneier J.. Policy search by ynamic programming, in NIPS 6, 2004. 5