Locally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005

Similar documents
Locally Weighted Learning

Model learning for robot control: a survey

3 Nonlinear Regression

COMPUTATIONAL INTELLIGENCE

Bumptrees for Efficient Function, Constraint, and Classification Learning

3 Nonlinear Regression

Using Machine Learning to Optimize Storage Systems

Perceptron as a graph

Instance-based Learning

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

What is machine learning?

Dense Image-based Motion Estimation Algorithms & Optical Flow

Basis Functions. Volker Tresp Summer 2017

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Edge and corner detection

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Applying Supervised Learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Support Vector Machines

An RSSI Gradient-based AP Localization Algorithm

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Nonparametric Methods Recap

Hierarchical Clustering 4/5/17

Gauss-Sigmoid Neural Network

University of Florida CISE department Gator Engineering. Clustering Part 5

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Edge detection. Goal: Identify sudden. an image. Ideal: artist s line drawing. object-level knowledge)

In the Name of God. Lecture 17: ANFIS Adaptive Network-Based Fuzzy Inference System

Based on Raymond J. Mooney s slides

Linear Methods for Regression and Shrinkage Methods

EE795: Computer Vision and Intelligent Systems

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Nonparametric Importance Sampling for Big Data

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning / Jan 27, 2010

Lecture 20: Tracking. Tuesday, Nov 27

Support Vector Machines

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

arxiv: v2 [stat.ml] 5 Nov 2018

Supervised vs unsupervised clustering

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Teaching a robot to perform a basketball shot using EM-based reinforcement learning methods

Cluster Analysis (b) Lijun Zhang

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Instance-based Learning

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Challenges motivating deep learning. Sargur N. Srihari

On Order-Constrained Transitive Distance

Digital Image Processing

10-701/15-781, Fall 2006, Final

EE795: Computer Vision and Intelligent Systems

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

Dimension Reduction CS534

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Linear Regression and K-Nearest Neighbors 3/28/18

Spatial Outlier Detection

Scan Matching. Pieter Abbeel UC Berkeley EECS. Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics

Interferogram Analysis using Active Instance-Based Learning

Mining Web Data. Lijun Zhang

Sensor Tasking and Control

Unsupervised learning in Vision

Manifold Constrained Deep Neural Networks for ASR

Motion and Tracking. Andrea Torsello DAIS Università Ca Foscari via Torino 155, Mestre (VE)

ECE 5424: Introduction to Machine Learning

Clustering & Classification (chapter 15)

Lecture 7: Linear Regression (continued)

Louis Fourrier Fabien Gaie Thomas Rolf

MTTTS17 Dimensionality Reduction and Visualization. Spring 2018 Jaakko Peltonen. Lecture 11: Neighbor Embedding Methods continued

scikit-learn (Machine Learning in Python)

Surfaces, meshes, and topology

Notes and Announcements

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition

arxiv: v1 [cs.cv] 2 May 2016

NONPARAMETRIC REGRESSION TECHNIQUES

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Problem 1: Complexity of Update Rules for Logistic Regression

Edge Detection (with a sidelight introduction to linear, associative operators). Images

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Partial order embedding with multiple kernels

Adaptive Metric Nearest Neighbor Classification

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Topics in Machine Learning

Introduction. Introduction. Related Research. SIFT method. SIFT method. Distinctive Image Features from Scale-Invariant. Scale.

Chapter 7: Competitive learning, clustering, and self-organizing maps

6.141: Robotics systems and science Lecture 10: Motion Planning III

CS 8520: Artificial Intelligence

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Multiresponse Sparse Regression with Application to Multidimensional Scaling

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Transcription:

Locally Weighted Learning for Control Alexander Skoglund Machine Learning Course AASS, June 2005

Outline Locally Weighted Learning, Christopher G. Atkeson et. al. in Artificial Intelligence Review, 11:11-73,1997 Locally Weighted Learning for Control, Christopher G. Atkeson et. al. in Artificial Intelligence Review, 11:75-113,1997 Temporally independent task Nonlinear optimal control: A simulated puck (similar to the mountain car problem) Scalable Techniques from Nonparametric Statistics for Real Time Robot Learning, Stefan Schaal et. al. Applied Intelligence 17:49-60, 2002 If you want to implement it

What is Local Learning? What is Local Learning? Local methods assign a weight to each training observation that regulates its influence on the training process. This weight depends upon the location of the training point in the input variable space relative to that of the point to be predicted. Training observations closer to the prediction point generally receive higher weights. -Jerome H. Friedman

Introduction Lazy learning, store data, process when query is required Also called memory based learning Relevance of data point is measured as distance Locally Weighted Learning (LWL) is suitable for real-time online robot learning because: Fast incremental learning Avoid interference between old and new data (unlearning)

Alternative Techniques Nonparametric regression, multilayer sigmoidal neural networks, radial basis function networks, regression trees, projection pursuit regression and global regression techniques.

An example: Distance Weighting Averaging A prediction is based on an average of n training examples Given a data point (x,y), relevance d(x i,q) is called a distance function, typically the Euclidean distance Weighting function also called Kernel function Two approaches for weighting: Distance weight directly or Weighting the error criterion

Weighting approaches Distance weight directly y q = y i K d x i,q K d x i,q Weighting the error criterion n C q = i=1 [ y y i 2 K d x i,q ] when (q)=0, C(q) is minimized.

Locally Weighted Regression Consider a global model trained to minimize: C= i L f x i,, y i For a global model we usually want to minimize the squared error. If that fails we can: use a more complex global model, or a simple local model of the areas of interest We are interested in the latter...

A Linear Global Model A global model that has a linear β vector can be written as (each row is one data point, the number of columns is the dimensionality): β can be solved for by: X = y X T X =X T y

The Weights as a Physical Interpretation In the left figure we see equally strong weights on nearby and far points In the right case close points have stronger springs than the far points

Direct Data Weighting For simplicity we make the query point q the center point by: x' i = [(x i -q i ) T 1] T For each data point a distance is calculated from the query point with the weight w i = K d x i, q Written in matrix form: z i =w i x i => Z=WX Written in matrix form: v i =w i y i => v=wy Solving for b using these new variables: Z T Z =Z T v

Distance Functions Where to use a distance function: Same on all data: Global d() Set by the query point: Query-based local d() Different d() measure for different points (chosen in advance of the query, not dependent of the query) The importance of a given input dimension is scaled by the distance function: If all scaling factors are non-zero this might lead to better predictions If some dimensions are set to zero, the model is global in that dimension

Distance Functions for Continuous Inputs Different distance functions: Unweighted Euclidean Distance Diagonally weighted Euclidean Distance Fully weighted Euclidean Distance Unweighted Lp norm (Minkowski metric): Diagonally weighted and fully weighted Lp norm Is represented by the matrix M For symbolic inputs see Atkeson 1996

Smoothing Parameters The smoothing parameter defines the scale over where generalization is performed. The different selection types are: Fixed bandwidth (large variations, undefined if no data) Nearest neighbor bandwidth Global bandwidth Query-based local bandwidth Point-based local bandwidth, preferred over Query-based because it allows rapid asymmetric changes in data Represented by parameter h

Different shapes of the kernel weighting function Look up for regions with no data! There is no evidence that the selection is critical K(d) represents the weighting Weighting Functions

Ridge Regression If a query point is in a region with a small number of data points nearby, the β parameter can not be solved for. Then use Ridge Regression, which can be said to be adding fake data to regions with no or insufficient data.

Dimensionality Reduction To identify dimensions with no or little data one can use: Locally weighted PCA to identify manifolds of data (SVD of Z matrix) SVD and set small singular values (of the inverse of Z T Z) to zero, this means we eliminate those directions

Predictions Possible to estimate prediction error Variance is estimated as the sum of squared errors divided by the sum of squared (modified) weights Bias = E( (q))-y true (q) Leave-one-out cross validation is easy to perform in memory based methods (in contrast to methods with a training stage.) Pretend that data point x i is not seen and calculate how well the remaining data points can predict y i

Identify Outliers Attach a weight to each data point base on cross validation of all data (global case) at query time, generate the weight by cross validation of nearby points Use Robust Regression by a weight called u i => w i =u i K(d(x i,q))

Tuning Parameters to tune: Bandwidth or Smoothing parameter, h Distance metric, d() Weighting or Kernel function, K() In addition we have: Ridge regression Outliers thresholds Ways to tune: Global (minimize cv errors) Query-based local Point-based local (different for each point) How to: Plug-in approach Optimization

Interference The most important motivation for using LWL! To illustrate the problem we compare a ANN with LWL on learning the inverse dynamics of a two link arm.

Key issues Is LWL a better FA than, let say, RBF or sigmoidal neural network? - No, but it avoids negative interferences with old data! LWL forms the model after the query is given.

Temporally Independent Tasks y = f(x,u) + noise u is the action, x the observed state, y outcome f() is not known (we learn a model of f() called f'() ) The inverse model is u=f -1 (x,y) During training each training sample is stored in a data base, < x(i),y(i)->u(i) > We can then match our query with the data base, nearest neighbor, interpolation etc... The forward model y=f'(x,u) can be used to perform a mental simulation The data base will be < x(i),u(i)->y(i) >

Forward model To choose action we must perform a search: Grid Search Random Search First Order Gradient Search Second Order Gradient Search The inverse model can be used to provide the forward model with a starting point for the search.

Billiard Small pool table, a spring actuated cue The only sensors are two cameras: one along the cue, the second has a top view of the pool table. Each shot proceed as follows: Each shot start at the same position, i.e., the cue ball at the same position every time The ball we want to sink is placed randomly, this is the state (x,y) Use the inverse model followed by a search of the forward model to find the action u, predicted to sink the ball

Billiard, 2 The outcome is the cushion and position (on the cushion) where the ball makes the first hit. The memory base (data base) is updated with the new observation Both the forward and inverse model uses the locally weighted regression with outlier removal and cross validation for choosing kernel width. The forward and inverse models where used together To work the action requires 1%(??) accuracy or better, which the LWR provides

Billiard, 3 The cue swivels until the centroid of the object (shown by the vertical line) coincides with the chosen action (the cross).

Performance With less than 100 trails, a 75% success is achieved (equally to an MIT billiard champion), some shots are virtually impossibly difficult. One reason for success: Non-uniform data distribution, all training data are clustered around the state-actions pairs that succeeded. Little exploration was made.

Nonlinear Optimal Control We need a global model (control law) based on many local models Given a cost function: g(t) = G(x(t), u(t), t) The task is to minimize the total cost This gives us a delayed reward function, since the action taken at time t affects all the subsequent states. Reinforcement learning!

Example 2: The puck Objective: drive the puck up the hill to the goal in minimum number of time steps. State is the position, x, and velocity, x' Constrains: maximum force is -4 a 4, a is our action No friction!

Example 2: The optimal solution

Example 2: Reinforcement Learning Discretized state space, a grid 60x60 units for the forward model and the value function. Actions, discretized into 5 levels, (-4,-2,0,2,4 N) All transitions where remembered Exploration is achieved by giving unvisited states the cost of zero (visited bad states will be given negative values)

Example 2: Reinforcement Learning Same as before except that: transitions between cells were filled in by predictions from LWR forward model. This means that unvisited transitions were extrapolated from actual experience. The value function is a table in both cases. Much faster than than before in terms of trails.