Radial Basis Function Networks: Algorithms

Similar documents
Learning from Data: Adaptive Basis Functions

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

COMPUTATIONAL INTELLIGENCE

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Clustering and Visualisation of Data

Function approximation using RBF network. 10 basis functions and 25 data points.

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Width optimization of the Gaussian kernels in Radial Basis Function Networks

A Dendrogram. Bioinformatics (Lec 17)

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Chap.12 Kernel methods [Book, Chap.7]

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Unsupervised Learning

The Automatic Musicologist

Radial Basis Function Networks

Learning from Data Linear Parameter Models

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Learning and Generalization in Single Layer Perceptrons

Divide and Conquer Kernel Ridge Regression

K-Means and Gaussian Mixture Models

732A54/TDDE31 Big Data Analytics

Unsupervised learning in Vision

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Clustering with Reinforcement Learning

On the Kernel Widths in Radial-Basis Function Networks

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Facial expression recognition using shape and texture information

The Problem of Overfitting with Maximum Likelihood

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

What is machine learning?

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Unsupervised Learning

Automatic basis selection for RBF networks using Stein s unbiased risk estimator

Grundlagen der Künstlichen Intelligenz

Slides adapted from Marshall Tappen and Bryan Russell. Algorithms in Nature. Non-negative matrix factorization

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Image Compression: An Artificial Neural Network Approach

Radial Basis Function Neural Network Classifier

Interlude: Solving systems of Equations

Clustering web search results

SGN (4 cr) Chapter 11

Classification and Regression using Linear Networks, Multilayer Perceptrons and Radial Basis Functions

Visual Representations for Machine Learning

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Machine learning - HT Clustering

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

9.1. K-means Clustering

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Unsupervised Learning: Clustering

Adaptive Regularization. in Neural Network Filters

Recent Developments in Model-based Derivative-free Optimization

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Lecture #11: The Perceptron

3 Nonlinear Regression

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Basis Functions. Volker Tresp Summer 2017

Lecture 10 CNNs on Graphs

Unsupervised Learning

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Efficient Object Tracking Using K means and Radial Basis Function

Artificial Neural Networks Unsupervised learning: SOM

Introduction to Machine Learning CMU-10701

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Machine Learning and Pervasive Computing

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

The exam is closed book, closed notes except your one-page cheat sheet.

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Pattern Classification Algorithms for Face Recognition

Machine Learning Classifiers and Boosting

General Instructions. Questions

CHAPTER 6 IMPLEMENTATION OF RADIAL BASIS FUNCTION NEURAL NETWORK FOR STEGANALYSIS

Generalized trace ratio optimization and applications

Channel Performance Improvement through FF and RBF Neural Network based Equalization

Character Recognition Using Convolutional Neural Networks

Machine Learning Lecture 3

This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

k-means Clustering Todd W. Neller Gettysburg College

IMPLEMENTATION OF RBF TYPE NETWORKS BY SIGMOIDAL FEEDFORWARD NEURAL NETWORKS

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Review Article Using Radial Basis Function Networks for Function Approximation and Classification

TWRBF Transductive RBF Neural Network with Weighted Data Normalization

CSC 411: Lecture 12: Clustering

On the Training of Artificial Neural Networks with Radial Basis Function Using Optimum-Path Forest Clustering

Network Traffic Measurements and Analysis

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam

Machine Learning Lecture 3

Learning from Data Mixture Models

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

4. Feedforward neural networks. 4.1 Feedforward neural network structure

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Artificial Neural Networks MLP, RBF & GMDH

Transcription:

Radial Basis Function Networks: Algorithms Neural Computation : Lecture 14 John A. Bullinaria, 2015 1. The RBF Mapping 2. The RBF Network Architecture 3. Computational Power of RBF Networks 4. Training an RBF Network 5. Unsupervised Optimization of the Basis Functions 6. Computing the Output Weights 7. Supervised RBF Network Training

The Radial Basis Function (RBF) Mapping We are working in the standard regression framework of function approximation, with a set of N training data points in a D dimensional input space, such that each input vector x p = {x p i : i =1,..., D} has a corresponding K dimensional target output t p = {t p k : k = 1,...,K}. The target outputs will generally be generated by some underlying functions g k (x) plus random noise. The goal is to approximate the g k (x) with functions y k (x) of the form y k (x) = M j=0 w kj φ j ( x) We shall concentrate on the case of Gaussian basis functions φ j (x) = exp x µ j 2 2σ j 2 which have centres {µ j } and widths {σ j }. Naturally, the way to proceed is to develop a process for finding the appropriate values for M, {w kj }, {µ ij } and {σ j }. L14-2

The RBF Network Architecture The RBF Mapping can be cast into a form that resembles a neural network: 1 K outputs y k weights w kj j 1 M basis functions φ j (x i, µ ij, σ j ) weights µ ij 1 D inputs x i The hidden to output layer part operates like a standard feed-forward MLP network, with the sum of the weighted hidden unit activations giving the output unit activations. The hidden unit activations are given by the basis functions φ j (x,µ j,σ j ), which depend on the weights {µ ij,σ j } and input activations {x i } in a non-standard manner. L14-3

Computational Power of RBF Networks Intuitively, it is not difficult to understand why linear superpositions of localised basis functions are capable of universal approximation. More formally: Hartman, Keeler & Kowalski (1990, Neural Computation, vol. 2, pp. 210-215) provided a formal proof of this property for networks with Gaussian basis functions in which the widths {σ j } are treated as adjustable parameters. Park & Sandberg (1991, Neural Computation, vol. 3, pp. 246-257; and 1993, Neural Computation, vol. 5, pp. 305-316) showed that with only mild restrictions on the basis functions, the universal function approximation property still holds. As with the corresponding proofs for MLPs, these are existence proofs which rely on the availability of an arbitrarily large number of hidden units (i.e. basis functions). However, they do provide a theoretical foundation on which practical applications can be based with confidence. L14-4

Training RBF Networks The proofs about computational power tell us what an RBF Network can do, but tell us nothing about how to find values for all its parameters/weights {M, w kj, µ j, σ j }. Unlike in MLPs, the hidden and output layers in RBF networks operate in very different ways, and the corresponding weights have very different meanings and properties. It is therefore appropriate to use different learning algorithms for them. The input to hidden weights (i.e., basis function parameters {µ ij,σ j }) can be trained (or set) using any one of several possible unsupervised learning techniques. Then, after the input to hidden weights are found, they are kept fixed while the hidden to output weights are learned. This second stage of training only involves a single layer of weights {w jk } and linear output activation functions, and we have already seen how the necessary weights can be found very quickly and easily using simple matrix pseudoinversion, as in Single Layer Regression Networks or Extreme Learning Machines. L14-5

Basis Function Optimization One major advantage of RBF networks is the possibility of determining suitable hidden unit/basis function parameters without having to perform a full non-linear optimization of the whole network. We shall now look at three ways of doing this: 1. Fixed centres selected at random 2. Clustering based approaches 3. Orthogonal Least Squares These are all unsupervised techniques, which will be particularly useful in situations where labelled data is in short supply, but there is plenty of unlabelled data (i.e. inputs without output targets). Later we shall look at how one might try to get better results by performing a full supervised non-linear optimization of the network instead. With all these approaches, determining a good value for M remains a problem. It will generally be appropriate to compare the results for a range of different values, following the same kind of validation/cross validation methodology used for optimizing MLPs. L14-6

Fixed Centres Selected At Random The simplest and quickest approach for setting the RBF parameters is to have their centres fixed at M points selected at random from the N data points, and to set all their widths to be equal and fixed at an appropriate size for the distribution of data points. Specifically, we can use normalised RBFs centred at {µ j } defined by φ j (x) = exp x µ j 2 2σ j 2 where {µ j } {x p } and the σ j are all related in the same way to the maximum or average distance between the chosen centres µ j. Common choices are σ j = d max 2M or σ j = 2d ave which ensure that the individual RBFs are neither too wide, nor too narrow, for the given training data. For large training sets, this approach gives reasonable results. L14-7

Example 1 : M = N, σ = 2d ave From: Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, 1995. L14-8

Example 2 : M << N, σ << 2d ave From: Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, 1995. L14-9

Example 3 : M << N, σ >> 2d ave From: Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, 1995. L14-10

Example 4 : M << N, σ = 2d ave From: Neural Networks for Pattern Recognition, C. M. Bishop, Oxford University Press, 1995. L14-11

Clustering Based Approaches An obvious problem with picking the RBF centres to be a random subset of the data points, is that the data points may not be evenly distributed throughout the input space, and relying on randomly chosen points to be the best subset is a risky strategy. Thus, rather than picking random data points, a better approach is to use a principled clustering technique to find a set of RBF centres which more accurately reflect the distribution of the data points. We shall consider two such approaches: 1. K-Means Clustering which we shall look at now. 2. Self Organizing Maps which we shall look at next week. Both approaches identify subsets of neighbouring data points and use them to partition the input space, and then an RBF centre can be placed at the centre of each partition/ cluster. Once the RBF centres have been determined in this way, each RBF width can then be set according to the variances of the points in the corresponding cluster. L14-12

K-Means Clustering Algorithm The K-Means Clustering Algorithm starts by picking the number K of centres and randomly assigning the data points {x p } to K subsets. It then uses a simple re-estimation procedure to end up with a partition of the data points into K disjoint sub-sets or clusters S j containing N j data points that minimizes the sum squared clustering function K 2 j =1 p S j J = x p µ j where µ j is the mean/centroid of the data points in set S j given by µ j = 1 N j x p It does that by iteratively finding the nearest mean µ j to each data point x p, reassigning the data points to the associated clusters S j, and then recomputing the cluster means µ j. The clustering process terminates when no more data points switch from one cluster to another. Multiple runs can be carried out to find the local minimum with lowest J. p S j L14-13

Orthogonal Least Squares Another principled approach for selecting a sub-set of data points as the basis function centres is based on the technique of orthogonal least squares. This involves the sequential addition of new basis functions, each centred on one of the data points. If we already have L such basis functions, there remain N L possibilities for the next, and we can determine the output weights for each of those N L potential networks. Then the L+1th basis function which leaves the smallest residual sum squared output error is chosen, and we then go on to choose the next basis function. This sounds wasteful, but if we construct a set of orthogonal vectors in the space S spanned by the vectors of hidden unit activations for each pattern in the training set, we can calculate directly which data point should be chosen as the next basis function at each stage, and the output layer weights can be determined at the same time. To get good generalization, we can use validation/cross validation to stop the process when an appropriate number of data points have been selected as centres. L14-14

Dealing with the Output Layer Since the hidden unit activations φ j (x,µ j,σ j ) are fixed while the output weights {w jk } are determined, we essentially only have to find the weights that optimize a single layer linear network. As with MLPs, we can define a sum-squared output error measure ( ) 2 E = 1 2 t k p y k (x p ) p k and here the outputs are a simple linear combination of the hidden unit activations, i.e. M y k (x p ) = w kj φ j (x p ). j =0 At the minimum of E the gradients with respect to all the weights w ki will be zero, so E w ki = M t p k w kj φ j (x p ) φ i (x p ) = 0 p j=0 and linear equations like this are well known to be easy to solve analytically. L14-15

Computing the Output Weights The equations for the weights are most conveniently written in matrix form by defining matrices with components (W) kj = w kj, (Φ) pj = φ j (x p ), and (T) pk = {t kp }. This gives Φ T ( T ΦW T ) = 0 and hence the formal solution for the weights is W T = (Φ T Φ) 1 Φ T Τ = Φ T This involves the standard pseudo-inverse of Φ which is defined as Φ (Φ T Φ) 1 Φ T and can be seen to have the property Φ Φ = I. Thus the network weights can be computed by fast linear matrix inversion techniques. In practice, it is normally best to use Singular Value Decomposition (SVD) techniques that can avoid problems due to possible ill-conditioning of Φ, i.e. Φ T Φ being singular or near singular. L14-16

Supervised RBF Network Training Supervised training of the basis function parameters may lead to better results than the above unsupervised procedures, but the computational costs are usually enormous. The obvious approach would be to perform gradient descent on a sum squared output error function as we did for MLPs. That would give the error function: E = (t p k y k (x p )) 2 = (t p k w kj φ j x p,µ j,σ j ) 2 p k p k M j=0 ( ) and one could iteratively update the weights/basis function parameters using Δw jk = η w E w jk Δµ ij = η µ E µ ij Δσ j = η σ E σ j This involves all the problems of choosing the learning rates η, avoiding local minima and so on, that arise when training MLPs by gradient descent. Also, there is a tendency for the basis function widths to grow large leaving non-localised basis functions. L14-17

Overview and Reading 1. We began by defining Radial Basis Function (RBF) mappings and the corresponding network architecture. 2. Then we considered the computational power of RBF networks. 3. We then saw how the two layers of network weights were rather different and that different techniques were appropriate for training each of them. 4. We looked at three different unsupervised techniques for optimizing the basis functions, and then at how the corresponding output weights could be computed using fast linear matrix pseudo-inversion techniques. 5. We ended by looking at how supervised RBF training would work. Reading 1. Bishop: Sections 5.2, 5.3, 5.9, 5.10, 3.4 2. Haykin-2009: Sections 5.4, 5.5, 5.6, 5.7 L14-18