Extending reservoir computing with random static projections: a hybrid between extreme learning and RC

Similar documents
Solving Large Regression Problems using an Ensemble of GPU-accelerated ELMs

Reservoir Computing with Emphasis on Liquid State Machines

Feature selection in environmental data mining combining Simulated Annealing and Extreme Learning Machine

Modular Reservoir Computing Networks for Imitation Learning of Multiple Robot Behaviors

Batch Intrinsic Plasticity for Extreme Learning Machines

A faster model selection criterion for OP-ELM and OP-KNN: Hannan-Quinn criterion

Unsupervised Learning in Reservoir Computing: Modeling Hippocampal Place Cells for Small Mobile Robots

A methodology for Building Regression Models using Extreme Learning Machine: OP-ELM

Activity recognition with echo state networks using 3D body joints and objects category

Parallelization and optimization of the neuromorphic simulation code. Application on the MNIST problem

Reservoir Computing for Neural Networks

Exploring difficulties in learning spatio-temporal data using hierarchical Echo State Networks

Image Compression: An Artificial Neural Network Approach

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Neural Network Weight Selection Using Genetic Algorithms

Machine Learning 13. week

Cascade Networks and Extreme Learning Machines

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Character Recognition Using Convolutional Neural Networks

Deep Learning With Noise

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

A Reservoir Computing approach to Image Classification using Coupled Echo State and Back-Propagation Neural Networks

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

11/14/2010 Intelligent Systems and Soft Computing 1

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Modelling of Parameterized Processes via Regression in the Model Space

Time Series Prediction as a Problem of Missing Values: Application to ESTSP2007 and NN3 Competition Benchmarks

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Weka ( )

Learning from Data Linear Parameter Models

3 Nonlinear Regression

Performance Analysis of Data Mining Classification Techniques

Applying Supervised Learning

Terrain Classification for a Quadruped Robot

Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network

CHAPTER 6 DETECTION OF MASS USING NOVEL SEGMENTATION, GLCM AND NEURAL NETWORKS

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

DOUBLE-CURVED SURFACE FORMING PROCESS MODELING

ECE 285 Class Project Report

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

An experimental evaluation of echo state network for colour image segmentation

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

CS294-1 Assignment 2 Report

Instantaneously trained neural networks with complex inputs

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

SNIWD: Simultaneous Weight Noise Injection With Weight Decay for MLP Training

Notes on Multilayer, Feedforward Neural Networks

Machine Learning and Bioinformatics 機器學習與生物資訊學

MODELING FOR RESIDUAL STRESS, SURFACE ROUGHNESS AND TOOL WEAR USING AN ADAPTIVE NEURO FUZZY INFERENCE SYSTEM

Regularized Fuzzy Neural Networks for Pattern Classification Problems

Fast Learning for Big Data Using Dynamic Function

Programming Exercise 3: Multi-class Classification and Neural Networks

CHAPTER 6 IMPLEMENTATION OF RADIAL BASIS FUNCTION NEURAL NETWORK FOR STEGANALYSIS

Machine Learning Applications for Data Center Optimization

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Keras: Handwritten Digit Recognition using MNIST Dataset

Visual object classification by sparse convolutional neural networks

ESTIMATING THE COST OF ENERGY USAGE IN SPORT CENTRES: A COMPARATIVE MODELLING APPROACH

Random Forest A. Fornaser

Louis Fourrier Fabien Gaie Thomas Rolf

arxiv: v1 [cs.lg] 25 Jan 2018

Nelder-Mead Enhanced Extreme Learning Machine

3 Nonlinear Regression

Chapter 3. Speech segmentation. 3.1 Preprocessing

An ELM-based traffic flow prediction method adapted to different data types Wang Xingchao1, a, Hu Jianming2, b, Zhang Yi3 and Wang Zhenyu4

Comparison of Optimization Methods for L1-regularized Logistic Regression

Artificial Neural Network and Multi-Response Optimization in Reliability Measurement Approximation and Redundancy Allocation Problem

Neural Networks Laboratory EE 329 A

SOM+EOF for Finding Missing Values

Using Genetic Algorithms to Improve Pattern Classification Performance

Neural Network Approach for Automatic Landuse Classification of Satellite Images: One-Against-Rest and Multi-Class Classifiers

Linear Separability. Linear Separability. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons. Capabilities of Threshold Neurons

Argha Roy* Dept. of CSE Netaji Subhash Engg. College West Bengal, India.

A neural network that classifies glass either as window or non-window depending on the glass chemistry.

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Introduction to ANSYS DesignXplorer

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

Data Mining. Neural Networks

7. Decision or classification trees

The Mathematics Behind Neural Networks

Data Mining and Analytics

Image Classification using Fast Learning Convolutional Neural Networks

Implementing Machine Learning in Earthquake Engineering

ENSEMBLE RANDOM-SUBSET SVM

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

MATLAB representation of neural network Outline Neural network with single-layer of neurons. Neural network with multiple-layer of neurons.

Compressed Sensing Algorithm for Real-Time Doppler Ultrasound Image Reconstruction

Taming the Reservoir: Feedforward Training for Recurrent Neural Networks

IMPROVEMENTS TO THE BACKPROPAGATION ALGORITHM

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

Artificial Neural Networks MLP, RBF & GMDH

Lab 2: Support vector machines

DEVELOPMENT OF NEURAL NETWORK TRAINING METHODOLOGY FOR MODELING NONLINEAR SYSTEMS WITH APPLICATION TO THE PREDICTION OF THE REFRACTIVE INDEX

An Efficient Learning Scheme for Extreme Learning Machine and Its Application

Transactions on Information and Communications Technologies vol 16, 1996 WIT Press, ISSN

Transcription:

Extending reservoir computing with random static projections: a hybrid between extreme learning and RC John Butcher 1, David Verstraeten 2, Benjamin Schrauwen 2,CharlesDay 1 and Peter Haycock 1 1- Institute for the Environment, Physical Sciences and Applied Mathematics (EPSAM) Keele University, Staffordshire, ST5 5BG, United Kingdom 2- Department of Electronics and Information Systems Ghent University, Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium Abstract. Reservoir Computing is a relatively new paradigm in the field of neural networks that has shown promise in applications where traditional recurrent neural networks have performed poorly. The main advantage of using reservoirs is that only the output weights are trained, reducing computational requirements significantly. There is a trade-off, however, between the amount of memory a reservoir can possess and its capability of mapping data into a highly non-linear transformation space. A new, hybrid architecture, combining a reservoir with an extreme learning machine, is presented which overcomes this trade-off, whose performance is demonstrated on a 4 th order polynomial modelling task and an isolated spoken digit recognition task. 1 Introduction A recent addition to the field of Recurrent Artificial Neural Networks (RANNs) is Reservoir Computing (RC) [1], which is based on using a randomly, sparsely connected reservoir of neurons combined with a linear readout. RC is a unification of several techniques such as Echo State Networks (ESNs), Liquid State Machines (LSMs) and the back-propagation decorrelation neural network (see [1] for more details on RC techniques). Recent studies using RC for predicting and classifying complex time-series data have shown that RC techniques outperformed conventional RANNs considerably (see [2] for an overview). This paper reports on the use of RC for tasks that require both non-linearity and memory, and in which a novel customised RC architecture is presented to better fulfill both of these requirements. Standard reservoir architectures consist of three layers of neurons, an input layer, a recurrently connected reservoir layer and an output layer. Each neuron in a layer is randomly connected to every neuron in the next layer. The reservoir weights are globally scaled by the spectral radius, which is the largest absolute eigenvalue of the weight matrix. Usually a spectral radius value of less than, but close to, unity is used to create reservoirs with suitable dynamics [3]. Increasing SciSite Ltd and the UK EPSRC are gratefully acknowledged for supporting this research. The Department of Electronics and Information Systems from Ghent University are also gratefully acknowledged for their very kind and useful feedback. 303

the spectral radius leads to the spreading of activation distributions and the eventual saturation of activations towards +1 and -1. This results in a chaotic reservoir whose generalisation characteristics disappear. See [4] for a thorough overview of the effects of different parameters, including the spectral radius, on the dynamics of the reservoir. The reservoirs presented in the remainder of this work comprise ESN type networks, which means that the reservoir consists of tanh-neurons. The activations of reservoir and output neurons at time t are given by Equations 1 and 2 respectively. x(t) =f(w input res u(t)+w res resx(t 1)) (1) y(t) =W res out(u(t), x(t)) (2) where t is the current time step, t 1 is the previous time step, u(t) is the input activation at time t, x(t 1) is the vector of activations of the reservoir neurons at time t 1, and f(x) is the activation function, which is tanh here. Wb a denotes a weight matrix from a to b. The main advantage of using RC is that only the reservoir-to-output connections W res out are modified during training which can be done in a one-shot fashion using linear regression, whereas training traditional RANNs involves modification of all weights in the network. This, therefore, makes RC techniques much faster to train, which is useful for real-world applications where fast, and sometimes real-time training is required. Linear regression training also guarantees convergence, which is not the case with most other RANN learning rules that are based on gradient-descent error minimisation algorithms. These two advantages combined overcome the main stumbling blocks of using RANNs within real-world engineering domains. 2 The need for non-linearity and memory Reservoirs can be created to have long term memory or to have highly nonlinear mapping capabilities, depending on the task at hand. However, at present, there is no parameter that allows to tune these properties independently. For instance, tuning the spectral radius affects the memory of the reservoir, but also the non-linear mapping. In order to overcome this problem, a new architecture is outlined which combines the memory capabilities of a reservoir with the nonlinear separation capabilities of traditional ANNs. 2.1 R 2 SP: a Reservoir with Random Static Projections The use of a feedforward network where only the output weights are trained has been previously presented in [5] as the Extreme Learning Machine (ELM) and was later extended as the Optimally-Pruned Extreme Learning Machine (OP- ELM) [6]. Both techniques involve two layers of feedforward networks. The first layer s weights are chosen at random at network creation, while the second layer s weights are trained using a simple linear regression. Therefore, as with 304

RC, only the readout weighted connections are trained. OP-ELM extends ELM further by pruning the least significant neurons from the hidden layer, which regularises the network, improving its generalisation capabilities. The novel architecture proposed here, named Reservoir with Random Static Projections (R 2 SP), contains a standard reservoir and two static hidden layers of non-recurrent neurons with no within-layer connections, combining the underlying principles of ELM with RC. The first static layer receives the same inputs as the reservoir and has output connections to the output layer, thus giving an instant non-linear transformation of the dataset at each time step. The second static layer receives input from the reservoir, giving a non-linear mapping of the reservoir at each time step. The outputs of the reservoir and the two static layers are combined, and, as in RC and with the ELM, only the weights connecting to the output nodes are trained, using linear regression. The activations of the newly added static layers are calculated as follows: x sl1 (t) =f(w input sl 1 u(t)) (3) x sl2 (t) =f(w res sl 2 x(t)) (4) where x sl1 (t), x sl2 (t), W input sl 1 and W res sl 2 are the activations and input weight matrices of static layers 1 and 2 respectively. All other symbols are as in Equations 1 and 2. Figure 1 shows a schematic overview of the R 2 SP. Fig. 1: Reservoir with Random Static Projections (R 2 SP). SL1 and SL2 are the number of neurons in the respective static layers. 3 Experimental Setup and Datasets The input scaling, spectral radius and reservoir size are known to influence the performance of a reservoir for a given task [1]. In order to find the optimal 305

settings for both reservoir architectures, separate parameter sweeps were carried out for both architectures for two different datasets. The leak rate (measure of how a neurons activation decreases over a period of time) of each neuron in the reservoir was also ranged over to investigate its effect on performance for both datasets, as it has been shown to improve performance by altering the dynamics of the reservoir to match the timescale of a given dataset. In order to overcome the randomness of reservoir weights at creation, 100 reservoirs of each type were trained 1. 3.1 Fourth order polynomial data This artificial dataset contains random uniform inputs ranging from -1 to +1 and outputs determined by a 4 th order polynomial, with coefficients also chosen from a uniform random sample of -1 to +1. The dataset consisted of 20,000 datapoints and was split into 20 samples of 1,000 datapoints for training and testing. The output at time t was determined by the following equation: y t = c 1 + c 2 u t + c 3 u t 1 +... + c 15 u 4 t 1 (5) where c denotes one of 15 random uniformly distributed coefficients. All other symbols are as described above. The output weights were calculated using ridge regression, with the performance of each reservoir calculated using the Normalised Root Mean Squared Error (NRMSE), using ten-fold cross-validation using a grid search for the optimal regularisation parameter. 3.2 Isolated spoken digits recognition task The dataset is a subset of the TI46 isolated spoken digits dataset. It consists of 10 utterances of ten digits spoken by five different female speakers, resulting in 500 digit samples. Since this task is relatively easy to solve with standard reservoirs (in simulations not reported here, errors around 0 were obtained), it was made more challenging by adding babble noise 2 to the samples at a signal-tonoise ratio of 3dB. The samples were then preprocessed using the Lyon cochlear model with a subsampling factor of 128, resulting in a 77-dimensional timevarying feature vector. This 77-dimensional vector was used as input to both architectures. For this task, the error was again determined using ten-fold crossvalidation but without optimization of the regularization parameter because the noise added to the dataset is enough to avoid overfitting. 4 Results Both architectures contained 150 neurons (the R 2 SP architecture consisted of 50 nodes in each layer). Table 1 shows the optimal parameters for both reservoir 1 All training and testing was performed using the Opensource Matlab RCToolbox available from http://www.elis.ugent.be/rct. 2 This is available from the NOISEX database from the Rice University: http://spib.rice.edu/spib/data/signals/noise/babble.html. 306

approaches and their corresponding error rates for both datasets. Varying the leak rate of the neurons for the polynomial task led to little improvement in performance for both architectures and therefore was not investigated further. Varying the leak rate for the digit recognition task did, however, affect the performance of both architectures. Figure 2 shows the test error plots of a standard reservoir and the R 2 SP on the digit recognition task when varying over the spectral radius and the leak rate. 0.55 0.12 Word error rate 0.5 0.45 0.4 0.35 0.3 0.25 0.2 Word error rate 0.11 0.1 0.09 0.08 2.3026 1.1513 log(leak rate) 0.5 0.7 0.9 1.1 1.3 Spectral radius 2.3026 1.1513 log(leak rate) 0.5 0.7 0.9 1.1 1.3 Spectral radius Fig. 2: Test error plots of the standard reservoir (left) and the R 2 SP (right) when applied to the digit recognition task ranging over the leak rate and the spectral radius. Note the different scaling of the Z (error) axis. Table 1: Errors obtained from a standard reservoir architecture (Std-Res) and the R 2 SP on the polynomial (Poly) and the digit recognition (Speech) datasets. Type Poly Error (NRMSE) Input Scale Spec Rad Std Res 0.73 25.1 1.2 R 2 SP 0.5 2.5 0.6 Speech (Error in Word Error Rate ) Std Res 0.17 1 1.3 R 2 SP 0.08 1 1.1 5 Discussion As the results show, the R 2 SP outperforms a standard optimal reservoir quite significantly, particularly on the digit recognition task. Here the standard reservoir has a classification error of 17%, while a R 2 SP achieves a smaller error of 8%. Analysis of Table 1 and Figure 2 shows that the optimal spectral radius of the R 2 SP is lower than a standard reservoir. This indicates that the 307

static layers perform more of the non-linear mapping in the high dimensional state space, resulting in a reservoir with longer term memory as indicated by a lower spectral radius, therefore resulting in a system which has highly nonlinear mapping capabilities and memory at the same time. Similar comments apply to the standard reservoir when applied to the polynomial dataset, as the input scaling to the reservoir and spectral radius are high, which results in a reservoir which has highly saturated states (activations at either +1 or -1 for the whole dataset). This creates a more non-linear reservoir which, therefore, has less memory capacity, giving a larger error rate. 6 Conclusion A novel architecture, Reservoir with Random Static Projections (R 2 SP), has been presented which combines the recurrent characteristics of a reservoir with the instantaneous non-linear separation capabilities of standard feedforward networks, enabling a reservoir to have both memory and non-linear mapping capabilities. Such characteristics are required for datasets with short lived non-linear features which also require memory, such as the polynomial and spoken digit recognition datasets presented. The ease of training of reservoirs is still present, as only the output weights of the newly added layers are trained alongside the reservoir outputs. As a result of the two combined characteristics, the new R 2 SP architecture outperforms a standard reservoir approach significantly for the two tasks considered here. Future work will include an investigation of applying further techniques used in the OP-ELM such as different activation functions and pruning of the least significant neurons to obtain an optimal network size for a given task. Further investigations will also be conducted on the application of the static reservoir architecture to data collected from the real world whose underlying characteristics are suspected to require highly non-linear mapping as well as short-term memory and should therefore benefit from the novel R 2 SP architecture introduced here. References [1] D. Verstraeten, B. Schrauwen, M. D Haene, and D. Stroobandt. An experimental unification of reservoir computing methods. Neural Networks, 20:391 403, 2007. [2] M. Lukosevicius and H. Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127 149, 2009. [3] H. Jaeger. The echo state approach to analysing and training recurrent neural networks. Technical report, German National Research Institute for Computer Science, 2001. [4] X. Dutoit. Reservoir Computing for Intelligent Mobile Systems. PhD thesis, Department of Mechanical Engineering, Katholleke Universiteit Leuven, Belgium, 2009. [5] G.B. Huang, Q.Y. Zhu, and C.H. Siew. Extreme learning machines: Theory and applications. Neurocomputing, 70:489 501, 2006. [6] Y. Miche, A. Sorajamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse. OP-ELM: Optimally pruned extreme learning machine. IEEE Transactions on Neural Networks, 21(1):158 162, December 2009. 308