Exploring difficulties in learning spatio-temporal data using hierarchical Echo State Networks

Size: px

Start display at page:

Download "Exploring difficulties in learning spatio-temporal data using hierarchical Echo State Networks"

Job Briggs
5 years ago
Views:

1 Exploring difficulties in learning spatio-temporal data using hierarchical Echo State Networks by Ivaylo Enchev Bachelor Thesis in Computer Science Prof. Dr. Herbert Jaeger Name and title of the supervisor Date of Submission: May 12, 2013 Jacobs University School of Engineering and Science

2 Abstract The multiple scales in which real world data has to be analyzed in order to extract relevant features from it, create many difficulties even for current state of the art learning architectures. An idea that may achieve good results on such data is to combine hierarchical processing with top-down feedback. So far, despite mutliple attempts, there has been no success to combine these features in a single learning architecture that could be successfully trained on complex data. Hierarchical Echo State Networks are another attempt with this goal. Unfortunately, this learning architecture is not performing well, when applied to multiscale data. The current work aims at experimentally discovering and documenting the reasons behind the difficulties that arise when using Hierarchical Echo State Networks on such data.

3 Contents 1 Introduction and related work 4 2 Statement of Research Goals 6 3 Introduction of the Architecture Structural Overview Formal Description Learning Experiments and results Dataset Setup Higher level adaptation Lower level adaptation Two layer adaptation without pre-training Discussion 25 6 Aknowledgements 25 A Appendix A: Derivation of sigmoid from logistic function 26 B Appendix B: Computing the error derivatives used in weight update equations 27 3

4 1 Introduction and related work One of the challenges in modern machine learning is the design of robust learning architectures that can work well with noisy and high dimensional real world data. The applications of such learning architectures would be immense - speech processing, handwriting and gesture recognition, video sequence analysis and more. Unfortunately to this day only a partial success on solving this task has been achieved. As per input from Jaeger, there is still no learning model that is close to the performance of humans when it comes to working with multiscale data. Some of the reasons that were pointed out are: many of those models work only with static or low dimensional input whereas real world data is rarely such a lot of models rely on preprocessing of the input and extracting carefully hand-crafted features, which imposes heavy additional workload on the designer Bengio [1][2] and LeCun [2] discuss the need for hierarchical learning architectures when dealing with complex tasks. Such architectures is used to, gradually (level by level), produce more and more abstract representations of the raw input data, which in the end can be used to answer questions about the input (e.g classification). For example, let us consider the problem of processing human speech. The modules in each layer of a hierarchical learning architecture put to this task may extract features on more and more coarse timescales. Lower layers may extract simple phonemes, while higher layers extract whole words and phrases. Starting off from the assumption that, in order to formally express a complex behavior such as recognising human speech, the architecture must be able to learn functions that are highly varying with respect to the raw input, they argue that hierarchical models are required in order to efficiently and successfully represent those functions. The two main arguments are: Shallow architectures (architectures with insufficient amount of processing layers) may require many more computational elements (e.g artificial neurons, when working with artificial neural networks), when compared to architectures with more appropriate for the task number of layers. An architecture with sufficient number of processing layers can compactly represent the highly varying functions which need to be learned when dealing with complex data. Certain learning algorithms for shallow architectures (local estimators) rely on smoothness of the input and give unsatisfactory performance when applied to highly varying functions. Such algorithms would require a much 4

5 bigger amount of samples for training to successfully capture the high variability of the target function. As pointed out by Jaeger, some learning architectures that implement such hierarchy and achieve particularly good results on temporal data are: Convolutional Neural Networks [3], Hidden Markov models and Multidimensional Recurrent Neural Networks [4]. Convolutional Neural Networks are organized in layers, where the units in each layer are organized into planes, called feature maps. Units in a feature map perform the same operation on different sets of neighbouring units (called receptive fields) from the previous layer. All weights connecting units from the previous layer to any unit in a particular feature map are shared. Depending on the operation that is performed the layers are called convolutional or subsampling. Each feature map in a convolutional layer detects a particular feature in different parts of the input from the previous layer. Subsampling layers are used to reduce the resolution of the input and the sensitivity of the output to shifts or distortions. This way the architecture is able to detect features on several spatial scales. In the end the weights are trained using back-propagation. Although such a hierarchical learning architecture may have many adaptable connections between layer, the sharing of weights technique reduces this number substantially. Multidimensional Recurrent Neural Networks are itself a Recurrent Neural Network which has as many recurrent connections as there are dimensions in the data. They consist of an input and output layer and one or more hidden layers, all connected by only feed forward connections. The hidden layer scans the input in such a way that a point in that layer is being fed the hidden activations of all points one-step back along each dimension (through recurrent connections) and the input. An extension to this approach has several hidden layers, each one receiving hidden activations from a different directions. An n-dimensional input would then require 2 n such hidden layers. The whole architecture is then trained by an n-dimensional variant of the back-propagation algorithm. The connections between the layers in the suggested hierarchical learning architectures are all feed-forward, lower layers are influencing representations in higher layers, but there is no influence in the other direction. According to Jaeger, the additional feature that a hierarchical learning architecture must have, in order to achieve very good results on complex real world data, is top-down feedback - connections that go from higher layers back to lower layers. According to Friston [5] and Clark [6], from a point of view of the field of Bayesian brain, such backward connections are essential in cases when the processes that generate the input data are non-invertible and highly non-linear. The 5

6 key idea in this field is that the brain is working with the goal of reducing the error in the task of predicting the next step of its input. The brain is supposed to discover the causes which generate the input it receives. To achieve this task it utilizes a hierarchical architecture in which higher levels are trying to predict the input to lower layers by building models of the causal structure of this input. Errors in this prediction task force higher levels to adapt. The top-down connections are able to trace the interactions between causes in the input data, explain the driving input and introduce constraints that higher levels impose on lower levels and thus add context and empirical prior into the system. Jaeger pointed out well performing examples of architectures that implement topdown feedback such as Hierarchical Bayes [5], systems based of Adaptive Resonance Theory and Deep Belief Networks [7]. Even though they achieve good results on static data, these architectures are not able to cope with temporal data. So far, a successful idea of combining hierarchy with top down connections that scales well with more complex data has not been found. The main topic of this work is to try to shed some light on the problems that inhibit the performance of one system that also implements these ideas - Hierarchical Echo State Networks. 2 Statement of Research Goals As explained in the previous chapter, hierarchical learning architectures with topdown feedback represent a promising idea towards solving many complex problems in modern machine learning. Even though there are several architectures that implement these features, they do not achieve satisfactory results when applied to multiscale data. Hierarchical Echo State Networks [8] are another learning architecture that uses this idea. As per input from Jaeger, it is also not able to scale when presented with complex data. The main goal of the presented work is to experimentally investigate the performance and document difficulties that arise when using Hierarchical Echo State Networks. 3 Introduction of the Architecture 3.1 Structural Overview The main purpose of the presented architecture is to discover in an unsupervised way, dynamical features in a multiscale time series u(n). To achieve this the ar- 6

top), and can be used to approximate the original signal u(n). The architecture is composed of several layers, each of which hosts an Echo State Network(ESN) [9][10].

7 chitecture is trained on a one-step input signal prediction task. The discovered features are again time series(signals), which are related to each other in a hierarchical way ( fast/local features in the bottom of the hierarchy and slow/global features at the top), and can be used to approximate the original signal u(n). The architecture is composed of several layers, each of which hosts an Echo State Network(ESN) [9][10]. Each layer operates on a different time scale, decreasing from bottom to top - the lowest level operates on the same timescale as the original time series, while higher levels operate at increasingly slower scales. Each level computes a representation of the input signal at the timescale of that level. This representation is produced by combining the extracted, by the ESN at that level, dynamical features using a set of weights (also called votes) coming from the layer above. As an input, each layer takes the output produced from the previous lower layer (bottom-up flow), the first layer receives the input to the whole architecture. The weights which are used to combine the extracted features at a particular level are the output from the next higher level (top-down flow). See Fig. 1 Figure 1: Schematic of approximating a signal by feature vote combination. Picture taken from [8] 7

8 3.2 Formal Description Each layer in the architecture has the same structure. Assume the input signal u(n) has d dimensions and the architecture has k layers. For each layer we have the following parameters: 1. Hierarchy parameters F - number of extracted features in the level f i (n) - feature vector, computed as the output from the ESN on the current level, i = 1...F f(n) - matrix where column i is f i (n), i = 1...F v(n) - F -dimensional vector of votes, passed down from the next higher level in the hierarchy. The highest layer has no votes available from above and thus votes passed down from it are computed directly from the output of the ESN on that level 2. ESN parameters a - leaking rate of neurons in the ESN in the level. x(n) - reservoir state of the ESN in the level. The output of the ESN are the feature vectors f i (n) W in - input weight matrix of the ESN in the level. W - internal weight matrix of the ESN in the level. W out i - output weight matrix used to compute f i (n) from the reservoir state x(n). The matrices W out i are lumped together to form the ESN output weight matrix W out i(n) - input to the reservoirs in the level λ - leaky integration parameter Let the layers have labels R 1..R k, with R 1 corresponding to the lowest layer and R k corresponding the highest layer in the hierarchy. In order to refer to a specific parameter from a specific layer we use the notation R 1.F - meaning the number of extracted features in the lowest layer. Keep in mind that each layer has its own set of parameters, thus the value of R 1.F is not necessarily the same as R 2.F and so on. For the inputs to each layer we have: u(n 1) if k = 1 R k.i(n) = û(n 1) if k = 2 R k 2.v(n 1) if 2 < k <= (k) The n-th update cycle of the hierarchy works in the following way: 8 (1)

9 1. First the reservoir state in the ESN at each layer is updated according to the leaky integration state update equation. For each k = 1... k: R k.x(n) = (1 R k.a)r k.x(n 1)+σ(R k.w R k.x(n 1)+R k.w in R k.i(n 1)) (2) where sigma is the logistic sigmoid function: σ(q) = exp( q) (3) 2. Then for each layer k = 1... k the feature matrix R k.f(n) is obtained from by setting its columns to be the feature vectors R k.f i (n) computed by: R k.f i (n) = R k.w out i [R k.x(n); R k.i(n)] (4) where if u and v a two column vectors, [u; v] denotes vertical concatenation of u and v. 3. Then the votes are passed down starting from the highest level: R k 1.v(n) = σ(l R k 1.λ(R k.w out [R k.x(n); R k.i(n)])) (5) where L λ (q(n)) is leaking integration with leaking rate λ of a signal q(n), carried out according to: L λ (q(n)) = (1 λ)l λ (q(n 1)) + λq(n) (6) where the recursive leaky integration is initialized to 0. For the layers k = k 2...1, we obtain the votes by: R k.v(n) = σ(l Rk.λ(R k+1.f(n)r k+1.v(n))) (7) In the end the output û(n) is obtained by: û(n) = R 1.f(n)R 1.v(n) (8) For a graphical example of the flow in the hierarchy, see Fig. 2 9

10 Figure 2: Overview of the architecture flow with 3 layers. The processing steps of one time increment are shown. Vectors with same texture have the same dimension. Picture taken from [8] 10

11 3.3 Learning The only adaptive parameters in the hierarchy are the output weights of the ESN - R k.w out. At each time step W out is adapted, on all levels of the hierarchy, using stochastic gradient descent. The gradient is taken with respect to the squared prediction error: ε(n) = u(n) û(n) 2 (9) For each level except the last one (k = 1.. k 1, we refer to a vote potential as the leaky integrated quantity before passing through the sigmoid function: R k.p(n) = L Rk.λ(R k+1.w out [R k+1.x(n); R[k + 1].i(n)]) (10) Let s denote by σ (q) = σ(q)(1 σ(q)) the derivative of the logistic sigmoid σ(q), q j denote the j-th coordinate of a vector q,. denote component wise multiplication of two matrices. Then we can compute the error terms by: E(n) = u(n) û(n) (11) E 1 (n) = E(n) (12) E k (n) = R k 1.f T (n)e k 1 (n). R k 1.λσ (R k 1.p(n)) (13) Then the updated weights are: R k.wi out (n + 1) = R k.wi out (n) + R k.γr k.v i (n)e k (n)[r k.x(n); R k.i(n)] T (14) The quantities E k (n) can be interpreted formally interpreted as error vectors. Each E k can be considered as back-propagated versions of E 1. 4 Experiments and results 4.1 Dataset The dataset used during the experiments is the Triple Generator Dataset [11]. This dataset uses 3 generator signals - tent map, sine wave and constant (see Fig. 3). The 3 signals alternate depending on a random switch occurring with certain probability. The nature of the dataset allows easy modification to which signals are to be used and with what probability should they switch. 11

Figure 3: Signal generated by the Triple Generator Dataset with length 1000 4.

12 Figure 3: Signal generated by the Triple Generator Dataset with length Setup The experiments shown in this paper aimed to show the performance of the architecture in rather simple conditions and set a ground on which to further build upon. As a first step, the architecture was tested under very clear and simplified conditions The dataset was simplified to only to 2 signal generators - sine wave and constant and the switching probability was set to Also during each switch the constant signal retains its value, that is the constant signal has always the same value no matter when or how the switch happens. The sine wave is the simply the function 0.5sin(0.8x) It has values between 0 and 1, with a period of approximately 8 steps. Within these conditions it is easy to see that 2 features would be enough in order to construct a perfect approximation of the next step of the 2-step generator signal: the first layer should produce 2 features - one that is optimized in case the next step is a sine wave and another that is optimized in case the next step is the constant signal in that case, the second layer should pass down 2 votes that switch only between 0 and 1, indicating which is of the two signals is used in the next step of the input. To achieve this, a simplified version of the original architecture has been used. This version uses only 2 layers and has no leaky integration on the passed votes from the second layer (but still go through a sigmoid function). The output of the ESN in the second layer and sigmoid function through which it passes on the topdown flow should work in way that the result after passing through the sigmoid is 2 indicators values - one that has a value 1 when the next step is from the sine wave and 0 otherwise, while the other has a value 1 when the next step is from the 12

13 constant signal and 0 otherwise. The sigmoid chosen for the experiments shown in this paper is derived from the logistic sigmoid: σ(x) = ( 1 1 e ( e x ) + 1)1 2 (15) The exact function has been derived using scaling and shifting. This sigmoid has the nice property that intersects 0 and 1 at points -5 and 5 respectively and smoothes the output from the ESN on the second layer(see Fig. 4). For details see Appendix A Figure 4: Sigmoid used for smoothing the output of the second layer ESN With these simplifications we can directly use the following weight update equations for each layer ( derived from back-propagating the error of the final approximation to the corresponding layer. For details see Appendix B). For the first layer: R 1.W out (n + 1) = R 1.W out (n) + R 1.γE(n)R 2.v(n)[R 1.x(n); R 1.i(n)] T (16) For the second layer, since we have no leaky integration anymore, the vote potential is just the output from the ESN on that layer R 2.p(n) = R 2.f(n) (17) And thus the update of the weights in the second layer can be computed using: R 2.W out (n + 1) = R 2.W out (n)+ R 2.γE(n)(R 1.f(n). σ (R 2.p(n)))[R 1.x(n); R 1.i(n)] T (18) 13

14 The prediction error was computed using the normalized root mean square error (NRMSE), computed by: mean((û(n) u(n))2 ) NRMSE = (19) σ 2 where u(n) is the value of the next step of the input, while û(n) is the predicted result. 4.3 Higher level adaptation An experiment to diagnose the adaptation of the higher level has been prepared according to the following scheme: 1. Two ESN are pre-trained to produce an approximation of the features and votes described in the previous section. The first ESN produces 2 features, each of which is optimized in case the next step is a sine wave and the other in case the next step is a constant. The second ESN produces values such that after passing them through a specified sigmoid, result in indicators with values either 0 or 1, depending on which type of signal is the next step. 2. After the pre-training, the two ESN are embedded in the 2 layer Hierarchical Echo State Network described in the previous section. 3. The learning rate of the first layer is set to 0, and the whole system is fed the 2-generator input signal. Since the first layer has 0 learning rate, adaptations should only happen in the second layer. 4. The performance of the adaptation is estimated by iteratively adapting the system for a fixed number of steps, after which adaptation is turned off and then the prediction error on a fixed testing dataset is computed Following the above scheme two ESNs with reservoir sizes of 100 units and spectral radii 1. The leaking rates of the networks were set to 0.7 and 0.5, for the first and second layer respectively. The first ESN was fed an input from the 2-generator dataset with length and trained on the task of producing 2 output signals - one that is an approximation of the next step of the sine wave, while the other is an approximation with the next step of a constant signal. The second ESN was fed the exact same input and trained to 2 output values, one 14

15 that has a value of 5 in case the next step is from the sine wave and 5 otherwise, and the other has a value of 5 in case the next step is from the constant signal and 5 otherwise. The reason for this is that the sigmoid chosen in Section 4.2 has a value of 1 for x = 5 and 0 for x = 5, which means that after passing the output from the second ESN through the sigmoid we get exactly the indicators described in the scheme. The adaptation of the weights in the first and second ESN was done using ridge regression (also known as Tikhonov regression)[12] with a regularization coefficient of 0.01 and 0.1 respectively. The NMRSE from the two prediction tasks in training and over a steps long testing sequence are shown in Table 1. Although the precision and results of this task can be increased using different methods for linear regression (such as using the Moore-Penrose pseudoinverse), we found that ridge regression was particularly useful in our case due to the better numerical stability of the resulting weight matrix. Using Moore-Penrose pseudoinverse resulted in weight matrices with very big entries, which yielded very sensitive solutions. This resulted into very unstable learning process of the hierarchical learning architecture, when the pre-trained ESNs were embedded in its layers. The regularization coefficients can be optimized using cross-validation, but for the purpose of just diagnostics the above mentioned values yielded solutions that generalized well enough. For plots of the produced features and votes see Fig. 5. NRMSE Training Testing ESN for Layer ESN for Layer Table 1: NMRSE during training and testing of two ESN producing approximations of the features and votes described in Section

(a) The two feature signals from ESN for Layer 1 (b) The two vote signals from ESN for Layer 2 Figure 5: Features and Votes produced during pre-training.

16 (a) The two feature signals from ESN for Layer 1 (b) The two vote signals from ESN for Layer 2 Figure 5: Features and Votes produced during pre-training. The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave. The 2-layered Hierarchical Echo State Network, that uses the two pre-trained ESNs in its 2 layers, is trained for steps by reiterating over a step long input sequence consisting of sine wave and constant signal. The learning rates of the first and second layer are set to 0 and 0.01 respectively. After each 16

17 1000 steps, adaptation is switched off, and the NRMSE over a 5000 steps long testing sequence is computed. The development of this error is plotted in Fig. 6. Figure 6: Development of the NRMSE over a fixed test sequence when adaption for the first layer is switched off. Final NRMSE is The development of the NRMSE clearly shows that the upper layer adapts correctly and reduces the overall prediction error in the system. The small jumps towards the end of the simulation can be accounted to using a too big learning rate for the stochastic gradient descent learning. A snapshot of the votes that the second layer produces can be seen in Fig. 7. The imperfect voting and appearance of oscillations during the switch to the constant signal are due to the fact that there are infinitely many possibilities to pick votes that are able to reconstruct the constant signal by combining them with 2 constant signal features. The constant signal is too easy to reconstruct which leads to the observed instability. An important factor to the success of the whole learning is the choice of an activation function for ESN in each layer. The current results were achieved using tanh(x) as an activation function. Experiments with a logistic activation function gave a much worse results (for the same number of training steps) during the pre-training of the ESNs in each layer, which later led to instability of the learning process of the whole hierarchical architecture. 17

18 Figure 7: Snapshot of the votes produced by the second level after training with steps. The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave 4.4 Lower level adaptation The adaptation of the first layer is diagnosed using the same procedure, with the only difference that now the second layer has a learning rate of 0 and the first layer has a learning rate of This way the second layer should pass down approximately perfect votes. The development of the NRMSE over the test sequence along with a snapshot of the final features produced by the first layer can be seen in Fig. 8. From the figure we can clearly see that adaptation within this first layer works correctly and the error is reduced as expected. Oscillations in this development are due to a high learning rate set for the layer. A snapshot of the final features can be seen in Fig

19 Figure 8: NMRSE development over a fixed test sequence during adaptation of only the lowest layer. Final NRMSE is Figure 9: Snapshot of features produced by the first layer. The blue and green line show the two features, whereas the red line is the correct next step. The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave 19

20 4.5 Two layer adaptation without pre-training As a final diagnostic the whole architecture has been run, without any pre-training of the two layers. The development of the NRMSE in this diagnostic can be seen in Fig. 10. Again the development of the NRMSE over the test sequence works as expected and reduces over time. It is interesting to note that in this case the produced features and votes (see Fig. 11) do not resemble the ones which were forced with pre-training in previous experiments(see Fig. 5). A possible explanation for this obvious difference is that the system is reaching a local minimum and would need many more training steps to escape from it. Figure 10: NRMSE development over a fixed test sequence during adaptation of both layers and no pre-training. Final NRMSE is

(a) The two feature signals produced from(b) The two vote signals passed down from Layer 1 Layer 2 Figure 11: Features and Votes after adaptation of both layers without pre-training.

Additionally the architecture was presented with increasingly more complex data by increasing the switching probability in the dataset.

21 (a) The two feature signals produced from(b) The two vote signals passed down from Layer 1 Layer 2 Figure 11: Features and Votes after adaptation of both layers without pre-training. The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave. Additionally the architecture was presented with increasingly more complex data by increasing the switching probability in the dataset. This makes the input signal more chaotic and the adaptation harder. Increasing this factor in the data required setting the learning rates to for both layers in order to reduce the instabilities in the learning. Two experiments have been run - one with a switching probability of 0.02 and one with switching probability of The architecture has been trained for steps. Snapshot of the resulting approximations and feature/vote combinations can be seen in Fig. 12 and Fig. 13 respectively. We can see that increasing the switching factor inhibits the learning severely and the architecture is not adapting well to the often switching of the input signal. 21

Figure 12: Features, votes and final result over a 1000 step sequence. The final NRMSE is 0.21244.

22 (a) The two feature signals produced from(b) The two vote signals passed down from Layer 1 Layer 2 (c) Resulting approximation, red curve shows the correct output whereas the blue line shows the output from the learning architecture. Figure 12: Features, votes and final result over a 1000 step sequence. The final NRMSE is The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave. 22

Figure 13: Features, votes and final result over a 1000 step sequence. The final NRMSE is 0.31678.

the input signal is the sine wave. The discovered instability when using the constant signal (see Fig.

23 (a) The two feature signals produced from(b) The two vote signals passed down from Layer 1 Layer 2 (c) Resulting approximation, red curve shows the correct output whereas the blue line shows the output from the learning architecture. Figure 13: Features, votes and final result over a 1000 step sequence. The final NRMSE is The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave. The discovered instability when using the constant signal (see Fig. 7) inspired the idea to try the architecture on other simple patterns. Instead of the constant signal, a scaled and slowed down modification of a sine wave has been used sin(0.05x) This new signal has values between 0 and 0.5 with a period of approximately 125 steps. The switching probability has been set to and 23

the learning rates for both layers to 0.001. The architecture has been trained for 5.10 6 steps.

The architecture still performs well and is able to adapt to the slow sine wave almost perfectly.

red curve shows the correct output whereas the blue line shows the output from the learning architecture.

24 the learning rates for both layers to The architecture has been trained for steps. The final result can be seen in Fig. 14. The architecture still performs well and is able to adapt to the slow sine wave almost perfectly. (a) The two feature signals produced from(b) The two vote signals passed down from Layer 1 Layer 2 (c) Resulting approximation, red curve shows the correct output whereas the blue line shows the output from the learning architecture. Figure 14: Features, votes and final result over a 1000 step sequence. The final NRMSE is The light blue colored intervals on the horizontal axis indicate that the input signal is the slow sine wave, while the black colored intervals indicate that the input signal is the fast sine wave. 24

25 5 Discussion The experiments in this paper analyse the performance of a 2-layer hierarchical architecture with top-down feedback when presented with a simple 2-generator dataset. Isolated adaptations of the upper and lower layer have been tried out, as well as a complete simulation of the whole architecture. Initially this was done by modifying an already existing implementation of Hierarchical Echo State Network that allowed users to easily modify its parameters and features. The architecture and the dataset were gradually simplified but no apparent progress has been made even in very simplified conditions. The experiments worked very slowly and required a huge number of trainings steps to achieve moderate results. After many failed attempts to try and detect the issues in the code, the whole learning procedure for the specific architecture described in Section 4.2 has been implemented from scratch. This was the key to making initial progress on the designed experiments and detect some of the problems in the old code. The main differences in the reimplementation (and probably the reason for the improved performance) are the use of the recomputed update equations from Section 4.2 and the use of tanh(x) as an activation function for the ESN in each layer. These changes enabled the system to perform much better and achieve good results with smaller number of training steps. The presented results can be used as a ground base on which to further build upon. Several issues that can inhibit the performance of the learning architecture have been discovered. Currently we can see that initially the system makes very fast progress and reduces the error rather quickly, after which we observe oscillations in the error, which are probably due to a very high learning rate. It would be interesting to see if using an adaptive learning rate can improve this behavior. It has been shown through several experiments that one of the factors in the data that inhibits the learning severely is the switching probability. Additionally, as seen in Fig. 7, instabilities may arise due to the use of the constant signal as input. A direction which can be further explored is the use of different patterns besides the already presented ones. Possible future work may also include training the architecture on more complex data and analysing modifications to the system that can cope with the mentioned issues, boost the performance and/or speed up the learning. 6 Aknowledgements I would like to thank my supervisor Prof. Dr. Herbert Jaeger for guiding my first steps in the exciting field of machine learning and for constantly devoting time 25

26 and effort to give me feedback and directions about the difficulties I had along the way of writing this thesis. A Appendix A: Derivation of sigmoid from logistic function In the experiments show in this paper, we needed a sigmoid that intersects 0 at point A and intersects 1 at point A, for some fixed positive constant A. The bigger this constant is, the more smoothing is applied to the input of the sigmoid. In order to derive a concrete formula for such a function we started off from the logistic sigmoid: 1 f(x) = (20) e x + 1 By subtracting 1 from the above equation, we get an odd function with two 2 horizontal asymptotes - at 1 and 1: 2 2 g(x) = f(x) 1 2 (21) After that we scale this function, so that it intersects 1 and 1 at points A and A respectively. This is easy to achieve since the function is odd. We only have to multiply by the reciprocal of the value of the function at A: t(x) = 1 g(x) (22) g(a) Now the resulting function intersects 1 and 1 at points A and A, but we want to make it so that it intersects 0 and 1 at A and A. To achieve this we shift the function up by adding 1, which makes it intersect 0 and 2 at A and A. Now we get the desired sigmoid, by scaling this whole function by 1. Thus we get: 2 σ(x) = (t(x) + 1) 1 2 (23) When we replace t(x) and g(x) by their definitions and choosing A = 5 we get exactly the result from section 4.2: σ(x) = ( 1 1 e ( e x ) + 1)1 2 (24) 26

27 B Appendix B: Computing the error derivatives used in weight update equations Let s define the following variables: f 1 (n) and f 2 (n) - outputs from the ESN in the first layer v 1 (n) and v 2 (n) - outputs from the ESN from the second layer x 1 (n) and x 2 (n) - reservoirs states of layer 1 and 2 respectively w 1 (n) and w 2 (n) - output weight vectors for feature 1 and feature 2 from layer 1 q 1 (n) and q 2 (n) - output weight vectors for feature 1 and feature 2 from layer 2 u(n) and û(n) - correct value of next step of the input and output of the whole architecture, respectively E(n = u(n) û(n) Then we have that û(n) = f 1 (n)v 1 (n) + f 2 (n)v 2 (n) and the squared error is ε(n) = E(n) 2. Thus for the partial derivative of the error with respect to the weight vector of the first feature in the first layer we get: ε(n) w 1 (n) = 2E(n) (E(n)) w 1 (n) (u(n) û(n)) = 2E(n) w 1 (n) (25) (26) = 2E(n) (u(n) f 1(n)v 1 (n) f 2 (n)v 2 (n)) w 1 (n) (27) = 2E(n) (u(n) w 1(n)x 1 (n)v 1 (n) f 2 (n)v 2 (n)) w 1 (n) (28) = 2E(n)x 1 (n)v 1 (n) (29) Following the same logic, we get the derivative of the error with respect to the weight vector of the second feature in the first layer to be ε(n) = 2E(n)x w 1 (n) 1v 2 (n) Then for the partial derivative of the error with respect to the weight vector of the first feature in second layer we get: 27

28 ε(n) q 1 (n) = 2E(n) (E(n)) q 1 (n) (u(n) û(n)) = 2E(n) q 1 (n) (30) (31) = 2E(n) (u(n) f 1(n)v 1 (n) f 2 (n)v 2 (n)) q 1 (n) (32) = 2E(n) (u(n) f 1(n)σ(q 1 (n)x(n)) f 2 (n)v 2 (n)) q 1 (n) (33) = 2E(n)f 1 (n)x(n)σ (q 1 (n)x(n)) (34) 28

29 References [1] J. Bengio. Learning deep architectures for AI. Technical Report 1312, Dept. IRO, Universite de Montreal, [2] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. In Bottou L., Chapelle O., DeCoste D., and Weston J., editors, Large-Scale Kernel Machines. MIT Press, [3] Y. LeCun, L. Bottou, J. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): , [4] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Proc. NIPS MIT Press, [5] K. Friston. Learning and inference in the brain. Neural Networks, 16: , [6] A. Clark. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 2012 (to appear). [7] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7): , JUL [8] Herbert Jaeger. Discovering multiscale dynamical features with hierarchical echo state networks. Technical report, Jacobs University Bremen, [9] Herbert Jaeger. The echo state approach to analysing and training recurrent neural networks - with an erratum note. German National Research Center for Information Technology GMD Technical Report, 148, [10] Herbert Jaeger. Short term memory in echo state networks. German National Research Center for Information Technology GMD Technical Report, 152, [11] Triple generator dataset. organic/benchmarks/triplegenerator. [12] David Verstraeten. Reservoir Computing: computation with dynamical systems. Phd thesis, Ghent University, Ghent,

Convolutional Neural Networks

Lecturer: Barnabas Poczos Introduction to Machine Learning (Lecture Notes) Convolutional Neural Networks Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.