Exploring difficulties in learning spatio-temporal data using hierarchical Echo State Networks

Size: px
Start display at page:

Download "Exploring difficulties in learning spatio-temporal data using hierarchical Echo State Networks"

Transcription

1 Exploring difficulties in learning spatio-temporal data using hierarchical Echo State Networks by Ivaylo Enchev Bachelor Thesis in Computer Science Prof. Dr. Herbert Jaeger Name and title of the supervisor Date of Submission: May 12, 2013 Jacobs University School of Engineering and Science

2 Abstract The multiple scales in which real world data has to be analyzed in order to extract relevant features from it, create many difficulties even for current state of the art learning architectures. An idea that may achieve good results on such data is to combine hierarchical processing with top-down feedback. So far, despite mutliple attempts, there has been no success to combine these features in a single learning architecture that could be successfully trained on complex data. Hierarchical Echo State Networks are another attempt with this goal. Unfortunately, this learning architecture is not performing well, when applied to multiscale data. The current work aims at experimentally discovering and documenting the reasons behind the difficulties that arise when using Hierarchical Echo State Networks on such data.

3 Contents 1 Introduction and related work 4 2 Statement of Research Goals 6 3 Introduction of the Architecture Structural Overview Formal Description Learning Experiments and results Dataset Setup Higher level adaptation Lower level adaptation Two layer adaptation without pre-training Discussion 25 6 Aknowledgements 25 A Appendix A: Derivation of sigmoid from logistic function 26 B Appendix B: Computing the error derivatives used in weight update equations 27 3

4 1 Introduction and related work One of the challenges in modern machine learning is the design of robust learning architectures that can work well with noisy and high dimensional real world data. The applications of such learning architectures would be immense - speech processing, handwriting and gesture recognition, video sequence analysis and more. Unfortunately to this day only a partial success on solving this task has been achieved. As per input from Jaeger, there is still no learning model that is close to the performance of humans when it comes to working with multiscale data. Some of the reasons that were pointed out are: many of those models work only with static or low dimensional input whereas real world data is rarely such a lot of models rely on preprocessing of the input and extracting carefully hand-crafted features, which imposes heavy additional workload on the designer Bengio [1][2] and LeCun [2] discuss the need for hierarchical learning architectures when dealing with complex tasks. Such architectures is used to, gradually (level by level), produce more and more abstract representations of the raw input data, which in the end can be used to answer questions about the input (e.g classification). For example, let us consider the problem of processing human speech. The modules in each layer of a hierarchical learning architecture put to this task may extract features on more and more coarse timescales. Lower layers may extract simple phonemes, while higher layers extract whole words and phrases. Starting off from the assumption that, in order to formally express a complex behavior such as recognising human speech, the architecture must be able to learn functions that are highly varying with respect to the raw input, they argue that hierarchical models are required in order to efficiently and successfully represent those functions. The two main arguments are: Shallow architectures (architectures with insufficient amount of processing layers) may require many more computational elements (e.g artificial neurons, when working with artificial neural networks), when compared to architectures with more appropriate for the task number of layers. An architecture with sufficient number of processing layers can compactly represent the highly varying functions which need to be learned when dealing with complex data. Certain learning algorithms for shallow architectures (local estimators) rely on smoothness of the input and give unsatisfactory performance when applied to highly varying functions. Such algorithms would require a much 4

5 bigger amount of samples for training to successfully capture the high variability of the target function. As pointed out by Jaeger, some learning architectures that implement such hierarchy and achieve particularly good results on temporal data are: Convolutional Neural Networks [3], Hidden Markov models and Multidimensional Recurrent Neural Networks [4]. Convolutional Neural Networks are organized in layers, where the units in each layer are organized into planes, called feature maps. Units in a feature map perform the same operation on different sets of neighbouring units (called receptive fields) from the previous layer. All weights connecting units from the previous layer to any unit in a particular feature map are shared. Depending on the operation that is performed the layers are called convolutional or subsampling. Each feature map in a convolutional layer detects a particular feature in different parts of the input from the previous layer. Subsampling layers are used to reduce the resolution of the input and the sensitivity of the output to shifts or distortions. This way the architecture is able to detect features on several spatial scales. In the end the weights are trained using back-propagation. Although such a hierarchical learning architecture may have many adaptable connections between layer, the sharing of weights technique reduces this number substantially. Multidimensional Recurrent Neural Networks are itself a Recurrent Neural Network which has as many recurrent connections as there are dimensions in the data. They consist of an input and output layer and one or more hidden layers, all connected by only feed forward connections. The hidden layer scans the input in such a way that a point in that layer is being fed the hidden activations of all points one-step back along each dimension (through recurrent connections) and the input. An extension to this approach has several hidden layers, each one receiving hidden activations from a different directions. An n-dimensional input would then require 2 n such hidden layers. The whole architecture is then trained by an n-dimensional variant of the back-propagation algorithm. The connections between the layers in the suggested hierarchical learning architectures are all feed-forward, lower layers are influencing representations in higher layers, but there is no influence in the other direction. According to Jaeger, the additional feature that a hierarchical learning architecture must have, in order to achieve very good results on complex real world data, is top-down feedback - connections that go from higher layers back to lower layers. According to Friston [5] and Clark [6], from a point of view of the field of Bayesian brain, such backward connections are essential in cases when the processes that generate the input data are non-invertible and highly non-linear. The 5

6 key idea in this field is that the brain is working with the goal of reducing the error in the task of predicting the next step of its input. The brain is supposed to discover the causes which generate the input it receives. To achieve this task it utilizes a hierarchical architecture in which higher levels are trying to predict the input to lower layers by building models of the causal structure of this input. Errors in this prediction task force higher levels to adapt. The top-down connections are able to trace the interactions between causes in the input data, explain the driving input and introduce constraints that higher levels impose on lower levels and thus add context and empirical prior into the system. Jaeger pointed out well performing examples of architectures that implement topdown feedback such as Hierarchical Bayes [5], systems based of Adaptive Resonance Theory and Deep Belief Networks [7]. Even though they achieve good results on static data, these architectures are not able to cope with temporal data. So far, a successful idea of combining hierarchy with top down connections that scales well with more complex data has not been found. The main topic of this work is to try to shed some light on the problems that inhibit the performance of one system that also implements these ideas - Hierarchical Echo State Networks. 2 Statement of Research Goals As explained in the previous chapter, hierarchical learning architectures with topdown feedback represent a promising idea towards solving many complex problems in modern machine learning. Even though there are several architectures that implement these features, they do not achieve satisfactory results when applied to multiscale data. Hierarchical Echo State Networks [8] are another learning architecture that uses this idea. As per input from Jaeger, it is also not able to scale when presented with complex data. The main goal of the presented work is to experimentally investigate the performance and document difficulties that arise when using Hierarchical Echo State Networks. 3 Introduction of the Architecture 3.1 Structural Overview The main purpose of the presented architecture is to discover in an unsupervised way, dynamical features in a multiscale time series u(n). To achieve this the ar- 6

7 chitecture is trained on a one-step input signal prediction task. The discovered features are again time series(signals), which are related to each other in a hierarchical way ( fast/local features in the bottom of the hierarchy and slow/global features at the top), and can be used to approximate the original signal u(n). The architecture is composed of several layers, each of which hosts an Echo State Network(ESN) [9][10]. Each layer operates on a different time scale, decreasing from bottom to top - the lowest level operates on the same timescale as the original time series, while higher levels operate at increasingly slower scales. Each level computes a representation of the input signal at the timescale of that level. This representation is produced by combining the extracted, by the ESN at that level, dynamical features using a set of weights (also called votes) coming from the layer above. As an input, each layer takes the output produced from the previous lower layer (bottom-up flow), the first layer receives the input to the whole architecture. The weights which are used to combine the extracted features at a particular level are the output from the next higher level (top-down flow). See Fig. 1 Figure 1: Schematic of approximating a signal by feature vote combination. Picture taken from [8] 7

8 3.2 Formal Description Each layer in the architecture has the same structure. Assume the input signal u(n) has d dimensions and the architecture has k layers. For each layer we have the following parameters: 1. Hierarchy parameters F - number of extracted features in the level f i (n) - feature vector, computed as the output from the ESN on the current level, i = 1...F f(n) - matrix where column i is f i (n), i = 1...F v(n) - F -dimensional vector of votes, passed down from the next higher level in the hierarchy. The highest layer has no votes available from above and thus votes passed down from it are computed directly from the output of the ESN on that level 2. ESN parameters a - leaking rate of neurons in the ESN in the level. x(n) - reservoir state of the ESN in the level. The output of the ESN are the feature vectors f i (n) W in - input weight matrix of the ESN in the level. W - internal weight matrix of the ESN in the level. W out i - output weight matrix used to compute f i (n) from the reservoir state x(n). The matrices W out i are lumped together to form the ESN output weight matrix W out i(n) - input to the reservoirs in the level λ - leaky integration parameter Let the layers have labels R 1..R k, with R 1 corresponding to the lowest layer and R k corresponding the highest layer in the hierarchy. In order to refer to a specific parameter from a specific layer we use the notation R 1.F - meaning the number of extracted features in the lowest layer. Keep in mind that each layer has its own set of parameters, thus the value of R 1.F is not necessarily the same as R 2.F and so on. For the inputs to each layer we have: u(n 1) if k = 1 R k.i(n) = û(n 1) if k = 2 R k 2.v(n 1) if 2 < k <= (k) The n-th update cycle of the hierarchy works in the following way: 8 (1)

9 1. First the reservoir state in the ESN at each layer is updated according to the leaky integration state update equation. For each k = 1... k: R k.x(n) = (1 R k.a)r k.x(n 1)+σ(R k.w R k.x(n 1)+R k.w in R k.i(n 1)) (2) where sigma is the logistic sigmoid function: σ(q) = exp( q) (3) 2. Then for each layer k = 1... k the feature matrix R k.f(n) is obtained from by setting its columns to be the feature vectors R k.f i (n) computed by: R k.f i (n) = R k.w out i [R k.x(n); R k.i(n)] (4) where if u and v a two column vectors, [u; v] denotes vertical concatenation of u and v. 3. Then the votes are passed down starting from the highest level: R k 1.v(n) = σ(l R k 1.λ(R k.w out [R k.x(n); R k.i(n)])) (5) where L λ (q(n)) is leaking integration with leaking rate λ of a signal q(n), carried out according to: L λ (q(n)) = (1 λ)l λ (q(n 1)) + λq(n) (6) where the recursive leaky integration is initialized to 0. For the layers k = k 2...1, we obtain the votes by: R k.v(n) = σ(l Rk.λ(R k+1.f(n)r k+1.v(n))) (7) In the end the output û(n) is obtained by: û(n) = R 1.f(n)R 1.v(n) (8) For a graphical example of the flow in the hierarchy, see Fig. 2 9

10 Figure 2: Overview of the architecture flow with 3 layers. The processing steps of one time increment are shown. Vectors with same texture have the same dimension. Picture taken from [8] 10

11 3.3 Learning The only adaptive parameters in the hierarchy are the output weights of the ESN - R k.w out. At each time step W out is adapted, on all levels of the hierarchy, using stochastic gradient descent. The gradient is taken with respect to the squared prediction error: ε(n) = u(n) û(n) 2 (9) For each level except the last one (k = 1.. k 1, we refer to a vote potential as the leaky integrated quantity before passing through the sigmoid function: R k.p(n) = L Rk.λ(R k+1.w out [R k+1.x(n); R[k + 1].i(n)]) (10) Let s denote by σ (q) = σ(q)(1 σ(q)) the derivative of the logistic sigmoid σ(q), q j denote the j-th coordinate of a vector q,. denote component wise multiplication of two matrices. Then we can compute the error terms by: E(n) = u(n) û(n) (11) E 1 (n) = E(n) (12) E k (n) = R k 1.f T (n)e k 1 (n). R k 1.λσ (R k 1.p(n)) (13) Then the updated weights are: R k.wi out (n + 1) = R k.wi out (n) + R k.γr k.v i (n)e k (n)[r k.x(n); R k.i(n)] T (14) The quantities E k (n) can be interpreted formally interpreted as error vectors. Each E k can be considered as back-propagated versions of E 1. 4 Experiments and results 4.1 Dataset The dataset used during the experiments is the Triple Generator Dataset [11]. This dataset uses 3 generator signals - tent map, sine wave and constant (see Fig. 3). The 3 signals alternate depending on a random switch occurring with certain probability. The nature of the dataset allows easy modification to which signals are to be used and with what probability should they switch. 11

12 Figure 3: Signal generated by the Triple Generator Dataset with length Setup The experiments shown in this paper aimed to show the performance of the architecture in rather simple conditions and set a ground on which to further build upon. As a first step, the architecture was tested under very clear and simplified conditions The dataset was simplified to only to 2 signal generators - sine wave and constant and the switching probability was set to Also during each switch the constant signal retains its value, that is the constant signal has always the same value no matter when or how the switch happens. The sine wave is the simply the function 0.5sin(0.8x) It has values between 0 and 1, with a period of approximately 8 steps. Within these conditions it is easy to see that 2 features would be enough in order to construct a perfect approximation of the next step of the 2-step generator signal: the first layer should produce 2 features - one that is optimized in case the next step is a sine wave and another that is optimized in case the next step is the constant signal in that case, the second layer should pass down 2 votes that switch only between 0 and 1, indicating which is of the two signals is used in the next step of the input. To achieve this, a simplified version of the original architecture has been used. This version uses only 2 layers and has no leaky integration on the passed votes from the second layer (but still go through a sigmoid function). The output of the ESN in the second layer and sigmoid function through which it passes on the topdown flow should work in way that the result after passing through the sigmoid is 2 indicators values - one that has a value 1 when the next step is from the sine wave and 0 otherwise, while the other has a value 1 when the next step is from the 12

13 constant signal and 0 otherwise. The sigmoid chosen for the experiments shown in this paper is derived from the logistic sigmoid: σ(x) = ( 1 1 e ( e x ) + 1)1 2 (15) The exact function has been derived using scaling and shifting. This sigmoid has the nice property that intersects 0 and 1 at points -5 and 5 respectively and smoothes the output from the ESN on the second layer(see Fig. 4). For details see Appendix A Figure 4: Sigmoid used for smoothing the output of the second layer ESN With these simplifications we can directly use the following weight update equations for each layer ( derived from back-propagating the error of the final approximation to the corresponding layer. For details see Appendix B). For the first layer: R 1.W out (n + 1) = R 1.W out (n) + R 1.γE(n)R 2.v(n)[R 1.x(n); R 1.i(n)] T (16) For the second layer, since we have no leaky integration anymore, the vote potential is just the output from the ESN on that layer R 2.p(n) = R 2.f(n) (17) And thus the update of the weights in the second layer can be computed using: R 2.W out (n + 1) = R 2.W out (n)+ R 2.γE(n)(R 1.f(n). σ (R 2.p(n)))[R 1.x(n); R 1.i(n)] T (18) 13

14 The prediction error was computed using the normalized root mean square error (NRMSE), computed by: mean((û(n) u(n))2 ) NRMSE = (19) σ 2 where u(n) is the value of the next step of the input, while û(n) is the predicted result. 4.3 Higher level adaptation An experiment to diagnose the adaptation of the higher level has been prepared according to the following scheme: 1. Two ESN are pre-trained to produce an approximation of the features and votes described in the previous section. The first ESN produces 2 features, each of which is optimized in case the next step is a sine wave and the other in case the next step is a constant. The second ESN produces values such that after passing them through a specified sigmoid, result in indicators with values either 0 or 1, depending on which type of signal is the next step. 2. After the pre-training, the two ESN are embedded in the 2 layer Hierarchical Echo State Network described in the previous section. 3. The learning rate of the first layer is set to 0, and the whole system is fed the 2-generator input signal. Since the first layer has 0 learning rate, adaptations should only happen in the second layer. 4. The performance of the adaptation is estimated by iteratively adapting the system for a fixed number of steps, after which adaptation is turned off and then the prediction error on a fixed testing dataset is computed Following the above scheme two ESNs with reservoir sizes of 100 units and spectral radii 1. The leaking rates of the networks were set to 0.7 and 0.5, for the first and second layer respectively. The first ESN was fed an input from the 2-generator dataset with length and trained on the task of producing 2 output signals - one that is an approximation of the next step of the sine wave, while the other is an approximation with the next step of a constant signal. The second ESN was fed the exact same input and trained to 2 output values, one 14

15 that has a value of 5 in case the next step is from the sine wave and 5 otherwise, and the other has a value of 5 in case the next step is from the constant signal and 5 otherwise. The reason for this is that the sigmoid chosen in Section 4.2 has a value of 1 for x = 5 and 0 for x = 5, which means that after passing the output from the second ESN through the sigmoid we get exactly the indicators described in the scheme. The adaptation of the weights in the first and second ESN was done using ridge regression (also known as Tikhonov regression)[12] with a regularization coefficient of 0.01 and 0.1 respectively. The NMRSE from the two prediction tasks in training and over a steps long testing sequence are shown in Table 1. Although the precision and results of this task can be increased using different methods for linear regression (such as using the Moore-Penrose pseudoinverse), we found that ridge regression was particularly useful in our case due to the better numerical stability of the resulting weight matrix. Using Moore-Penrose pseudoinverse resulted in weight matrices with very big entries, which yielded very sensitive solutions. This resulted into very unstable learning process of the hierarchical learning architecture, when the pre-trained ESNs were embedded in its layers. The regularization coefficients can be optimized using cross-validation, but for the purpose of just diagnostics the above mentioned values yielded solutions that generalized well enough. For plots of the produced features and votes see Fig. 5. NRMSE Training Testing ESN for Layer ESN for Layer Table 1: NMRSE during training and testing of two ESN producing approximations of the features and votes described in Section

16 (a) The two feature signals from ESN for Layer 1 (b) The two vote signals from ESN for Layer 2 Figure 5: Features and Votes produced during pre-training. The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave. The 2-layered Hierarchical Echo State Network, that uses the two pre-trained ESNs in its 2 layers, is trained for steps by reiterating over a step long input sequence consisting of sine wave and constant signal. The learning rates of the first and second layer are set to 0 and 0.01 respectively. After each 16

17 1000 steps, adaptation is switched off, and the NRMSE over a 5000 steps long testing sequence is computed. The development of this error is plotted in Fig. 6. Figure 6: Development of the NRMSE over a fixed test sequence when adaption for the first layer is switched off. Final NRMSE is The development of the NRMSE clearly shows that the upper layer adapts correctly and reduces the overall prediction error in the system. The small jumps towards the end of the simulation can be accounted to using a too big learning rate for the stochastic gradient descent learning. A snapshot of the votes that the second layer produces can be seen in Fig. 7. The imperfect voting and appearance of oscillations during the switch to the constant signal are due to the fact that there are infinitely many possibilities to pick votes that are able to reconstruct the constant signal by combining them with 2 constant signal features. The constant signal is too easy to reconstruct which leads to the observed instability. An important factor to the success of the whole learning is the choice of an activation function for ESN in each layer. The current results were achieved using tanh(x) as an activation function. Experiments with a logistic activation function gave a much worse results (for the same number of training steps) during the pre-training of the ESNs in each layer, which later led to instability of the learning process of the whole hierarchical architecture. 17

18 Figure 7: Snapshot of the votes produced by the second level after training with steps. The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave 4.4 Lower level adaptation The adaptation of the first layer is diagnosed using the same procedure, with the only difference that now the second layer has a learning rate of 0 and the first layer has a learning rate of This way the second layer should pass down approximately perfect votes. The development of the NRMSE over the test sequence along with a snapshot of the final features produced by the first layer can be seen in Fig. 8. From the figure we can clearly see that adaptation within this first layer works correctly and the error is reduced as expected. Oscillations in this development are due to a high learning rate set for the layer. A snapshot of the final features can be seen in Fig

19 Figure 8: NMRSE development over a fixed test sequence during adaptation of only the lowest layer. Final NRMSE is Figure 9: Snapshot of features produced by the first layer. The blue and green line show the two features, whereas the red line is the correct next step. The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave 19

20 4.5 Two layer adaptation without pre-training As a final diagnostic the whole architecture has been run, without any pre-training of the two layers. The development of the NRMSE in this diagnostic can be seen in Fig. 10. Again the development of the NRMSE over the test sequence works as expected and reduces over time. It is interesting to note that in this case the produced features and votes (see Fig. 11) do not resemble the ones which were forced with pre-training in previous experiments(see Fig. 5). A possible explanation for this obvious difference is that the system is reaching a local minimum and would need many more training steps to escape from it. Figure 10: NRMSE development over a fixed test sequence during adaptation of both layers and no pre-training. Final NRMSE is

21 (a) The two feature signals produced from(b) The two vote signals passed down from Layer 1 Layer 2 Figure 11: Features and Votes after adaptation of both layers without pre-training. The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave. Additionally the architecture was presented with increasingly more complex data by increasing the switching probability in the dataset. This makes the input signal more chaotic and the adaptation harder. Increasing this factor in the data required setting the learning rates to for both layers in order to reduce the instabilities in the learning. Two experiments have been run - one with a switching probability of 0.02 and one with switching probability of The architecture has been trained for steps. Snapshot of the resulting approximations and feature/vote combinations can be seen in Fig. 12 and Fig. 13 respectively. We can see that increasing the switching factor inhibits the learning severely and the architecture is not adapting well to the often switching of the input signal. 21

22 (a) The two feature signals produced from(b) The two vote signals passed down from Layer 1 Layer 2 (c) Resulting approximation, red curve shows the correct output whereas the blue line shows the output from the learning architecture. Figure 12: Features, votes and final result over a 1000 step sequence. The final NRMSE is The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave. 22

23 (a) The two feature signals produced from(b) The two vote signals passed down from Layer 1 Layer 2 (c) Resulting approximation, red curve shows the correct output whereas the blue line shows the output from the learning architecture. Figure 13: Features, votes and final result over a 1000 step sequence. The final NRMSE is The light blue colored intervals on the horizontal axis indicate that the input signal is the constant, while the black colored intervals indicate that the input signal is the sine wave. The discovered instability when using the constant signal (see Fig. 7) inspired the idea to try the architecture on other simple patterns. Instead of the constant signal, a scaled and slowed down modification of a sine wave has been used sin(0.05x) This new signal has values between 0 and 0.5 with a period of approximately 125 steps. The switching probability has been set to and 23

24 the learning rates for both layers to The architecture has been trained for steps. The final result can be seen in Fig. 14. The architecture still performs well and is able to adapt to the slow sine wave almost perfectly. (a) The two feature signals produced from(b) The two vote signals passed down from Layer 1 Layer 2 (c) Resulting approximation, red curve shows the correct output whereas the blue line shows the output from the learning architecture. Figure 14: Features, votes and final result over a 1000 step sequence. The final NRMSE is The light blue colored intervals on the horizontal axis indicate that the input signal is the slow sine wave, while the black colored intervals indicate that the input signal is the fast sine wave. 24

25 5 Discussion The experiments in this paper analyse the performance of a 2-layer hierarchical architecture with top-down feedback when presented with a simple 2-generator dataset. Isolated adaptations of the upper and lower layer have been tried out, as well as a complete simulation of the whole architecture. Initially this was done by modifying an already existing implementation of Hierarchical Echo State Network that allowed users to easily modify its parameters and features. The architecture and the dataset were gradually simplified but no apparent progress has been made even in very simplified conditions. The experiments worked very slowly and required a huge number of trainings steps to achieve moderate results. After many failed attempts to try and detect the issues in the code, the whole learning procedure for the specific architecture described in Section 4.2 has been implemented from scratch. This was the key to making initial progress on the designed experiments and detect some of the problems in the old code. The main differences in the reimplementation (and probably the reason for the improved performance) are the use of the recomputed update equations from Section 4.2 and the use of tanh(x) as an activation function for the ESN in each layer. These changes enabled the system to perform much better and achieve good results with smaller number of training steps. The presented results can be used as a ground base on which to further build upon. Several issues that can inhibit the performance of the learning architecture have been discovered. Currently we can see that initially the system makes very fast progress and reduces the error rather quickly, after which we observe oscillations in the error, which are probably due to a very high learning rate. It would be interesting to see if using an adaptive learning rate can improve this behavior. It has been shown through several experiments that one of the factors in the data that inhibits the learning severely is the switching probability. Additionally, as seen in Fig. 7, instabilities may arise due to the use of the constant signal as input. A direction which can be further explored is the use of different patterns besides the already presented ones. Possible future work may also include training the architecture on more complex data and analysing modifications to the system that can cope with the mentioned issues, boost the performance and/or speed up the learning. 6 Aknowledgements I would like to thank my supervisor Prof. Dr. Herbert Jaeger for guiding my first steps in the exciting field of machine learning and for constantly devoting time 25

26 and effort to give me feedback and directions about the difficulties I had along the way of writing this thesis. A Appendix A: Derivation of sigmoid from logistic function In the experiments show in this paper, we needed a sigmoid that intersects 0 at point A and intersects 1 at point A, for some fixed positive constant A. The bigger this constant is, the more smoothing is applied to the input of the sigmoid. In order to derive a concrete formula for such a function we started off from the logistic sigmoid: 1 f(x) = (20) e x + 1 By subtracting 1 from the above equation, we get an odd function with two 2 horizontal asymptotes - at 1 and 1: 2 2 g(x) = f(x) 1 2 (21) After that we scale this function, so that it intersects 1 and 1 at points A and A respectively. This is easy to achieve since the function is odd. We only have to multiply by the reciprocal of the value of the function at A: t(x) = 1 g(x) (22) g(a) Now the resulting function intersects 1 and 1 at points A and A, but we want to make it so that it intersects 0 and 1 at A and A. To achieve this we shift the function up by adding 1, which makes it intersect 0 and 2 at A and A. Now we get the desired sigmoid, by scaling this whole function by 1. Thus we get: 2 σ(x) = (t(x) + 1) 1 2 (23) When we replace t(x) and g(x) by their definitions and choosing A = 5 we get exactly the result from section 4.2: σ(x) = ( 1 1 e ( e x ) + 1)1 2 (24) 26

27 B Appendix B: Computing the error derivatives used in weight update equations Let s define the following variables: f 1 (n) and f 2 (n) - outputs from the ESN in the first layer v 1 (n) and v 2 (n) - outputs from the ESN from the second layer x 1 (n) and x 2 (n) - reservoirs states of layer 1 and 2 respectively w 1 (n) and w 2 (n) - output weight vectors for feature 1 and feature 2 from layer 1 q 1 (n) and q 2 (n) - output weight vectors for feature 1 and feature 2 from layer 2 u(n) and û(n) - correct value of next step of the input and output of the whole architecture, respectively E(n = u(n) û(n) Then we have that û(n) = f 1 (n)v 1 (n) + f 2 (n)v 2 (n) and the squared error is ε(n) = E(n) 2. Thus for the partial derivative of the error with respect to the weight vector of the first feature in the first layer we get: ε(n) w 1 (n) = 2E(n) (E(n)) w 1 (n) (u(n) û(n)) = 2E(n) w 1 (n) (25) (26) = 2E(n) (u(n) f 1(n)v 1 (n) f 2 (n)v 2 (n)) w 1 (n) (27) = 2E(n) (u(n) w 1(n)x 1 (n)v 1 (n) f 2 (n)v 2 (n)) w 1 (n) (28) = 2E(n)x 1 (n)v 1 (n) (29) Following the same logic, we get the derivative of the error with respect to the weight vector of the second feature in the first layer to be ε(n) = 2E(n)x w 1 (n) 1v 2 (n) Then for the partial derivative of the error with respect to the weight vector of the first feature in second layer we get: 27

28 ε(n) q 1 (n) = 2E(n) (E(n)) q 1 (n) (u(n) û(n)) = 2E(n) q 1 (n) (30) (31) = 2E(n) (u(n) f 1(n)v 1 (n) f 2 (n)v 2 (n)) q 1 (n) (32) = 2E(n) (u(n) f 1(n)σ(q 1 (n)x(n)) f 2 (n)v 2 (n)) q 1 (n) (33) = 2E(n)f 1 (n)x(n)σ (q 1 (n)x(n)) (34) 28

29 References [1] J. Bengio. Learning deep architectures for AI. Technical Report 1312, Dept. IRO, Universite de Montreal, [2] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. In Bottou L., Chapelle O., DeCoste D., and Weston J., editors, Large-Scale Kernel Machines. MIT Press, [3] Y. LeCun, L. Bottou, J. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11): , [4] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Proc. NIPS MIT Press, [5] K. Friston. Learning and inference in the brain. Neural Networks, 16: , [6] A. Clark. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 2012 (to appear). [7] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7): , JUL [8] Herbert Jaeger. Discovering multiscale dynamical features with hierarchical echo state networks. Technical report, Jacobs University Bremen, [9] Herbert Jaeger. The echo state approach to analysing and training recurrent neural networks - with an erratum note. German National Research Center for Information Technology GMD Technical Report, 148, [10] Herbert Jaeger. Short term memory in echo state networks. German National Research Center for Information Technology GMD Technical Report, 152, [11] Triple generator dataset. organic/benchmarks/triplegenerator. [12] David Verstraeten. Reservoir Computing: computation with dynamical systems. Phd thesis, Ghent University, Ghent,

Convolutional Neural Networks

Convolutional Neural Networks Lecturer: Barnabas Poczos Introduction to Machine Learning (Lecture Notes) Convolutional Neural Networks Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.

More information

Extending reservoir computing with random static projections: a hybrid between extreme learning and RC

Extending reservoir computing with random static projections: a hybrid between extreme learning and RC Extending reservoir computing with random static projections: a hybrid between extreme learning and RC John Butcher 1, David Verstraeten 2, Benjamin Schrauwen 2,CharlesDay 1 and Peter Haycock 1 1- Institute

More information

Machine Learning 13. week

Machine Learning 13. week Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

Activity recognition with echo state networks using 3D body joints and objects category

Activity recognition with echo state networks using 3D body joints and objects category Activity recognition with echo state networks using 3D body joints and objects category Luiza Mici, Xavier Hinaut and Stefan Wermter University of Hamburg - Department of Informatics, Vogt-Koelln-Strasse

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract

More information

Character Recognition Using Convolutional Neural Networks

Character Recognition Using Convolutional Neural Networks Character Recognition Using Convolutional Neural Networks David Bouchain Seminar Statistical Learning Theory University of Ulm, Germany Institute for Neural Information Processing Winter 2006/2007 Abstract

More information

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images Marc Aurelio Ranzato Yann LeCun Courant Institute of Mathematical Sciences New York University - New York, NY 10003 Abstract

More information

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider

More information

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD. Deep Learning 861.061 Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD asan.agibetov@meduniwien.ac.at Medical University of Vienna Center for Medical Statistics,

More information

Advanced Introduction to Machine Learning, CMU-10715

Advanced Introduction to Machine Learning, CMU-10715 Advanced Introduction to Machine Learning, CMU-10715 Deep Learning Barnabás Póczos, Sept 17 Credits Many of the pictures, results, and other materials are taken from: Ruslan Salakhutdinov Joshua Bengio

More information

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic SEMANTIC COMPUTING Lecture 8: Introduction to Deep Learning Dagmar Gromann International Center For Computational Logic TU Dresden, 7 December 2018 Overview Introduction Deep Learning General Neural Networks

More information

Reservoir Computing with Emphasis on Liquid State Machines

Reservoir Computing with Emphasis on Liquid State Machines Reservoir Computing with Emphasis on Liquid State Machines Alex Klibisz University of Tennessee aklibisz@gmail.com November 28, 2016 Context and Motivation Traditional ANNs are useful for non-linear problems,

More information

A Fast Learning Algorithm for Deep Belief Nets

A Fast Learning Algorithm for Deep Belief Nets A Fast Learning Algorithm for Deep Belief Nets Geoffrey E. Hinton, Simon Osindero Department of Computer Science University of Toronto, Toronto, Canada Yee-Whye Teh Department of Computer Science National

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Stacked Denoising Autoencoders for Face Pose Normalization

Stacked Denoising Autoencoders for Face Pose Normalization Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Example Learning Problem Example Learning Problem Celebrity Faces in the Wild Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction,

More information

3 Nonlinear Regression

3 Nonlinear Regression CSC 4 / CSC D / CSC C 3 Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear models are necessary. In regression, all such models will have the same basic

More information

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why? Data Mining Deep Learning Deep Learning provided breakthrough results in speech recognition and image classification. Why? Because Speech recognition and image classification are two basic examples of

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies http://blog.csdn.net/zouxy09/article/details/8775360 Automatic Colorization of Black and White Images Automatically Adding Sounds To Silent Movies Traditionally this was done by hand with human effort

More information

Machine Learning. MGS Lecture 3: Deep Learning

Machine Learning. MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ Machine Learning MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Shallow network: Only one hidden layer

More information

A Deep Learning primer

A Deep Learning primer A Deep Learning primer Riccardo Zanella r.zanella@cineca.it SuperComputing Applications and Innovation Department 1/21 Table of Contents Deep Learning: a review Representation Learning methods DL Applications

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Neural Network and Deep Learning Early history of deep learning Deep learning dates back to 1940s: known as cybernetics in the 1940s-60s, connectionism in the 1980s-90s, and under the current name starting

More information

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules Report Explainable Machine Learning Dynamic Routing Between Capsules Author: Michael Dorkenwald Supervisor: Dr. Ullrich Köthe 28. Juni 2018 Inhaltsverzeichnis 1 Introduction 2 2 Motivation 2 3 CapusleNet

More information

Calibrating an Overhead Video Camera

Calibrating an Overhead Video Camera Calibrating an Overhead Video Camera Raul Rojas Freie Universität Berlin, Takustraße 9, 495 Berlin, Germany http://www.fu-fighters.de Abstract. In this section we discuss how to calibrate an overhead video

More information

Deep Learning. Architecture Design for. Sargur N. Srihari

Deep Learning. Architecture Design for. Sargur N. Srihari Architecture Design for Deep Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation

More information

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan Lecture 17: Neural Networks and Deep Learning Instructor: Saravanan Thirumuruganathan Outline Perceptron Neural Networks Deep Learning Convolutional Neural Networks Recurrent Neural Networks Auto Encoders

More information

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio Université de Montréal 13/06/2007

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Why equivariance is better than premature invariance

Why equivariance is better than premature invariance 1 Why equivariance is better than premature invariance Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of Toronto with contributions from Sida Wang

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017 3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural

More information

Neural Nets. General Model Building

Neural Nets. General Model Building Neural Nets To give you an idea of how new this material is, let s do a little history lesson. The origins of neural nets are typically dated back to the early 1940 s and work by two physiologists, McCulloch

More information

3 Nonlinear Regression

3 Nonlinear Regression 3 Linear models are often insufficient to capture the real-world phenomena. That is, the relation between the inputs and the outputs we want to be able to predict are not linear. As a consequence, nonlinear

More information

Deep Learning of Visual Control Policies

Deep Learning of Visual Control Policies ESANN proceedings, European Symposium on Artificial Neural Networks - Computational Intelligence and Machine Learning. Bruges (Belgium), 8-3 April, d-side publi., ISBN -9337--. Deep Learning of Visual

More information

Study of Residual Networks for Image Recognition

Study of Residual Networks for Image Recognition Study of Residual Networks for Image Recognition Mohammad Sadegh Ebrahimi Stanford University sadegh@stanford.edu Hossein Karkeh Abadi Stanford University hosseink@stanford.edu Abstract Deep neural networks

More information

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing feature 3 PC 3 Beate Sick Many slides are taken form Hinton s great lecture on NN: https://www.coursera.org/course/neuralnets

More information

Chapter 3. Speech segmentation. 3.1 Preprocessing

Chapter 3. Speech segmentation. 3.1 Preprocessing , as done in this dissertation, refers to the process of determining the boundaries between phonemes in the speech signal. No higher-level lexical information is used to accomplish this. This chapter presents

More information

Efficient Algorithms may not be those we think

Efficient Algorithms may not be those we think Efficient Algorithms may not be those we think Yann LeCun, Computational and Biological Learning Lab The Courant Institute of Mathematical Sciences New York University http://yann.lecun.com http://www.cs.nyu.edu/~yann

More information

Autoencoders, denoising autoencoders, and learning deep networks

Autoencoders, denoising autoencoders, and learning deep networks 4 th CiFAR Summer School on Learning and Vision in Biology and Engineering Toronto, August 5-9 2008 Autoencoders, denoising autoencoders, and learning deep networks Part II joint work with Hugo Larochelle,

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism) Artificial Neural Networks Analogy to biological neural systems, the most robust learning systems we know. Attempt to: Understand natural biological systems through computational modeling. Model intelligent

More information

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu Artificial Neural Networks Introduction to Computational Neuroscience Ardi Tampuu 7.0.206 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Rational functions, like rational numbers, will involve a fraction. We will discuss rational functions in the form:

Rational functions, like rational numbers, will involve a fraction. We will discuss rational functions in the form: Name: Date: Period: Chapter 2: Polynomial and Rational Functions Topic 6: Rational Functions & Their Graphs Rational functions, like rational numbers, will involve a fraction. We will discuss rational

More information

Final Report: Classification of Plankton Classes By Tae Ho Kim and Saaid Haseeb Arshad

Final Report: Classification of Plankton Classes By Tae Ho Kim and Saaid Haseeb Arshad Final Report: Classification of Plankton Classes By Tae Ho Kim and Saaid Haseeb Arshad Table of Contents 1. Project Overview a. Problem Statement b. Data c. Overview of the Two Stages of Implementation

More information

Introduction to ANSYS DesignXplorer

Introduction to ANSYS DesignXplorer Lecture 4 14. 5 Release Introduction to ANSYS DesignXplorer 1 2013 ANSYS, Inc. September 27, 2013 s are functions of different nature where the output parameters are described in terms of the input parameters

More information

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network Tianyu Wang Australia National University, Colledge of Engineering and Computer Science u@anu.edu.au Abstract. Some tasks,

More information

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple

More information

Text Recognition in Videos using a Recurrent Connectionist Approach

Text Recognition in Videos using a Recurrent Connectionist Approach Author manuscript, published in "ICANN - 22th International Conference on Artificial Neural Networks, Lausanne : Switzerland (2012)" DOI : 10.1007/978-3-642-33266-1_22 Text Recognition in Videos using

More information

Gated Boltzmann Machine in Texture Modeling

Gated Boltzmann Machine in Texture Modeling Gated Boltzmann Machine in Texture Modeling Tele Hao, Tapani Rao, Alexander Ilin, and Juha Karhunen Department of Information and Computer Science Aalto University, Espoo, Finland firstname.lastname@aalto.fi

More information

Chapter 7. Conclusions and Future Work

Chapter 7. Conclusions and Future Work Chapter 7 Conclusions and Future Work In this dissertation, we have presented a new way of analyzing a basic building block in computer graphics rendering algorithms the computational interaction between

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing

More information

Introduction to Deep Learning

Introduction to Deep Learning ENEE698A : Machine Learning Seminar Introduction to Deep Learning Raviteja Vemulapalli Image credit: [LeCun 1998] Resources Unsupervised feature learning and deep learning (UFLDL) tutorial (http://ufldl.stanford.edu/wiki/index.php/ufldl_tutorial)

More information

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION 6 NEURAL NETWORK BASED PATH PLANNING ALGORITHM 61 INTRODUCTION In previous chapters path planning algorithms such as trigonometry based path planning algorithm and direction based path planning algorithm

More information

Learning visual odometry with a convolutional network

Learning visual odometry with a convolutional network Learning visual odometry with a convolutional network Kishore Konda 1, Roland Memisevic 2 1 Goethe University Frankfurt 2 University of Montreal konda.kishorereddy@gmail.com, roland.memisevic@gmail.com

More information

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 2 Notes Outline 1. Neural Networks The Big Idea Architecture SGD and Backpropagation 2. Convolutional Neural Networks Intuition Architecture 3. Recurrent Neural

More information

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers A. Salhi, B. Minaoui, M. Fakir, H. Chakib, H. Grimech Faculty of science and Technology Sultan Moulay Slimane

More information

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( )

Artificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( ) Structure: 1. Introduction 2. Problem 3. Neural network approach a. Architecture b. Phases of CNN c. Results 4. HTM approach a. Architecture b. Setup c. Results 5. Conclusion 1.) Introduction Artificial

More information

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D. PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D. Rhodes 5/10/17 What is Machine Learning? Machine learning

More information

3D Visualization of Sound Fields Perceived by an Acoustic Camera

3D Visualization of Sound Fields Perceived by an Acoustic Camera BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 7 Special Issue on Information Fusion Sofia 215 Print ISSN: 1311-972; Online ISSN: 1314-481 DOI: 1515/cait-215-88 3D

More information

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization More on Learning Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization Neural Net Learning Motivated by studies of the brain. A network of artificial

More information

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018 CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2018 Last Time: Multi-Dimensional Scaling Multi-dimensional scaling (MDS): Non-parametric visualization: directly optimize the z i locations.

More information

Allstate Insurance Claims Severity: A Machine Learning Approach

Allstate Insurance Claims Severity: A Machine Learning Approach Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Space Filling Curves and Hierarchical Basis. Klaus Speer

Space Filling Curves and Hierarchical Basis. Klaus Speer Space Filling Curves and Hierarchical Basis Klaus Speer Abstract Real world phenomena can be best described using differential equations. After linearisation we have to deal with huge linear systems of

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting.

An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting. An Algorithm For Training Multilayer Perceptron (MLP) For Image Reconstruction Using Neural Network Without Overfitting. Mohammad Mahmudul Alam Mia, Shovasis Kumar Biswas, Monalisa Chowdhury Urmi, Abubakar

More information

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong

Using Capsule Networks. for Image and Speech Recognition Problems. Yan Xiong Using Capsule Networks for Image and Speech Recognition Problems by Yan Xiong A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved November 2018 by the

More information

Recurrent Neural Network (RNN) Industrial AI Lab.

Recurrent Neural Network (RNN) Industrial AI Lab. Recurrent Neural Network (RNN) Industrial AI Lab. For example (Deterministic) Time Series Data Closed- form Linear difference equation (LDE) and initial condition High order LDEs 2 (Stochastic) Time Series

More information

Face Detection Using Convolutional Neural Networks and Gabor Filters

Face Detection Using Convolutional Neural Networks and Gabor Filters Face Detection Using Convolutional Neural Networks and Gabor Filters Bogdan Kwolek Rzeszów University of Technology W. Pola 2, 35-959 Rzeszów, Poland bkwolek@prz.rzeszow.pl Abstract. This paper proposes

More information

Machine Learning Techniques at the core of AlphaGo success

Machine Learning Techniques at the core of AlphaGo success Machine Learning Techniques at the core of AlphaGo success Stéphane Sénécal Orange Labs stephane.senecal@orange.com Paris Machine Learning Applications Group Meetup, 14/09/2016 1 / 42 Some facts... (1/3)

More information

Time Series prediction with Feed-Forward Neural Networks -A Beginners Guide and Tutorial for Neuroph. Laura E. Carter-Greaves

Time Series prediction with Feed-Forward Neural Networks -A Beginners Guide and Tutorial for Neuroph. Laura E. Carter-Greaves http://neuroph.sourceforge.net 1 Introduction Time Series prediction with Feed-Forward Neural Networks -A Beginners Guide and Tutorial for Neuroph Laura E. Carter-Greaves Neural networks have been applied

More information

Noisy iris recognition: a comparison of classifiers and feature extractors

Noisy iris recognition: a comparison of classifiers and feature extractors Noisy iris recognition: a comparison of classifiers and feature extractors Vinícius M. de Almeida Federal University of Ouro Preto (UFOP) Department of Computer Science (DECOM) viniciusmdea@gmail.com Vinícius

More information

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey. Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

FUNCTIONS AND MODELS

FUNCTIONS AND MODELS 1 FUNCTIONS AND MODELS FUNCTIONS AND MODELS 1.3 New Functions from Old Functions In this section, we will learn: How to obtain new functions from old functions and how to combine pairs of functions. NEW

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

Data Mining. Neural Networks

Data Mining. Neural Networks Data Mining Neural Networks Goals for this Unit Basic understanding of Neural Networks and how they work Ability to use Neural Networks to solve real problems Understand when neural networks may be most

More information

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning

More information

Automated Crystal Structure Identification from X-ray Diffraction Patterns

Automated Crystal Structure Identification from X-ray Diffraction Patterns Automated Crystal Structure Identification from X-ray Diffraction Patterns Rohit Prasanna (rohitpr) and Luca Bertoluzzi (bertoluz) CS229: Final Report 1 Introduction X-ray diffraction is a commonly used

More information

Face Recognition using Convolutional Neural Network and Simple Logistic Classifier

Face Recognition using Convolutional Neural Network and Simple Logistic Classifier Face Recognition using Convolutional Neural Network and Simple Logistic Classifier Hurieh Khalajzadeh, Mohammad Mansouri and Mohammad Teshnehlab Intelligent Systems Laboratory (ISLAB), Faculty of Electrical

More information

Chapter 10. Conclusion Discussion

Chapter 10. Conclusion Discussion Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with

More information

Rotation Invariance Neural Network

Rotation Invariance Neural Network Rotation Invariance Neural Network Shiyuan Li Abstract Rotation invariance and translate invariance have great values in image recognition. In this paper, we bring a new architecture in convolutional neural

More information

Slides adapted from Marshall Tappen and Bryan Russell. Algorithms in Nature. Non-negative matrix factorization

Slides adapted from Marshall Tappen and Bryan Russell. Algorithms in Nature. Non-negative matrix factorization Slides adapted from Marshall Tappen and Bryan Russell Algorithms in Nature Non-negative matrix factorization Dimensionality Reduction The curse of dimensionality: Too many features makes it difficult to

More information

The Fly & Anti-Fly Missile

The Fly & Anti-Fly Missile The Fly & Anti-Fly Missile Rick Tilley Florida State University (USA) rt05c@my.fsu.edu Abstract Linear Regression with Gradient Descent are used in many machine learning applications. The algorithms are

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks COMP 551 Applied Machine Learning Lecture 14: Neural Networks Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Functions. Copyright Cengage Learning. All rights reserved.

Functions. Copyright Cengage Learning. All rights reserved. Functions Copyright Cengage Learning. All rights reserved. 2.2 Graphs Of Functions Copyright Cengage Learning. All rights reserved. Objectives Graphing Functions by Plotting Points Graphing Functions with

More information