Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Artificial Neural Networks Introduction to Computational Neuroscience Ardi Tampuu 7.0.206

Artificial neural network NB! Inspired by biology, not based on biology!

Applications Automatic speech recognition Automatic image classification and tagging Natural language modeling

Learning objectives How do artificial neural networks work? What types of artificial neural networks are used for what tasks? What are the state-of-the-art results achieved with artificial neural networks?

Part How DO neural networks work?

Frank Rosenblatt (957) Added learning rule to McCulloch-Pitts neuron.

Perceptron Prediction:, if xw x2 w2 b 0 y 0, otherwise y Σ w x w2 x2 b

Perceptron Prediction:, if xw x2 w2 b 0 y 0, otherwise Learning: wi wi (t y ) xi b b (t y ) y If prediction == target, do nothing Σ w x w2 x2 If prediction < target, increase weights of positive inputs, decrease wights of negative inputs If prediction > target, vice versa b

Let s try it out! X Y X OR Y 0 0 0 0 0 A B C Initialize A,B,C=0, so output is 0 go over the examples in table:. t = y, so no changes 2. y = 0, t = A = 0,B =, C = 3. y = t = 4. y = t = 5. y =, t = 0 A = 0, B =, C = 0 6. y = t = 7. y = 0, t = A =, B =, C = 0 Learning: wi wi (t y ) xi b b (t y )

Perceptron limitations Perceptron learning algorithm converges only for linearly separable problems (because it only has one layer) Minsky, Papert, Perceptrons (969)

Multi-layer perceptrons Add non-linear activation functions Add hidden layer(s) Universal approximation theorem (important!): Any continuous function can be approximated by finite feed-forward neural network with one hidden layer.

Forward propagation + b b2 x w + b2 y =σ(b+xw+x2w2) Σ w2 w2 w2 x2 w22 y2 = σ(b2+xw2+x2w22) Σ w22 z = b2+yw2+y2w22 (no nonlinearity) Σ Z

Loss function Function approximation: (0 z ) 2 L (t z ) 2 2 Binary classification: log(z ) log( z ), if t L log( z ), if t 0 Multi-class classification: L t j log z j j log( z )

Backpropagation + + b2= ez b= ey b2= ey z = b2+yw2+y2w22 y = σ(b+xw+x2w2) ey= ezw2 σ (b+xw+x2w2) x w= eyx Σ dl/dz = ez = z-t Σ w2= ezy w2= ey2x w2= eyx2 x2 w22= ey2x2 y2 = σ(b2+xw2+x2w22) ey2= ezw22 σ (b2+xw2+x2w22) Σ w22= ezy2 Derivative of sigmoid: ' ( x) ( x)( ( x))

Gradient Descent Gradient descent finds weight values that result in small loss. Gradient descent is guaranteed to find only local minimum. But there is plenty of them and they are often good enough!

Walking around in energy(loss) landscape based on only local gradient information

Things to remember... Perceptron was the first artificial neuron model invented in late 950s. Perceptron can learn only linearly separable classification problems. Feed-forward networks with non-linear activation functions and hidden layers can overcome limitations of perceptrons. Multi-layer artificial neural networks are trained using backpropagation and gradient descent.

Part 2 Neural networks taxonomy

Simple feed-forward networks Architecture: Each node connected to all nodes of previous layer. Information moves in one direction only. Used for: Function approximation Simple classification problems Not too many inputs (~00) OUTPUT LAYER HIDDEN LAYER INPUT LAYER

Convolutional neural networks

Hubel & Wiesel (959) Performed experiments with anesthetized cat. Discovered topographical mapping, sensitivity to orientation and hierarchical processing. Simple cells convolution Complex cells pooling

Convolution in neural nets Recommending music on Spotify

Convolutional neural networks Architecture: Convolutional layer: local connections + weight sharing. Pooling layer: translation invariance. POOLING LAYER 3 max CONVOLUTIONAL LAYER Used for: images, any other data with locality property, i.e. adjacent characters make up word. 2 0-2 2 2 2 - INPUT LAYER weights: 0 - -3

Convolution 0 0 0 0 Convolution searches for the same pattern over the entire image and calculates a score for each match.

Convolution Now try this: - - - - And this.. - - - Convolution searches for the same pattern over the entire image and calculates a score for each match.

What do these filers do? 0 0-4 0 0

0 0-4 0 0

Pooling Pooling achieves translation invariance by taking maximum of adjacent convolution scores.

Example: handwritten digit recognition Y. LeCun et al., Handwritten digit recognition: Applications of neural net chips and automatic learning, 989. LeCun et al. (989)

Recurrent neural networks Architecture: Hidden layer nodes connected to each other. Allows retaining internal state and memory. Used for: speech recognition, handwriting recognition, any time series brain activity, DNA reads OUTPUT LAYER RECURRENT HIDDEN LAYER INPUT LAYER

Backpropagation through time T T2 T3 T4? HIDDEN LAYER OUTPUT LAYER O O2 O3 O4 H0 H H2 H3 H4 I3 I4 same W INPUT LAYER I I2 time

Auto-encoders Architecture: Input and output are the same!! Hidden layer functions as a bottleneck. Network is trained to reconstruct input from hidden layer activations. OUTPUT LAYER = INPUT LAYER HIDDEN LAYER Used for: image search dimensionality reduction INPUT LAYER

We didn t talk about... Restricted Boltzmann Machines (RBMs) Long Short Term Memory networks (LSTMs) Echo State Networks / Liquid State Machines Hopfield Network Self-organizing maps (SOMs) Radial basis function networks (RBFs) But we covered the most important ones!

Things to remember... Simple feed-forward networks are usually used for function approximation, i.e. predict energy consumption. Convolutional neural networks are mostly used for images. Recurrent neural networks are used for speech recognition and language modeling Autoencoders are used for dimensionality reduction.

Part 3 State-of-the-art results

Deep Learning Artificial neural networks and backpropagation have been around since 980s. What s all this fuss about deep learning? What has changed: we have much bigger datasets, we have much faster computers (think GPUs), we have learned a few tricks how to train networks with very very many (50) layers.

GoogLeNet ImageNet 204 winner 27 layers, 5M weights. Szegedy et al., Going Deeper with Convolutions (204).

ImageNet classification current best 4,9% (human error 5,%) Try it yourself: http://www.clarifai.com/#demo Wu et al., Deep Image: Scaling up Image Recognition (205). Ioffe, Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (205).

Automatic image descriptions Karpathy, Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Descriptions (204)

Reinforcement learning score screen Pong Seaquest actions Breakout Beam Rider Space Invaders Enduro https://github.com/tambetm/simple_dqn Mnih et al., Human-level control through deep reinforcement learning (205)

Multiagent reinforcement learning Videos on YouTube about competitive mode and collaborative mode Tampuu, Matiisen et al., Multiagent Cooperation and Competition with Deep Reinforcement Learning (205)

Program execution Curriculum learning learning simple expressions first and then more complex proved to be essential. Zaremba, Sutskever, Learning to Execute (205).

The future of AI? Neural Turing Machines Memory Networks writing and reading from external memory (infinite memory) For example: Hybrid computing using a neural network with dynamic external memory (Graves, Hassabis et al. 206)

Thank you!