Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Deep Learning 861.061 Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD asan.agibetov@meduniwien.ac.at Medical University of Vienna Center for Medical Statistics, Informatics and Intelligent Systems Institute for Artificial Intelligence and Decision Support Spitalgasse 23, 1090 Vienna, BT88.04.808 November 7, 2017

Introduction References (available online for free): "Neural Networks and Deep Learning". Michael A. Nielsen, Determination Press, 2015 intuition first, math after "Deep Learning". Ian Goodfellow, Youshua Bengio and Aaron Courville, MIT Press, 2016 formal with fair amount of intuition, more general than Nielsen These slides are based on DL courses: Course notes "CNN for Visual Recognition" (Stanford, Spring 2017) Course notes "An introduction to Deep Learning", Marc Aurelio Ranzato (Facebook AI Research), DeepLearn Summer School - Bilbao, 17-21 July 2017

Why Deep Learning Peter Norvig s 1 recollection on Geoff Hinton s 2 talk on Boltzmann Machine 3 work (back in 1980) 1. Cognitive plausibility in terms of a model of the brain 2. Model that learns from experiences rather that programmed by hand 3. Continuous representations rather than Boolean, as in traditional symbolic expert systems 1 Research Director at Google, co-author of classical texts on AI 2 Professor at University of Toronto, one of the pioneers of Deep Learning 3 Boltzmann Machine (and Probabilistic Graphical Models) one of the theoretical foundations for generative DL models

Neural networks and Deep Learning Neural networks - biologically-inspired programming paradigm enables computer to learn from observational data universal function approximation machine 4 Deep learning - powerful set of techniques for learning in neural networks harness GPU resources to parallelize and speed up matrix-vector computations give rise to modularized approach to learning 4 Hornik, "Approximation capabilities of Multilayer Feedforward Networks", Neural Networks, 1991

Deep Learning - what s in the name? DL, roughly speaking, is NN with many layers and many neurons in each layer not true in all cases though (e.g., embeddings are often shallow) Figure 1: Simple and Deep NNs (image credit 5 ) 5 https://hackernoon.com/log-analytics-with-deep-learning-and-machine-learning-20a1891ff70e

Hierarchical feature learning DL learns features automatically, and hierarchically Figure 2: (Convolutional) Neural Network to detect a face 6 6 credit "Michael A. Nielsen"

Hierarchical feature learning (cont.) Learnt features can be combined Figure 3: Further decomposition of learnt features 7 7 credit "Michael A. Nielsen"

Neural networks Figure 4: 2 hidden layer network/4 layer network (+ input, output) 8 Universal function approximation that maps input to output f : R n R m Class of functions considered to map input to output composition of simpler (including non-linear 9 ) functions h 1 is non-linear max(0, W x + b) aka ReLU f = o h 2 x h 1 x 8 image credit M-A. Ranzato (Facebook AI Research) 9 composition of only linear function would be equivalent to one linear function

Forward propagation Figure 5: Forward pass on the network x R D, W 1 R N 1 D b 1 R N 1, h 1 R N 1 h 1 = max(0, W 1 x + b) W 1 1-st layer weight matrix or weights b 1 1-st layer biases

Why non linear layers ReLU layers provide piece-wise linear tiling # planes grows exponentially w. # hidden units Multiple layers yield exponential savings in # parameters (parameter sharing) Figure 6: with ReLU mapping is locally linear 10 10 Montufar et al. "On the number of linear regions of DNNs", arxiv, 2014

How good is the network: task-dependant loss function V i regression: MSE (mean squared error) V1 (y, f ) = (y f (x)) 2 classification: variants of Cross-Entropy loss class (category) index k 1... C predicted classes f (x) = [ 1 0 0... k 1... C 0], f (x) k = 1 true classes y = [ 1 1 0... k 0... C 0], y k = 0 probability that x belongs to class ck C ef (x) 1 p(c k = 1 x) = ef (x) k loss function with log-likelihoods (easier to optimize) V 2(y, f ) = k y k log p(c k x)

Optimization: finding the best f Typical setup for optimization f can be parameterized with Θ (f = Θ x linear case) minimizing (learning) the loss function V over all training examples 1... n plus regularizations on: λ2 (f ) - controls complexity of the function (usually norm f ) λ1 (f, Θ) - sparsity of the solution, where Θ parameters of f f = argmin f = n V (y, f (x)) + λ 2 (f ) + λ 1 (f, Θ) 1 to find f you need to minimize complicated function backpropagation gives the gradients of that complicated function

Recap Neural nets - chain (composition) of non-linear operations, implementing highly non-linear functions Forward pass computes error between the currently learnt mapping function and the actual output Backward pass computes gradients w.r.t. inputs at each layer and parameters Optimization (minimization of the loss error) done by stochastic gradient descent (or variants of it)

Computation: speed up and parallelize with GPU In a nutshell DL is all about matrix multiplication Figure 7: Matrix-matrix multiplication 11 Entries of the A C matrix can be computed in parallel with GPU A B rows and B C cols loaded in the shared memory 11 image credit: Course notes "CNN for Visual Recognition" (Stanford, Spring 2017)

Function composition and computational graph x n 1 y 1 + z 1 f (x, y, z) =. in vector notation. i x n y n + z n n = (x y + z) Hadamard product, elementwise multiplication i n = (a + z) a = x y i n = b b = a + z i n = c c = b i

Function composition and computational graph (contd.) f (x, y, z) = x n 1 y 1 + z 1. i x n y n + z n Figure 8: computational graph with numpy 12 12 image credit: Course notes "CNN for Visual Recognition" (Stanford, Spring 2017)

Gradients of function composition x f = x f x n 1 y 1 + z 1 x i. =. i x n y n + z f n x n, f x i = f c c b b a a x i x f = y y f = x z f = 1

Gradients of function composition (contd.) Cons of using numpy only: Manual computation of gradients for all f No GPU support Figure 9: computational graph and gradients with numpy 13 13 image credit: Course notes "CNN for Visual Recognition" (Stanford, Spring 2017)

Deep Learning frameworks Goals: 1. Easily build big computation graphs 2. Easily compute gradients in computational graphs (automatic gradient computation) 3. Run it all efficiently on GPU (wrap low level NVIDIA and Linear Algebra libraries (e.g., cudnn, cublas)) Academia/Industry open source frameworks Caffe (UC Berkeley) Caffe2 (Facebook) Torch (NYU/Facebook) PyTorch (Facebook) Theano (U Montreal) TensorFlow (Google) Industry (not necessarily open source) frameworks Paddle (Baidu), CNTK (Microsoft), MXNet (Amazon), and others... High-level frameworks Keras (Theano, TensorFlow or CNTK as backend) good for beginners

DL frameworks comparison Figure 10: Computational graph definition in numpy, pytorch and tensorflow 15 15 image credit: Course notes "CNN for Visual Recognition" (Stanford, Spring 2017)

DL frameworks: Demo

Deep Learning for Vision Idea: unwrap images (2d matrices) into 1d vectors R 200 200 R 40000 Figure 11: fully connected layer for visual recognition (image credit Ranzato FAIR) Problem: feed them into Neural Networks (fully connected layers) spatial correlation is local waste of resources not robust to transformations (scale, rotation, translation)

Convolutional Layer shared weights across the whole image convolution takes advantage of stationarity (similar statistics at different locations) local spatial correlation Figure 12: convolutional layer for visual recognition Figure 13: convolutions with learnt kernels

Multiple convolutional filters h n j = max(0, K k=1 h n 1 k w n kj) Figure 14: multiple convolutional filters for visual recognition (image credit Ranzato FAIR) Figure 15: one convolution layer

Pooling layer Pooling layer goal: spatial robustness for feature extraction Assume our filter is eye dectector Pooling layer makes eye detector robust to exact location of eye Figure 16: pooling layer (image credit Ranzato FAIR)

Pooling layer (contd.) h n j (x, y) = max x N(x),y N(y) h n 1 j (x, y) Figure 17: pooling layer (image credit Ranzato FAIR) by pooling (e.g., taking max) filter responses at different locations we gain robustness to the exact spatial location of features

ConvNets architecture Figure 18: LeCun et al. "Gradient based learning applied to document recognition" IEEE 1988

DL for vision Demo