Deep Learning and Its Applications

Size: px

Start display at page:

Download "Deep Learning and Its Applications"

Joanna Hancock
5 years ago
Views:

1 Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016

2 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent Advances 5 Code Demonstration

3 A Motivating Example Image Classification: A classic application of CNN

4 The CNN Model Visualization of a typical CNN Model Layers in the model 1 Convolutional Layer: 1 feature map 8 feature maps 2 Pooling Layer: shrinks the resolution: Convolutional Layer: 8 feature maps 16 feature maps 4 Pooling Layer: shrinks the resolution: Fully Connected Layer: each output fully connects all pixels 6 Softmax Layer: applies the softmax function

Key Elements in Convolutional Layer This convolutional layer takes 1 feature map of size 24 24 to 8 feature maps of size 24 24.

5 Key Elements in Convolutional Layer This convolutional layer takes 1 feature map of size to 8 feature maps of size In: p in feature maps (images) of resolution m in n in : (denote each pixel by) I i [x, y], i [p in ], x [m in ], y [n in ], where [a] = {0, 1, 2,, a 1}. Out: p out feature maps of resolution m out n out : O j [x, y], j [p out ], x [m out ], y [n out ]. Filter size u v p in. Activation function h( ). E.g., relu function: h(x) = max{0, x}.

6 2-D Discrete Convolution For simplicity, assume p in = p out = 1, and h(x) = x for now. Let F [x, y] be the filter of size u v 1, that is, x [u], y [v]. Recall I [x, y] is the input image of size m in n in, while O[x, y] is the output image of size m out n out. Then, O[x, y] = a [u] b [v] I [x a, y b]f [a, b] := (I F )[x, y] for all x [m out ], y [n out ]. The operator denotes 2-D discrete convolution. For simplicity, denote this by O = I F Let I [x a, y b] = 0 when x a [u] or y b [v]. Visualization of 2-D convolution Convolution detects features in an image For simplicity, it is commonly used that (m out, n out ) = (m in, n in ). It is named same padding.

7 Convolutional Layer Let us remove the assumptions p in = p out = 1, and h(x) = x one-by-one to obtain the final form of a convolutional layer. When p in 1, Additionally, when p out 1, O = O j = i [p in ] i [p in ] I i F i. I i F (j ) i for all j. In the most general case, where h(x) x and there is an additional bias term, O j = h( I i F (j ) i + b j ) for all j. i [p in ] Note that in a convolutional layer, there are p out filters, each of size u v p in.

Key Elements in Pooling Layer This pooling layer takes 8 feature maps of size 24 24 to 8 feature maps of size 12 12.

8 Key Elements in Pooling Layer This pooling layer takes 8 feature maps of size to 8 feature maps of size In: p feature maps of resolution m in n in : I i [x, y], i [p], x [m in ], y [n in ]. Out: p feature maps of resolution m out n out : O i [x, y], i [p], x [m out ], y [n out ]. Pooling size: u v. Pooling method: max pooling, average pooling, etc.

9 Pooling For all i [p], obtain O i from I i. Intuitively, Max-pooling introduces non-linearity in the kernel.

10 Fully-connected Layer In: p in variables x R pin (every pixel is considered a variable). Out: p out variables y R pout. Activation function h( ). The model: y = h(w x + b), where W R pout pin, b R pout, and h( ) is a component-wise function.

11 Softmax Function For any x R K, the softmax function for every k [K ] is defined as softmax(k, x) = e x k i [K ] exi. Let p k = P(Y = k) be the probability that a given image is in the k th class, then under the model assumption P(Y = k) = 1 Z ewt k x+b k for all k [K ] and some explanatory variables x, we have p k = softmax(k, W x + b), where W = [w 1, w 2,, w K ] T and b = (b 1, b 2,, b K ) T. Combine fully connected layer with h(x) = x, p out = K and the softmax function to obtain class probabilities. Cross-entropy loss (essentially negative log-likelihood) can be used to train the model.

12 Recap of the CNN Model Convolutional Layer: generates features by varying filters. Pooling Layer: performs subsampling that shrinks the dimensionality. Fully-connected layer + Softmax function: obtain class probabilities. Note that Fully-connected layers can also be used alone to introduce non-linearity (h(x) x) or shrink dimensionality(p out < p in ). A typical CNN model: Conv-Pool-Conv-Pool-FC-FC-Softmax. The more layers there are, the deeper the model is. Fully connected layers are usually placed at the end of the model, because it fully connects all pixels and breaks their spatial correlation.

13 Optimization: Stochastic Gradient Descent When 1) data size is too large, or 2) data is obtained in a sequential way, that some future data is not accessible currently, gradient descent may not be practical to train a CNN model. Stochastic Gradient Descent (SGD) can be used. Split the whole dataset (X, y) into B mini-batches (X (i), y (i) ), i = 1, 2,, B. At iteration t, find a mini-batch (X t, y t ) (usually iteratively) and then perform update θ (t) θ (t 1) η θ f (θ (t 1) X t, y t ), where θ is the set of all parameters in the model, and f ( ) is the loss function. Challenge: vanishing gradient problem

14 Vanishing Gradient Problem Let σ(x) = 1/(1 + exp{ x}) be the commonly-used sigmoid function. Assume a neural network x k+1 = σ(w k x k ), where k = 1, 2,, K, where x 1 is the input. Let f (x K +1 ) be the loss function. f = f x K +1 x 3 x 2 = f (x K +1 ) w 1 x K +1 x K x 2 w 1 K σ (w k x k )w k σ (w 1 x 1 )x 1 k=2 Since σ f (x) 1/4, w 1 (1/4) K f (x K +1 ) K k=2 w kx 1. Intuitively, when w k s are not too large, the gradient vanishes as K increases. Practically, the first few layers are hard to train when the network is deep.

15 To Deal With This Issue There are several ways to deal with this issue in deep learning. 1 Use other activation functions, such as ReLU: h(x) = max{0, x}. Since max h (x) = 1, it helps to pass down the gradient while keeps non-linearity. 2 Use other ways to connect layers, such as short-cut connections. 3 Modify the way to update parameters, for example, use momentum in the updates.

16 Deep Learning Why deep models? Theoretically, the intuition is that, the deeper the model, the more complex the feature space is. Computationally, deep learning models, such as CNN, leads to easy massive parallelization. But hard to interpret. Some frontier applications of deep learning

17 Packages: Theano/TensorFlow These packages are quite similar. In fact, some developers of Theano went to develop TensorFlow for Google. Low-level, highly customizable. Automatic computation of derivatives. Same code can be run on either CPU or GPU. Fast (cuda) C implementations. High-level libraries, such as pylearn2, are available. Actively developed and maintained. (In the past year in Theano, they added support for multiple GPUs, implemented average pooling, and made faster implementations available as far as I know) Hard to debug. (My code is available online)

18 Graph c = a 2 + 2b 2a

Deep Learning with Tensorflow AlexNet

Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification