Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

Size: px

Start display at page:

Download "Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016"

Bethanie Davidson
5 years ago
Views:

1 Winter 2016 Lecture 2, Januar 26, 2016 f2? cathedral high-rise f1

A common computer vision pipeline before 2012 1. 2. 3. 4. Find interest points. Crop patches around them. Represent each patch with a sparse local descriptor.

2 A common computer vision pipeline before Find interest points. Crop patches around them. Represent each patch with a sparse local descriptor. Combine the descriptors into a representation of the image. f1 fn A common computer vision pipeline before Find interest points. Crop patches around them. Represent each patch with a sparse local descriptor. Combine the descriptors into a representation of the image. A common computer vision pipeline before Find interest points. Crop patches around them. Represent each patch with a sparse local descriptor. Combine the descriptors into a representation of the image.

f 1 1 f M 1 f 2? cathedral high-rise f 1 n f M n f 1 A common computer vision pipeline before 2012 1. Find interest points. 2. Crop patches around them. 3.

3 f 1 1 f M 1 f 2? cathedral high-rise f 1 n f M n f 1 A common computer vision pipeline before Find interest points. 2. Crop patches around them. 3. Represent each patch with a sparse local descriptor. 4. Combine the descriptors into a representation of the image. This creates a representation that even a linear classifier can deal with. bottom line: computer vision is all about building non-linear pipelines (aka the representation matters ) What do good low-level features look like? The XOR problem Local features that are often found to work well are based on oriented structure (such as Gabor features) These were discovered again and again (also in other areas) and are closel related to the Short Time Fourier Transform. 2 1

4 The XOR problem Neural networks are trainable pipelines Neural networks are trainable pipelines Neural networks are trainable pipelines Most common networks interleave matri multiplies with element-wise non-linearities: Common non-linearities: 1 sigmoid: h(a) = 1+ep(a) ReLU: h(a) = a [a > 0] tanh: h(a) = ea e a e a +e a () = W out h(w 23 h(w 12 h(w 01 ))) Usuall there are constant bias -terms as well.

5 Neural networks are trainable pipelines Neural networks are trainable pipelines For classification tasks, turn class outputs into probabilities using the softargma function: p(c k ) = ep( k()) j ep( j()) For training, use a (large) training set ( n, t n )n=1...n and minimize a suitable cost-function, using stochastic gradient descent (SGD). The most common choices of cost function Stochastic gradient descent (SGD) Regression (predict real values): cost = 1 2 N ( n ) t n 2 n=1 Classification (predict discrete labels): N K cost = t nk log p(c k n ) n=1 k=1 where t nk = 1 iff training case n belongs to class k. θ (τ) θ (0) θ (τ+1) = θ (τ) cost( η n,t n ) θ new parameter value old parameter value learning rate

Error back-propagation (backprop) Backprop general form cost(, t) w f f(g; w f ) f fprop f g bprop f wf grad w g g(h; w g ) Use the chainrule: For regression and classification we get: cost ( n ) = (

6 Error back-propagation (backprop) Backprop general form cost(, t) w f f(g; w f ) f fprop f g bprop f wf grad w g g(h; w g ) Use the chainrule: For regression and classification we get: cost ( n ) = ( n) t n w h h(; w h ) Net: If has an parameters, W out, collect them using: cost W out = (( n) t n ) ( n) W out Net: Descend to the net laer b computing cost h 3 = cost ( n ) ( n) h 3 ( n ) Implementing backprop...and so on... There are several software packages that implement backprop. Software packages like theano and tensorflow, take the idea to the etreme, b using smbolic differentiation, so ou don t even need to implement bprop and grad ourself. import theano import theano.tensor as T = T.dmatri("") w = T.dmatri("w") somefunction = T.dot(w,).sum() pthon_function = theano.function([,w], somefunction) pthon_function(randn(100, 10), randn(10, 100)) derivative = T.grad(somefunction, w) Backprop can be thought of as an engineering principle, that prescribes how to design an end-to-end train-able sstem from differentiable components: Use components which provide the methods fprop, bprop and grad. Then backprop can be automated. Well-suited for support b software frameworks theano

7 Potential Issues The cost surface/local optima But what about local minima? But what about overfitting? Vanishing gradients Local minima not an issue in practice This is probabl due to high dimensional parameter space, which causes most critical points to be saddle points not local optima. Some recent theoretical work supports this view (Choromanska et al. 2014); (Dauphin, et al. 2014) figure from wikipedia Overfitting Overfitting w

8 Overfitting in regression Preventing overfitting in neural networks Earl stopping: training cost validation cost (Bishop 2006: Pattern recognition and machine learning) training iteration training iteration Weight deca (somewhat outdated): add a weight penalt to the training objective (weight constraints now more common) Dropout (Hinton et al., 2012): Corrupt hidden unit activations (multiplicativel) during training More data Weight sharing (reduce the number of parameters): Weight sharing Batch normalization (Ioffe, Szeged 2015) 032ff W 1,2 2,3 W 2,3 5,6 Parameters can be shared b having them point to the same memor location. Ver common wa to reduce parameters and encode prior knowledge. Central ingredient in conv-nets (CNNs) and recurrent nets (RNNs). But: It requires long-range communication. Normalize the pre-activation (activation before non-linearit) of each neuron averaged over the current mini-batch to have mean 0 and standard-deviation 1. To allow the network to learn the original, unnormalized function, two parameters are added that allow it to undo the normalization. Batch normalization is a somewhat unusual operation, because it couples (independentl for each neuron) all eamples within the minibatch (and requires to back-prop through them). Shown to stabilize training and prevent overfitting. Eplanation attempt b (Ioffe, Szeged 2015): it prevents covariate-shift.

9 The vanishing gradients problem Neural nets learn distributed representations The backward-pass is a sequence of matri multiplies. Depending on the magnitude of the eigenvalues, initial values can blow up or deca to zero. This can ma learning difficult or slow. Potential solutions: architectural tricks (for eample, the LSTM unit) Universal approimation Neural networks encode information as vectors of real values. This makes it eas to encode conceptual similarities. In a tet processing task, for eample: If user searches for Dell notebook batter size, we would like to match documents with Dell laptop batter capacit If user searches for Seattle motel, we would like to match documents containing Seattle hotel (Eample from Chris Manning) Back-prop using asnchronous, local computations? A network with a single hidden laer can model an non-linear function under fairl mild conditions to arbitrar accurac (eg. Funahashi, 1989). Unfortunatel, the proof relies on using an eponentiall large number of hidden units. So the practical relevance of this result is ver limited. In practice, networks with man laers have proven to be much more useful. In the brain, where is the backward channel?

10 Towards back-prop using local computations Hinton 2007: Use the temporal derivative to encode the error derivative! (see also: Bengio et al. 2015) Recall that the derivative of most common cost functions is, convenientl, given b cost ( n ) = ( n) t n Is the brain doing local back-prop? (Hinton 2007): What would neuro-scientists see if this is what s happening in the brain? The should see this (and the do!): How local back-prop ma work Let top-laer drive the activations towards the correct value. Let feedback weights transport that change downward. Make weight changes proportional to the rate of change of a postsnaptic neuron and the value of the pre-snaptic neuron. picture from dependent Machineplasticit learning for vision

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent