Neural Networks and Deep Learning

Similar documents
SUPPORT VECTOR MACHINES

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

All lecture slides will be available at CSC2515_Winter15.html

Support Vector Machines

Support vector machines

Deep Generative Models Variational Autoencoders

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

ECG782: Multidimensional Digital Signal Processing

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Advanced Introduction to Machine Learning, CMU-10715

SUPPORT VECTOR MACHINES

Support Vector Machines.

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Generative and discriminative classification techniques

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

What is machine learning?

Content-based image and video analysis. Machine learning

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Efficient Algorithms may not be those we think

Introduction to Deep Learning

Autoencoders, denoising autoencoders, and learning deep networks

HW2 due on Thursday. Face Recognition: Dimensionality Reduction. Biometrics CSE 190 Lecture 11. Perceptron Revisited: Linear Separators

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Vulnerability of machine learning models to adversarial examples

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Deep Learning & Neural Networks

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Kernel Methods & Support Vector Machines

Neural Networks: promises of current research

Support Vector Machines

Kernels and Clustering

Class 6 Large-Scale Image Classification

Chakra Chennubhotla and David Koes

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Lecture 20: Neural Networks for NLP. Zubin Pahuja

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Backpropagation + Deep Learning

Natural Language Processing

COMPUTATIONAL INTELLIGENCE

Support Vector Machines + Classification for IR

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Optimal Separating Hyperplane and the Support Vector Machine. Volker Tresp Summer 2018

CS 343: Artificial Intelligence

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

Grundlagen der Künstlichen Intelligenz

Deep Learning for Generic Object Recognition

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Linear methods for supervised learning

Data Mining. Neural Networks

CS6220: DATA MINING TECHNIQUES

Simple Model Selection Cross Validation Regularization Neural Networks

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

DM6 Support Vector Machines

Perceptron as a graph

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

6.034 Quiz 2, Spring 2005

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Classification: Feature Vectors

Support Vector Machines

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Deep Learning With Noise

Alternatives to Direct Supervision

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

1 Case study of SVM (Rob)

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

A Taxonomy of Semi-Supervised Learning Algorithms

Unsupervised Learning

CSE 573: Artificial Intelligence Autumn 2010

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Deep Learning. Volker Tresp Summer 2014

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Deep Learning for Computer Vision II

CS229 Final Project: Predicting Expected Response Times

ImageNet Classification with Deep Convolutional Neural Networks

Recursive Similarity-Based Algorithm for Deep Learning

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Convolutional Neural Networks

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Generative and discriminative classification

CS570: Introduction to Data Mining

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

12 Classification using Support Vector Machines

Stacked Denoising Autoencoders for Face Pose Normalization

732A54/TDDE31 Big Data Analytics

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Deep Belief Network for Clustering and Classification of a Continuous Data

Transcription:

Neural Networks and Deep Learning

Example Learning Problem

Example Learning Problem Celebrity Faces in the Wild

Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction, recognition Features critical for accuracy traditionally hand-crafted Instead of designing features, try to design feature detectors

Machine Learning Pipeline Raw data Feature discovery. Feature composition Inference: prediction, recognition Low level features Mid-level features Face-like features

Logistic Regression unit w 1 w 2 w 3 b h(x; w,b) = 1/ e -(wt x + b) Objective: determine parameters w,b

Training a neural network Given training set (x 1, t 1 ), (x 2, t 2 ), (x 3, t 3 ),. minimize error = h (x i ; w,b) t i by adjusting parameters (w,b) over all nodes Use gradient descent : Backpropagation : local optima

MLP with Back-propagation Back-propagate error to get gradient for parameter updating Compare outputs with correct answer to get error vector outputs hidden layers input vector [Hinton 09 ] DBN tutorial

Why Deep? Brains are very deep Humans organize their ideas hierarchically, through composition of simpler ideas Insufficiently deep architectures can be exponentially inefficient functions computable with a polynomialsize circuit of depth k may require exponential size at depth k-1 [Hastad 86]. Deep architectures help share features

Why learn features?? In the brain, very few filters are hard-coded irreversible damage produced in kittens by early visual deprivation [Hubel Wiesel 63] Avoids different feature extraction schemes for different kinds of input data Hypothesis : Good Reconstruction Good Recognition

Drawbacks of Back-propagation Purely discriminative Get all the information from the labels And the labels don t give so much of information Need a substantial amount of labeled data Gradient descent with random initialization leads to poor local minima

Deep Belief Networks Pre-train network from input-data alone (generative step) Use weights of pre-trained network as the initial point for traditional backpropagation Leads to quicker convergence Pre-training is fast; fine-tuning can be slow

2-layer Auto- Encoders (coarse) Auto-Encoders stack Deep Autoencoder

Pre-Training: Maximum likelihood learning j j j j v i h j 0 v i h j i i i i t = 0 t = 1 t = 2 t = infinity Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel. log vih j w p( v) 0 vih j ij [Hinton 09 ] DBN tutorial

Pre-Training: Maximum likelihood learning i h j v 0 1 i j v i hj t = 0 t = 1 data reconstruction i j For each training vector Update all hidden units in parallel Update the all the visible units in parallel to get a reconstruction. Update the hidden units again. w ij ( v i h j 0 v i h j 1 ) [Hinton 09 ] DBN tutorial

Deep Belief Networks Pre-trained weights Slow Fine-tuning Input-based pre-training Good Solution Random Initial position Very slow Back-propagation [MS Ram]

Deep Belief Networks Pre-train network from input-data alone (generative step) Use weights of pre-trained network as the initial point for traditional backpropagation Leads to quicker convergence Pre-training is fast; fine-tuning can be slow

Searching in parameter space One layer :1000 input + 1000 hidden 1 million weights million-dimensional optimization need to find global (or at least good) optimum from random initialization) Impossibly slow for Gradient descent

Deep Belief Networks million-dimensional space Good Solution Random Initial position Very slow Back-propagation

Searching in parameter space One layer :1000 input + 1000 hidden 1 million weights million-dimensional optimization Impossibly slow for Gradient descent to find global optimum from random initialization Added complications: gradient magnitude vanishingly small in lower parts of network deep networks tend to have more local minima than shallow networks

In practice : MLP vs DBN (MNIST) MLP (1 Hidden Layer) DBN 1 hour: 2.18% 14 hours: 1.65% 1 hour: 1.65% 14 hours: 1.10% 21 hours: 0.97% Intel QuadCore 2.83GHz, 4GB RAM [MS Ram]

Pre-training Fine-tuning Deep network architecture (MNIST) # nodes Output: 10 # weights 0.02 mn 2000 500 1 mn 500 0.25 mn 0.4 mn Input: 784

Designing DBNs Relative importance of Depth of network : Seems quite important Layer-wise pre-training Counter-example: MNIST: 6-layer MLP: 784-2500-2000-1500-1000-500-10 (on GPU, w elastic distortions) Error Rate: 0.35% [Ciresan et al 2010] No fashionable unsupervised pre-training is necessary! - Jürgen Schmidhuber Amount of labeled training data Affine and Elastic distortions Main benefit: DBNs work w less training data

Kernel Methods Bishop, Ch.6 R & N ch 18.6

Parametric: Discriminative :

Parametric: Generative image from [Herbrich 2002]

Parametric vs Memory models Parametric models: learn model for data: parameter vector w / posterior distribution p(w t 1..t N ) discard training set t e.g. linear classifiers (perceptron) Non-Parametric : models on data e.g. k-nn memory-based: some or all of the training data is saved SVM: save a set of support vectors

MNIST dataset 1 0 image from [Herbrich 2002]

k-nn Test 5 Nearest neighbours

Kernel methods feature space mapping φ(x): k(x, x') = φ(x) T φ(x') symmetric: k(x, x') = k(x', x) linear kernel: φ(x) = x stationary kernel: k(x, x') = k(x x') [stationary under translation] homogeneous kernel: k(x, x') = k( x x' ) (e.g. RBF)

Support Vector Machines Main idea: linear classifier, but in φ(x) kernel space. criterion for decision: max-margin Algorithm user specifies kernel function learn weights for instances no actual computation in high-dim space Classification average of the instance labels, weighted by a) proximity b) instance weight.

Example: XOR * X-OR problem φ(x 1,x 2 ) = {x 1, x 2, x 1 x 2 } Better: φ(x 1,x 2 ) = {1, 2x 1, 2x 2, x 1 2, 2 x 1 x 2, x 2 2 }

Margin maximization y i = +1 y i = -1 Decision hyperplane: w T x + b = 0 If t i = {+1,-1}, margin m= 1/ w min i t i (w T x i + b) w must satisfy the constraint that for all data (x i,t i ) : t i (w T x i + b ) > m Margin is maximized when 1/ w is maximum, ie. minimize w

Support Vector Machines Main idea: linear classifier, but in φ(x) kernel space. criterion for decision: max-margin Algorithm user specifies kernel function learn weights for instances no actual computation in high-dim space Classification average of the instance labels, weighted by a) proximity b) instance weight.

Kernel trick

Demo Demo by Udi Aharoni http://www.youtube.com/watch?v=3licbrzprza

Support Vector Machines Main idea: linear classifier, but in φ(x) kernel space. criterion for decision: max-margin Algorithm user specifies kernel function learn weights for instances via convex optimization no actual computation in high-dim space Classification average of the instance labels, weighted by a) proximity b) instance weight.

Kernel trick Linear classifier is in in high-dimensional ϕ(x) space However, no computation directly on ϕ(x); compute only kernel = ϕ(x) T ϕ(x) e.g. for ϕ(x 1,x 2 ) = {1, 2x 1, 2x 2, x 1 2, 2 x 1 x 2, x 22 } k(x,x') = ϕ(x) T ϕ(x') = ( (1, x 1, x 2 ) (1, x' 1, x' 2 ) T ) 2 = <x,x'> 2 Efficient only if scalar product can be efficiently computed. Holds for: k(x,x') : continuous, symmetric and positive definite

Dual Representations Scalar product representations arise naturally in many classes of problems e.g. Linear regression Setting gradient to zero: where Φ T = matrix of ϕ(x n ), and

Dual Representations Instead of parameter space w, use parameter space a Writing w = Φ T a into J(w): where gram matrix K = ΦΦ T, with K nm = ϕ(x n ) T ϕ(x m ) solving for a by setting dj(a)/da to zero:

Dual Representations Substituting back into original regression model: Thus, all computations are performed purely with the kernel

SVM Why so popular: Very good classification performance, compares w best Fast (convex) and scaleable learning Fast inference (but slower training) Difficulties: No model (discriminative; black-box) Not as useful for discrete inputs. Art: how to specify kernel function Difficulties with multiple classes

Choosing Kernels Popular kernels that satisfy the Gram matrix positive-definiteness criterion inlcude: Linear kernels: k(x,x') = <x,x'> Polynomials: k(x,x') = <x,x'> n for n=2 (quadratic): Polynomial basis functions k(x,x') = for x' = 0

Choosing Kernels Gaussian: Radial basis functions Sigmoid (logistic) function

Gaussian / Sigmoid Bases gaussian bases logistic bases k(x,0) image from [Bishop 2005]

Constructing Kernels If k1 and k2 are valid kernels, then so are:

Reading Machine Learning Bishop : sections 1.1 to 1.2.4, 1.5.1-2, 6.1, 6.2 Russell & Norvig: 18.9