Lecture 5: Multilayer Perceptrons

Similar documents
Machine Learning 9. week

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Classification / Regression Support Vector Machines

Support Vector Machines

CMPS 10 Introduction to Computer Science Lecture Notes

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Mathematics 256 a course in differential equations for engineering students

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Problem Set 3 Solutions

Edge Detection in Noisy Images Using the Support Vector Machines

Support Vector Machines

Hermite Splines in Lie Groups as Products of Geodesics

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Parallelism for Nested Loops with Non-uniform and Flow Dependences

S1 Note. Basis functions.


Learning the Kernel Parameters in Kernel Minimum Distance Classifier

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Programming in Fortran 90 : 2017/2018

3D vector computer graphics

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Announcements. Supervised Learning

Lecture #15 Lecture Notes

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

A Binarization Algorithm specialized on Document Images and Photos

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

CS 534: Computer Vision Model Fitting

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Array transposition in CUDA shared memory

Face Detection with Deep Learning

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

Analysis of Continuous Beams in General

Math Homotopy Theory Additional notes

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

Module Management Tool in Software Development Organizations

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Inverse Kinematics (part 2) CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Spring 2016

TN348: Openlab Module - Colocalization

Writer Identification using a Deep Neural Network

Intro. Iterators. 1. Access

Cluster Analysis of Electrical Behavior

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

A Bilinear Model for Sparse Coding

Solving two-person zero-sum game by Matlab

Recognizing Faces. Outline

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Fast Feature Value Searching for Face Detection

Parallel matrix-vector multiplication

Feature Reduction and Selection

Random Kernel Perceptron on ATTiny2313 Microcontroller

Lecture 13: High-dimensional Images

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Insertion Sort. Divide and Conquer Sorting. Divide and Conquer. Mergesort. Mergesort Example. Auxiliary Array

Feature-Based Matrix Factorization

Why visualisation? IRDS: Visualization. Univariate data. Visualisations that we won t be interested in. Graphics provide little additional information

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

Learning-based License Plate Detection on Edge Features

SUMMARY... I TABLE OF CONTENTS...II INTRODUCTION...

Classification algorithms on the cell processor

Related-Mode Attacks on CTR Encryption Mode

A Network for Extracting the Locations of Point Clusters Using Selective Attention 1

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Private Information Retrieval (PIR)

Harmonic Coordinates for Character Articulation PIXAR

Support Vector Machines. CS534 - Machine Learning

Lecture notes: Histogram, convolution, smoothing

An Optimal Algorithm for Prufer Codes *

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Radial Basis Functions

Review of approximation techniques

User Authentication Based On Behavioral Mouse Dynamics Biometrics

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Active Contours/Snakes

Comparing Image Representations for Training a Convolutional Neural Network to Classify Gender

Assembler. Building a Modern Computer From First Principles.

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Brave New World Pseudocode Reference

Reading. 14. Subdivision curves. Recommended:

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Adaptive Regression in SAS/IML

Smoothing Spline ANOVA for variable screening

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Artificial Intelligence (AI) methods are concerned with. Artificial Intelligence Techniques for Steam Generator Modelling

High-Boost Mesh Filtering for 3-D Shape Enhancement

Transcription:

Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented by lnear models; for nstance, lnear regresson can t represent quadratc functons, and lnear classfers can t represent XOR. We also saw one partcular way around ths ssue: by defnng features, or bass functons. E.g., lnear regresson can represent a cubc polynomal f we use the feature map ψ(x) = (1, x, x 2, x 3 ). We also observed that ths sn t a very satsfyng soluton, for two reasons: 1. The features need to be specfed n advance, and ths can requre a lot of engneerng work. 2. It mght requre a very large number of features to represent a certan set of functons; e.g. the feature representaton for cubc polynomals s cubc n the number of nput features. In ths lecture, and for the rest of the course, we ll take a dfferent approach. We ll represent complex nonlnear functons by connectng together lots of smple processng unts nto a neural network, each of whch computes a lnear functon, possbly followed by a nonlnearty. In aggregate, these unts can compute some surprsngly complex functons. By hstorcal accdent, these networks are called multlayer perceptrons. Some people would clam that the methods covered n ths course are really just adaptve bass functon representatons. I ve never found ths a very useful way of lookng at thngs. 1.1 Learnng Goals Know the basc termnology for neural nets Gven the weghts and bases for a neural net, be able to compute ts output from ts nput Be able to hand-desgn the weghts of a neural net to represent functons lke XOR Understand how a hard threshold can be approxmated wth a soft threshold Understand why shallow neural nets are unversal, and why ths sn t necessarly very nterestng 1

Fgure 1: A multlayer perceptron wth two hdden layers. Left: wth the unts wrtten out explctly. Rght: representng layers as boxes. 2 Multlayer Perceptrons In the frst lecture, we ntroduced our general neuron-lke processng unt: a = φ j w j x j + b, where the x j are the nputs to the unt, the w j are the weghts, b s the bas, φ s the nonlnear actvaton functon, and a s the unt s actvaton. We ve seen a bunch of examples of such unts: Lnear regresson uses a lnear model, so φ(z) = z. In bnary lnear classfers, φ s a hard threshold at zero. In logstc regresson, φ s the logstc functon σ(z) = 1/(1 + e z ). A neural network s just a combnaton of lots of these unts. Each one performs a very smple and stereotyped functon, but n aggregate they can do some very useful computatons. For now, we ll concern ourselves wth feed-forward neural networks, where the unts are arranged nto a graph wthout any cycles, so that all the computaton can be done sequentally. Ths s n contrast wth recurrent neural networks, where the graph can have cycles, so the processng can feed nto tself. These are much more complcated, and we ll cover them later n the course. The smplest knd of feed-forward network s a multlayer perceptron (MLP), as shown n Fgure 1. Here, the unts are arranged nto a set of layers, and each layer contans some number of dentcal unts. Every unt n one layer s connected to every unt n the next layer; we say that the network s fully connected. The frst layer s the nput layer, and ts unts take the values of the nput features. The last layer s the output layer, and t has one unt for each value the network outputs (.e. a sngle unt n the case of regresson or bnary classfaton, or K unts n the case of K-class classfcaton). All the layers n between these are known as hdden layers, because we don t know ahead of tme what these unts should compute, and ths needs to be dscovered durng learnng. The unts MLP s an unfortunate name. The perceptron was a partcular algorthm for bnary classfcaton, nvented n the 1950s. Most multlayer perceptrons have very lttle to do wth the orgnal perceptron algorthm. 2

Fgure 2: An MLP that computes the XOR functon. All actvaton functons are bnary thresholds at 0. n these layers are known as nput unts, output unts, and hdden unts, respectvely. The number of layers s known as the depth, and the number of unts n a layer s known as the wdth. As you mght guess, deep learnng refers to tranng neural nets wth many layers. As an example to llustrate the power of MLPs, let s desgn one that computes the XOR functon. Remember, we showed that lnear models cannot do ths. We can verbally descrbe XOR as one of the nputs s 1, but not both of them. So let s have hdden unt h 1 detect f at least one of the nputs s 1, and have h 2 detect f they are both 1. We can easly do ths f we use a hard threshold actvaton functon. You know how to desgn such unts t s an exercse of desgnng a bnary lnear classfer. Then the output unt wll actvate only f h 1 = 1 and h 2 = 0. A network whch does ths s shown n Fgure 2. Let s wrte out the MLP computatons mathematcally. Conceptually, there s nothng new here; we just have to pck a notaton to refer to varous parts of the network. As wth the lnear case, we ll refer to the actvatons of the nput unts as x j and the actvaton of the output unt as y. The unts n the lth hdden layer wll be denoted h (l). Our network s fully connected, so each unt receves connectons from all the unts n the prevous layer. Ths means each unt has ts own bas, and there s a weght for every par of unts n two consecutve layers. Therefore, the network s computatons can be wrtten out as: = φ (1) j h (1) = φ (2) j h (2) y = φ (3) j j x j + b (1) w (1) w (2) j h(1) w (3) j h(2) j + b (2) j + b (3) (1) Termnology for the depth s very nconsstent. A network wth one hdden layer could be called a one-layer, two-layer, or three-layer network, dependng f you count the nput and output layers. Note that we dstngush φ (1) and φ (2) because dfferent layers may have dfferent actvaton functons. Snce all these summatons and ndces can be cumbersome, we usually 3

wrte the computatons n vectorzed form. Snce each layer contans multple unts, we represent the actvatons of all ts unts wth an actvaton vector h (l). Snce there s a weght for every par of unts n two consecutve layers, we represent each layer s weghts wth a weght matrx W (l). Each layer also has a bas vector b (l). The above computatons are therefore wrtten n vectorzed form as: h (1) = φ (1) ( W (1) x + b (1)) h (2) = φ (2) ( W (2) h (1) + b (2)) y = φ (3) ( W (3) h (2) + b (3)) (2) When we wrte the actvaton functon appled to a vector, ths means t s appled ndependently to all the entres. Recall how n lnear regresson, we combned all the tranng examples nto a sngle matrx X, so that we could compute all the predctons usng a sngle matrx multplcaton. We can do the same thng here. We can store all of each layer s hdden unts for all the tranng examples as a matrx H (l). Each row contans the hdden unts for one example. The computatons are wrtten as follows (note the transposes): H (1) = φ (1) ( XW (1) + 1b (1) ) H (2) = φ (2) ( H (1) W (2) + 1b (2) ) Y = φ (3) ( H (2) W (3) + 1b (3) ) (3) If t s hard to remember when a matrx or vector s transposed, fear not. You can usually fgure t out by makng sure the dmensons match up. These equatons can be translated drectly nto NumPy code whch effcently computes the predctons over the whole dataset. 3 Feature Learnng We already saw that lnear regresson could be made more powerful usng a feature mappng. For nstance, the feature mappng ψ(x) = (1, x, x 2, x e ) can represent thrd-degree polynomals. But statc feature mappngs were lmted because t can be hard to desgn all the relevant features, and because the mappngs mght be mpractcally large. Neural nets can be thought of as a way of learnng nonlnear feature mappngs. E.g., n Fgure 1, the last hdden layer can be thought of as a feature map ψ(x), and the output layer weghts can be thought of as a lnear model usng those features. But the whole thng can be traned end-to-end wth backpropagaton, whch we ll cover n the next lecture. The hope s that we can learn a feature representaton where the data become lnearly separable: 4

Fgure 3: Left: Some tranng examples from the MNIST handwrtten dgt dataset. Each nput s a 28 28 grayscale mage, whch we treat as a 784- dmensonal vector. Rght: A subset of the learned frst-layer features. Observe that many of them pck up orented edges. Consder tranng an MLP to recognze handwrtten dgts. (Ths wll be a runnng example for much of the course.) The nput s a 28 28 grayscale mage, and all the pxels take values between 0 and 1. We ll gnore the spatal structure, and treat each nput as a 784-dmensonal vector. Ths s a multway classfcaton task wth 10 categores, one for each dgt class. Suppose we tran an MLP wth two hdden layers. We can try to understand what the frst layer of hdden unts s computng by vsualzng the weghts. Each hdden unt receves nputs from each of the pxels, whch means the weghts feedng nto each hdden unt can be represented as a 784- dmensonal vector, the same as the nput sze. In Fgure 3, we dsplay these vectors as mages. In ths vsualzaton, postve values are lghter, and negatve values are darker. Each hdden unt computes the dot product of these vectors wth the nput mage, and then passes the result through the actvaton functon. So f the lght regons of the flter overlap the lght regons of the mage, and the dark regons of the flter overlap the dark regon of the mage, then the unt wll actvate. E.g., look at the thrd flter n the second row. Ths corresponds to an orented edge: t detects vertcal edges n the upper rght part of the mage. Ths s a useful sort of feature, snce t gves nformaton about the locatons and orentaton of strokes. Many of the features are smlar to ths; n fact, orented edges are a very commonly learned by the frst layers of neural nets for vsual processng tasks. It s harder to vsualze what the second layer s dong. We ll see some trcks for vsualzng ths n a few weeks. We ll see that hgher layers of a neural net can learn ncreasngly hgh-level and complex features. Later on, we ll talk about convolutonal networks, whch use the spatal structure of the mage. 4 Expressve Power Lnear models are fundamentally lmted n ther expressve power: they can t represent functons lke XOR. Are there smlar lmtatons for MLPs? It depends on the actvaton functon. 5

Fgure 4: Desgnng a bnary threshold network to compute a partcular functon. 4.1 Lnear networks Deep lnear networks are no more powerful than shallow ones. The reason s smple: f we use the lnear actvaton functon φ(x) = x (and forget the bases for smplcty), the network s functon can be expanded out as y = W (L) W (L 1) W (1) x. But ths could be vewed as a sngle lnear layer wth weghts gven by W = W (L) W (L 1) W (1). Therefore, a deep lnear network s no more powerful than a sngle lnear layer,.e. a lnear model. 4.2 Unversalty As t turns out, nonlnear actvaton functons gve us much more power: under certan techncal condtons, even a shallow MLP (.e. one wth a sngle hdden layer) can represent arbtrary functons. Therefore, we say t s unversal. Let s demonstrate unversalty n the case of bnary nputs. We do ths usng the followng game: suppose we re gven a functon mappng nput vectors to outputs; we wll need to produce a neural network (.e. specfy the weghts and bases) whch matches that functon. The functon can be gven to us as a table whch lsts the output correspondng to every possble nput vector. If there are D nputs, ths table wll have 2 D rows. An example s shown n Fgure 4. For convenence, let s suppose these nputs are ±1, rather than 0 or 1. All of our hdden unts wll use a hard threshold at 0 (but we ll see shortly that these can easly be converted to soft thresholds), and the output unt wll be lnear. Our strategy wll be as follows: we wll have 2 D hdden unts, each of whch recognzes one possble nput vector. We can then specfy the functon by specfyng the weghts connectng each of these hdden unts to the outputs. For nstance, suppose we want a hdden unt to recognze the nput ( 1, 1, 1). Ths can be done usng the weghts ( 1, 1, 1) and bas 2.5, and ths unt wll be connected to the output unt wth weght 1. (Can you come up wth the general rule?) Usng these weghts, any nput pattern wll produce a set of hdden actvatons where exactly one of the unts s actve. The weghts connectng nputs to outputs can be set based on the nput-output table. Part of the network s shown n Fgure 4. Ths argument can easly be made nto a rgorous proof, but ths course won t be concerned wth mathematcal rgor. 6

Unversalty s a neat property, but t has a major catch: the network requred to represent a gven functon mght have to be extremely large (n partcular, exponental). In other words, not all functons can be represented compactly. We desre compact representatons for two reasons: 1. We want to be able to compute predctons n a reasonable amount of tme. 2. We want to be able to tran a network to generalze from a lmted number of tranng examples; from ths perspectve, unversalty smply mples that a large enough network can memorze the tranng set, whch sn t very nterestng. 4.3 Soft thresholds In the prevous secton, our actvaton functon was a step functon, whch gves a hard threshold at 0. Ths was convenent for desgnng the weghts of a network by hand. But recall from last lecture that t s very hard to drectly learn a lnear classfer wth a hard threshold, because the loss dervatves are 0 almost everywhere. The same holds true for multlayer perceptrons. If the actvaton functon for any unt s a hard threshold, we won t be able to learn that unt s weghts usng gradent descent. The soluton s the same as t was n last lecture: we replace the hard threshold wth a soft one. Does ths cost us anythng n terms of the network s expressve power? No t doesn t, because we can approxmate a hard threshold usng a soft threshold. In partcular, f we use the logstc nonlnearty, we can approxmate a hard threshold by scalng up the weghts and bases: 4.4 The power of depth If shallow networks are unversal, why do we need deep ones? One mportant reason s that deep nets can represent some functons more compactly than shallow ones. For nstance, consder the party functon (on bnary-valued nputs): { 1 f f par (x 1,..., x D ) = j x j s odd (4) 0 f t s even. We won t prove ths, but t requres an exponentally large shallow network to represent the party functon. On the other hand, t can be computed by a deep network whose sze s lnear n the number of nputs. Desgnng such a network s a good exercse. 7