Lecture 5: Multilayer Perceptrons

Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented by lnear models; for nstance, lnear regresson can t represent quadratc functons, and lnear classfers can t represent XOR. We also saw one partcular way around ths ssue: by defnng features, or bass functons. E.g., lnear regresson can represent a cubc polynomal f we use the feature map ψ(x) = (1, x, x 2, x 3 ). We also observed that ths sn t a very satsfyng soluton, for two reasons: 1. The features need to be specfed n advance, and ths can requre a lot of engneerng work. 2. It mght requre a very large number of features to represent a certan set of functons; e.g. the feature representaton for cubc polynomals s cubc n the number of nput features. In ths lecture, and for the rest of the course, we ll take a dfferent approach. We ll represent complex nonlnear functons by connectng together lots of smple processng unts nto a neural network, each of whch computes a lnear functon, possbly followed by a nonlnearty. In aggregate, these unts can compute some surprsngly complex functons. By hstorcal accdent, these networks are called multlayer perceptrons. Some people would clam that the methods covered n ths course are really just adaptve bass functon representatons. I ve never found ths a very useful way of lookng at thngs. 1.1 Learnng Goals Know the basc termnology for neural nets Gven the weghts and bases for a neural net, be able to compute ts output from ts nput Be able to hand-desgn the weghts of a neural net to represent functons lke XOR Understand how a hard threshold can be approxmated wth a soft threshold Understand why shallow neural nets are unversal, and why ths sn t necessarly very nterestng 1

Fgure 1: A multlayer perceptron wth two hdden layers. Left: wth the unts wrtten out explctly. Rght: representng layers as boxes. 2 Multlayer Perceptrons In the frst lecture, we ntroduced our general neuron-lke processng unt: a = φ j w j x j + b, where the x j are the nputs to the unt, the w j are the weghts, b s the bas, φ s the nonlnear actvaton functon, and a s the unt s actvaton. We ve seen a bunch of examples of such unts: Lnear regresson uses a lnear model, so φ(z) = z. In bnary lnear classfers, φ s a hard threshold at zero. In logstc regresson, φ s the logstc functon σ(z) = 1/(1 + e z ). A neural network s just a combnaton of lots of these unts. Each one performs a very smple and stereotyped functon, but n aggregate they can do some very useful computatons. For now, we ll concern ourselves wth feed-forward neural networks, where the unts are arranged nto a graph wthout any cycles, so that all the computaton can be done sequentally. Ths s n contrast wth recurrent neural networks, where the graph can have cycles, so the processng can feed nto tself. These are much more complcated, and we ll cover them later n the course. The smplest knd of feed-forward network s a multlayer perceptron (MLP), as shown n Fgure 1. Here, the unts are arranged nto a set of layers, and each layer contans some number of dentcal unts. Every unt n one layer s connected to every unt n the next layer; we say that the network s fully connected. The frst layer s the nput layer, and ts unts take the values of the nput features. The last layer s the output layer, and t has one unt for each value the network outputs (.e. a sngle unt n the case of regresson or bnary classfaton, or K unts n the case of K-class classfcaton). All the layers n between these are known as hdden layers, because we don t know ahead of tme what these unts should compute, and ths needs to be dscovered durng learnng. The unts MLP s an unfortunate name. The perceptron was a partcular algorthm for bnary classfcaton, nvented n the 1950s. Most multlayer perceptrons have very lttle to do wth the orgnal perceptron algorthm. 2

Fgure 2: An MLP that computes the XOR functon. All actvaton functons are bnary thresholds at 0. n these layers are known as nput unts, output unts, and hdden unts, respectvely. The number of layers s known as the depth, and the number of unts n a layer s known as the wdth. As you mght guess, deep learnng refers to tranng neural nets wth many layers. As an example to llustrate the power of MLPs, let s desgn one that computes the XOR functon. Remember, we showed that lnear models cannot do ths. We can verbally descrbe XOR as one of the nputs s 1, but not both of them. So let s have hdden unt h 1 detect f at least one of the nputs s 1, and have h 2 detect f they are both 1. We can easly do ths f we use a hard threshold actvaton functon. You know how to desgn such unts t s an exercse of desgnng a bnary lnear classfer. Then the output unt wll actvate only f h 1 = 1 and h 2 = 0. A network whch does ths s shown n Fgure 2. Let s wrte out the MLP computatons mathematcally. Conceptually, there s nothng new here; we just have to pck a notaton to refer to varous parts of the network. As wth the lnear case, we ll refer to the actvatons of the nput unts as x j and the actvaton of the output unt as y. The unts n the lth hdden layer wll be denoted h (l). Our network s fully connected, so each unt receves connectons from all the unts n the prevous layer. Ths means each unt has ts own bas, and there s a weght for every par of unts n two consecutve layers. Therefore, the network s computatons can be wrtten out as: = φ (1) j h (1) = φ (2) j h (2) y = φ (3) j j x j + b (1) w (1) w (2) j h(1) w (3) j h(2) j + b (2) j + b (3) (1) Termnology for the depth s very nconsstent. A network wth one hdden layer could be called a one-layer, two-layer, or three-layer network, dependng f you count the nput and output layers. Note that we dstngush φ (1) and φ (2) because dfferent layers may have dfferent actvaton functons. Snce all these summatons and ndces can be cumbersome, we usually 3

wrte the computatons n vectorzed form. Snce each layer contans multple unts, we represent the actvatons of all ts unts wth an actvaton vector h (l). Snce there s a weght for every par of unts n two consecutve layers, we represent each layer s weghts wth a weght matrx W (l). Each layer also has a bas vector b (l). The above computatons are therefore wrtten n vectorzed form as: h (1) = φ (1) ( W (1) x + b (1)) h (2) = φ (2) ( W (2) h (1) + b (2)) y = φ (3) ( W (3) h (2) + b (3)) (2) When we wrte the actvaton functon appled to a vector, ths means t s appled ndependently to all the entres. Recall how n lnear regresson, we combned all the tranng examples nto a sngle matrx X, so that we could compute all the predctons usng a sngle matrx multplcaton. We can do the same thng here. We can store all of each layer s hdden unts for all the tranng examples as a matrx H (l). Each row contans the hdden unts for one example. The computatons are wrtten as follows (note the transposes): H (1) = φ (1) ( XW (1) + 1b (1) ) H (2) = φ (2) ( H (1) W (2) + 1b (2) ) Y = φ (3) ( H (2) W (3) + 1b (3) ) (3) If t s hard to remember when a matrx or vector s transposed, fear not. You can usually fgure t out by makng sure the dmensons match up. These equatons can be translated drectly nto NumPy code whch effcently computes the predctons over the whole dataset. 3 Feature Learnng We already saw that lnear regresson could be made more powerful usng a feature mappng. For nstance, the feature mappng ψ(x) = (1, x, x 2, x e ) can represent thrd-degree polynomals. But statc feature mappngs were lmted because t can be hard to desgn all the relevant features, and because the mappngs mght be mpractcally large. Neural nets can be thought of as a way of learnng nonlnear feature mappngs. E.g., n Fgure 1, the last hdden layer can be thought of as a feature map ψ(x), and the output layer weghts can be thought of as a lnear model usng those features. But the whole thng can be traned end-to-end wth backpropagaton, whch we ll cover n the next lecture. The hope s that we can learn a feature representaton where the data become lnearly separable: 4

Fgure 3: Left: Some tranng examples from the MNIST handwrtten dgt dataset. Each nput s a 28 28 grayscale mage, whch we treat as a 784- dmensonal vector. Rght: A subset of the learned frst-layer features. Observe that many of them pck up orented edges. Consder tranng an MLP to recognze handwrtten dgts. (Ths wll be a runnng example for much of the course.) The nput s a 28 28 grayscale mage, and all the pxels take values between 0 and 1. We ll gnore the spatal structure, and treat each nput as a 784-dmensonal vector. Ths s a multway classfcaton task wth 10 categores, one for each dgt class. Suppose we tran an MLP wth two hdden layers. We can try to understand what the frst layer of hdden unts s computng by vsualzng the weghts. Each hdden unt receves nputs from each of the pxels, whch means the weghts feedng nto each hdden unt can be represented as a 784- dmensonal vector, the same as the nput sze. In Fgure 3, we dsplay these vectors as mages. In ths vsualzaton, postve values are lghter, and negatve values are darker. Each hdden unt computes the dot product of these vectors wth the nput mage, and then passes the result through the actvaton functon. So f the lght regons of the flter overlap the lght regons of the mage, and the dark regons of the flter overlap the dark regon of the mage, then the unt wll actvate. E.g., look at the thrd flter n the second row. Ths corresponds to an orented edge: t detects vertcal edges n the upper rght part of the mage. Ths s a useful sort of feature, snce t gves nformaton about the locatons and orentaton of strokes. Many of the features are smlar to ths; n fact, orented edges are a very commonly learned by the frst layers of neural nets for vsual processng tasks. It s harder to vsualze what the second layer s dong. We ll see some trcks for vsualzng ths n a few weeks. We ll see that hgher layers of a neural net can learn ncreasngly hgh-level and complex features. Later on, we ll talk about convolutonal networks, whch use the spatal structure of the mage. 4 Expressve Power Lnear models are fundamentally lmted n ther expressve power: they can t represent functons lke XOR. Are there smlar lmtatons for MLPs? It depends on the actvaton functon. 5

Fgure 4: Desgnng a bnary threshold network to compute a partcular functon. 4.1 Lnear networks Deep lnear networks are no more powerful than shallow ones. The reason s smple: f we use the lnear actvaton functon φ(x) = x (and forget the bases for smplcty), the network s functon can be expanded out as y = W (L) W (L 1) W (1) x. But ths could be vewed as a sngle lnear layer wth weghts gven by W = W (L) W (L 1) W (1). Therefore, a deep lnear network s no more powerful than a sngle lnear layer,.e. a lnear model. 4.2 Unversalty As t turns out, nonlnear actvaton functons gve us much more power: under certan techncal condtons, even a shallow MLP (.e. one wth a sngle hdden layer) can represent arbtrary functons. Therefore, we say t s unversal. Let s demonstrate unversalty n the case of bnary nputs. We do ths usng the followng game: suppose we re gven a functon mappng nput vectors to outputs; we wll need to produce a neural network (.e. specfy the weghts and bases) whch matches that functon. The functon can be gven to us as a table whch lsts the output correspondng to every possble nput vector. If there are D nputs, ths table wll have 2 D rows. An example s shown n Fgure 4. For convenence, let s suppose these nputs are ±1, rather than 0 or 1. All of our hdden unts wll use a hard threshold at 0 (but we ll see shortly that these can easly be converted to soft thresholds), and the output unt wll be lnear. Our strategy wll be as follows: we wll have 2 D hdden unts, each of whch recognzes one possble nput vector. We can then specfy the functon by specfyng the weghts connectng each of these hdden unts to the outputs. For nstance, suppose we want a hdden unt to recognze the nput ( 1, 1, 1). Ths can be done usng the weghts ( 1, 1, 1) and bas 2.5, and ths unt wll be connected to the output unt wth weght 1. (Can you come up wth the general rule?) Usng these weghts, any nput pattern wll produce a set of hdden actvatons where exactly one of the unts s actve. The weghts connectng nputs to outputs can be set based on the nput-output table. Part of the network s shown n Fgure 4. Ths argument can easly be made nto a rgorous proof, but ths course won t be concerned wth mathematcal rgor. 6

Unversalty s a neat property, but t has a major catch: the network requred to represent a gven functon mght have to be extremely large (n partcular, exponental). In other words, not all functons can be represented compactly. We desre compact representatons for two reasons: 1. We want to be able to compute predctons n a reasonable amount of tme. 2. We want to be able to tran a network to generalze from a lmted number of tranng examples; from ths perspectve, unversalty smply mples that a large enough network can memorze the tranng set, whch sn t very nterestng. 4.3 Soft thresholds In the prevous secton, our actvaton functon was a step functon, whch gves a hard threshold at 0. Ths was convenent for desgnng the weghts of a network by hand. But recall from last lecture that t s very hard to drectly learn a lnear classfer wth a hard threshold, because the loss dervatves are 0 almost everywhere. The same holds true for multlayer perceptrons. If the actvaton functon for any unt s a hard threshold, we won t be able to learn that unt s weghts usng gradent descent. The soluton s the same as t was n last lecture: we replace the hard threshold wth a soft one. Does ths cost us anythng n terms of the network s expressve power? No t doesn t, because we can approxmate a hard threshold usng a soft threshold. In partcular, f we use the logstc nonlnearty, we can approxmate a hard threshold by scalng up the weghts and bases: 4.4 The power of depth If shallow networks are unversal, why do we need deep ones? One mportant reason s that deep nets can represent some functons more compactly than shallow ones. For nstance, consder the party functon (on bnary-valued nputs): { 1 f f par (x 1,..., x D ) = j x j s odd (4) 0 f t s even. We won t prove ths, but t requres an exponentally large shallow network to represent the party functon. On the other hand, t can be computed by a deep network whose sze s lnear n the number of nputs. Desgnng such a network s a good exercse. 7