Neural Networks for Classification Andrei Alexandrescu June 19, 2007 1 / 40
Neural Networks: History What is a Neural Network? Examples of Neural Networks Elements of a Neural Network 2 / 40
Neural Networks: History Modeled after the human brain Experimentation and marketing predated theory Considered the forefront of the AI spring Suffered from the AI winter Theory today still not fully developed and understood Neural Networks: History What is a Neural Network? Examples of Neural Networks Elements of a Neural Network 3 / 40
What is a Neural Network? Essentially: A network of interconnected functional elements each with several inputs/one output y(x 1,..., x n ) = f(w 1 x 1 +w 2 x 2 +...+w n x n ) (1) w i are parameters f is the activation function Crucial for learning that addition is used for integrating the inputs Neural Networks: History What is a Neural Network? Examples of Neural Networks Elements of a Neural Network 4 / 40
Examples of Neural Networks Logical functions with 0/1 inputs and outputs Fourier series: F(x) = i 0 (a i cos(ix) + b i sin(ix)) (2) Taylor series: F(x) = i 0 a i (x x 0 ) i (3) Neural Networks: History What is a Neural Network? Examples of Neural Networks Elements of a Neural Network Automata 5 / 40
Elements of a Neural Network The function performed by an element The topology of the network The method used to train the weights Neural Networks: History What is a Neural Network? Examples of Neural Networks Elements of a Neural Network 6 / 40
The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 7 / 40
The Perceptron n inputs, one output: y(x 1,..., x n ) = f(w 1 x 1 +... + w n x n ) (4) Oldest activation function (McCulloch/Pitts): f(v) = 1 x 0 (v) (5) The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 8 / 40
Perceptron Capabilities Advertised to be as extensive as the brain itself Can (only) distinguish between two linearly-separable sets Smallest undecidable function: XOR Minsky s proof started the AI winter It was not fully understood what connected layers could do The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 9 / 40
Bias Notice that the decision hyperplane must go through the origin Could be achieved by preprocessing the input Not always desirable or possible Add a bias input: y(x 1,..., x n ) = f(w 0 +w 1 x 1 +...+w n x n ) (6) Same as an input connected to the constant 1 We consider that ghost input implicit henceforth The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 10 / 40
Training the Perceptron Switch to vector notation: y(x) = f(wx) = f w (x) (7) Assume we need to separate sets of points A and B. E(w) = (1 f w (x))+ x A x B f w (x) (8) Goal: E(w) = 0 Start from a random w and improve it The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 11 / 40
Algorithm 1. Start with random w, set t = 0 2. Select a vector x A B 3. If x A and wx 0, then w t+1 = w t + x 4. Else if x B and wx 0, then w t+1 = w t x 5. Conditionally go to step 2 Guaranteed to converge iff A and B are linearly separable! The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 12 / 40
Summary of Simple Simple training Limited capabilities Reasonably efficient training Simplex, linear programming are better The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 13 / 40
A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 14 / 40
Let s connect the output of a perceptron to the input of another What can we compute with this horizontal combination? (We already take vertical combination for granted) A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 15 / 40
A Misunderstanding of Epic Proportions Some say two-layered network Two cascaded layers of computational units Some say three-layered network There is one extra input layer that does nothing Let s arbitrarily choose three-layered Input Hidden Output A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 16 / 40
Workings The hidden layer maps inputs into a second space: feature space, classification space This makes the job of the output layer easier A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 17 / 40
Capabilities Each hidden unit computes a linear separation of the input space Several hidden units can carve a polytope in the input space Output units can distinguish polytope membership Any union of polytopes can be decided A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 18 / 40
Training Prerequisite The step function bad for gradient descent techniques Replace with a smooth step function: f(v) = 1 1 + e v (9) Notable fact: f (v) = f(v)(1 f(v)) Makes the function cycles-friendly A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 19 / 40
Output Activation Simple binary discrimination zero-centered sigmoid: f(v) = 1 e v 1 + e v (10) Probability distribution softmax: f(v i ) = ev i e v (11) j j A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 20 / 40
The Backpropagation Algorithm Works on any differentiable activation function Gradient descent in weight space Metaphor: a ball rolls on the error function s envelope Condition: no flat portion Ball would stop in indifferent equilibrium Some add a slight pull term: f(v) = 1 e v + cv (12) 1 + e v A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 21 / 40
The Task Minimize error function: where: E = 1 2 p i=1 o i actual outputs t i desired outputs p number of patterns o i t i 2 (13) A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 22 / 40
Training. The Delta Rule Compute E = Update weights: ( E w 1,..., E ) w l w i = γ E w i i = 1,..., l (14) Expect to find a point E = 0 Algorithm for computing E: backpropagation Beyond the scope of this class A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 23 / 40
Gradient Locality Only summation guarantees locality of backpropagation Otherwise backpropagation would propagate errors due to one input to all inputs Essential to use summation as input integration! A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 24 / 40
Regularization Weights can grow uncontrollably Add a regularization term that opposes weight growth w i = γ E w i αw i (15) Very important practical trick Also avoids overspecialization Forces a smoother output A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 25 / 40
Local Minima The gradient surf can stop in a local minimum Biggest issue with neural networks Overspecialization second biggest Convergence not guaranteed either, but regularization helps A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 26 / 40
Discrete Inputs One-Hot Encoding Optimizing One-Hot Encoding One-Hot Encoding: Interesting Tidbits 27 / 40
Many NLP applications foster discrete features Neural nets expect real numbers Smooth: similar outputs for similar inputs Any two discrete inputs are just as different Treating them as integral numbers undemocratic One-Hot Encoding Optimizing One-Hot Encoding One-Hot Encoding: Interesting Tidbits 28 / 40
One-Hot Encoding One discrete feature with n values n real inputs The i th feature value sets the i th input to 1 and others to 0 The Hamming distance between any two distinct inputs is now constant! Disadvantage: input vector size much larger One-Hot Encoding Optimizing One-Hot Encoding One-Hot Encoding: Interesting Tidbits 29 / 40
Optimizing One-Hot Encoding Each hidden unit has all inputs zero except the i th one Even that one is just multiplied by 1 Regroup weights by discrete input, not by hidden unit! Matrix w of size n l Input i just copies row i to the output (virtual multiplication by 1) Cheap computation Delta rule applies as usual One-Hot Encoding Optimizing One-Hot Encoding One-Hot Encoding: Interesting Tidbits 30 / 40
One-Hot Encoding: Interesting Tidbits The row w i is a continuous representation of discrete feature i Only one row trained per sample The size of the continuous representation can be chosen depending on the feature s complexity Mix this continuous representation freely with truly continuous features, such as acoustic features One-Hot Encoding Optimizing One-Hot Encoding One-Hot Encoding: Interesting Tidbits 31 / 40
Multi-Label Classification Soft Training 32 / 40
Multi-Label Classification n real outputs summing to 1 Normalization included in the softmax function: f(v i ) = ev i e v = evi vmax j e v j v max (16) j Train with 1 ǫ for the known label, ǫ n 1 for all others (avoids saturation) j Multi-Label Classification Soft Training 33 / 40
Soft Training Maybe the targets are known probability distribution Or want to reduce the number of training cycles Train with actual desired distributions as desired outputs Example: for feature vector x, labels l 1, l 2, l 3 are possible with equal probability Train with 1 ǫ ǫ 3 for the three, n 3 for all others Multi-Label Classification Soft Training 34 / 40
Language Modeling Lexicon Learning Word Sense Disambiguation 35 / 40
Language Modeling Input: n-gram context May include arbitrary word features (cool!!!) Output: probability distribution of next word Automatically figures which features are important Language Modeling Lexicon Learning Word Sense Disambiguation 36 / 40
Lexicon Learning Input: Word-level features (root, stem, morph) Input: Most frequent previous/next words Output: Probability distribution of the word s possible POSs Language Modeling Lexicon Learning Word Sense Disambiguation 37 / 40
Word Sense Disambiguation Input: bag of words in context, local collocations Output: Probability distribution over senses Language Modeling Lexicon Learning Word Sense Disambiguation 38 / 40
39 / 40
Neural nets respectable machine learning technique Theory not fully developed Local optima and overspecialization are killers Yet can learn very complex functions Long training time Short testing time Small memory requirements 40 / 40