Neural Networks for Classification

Similar documents
For Monday. Read chapter 18, sections Homework:

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Neural Network Neurons

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

COMPUTATIONAL INTELLIGENCE

Multi-Layered Perceptrons (MLPs)

Artificial neural networks are the paradigm of connectionist systems (connectionism vs. symbolism)

Data Mining. Neural Networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks

6. Linear Discriminant Functions

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Week 3: Perceptron and Multi-layer Perceptron

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

CMPT 882 Week 3 Summary

Introduction AL Neuronale Netzwerke. VL Algorithmisches Lernen, Teil 2b. Norman Hendrich

Learning via Optimization

Notes on Multilayer, Feedforward Neural Networks

Assignment # 5. Farrukh Jabeen Due Date: November 2, Neural Networks: Backpropation

Neural Networks. By Laurence Squires

CS6220: DATA MINING TECHNIQUES

Support Vector Machines

A Neuro Probabilistic Language Model Bengio et. al. 2003

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Machine Learning in Telecommunications

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Supervised Learning in Neural Networks (Part 2)

Motivation. Problem: With our linear methods, we can train the weights but not the basis functions: Activator Trainable weight. Fixed basis function

IMPROVEMENTS TO THE BACKPROPAGATION ALGORITHM

In this assignment, we investigated the use of neural networks for supervised classification

Character Recognition Using Convolutional Neural Networks

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Perceptron as a graph

Support Vector Machines

Machine Learning Classifiers and Boosting

COMS 4771 Support Vector Machines. Nakul Verma

COMP9444 Neural Networks and Deep Learning 5. Geometry of Hidden Units

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Neural Networks: What can a network represent. Deep Learning, Spring 2018

Multi Layer Perceptron with Back Propagation. User Manual

2. Neural network basics

Perceptrons and Backpropagation. Fabio Zachert Cognitive Modelling WiSe 2014/15

Deep Learning. Architecture Design for. Sargur N. Srihari

Artificial Neural Networks

Introduction to Multilayer Perceptrons

Extreme Learning Machines. Tony Oakden ANU AI Masters Project (early Presentation) 4/8/2014

CSC 578 Neural Networks and Deep Learning

Neural Networks and Deep Learning

Neural Networks: What can a network represent. Deep Learning, Fall 2018

This leads to our algorithm which is outlined in Section III, along with a tabular summary of it's performance on several benchmarks. The last section

Classification and Regression using Linear Networks, Multilayer Perceptrons and Radial Basis Functions

EECS 496 Statistical Language Models. Winter 2018

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

Multilayer Feed-forward networks

Hidden Units. Sargur N. Srihari

5 Machine Learning Abstractions and Numerical Optimization

Neural Networks CMSC475/675

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Simple Model Selection Cross Validation Regularization Neural Networks

Lecture 17: Neural Networks and Deep Learning. Instructor: Saravanan Thirumuruganathan

Machine learning for vision. It s the features, stupid! cathedral. high-rise. Winter Roland Memisevic. Lecture 2, January 26, 2016

Deep Reinforcement Learning

Kernel Methods & Support Vector Machines

Backpropagation + Deep Learning

DM6 Support Vector Machines

Yuki Osada Andrew Cannon

Artificial Neural Networks (Feedforward Nets)

Constructively Learning a Near-Minimal Neural Network Architecture

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Instructor: Jessica Wu Harvey Mudd College

arxiv: v1 [cs.lg] 25 Jan 2018

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Hybrid PSO-SA algorithm for training a Neural Network for Classification

Neural Nets. General Model Building

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CS 354R: Computer Game Technology

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade


Mathematical Programming and Research Methods (Part II)

Artificial Intellegence

Linear Discriminant Functions: Gradient Descent and Perceptron Convergence

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Neural Networks: Optimization Part 1. Intro to Deep Learning, Fall 2018

Neural Networks (pp )

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Logical Rhythm - Class 3. August 27, 2018

Transcription:

Neural Networks for Classification Andrei Alexandrescu June 19, 2007 1 / 40

Neural Networks: History What is a Neural Network? Examples of Neural Networks Elements of a Neural Network 2 / 40

Neural Networks: History Modeled after the human brain Experimentation and marketing predated theory Considered the forefront of the AI spring Suffered from the AI winter Theory today still not fully developed and understood Neural Networks: History What is a Neural Network? Examples of Neural Networks Elements of a Neural Network 3 / 40

What is a Neural Network? Essentially: A network of interconnected functional elements each with several inputs/one output y(x 1,..., x n ) = f(w 1 x 1 +w 2 x 2 +...+w n x n ) (1) w i are parameters f is the activation function Crucial for learning that addition is used for integrating the inputs Neural Networks: History What is a Neural Network? Examples of Neural Networks Elements of a Neural Network 4 / 40

Examples of Neural Networks Logical functions with 0/1 inputs and outputs Fourier series: F(x) = i 0 (a i cos(ix) + b i sin(ix)) (2) Taylor series: F(x) = i 0 a i (x x 0 ) i (3) Neural Networks: History What is a Neural Network? Examples of Neural Networks Elements of a Neural Network Automata 5 / 40

Elements of a Neural Network The function performed by an element The topology of the network The method used to train the weights Neural Networks: History What is a Neural Network? Examples of Neural Networks Elements of a Neural Network 6 / 40

The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 7 / 40

The Perceptron n inputs, one output: y(x 1,..., x n ) = f(w 1 x 1 +... + w n x n ) (4) Oldest activation function (McCulloch/Pitts): f(v) = 1 x 0 (v) (5) The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 8 / 40

Perceptron Capabilities Advertised to be as extensive as the brain itself Can (only) distinguish between two linearly-separable sets Smallest undecidable function: XOR Minsky s proof started the AI winter It was not fully understood what connected layers could do The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 9 / 40

Bias Notice that the decision hyperplane must go through the origin Could be achieved by preprocessing the input Not always desirable or possible Add a bias input: y(x 1,..., x n ) = f(w 0 +w 1 x 1 +...+w n x n ) (6) Same as an input connected to the constant 1 We consider that ghost input implicit henceforth The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 10 / 40

Training the Perceptron Switch to vector notation: y(x) = f(wx) = f w (x) (7) Assume we need to separate sets of points A and B. E(w) = (1 f w (x))+ x A x B f w (x) (8) Goal: E(w) = 0 Start from a random w and improve it The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 11 / 40

Algorithm 1. Start with random w, set t = 0 2. Select a vector x A B 3. If x A and wx 0, then w t+1 = w t + x 4. Else if x B and wx 0, then w t+1 = w t x 5. Conditionally go to step 2 Guaranteed to converge iff A and B are linearly separable! The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 12 / 40

Summary of Simple Simple training Limited capabilities Reasonably efficient training Simplex, linear programming are better The Perceptron Perceptron Capabilities Bias Training the Perceptron Algorithm Summary of Simple 13 / 40

A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 14 / 40

Let s connect the output of a perceptron to the input of another What can we compute with this horizontal combination? (We already take vertical combination for granted) A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 15 / 40

A Misunderstanding of Epic Proportions Some say two-layered network Two cascaded layers of computational units Some say three-layered network There is one extra input layer that does nothing Let s arbitrarily choose three-layered Input Hidden Output A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 16 / 40

Workings The hidden layer maps inputs into a second space: feature space, classification space This makes the job of the output layer easier A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 17 / 40

Capabilities Each hidden unit computes a linear separation of the input space Several hidden units can carve a polytope in the input space Output units can distinguish polytope membership Any union of polytopes can be decided A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 18 / 40

Training Prerequisite The step function bad for gradient descent techniques Replace with a smooth step function: f(v) = 1 1 + e v (9) Notable fact: f (v) = f(v)(1 f(v)) Makes the function cycles-friendly A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 19 / 40

Output Activation Simple binary discrimination zero-centered sigmoid: f(v) = 1 e v 1 + e v (10) Probability distribution softmax: f(v i ) = ev i e v (11) j j A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 20 / 40

The Backpropagation Algorithm Works on any differentiable activation function Gradient descent in weight space Metaphor: a ball rolls on the error function s envelope Condition: no flat portion Ball would stop in indifferent equilibrium Some add a slight pull term: f(v) = 1 e v + cv (12) 1 + e v A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 21 / 40

The Task Minimize error function: where: E = 1 2 p i=1 o i actual outputs t i desired outputs p number of patterns o i t i 2 (13) A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 22 / 40

Training. The Delta Rule Compute E = Update weights: ( E w 1,..., E ) w l w i = γ E w i i = 1,..., l (14) Expect to find a point E = 0 Algorithm for computing E: backpropagation Beyond the scope of this class A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 23 / 40

Gradient Locality Only summation guarantees locality of backpropagation Otherwise backpropagation would propagate errors due to one input to all inputs Essential to use summation as input integration! A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 24 / 40

Regularization Weights can grow uncontrollably Add a regularization term that opposes weight growth w i = γ E w i αw i (15) Very important practical trick Also avoids overspecialization Forces a smoother output A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 25 / 40

Local Minima The gradient surf can stop in a local minimum Biggest issue with neural networks Overspecialization second biggest Convergence not guaranteed either, but regularization helps A Misunderstanding of Epic Proportions Workings Capabilities Training Prerequisite Output Activation The Backpropagation Algorithm The Task Training. The Delta Rule Gradient Locality Regularization Local Minima 26 / 40

Discrete Inputs One-Hot Encoding Optimizing One-Hot Encoding One-Hot Encoding: Interesting Tidbits 27 / 40

Many NLP applications foster discrete features Neural nets expect real numbers Smooth: similar outputs for similar inputs Any two discrete inputs are just as different Treating them as integral numbers undemocratic One-Hot Encoding Optimizing One-Hot Encoding One-Hot Encoding: Interesting Tidbits 28 / 40

One-Hot Encoding One discrete feature with n values n real inputs The i th feature value sets the i th input to 1 and others to 0 The Hamming distance between any two distinct inputs is now constant! Disadvantage: input vector size much larger One-Hot Encoding Optimizing One-Hot Encoding One-Hot Encoding: Interesting Tidbits 29 / 40

Optimizing One-Hot Encoding Each hidden unit has all inputs zero except the i th one Even that one is just multiplied by 1 Regroup weights by discrete input, not by hidden unit! Matrix w of size n l Input i just copies row i to the output (virtual multiplication by 1) Cheap computation Delta rule applies as usual One-Hot Encoding Optimizing One-Hot Encoding One-Hot Encoding: Interesting Tidbits 30 / 40

One-Hot Encoding: Interesting Tidbits The row w i is a continuous representation of discrete feature i Only one row trained per sample The size of the continuous representation can be chosen depending on the feature s complexity Mix this continuous representation freely with truly continuous features, such as acoustic features One-Hot Encoding Optimizing One-Hot Encoding One-Hot Encoding: Interesting Tidbits 31 / 40

Multi-Label Classification Soft Training 32 / 40

Multi-Label Classification n real outputs summing to 1 Normalization included in the softmax function: f(v i ) = ev i e v = evi vmax j e v j v max (16) j Train with 1 ǫ for the known label, ǫ n 1 for all others (avoids saturation) j Multi-Label Classification Soft Training 33 / 40

Soft Training Maybe the targets are known probability distribution Or want to reduce the number of training cycles Train with actual desired distributions as desired outputs Example: for feature vector x, labels l 1, l 2, l 3 are possible with equal probability Train with 1 ǫ ǫ 3 for the three, n 3 for all others Multi-Label Classification Soft Training 34 / 40

Language Modeling Lexicon Learning Word Sense Disambiguation 35 / 40

Language Modeling Input: n-gram context May include arbitrary word features (cool!!!) Output: probability distribution of next word Automatically figures which features are important Language Modeling Lexicon Learning Word Sense Disambiguation 36 / 40

Lexicon Learning Input: Word-level features (root, stem, morph) Input: Most frequent previous/next words Output: Probability distribution of the word s possible POSs Language Modeling Lexicon Learning Word Sense Disambiguation 37 / 40

Word Sense Disambiguation Input: bag of words in context, local collocations Output: Probability distribution over senses Language Modeling Lexicon Learning Word Sense Disambiguation 38 / 40

39 / 40

Neural nets respectable machine learning technique Theory not fully developed Local optima and overspecialization are killers Yet can learn very complex functions Long training time Short testing time Small memory requirements 40 / 40