Neural Networks. Aarti Singh. Machine Learning Nov 3, Slides Courtesy: Tom Mitchell

Size: px

Start display at page:

Download "Neural Networks. Aarti Singh. Machine Learning Nov 3, Slides Courtesy: Tom Mitchell"

Emma Parrish
5 years ago
Views:

1 Neura Networks Aarti Singh Machine Learning Nov 3, 2011 Sides Courtesy: Tom Mitche 1

2 Logis0c Regression Assumes the foowing func1ona form for P(Y X): Logis1c func1on appied to a inear func1on of the data Logistic function (or Sigmoid): ogit (z) Features can be discrete or continuous! z 2

3 Logis0c Regression is a Linear Cassifier! Assumes the foowing func1ona form for P(Y X): Decision boundary: 1 1 (Linear Decision Boundary) 3

4 Training Logis0c Regression How to earn the parameters w 0, w 1, w d? Training Data Maximum (Condi1ona) Likeihood Es1mates Discrimina1ve phiosophy Don t waste effort earning P(X), focus on P(Y X) that s a that malers for cassifica1on! 4

5 Op0mizing convex func0on Max Condi1ona og- ikeihood = Min Nega1ve Condi1ona og- ikeihood Nega1ve Condi1ona og- ikeihood is a convex func1on Gradient Descent (convex) Gradient: Update rue: Learning rate, η>0 5

6 Logis0c func0on as a Graph Sigmoid Unit d d d

7 Neura Networks to earn f: X à Y f can be a non- inear func1on X (vector of) con1nuous and/or discrete variabes Y (vector of) con1nuous and/or discrete variabes Neura networks - Represent f by network of ogis1c/sigmoid units: Sigmoid Unit Output ayer, Y Hidden ayer, H Input ayer, X

8 Neura Network trained to distinguish vowe sounds using 2 formants (features) Output ayer Hidden ayer Input ayer Two ayers of ogistic units Highy non-inear decision surface

9 Neura Network trained to drive a car! Weights to output units from the hidden unit Weights of each pixe for one hidden unit

Forward Propagation Start from input ayer For each subsequent ayer,

11 Forward Propaga0on for predic0on Prediction Given neura network (hidden units and weights), use it to predict the abe of a test point Forward Propagation Start from input ayer For each subsequent ayer, compute output of sigmoid unit Sigmoid unit: 1-Hidden ayer, 1 output NN: o h

12 Training Neura Networks d d d Differentiabe

13 M(C)LE Training for Neura Networks Consider regression probem f:xà Y, for scaar Y y = f(x) + ε assume noise N(0,σ ε ), iid deterministic Let s maximize the conditiona data ikeihood Learned neura network Train weights of a units to minimize sum of squared errors of predicted network outputs

14 MAP Training for Neura Networks Consider regression probem f:xà Y, for scaar Y y = f(x) + ε noise N(0,σ ε ) deterministic Gaussian P(W) = N(0,σΙ) n P(W) c i w i 2 Train weights of a units to minimize sum of squared errors of predicted network outputs pus weight magnitudes

15 E Mean Square Error d For Neura Networks, E[w] no onger convex in w

16 Error Gradient for a Sigmoid Unit y y y y y y y Sigmoid Unit d d d

17 (MLE) Using Forward propagation y k y k = target output (abe) o k/h = unit output (obtained by forward propagation) w ij = wt from i to j Note: if i is input variabe, o i = x i o

18 Using a training data D y y

19 Objective/Error no onger convex in weights

20 Deaing with OverfiVng Our earning agorithm invoves a parameter n=number of gradient descent iterations How do we choose n to optimize future error? (note: simiar issue for ogistic regression, decision trees, ) e.g. the n that minimizes error rate of neura net over future data

Deaing with OverfiVng Our earning agorithm invoves a parameter n=number of gradient descent iterations How do we choose n to optimize future error?

21 Deaing with OverfiVng Our earning agorithm invoves a parameter n=number of gradient descent iterations How do we choose n to optimize future error? Separate avaiabe data into training and vaidation set Use training to perform gradient descent n ß number of iterations that optimizes vaidation set error

22 K- fod Cross- vaida0on Idea: train mutipe times, eaving out a disjoint subset of data each time for test. Average the test set accuracies. Partition data into K disjoint subsets For k=1 to K testdata = kth subset h ß cassifier trained* on a data except for testdata accuracy(k) = accuracy of h on testdata end FinaAccuracy = mean of the K recorded testset accuracies * might withhod some of this to choose number of gradient decent steps

23 Leave- one- out Cross- vaida0on This is just k-fod cross vaidation eaving out one exampe each iteration Partition data into K disjoint subsets, each containing one exampe For k=1 to K testdata = kth subset h ß cassifier trained* on a data except for testdata accuracy(k) = accuracy of h on testdata end FinaAccuracy = mean of the K recorded testset accuracies * might withhod some of this to choose number of gradient decent steps

24 Deaing with OverfiVng Cross-vaidation Reguarization sma weights impy NN is inear (ow VC dimension) Logistic output Contro number of hidden units ow compexity Σw i x i

29 eft strt right up w 0

30 Semantic Memory Mode Based on ANN s [McCeand & Rogers, Nature 2003] No hierarchy given. Train with assertions, e.g., Can(Canary,Fy)

31 Humans act as though they have a hierarchica memory organization 1. Victims of Semantic Dementia progressivey ose knowedge of objects But they ose specific detais first, genera properties ater, suggesting hierarchica memory organization 2. Chidren appear to earn genera categories and properties first, foowing the same hierarchy, top down *. NonLiving Thing Pant Living Fish Anima Bird Canary Question: What earning mechanism coud produce this emergent hierarchy? * some debate remains on this.

32 Memory deterioration foows semantic hierarchy [McCeand & Rogers, Nature 2003]

34 Training Networks on Time Series Suppose we want to predict next state of word and it depends on history of unknown ength e.g., robot with forward-facing sensors trying to predict next sensor reading as it moves and turns

35 Training Networks on Time Series Suppose we want to predict next state of word and it depends on history of unknown ength e.g., robot with forward-facing sensors trying to predict next sensor reading as it moves and turns Idea: use hidden ayer in network to capture state history

36 Training Networks on Time Series How can we train recurrent net??

37 Ar0ficia Neura Networks: Summary Ac1vey used to mode distributed computa1on in brain Highy non- inear regression/cassifica1on Vector- vaued inputs and outputs Poten1ay miions of parameters to es1mate - overfiwng Hidden ayers earn intermediate representa1ons how many to use? Predic1on Forward propaga1on Gradient descent (Back- propaga1on), oca minima probems Mosty obsoete kerne tricks are more popuar, but coming back in new form as deep beief networks (probabiis1c interpreta1on)

Neural Networks. Aarti Singh & Barnabas Poczos. Machine Learning / Apr 24, Slides Courtesy: Tom Mitchell

Neural Networks. Aarti Singh & Barnabas Poczos. Machine Learning / Apr 24, Slides Courtesy: Tom Mitchell Neura Networks Aarti Singh & Barnabas Poczos Machine Learning 10-701/15-781 Apr 24, 2014 Sides Courtesy: Tom Mitche 1 Logis0c Regression Assumes the foowing func1ona form for P(Y X): Logis1c func1on appied