Neural Networks. Aarti Singh & Barnabas Poczos. Machine Learning / Apr 24, Slides Courtesy: Tom Mitchell

Size: px

Start display at page:

Download "Neural Networks. Aarti Singh & Barnabas Poczos. Machine Learning / Apr 24, Slides Courtesy: Tom Mitchell"

Marjorie Terry
5 years ago
Views:

1 Neura Networks Aarti Singh & Barnabas Poczos Machine Learning / Apr 24, 2014 Sides Courtesy: Tom Mitche 1

2 Logis0c Regression Assumes the foowing func1ona form for P(Y X): Logis1c func1on appied to a inear func1on of the data Logistic function (or Sigmoid): ogit (z) z 2

3 Logis0c Regression is a Linear Cassifier! Assumes the foowing func1ona form for P(Y X): Decision boundary: 1 1 (Linear Decision Boundary) 3

4 Training Logis0c Regression How to earn the parameters w 0, w 1, w d? Training Data Maximum (Condi1ona) Likeihood Es1mates Discrimina1ve phiosophy Don t waste effort earning P(X), focus on P(Y X) that s a that malers for cassifica1on! 4

5 Op0mizing convex func0on Max Condi1ona og- ikeihood = Min Nega1ve Condi1ona og- ikeihood Nega1ve Condi1ona og- ikeihood is a convex func1on Gradient Descent (convex) Gradient: Update rue: Learning rate, η>0 5

6 Logis0c func0on as a Graph Sigmoid Unit d d d

7 Neura Networks to earn f: X à Y f can be a non- inear func1on X (vector of) con1nuous and/or discrete variabes Y (vector of) con1nuous and/or discrete variabes Neura networks - Represent f by network of ogis1c/sigmoid units: Sigmoid Unit Output ayer, Y Hidden ayer, H Input ayer, X

8 Neura Network trained to distinguish vowe sounds using 2 formants (features) Output ayer Hidden ayer Input ayer Two ayers of ogistic units Highy non-inear decision surface

9 Neura Network trained to drive a car! Weights to output units from the hidden unit Weights of each pixe for one hidden unit

point Forward Propagation Start from input ayer For each subsequent

11 Predic0on using Neura Networks Prediction Given neura network (hidden units and weights), use it to predict the abe of a test point Forward Propagation Start from input ayer For each subsequent ayer, compute output of sigmoid unit Sigmoid unit: 1-Hidden ayer, 1 output NN: o h

12 M(C)LE Training for Neura Networks Consider regression probem f:xà Y, for scaar Y y = f(x) + ε assume noise N(0,σ ε ), iid deterministic Let s maximize the conditiona data ikeihood Learned neura network Train weights of a units to minimize sum of squared errors of predicted network outputs

13 MAP Training for Neura Networks Consider regression probem f:xà Y, for scaar Y y = f(x) + ε noise N(0,σ ε ) deterministic Gaussian P(W) = N(0,σΙ) n P(W) c i w i 2 Train weights of a units to minimize sum of squared errors of predicted network outputs pus weight magnitudes

14 E Mean Square Error d For Neura Networks, E[w] no onger convex in w

15 Training Neura Networks d d d Differentiabe

16 Error Gradient for a Sigmoid Unit y y y y y y y Sigmoid Unit d d d

17 Using a training data D y y

18 (MLE) Using Forward propagation y k y k = target output (abe) o k/h = unit output (obtained by forward propagation) w ij = wt from i to j Note: if i is input variabe, o i = x i o

19 Objective/Error no onger convex in weights

20 Deaing with OverfiUng Our earning agorithm invoves a parameter n=number of gradient descent iterations How do we choose n to optimize future error? (note: simiar issue for ogistic regression, decision trees, ) e.g. the n that minimizes error rate of neura net over future data

21 Deaing with OverfiUng Our earning agorithm invoves a parameter n=number of gradient descent iterations How do we choose n to optimize future error? Separate avaiabe data into training and vaidation set Use training to perform gradient descent n ß number of iterations that optimizes vaidation set error

22 K- fod Cross- vaida0on Idea: train mutipe times, eaving out a disjoint subset of data each time for test. Average the test set accuracies. Partition data into K disjoint subsets For k=1 to K testdata = kth subset h ß cassifier trained* on a data except for testdata accuracy(k) = accuracy of h on testdata end FinaAccuracy = mean of the K recorded testset accuracies * might withhod some of this to choose number of gradient decent steps

23 Leave- one- out Cross- vaida0on This is just k-fod cross vaidation eaving out one exampe each iteration Partition data into K disjoint subsets, each containing one exampe For k=1 to K testdata = kth subset h ß cassifier trained* on a data except for testdata accuracy(k) = accuracy of h on testdata end FinaAccuracy = mean of the K recorded testset accuracies * might withhod some of this to choose number of gradient decent steps

24 Deaing with OverfiUng Cross-vaidation Reguarization sma weights impy NN is inear (ow VC dimension) Logistic output Contro number of hidden units ow compexity Σw i x i

31 eft strt right up w 0

32 Semantic Memory Mode Based on ANN s [McCeand & Rogers, Nature 2003] No hierarchy given. Train with assertions, e.g., Can(Canary,Fy)

33 Humans act as though they have a hierarchica memory organization 1. Victims of Semantic Dementia progressivey ose knowedge of objects But they ose specific detais first, genera properties ater, suggesting hierarchica memory organization 2. Chidren appear to earn genera categories and properties first, foowing the same hierarchy, top down *. NonLiving Thing Pant Living Fish Anima Bird Canary * some debate remains on this.

34 Memory deterioration foows semantic hierarchy [McCeand & Rogers, Nature 2003]

36 Ar0ficia Neura Networks: Summary Ac1vey used to mode distributed computa1on in brain Highy non- inear regression/cassifica1on Vector- vaued inputs and outputs Poten1ay miions of parameters to es1mate - overfiwng Hidden ayers earn intermediate representa1ons how many to use? Predic1on Forward propaga1on Gradient descent (Back- propaga1on), oca minima probems Coming back in new form as deep beief networks (probabiis1c interpreta1on)

Neural Networks. Aarti Singh. Machine Learning Nov 3, Slides Courtesy: Tom Mitchell

Neural Networks. Aarti Singh. Machine Learning Nov 3, Slides Courtesy: Tom Mitchell Neura Networks Aarti Singh Machine Learning 10-601 Nov 3, 2011 Sides Courtesy: Tom Mitche 1 Logis0c Regression Assumes the foowing func1ona form for P(Y X): Logis1c func1on appied to a inear func1on of