Perceptron Learning Algorithm

Size: px

Start display at page:

Download "Perceptron Learning Algorithm"

Brandon Pitts
5 years ago
Views:

1 Perceptron Learning Algorithm An iterative learning algorithm that can find linear threshold function to partition linearly separable set of points. Assume zero threshold value. 1) w(0) = arbitrary, j=1, k=0 2) Pick point (x(j),d(j)). 3) If w(k) T x(j)d(j) > 0 go to 5) 4) w(k+1) = w(k ) + µx(j)d(j), k=k+1 5) Increment j, if at end of data, set j=1, check if cycled through data without changing w, if not go to 2 6) therwise stop.

2 PLA comments Energy function J(w) = - (sum of synaptic strengths of misclassified points) w(k+1) = w(k) - µ(k) J(w(k)) (gradient descent) Margins Proof (Novikoff, requires margins) Homogeneously linearly separable Version space (weight space where feasible solutions lie)

3 Energy Function J(w) = - Σ x M (w T xd) where M is the set of misclassified points by w. Can find optimal solution using iterative gradient descent algorithm with w(k+1) = w(k) - µ(k) J(w(k)) (gradient descent) Gradient given by J(w(k)) = - Σ x M xd (use all misclassified points). Perceptron algorithm given uses estimate of gradient ( one misclassified training input).

4 Margins γ(1) + + w + x(1) x(2) γ(2) + Assume w =1 and let margin be γ(i) =d(i) x(i) T w, then γ(i) is distance from x(i) point to hyperplane. If minimum margin > 0, then training points are linearly separable.

5 Novikoff s Theorem Let S be a nontrivial training set and assume zero threshold with w* a solution w* =1 w(0)=0 Let max x(k) =β be finite and min d(k)x(k) T w*=γ>0. The number of updates will be upper bounded by k ( β/ γ ) 2

6 Proof of Novikoff s Theorem <w(k),w*>=<w(k-1) + µx(k-1)d(k-1),w*> <w(k-1),w*> + µγ µk γ. w(k) 2 w(k-1) 2 + µx(k-1) 2 w(k-1) 2 + µ 2 β 2 µ 2 β 2. k w* 2 k µ 2 β 2 w* 2 w(k) 2 <w(k),w*> 2 (k µγ) 2. Since w* =1, then k ( β/ γ ) 2.

7 Version space Version space is a set W, such that for any w W, all training points are correctly labeled.

8 Weight characteristics If w(0)=0, then w*= Σα(k)x(k). Weight vector is a weighted sum of inputs. The largest magnitude α(k) is associated with the input that is updated the most and has the smallest margin. Nonzero threshold handled with input set to 1. Many possible solutions for weight vector as observed from version space. PLA is an ill posed problem.

9 ptimum Margin Classifiers Consider methods based on optimum margin classifiers or Support Vector Machines (SVM) Here we consider a different algorithm that is well posed with a unique optimal solution. The solution is based on finding the largest minimal margin of all training points.

10 ptimal Marginal Classifiers Given a set of points that are linearly separable: which hyperplane should you choose to separate points? Choose hyperplane that maximizes distance between two sets of points.

11 Finding ptimal Hyperplane margins w Draw convex hull around each set of points. Find shortest line segment connecting two convex hulls. Find midpoint of line segment. ptimal hyperplane is perpendicular to segment at midpoint of line segment. ptimal hyperplane

12 Alternative Characterization of ptimal Margin Classifiers 2m Maximizing margins equivalent to minimizing magnitude of weight vector. w u T T W (u-v) = 2 W (u-v)/ W = 2/ W =2m v T W u+ b = 1 T W v+ b = -1

13 Quadratic Programming Problem Problem statement: max (w,b) min ( x-x(i) ) : x R n, w T x + b = 0, 1 i l which is equivalent to solving (QP) problem: Min ½ w 2 subject to d i (w T x(i) + b) 1, i

14 Lagrange Multipliers Constrained optimization can be dealt with by introducing Lagrange multipliers α(i) 0 and augmenting objective function L(w,b,α) = ½ w 2 - Σ α(i) (d(i) (wt x(i) + b) 1)

15 Support Vectors QP program is convex and note that at solution L(w,b,α) / w =0 and L(w,b,α) / b = 0 leading to Σ α(i) d(i) = 0 and w = Σ α(i) d(i) x(i) Weight vector expressed in terms of subset of training vectors x i and α i that are nonzero. These are support vectors. By KKT conditions we have that α(i) (d(i) (wt x(i) + b) 1) = 0 indicating that support vectors lie on margin.

16 Support Vectors Points in red are support vectors.

17 Dual optimization representation Solving the primal QP problem is equivalent to solving the dual QP problem. Primal variables w, b are eliminated. max W(α) = Σ α(i) - ½Σ α(i) α(j) d(i) d(j)(x(i) T x(j) ) subject to α(i) 0, and Σ α(i) d(i) = 0. The hyperplane decision function can be written as f(x) = sgn (Σ α(i) d(i) xt x(i) + b)

18 Comments about SVM solution Solution can be found in primal or dual spaces. When we discuss kernels later there are advantages of finding solution in dual space. Threshold, b is determined after weight w is found. In version space, SVM solution is center of largest hypersphere that fits in version space. Solution can be modified to Handle training data that is not linearly separable via slack variables Implement nonlinear discriminant function via kernels Solving quadratic programming problems

Perceptron Learning Algorithm (PLA)

Review: Lecture 4 Perceptron Learning Algorithm (PLA) Learning algorithm for linear threshold functions (LTF) (iterative) Energy function: PLA implements a stochastic gradient algorithm Novikoff s theorem