Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng
Machine Learning - Motivation Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed
Machine Learning - Motivation Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed In between, computer science, statistics, optimization, Three categories (soft dichotomy) Supervised learning Unsupervised learning Reinforcement learning
Difficulties Understanding the methods (requires knowledge of various areas) Understanding data and application areas Sometimes hard to establish mathematical guarantees Sometimes hard to code and test Fast developing area of research
Simplification To avoid such difficulties, but obtain a fine level of knowledge in 2 days, we ll follow: Book is available online Plan: last 3 chapters (8-10) and a bit more.
Review Supervised learning (training and test sets) vs. unsupervised learning Examples of supervised learning: regression, classification Examples of unsupervised learning: density/function estimation, clustering, dimension reduction Recall: regression, bias-variance tradeoff, resampling (e.g., cross validation), linear and non-linear models
Quick Review of Regression and Nearest Neighbors Regression predicts a response variable Y (quantitative variable) in terms of input variables (predictors) X 1,,X p given n samples in p ; denote X=(X 1,,X p ) The regression function f(x)=e(y X=x) is the minimizer of the mean square prediction error We cannot precisely compute f, since we have few if any values of given x
Estimating f by NN
Remarks on NN and Classification Need p 4 and sufficiently large n Nearest neighbors tend to be far away in high dimensions Can use kernel or spline smoothing Other common methods: parametric and structure models
Neighborhoods in Increasing Dimensions
More on Regression Assessing model accuracy:
More on Regression Dashed line explained later (irreducible error) Flexibility = degrees of freedom (each square represents method with same color),
More on Regression
More on Regression
More on Regression
On Regression Error For an estimator f learned on training set the mean squared error is E(Y f X X = x) 2 Assume Y = f X + ε, wherε is independent noise with mean zero, then E(Y f X X = x) 2 = E(f X + ε f X X = x) 2 = E(f X f X X = x) 2 + Var(ε) Var(ε) is the irreducible error E(f X f X X = x) 2 is the reducible error ( f X depends on random training sample)
Regression Error: Bias and Variance E(f X f X X = x) 2 = E( f X E( f X ) X = x) 2 + (E( f X X = x) f x ) 2 = Var( f X X = x)+bias 2 (( f X X = x) E(Y f X X = x) 2 = Var( f X X = x)+bias 2 (( f X X = x)+var(ε)
Bias-Variance Tradeoff Two other tradeoffs:
Bias-Variance Tradeoff
Quick Review of Classification and Nearest Neighbors Classification:
Quick Review of Classification and Nearest Neighbors Example:
Quick Review of Classification and Nearest Neighbors
Quick Review of Classification and Nearest Neighbors
Quick Review of Classification and Nearest Neighbors
Quick Review of Classification and Nearest Neighbors
Quick Review of Classification and Nearest Neighbors
Chapter 9: SVM
Separation of 2 Classes by a hyperplane Training set: n points (x i,1,, x i,p ), 1 i n, with n labels y i 1,1, 1 i n Separating hyperplane (if exists) satisfies:
Separation of 2 Classes by a Example: hyperplane
Separation of 2 Classes by a hyperplane If a separating hyperplane exists, then for a test observation x*, a classifier is obtained by the sign of (negative (positive) sign -1/1) The magnitude of f x * provides confidence on class assignment p p * * 2 i i 1 i 1 d( x,hyp.) β0 βix / β
Maximal Margin Classifier
Maximal Margin Classifier MMC is the solution of No explanation in book, but immediate for a math student Actual algorithm is not discussed
Numerical Solution (following A. Ng s Cs229 notes) Change of notation: y (i) =y i, x (i) =(x i,1,, x i,p ) Recall Distance of (x (i),y (i) ) to a hyperplane w T X +b=0 is w T x (i) +b / w
Numerical Solution (following A. Ng s Cs229 notes) Original Problem (non-convex): Equivalent non-convex problem via
Numerical Solution (following A. Ng s Cs229 notes) Scale w and b by the same constant so that (no effect on problem) and change to the convex problem (quadratic program)
Equivalent Formulation (following A. Ng s Cs229 notes) Lagrangian: Dual: Solution: Hence: (used later)
A Non-separable Example
Non-robustness of the Maximal Margin Classifier
The Support Vector Classifier If ε i =0 correct side of boundary If ε i >0 wrong side of margin If ε i >1 wrong side of hyperplane Solution is effected only by support vectors, i.e., observations on wrong side of margins or boundary.
Concept Demonstration
More on the Optimization Problem C controls # observations on wrong side of margin C controls the bias-variance trade-off Optimizer is effected only by support vectors Increasing C in clock-wise order:
Equivalent Formulation (following A. Ng s Cs229 notes) Dual: Similarly as before w T x is a linear combination of <x,x (i) >
Support Vector Machine (SVM) From linear to nonlinear boundaries by embedding to a higher-dimensional space The algorithm can be written in terms of a dot product Instead of embed to a very high-dimen. space, replace dot products with kernels
Clarification
Clarification
More (following book) By solution of SVC (recall earlier comment) Can use only support vectors for SVC For SVM replace dot products with kernels
Demonstration
SVM for K>2 Classes OVO (One vs. One): For training data, K construct 1/-1 classifiers (2 classes 2 out of K classes). For test point, use voting (class with most pairwise assignments) OVA (One vs. All): For training, construct K classifiers (class with 1 vs. rest of classes with -1). For test x*, classify according to largest estimated f(x*) OVO is better for K not too large
Chapter 8: Tree-based Methods (or CART) Decision Trees for Regression Demonstration of predicting log(salary/1000) as a func. of # of years in major leagues and hits in previous year Terminology: leaf/terminal node, internal node, branch
Chapter 8: Tree-based Methods (or CART)
Building a Decision Tree We wish to minimize the RSS (residual sum of squares): Computationally infeasible. Use instead recursive binary splitting (top-down greedy procedure)
Recursive Binary Splitting At each node (top to bottom) determine predictor X j and cutoff s minimizing 2 y i yi 2 i: xi R1( j, s ) 2 i: xi R2( j, s ) yi yi i: x 1(, ) 1(, ) : 2(, ) 2(, ) i R j s R j s i xi R j s R j s 2
Recursive Binary Splitting For j = 1,, p, determine s that maximize 2 y i yi i: xi R1 ( j, s ) i: x i R2( j, s ) R1( j, s) R2( j, s) Can be done by sorting the j-values and checking all n-1 pairs (x i,x i+1 ) (O(1) operations for each) and reporting average of x i and x i+1, for max. i. Total cost is O(pn). We assumed continuous random variables (can modify for discrete ones) 2
More on Recursive Binary Splitting The previous process is repeated until a stopping criteria is met Predict response by mean of training observations in region the test sample belong to
Tree Pruning Continue page 17 of books slides trees.pdf