Image Analysis & Retrieval Lec 10 - Classification II

Size: px

Start display at page:

Download "Image Analysis & Retrieval Lec 10 - Classification II"

Rhoda Harmon
5 years ago
Views:

1 CS/EE 5590 / ENG 401 Special Topics, Spring 2018 Image Analysis & Retrieval Lec 10 - Classification II Zhu Li Dept of CSEE, UMKC Office Hour: Tue/Thr 2:30-4pm@FH560E, Contact: lizhu@umkc.edu, Ph: x Created using WPS office and EqualX latex equation editor p.1

2 q ReCap of Lecture 09 knn Classifier GMM Classifier Outline q HW-2 & Extra credit work q Support Vector Machine q Summary p.2

3 Classification in Image Retrieval q A typical image retrieval pipeline Image Formation Feature Computing Feature Aggregation Classification Knowledge /Data Base p.3

q knn is Nonlinear Classifier Operations: majority vote knn Classifier qperformance bounds on k: not worse off by 2 times the error rate of Bayesian error rate As k increases, it improves, but not

4 q knn is Nonlinear Classifier Operations: majority vote knn Classifier qperformance bounds on k: not worse off by 2 times the error rate of Bayesian error rate As k increases, it improves, but not much qaccelerate NN retrieval operation Kd-tree: o a space partition scheme, to limit the number of data points comparison operation Kd-tree supported knn search: o Approx KNN search o Kd-Forest supported Approx Search p.4

5 knn Classifier Distance Metric q Mahalanobis Distance Metric Implicit Linear Kernel Function, for X in R d : If we have a projection: A: pxd, Y=AX in R p, then Distance: (Linear) Kernel Distance: p.5

6 Kernel and Metric q Mahanalobis Distance: Correct non-white Gaussian to white iid Gaussian getmahalmetric.m SVD p.6

7 GMM Model Bayesian Decision Boundary q Shaped by the relative shape of Covariance Matrix Posterior prob func for class i: q Matlab: [rec_label, err]=classify(q, x, y, method); method = quadratic p.7

8 Retrieval Performance: Mean Average Precision qmap measures the retrieval performance across all queries Z. Li, Image Analysis & Retrv, Spring 2017 p.8

9 map example q MAP is computed across all query results as the average precision over the recall Z. Li, Image Analysis & Retrv, Spring 2017 p.9

10 q ReCap of Lecture 09 knn Classifier GMM Classifier Outline q HW-2 & Extra credit work q Support Vector Machine q Summary p.10

11 HW-2: Aggregation q Data Set: Take the 10% CDVS landmark dataset Also prob mix with some from Oxford, Paris q Algorithm: SIFT: top K (=300, 500) SIFT from images Aggregation: GMM model: kd=[16,24,32], nc=[64, 128, 256] FV re-learning: prune dimensions via a sorting by Then keep the top m% dimensions p.11

12 HEVC Intra-Prediction q For mode 2~34, copy pixels from reference according to angles p.12

Frank Bossen, Woojin Han, Junghye Min, Kemal Ugur, Intra Coding of the HEVC Standard.

13 Intra Prediction in HEVC q Much more modes DC mode: copy DC values from neighbor Planar mode: top row or left col average Angular: pixels on certain line Ref: Jani Lainema, Frank Bossen, Woojin Han, Junghye Min, Kemal Ugur, Intra Coding of the HEVC Standard. IEEE Trans. Circuits Syst. Video Tech. 22(12): (2012) Like a sparse transform basis! p.13

HW-2 extra credit q Machine Learning for Acceleration INTRA mode decision Stack all top row and left column pixels into the 17-dimension feature X The ground

14 HW-2 extra credit q Machine Learning for Acceleration INTRA mode decision Stack all top row and left column pixels into the 17-dimension feature X The ground truth on the intra mode is stacked into Y Apply PCA on X and build a knn classifier with kdtree and leaf node label histogram Will provide relevant code. p.14

15 q ReCap of Lecture 09 knn Classifier GMM Classifier Outline q HW-2 & Extra credit work q Support Vector Machine q Summary p.15

16 Support Vector Machine linear Non-linear p.16

17 SVM Outline q What is SVM? q Why SVM? q Gory Details of Math behind SVM q SVM in Matlab p.17

18 Support Vectors Not all training data points are relevant to the decision boundary: Only those support the shape of decision boundary are important. Remember the Condensing in knn classifier. Linear Non-Linear p.18

19 Support Vector Machine SVM Find a decision boundary that maximizes the separation between two classes Decision boundary is determined by a small number of training data points called support vectors Can have non-linear boundary by replacing inner product in the original space with a kernel function. (recall the linear kernel in Mahalanobis distance) p.19

20 SVM Matlab Example Check out matlab tutorial : svm_tutorials.m x (training) x (classified) y (training) y (classified) Support Vectors p.20

21 Math Principles for SVM q What is a good classifier/decision boundary? q VC Complexity/Dimension q SVM Formulation: Maximizing the Gap between classes q Solution is via Lagrangian Relaxation L. Lagrange V. Vapnik p.21

22 Decision Function qin classification, let {(x j, y j ) j=1..n} be the data/features and their labels drawn from i.i.d sources, P(x, y), where x is in R d, and y in R. q Decision function: f(x) -> y, gives the prediction of label for x qclassification loss function: Structual Risk Function p.22

23 A Good Decision Function? q How to choose the best decision boundary function? So many ways to separate two classes of data What is the best? Just minimizing the R(f) good enough? Linear Gerrymandering! J p.23

24 Empirical Risk q A natural way to design decision boundary function is to minimizing the empirical risk function R E (f), q But, it may overfit, the training error rate from minimizing R E (f) may go down, but the testing error may actually goes up p.24

25 Better Objective: Structural Risk q Penalize the decision boundary function complexity in objective function design, Structural Risk Function This is true with prob (1-e), for n > h, where n is the number of training samples h is the V(apnik)C complexity q Ref: Chris. Burges, A Tutorial on Support Vector Machine for Pattern Recognition, Kluger Series on Data Mining and Knowledge Discovery, V. Vapnik, Statistical Learning Theory, John Wiley & Sons, p.25

26 VC Dimension: A Measure of model complexity qsuppose that we pick n data points and assign labels of + or to them at random. If our model class is powerful enough to learn any association of labels with the data, its too powerful! qmaybe we can characterize the power of a model class by asking how many data points it can shatter i.e. learn perfectly for all possible assignments of labels. This number of data points is called the Vapnik- Chervonenkis dimension/complexity. The model does not need to shatter all sets of data points of size h. One set is sufficient. o For planes in 3-D, h=4 even though 4 co-planar points cannot be shattered. p.26

27 An example of VC dimension Suppose our model class is a linear hyperplane. In 2-D, we can find a plane (i.e. a line) to deal with any labeling of three points. A 2-D hyperplane shatters 3 points qbut we cannot deal with some of the possible labels of four points. A 2-D hyperplane (i.e. a line) does not shatter 4 points. p.27

28 Some examples of VC dimension qthe VC dimension of a hyperplane in 2-D is 3. In k dimensions it is k+1. qits just a coincidence that the VC dimension of a hyperplane is almost identical to the number of parameters it takes to define a hyperplane. qa sine wave has infinite VC dimension and only 2 parameters! By choosing the phase and period carefully we can shatter any random collection of one-dimensional data points (except for nasty special cases). f ( x) = a sin( b x) p.28

29 VC Dimension Penalized Linear Classifier q VC Dimension penalized Hyperplane decision boundary function optimization, is equivalent to maximizing the gap between along the decision plane: w. x w. x and c c + b + b w > + 1 < -1 2 is for positive cases for negative cases as small as possible p.29

30 SVM Formulation q SVM: A linear classifer that max the gap q Hyperplane: <w,x j >+b = 0: w is the norm, b is the shift q Canonical form constraint: w, b need to satisfy w 2 w 1 That is, the nearest data point to the hyperplane, should be the inverse of the norm of w. p.30

31 SVM Gap q SVM formulation: The Gap p.31

32 Separable SVM qassuming the data points are separable: <w, x i >+b > 1, for y i = +1 <w, x i >+b < 1, for y i = -1 qthen the constraint can be expressed as, qthe problem formulation: p.32

33 Lagrangian Relaxation q Primal-Dual Decomposition, (Math dept has a course on this method in more details) Lagrangian p.33

34 Primal-Dual Decomposition q Primal-Dual Decomposition p.34

35 The Support Vector Solution p.35

36 Non-Linear Case q What if the data points are not separable by a linear decision boundary? q Can we introducing a more flexible non-linear decision boundary? Yes, implicitly via the kernel trick, recall the SVM dual problem And the resulting classifier: Inner Product Operatio p.36

37 Apply The Kernel Trick q Replacing the inner product evaluation with a kernel function, implicitly mapping the features to a higher dimensional space, and for a non-linear decision boundary q Some common kernels (must satisfy Mercer s theorem, see ref [Berges98]) Polynomial: K( x, y) = ( x. y + 1) p Gaussian radial basis function K( x, y) = e - x-y 2 / 2 s 2 Parameters that the user must choose Neural net: K( x, y) = tanh ( k x.y - d ) p.37

38 Benefits of Kernel SVM q It gives the feature space richer structure by implicit non-linear mapping q Dimension is usually higher, the prob of linear separability grows with feature dimension q Still a lot of heuristics to deal with on kernel parameters q Complexity of training SVM: QP with n x n Hessian. p.38

39 SVM Math Summary q SVM find a linear decision boundary that maximizes the separation of classes q Solution is via a clever Math trick called Lagrangian Relaxation and Primal-Dual decomposition (many applications in engineering, economics, finance) q Interpretation of SVs: those data points with tight constraint, ie, non-zero Lagrangians, and only these data points are shaping the decision boundary location and orientation q Kernel Trick: both the dual problem and the decision boundary function, relies on inner product evaluation, i.e, the nxn inner product matrix from data, will determine the SVM If a kernel function is used to replacing inner product, then it implicitly mapping the data points to usually higher dimensional space with desirable structures, and can separate linearly nonseparable data set. p.39

40 Outline qrecap Quiz-1 knn classifier GMM Models and Bayesian Classifier q Classification with Support Vector Machine VC Dimension and Good Classifier Linear Decision Boundary that maximizing the Gap Lagrangian Relaxation and Primal-Dual Decomposition Solution Matlab Implementation. q Summary p.40

41 q Preparing data: Check out: svm_tutorial.m if (0) % allows manual input of data points figure; hold on; axis([ ]); nx = 20; ny = 20; fprintf('\n input x points...'); for k=1:nx [x1, x2]=ginput(1); plot(x1, x2, '+r'); x(k, :)=[x1, x2]; end fprintf('\n input y points...') for k=1:ny [x1, x2]=ginput(1); plot(x1, x2, '.k'); y(k, :)=[x1, x2]; end save svm_test_data.mat x y; else %load svm_test_data_linear.mat; load svm_test_data.mat; hold on; grid on; axis([ ]); plot(x(:,1), x(:,2), '+r'); plot(y(:,1), y(:,2), '.k'); end p.41

42 SVM Matlab q Train Linear and Non-Linear SVMs if (0)% train a linear svm svmstruct = svmtrain(data, labels, 'showplot',true); else % train a non-linear svm svmstruct = svmtrain(data, labels, 'kernel_function', 'rbf', 'showplot',true); end q Linear, RBF, and Polynomial Kernel SVMs linear 3 rd order polynomial rbf-radial basis function p.42

43 SVM Classification q Classify with SVM model % test svm ntest = 3; for k=1:ntest %input query data points [x1, x2] = ginput(1); hold on; grid on; plot(x1, x2, '*m'); test(k, :) = [x1, x2]; end classes = svmclassify(svmstruct, test,'showplot',true); fprintf('%c ', classes); p.43

44 Summary q What is a good classifier? Not only good precision-recall performance at training (empirical risk function), but also need to consider the model complexity Structural Risk: penalizing by VC dimension q VC dimension Is a good measure of model complexity How many data points a certain classifier can shatter q SVM: For linear hyperplane decision function, structural risk minimization is equivalent to gap maximization Lagrangian Relaxation & Primal-Dual decomposition Support Vectors: mathematically, are data points that has non-zero Lagrangian associated with Kernel Trick: implicit mapping to higher dimensional richer structure space. Heuristic, may have overfitting risks (eg. RBF) p.44

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 13

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 13 Image Analysis & Retrieval CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W 4-5:15pm@Bloch 0012 Lec 13 Dimension Reduction: SVD and PCA Zhu Li Dept of CSEE, UMKC Office: FH560E, Email: