Content-based image and video analysis. Machine learning

Size: px

Start display at page:

Download "Content-based image and video analysis. Machine learning"

Berenice Walton
5 years ago
Views:

1 Content-based image and video analysis Machine learning for multimedia retrieval

2 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all problems in content-based retrieval fall under this category: Object detection in images Genre classification for videos Speech recognition and many more Machine learning is concerned with developing generic algorithms, that are able to solve such problems by learning from example data Content-based image and video retrieval 2

3 Classification An important task in machine learning is classification: Given an input pattern x, assign in to a class ω i Example: Given an image, assign label face or non-face x can be an image, a video, or (more commonly) any feature vector that can be extracted from them ω i is the desired (discrete) class label If class label is real number or vector Regression ML: Use example patterns with given class labels to automatically learn Many tasks in multimedia retrieval can be posed as classification tasks Genre classification Object detection High-level feature detection Content-based image and video retrieval 3

4 Classification pipeline pp Data acquisition iti Preprocessing Problems or errors in earlier stages can usually not be fixed in later stages! Segmentation Feature Extraction Learning / Classification Content-based image and video retrieval 4

5 Curse of dimensionality In computer vision, the extracted feature vectors are often high-dimensional Many intuitions about linear algebra are no longer valid in high-dimensional spaces Classifiers often work better in lowdimensional spaces These problems that present themselves in high-dimensional spaces are often called curse of dimensionality Content-based image and video retrieval 5

6 Sample densities Consider interval sampled with 5 samples To sample with the same distance between samples (in direction of coordinate axes) we need: 1D: 5 2D: 5 2 = 25 3D: 5 3 = D: = Samples needed for same sample density increase exponentially with dimensionality! Content-based image and video retrieval 6

$59 % The volume of the unit hypersphere is only a tiny fraction of the volume$

7 Unit hypercube vs. unit hypersphere p Volume of unit hypercube V c (d) = 2 d Volume of unit hypersphere V = d/2 s (d) π / Γ(1 + d/2) Ratio of Volumes V s (d) : V c (d) 2D: π : 4 = % 3D: 4π /3 : 8 = % 8D: π 4 / 24 : 256 = 1.59 % The volume of the unit hypersphere is only a tiny fraction of the volume of the 78.53% unit hypercube in high dimensions! Almost all the space is concentrated in the corners! 52.36% Content-based image and video retrieval 8

8 Dimensionality reduction PCA: Leave out dimensions and minimize error made x u 2 2 u1 x (see Duda &H Hart) 1 (see slides on Pattern recognition in WS-lecture) Content-based image and video retrieval 9

9 Dimensionality reduction LDA: Maximize class separability PCA x 2 x 1 LDA ( see Duda & Hart / slides on Pattern recognition in WS-lecture) Content-based image and video retrieval 10

10 Classification process Training feature vectors with class labels (x j, ω j ) Training i algorithm Classifier parameters Training Classification Classification algorithm Test feature vector Class label (x i ) ω i ω i Content-based image and video retrieval 11

11 Bayesian Classification We are given a feature vector x and want to know which class ω i is most likely, given x Use Bayes rule: p( x ω i)p(ω i) P(ω i x ) = p( x) likelihoodlih prior posterior = normalization factor with p( x ) = Σp( x ω i)p(ωi) i Decide for the class ω i with maximum posterior probability If we know P(x w i ) and P(ω i ), then we can just compute the probabilities It can be shown that this is the optimal solution Problem: p(x ω i ) (and to a lesser degree P(ω i )) is usually unknown and often hard to estimate from data Priors describe what we know about the classes before observing anything Can be used to model prior knowledge Sometimes easy to estimate t (counting) Content-based image and video retrieval 12

12 Gaussian classification T Assumption: p(x ω i ) ~ N(μ, Σ) = exp ( x μ) Σ ( x μ) d/2 1/2 2 ( 2π) Σ This makes estimation easier: Only mu and sigma have to be estimated To reduce parameters, the covariance matrix can be restricted Diagonal matrix Dimensions uncorrelated Multiple of unit matrix Dimensions uncorrelated with same variance Problem: if the assumption(s) do not hold, the model does not represent reality well Performance will decrease (often severely) Estimation of mu and sigma with Maximum Likelihood (ML) Use parameters, that best explain the data (highest likelihood): l(μ, Σ) = p(data μ, Σ) = p(x 0, x 1,, x n μ, Σ) =p(x 0 μ, Σ) p(x 1 μ, Σ) p(x n μ, Σ) log(l(μ, Σ)) = log(p(x 0 μ, Σ)) + + log(p(x n μ, Σ)) Maximize log(l(μ, Σ)) over μ and Σ Content-based image and video retrieval 13

13 Gaussian Example y x Random samples drawn from a Gaussian distribution Content-based image and video retrieval 14

14 Gaussian Example y x Estimation of Gaussian with multiple of unit covariance matrix Content-based image and video retrieval 15

15 Gaussian Example y x Estimation of Gaussian with diagonal covariance matrix Content-based image and video retrieval 16

16 Gaussian Example y x Estimation of Gaussian with full covariance matrix Content-based image and video retrieval 17

17 Gaussian Example y x Random samples drawn from non-gaussian distribution Content-based image and video retrieval 18

18 Gaussian Example Note that the region where the probability density of the model is highest has almost no data points! y x Estimation of Gaussian with full covariance matrix Content-based image and video retrieval 19

19 Gaussian Mixture Models (GMMs) Approximate true density function using a weighted sum of several Gaussians 1 1 T 1 p( x) = w exp ( x μ) Σ ( x μ with d/2 1/2 ( 2π ) Σ 2 i ) i 2 w i = 1 i Any density can be approximated this way with arbitrary precision But might need many Gaussians Difficult to estimate many parameters Restrict covariance matrices as before Only one shared covariance matrix for all Gaussians Now need to estimate parameters of the Gauss functions as well as the weights Expectation Maximization (EM) Algorithm Content-based image and video retrieval 20

20 GMM example y x Random samples drawn from non-gaussian distribution Content-based image and video retrieval 21

21 GMM example y x Estimated GMM with full covariance matrices Content-based image and video retrieval 22

22 Expectation Maximization (EM) If we knew to which Gaussian each data point belongs, this would be simple We don t exactly know, but we can estimate EM Algorithm Initialize parameters of GMM randomly Repeat until convergence Expectation (E) step: Compute the probability p ij that data point i belongs to Gaussian j Take the value of each Gaussian at point i and normalize so they sum up to one Maximization (E) step: Compute new GMM parameters using soft assignments p ij Maximum Likelihood with data weighted according to p ij Content-based image and video retrieval 23

23 GMM applets M.html Content-based image and video retrieval 24

24 Some taxonomy: parametric ti vs. non-parametric? ti Gaussian and GMMs are called parametric classifiers They assume a specific form of probability distribution with some parameters Then only the parameters need to be estimated There are also methods which do not assume a specific form of probability distribution They are called non-parametric Examples: Parzen windows, k-nearest neighbors Properties of parametric classifiers + Need less training data because less parameters to estimate - Only work well if model fits data Properties of non-parametric ti classifiers + Work well for all types of distributions - Need more data to correctly estimate distribution Content-based image and video retrieval 25

25 Some taxonomy: generative vs. discriminative i i A method that models P(ω i ) and p(x ω i ) explicitly is called a generative e model p(x ω i ) allows to generate new samples of class ω i The other common approach is called discriminative models They directly model P(ω i x) or just output a decision ω i given an input pattern x Often, discriminative i i models are easier to train because they solve a simpler problem However, they can become complicated with a large number of classes N One-vs-all binary classifiers One-vs-one binary classifiers Voting sometimes ambiguous Real multi-class discriminative classifiers sometimes possible Content-based image and video retrieval 26

26 Linear Discriminant Functions Simplest possible discriminative classifier F(x) = sign(w T x + b) = Σ w i x i + b Hyperplane dividing feature space First, a notational trick Define w =[b w] T and x =[1 x] T w T x + b then becomes w T x This makes the math a bit easier y i is the label of x i Binary classification: y i {-1, 1} How to find w and b (or w )? First and easiest solution: Perceptron learning algorithm Content-based image and video retrieval 27

27 Perceptron learning Perceptron Algorithm Start with initial w 0 (usually random or zero) Classify all training samples using ŷ = T i sign(w x i ) For all misclassified training samples adjust w: w w η y y i x i i η is the learning rate Repeat until training samples correctly classified Demo applet: 5/Perceptron.html Content-based image and video retrieval 28

28 Which hyperplane p is best? Perceptron algorithm finds any hyperplane that separates the data Which hyperplane is best? Content-based image and video retrieval 29

29 Support vector machines Find hyperplane that maximizes the margin between the positive and negative examples w T x / w x projected in direction w Support vectors w x Margin On which side of the hyperplane is x? w T x / w <> -b / w w T x <> -b y i = 1 w T x i > b y i = -1 w T x i < b y T i (w x i + b) > 0 Length of w does not change anything Fix it by requiring: min y i (w T x i +b)=1 Distance to hyperplane? w T x + b / w Minimal distance to hyperplane? 1 / w Margin? 2 / w Content-based image and video retrieval 30

30 Finding the hyperplane with maximum margin Pose as constrained optimization problem: Minimize w 2 (i.e. maximize margin 2 / w ) Subject to y T i (w x i +b)>= 1 Samples are on the right side of the hyperplane (> 0) Samples are outside the margin (>= 1) This is known as a quadratic optimization problem Has only one global minimum No problems with local minima! Efficient algorithms for solving it are known Content-based image and video retrieval 31

31 Non-separable data What to do with data that is not linearly separable? Valid? Noise? Mislabeling? Content-based image and video retrieval 34

32 Large margin vs. training error Noise in data or labels can make training data non-separable Or seriously reduce the size of the margin What we want: A way to maximize margin while ignoring (some) outliers, that would lead to small margin This gives us an SVM algorithm that also works for data that is not linearly separable (but almost) Training error Margin constraint Old way (hard margin) New way (soft violation margin) ξ i Larger margin Better generalization Content-based image and video retrieval 35

33 Soft-margin SVM formulation Reminder: hard-margin SVM Minimize w 2 (i.e. maximize margin 2 / w ) Subject to y i (w T x i + b) >= 1 Soft-margin SVM Add sum of constraint violations to objective function: Minimize w 2 + C Σ ξ i Allow constraints to be violated: Subject to y i (w T x i + b) >= 1 ξ i (with ξ i >= 0) Parameter C controls the trade-off between low training error and large margin ξ i has simple interpretation ξ i = 0 sample classified correctly outside margin 0 < ξ i < 1 sample classified correctly inside margin ξ i > 1 sample classified incorrectly Σ ξ i is upper bound on number of training errors Content-based image and video retrieval 36

34 Really non-separable data What about data that is really non-separable (not because of noise)? No hyperplane will be able to perform well On this kind of data! Content-based image and video retrieval 37

35 Non-linear SVMs: Going to higherh dimensionsi Idea: Transform data to high-dimensional space where it is linearly separable Content-based image and video retrieval 38

36 Kernel trick Transforming feature vectors into a high dimensional space can be difficult to compute For infinite-dimensional spaces it is impossible Kernel trick: Re-write algorithm so that the feature vectors only appear in dot products (using Lagrange multipliers / dual optimization problem) Use kernel function to directly compute dot product in high-dimensional space from low- dimensional input vectors High-dimensional space becomes implicit property of the kernel function Nothing needs to be computed in highdimensional space! Even infinite-dimensional spaces are possible (for mathematical details: See e.g. Burges 1998, or Machine-Learning lecture) Content-based image and video retrieval 39

37 Common SVM kernels Some kernel functions are widely used Linear kernel: k(x, x ) = x T x Polynomial kernel: k(x, x ) = (x T x + 1) d Gaussian (RBF) kernel: k(x, x ) = exp(- x-x 2 /2σ 2 ) Sigmoid kernel: k(x, x ) = tanh(x T x + c) Kernels for non-vectorial data are also possible Graphs, sets, texts, etc Kernel defines a similarity function for the input vectors Often problem-specific kernels can improve performance Content-based image and video retrieval 40

38 Kernel SVM example Polynomial kernel (degree 2): (x T x + 1) 2 Content-based image and video retrieval 41

39 Kernel SVM example RBF-Kernel: exp(- x x 2 / 2σ 2 ) Content-based image and video retrieval 42

40 Model selection SVM learning algorithm has parameter C, Kernels usually have parameter(s) These need to be optimized to achieve best performance Simple idea: Training several SVMs with different parameters and taking the one with the best performance on the training set But training error is not a good performance measure What we want is good generalization capability (low test error) n-fold cross-validation (CV) Split training set into n folds For each fold i, train SVM using all folds except fold I Perform CV for a number of SVM parameters and use the ones with best performance Content-based image and video retrieval 43

41 Linear SVMs Don t completely forget about linear SVMs! If the input space is already high-dimensional, linear SVMs can often perform well too Linear SVMs have some advantages Speed: Only one scalar product for classification Kernel SVM needs #(support vectors) kernel evaluations Memory: Only one vector w needs to be stored Kernel SVM needs to store all support vectors Training: Training is much faster Specialized primal optimization algorithms Model selection: Only one parameter to optimize Kernel SVMs need to optimize C and kernel parameters Content-based image and video retrieval 45

42 Multi-class SVMs SVM is originally a two-class classifier Generic methods to get a multi-class classifier One-vs-all: One SVM per class One-vs-one: Pairwise SVMs Problems: Classification sometimes ambiguous Need to train and evaluate multiple classifiers Today real multi-class SVMs are available Weston & Watkins 1998 Crammer & Singer 2001 Content-based image and video retrieval 46

43 Some practical tips for using SVMs Normalization of feature vectors very important Normalize each feature separately Zero mean, unit variance or [-1; 1] are popular Normalize feature vector to unit norm x = x / x Otherwise performance will suffer Training will be much slower and numerically unstable Classification performance will be suboptimal Model selection, i.e. optimization of C and kernel parameters also very important t Generally grid-search using cross-validation is used Content-based image and video retrieval 47

44 SVM software LibSVM [Chang et al. 2001] SVMlight [Joachims 1999] Applets 20SVM/svmapplet.html l Content-based image and video retrieval 48

k-nearest-neighbors (KNN) Look at the k closest training samples and assign the most frequent label among them Model consists of all training samples Pro: No information is lost Con: A lot of data to

45 k-nearest-neighbors (KNN) Look at the k closest training samples and assign the most frequent label among them Model consists of all training samples Pro: No information is lost Con: A lot of data to manage Naïve implementation: compute distance to each training sample every time Distance metric is needed Important design parameter L 1, L 2, L, Mahalanobis, Problem-specific distances knn often good classifier, but: Needs enough data Scalability issues More on this in later lectures Content-based image and video retrieval 49

46 New problem setting Clustering Only data points are given, no class labels Find structures in given data Generally no single correct solution possible Content-based image and video retrieval 50

47 K-means Algorithm Randomly initialize k cluster centers Repeat until convergence: Assign all data points to closest cluster center Compute new cluster center as mean of assigned data points Pros Simple and efficient Cons k needs to be known in advance Results depend on initialization Does not work well for clusters that are not hyperspherical (round) or clusters that overlap K-means is very similar to the EM algorithm Uses hard assignments instead of probabilistic assignments (EM) EM algorithm can also be used for clustering Applets Content-based image and video retrieval 51

48 Algorithm Agglomerative Hierarchical Clustering Start with one cluster for each data point Repeat Merge two closest clusters Several possibilities to measure cluster distance Min: minimal distance between elements Max: maximal distance between elements Avg: average distance between elements Mean: distance between een cluster means Result is a tree called a dendrogram Content-based image and video retrieval 52

49 AHC dendrogram AHC applet ml/appleth.html htm Content-based image and video retrieval 53

50 Scaling Scaling of features can have a large impact on clustering Nice clusters Clustering not so clear anymore Content-based image and video retrieval 54

51 Literature Classification (Bayes, Gaussians, EM, ) Duda, Hart, Stork: Pattern Classification, 2 nd ed., 2000) Mitchell: Machine Learning, 1997 Bishop: Pattern Recognition and Machine Learning, 2008 SVMs C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining i and Knowledge Discovery, 2, (1998) Shawe-Taylor, Cristianini: Kernel Methods for Pattern analysis, 2004 Schölkopf, Smola: Learning with Kernels, 2001 Content-based image and video retrieval 55

Support Vector Machines

Support Vector Machines About the Name... A Support Vector A training sample used to define classification boundaries in SVMs located near class boundaries Support Vector Machines Binary classifiers whose