ALBERT-LUDWIGS-UNIVERSITÄT FREIBURG INSTITUT FÜR INFORMATIK

Size: px

Start display at page:

Download "ALBERT-LUDWIGS-UNIVERSITÄT FREIBURG INSTITUT FÜR INFORMATIK"

Rosalyn Pierce
5 years ago
Views:

1 ALBERT-LUDWIGS-UNIVERSITÄT FREIBURG INSTITUT FÜR INFORMATIK Lehrstuhl für Mustererkennung und Bildverarbeitung Prof. Dr. Hans Burkhardt Comparison of Content-Based Image Retrieval using one-class and two-class SVM Studienarbeit Julia Ick Mai 2004 August 2004

2 Erklärung Hiermit erkläre ich, daß die vorliegende Arbeit von mir selbständig und nur unter Verwendung der aufgeführten Hilfsmittel erstellt wurde. Freiburg, den

3 Contents 1 Motivation 2 2 Support Vector Machines Two-Class SVM One-Class SVM Similarity Measures Euclidian Distance Histogram Intersection Image Retrieval with similarity measures Kernel Functions Common Kernel Functions The Histogram Intersection Kernel Relevance Feedback with SVMs Using a Two-Class SVM Using a One-Class SVM Using a Similarity Measure Relevance ranking induced by the histogram intersection kernel 20 7 Implementation 22 8 Results Tests with the MPEG-7 Content Set Tests with the Benchathlon image collection Conclusions

4 Chapter 1 Motivation Nowadays people are confronted more and more with very large multimedia data sets. For example, the amount of images found in the world wide web increased immensely in the past years and will keep growing in the future. To cope with these tremendous sets, it is necessary to develop good systems to search multimedia data. In the following we will concentrate on image search systems. One way to construct an image search system is to label each image with associated data, like the lename, the date of creation or descriptive words. Then it is possible to search for an image by searching in the associated data. A disadvantage of this method is, that a lot of work with labeling has to be done before the image search database can be used. But the main problem with this method is, that it is very hard for the user of the search system to describe the image he is looking for in a way that the computer has a chance to give the desired result. Descriptions of the same image by dierent users will correspond in the fewest cases. Image descriptions are very subjective. Another way to create an image search system is to search on the basis of the content of the image. Search systems that use this method are called Content-Based-Image Retrieval (CBIR) systems. The main idea is, that the user uploads or selects an image and requests the computer to nd the most similar images in the database. To make the comparison of images more simple, a small number of numerical, easy to detect low level features are used instead of the raw image data. Low level features are features like color histogram, textures, shapes and edges. A problem is that the user most likely is not able to express an image with low-level features alone. The user will need the help of high level features, which can not be detected by the computer. For example, what if the user wants to search for images with ying or sitting birds with no preference for the color of the feathers? How can this target concept be described alone with low level features? A major goal for image search technics would be to nd a way to ll the gap between the perception of humans and computers. In the meantime we have to try to capture the user s target image concept with computer detectable low-level features as good as possible. Because our CBIR system will be confronted with dierent users with dierent search interests, the system has to be able to learn. It is important that the system tries to catch the target image concept quickly and with high accuracy. Learning means that the system nds and weights the features which best describe the user s desired target image. 2

5 CHAPTER 1. MOTIVATION 3 Before choosing a learning method we have to dene how many relevance classes should be distinguished. Here we will treat CBIR as a two-class classication problem. That means that an image will either be relevant or irrelevant. To achieve a high accuracy, the system has to see a sucient amount of training examples. In the case of image retrieval the examples have to be provided by the user. One way to provide the training examples is that the user uploads them. But this is very time consuming and supposes that the user has a set of examples at hand. A much easier way, which we will be using here, is to present a small number of images from the database to the user for labeling. A system which improves the search result iteratively by asking the user to label a set of query images is called a Relevance Feedback system. The images for the query can be picked randomly or purposely by the system. Systems which pick their training examples randomly are called passive learners and systems which pick their training examples purposely are called active learners. A passive learner might take a long time and a lot of examples to learn the target image concept. No user has the patience to label more than a few dozen images, so it might be better to chose the images on purpose in some way. Now, what are the best images for query? Training examples are good if they minimize the space of all possible target image concepts. Consequently the most informative images for the query round are those that restrict the space of all possible target image concepts the most. The strategy used by our relevance feedback system to learn the target image concept is as follows. Firstly the user is presented a small set of images for labeling. It is assumed that at least one image of the set is relevant. An example of an already labeled initial query set of our CBIR system is shown in gure 1.1. The blue frame marks the chosen relevant image and all unmarked images are thought to be irrelevant. When the user is done with labeling, the system uses this information to compute the image concept, which it thinks is the most possible one. After that, the current best result images and the images of the second query round are presented to the user. The query set will consist of the most informative images. Figure 1.2 shows in the upper part the 20 best result images after our example initial query round from gure 1.2. The images of the second query round are presented in the lower part. Based on the user s answers the representation of the systems image concept can be updated. The best results after the second query round of our example can be viewed in gure 1.3. These query rounds are repeated until the user thinks the system is good enough. Which machine learning method should we use in the relevance feedback CBIR system? Most easy to detect low level features are numerical and therefore an image can be represented as a vector of numerical values. Every image belongs to one of two classes: relevant or irrelevant. Therefore a possible learning method to use are Support Vector Machines (SVM). Two dierent types of SVMs have already been successfully tested in CBIR relevance feedback systems. One of the two SVM types is the well known two-class SVM, which tries to separate the training data with a hyperplane. This kind of SVM has been used for image retrieval by S.Tong and E.Chang [5]. The second type tested is the less well known one-class SVM, which tries to t a tight hypersphere around the positive training examples. This type of SVM was introduced for image retrieval by Y.Chen [4].

6 CHAPTER 1. MOTIVATION 4 Figure 1.1: Screenshot of our CBIR system after labeling the rst set of query images

7 CHAPTER 1. MOTIVATION 5 Figure 1.2: Results after the initial query round

8 CHAPTER 1. MOTIVATION 6 Figure 1.3: Results after the second query round

9 CHAPTER 1. MOTIVATION 7 But which of these two SVM types is the better type for image search? The relevant images are only a very small fraction of the image database. One might assume that the relevant images cluster in the feature space in a certain way. The irrelevant images will certainly not cluster and might be all around the cluster of relevant images. In this case one might think that a one-class SVM separates the two classes better, because it can easily t a hypersphere around the cluster. A two-class SVM with a linear kernel will surely fail to separate the two classes, but maybe a rbf kernel will do? In the following it will be tried to compare these two SVM types.

10 Chapter 2 Support Vector Machines Support Vector Machines were rst introduced by Vapnik. They have a good generalization ability and can easily be applied to classication problems of the following form. Assume that each data instance can be represented as a vector x R n and that each instance belongs to one of two classes. Lets call them the positive and the negative class. The instances of the positive class are labeled with 1 and the instances of the negative class with -1. L = { 1, 1} is the set of possible labels. Given a set of training examples {(x 1,y 1 ),...,(x l,y l )} with (x i,y i ) R n L. The problem now is to nd a function that will assign any unlabeled instance to the correct class. In the following it will be shown how the two SVM types, one-class and two-class, will handel this problem. 2.1 Two-Class SVM Figure 2.1: Positive (yellow) and negative (blue) training examples Figure 2.2: Separating hyperplane of the two-class SVM Two-class SVMs solve the classication problem by nding a maximal margin hyperplane that separates the positive training instances from the negative ones. All positive training instances will lie on one and all negative training instances will lie on the other side of the hyperplane. The training instances that lie closest to the hyperplane are called support vectors. In most cases the training instances are not linearly separable in the original feature space R n. In this case the training instances can be transformed nonlinearly into a higher 8

11 CHAPTER 2. SUPPORT VECTOR MACHINES 9 dimensional feature space F with a mapping φ : R n F x φ(x) In the higher dimensional feature space the instances can be separated much easier. The classication function is f(x) = sgn(w φ(x) + b). Instances mapped to F lying on the positive side of the hyperplane w φ(x) + b = 0 are classied as members of the positive class and the ones on the other side are classied as members of the negative class. It is easy to see that members of F appear only in dot products. Any algorithm that uses dot products can be performed implicitly in F using a kernel function. Instead of mapping a instance to the possibly very high dimensional vector space F and performing a dot product there, a kernel function that only calculates with vectors of R n can be used. This saves a lot of computational time. Another advantage of using a kernel function is that the decision function can be calculated even then when the mapping φ can not be described analytically. The decision function can be rewritten as l f(x) = sgn( y i α i k(x i,x) + b) i=1 To calculate b R and α 1,..., α l 0 the following quadratic optimization problem has to be solved: subject to l maximize W(α) = α i 1 l α i α j y i y j k(x i,x j ) 2 i=1 i,j=1 l α i y i = 0, 0 α i C, i = 1,..., l i=1 The parameter C is called cost and is used to regulate the trade o between having a SVM that classies the training instances all correct and a SVM that allows outliers. The higher the value for C the more classication errors are allowed on the training set. By allowing more classication errors the SVM has the chance to choose a more simple decision boundary and to avoid overtting. Only the instances nearest to the decision boundary, the support vectors, are needed to dene the boundary. These instances will have an nonzero α i. All other instances will have α i = 0 and are irrelevant for the classication problem. 2.2 One-Class SVM One-class SVMs solve the classication problem by nding the smallest hypersphere which contains most of the positive training instances. The information of negative training instances is completely ignored while calculating the hypersphere. The hypersphere should be as small as possible to minimize the risk of including negative instances. All instances inside the "ball" will be classi- ed as positive and all instances outside the ball will be classied as negative.

Training instances may contain noise, therefore outliers should be detected and singled out.

12 CHAPTER 2. SUPPORT VECTOR MACHINES 10 Figure 2.3: Positive (yellow) and negative (blue) training examples Figure 2.4: Separating hypersphere of the one-class SVM The hypersphere need not contain all positive training instances. Training instances may contain noise, therefore outliers should be detected and singled out. According to the case of the two-class SVMs, the training instances x can be projected nonlinearly with a mapping φ(x) into a higher dimensional feature space F and the hypersphere can be calculated there. This will achieve a more complex decision boundary in the original feature space. The goal is to compute a hypersphere which is as small as possible while at the same time it contains most of the l positive training instances. This can be formulated in a primal form as: min R ζ i R R,ζ R l,c F νl i subject to φ(x i ) c 2 R 2 + ζ i, ζ i 0, i = 1,...l Figure 2.5: Enclosing hypersphere of a one-class SVM with ν = 0.1 Figure 2.6: Enclosing hypersphere of a one-class SVM with ν = 0.8 The ζ i s are slack variables to denote the distance of an instance from the ball. They are used to penalize outliers. If ζ i > 0 then the positive training instance x i is detected as an outlier and lies outside of the hypersphere with the radius R. To set the trade o between the radius of the ball and the number of training instances it encloses the parameter ν [0,1] is used. If ν is chosen to be small,

13 CHAPTER 2. SUPPORT VECTOR MACHINES 11 then the hypersphere is allowed to grow so that more training instances can be put into the ball. If ν is chosen to be large,then the hypersphere should be kept small while allowing that a fraction of the training instances lie outside. The primal form of the optimization problem can be transformed into a dual form using Lagrangian multipliers. The corresponding Lagrange function is: L(R, ζ,c,α) = R l l ζ i + α i ( φ(x i ) c 2 R 2 ζ i ) νl i=1 i=1 with α i 0 This function has to be minimized. For the minimum the following conditions have to hold: L R L c l l = 0 2R 2R α i = 0 α i = 1 i=1 i=1 l li=1 α i φ(x i ) l = 0 α i (2c 2φ(x i )) = 0 c = li=1 c = α i φ(x i ) i=1 α i i=1 The center c is completely determined by α alone. The radius R is also not independent. So, the Lagrange function can be constructed only consisting of the independent variables ζ and α. L(ζ, α) = 1 l l l ζ i + α i ( φ(x i ) α i φ(x i ) 2 ζ i ) νl i=1 i=1 i=1 = 1 l ζ i νl i=1 l l l ] + α i [φ(x i ) φ(x i ) + α j α k φ(x j ) φ(x k ) 2 α j φ(x j ) φ(x i ) i=1 j,k=1 j=1 l α i ζ i i=1 = 1 l l l l ζ i + α i φ(x i ) φ(x i ) α i α j φ(x i ) φ(x j ) α i ζ i νl i=1 i=1 i,j=1 i=1 Now L should be minimized with respect to the ζ i s and with subject to ζ i > 0. So either L/ ζ i = 0 if such a point exists, or ζ i = 0 and L/ ζ i > 0. L(ζ, α) ζ i = 1 νl α i 0 α i 1 νl Now L can be rewritten without the ζ i`s. l l L(x,α) = α i φ(x i ) φ(x i ) α 1 α j φ(x i ) φ(x j ) i=1 i,j=1

14 CHAPTER 2. SUPPORT VECTOR MACHINES 12 This leads to the following dual form: min α α i α j k(x i,x j ) i,j i α i k(x i,x i ) subject to 0 α i 1 νl, α i = 1 i The optimal α s can be computed by solving this dual problem with the help of a QP optimization method. After that the center of the hypersphere can be calculated, if the mapping φ(x) is known: c = i α i φ(x i ) But the mapping φ(x) will be unknown in most cases. The decision function f(x) = sgn(r 2 φ(x i ) c 2 ) can be computed without the center using the corresponding kernel function: f(x) = sgn(r 2 i,j α i α j k(x i,x j ) + 2 i α i k(x i,x) k(x,x)) The support vectors are those instances x i with 0 < α i < 1/(νl), the x i s with α i = 1 νl are the outliers and the x i s with α i = 0 are the instances lying truly inside the ball. The radius R is computed such that all support vectors lie on the hull of the hypersphere. This is the case, if for all support vectors the argument of the sgn is zero. The distance of an instance to the center, which will be needed for ranking the images later on, can be calculated in the following way: d(x) = α i α j k(x i,x j ) 2 i,j i α i k(x i,x) + k(x,x)

15 Chapter 3 Similarity Measures Image Retrieval is more of a ranking problem than a classication problem. It is not enough to label 100 images as relevant and 9900 as irrelevant in a database of images. It is more important that the images of the database are ranked according to their relevance and that the n-most relevant images are returned. For the ranking a method for measuring the similarity of two images is needed. Two of many possible similarity measures are the euclidian distance, which will be introduced in section 3.1, and histogram intersection, which will be discussed in section Euclidian Distance The euclidian distance is the distance measure induced by the L 2 -norm. Let x and y be the feature vectors of length n of two images. The euclidian distance of the two images is dened as: ( n L 2 (x, y) = x i y i 2) 1/2 i=1 The smaller the euclidian distance of two images the more similar they are thought of. 3.2 Histogram Intersection Histogram intersection is a method to measure the similarity of two images with respect to their colors. Lets denote the histograms of the images A im and B im with A and B. Let both images consist of N pixels and let both histograms consist of m bins. The i-th bin (i=1,...,m) of the histogram A is denoted with A i, and the i-th bin of the histogram B is denoted with B i. It holds, that m A i = N and i=1 m B i = N i=1 Now, the histogram intersection is dened as m K int (A,B) = min(a i,b i ) i=1 13

16 CHAPTER 3. SIMILARITY MEASURES 14 The higher the histogram intersection value of two images, the greater is the common part of the histograms, and the more similar are both images. If the sum of the histogram bins is normalized to one, the highest similarity value of two images will also be one. Histogram intersection is closely related to the L 1 -norm. Let x and y be the feature vectors of length n of two images. Then the L 1 -distance is dened as ( n ) L 1 (x, y) = x i y i Histogram intersection is related to the L 1 -distance in the following way: i=1 K int (x, y) = 1 L 1(x, y) Image Retrieval with similarity measures Lets imagine that our image retrieval system is based on similarity measures. How are the most relevant images computed when only one relevant labeled training image is given? In this case the system will calculate the similarity values of all images of the database and will return the n-most similar images. If more than one relevant example are given by the user, a reference point for the similarity comparison has to be chosen. Normally the mean of the feature vectors of the relevant example images is used. This method for nding a reference point treats each example the same and is not able to detect outliers. Another method, which we will be using in our CBIR system, is to train a linear one-class SVM with the relevant feature vectors and then use the center of the hypersphere as the reference point. The center can be calculated because the training data is not projected into a higher dimensional feature space. After the reference point has been calculated all images can be compared to it using the selected similarity measure and a relevance ranking can be computed.

17 Chapter 4 Kernel Functions The CBIR relevance feedback system has been tested with both SVM types and with varios kernel functions. All, except one, are commonly known kernel functions. The common kernels will be listed in the next section and thereafter the histogram intersection kernel will be introduced. 4.1 Common Kernel Functions Four of the ve kernel functions we use in our relevance feedback CBIR system are: linear kernel: k(x i,x j ) = (x i x j ) polynomial kernel: k(x i,x j ) = (γ(x i x j ) + coef0) d, γ > 0 radial basis function(rbf)kernel: k(x i,x j ) = exp( γ x i x j 2, γ > 0 sigmoid kernel: k(x i,x j ) = tanh(γ(x i x j ) + coef0) The occurring kernel parameter are γ, the degree d and a coecient coef The Histogram Intersection Kernel In the previous section histogram intersection was introduced as a method for measuring the similarity of two images. Annalisa Barla [6] has shown that histogram intersection is a mercer kernel, i.e. it yields hyperplanes with a guaranteed maximum margin in the mapped feature space. This will be shown by constructing a mapping from a feature space R n to a higher dimensional feature space F, where histogram intersection will be a dot product. Let A im and B im be images with N pixels and the corresponding histograms are A and B. Each histogram has m bins. A is now mapped to an N m-dimensional binary vector A dened as A 1 {}}{ A = ( 1, 1,...,1, 0,...,0, }{{} N A 1 A 2 {}}{ 1,1,...,1, 0,.., 0,..., }{{} N A 2 A m {}}{ 1, 1,...,1, 0,..,0 ) }{{} N A m B is mapped to B similarly. The histogram intersection K int (A,B) is now equal to the standard inner product between the two vectors A and B: K int (A,B) = A B 15

18 CHAPTER 4. KERNEL FUNCTIONS 16 Now it has been proven, that histogram intersection is a positive denite kernel function. Histogram intersection will be the fth kernel used in our CBIR system.

Chapter 5 Relevance Feedback with SVMs How will the relevance feedback part of the CBIR system be realized with the help of the support vector machines?

19 Chapter 5 Relevance Feedback with SVMs How will the relevance feedback part of the CBIR system be realized with the help of the support vector machines? SVMs are binary classiers which separate the relevant images from the irrelevant images with a boundary in a higher dimensional feature space. The relevant images lie on one side and the irrelevant images lie on the other side. When asked to classify an unlabeled instance, the SVM normally gives back only the computed class label. But the SVM can be changed slightly, so that a numerical value is returned instead. How this numerical value will look like depends on the SVM type that is used. This will be discussed for both SVM types in the sections 5.1 and 5.2. Nevertheless the numerical value given back by the SVM will induce a relevance order on all images in the database. The n-most relevant images can now be easily determined and be presented to the user as the result. For the next feedback round the n-most informative images have to be determined. This can be also be done with the help of the numerical value given back by the SVM. After the labeling is nished, the newly labeled instances are added to the old training set and the SVM is trained with this new set. Instead of SVMs the relevance feedback system can use any similarity measure to give a relevance order on the image database. This is discussed further in section Using a Two-Class SVM Figure 5.1: Separating hyperplane of a linear two-class SVM Figure 5.2: Relevance ranking of all points 17

CHAPTER 5. RELEVANCE FEEDBACK WITH SVMS 18 Two-class SVMs separate the relevant images from the irrelevant images with a hyperplane in a higher dimensional feature space.

20 CHAPTER 5. RELEVANCE FEEDBACK WITH SVMS 18 Two-class SVMs separate the relevant images from the irrelevant images with a hyperplane in a higher dimensional feature space. The most informative images for the feedback round are those which have the greatest inuence on the position of the separating hyperplane. The only instances which inuence the position of the hyperplane are the support vectors. The nearer an unlabeled image is to the hyperplane, the higher is the possibility that it is a support vector of the new SVM, if labeled and added to the training set. Therefore the n images with the shortest distance to the separating hyperplane are chosen for the feedback round. Because negative instances will have negative distances, the absolute value has to be used. The most relevant images are those on the positive side of the hyperplane that have the highest possibility of lying on the correct side. This is shown in gure 5.2. The whiter a point is, the more relevant it is. The corresponding separating hyperplane was created by a two-class SVM with a linear kernel and is shown in gure 5.1. Therefore the n-most relevant images for the result can be determined by sorting the images according to their distance to the hyperplane and taking those with the greatest distance. Instances on the negative side will have negative distances and therefore have smaller values than any positive one. The SVM classier can easily be changed to give back the signed distance instead of the class label. 5.2 Using a One-Class SVM One-class SVMs capture the distribution of the relevant images using hyperspheres in a high dimensional feature space. The most informative images for the feedback round are those which have the greatest inuence on the center and radius of the separating hypersphere. The instances which possibly have the greatest inuence are those which lie nearest to the center of the hypersphere. Therefore the n images with the shortest distance to the center of the separating hypersphere are chosen for the feedback round. The images lying inside the hypersphere are more probable to be relevant. Therefore the n- most relevant pictures for the result are those closest to the center. Figure 5.3 shows the separating hypersphere of a one-class SVM using the linear kernel. The ranking of the points induced by this hypersphere is illustrated in gure Figure 5.3: Separating hypersphere of a linear one-class SVM Figure 5.4: Relevance ranking of all points

21 CHAPTER 5. RELEVANCE FEEDBACK WITH SVMS The lighter a point is colored, the more relevant is the point. The SVM classier can easily be changed to give back the distance instead of the class label. 5.3 Using a Similarity Measure As mentioned before, the reference point of comparison is computed by a oneclass SVM. Therefore the most informative images are those that are most informative to the one-class SVM and that are those which lie nearest to the center of the hypersphere. Instead of the distance to the center, the similarity to the center will be used to rank the images of the database. In the feedback round and as the result the user will be presented the n-most similar images to the center. In our relevance feedback CBIR system we used histogram intersection as similarity measure. One reason for our choice is that in previous studies, e.g. in [9], L 1 related distances like histogram intersection have been found to be better than L 2 distances when dealing with histogram based feature vectors.

Chapter 6 Relevance ranking induced by the histogram intersection kernel In the previous section it was already shown in gures 5.1-5.

22 Chapter 6 Relevance ranking induced by the histogram intersection kernel In the previous section it was already shown in gures how the relevance ranking of two dimensional points induced by one-class and two-class SVMs with the linear kernel look like. Now the same is done for one-class and twoclass SVMs using the histogram intersection kernel. Figure 6.1 illustrates the relevance ranking induced by a one-class SVM, which was trained with only one positive example point. This example point is chosen by the SVM to be the center of the hypersphere. The white area in gure 6.2 represents the points, which have a distance of 0.5 or less to the center when transformed into the higher dimensional feature space. Figures 6.3 and 6.4 show the same for a one-class SVM trained with two positive examples. Figure 6.5 shows the decision boundary of a one-class SVM in the upper left part and the decision boundary of a two-class SVM in the lower left part. Both SVMs used the histogram intersection kernel and were trained with a cluster of positive and surrounding negative examples. The induced relevance rankings of both SVMs are illustrated in the right half of gure 6.5. Figure 6.1: Relevance ranking induced by a one-class SVM trained with only one positive example Figure 6.2: Area of the points having a distance of 0.5 or less to the center of the hypersphere 20

CHAPTER 6. RELEVANCE RANKING 21 Figure 6.

5 or less to the center of the hypersphere Figure 6.

23 CHAPTER 6. RELEVANCE RANKING 21 Figure 6.3: Relevance ranking induced by a one-class SVM trained with only two positive examples Figure 6.4: Area of the points having a distance of 0.5 or less to the center of the hypersphere Figure 6.5: Decision boundarys (left) and relevance ranking (right) induced by a one-class SVM (top) and a two-class SVM (bottom)

24 Chapter 7 Implementation For training and classication of a SVM each image of the database has to be represented by a numerical vector of image features. Here a invariant feature histogram with = 512 bins was chosen. A comprehensive description of this invariant feature histogram can be found in the dissertation of S. Siggelkow [7]. For the implementation of the SVMs the library LIBSVM [1] was used. LIB- SVM supports two-class SVMs, but not the here dened one-class SVMs. Therefore a additional SVM type "One-Class-Ball" had to be added. The decomposition method implemented in LIBSVM solves quadratic problems of the form: 1 min α 2 αt Qα + p T α subject to y T α =, 0 α t C, t = 1,..., l where y t = ±1,t = 1,..., l and Q is a matrix with Q ij = k(x i,x j ). As seen before, the dual form of the one-class SVM optimization problem is: min α i α j k(x i,x j ) α i k(x i,x i ) α i subject to i,j 0 α i 1 νl, α i = 1 To solve this problem with the decomposition method of LIBSVM this dual form has to be transformed. The rst thing that has to be done, is to scale the whole problem such that 0 α i 1 holds instead of 0 α i 1 νl : subject to 1 min α (νl) 2 α i α j k(x i,x j ) 1 α i k(x i,x i ) νl i,j 0 α i 1, i i α i = νl i After that the rst condition can be transformed to 1 min α 2 i,j α i α j k(x i,x j ) νl 2 22 α i k(x i,x i ) i

25 CHAPTER 7. IMPLEMENTATION 23 This can be rewritten as subject to with: min α : 1 2 αt Qα + p T α y T α =, 0 α t C, t = 1,..., l p = (p 1,..., p l ) with p i = νl 2 k(x i,x i ) = 1 C = 1 matrix Q with Q ij = k(x i,x j ) Now the optimization problem is in the needed form to be solved by the decomposition method of LIBSVM. Another SVM type "Histogram Intersection" was added to LIBSVM to support the learning method described in chapter 3. Furthermore LIBSVM only supports the commonly known kernel functions, so the histogram intersection kernel had to implemented, too. To realize the relevance feedback part of the CBIR system an additional function was added to LIBSVM that returns... the distance from the hyperplane, in the case of two-class SVM the distance from the center of the hypersphere, in the case of one-class SVM the result of the histogram intersection, in the case of histogram intersection... is used to rank the images for the feedback round and to determine the most relevant images. The distance returned for two-class SVM is signed. This means that for instances of the positive class positive distances and for negative instances negative distances are returned. The signed distance is needed for the computation of the most relevant images. For the selection of the most informative images the absolute value of the signed distance is used. The relevance feedback system for content based image search can be used through a PHP web interface. Some screen shots of the web interface can be viewed in gures The user can choose between two dierent sized image databases. With the help of a little menu, the SVM and all its necessary parameters can be selected. Alternatively histogram intersection can be chosen as the used learning method. After that the user is presented 12 random images from the selected database and is asked to mark all relevant images. The page will then reload and present the current 20 best results in the upper part and the 12 next query images of the second feedback round in the lower part. An example for an image retrieval query is shown in gures 7.1 and 7.2. The two images from gure 7.2 were the only ones of the 12 initial query images that were selected as relevant and were used to train a one-class SVM with

CHAPTER 7. IMPLEMENTATION 24 Figure 7.1: Example of a set of relevant training images histogram intersection kernel. The 20 most relevant images resulting from training the SVM can be found in gure 7.

The new SVM will be trained with the same images as the old one. To make a comparison of the results possible, the result images of the new SVM will be presented in a new window.

26 CHAPTER 7. IMPLEMENTATION 24 Figure 7.1: Example of a set of relevant training images histogram intersection kernel. The 20 most relevant images resulting from training the SVM can be found in gure 7.2. The user can take part in as many feedback rounds as he wants to. After each feedback round the user has the opportunity to compare his selected SVM with a new SVM with dierent parameters. The new SVM will be trained with the same images as the old one. To make a comparison of the results possible, the result images of the new SVM will be presented in a new window. Another nice feature is that after each feedback round a graph, which will show the number of relevant images as a function of the number of returned images, can be computed. Therefore the 100 best results images will be presented to the user for labeling. For comparison this statistic can be computed, with the help of the user, for dierent learning methods and then be shown as dierent colored lines in a single graph.

27 CHAPTER 7. IMPLEMENTATION Figure 7.2: The 20 most relevant images gained through training the CBIR system with a one-class SVM using only the two relevant examples from gure 7.1

28 Chapter 8 Results The CBIR relevance feedback system has been tested with labeled data of the MPEG-7 Content Set 1 consisting of 2500 images and Benchathlon image database [8]with 4500 images. For the MPEG-7 Content Set 15 images and for the Benchathlon collection 11 images were chosen randomly. After that, nonexpert users generated lists of relevant images for each selected image. For each relevant image set a number of dierent SVMs were trained with the same training data, consisting of positive instances out of the relevant set and randomly picked negative images from the complementary set. After each SVM training all images of the database were ranked according to their relevance. A ideal learner would retrieve all relevant images of the set before retrieving any irrelevant ones. To compare the dierent learning methods a precisionrecall graph is used. The precision is dened as the ratio of the number of retrieved relevant images to the total number of retrieved images. Recall is the proportion of the retrieved relevant images from the total number of relevant images in the collection. The precision-recall graph is dened as the precision as a function of the recall. In the ideal case the precision will be one for all recall values. In the next two subsection two-class and one-class SVMs with dierent kernel functions will be compared. The precision-recall graphs shown are the averages over all 11 or 15 relevance sets. 8.1 Tests with the MPEG-7 Content Set The best kernel function for two-class SVMs The goal of the rst experiment was to nd out which kernel function for twoclass SVM is the best for image retrieval. The SVMs were trained only with the initial training data, which consisted of 3 relevant and 5 irrelevant images. No additional feedback rounds were allowed. For each kernel individual tests were made to nd the best parameters. After that all kernels with their best parameters were compared. The best parameter of the kernel functions can be found in table 8.1 and the result of the comparison is shown in gure 8.1. As can be seen, the kernel with the best performance on the MPEG Content Set is the L 1 distance based histogram intersection kernel. The second best is the 1 We acknowledge Tristan Savatier, Aljandro Jaimes, and the Department of Water Resources, California, for providing them under the Licensing Agreement for the MPEG-7 Content Set (MPEG 98/N2466). 26

29 CHAPTER 8. RESULTS 27 Figure 8.1: Comparison of kernels for two-class SVMs kernel cost γ coef0 degree linear sigmoid polynomial rbf intersection Table 8.1: Parameter of the best two-class SVMs

30 CHAPTER 8. RESULTS 28 Figure 8.2: Comparison of kernels for one-class SVMs kernel ν γ coef0 degree linear sigmoid polynomial rbf intersection Table 8.2: Parameter of the best one-class SVMs rbf kernel, which is based on the L 2 distance. This result was expected, because in previous studies, e.g. in [9], L 1 distances have been found better than L 2 distances when dealing with histogram based feature vectors. Furthermore, all other kernels seem to be very inecient compared to the intersection and rbf kernel The best kernel function for one-class SVMs The same experiment was now carried out for one-class SVMs. The SVMs were again just trained with the initial training data, which consisted this time of only 3 relevant images. Negative instances in the training set have no inuence on one-class SVMs. Again no additional feedback rounds were allowed. The best parameters for the kernel functions, which again were determined through individual tests, are shown in table 8.3. The result of the comparison of the best kernel functions is presented in gure 8.2. Even when using one-class SVMs as the learning method for the CBIR relevance feedback system, histogram intersection is clearly the best kernel function. But this time the only kernel that performs badly is the sigmoid kernel. All other kernel functions perform equally well, but not as good as the intersection kernel.

31 CHAPTER 8. RESULTS 29 Figure 8.3: Comparison of one-class and two-class SVM Figure 8.4: Comparison of both SVM types after 6 feedback rounds Comparison of two-class and one-class SVMs In the next step the best two-class SVM was compared to the best one-class SVM. For both SVM types those were the SVMs using the histogram intersection kernel. The exact parameters can be looked up in the tables 8.1 and 8.3. Additionally histogram intersection was added as an alternative learning method to the comparison of the two SVMs. Again only 3 relevant and 5 irrelevant images were used for training and no additional feedback rounds were allowed. As can be seen in gure 8.3, the one-class SVM performs best, followed closely by histogram intersection Comparison of two-class and one-class SVMs after 6 query rounds How do the three learning methods perform when allowed to make up to 6 feedback rounds? In the rst feedback round the three learners will be trained with the same 3 relevant and 5 irrelevant images. Each following feedback round consisted of 12 query images. The result of the comparison after the 6 feedback rounds is shown in gure 8.4. Now, the two-class SVM is the best learner, followed by the one-class SVM. A possible reason that the twoclass SVM is the best, is that it is the only learner of the tree that uses the whole information from the feedback rounds. The one-class SVM and histogram intersection ignore completely the information from the irrelevant labeled feedback images Improvement of a two-class SVM with relevance feedback Another nice thing to know is, how a SVM improves after each relevance feedback round. Figure 8.5 shows the improvement of a two-class SVM with the histogram intersection kernel. In the rst round the SVM was trained with one relevant and 5 irrelevant images. After each feedback round the precisionrecall graph was updated. The SVM was trained until 6 feedback rounds were performed. It can be seen that the SVM improves, as expected, with each passed feedback round.

32 CHAPTER 8. RESULTS 30 Figure 8.5: Improvement of a twoclass SVM with relevance feedback Figure 8.6: Improvement of a twoclass SVM without relevance feedback Improvement of a two-class SVM without relevance feedback How does the SVM improve if not the most informative, but random images are used in the query round? The results can be seen in gure 8.6. As expected, the improvement is not nearly as good. 8.2 Tests with the Benchathlon image collection Figure 8.7: Comparison of SVMs The best kernel function for one-class and two-class SVMs The rst thing done with the Benchathlon image collection, is to determine the best one-class and two-class SVM for image retrieval. Therefore the SVMs were trained only with the initial training set, consisting of 3 relevant and 5 irrelevant images. No additional feedback rounds were allowed. For each SVM

33 CHAPTER 8. RESULTS 31 SVM type kernel cost ν γ one-class intersection one-class rbf two-class intersection two-class rbf Table 8.3: Parameter of the best one-class SVMs Improvement of a one- Figure 8.8: class SVM Improvement of a two- Figure 8.9: class SVM type and each kernel individual tests were made to nd the best parameters. After that all kernels with their best parameters were compared. The performances of the best two two-class SVMs, the best two one-class SVMs and histogram intersection can be found in gure 8.7. The corresponding parameters are shown in table 8.3. When comparing the curves of the MPEG Content Set with the curves of the Benchathlon image collection one can see that the results of the CBIR system using the Benchathlon database are not as good as the results using the MPEG Content Set. A reason therefore is that the Benchathlon database, with 4500 images, is larger than the MPEG Content Set, which consists of When using the Benchathlon the proportion of the relevant images to the irrelevant images is much smaller and the possibility that the CBIR system returns a irrelevant image before returning all relevant images is much higher Comparison of two-class and one-class SVMs This time the best two SVMs are the two-class SVM and the one-class SVM, both with the histogram intersection kernel. One may say that the two-class SVM is even slightly better. Now, the performance level of Histogram intersection is nowhere near the levels of the best two SVMs Improvement and comparison of one-class and two-class SVMs after 6 query rounds In gure 8.8 and gure 8.9 the improvement after 6 relevance feedback rounds of the one-class SVM and the two-class SVM, both using the histogram intersection kernel, is shown. Both SVMs were trained in the rst feedback round

34 CHAPTER 8. RESULTS 32 Figure 8.10: Comparison of both SVM types after 6 feedback rounds with 3 relevant and 5 irrelevant images. It can be easily seen that the twoclass SVM improves more during the feedback rounds. One reason for this is, that the two-class SVM makes use of the information given from the irrelevant labeled instances. In gure 8.10 the performance of both SVM types after 6 feedback rounds is compared. As expected, the two-class SVM preforms much better than the one-class SVM. 8.3 Conclusions When using image features based on histogram intersection the best one-class SVM and the best two-class SVM are those using histogram intersection as the kernel function. But which SVM type should be used for image retrieval in which cases? If the user is asked to provide the training examples by uploading a small number of relevant images, then a one-class SVM would be the best choice. In this case a two-class SVM will fail, because it needs at least one negative training example. Using a one-class SVM is also a good choice when the user is asked only once to label a small set of images. As seen in the previous chapter, the improvement of the performance from feedback round to feedback round of a two-class SVM is a lot better than the performance improvement of a one-class SVM. After several feedback rounds the two-class SVM is clearly the best learning method. Therefore we would expect, when using a CBIR with multiple query rounds, that a two-class SVM will perform better than a one-class SVM.

35 Bibliography [1] Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines, 2001, Software available at [2] Schölkopf, B., J.C. Platt, J. Shawe-Taylor, A.J. Smola and R.C. Williamson: Estimating the support of a high-dimensional distribution. Technical report No.(87) (1999) [3] Y. Rui,T. Huang, M. Ortega, S. Mehrotra: Relevance feedback: A power tool in interactive content-based image retrival. IEEE Trans. on Circuits and Systems for Video Technology 8(5) (Sep. 1998): [4] Chen, Y., et al, "One-class SVM for Learning in Image Retrival",IEEE Intl Conf. on Image Proc. (ICIP 2001), Thessaloniki, Greece, October 7-10, 2001 [5] S. Tong and E- Chang. Support vector machine active learning for image retrival, In ACM International Conference on Multimedia, pages ; Otawa, Canada, September 2001 [6] A. Barla, E. Franceschi, F. Odone and A. Verri: Image kernels, In Proceedings of the International Workshop on Pattern Recognition with Support Vector Machines, satellite event of ICPR 2002, LNCS 2388, p. 83, [7] S. Siggelkow. Feature Historgrams for Content-Based Image Retrieval. PhD thesis, Albert-Ludwigs-Universität Freiburg, December 2002 [8] Benchathlon home page: [9] O. Chapelle, P. Haner, and V. Vapnik. Svms for histogrambased image classication. IEEE Transactions on Neural Networks, accepted, special issue on Support Vectors. 33

A Short SVM (Support Vector Machine) Tutorial

A Short SVM (Support Vector Machine) Tutorial j.p.lewis CGIT Lab / IMSC U. Southern California version 0.zz dec 004 This tutorial assumes you are familiar with linear algebra and equality-constrained optimization/lagrange