ImageCLEF 2008: visual feature analysis in segmented images

Size: px

Start display at page:

Download "ImageCLEF 2008: visual feature analysis in segmented images"

Aubrie Peters
6 years ago
Views:

1 ImageCLEF 2008: visual feature analysis in segmented images Bálint Daróczy Zsolt Fekete Mátyás Brendel Simon Rácz András Benczúr Dávid Siklósi Attila Pereszlényi Data Mining and Web search Research Group, Informatics Laboratory Computer and Automation Research Institute of the Hungarian Academy of Sciences {daroczyb, zsfekete, mbrendel, sracz, benczur, peresz, sdavid}@ilab.sztaki.hu Abstract. We describe our image processing system used in the Image- CLEF 2008 Photo Retrieval and Visual Concept Detection tasks. Our method consists of image segmentation followed by feature generation over the segments based on color, shape and texture. In the paper we elaborate on the importance of choices in the segmentation procedure with emphasis on edge detection. We also measure the relative importance of the visual features as well as the right choice of the distance function. Finally, given a very large number of parameters in our image processing system, we give a method for parameter optimization by measuring how well the similarity measures separate sample images of the same topic from those of dierent topics. 1 Introduction The ImageCLEF 2008 Photo Retrieval [1] and Visual Concept Detection tasks [2] both targeted towards image processing and visual feature generation over the IAPR TC-12 benchmark collection [3]. While the actual systems we used in the ImageCLEF 2008 campaign are described in the Working Notes [4, 5], in this paper we concentrate on the main lessons we have learned considering the strength of various visual processing elements in image categorization and similarity search. Our image processing system common to both tasks is based on image segmentation and then feature generation for the individual image segments, typically around 100 for each image in the corpus. The segmentation procedure consist of a novel combination of the FelzenszwalbHuttenlocher graph cut method [6] with smoothing over the Gaussian-Laplacian Pyramid [7]. We map all image segments into a roughly 400-dimensional space with features describing the color, shape and texture of the segment. While in the image categorization task we can learn the relative importance of the feature classes, the similarity search procedure used in our content based retrieval system is sensible to the weight. We make an excessive analysis of the feature weights as well as give a novel This work was supported by the EU FP7 project JUMAS Judicial Management by Digital Libraries Semantics and by grants OTKA NK and NKFP-07-A2 TEXTREND.

2 dimensions description 3 Mean HSV (or RGB) 60 RGB histogram, 20 bins each 30 Hue histogram 15 Saturation histogram 15 Value histogram 210 Zig-Zag Fourier amplitude (105) and absolute phase (105) low frequency components 1 Size 1 Aspect ratio 64 Shape: density in 8x8 regions Table 1. Description and number of visual features used to characterize a single image segment. method to learn these weights based solely on the sample images of the photo retrieval topics. We briey describe our text IR system; for more details we refer to [4]. We use the Hungarian Academy of Sciences search engine as our information retrieval system that is based on Okapi BM25 with the proximity of query terms taken into account. We used the original automatic query expansion formula of [8] that, in our implementation, turned out to give minor improvement only. While we show results with and without query expansion, the improvement is minor and hence we omit detailed description and analysis from this report. We also omit details on cluster recall as clusters were typically organized based on the location of the photograph and, in our opinion, image processing could not assist in identifying images of the same cluster. 2 The Image Processing System In our system we segment images by a novel combination of the graph based image segmentation method of Felzenszwalb and Huttenlocher [6] with the Gaussian- Laplacian Pyramid. While the pyramid is used with success in the ImageCLEF campaign for example in combination with the region of interest method [9], we nd other elements of the segmentation procedure of more importance. After segmentation we map each segment into a feature space characterizing its color, shape and texture with description and dimensionality shown in Table 1. These features are used directly for image classication in Section 4. Their use in the content-based retrieval system (CBIR) of Section 3 is via the distance from sample images. Given a pair of a sample and a target image, for each sample segment we compute the distance of the closest segment in the target image. The nal (asymmetric) distance arises by simply averaging over all sample image segments. Next we describe the details of the segmentation (Section 2.1) procedure and in Section 2.2 we give a novel method to learn the weight of the feature groups based solely on the sample images of the photo retrieval queries. The

3 eect of various settings on the image processing quality is analyzed over the Visual Concept Detection (Section 4) and Photo Retrieval (Section 3) tasks of ImageCLEF Segmentation Our segmentation procedure is based on a multilevel Gaussian-Laplacian pyramid [7] that enables a gradual renement of the segments starting out from a coarse segmentation on the top level of the pyramid. Given a coarser segmentation on a higher level, we rst try to replace each segment pixel by pixel with the four lower level pixels if their similarity in the RGB space is within a threshold. If the four pixels of the ner resolution are dissimilar, we remove those pixels from the segment. The remaining segments are kept together as starting segments for the lower level procedure while the remove pixels can join existing segments or form new ones. On the top level of the pyramid we use a modied FelzenszwalbHuttenlocher graph cut method [6] that, on lower levels, simply continues the growth of the segments obtained on the higher level. Our main improvement over the original method is the use of Canny edge detection [10] values to weight the connection between neighboring pixels. The original method only uses distance in the RGB space as weight that we add to the edge detection weight. We also require a similar number of segments in the images that are large enough to be meaningful for retrieval or classication purposes. The original FelzenszwalbHuttenlocher method builds a minimum spanning forest where the addition of a new pixel to the component is constrained by the weight of the connection with the next pixel and the size of the existing component. We test two post-processing rules that reject the smallest segments. The pixels of rejected segments are then redistributed by the same minimum spanning forest method but now without any further restriction on the growth of the existing large segments. The two dierent rules are as follows: Segments of size below a threshold are rejected. All segments are rejected except for the prescribed number of largest ones. 2.2 Learning feature weights for image similarity search Our CBIR ranks images based on the distance of the target image segments with the sample image segments. Unlike image classication where classiers may be capable of learning the relative importance of the features, when considering distances in the feature space, we cannot distinguish between directions relevant or irrelevant with respect to image retrieval. When applying feature weight optimization for the Photo Retrieval task, we face several problems. First, training data consists solely of the three sample images of the topics. Second, relevance to certain Photo Retrieval topics are based on aspects other than image similarity such as the location of the scene. Third, the three sample images of the same topic are sometimes not even similar. Our method for training the image processing weights is based on a test for topic separation. We select those topics manually where the three sample images

4 are similar to one another. For ImageCLEF Photo 2008 the list of the selected topics (some of which are ImageCLEF 2007 only) is as follows: 01, 02, 04, 07, 14, 15, 17, 22, 24, 27, 33, 36, 41, 43, 45, 51, 53, 55, 58, 60. The training data consists of image pairs with an identical number of pairs from the same topic and from dierent topics. Since our distance is asymmetric, we have six pairs for one topic that results in 120 positive pairs. The negative pairs are formed by selecting two random pairs from a dierent topic for each of the 60 sample images. We optimize weights for the AUC value of the two-class classication. Since the task at hand is computationally very inexpensive, we simply performed a brute force parameter search. For larger problems we could choose from logistic regression (if we only train linear weights), simulated annealing or genetic algorithms to name a few. Given the post-campaign evaluation data, we could perform another manual parameter search to nd the best performing weights in terms of the MAP of the retrieval system. As shown in Section 3 we could reach very close to the best settings we found manually, a result that is in fact overtrained due to the use of all evaluation data. 3 The Photo Retrieval Task For the Photo Retrieval task we combine the scores of our text retrieval system (with or without query expansion) with the following visual relevance score. For a target image to be ranked we take each segment of a given topic sample image and nd the closest segment in the target image. We average distances over all these segments. Finally among the three sample images we use the smallest value that corresponds to the closest, most similar one. Since we compute distance instead of similarity, we simply negate the values. When combining the much lower quality visual scores with the text retrieval scores, we use a method that basically optimizes for early precision but reaches very good improvement in MAP as well. Due to the low quality of the visual scores, low ranked images carry little information and act as noise when combining with text retrieval. Hence we replace all except the highest scores by the same largest value among them, i.e. after some position i, for all j > i we let score j = score i. In our experiment we choose i to be the rst value where score i = score i+1. Our results are summarized in Table 2 for a choice of 100 segments with the best segmentation method that uses a 7-level Gaussian-Laplacian pyramid and Canny edge detection. We observe the following behavior. First, l 1 distance outperforms l 2 in all cases. Second, better CBIR scores translate into better combined scores. Third, the test for topic separation (method Section 2.2) nds weights that perform nearly as well as the overtrained best weight setting that we were only able to compute given all relevance assessment data and by far outperforms the all-1.0 weight case. Unfortunately query expansion gives only very minor improvement and we will have to revise this component of our system. Figure 1 shows the performance of our best methods on the dierent topics. Topics are sorted by the MAP of the pure visual result. As it can be seen, the

5 MAP P5 P20 l 1 w l 1 w l 1 TST l 2 TST l 1 best l 2 best txt txt+qe txt+l 1 w txt+l 2 w txt+tst l txt+tst l txt+best l txt+best l txt+qe+tst l txt+qe+best l l 1 l 1 norm is used l 2 l 2 norm is used TST visual feature weights optimized with test for topic separation (Section 2.2) w1.0 all 1.0 weights best weights hand picked based on the evaluation data txt text based information retrieval qe query expansion Table 2. Photo Retrieval performance of dierent methods (left) with explanation on the right. MAP P5 P20 RGB RGB + Canny RGB + pyramid RGB + Canny + pyramid RGB+HSV RGB+HSV + Canny RGB+HSV + pyramid RGB+HSV + Canny + pyramid Table 3. Performance of the variants method evaluated by dierent measures visual result improves text result in most of the topics with the exception of four topics (31, 60, 17 and 15) only. Interestingly, for ve topics (23, 59, 50 and 53) the MAP improvement is higher than the visual MAP itself. The main factors in our CBIR performance consist of l 1 distance in the HSV and DFT feature space as well as our home grown parameter optimization method. Table 3 compares some variations. In general the HSV space is better than RGB but RGB yields additional improvement in combination. The table also justies the use of both the Gaussian-Laplacian pyramid and the Canny edge weight in the FelzenszwalbHuttenlocher segmentation algorithm. Finally in Fig. 2 we compare the relative strength of the features. Five features, size, aspect ratio and the three mean HSV values themselves form a strong similarity space. This fact is due to the large number of segments so that these features act as histograms. Over this feature set DFT gives the largest additional im-

6 Fig. 1. Performance of dierent methods by topic. The di line denotes the improvement of the CBIR over text retrieval with query expansion. Fig. 2. Performance of dierent feature combinations.

7 Glob1 Large Small EER AUC Glob1 Large Small Night 10.00/ / /79.72 Overcast 18.77/ / /79.24 Vegetation 35.47/ / /77.87 Buildings 36.65/ / /73.36 Table 4. Left: Performance of the three basic methods and their combination, evaluated by dierent measures. Right: Examples of global and local types of concepts with performance given in the form of EER/AUC. provement while histograms and shape add very little, though positive, increase in MAP. 4 The Visual Concept Detection Task For the Visual Concept Detection Task we used our image processing system with three main settings: global: features computed for the whole image: mean color, histogram and DFT; medium: 50 segments, features: size, ratio, mean color, histogram, shape and DFT; small: 100 segments, features: size, ratio, mean color, histogram, shape and DFT; Logistic regression was used for classication with the global or segment features as input. For a single image we averaged the segment based predictions, which turned out more accurate than either the minimum or the maximum. We note that we did not use the class hierarchy information. Our main classication results summarized in Table 4 where, in addition to the three above settings for image processing, we give two additional combinations: Logreg: The output of the classiers are combined by logistic regression on the 1/4 random fraction training data as heldout set. For the rest of the training set predictions are generated in a 3-fold crossvalidation. Mixed: For each class the method performing best on the above dened heldout set was selected. As seen in Table 4, left, best overall performance is attained with a high dimensional global feature space, closely followed by the medium resolution segmentation. We nd a clear distinction between concepts that give an overall characterization of the image (day, night, overcast) and those that describe objects in the image (people, vegatation, buildings). The former concepts are best classied in a global while the latter in a segmentwise local feature space (Table 4, right).

8 Conclusion and future work We have demonstrated that image segmentation based retrieval and categorization systems perform well and analyzed the right choice for the segmenter and the visual features. In future work we will conduct a more thorough investigation of possible features and using of more sophisticated methods for computing image distances such as the mixture of Gaussian models. We also plan to strengthen our results by improving our query expansion procedure and using more sophisticated methods for text and image retrieval fusion as well as utilize visual concepts for retrieval. References 1. Arni, T., Clough, P., Sanderson, M., Grubinger, M.: Overview of the ImageCLEFphoto 2008 photographic retrieval task. In Peters, C., Giampiccol, D., Ferro, N., Petras, V., Gonzalo, J., Peñas, A., Deselaers, T., Mandl, T., Jones, G., Kurimo, M., eds.: Evaluating Systems for Multilingual and Multimodal Information Access 9th Workshop of the Cross-Language Evaluation Forum. Lecture Notes in Computer Science, Aarhus, Denmark (September 2008 (printed in 2009)) 2. Deselaers, T., Hanbury, A.: The visual concept detection task in ImageCLEF In Peters, C., Giampiccol, D., Ferro, N., Petras, V., Gonzalo, J., Peñas, A., Deselaers, T., Mandl, T., Jones, G., Kurimo, M., eds.: Evaluating Systems for Multilingual and Multimodal Information Access 9th Workshop of the Cross- Language Evaluation Forum. Lecture Notes in Computer Science, Aarhus, Denmark (September 2008 (printed in 2009)) 3. Grubinger, M., Clough, P., Müller, H., Deselears, T.: The IAPR TC-12 benchmark - a new evaluation resource for visual information systems. In: OntoImage. (2006) Rácz, S., Daróczy, B., Siklósi, D., Pereszlényi, A., Brendel, M., Benczúr, A.: Increasing cluster recall of cross-modal image retrieval. In: Working Notes for the CLEF 2008 Workshop, Aarhus, Denmark (September 2008) 5. Daróczy, B., Fekete, Z., Brendel, M.: imageclef 2008 visual concept detection. In: Working Notes for the CLEF 2008 Workshop, Aarhus, Denmark (September 2008) 6. Felzenszwalb, P.F., Huttenlocher, D.P.: Ecient graph-based image segmentation. International Journal of Computer Vision 59 (2004) 7. Burt, P., Adelson, E.: The Laplacian Pyramid as a Compact Image Code. Communications, IEEE Transactions on [legacy, pre-1988] 31(4) (1983) Xu, J., Croft, W.: Query expansion using local and global document analysis. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (1996) Ah-Pine, J., Cifarelli, C., Clinchant, S., Csurka, G., Renders, J.: XRCE's Participation to ImageCLEF In: Working Notes of the 2008 CLEF Workshop. (2008) 10. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6) (November 1986)

ImageCLEF 2008 Bálint Daróczy

SZTAKI @ ImageCLEF 2008 Bálint Daróczy joint work with András Benczúr, Mátyás Brendel, Zsolt Fekete, Attila Pereszlényi, Simon Rácz, Dávid Siklósi Data Mining and Web Search Group Computer and Automation