A User-driven Model for Content-based Image Retrieval

A User-driven Model for Content-based Image Retrieval Yi Zhang *, Zhipeng Mo *, Wenbo Li * and Tianhao Zhao * * Tianjin University, Tianjin, China E-mail: yizhang@tju.edu.cn Abstract The intention of image retrieval systems is to provide retrieved results as close to users' expectations as possible. However, users requirements vary from each other in various application scenarios for the same concept and keywords. In this paper, we introduce a personalized image retrieval model driven by users operational history. In our simulated system, three types of data, which are browsing time, downloads and grades, are collected to generate a sort criterion for retrieved image sets. According to the criterion, the image collection is classified into a positive group, a negative group and a testing group. Then an SVM classifier is trained with image features extracted from three groups and used to refine retrieved results. We test the proposed method on several image sets. The experimental results show that our model is effective to represent users demands and help improving retrieval accuracy. I. INTRODUCTION Over the past decade, the expansion of the digital imagery in many directions has resulted in an explosion in the volume of image data required to be organized [1]. Although much progress has been made in attempts to make computers learn to understand, index, and annotate pictures [2], it still remains a challenge to find the ideal images for particular user s requirement in the database with such a large number of images. Current commercial web image search engines can t reach users' expectations because they retrieve images only by keyword instead of coupling with content. Even for the same keyword, the query demands might vary depending on individual user s favor, purpose and other factors. Thus, a personalization retrieval method is needed. Relevance feedback (RF) is a kind of method to provide image retrieval personalization by asking users to label images and it has been extensively studied in the content-based image retrieval (CBIR) community [3]. For example, a unified relevance feedback framework [4] is proposed by combining textual feature-based RF and visual feature-based RF. However, most RF methods are far from practicability due to the scalability, efficiency, and effectiveness issues. To overcome these shortcomings of RF's practicability, the idea of implicit RF [5] was proposed. Implicit RF was first applied in retrieving documents and used in the CBIR community three or four years ago. The implicit RF technique is a kind of relevance feedback technique by gathering useful data indirectly from the users by monitoring behaviors of the users during searching [5] instead of requiring much effort of the users. Although the implicit RF methods are generally thought to be less accurate than explicit measures [6], their advantages are obvious. In this paper, we propose an image retrieval personalization method which combines explicit image rating and implicit RF together. Our method infers individual user s interests in images from common browsing behaviors, for the purpose of improving the user experience of traditional CBIR system without increasing user s burden. In our model, we simulate the users' behaviors in the process of searching images by keywords and browsing retrieval results. Three types of data are collected and recorded in the system; they are browsing time of each image, download number of each image and image grades given by users. These data are in connection with certain user under the given keyword and will be used to predict user s preferences. The user's preferences of searched images are calculated by combining the three indicators mentioned above. In order to obtain an ordered image set for training, we rank the images

by their preference values. And then the ordered image set is divided into a positive group, a negative group and a testing group according to the preference values of the images. Furthermore, we extract three kinds of features color, edge and texture from images to train an SVM classifier. The classifier will be saved for certain user and helps to classify the retrieval image results. The images belonging to positive class will be displayed in higher priority. With more users information, the classifier can be updated dynamically and improve classification performance. The rest of this paper is organized as follows: In Section 2, we briefly review the relevance feedback methods in image retrieval. Section 3 describes our method framework and technique details. In Section 4, we show the experimental results and evaluations. Finally section 5 concludes this paper. the user during searching and browsing process instead of requiring specified users manipulations. So far there have been some researches addressing the problems of visual attention for implicit relevance feedback methods. Auer [7] proposes a system which infers users intent from eye movements by using machine learning method. Then the system learns a similarity metric of common image features depending on the current interests of the user. Similarly, the system of [8] improves the performance of image retrieval by re-ranking the retrieved images according to color and texture features extracted from the regions where the users pay more attention. The users interests are found by gazing information collected from an unobtrusive eye tracker. However, these studies also put extra burden on users because the system has to collect the users data with the help of additional tools. II. RELATED WORK III. METHODS Relevance feedback is a supervised learning technique that focuses on the interactions between users and image retrieval system. The main idea of typical RF method is to improve retrieval accuracy by users labeling. For a given query, the retrieval system first retrieves an image set according to predefined similarity metrics which are usually defined on image feature space. Then the users are required to label positive images and negative ones. These labels are used to refine the query dynamically so that the accuracy and performance of the retrieval system could be improved. Generally speaking, relevance feedback methods generate user satisfaction model by adjusting weights of multiple image features. Feature dimensions with more importance are assigned to larger weights while feature dimensions with less importance are assigned to smaller ones. The model tends to approach user s intent after several iterations. In relevance feedback methods, Bayesian estimation method [3] has been widely used. The feedback samples are considered as a sequence of independent queries and the retrieval error is minimized by using Bayesian rules. However, normal RF techniques suffer from a mass of users' effort so that the practicability becomes their most disadvantages. To get over over-work loading problem, the implicit RF approach is proposed. The implicit RF techniques collect users feedback indirectly by monitoring behaviors of 3.1 Overview In this paper, we propose a user-driven model for post-processing of content-based image retrieval. The goal of our model is to provide users with better retrieval experiences as well as image retrieval results which are as close as possible to their expectations. The primary contributions of the paper are summarized as follows: We build a user-driven model to re-rank retrieval images by training users preferred image set. Our goal is to promote personalization of image retrieval by making use of individuals' retrieval experience. We experimentally evaluate various factors which influence the performance of proposed model such as the size of training set, the noise in the training set and the features used in the training. The results prove the effectiveness of our user-driven model. We explore a simulated image browsing system to collect users data implicitly. The interactions between users and image retrieval system are more effective and convenient compared with general RF methods. Fig. 1 shows the sketch chart of our method which mainly contains the following six steps. 1 Given an unordered initial set of images under user's query, the user browses images and interacts unconsciously with the system in the simulated browsing system.

Fig. 1 The architecture of the system 2 The system writes user's manipulation histories into the system log and a preference value of each image is calculated by user-driven model with users' data extracted from manipulation histories. 3 According to the preference values, the browsed image collection is divided into three groups: the positive group and the negative group for training, and all rest images for testing. 4 Three types of image features: color, edge and texture for both positive image-set and negative image-set are extracted for further classifier training. 5 An SVM classifier is trained and used to select the images which meet users expectations from un-browsed image collection or new retrieval results. 6 The classifier will be reused and refined when the system receives a similar query from the same user. 3.2 User-driven Model The user-driven model is used to estimate how much the browsed images accord with user s expectation. And further according to the model, these images are divided into positive training set and negative training set to generate a classifier. We build the model from users behaviors recorded in the simulated system, and the output of the model is image s preference value. A higher preference value indicates that the corresponding image has gained more users attention. Three types of user behaviors are taken into considerations in our model; they are browsing time of the image, download number of the image and user s grades for the image. Within these three types of factors, two of them can be extracted implicitly from users operations. Some researches [9][10] have found that the users preferences have the large impact on the time they spend in reading an article. Analogously, it can be predicted that the users will spend more browsing time on preferred images. Moreover, the users tend to zoom in or download the images which they re interested in. Thus we take download number of the image as another factor to show user s preference for an image. Besides implicit factors, we also employ the grading method, which needs users explicit feedback, as one measurement of user s preference. However, the user s grade, which is an optional factor in our model. Therefore, most of the users effort could be saved. For each browsed image, the preference value V is calculated as following formulas: V = w * T + w * D+ w * G (1), T D G where T is the browsing time, D is the download number, and G is the user s grade. w T, w D, w G are respectively the weights, indicating the contributions to preference value, of above three factors. When determining the weight of each factor in the model, two properties are considered: universality and sensibility. We give universality higher priority, in other words common operations contribute more to final preference value. When the universality degree of the factor is high enough, the sensibility plays a more important role. Here, we analyze three factors used in our model from above two aspects as follows. Factor browsing time Factor browsing time has a high universality score because there is a strong tendency that users would spend more time in browsing an image which attracts them. But even for uninteresting images, the users have to glance over them to determine their importance. Therefore, the sensitivity of factor browsing time is relatively low because every image has the basic browsing time. Factor download number The users tend to download the images meeting their expectations. Therefore, the universality and the sensibility of factor download are both high. Factor grades Grading is an explicit feedback. We give a low universality evaluation for this factor as grading operation can t be guaranteed owing to users laziness. On the contrary, the sensibility score is relatively high because it reflects users

particular preference for an image. General Measurement On the basis of above factor discussion and numbers of experiments, we empirically determined the weights for three factors as follows. T G 2.5 V = 30%*max(,1) + 60%* D+ 10%* * α (2) 15 2 The essential weight of factor browsing time is 30 percent. To strengthen its sensibility, we set the threshold 15. When the browsing time is more than 15 seconds, the image gains more importance. The weight of factor download is 60 percent. The essential weight of factor grades is 10 percent due to its low universality. As grading is an optional operation, we give extra rewards or punishments if the users spend their efforts in grading images. When the image gets a better grade which is over average grade 2.5, we add its preference value. On the contrary, the images lose some scores if the users give them negative evaluations. α is the operator to make the punishment decision and can be represented as: 1 G > 2.5 α = (3) 1 G < 2.5 3.3 Feature Extraction For the purpose of computing simplification and operational universalism, we extract three most widely used features to describe the image. They are color histogram feature in HSV color space, edge feature and texture features. These features are further used to train an image classifier for user s preference. 3.3.1 Color Feature Extraction We employ HSV color space for color histogram generation as the HSV color space has an advantage of presenting color with the similar way in which humans perceive color. Basically, the initial color information read directly from images is RGB color. Then the color transformation from RGB to HSV is performed and the histograms are calculated for each HSV color channel. We divide respectively H channel into 32 bin, S channel into 16 bins, and V channel into 16bins. Therefore, a 64-dimension color feature vector is obtained for each image. 3.3.2 Edge Feature Extraction The edge feature is computed by edge pixel proportion in each sub-region. As edge information is irrelevant with image color, the edge extraction is operated in a gray-level color space. Frist, the input colorful image is converted into a gray-level image, and then Sobel operator is applied to get salient edges of the objects. Actually, the edges extracted by Sobel operator can t not semantically separate foreground objects from background. However, the general edges and directions are sufficient for our application so that no more refining effort is needed. Furthermore, we partition each image into N sub-regions without overlap. Within each sub-region, the proportion between edge pixels and all pixels is calculated. To control the scale of edge similarity, the number of sub-region N could be adjusted according to image resolution and user requirements. In general, the measurement becomes stricter with N increasing. We use 256 in our experiments. A formalized example of edge extraction is given below. We assume that the input image is I and its gray-level transformation is I gray. Then we utilize the Sobel operator to generate an image R=Sobel(I gray ), which contains I gray 's edge information. R is a binary image whose white pixels represent edges while black pixels represent background. After that, we partition R into non-overlap 256 sub-regions that are distributed as 16 rows * 16 columns. Due to few images whose horizontal resolution and vertical resolution are exactly the integer multiple of 16, we take the integer part of the division and discard the marginal redundant parts when partitioning R. Empirically, we think the marginal part contribute less for the image content. Finally, we calculate the proportion between the number of white pixels and the total pixels number of each sub-region. Thus, we obtain a 256-dimension edge vector of I. 3.3.3 Texture Feature Extraction The Gabor filter approach has been successfully applied to a broad range of image processing tasks. Particularly, it is proved that Gabor filter is appropriate for texture representation and analysis. In our method, we employ 2D Gabor filter [11] to extract images' texture feature. Similar to edge extraction, the input image is preprocessed by graying and partitioning. For input image I, its gray-level

transformation I gray is partitioned into several overlapped equal-sized regions. The size of region in our experiment is 48 * 48 pixels and the overlap of adjacent regions in horizontal direction is 24 pixels. Then we apply Gabor filter and obtain a 48-dimension texture feature vector of each region. Texture feature is special in measurement because it is independent with position but closely related to the size. In other words, two images are considered to be similar in texture feature space when large part of each image looks alike without constraint that these parts have to locate in consistent positions. Therefore, we need to select limited representative regions and rearrange texture feature vector instead of putting texture feature vector of all regions together. Assuming that I gray is divided into N regions, thus the size of the initial texture feature matrix is 48 N. In order to reduce redundancy, we select 32 from N regions which represent the most important parts in each image. K-means method is employed to find important parts of the image. First, the texture feature vectors of N regions are clustered into several groups by K-means. Then, we select certain number of regions which are close to cluster center from each group; the number of selected regions is decided by ratio between region number of the clustering group and N. Thus for each image, we generate a 32*48-dimension texture feature vector. IV. EXPERIMENTS AND RESULTS In order to prove the validity of our proposed method, we evaluate the data-driven model by classification performance under different parameter combinations. We prepare two groups of image sets in the experiments. The images are searched and downloaded for specified keyword from Google Image, and then loaded to our simulated system to finish the left operators. In the first group, we intend to find some pictures about Apple Company by keyword apple. In another group, we use keywords beach and sunset to find some scenery pictures. The sizes of Apple data and Beach&Sunset data are respectively 200 and 400. In our experiment, no more than 50% images are used as training data while the rest of images as testing data. Particularly, there is no overlap between training data and testing data. Parts of the data used in training and testing are shown in Fig. 2 and Fig. 3. In the experiments, the performance of our proposed method is evaluated by four measures which are classification accuracy, precision, recall and user satisfaction, Accuracy indicates how many testing images are classified into the correct group by our trained user s preference classifier. Precision and recall are the measures widely used in the field of classification and information retrieval. Assuming that a given testing set has N images in total for classification, within there are N 1 satisfied images and N 2 unsatisfied images; the trained classifier correctly recognizes M 1 satisfied images, M 2 unsatisfied images and wrong recognizes M 3 satisfied images, M 4 unsatisfied images, the classification accuracyγ,precision τ and recallυ are defined as follow: γ = ( M1+ M2) N τ = M1 ( M1+ M3) υ = M ( M + M ) 1 1 2 User satisfaction is the ratio of user-preferred images and displayed top 100 images. It is reasonable to use part of images in evaluation because the users usually pay more attention to first hundred results and lose their patience in browsing the rest. In the simulated system, the user satisfaction before reordering for Apple data and Beach&Sunset data are 0.3100 and 0.2625. To provide the evaluation of the proposed method, we organize the experiments in three cases to consider the influence of training parameters, including the size of training data-set, noise training and feature combination in the training. Experiments on Training Size In this paper, we define the training size as samples number in positive training data, and the negative training data contains the same number of samples. To evaluate the impact of the training data size, we perform training and classification for both dataset in different sizes of training data. The feature used in the training is selected according to user s classification standard in data preparation. For instance, the edge information plays an important role in apple data because the users intend to find the trademark of Apple Company but not Iphone or Ipad. Similarly, Beach&Sunset data get more consistent response in color and texture features. The results in Fig. 4 indicate that the classification accuracy and reliability rise with the increment of training scale. When (4)

the training data size increases from 20 to 70, four measures are improved in varying degrees. The classification accuracy has improved 9 percent for color feature training, 11 percent for edge feature training and 12 percent for texture feature training. However, it should be noted that there is an obvious decline when the size varies from 70 to 80. This indicates that there exists a best training size in certain scale of data. We take 50% images as best training size for positive and negative sets. Large training size may lead to overtraining by introducing sample noise. (a) Part of testing data (b) Part of negative training data (c) Part of positive training data Fig. 2 Apple image set (a) Part of testing data (b) Part of negative training data (c) Part of positive training data Fig. 3 Beach&Sunset image set Experiments on Noise Training In order to evaluate the robustness of our method, we test classification performance by mixing 10% noisy images to positive training set. The training features are selected using the same principle as above, and training size is set to 70. The results in Fig.5 show that the performance of non-noisy training is superior to noisy training. Among the all three features, the color feature is most sensitive to the noise and the improvement of classification accuracy is up to 10 percent. The edge feature and texture feature only get 2 percent improvements in average. Noticing the dimension difference between color feature and other two features, such results support the conclusion that more feature dimensions can reduce undesirable influence of noisy training. On the contrary, the user satisfaction is less affected by color noise rather than the other two features. It suggests that the users have lowest visual tolerance level in color. Experiments on Feature Combination To evaluate the effectiveness of features used in training, we design the experiments in the cases of different feature combinations. In consideration of single feature training, double-feature training and all-feature training, there are seven combination possibilities in total for three features used in our method. We process the training and classification on each feature combination for both experimental data. In this experiment, the training data is free from noise and the training size remains to be 70. The comparisons in Fig. 6 illustrate that the results are quite different in two different image sets. For Beach&Sunset data, the performance of multi-feature training is always better than single-feature training. Especially, the combination of color and texture gains the best results. It appears the tendency that edge feature has a general negative effect on various feature combinations. By contrast, edge feature plays the most crucial role in improving performance for Apple data. Such results

present a consistent argument to support the users principle in preparing experimental ground truth as mentioned in the first experiment. V. CONCLUSIONS In this paper, we propose a user-driven model which combines the explicit grading and implicit feedback for content-based image retrieval. Our model is used to estimate preference degree for browsed images, and then generate positive and negative training data. A preference classifier is trained by SVM and used to re-rank given image collection in order to present retrieved results as close to users' expectations as possible. Our experimental results support that the proposed method achieves better user satisfaction, providing the potential to be further developed into a post-processing tool for image retrieval applications. Future work will concern the reduction of feature vectors and discussion of features effectiveness. ACKNOWLEDGMENT The authors wish to acknowledge the financial support for the research work under Key Projects in the Science and Technology Pillar Program of Tianjin under Grant No. 11ZCKFGX01200. (a) Color training for Beach&Sunset data (b)edge training for Apple data (c)texutre training for Beach&Sunset data Fig. 4 Experiment results of Training Size (a)color training for Beach&Sunset data (b)edge training for Apple data (c)texutre training for Beach&Sunset data Fig. 5 Experiment results of Noisy Training (a)beach&sunset training without noise (b) Apple traing without noise Fig. 6 Experiment results of Feature Combination Training Feature Note:1-color, 2-edge, 3-texture, 4-color and edge, 5-color and texture, 6-edge and texture, 7-all

REFERENCES [1]Datta, R. a. L., J. and Wang, J.Z. "Content-based image retrieval: approaches and trends of the new age." Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval: 253 262, 2005. [2]Datta, R. a. J., D. and Li, J. and Wang, J.Z. "Image retrieval: Ideas, influences, and trends of the new age." ACM Computing Surveys (CSUR) 40(2): 5, 2008. [3]Su, Z. a. Z., H. and Li, S. and Ma, S. "Relevance feedback in content-based image retrieval: Bayesian framework, feature subspaces, and progressive learning.", IEEE Transactions on Image Processing, 12(8): 924 937, 2003. [4] Cheng, E. a. J., F. and Zhang, L. "A unified relevance feedback framework for web image retrieval.", IEEE Transactions on Image Processing, 18(6): 1350 1357, 2009. [5] Kelly, D. a. B., N.J.. "Reading time, scrolling and interaction: exploring implicit sources of user preferences for relevance feedback." Annual ACM Conference on Research and Development in Information Retrieval: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 35: 408, 2001. [6] Nichols, D.M. Implicit ratings and filtering. In Proceedings of the 5th DELOS Workshop on Filtering and Collaborative Filtering, Hungary, 31-36, 1997. [7] Auer, P. a. H., Z. and Kaski, S. and Klami, A. and Kujala, J. and Laaksonen, J. and Leung, A.P. and Pasupa, K. and Shawe-Taylor, J. "Pinview: Implicit Feedback in Content-Based Image Retrieval." MLR Workshop and Conference Proceedings Volume 11: Workshop on Applications of Pattern Analysis, 11: 51-57, 2010. [8] Faro, A. a. G., D. and Pino, C. and Spampinato, C. "Visual attention for implicit relevance feedback in a content based image retrieval." Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications: 73 76, 2010. [9] Morita, M. a. S., Y. (1994). "Information filtering based on user behavior analysis and best match text retrieval." Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval: 272 281, 1994. [10] Konstan, J. A. a. M., B.N. and Maltz, D. and Herlocker, J.L. and Gordon, L.R. and Riedl, J. "GroupLens: applying collaborative filtering to Usenet news." Communications of the ACM, 40(3): 77-87, 1997. [11] Manjunath, B. S. a. M., W.Y. "Texture features for browsing and retrieval of image data." IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8): 837 842, 1996.