IEEE TRANSACTIONS ON IMAGE PROCESSING 1. A Unified Relevance Feedback Framework for Web Image Retrieval En Cheng, Feng Jing, and Lei Zhang

Size: px

Start display at page:

Download "IEEE TRANSACTIONS ON IMAGE PROCESSING 1. A Unified Relevance Feedback Framework for Web Image Retrieval En Cheng, Feng Jing, and Lei Zhang"

Jesse Wood
6 years ago
Views:

1 IEEE TRANSACTIONS ON IMAGE PROCESSING 1 A Unified Relevance Feedback Framework for Web Image Retrieval En Cheng, Feng Jing, and Lei Zhang Abstract Although relevance feedback (RF) has been extensively studied in the content-based image retrieval community, no commercial Web image search engines support RF because of scalability, efficiency, and effectiveness issues. In this paper, we propose a unified relevance feedback framework for Web image retrieval. Our framework shows advantage over traditional RF mechanisms in the following three aspects. First, during the RF process, both textual feature and visual feature are used in a sequential way. To seamlessly combine textual feature-based RF and visual feature-based RF, a query concept-dependent fusion strategy is automatically learned. Second, the textual feature-based RF mechanism employs an effective search result clustering (SRC) algorithm to obtain salient phrases, based on which we could construct an accurate and low-dimensional textual space for the resulting Web images. Thus, we could integrate RF into Web image retrieval in a practical way. Last, a new user interface (UI) is proposed to support implicit RF. On the one hand, unlike traditional RF UI which enforces users to make explicit judgment on the results, the new UI regards the users click-through data as implicit relevance feedback in order to release burden from the users. On the other hand, unlike traditional RF UI which hardily substitutes subsequent results for previous ones, a recommendation scheme is used to help the users better understand the feedback process and to mitigate the possible waiting caused by RF. Experimental results on a database consisting of nearly three million Web images show that the proposed framework is wieldy, scalable, and effective. Index Terms Implicit feedback, relevance feedback (RF), search result clustering, web image retrieval. I. INTRODUCTION WITH the explosive growth of both World Wide Web and the number of digital images, there is more and more urgent need for effective Web image retrieval systems. Most of the popular commercial search engines, such as Google [1], Yahoo! [2], and AltaVista [3], support image retrieval by keywords. There are also commercial search engines dedicated to image retrieval, e.g., Picsearch [4]. A common limitation of most of the existing Web image retrieval systems is that their search Manuscript received May 22, 2007; revised February 10, This work was performed at Microsoft Research Asia, Beijing, China. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Eli Saber. E. Cheng is with Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH USA ( en.cheng@case. edu). F. Jing is with Tencent Research Center, Beijing, , China ( scenery.jf@gmail.com). L. Zhang is with Microsoft Research Asia, Beijing, , China ( leizhang@microsoft.com). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TIP process is passive, i.e., disregarding the informative interactions between users and retrieval systems. An active system should get the user into the loop so that personalized results could be provided for the specific user. To be active, the system could take advantage of relevance feedback techniques. Relevance feedback, originally developed for information retrieval [5], is an online learning technique aiming at improving the effectiveness of the information retrieval system. The main idea of relevance feedback is to let the user guide the system. During retrieval process, the user interacts with the system and rates the relevance of the retrieved documents, according to his/her subjective judgment. With this additional information, the system dynamically learns the user s intention, and gradually presents better results. Since the introduction of relevance feedback to image retrieval in the mid-1990s, it has attracted tremendous attention in the content-based image retrieval (CBIR) community and has been shown to provide dramatic performance improvement [6]. However, no commercial Web image search engines support relevance feedback because of usability, scalability, and efficiency issues. Note that the textual features, on which most of the commercial search engines depend, are extracted from the file name, ALT text, URL, and surrounding text of the images. The usefulness of the textual features is demonstrated by the popularity of the current available Web image search engine. While straightly using the textual information to construct the textual space leads to a timeconsuming computation and the performance suffers from noisy terms. Since the user is interacting with the search engine in real time, the relevance feedback mechanism should be sufficiently fast, and if possible avoid heavy computations over millions of retrieved images. To integrate relevance feedback into Web image retrieval in a practical way, an efficient and effective mechanism is requiredforconstructinganaccurateandlow-dimensionaltextual space with respect to the resulting Web images. Although all existing commercial Web image retrieval systems solely depend on textual information, Web images are characterized by both textual and visual features. With effective utilization of textual features, image retrieval greatly benefits from leveraging mature techniques from text retrieval. However, just as the proverb a picture is worthone thousand words, the textual representation of an image is always insufficient compared to the visual content of the image itself. Therefore, visual features are required for finer granularity of image description. Considering the characteristics of both textual and visual feature, it is reasonable to conclude that RF in textual space could guarantee relevance and RF in visual space could meet the need for finer granularity. Thus, it is meaningful to introduce a unified relevance feedback framework for Web image retrieval which seamlessly combines textual feature-based RF and visual feature-based RF in a sequential way /$ IEEE

2 2 IEEE TRANSACTIONS ON IMAGE PROCESSING To strengthen our proposed framework, we employ implicit feedback to overcome the limitation of explicit feedback techniques where an increased cognitive burden is placed on the users. Unlike explicit feedback, implicit feedback could be collected at much lower cost, in much larger quantities and without burden on the users. As one of the most effective implicit feedback information, click-through data has been used either as absolute relevance judgments [7] or relative relevance judgments [8] in text retrieval. Fortunately, image retrieval has the following two characteristics when comparing with text retrieval. First, the thumbnail of an image reflects more information than the title and snippet of a Web page, so click-through information of image retrieval tends to be less noisy than that of text retrieval. Second, unlike textual document, the content of an image can be taken in at a glance. As a result, the user will possibly click more results in image retrieval than in text retrieval. Both characteristics imply that click-through data could be helpful for image retrieval. In this paper, we propose a unified relevance feedback framework for Web image retrieval. There are three main contributions of the paper. A dynamic mutimodal fusion scheme is proposed to seamlessly combine textual feature-based RF (TBRF) and visual feature-based RF (VBRF). More specifically, a TBRF algorithm is first used to quickly select a possibly relevant image set. Then, a VBRF algorithm is combined with the TBRF algorithm to further re-rank the resulting Web images. The fusion of VBRF and TBRF is query concept-dependent and automatically learned. The textual feature-based RF mechanism employs an effective search result clustering (SRC) algorithm to obtain salient phrases, based on which we could construct an accurate and low-dimensional textual space for the resulting Web images. As a result, we could integrate RF into Web image retrieval in a practical way. A new UI is proposed to support implicit RF. On the one hand, unlike traditional RF UI which enforces the users to make explicit judgment on the results, the new UI regards the user s click-through data as implicit relevance feedback in order to release burden from the user. On the other hand, unlike traditional RF UI which hardily substitutes subsequent results for previous ones, a recommendation scheme is used to help the user better understand the feedback process and to mitigate the possible waiting caused by RF. The remainder of this paper is organized as follows. In Section II, we describe the dynamic multimodal fusion mechanism. SRC-based textual space construction is illustrated in Section III. Experimental results are presented and analyzed in Section V. Finally, we conclude and discuss future work in Section VI. A. Image Representation II. DYNAMIC MULTIMODAL FUSION The images collected from several photo forum sites, e.g., photosig [9], have rich metadata such as title, category, photographer s comment and other people s critiques. These images constitute the evaluation dataset for the proposed relevance feedback framework. For example, a photo of photosig 1 has the following metadata. In order to facilitate later citation of this photo, we denote it by. Title: early morning. Category: landscape, nature, rural. Comment: I found this special light one early morning in Pyrenees along the Vicdessos river near our house. One of the critiques: wow I like this picture very much I guess the light has to do with everything the light is great on the snow and on the sky (strange looking sky by the way) greatly composed nice crafted border a beauty. All the aforementioned metadata is used as the textual source for the textual space construction. To build the textual space, there are two available approaches in our work. One straightforward approach is directly using the above metadata to obtain the textual feature. Another one is based on the Search Result Clustering (SRC) algorithm to construct the textual space. The detailed description of the SRC-based textual space construction is illustrated in Section III. To represent the textual feature, vector space model [10] with TF-IDF weighting scheme is adopted. More specifically, the textual feature of an image is an -dimensional vector and can be given by (1.1) (1.2) where: is the textual feature of an image ; is the weight of the th term in s textual space; is the number of all distinct terms of all images textual space; is the frequency of the th term in s textual space; is the total number of images; is the number of images whose metadata contains the th term. To illustrate the straightforward approach where all metadata is utilized to construct the textual space, we use the photo introduced at the beginning of this section as an example. Given the query early morning, we have 151 resulting images including photo. Based on those resulting images, we collect all distinct terms from the metadata which results in totally 358 distinct terms. For, it has 48 distinct terms, which consist of early, morning, landscape, nature, rural, I, found, this, special, light, one, in, Pyrenees, along, the, Vicdessos, river, near, our, house, wow, like, picture, very, much, guess, has, to, do, with, everything, is, great, on, snow, and, sky, strange, looking, by, way, greatly, composed, nice, crafted, border, a, and beauty. Given, and 48 distinct terms of, we can calculate and for each distinct term with respect to. As a result, we can obtain according to the (1.2). In the end, according to the (1.1), the textual feature of is obtained. 1

3 CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 3 To visually represent an image, a 64-dimensional feature [11] was extracted. It is a combination of three features: six-dimensional color moments [12], 44-dimensional banded auto-correlogram [12], and 14-dimensional color texture moments [14]. For color moments, the first two moments from each channel of CIE-LUV color space were extracted. For correlogram, the HSV color space with inhomogeneous quantization into 44 colors is adopted [11]. For textual moments, we operate the original image with templates derived from local Fourier transform and obtain characteristic maps, each of which characterizes some information on a certain aspect of the original image. Similar to color moments, we calculate the first and second moments of the characteristic maps, which represent the color texture information of the original image. The resulting visual feature of an image is a 64-dimensional vector. Each feature dimension is normalized to [0, 1] using Gaussian normalization for the convenience of further computation. B. RF in Textual Space To perform RF in textual space, Rocchio s algorithm [1] is used. The algorithm was developed in the mid-1960s and has been proven to be one of the most effective RF algorithms in information retrieval. The key idea of Rocchio s algorithm is to construct a so-called optimal query so that the difference between the average score of a relevant document and the average score of a nonrelevant document is maximized. Cosine similarity is used to calculate the similarity between an image and the optimal query. Since only clicked images are available for our proposed framework, we assume clicked images to be relevant and define the feature of optimal query as follows: (1.3) where: is the vector of the initial query; is the vector of a relevant image; is the vector of a nonrelevant image; Rel is the relevant image set; Non-Rel is the nonrelevant image set; is the number of relevant images; is the number of nonrelevant images; is the parameter controlling the relative contribution of relevant images and the initial query; is the parameter controlling the relative contribution of nonrelevant images and the initial query. In our case, only relevant images are available for our proposed mechanism, so we set to be 1 and to be 0 in our experiments. Although Rocchio s algorithm is used currently, any vector-based RF algorithm could be used in the unified framework. C. RF in Visual Space To perform RF in visual space, Rui s algorithm [15] is used. Assume clicked images to be relevant, both an optimal query and feature weights are learned from the clicked images. More specifically, the feature vector of the optimal query is the mean of all features of clicked images. The weight of a feature dimension is proportional to the inverse of the standard deviation of the feature values of all clicked images [15]. Weighted Euclidean distance is used to calculate the distance between an image and the optimal query. Although Rui s algorithm is used currently, any RF algorithm using only relevant images could be used in the unified framework. D. Dynamic Multimodal Fusion There has been some work on fusion of relevance feedback in different feature spaces [16] [18]. A straightforward and widely used strategy is linear combination [16], [17]. Nonlinear combination using support vector machine (SVM) was proposed in [18]. Since the super-kernel fusion algorithm [18] needs irrelevant images, it is incapable for systems only offering relevant images. Since textual features are more semantic-oriented and efficient than visual features while visual features have finer descriptive granularity than textual features, we combine the RF in both feature spaces in a sequential way. The flowchart of the RF of our unified framework is shown in Fig. 1. First, RF in textual space is performed to rank the initial resulting images using the optimal query learned in (1.3). Then, RF in visual space is performed to re-rank the top images. The re-ranking process is based on a dynamic linear combination of the RF in both visual and textual spaces. Note that restricting the re-ranking only on the top images has two advantages. First, the relevance of the top images could be guaranteed by the former RF in textual space. Second, the efficiency of RF process could be ensured, for RF in visual space could possibly be inefficient on a very large image set. The number of top images that affects both efficiency and effectiveness of the RF process is predetermined experimentally. The re-ranking process is based on a dynamic multimodal fusion of the RF in visual and textual spaces. The combination weights that reflect the relative contribution of both spaces are automatically learned and query concept-dependent. Assume there are clicked images. The similarity metric used to re-rank a top image using RF in both visual and textual spaces is defined as follows: (1.4) (1.5) (1.6) (1.7) (1.8) where: is the similarity metric in both visual and textual spaces; is the similarity between s visual feature and ; is the cosine similarity between s textual feature and ; is the dynamic linear combination parameter for similarity metric in both visual and textual spaces;

4 IEEE TRANSACTIONS ON IMAGE PROCESSING -gram) is denoted as, and the set of documents that contains as. Then, the five properties can be given by (1.9) (1.10) (1.11) (1.12) (1.13) (1.14) (1.15) Fig.

4 4 IEEE TRANSACTIONS ON IMAGE PROCESSING -gram) is denoted as, and the set of documents that contains as. Then, the five properties can be given by (1.9) (1.10) (1.11) (1.12) (1.13) (1.14) (1.15) Fig. 1. Flowchart of the RF of the unified framework. and are parameters which control the relative contribution of RF in visual space; is the deviation of the clicked image in visual space; is the visual feature vector of the clicked image ; is the feature vector of the optimal query in visual space; is the weighted Euclidean distance between s visual feature and. Note that in (1.4) tunes the visual feature s contribution to the overall similarity metric according to different query concept. According to (1.5), controls the overall contribution of RF in visual space, fine-tunes the contribution. If the query concept could be well characterized by visual feature and the clicked images should be visually consistent, will be small (near 0). According to (1.5), should be large. Thus, visual feature will be important. This is consistent with our intuition. Since is query concept-dependent, the resulting combination parameter is query concept-dependent as well. This property of the parameter results in a query concept-dependent fusion strategy for relevance feedback in both textual and visual space. III. SRC-BASED TEXTUAL SPACE CONSTRUCTION To construct an accurate and low-dimensional textual space for the resulting Web images, we use the SRC algorithm proposed in [19]. The author re-formalizes the clustering problem as a salient phrase ranking problem. Given a query and the ranked list of search results, it first parses the whole list of titles and snippets, extracts all possible phrases ( -grams) from the contents, and calculates five properties for each phrase. The five properties consist of phrase frequency/inverted document frequency (TFIDF), phrase length (LEN), intra-cluster similarity (ICS), cluster entropy (CE), and phrase independence (IND). The five properties are supposed to be relative to the salience score of phrases. In our case, the comment and critiques are regarded as snippets. In the following, the current phrase (an where represents frequency calculation. Given the above five properties, we use a single formula to combine them and calculate a single salience score for each phrase. In our case, each term can be a vector. A regression model learned from previous training data is then applied to combine the five properties into a single salience score. According to [19], when comparing the performance of linear regression, logistic regression, and support vector regression, the performance of linear regression is the best one. Therefore, in our experiments, we choose the linear regression model. The linear regression model postulates that (1.16) where: is a random variable with mean zero; is a coefficient determined by the condition that the sum of the square residuals is as small as possible. The phrases are ranked according to the salience score, and the top-ranked phrases are taken as salient phrases. The resulting salient phrases are utilized to construct the textual space, based on which we use the (1.1) and (1.2) to compute the textual feature. IV. FRIENDLY USER INTERFACE To make the best of the implicit feedback information, a new Web image search UI named MindTracer is proposed. Mind- Tracer consists of two types of pages: main page and detail page. The main page is shown in Fig. 3 and the detail page is shown in Fig. 4. The main page has three frames: search frame, recommendation frame, and result frame. The search frame contains an edit box for users to type query phrase. Only text-based queries are supported by MindTracer since they are friendly and familiar to the typical surfer of the Web. After a user submits a query to MindTracer, the thumbnails of result images are shown in the result frame with five rows and four columns. Initially, no images are shown in the recommended frame. When the user clicks an image in the result frame for sake of his/her interest, the recommendation function are activated, so that the dynamic

CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 5 Fig. 2. Function curves of exp(0 1 x) with different. Fig. 5. Flowchart of the UI. Fig. 3. Main page of MindTracer. Fig. 4.

As a result, a finer ranking of the initial results are obtained, and the top 20 recommended images will be shown in the recommendation frame.

Accompanying with the user s click-through, the corresponding original image will be shown in a detail page. The detail page has two frames: image frame and snapshot frame.

5 CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 5 Fig. 2. Function curves of exp(0 1 x) with different. Fig. 5. Flowchart of the UI. Fig. 3. Main page of MindTracer. Fig. 4. Detailed page of MindTracer. multimodal RF are carried out. As a result, a finer ranking of the initial results are obtained, and the top 20 recommended images will be shown in the recommendation frame. The images iteratively roll in the recommendation window with a scroll-bar that could be manually controlled by the user. Accompanying with the user s click-through, the corresponding original image will be shown in a detail page. The detail page has two frames: image frame and snapshot frame. If the user clicks another image in the result frame or the recommendation frame, besides the aforementioned system reactions, the former recommended images will be shown in the snapshot frame of the detail page in case that the user wants more images from the former recommended image list. If the user clicks an image in the snapshot frame, the corresponding original image will be shown in the image frame. Once the user is satisfied with the recommended results, he/she could click the refine button to move all the recommended images from recommendation frame to the result frame. With the asynchronous scheme for refreshing the detail page and the recommendation frame of the main page, no extra-waiting time is required to support the recommendation scheme. The available functions of MindTracer include query-based search, result recommendation, and result refinement. The query-based search is similar to the current available search engines. The result recommendation and refinement functions are the contributions of MindTracer. The recommendation function is activated by the user s click-through, for MindTracer regards the user s click-through as implicit relevance feedback. Besides result recommendation, the result refinement is another useful function, which will display the whole results obtained from the multimodal RF procedure to the user when the user is satisfied with the recommendation results and clicks the refine button. Considering the user s satisfaction and refine button clicked on his/her own initiative, the relevance of the refined results could be guaranteed. The flowchart of MindTracer is shown in Fig. 5. V. EXPERIMENTAL RESULTS A. Evaluation Dataset To construct the evaluation dataset, approximately three million images were crawled from several photo forum sites, e.g., photosig [9]. To automatically evaluate our proposed SRC-based RF mechanism, an image subset was selected and manually labeled as follows. First, ten representative queries were chosen. Then, for each query, the key terms related to the

6 IEEE TRANSACTIONS ON IMAGE PROCESSING TABLE I QUERIES AND CORRESPONDING KEY TERMS. THE NUMBER WITHIN PARENTHESES IS THE NUMBER OF RESULT IMAGES Fig. 6. Performance of TVRF under different and K.

The key terms and number of resulting images for each query are shown in Table I. In total, there are 160 key terms.

Images annotated with the term were considered to be relevant to. For each, 5 iterations of user-and-system interaction were carried out.

6 6 IEEE TRANSACTIONS ON IMAGE PROCESSING TABLE I QUERIES AND CORRESPONDING KEY TERMS. THE NUMBER WITHIN PARENTHESES IS THE NUMBER OF RESULT IMAGES Fig. 6. Performance of TVRF under different and K. Fig. 7. Performance of TVRF under different and. top 20 images were identified. Finally, all resulting images of each query were manually annotated with the corresponding key terms. The key terms and number of resulting images for each query are shown in Table I. In total, there are 160 key terms. To simulate the interactions between a user and a Web image retrieval system, for each query, each related key term was selected in turn to represent the user s search intention. Images annotated with the term were considered to be relevant to. For each, 5 iterations of user-and-system interaction were carried out. The system first ranked the initial resulting images using the optimal query learned in (1.3) and brought out the top images for RF in both textual and visual space. After re-ranking the top images using (1.4), the system examined the top 20 images to collect the relevant images which were regarded as click-through data. Those relevant images labeled in previous iterations were directly placed in top ranks and excluded from the examining process. Precision is used as the basic evaluation measure. When the top 20 images are examined and there are relevant images, the precision within top 20 images is defined to be. B. Evaluation of RF Fusion The proposed RF fusion strategy (TVRF) has three parameters that need to be determined. That is, controls the overall contribution of RF in visual space, fine-tunes the contribution, and the scope in which the resulting images are re-ranked by the combination of the textual similarities and the visual similarities. Because is less correlated with and, we first chose based on a simplified version of (1.4) by constraining to 0, i.e.,. We conducted a series of experiments by varying from 0 to 1 (by iteratively adding 0.05), and from 100 to 1000 (by iteratively adding 100). Fig. 6 shows the detailed performance of TVRF under different and. is finally set to be 200, which corresponds to the best result. Then, we fixed to 200 and chose and simultaneously. We conducted another series of experiments by varying both from 0 to 1 (by iteratively adding 0.05) and from 1 to 256 (by iteratively multiplying 2). Fig. 7 shows the detailed performance of TVRF under different and. We chose and as the best parameters. To further validate whether the scope is the best parameter, we fixed to 0.25 and to 64, and varied from 100 to 1000 again. The validation result showed that still corresponds to the best performance, which confirms the aforementioned assumption that is less correlated to and. Four RF strategies were evaluated and compared: RF using textual feature only (TBRF), RF using visual feature only (VBRF), linear combination of the RF in two feature spaces (LTVRF), and the proposed RF fusion strategy (TVRF). Fig. 8 shows the detailed RF performance of the four strategies for

CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 7 Fig. 8. Performance of the four strategies. the ten representative queries and the average.

From the result, it can be seen that TVRF performs the best among four strategies because it is capable of effectively combining textual and visual features.

It shows that an inappropriate combination of textual and visual feature will seriously deteriorate RF performance.

7 CHENG et al.: UNIFIED RELEVANCE FEEDBACK FRAMEWORK FOR WEB IMAGE RETRIEVAL 7 Fig. 8. Performance of the four strategies. the ten representative queries and the average. The average precision of the four RF strategies is , , , and 0.883, respectively. From the result, it can be seen that TVRF performs the best among four strategies because it is capable of effectively combining textual and visual features. Though LTVRF also combines both features, it performs even worse than TBRF in case of Eiffel Tower, Pear, and Rainbow, because it is not query dependent and lacks the fine-tuning capability. It shows that an inappropriate combination of textual and visual feature will seriously deteriorate RF performance. The results also show that VBRF performs the worst, except the case of Tulip, because visual features are still ineffective in capturing most of the textual query concepts. Fig. 9. Precision comparison of two RF strategies. C. Evaluation of SRC-Based RF In our experiment, two RF strategies were evaluated and compared: traditional RF and the proposed SRC-based RF. Both of them use Rochhio s algorithm to construct a so-called optimal query. The difference lies in constructing the textual space for the resulting images. Traditional RF uses all terms present in the metadata to construct the textual space, while the SRC-based RF uses the SRC algorithm to obtain the salient phrases, based on which the textual space is constructed. Fig. 9 shows the detailed RF performance of the two strategies for the ten representative queries and the average. The average precision of the traditional RF and the SRC-based RF is and respectively. From the result, it can be seen that the SRC-based RF clearly outperforms the traditional RF strategy. The main reason is that using SRC could effectively detect and remove those unimportant or noisy words so that the resulting feature could reflect user s search intension more precisely. Besides the performance comparison, the time cost of the two strategies is another factor worth analyzing. Given a query and a term, the time cost for completing 5 iterations of userand-system interaction is recorded. Based on the sum of each term s time cost, we could obtain the average time cost for each query. Fig. 10 shows the time cost of the two strategies for the ten representative queries and the average. According to Fig. 10, the average time cost of the traditional RF and SRC-based RF is 1.87 and 0.59 s, respectively. From the result, it can be seen that the SRC-based RF mechanism is more efficient than the traditional RF. Fig. 10. D. Efficiency of TVRF Efficiency comparison of two RF strategies. In order to evaluate the real time performance of the proposed technique, efficiency performance of the proposed RF fusion strategy (TVRF) is worth discussing as well. Since there are two textual-based RF mechanisms available in our work, we refer to the SRC-based TVRF as SRC-TVRF. Then, we can make comparisons between TVRF and SRC-TVRF. Given a query and a term, the time cost for completing five iterations of user-and-system interaction is recorded. Based on the sum of each term s time cost, we could obtain the average time cost for each query. Note that each query need accomplish only one SRC procedure, and the resulting textual space is suitable for all the related terms. Fig. 11 shows the time cost of TVRF and SRC-TVRF for the ten representative queries and the average. According to Fig. 11, the average time cost of the TVRF and SRC-TVRF is 3.02 and s, respectively. From the result, it can be seen that the SRC-TVRF mechanism is more efficient than the TVRF. Therefore, the SRC-TVRF is more practical for a real Web image retrieval system.

8 IEEE TRANSACTIONS ON IMAGE PROCESSING Fig. 11. Efficiency comparison of TVRF and SRC-TVRF. VI.

A dynamic multimodal fusion strategy is proposed to seamlessly combine the RF in textual space and that in visual space.

low-dimensional textual space for the resulting Web images. Besides explicit relevance feedback, implicit relevance feedback, e.g., click-through data, can also be integrated into the proposed mechanism.

Experimental results on a database consisting of nearly three million Web images show that the proposed mechanism is wieldy, scalable, and effective. REFERENCES [1] Google Image Search, [Online].

8 8 IEEE TRANSACTIONS ON IMAGE PROCESSING Fig. 11. Efficiency comparison of TVRF and SRC-TVRF. VI. CONCLUSION In this paper, we have presented a unified relevance feedback framework for Web image retrieval. During RF process, both textual features and visual features are used in a sequential way. A dynamic multimodal fusion strategy is proposed to seamlessly combine the RF in textual space and that in visual space. To integrate RF into Web image retrieval in a practical way, the textual feature-based RF mechanism employs an effective search result clustering (SRC) algorithm to construct an accurate and low-dimensional textual space for the resulting Web images. Besides explicit relevance feedback, implicit relevance feedback, e.g., click-through data, can also be integrated into the proposed mechanism. Then, a new user interface (UI) is proposed to support implicit RF. Experimental results on a database consisting of nearly three million Web images show that the proposed mechanism is wieldy, scalable, and effective. REFERENCES [1] Google Image Search, [Online]. Available: [2] Yahoo Image Search, [Online]. Available: com/ [3] Altavisa Image Search, [Online]. Available: image/ [4] Picsearch Image Search, [Online]. Available: [5] J. Rocchio, Relevance Feedback in Information Retrieval. Upper Saddle River, NJ: Prentice-Hall, [6] X. S. Zhou and T. S. Huang, Relevance feedback in image retrieval: A comprehensive review, ACM Multimedia Syst., vol. 8, no. 6, pp , [7] Q. K. Zhao, S. C. H. Hoi, T. Y. Liu, S. S. Bhowmick, M. R. Lyu, and W. Y. Ma, Time-dependent semantic similarity measure of queries using historical click-through data, in Proc. 15th Int. Conf. World Wide Web, 2006, pp [8] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay, Accurately interpreting clickthrough data as implicit feedback, in Proc. 28th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2005, pp [9] Photosig, [Online]. Available: [10] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Reading, MA: Addison-Wesley, [11] L. Zhang, Y. X. Hu, M. J. Li, W. Y. Ma, and H. J. Zhang, Efficient propagation for face annotation in family albums, in Proc. 12th Annu. ACM Int. Conf. Multimedia, 2004, pp [12] M. Stricker and M. Orengo, Similarity of color images, Proc. SPIE Storage and Retrieval for Image and Video Databases, pp , [13] J. Huang, S. R. Kumar, M. Mitra, W. J. Zhu, and R. Zabih, Image indexing using color correlograms, in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition, 1997, pp [14] H. Yu, M. J. Li, H. J. Zhang, and J. F. Feng, Color texture moments for content-based image retrieval, in Proc. Int. Conf. Image Processing, 2002, pp [15] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, Relevance feedback: A power tool for interactive content-based image retrieval, IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 5, pp , May [16] F. Jing, M. J. Li, H. J. Zhang, and B. Zhang, A unified framework for image retrieval using keyword and visual features, IEEE Trans. Image Process., vol. 14, no. 7, pp , Jul [17] Y. Lu, C. Hu, X. Zhu, H. J. Zhang, and Q. Yang, A unified framework for semantics and feature based relevance feedback in image retrieval systems, in Proc. 8th Annu. ACM Int. Conf. Multimedia, 2000, pp [18] Y. Wu, E. Y. Chang, K. C. C. Chang, and J. R. Smith, Optimal multimodal fusion for multimedia data analysis, in Proc. 12th Annu. ACM Int. Conf. Multimedia, 2004, pp [19] H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. W. Ma, Learning to cluster web search results, in Proc. 27th Annu. Int. ACM SIGIR Conference on Research and Development in Information Retrieval, 2004, pp En Cheng received the B.S. and M.S. degrees in computer science from Huazhong University of Science and Technology, Wuhan, China, in 2003 and 2006, respectively. She is currently pursuing the Ph.D. degree in the Electrical Engineering and Computer Science Department, Case Western Reserve University, Cleveland, OH. From 2005 to 2006, she was with Microsoft Research Asia, Beijing, China, as a visiting student. Her research interests include knowledge management, semantic web, information retrieval, image processing, pattern recognition, and bioinformatics. Feng Jing received the B.S. and Ph.D. degrees in computer science from Tsinghua University, Beijing, China, in 2000 and 2005, respectively. He was with Microsoft Research Asia, Beijing, from 2005 to Then, he joined Tencent Research Center, Beijing. His research interests include image retrieval, image annotation, web mining, text understanding, and machine learning. Lei Zhang received the B.S., M.S., and Ph.D. degrees in computer science from Tsinghua University, Beijing, China, in 1993, 1995, and 2001, respectively. After two years of working in industry, he returned to Tsinghua University and received his Ph.D. degree. Then, he joined Microsoft Research Asia, Beijing. His research interests include machine learning, web image search, information retrieval, and text mining.

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India