2282 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012

Size: px
Start display at page:

Download "2282 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012"

Transcription

1 2282 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 Task-Dependent Visual-Codebook Compression Rongrong Ji, Member, IEEE, Hongxun Yao, Member, IEEE, Wei Liu, Student Member, IEEE, Xiaoshuai Sun, Student Member, IEEE, and QiTian, Senior Member, IEEE Abstract A visual codebook serves as a fundamental component in many state-of-the-art computer vision systems. Most existing codebooks are built based on quantizing local feature descriptors extracted from training images. Subsequently, each image is represented as a high-dimensional bag-of-words histogram. Such highly redundant image description lacks efficiency in both storage and retrieval, in which only a few bins are nonzero and distributed sparsely. Furthermore, most existing codebooks are built based solely on the visual statistics of local descriptors, without considering the supervise labels coming from the subsequent recognition or classification tasks.inthispaper,wepropose a task-dependent codebook compression framework to handle the above two problems. First, we propose to learn a compression function to map an originally high-dimensional codebook into a compact codebook while maintaining its visual discriminability. This is achieved by a codeword sparse coding scheme with Lasso regression, which minimizes the descriptor distortions of training images after codebook compression. Second, we propose to adapt our codebook compression to the subsequent recognition or classification tasks. This is achieved by introducing a label constraint kernel (LCK) into our compression loss function. In particular, our LCK can model heterogeneous kinds of supervision, i.e., (partial) category labels, correlative semantic annotations, and image query logs. We validated our codebook compression in three computer vision tasks: 1) object recognition in PASCAL Visual Object Class 07; 2) near-duplicate image retrieval in UKBench; and 3) web image search in a collection of 0.5 million Flickr photographs. Our compressed codebook has shown superior performances over several state-of-the-art supervised and unsupervised codebooks. Index Terms Image retrieval, indexing, local feature, object classification, supervised quantization, visual codebook. I. INTRODUCTION VISUAL-CODEBOOK representations are widely used in many computer vision tasks such as visual search, object categorization, scene recognition, and video event detection. In a typical scenario, given the local feature descriptors ex- Manuscript received October 27, 2010; revised March 02, 2011 and June 03, 2011; accepted June 04, Date of publication November 22, 2011; date of current version March 21, This work was supported in part by the National Science Foundation of China (Key Program) under and in part by the Natural Science Foundation of China under Grant and Grant The work of Q. Tian was supported in part by the NSF IIS , by the Faculty Research Awards of Google FXPAL, and by the NEC Laboratories of America. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mark Liao. R. Ji and X. Sun are with the Department of Computer Science, Harbin Institute of Technology, Harbin , China. H. Yao is with the Department of Computer Science, Harbin Institute of Technology, Harbin , China ( H.Yao@hit.edu.cn). W. Liu is with the Department of Electrical Engineering, Columbia University,NewYork,NY10027USA. Q. Tian is with the Department of Computer Science, University of Texas at San Antonio, San Antonio, TX USA. Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TIP tracted from training images, a visual codebook is built by quantizing these descriptors into discrete codeword regions in the descriptor feature space. Subsequently, each image is represented as a bag-of-words (BoW) histogram, which is robust against the photographing variances in different viewpoints, scales, and occlusions. One important issue of this visual-codebook representation is its dimensionality. Generally speaking, most existing codebooks require more than codewords to be tuned optimal [1] [5]. Such a high dimensionality usually introduces obvious computational cost in both processing time (to match visual descriptors online) and storage space (to maintain the search model in the memory). This is extremely crucial for the state-ofthe-art mobile visual search scenario [6], [7], where mobile devices directly extract the BoW histogram and transmit over the wireless link. On the contrary, since the number of local features extracted from each image is limited (typically hundreds), each image is represented as a sparse histogram of nonzero codewords. Moreover, due to the dimension curses in the subsequent classifier training procedure, most object recognition systems expect the BoW histogram to be compact. Inspired by the above contradictions, our first goal is to compress the state-of-the-art high-dimensional codebooks [1], [3], [4], [8], [9] to improve their storage and retrieval efficiency while maintaining the visual discriminative capability. Our second goal is to introduce the supervised labels from the subsequent recognition or classification tasks to guide our codebook compression. Our inspirations are twofold: First, most existing visual codebooks [1], [3] [5], [8] [11] are built based solely on the visual statistics of training images, without considering their semantic labels to improve their discriminability. Second, recent works in supervised codebook learning [12] [15] have directly incorporated the supervised labels in the initial codebook building procedure, which are computational burdensome and not scalable for new labels, new data sets, or new tasks (even in the same data set). In addition, by binding supervision over codebook building, the existing supervised codebooks learning strategies cannot be reused among each other [12] [15]. An alternative approach is to refine an initial codebook. However, existing works [16] [18] are not sufficiently flexible and scalable to model heterogeneous kinds of supervised labels, such as correlative semantic labels and image query logs. In this paper, we present a task-dependent visual-codebook compression framework to achieve the above goals. We introduce a supervised sparse coding model to learn a compression function, which maps an originally high-dimensional codebook into a compact basis dictionary. In this model, we integrate both visual discriminability and task-dependent semantic discriminability (from supervised labeling) into the Lasso-based regression cost [19] for compression function learning. Using the /$ IEEE

2 JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION 2283 Fig. 1. Proposed task-dependent visual-codebook compression framework. learned basis dictionary, each original BoW histogram is nonlinearly transformed into a low-dimensional reconstruction parametric vector, which serves as the compressed histogram for a given image. This histogram is more compact and discriminative for the subsequent recognition classifier training, retrieval model indexing, or a similarity ranking procedure. To integrate the visual discriminability into our basis dictionary learning, we model the codewords importance using their term frequencies and inverted document frequencies [20], which are embedded as a weighting vector into the compression cost function. To integrate the task-dependent discriminability into our basis dictionary learning, we introduce a label constraint kernel (LCK) to model the labeling consistency loss in the compression function. In particular, we show how to specify the LCK for different tasks even in an identical database, including: 1) (partial) category labels for object recognition; 2) correlative semantic annotations for image understanding; and 3) image query logs (obtained from the ground-truth image ranking labeling) for web image ranking. Fig. 1 outlines the proposed framework. II. RELATED WORK Visual-codebook construction: A visual codebook is typically constructed based on unsupervised vector quantization such as k-means clustering [8], [21], which subdivides the local feature space into discrete codeword regions. Such a division represents an image as a BoW histogram, in which each bin counts how many local features of this image fall into the corresponding codeword region of this bin. In recent years, there have been numerous vector quantization approaches to build visual codebooks, such as spatial pyramids [22], vocabulary trees (VTs) [1], approximate k-means [4], and their variations [3], [9]. In addition to direct quantization, hashing-based approaches are also well exploited in the literature, e.g., locality sensitive hashing [23] and its kernelized version [10]. Recent works have also investigated the codeword uncertainty and ambiguity using methods such as Hamming embedding [11], soft assignments [24], [25], and kernelized codebook histograms [5]. Supervised codebook construction: Rather than using solely visual content statistics, works in [12] [15] also proposed to integrate the semantic labels to supervise the codebook construction. For instance, Mairal et al. [12] used category-aware sparse coding to build a supervised vocabulary for object categorization. Lazebnik et al. [13] adopted minimizing mutual information loss to build a supervised codebook based on a fully labeled local feature set. Moosmann et al. [14] proposed to build supervised indexing trees using an ERC-Forest that considered semantic labels as stopping tests. Ji et al. [15] adopted a hidden Markov random field to use correlative web labeling for supervised vocabulary construction. In image and video compression, there are also similar works proposed for learning-based vector quantization [26] [28]. Methods such as self-organizing maps [27] and regression loss minimization [28] are utilized to reconstruct the original input signals that minimized visual quantization distortions. Finally, there are also works in learning visual parts [29], [30] from the images of identical categories by clustering local patches with spatial configurations. Learning-based codeword refinement: Rather than directly supervising the codebook building procedure, works in [16] [18] and [31] proposed to merge or split the initial codewords for the subsequent classifier training. For instance, Perronnin et al. [16] integrated category labels to adapt an initial vocabulary into several class-specific vocabularies. Winn et al. [31] also learned class-specific vocabularies from an initial vocabulary by merging codeword pairs, in which the codeword distribution was modeled by the Gaussian mixture model. Although works in [16] [18] and [1] worked well for limited category numbers, these methods cannot be scaled up to general scenarios that contain numerous and correlative category labels. To a certain degree, the topic models [e.g., probabilistic latent semantic analysis (plsa) [32] and latent Dirichlet allocation (LDA) [33]] can be also treated as unsupervised codeword refinements, which work in a generative manner constrained by superparameters. Sparse coding: Sparse coding theory has been recently well investigated for effective and efficient high-dimensional signal representation and compression [19], [34], [35]. Its main idea is to represent a signal as a linear combination of sparse basis from an overcomplete dictionary. Although the exact recovery of this dictionary is NP hard, a sufficiently sparse and linear representation can be efficientlyapproximatedbyconvex optimization [34]. For instance, Tibshirani et al. [19] proposed the Lasso regression strategy to achieve coefficient sparsity by adding -norm penalty to the regression loss. Works in [12] and [36] also adopted sparse coding to directly build codebooks from the original local patch set. However, facing millions of local patches, the computational efficiency of direct sparse coding [12], [36] retains to be an opening problem. Compared with the above works, our novelties are twofold. First, our codeword-level sparse coding improves the essential computational efficiency: We aim to learn a compression transform to

3 2284 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 Fig. 2. Fig. 3. Graphical model of task-dependent codebook compression. Visualized flowchart of our codebook compression pipeline. compress the quantized BoW histograms, rather than directly encoding the local patch collection. It is very important for scalable applications that contain hundreds of millions of local patches, which are unsuitable to directly use spare coding to learn codebooks. Second, the existing supervised sparse coding model only allows supervision on the local patch level. On the contrary, the coding model proposed in this paper enables the modeling of heterogeneous types of semantic labels based on our proposed LCK. Our contributions: Compared with the previous works that directly supervised the codeword quantization procedure, we emphasize on compressing an initially high-dimensional codebook into an extremely compact codebook (see Figs. 2 and 3). Our first contribution is the efficiency: We only operate on the BoW histograms and subsequently avoid the burden to directly supervise the quantization of millions or billions of local features. Our second contribution is the extensibility: Our approach can be deployed to most existing codebooks [1], [3], [4], [8], [9]. We treat their BoW histograms as input, and our subsequent compression is independent to the initial codebook building strategies. Our last consideration is the flexibility: We propose a task dependent discriminability embedding to model different kinds of semantic labels, either within an identical database or among different databases. This is achieved by introducing a flexible labeling correlation kernel in Section IV-B to model heterogeneous kinds of semantic labels. III. PROBLEM FORMULATION We denote scalars as italic letters, e.g., ; denote vectors as bold italic letters, e.g., ; denote instance space for instances as ; denote an inner product between two vectors and as ;denote norms over with as ; and denote the transposition of matrix as and the inverse of as. The input of our codebook compression algorithm is an initial codebook containing codewords, which represents the training images as a set of BoW histograms, in which each denotes an -bin BoW histogram. This initial codebook can be derived from most existing approaches such as [1], [3], [4], [8], and [9]. Then, suppose a subsequent task (such as image retrieval or object recognition) provides its labels as. To incorporate partial labels in our subsequent supervised codebook compression, we further assign to indicate that the th image is unlabeled. We aim to learn a basis dictionary to compress the initial codebook, which serves as a compression function to transfer each original BoW histogram into a more compact histogram. Therefore, for new image,we first use to generate its original BoW histogram and then use to map into a compact histogram. In general, we also have and. For instance for many widely used codebooks, and is at a hundred level in our subsequent experiments. Our first goal is to ensure that the basis dictionary can still preserve the visual discriminability of the original codebook.tomaintainsuchdiscriminability, we should slightly compress the important codewords and heavily compress the chaoslike codewords. To this end, we model the visual discriminability of each codeword into our compression loss function to learn in Section IV-A. Our second goal is to integrate the task-dependent label constraints to supervise our compression. Such labels come from numerous ways, such as object classes, correlative image annotations, and image ranking orders from the user-query logs. Task dependent means that the compression process should depend on its specific taskeveninanidenticaldatabase,e.g., search semantically similar images or search near-duplicate (ND) objects. Task-dependent embedding is a challenging issue that remains unexplored in all previous works. In Section IV-B, we further show how to integrate heterogeneous labels from different tasks into our codebook compression framework. IV. TASK-DEPENDENT CODEBOOK COMPRESSION Using the initial BoW histogram set extracted from training images, we learn the basis dictionary from the original codebook by minimizing the following cost: Cost (1) where Loss denotes the loss function to measure if the current basis dictionary is good at reconstructing an initial BoW histogram. To this end, we are trying to learn the best compression function to transfer each high-dimensional sparse into the compact descriptor. To evaluate the compression distortion, we reversely measure the recovery residual of from.furthermore,considering that each histogram is sparse with respect to the few nonzero bins, we hope that our reconstructed is also sparse. We add

4 JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION 2285 this sparse constraint into the loss function Loss an normalization form as follows: using Loss (2) is the coefficient vector to reconstruct from,which meanwhile serves as the new compressed BoW histogram for the th image. controls a tradeoff between the sparsity of the reconstructed signal and its precision to recover.while guaranteeing the real sparsity through is intractable [39], using a Lasso-based penalty [19] can still approximate a sparse solution for coefficients. To avoid arbitrarily small, we conduct a normalize operation on each column in after each-round optimization of in (1) as follows: s.t. (3) Note that Cost in (1) is not convex with respect to. Therefore, we resort to a joint optimization between the basis dictionary and the compressed BoW histogram iteratively. 1 That is, in each-round optimization, we fix one parameter set ( or ) and optimize the rest ( or ). We then rewrite (2) with constraints as Loss (4) One essential advantage is that, compared with direct sparse coding to build codebook as in existing works [12], [36], we only operate on the dictionary level (typically contains only codewords), rather than on the entire local feature collection (which typically contains over millions or billions of local features). Hence, the computational efficiency is largely improved. Meanwhile, as would be proven in our subsequent experiments, our codebook compression performance is comparable with that of the state-of-the-art ones [12], [36] that directly operated on the entire initial local feature collection. A. Visual Discriminability Embedding Visual discriminability refers to whether a codeword is discriminative to distinguish each training image in the BoW histogram. We exploit this criterion to guide our codebook compression: Discriminative codewords should be slightly compressed, whereas chaos-like codewords should be heavily compressed. Following the principle of term frequency and the inverted document frequency [20], we quantitatively measure the visual discriminability of the th codeword as where denotes the discriminative weighting of codeword, is the number of local features that is quantized into, is the total amount of local features extracted from all training images, is the total number of training images, and is the number of images that contain local features quantized into. In the right part of (5), the first term measures whether can 1 Nevertheless, other strategies, such as linear programming, can be also used to accelerate the learning of and the estimation of (5) represent many local features (the more the better), and the second term measures whether is only discriminative for a few images (the less the better). The ensemble of forms a visual discriminability weighting vector, which is embedded to refine our compression loss in (4) as Loss (6) Equation (6) highlights whether a given BoW histogram contains discriminative local features, which is measured by its weighting. If so, the learning of would emphasize more on the compression loss of and vice versa. B. Task-Dependent Discriminability Embedding Task-dependent discriminability refers to whether the compressed codebook is suitable for its subsequent recognition or retrieval task. In this subsection, we integrate the task labels to supervise the learning of our basis dictionary. This is achieved by introducing an LCK into our compression loss Loss for each BoW histogram to evaluate its labeling consistency, as shown in LCK Logistic (7) where Logistic represents a logistic loss function for scalar as, which enjoys the properties similar to the loss function of support vector machines (SVMs). defines a consistency measurement between two labels and,which would be revisited in Sections IV-B1 IV-B3. denotes the labels of nearby BoW histograms that fall within of. Then, our goal turns out to learn with LCK constraints. To that effect, we propose a refined loss function of (6) as Loss In addition, learning the overall Cost using the above refined Loss D is still not convex with respect to. Similarly, we resort to a joint optimization between and, as stated before. It is worth mentioning that, before the compression procedure, we preestimate and store the LCK for each into a lookup table, which largely accelerates our subsequent iterative regression. Our LCK can be used in several kinds of recognition or retrieval tasks by modeling heterogeneous labels, such as oneclass, two-class, and multiple-class labels; partial labels; correlative labels; and the list of image ranking orders. We specify the LCK in (7) as follows. 1) Modeling Category Labels: In such case, the label vector of training images comes from several discrete categories. We assume that any two given categories are independent. Hence, only takes effect to once two identical labels fall into LCK Logistic (9) (8)

5 2286 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 Here, we specialize the consistency measurement of as the count of similar labels within :,in which produces 1 for two identical labels and 0 if otherwise. Intuitively, if contains more nearby and identical category labels, the learning of would pay more attention to the reconstruction loss of. It is worth to mention that, partial labeling, as well as one-class labeling, can be easily modeled using our category-based LCK in (9). 2) Modeling Correlative Semantic Annotations: We further remove the category independence constraints, which allow the labels to be correlative to each other. For instance, a furniture annotation is more closer to a chair annotation, compared with an airplane annotation. WordNet [40] is a powerful electronic lexical database to provide such semantic relationships. We adopt the WordNet::similarity [41] measurement to quantitatively measure the semantic closeness between two relative annotations and, which refines the LCK as LCK Logistic (10) Here, we specialize the consistency degree of as its accumulated WordNet distance within :. Intuitively, if contains more correlative annotations, is more semantically sensitive. Therefore, will receive less compression loss in generating and. On the contrary, more diverse semantic annotations in will lead to heavier compression of. 3) Modeling Image Ranking Lists: Rather than category or semantic labels, in some other cases, the task supervision cannot be directly obtained. For instance, content-based image search engines (e.g., TineEye [51] and Google Goggles [52]) usually collect large amount of user-query logs, which offer similarity ranking orders instead of the direct semantic labels. Suppose that there are in total groups of user-query logs and each gives the ranking order list for the image query. We give the following ranking consistency modeling of as (11) in which denotes the ranking order of. We set once or is outside, therefore ensuring for unlabeled (unranked) images. Subsequently, our LCK is refined as follows: (12) Based on (12), we specialize the measurement of with the sum of positive or negative sequential ranking pairs within as. Intuitively, if we get more positive sequential ranking pairs in, should then be ranked higher. Hence, our LCK will give less compression loss for in generating basis dictionary and. C. Iterative Dictionary Learning Learning the basis dictionary in is an optimization problem with respect to. As stated before, while the jointly convex optimization of both and is unavailable, we resort to an iterative convex optimization between and using block coordinate descent [12], [42] [44]. It iteratively optimizes the learning of (denoted as supervised dictionary learning) and the inference of (denoted as task-dependent sparse coding), as outlined in Algorithm 1. Algorithm 1: Iterative Dictionary Learning 1 Input: Training images with initial BoW histograms, initial visual codebook, task-dependent labels, iteration number, and maximum iteration number. 2 Initialization: Set to a random Gaussian matrix with normalized columns. Use and to calculate the initial. 3while{the or } do 4 For a given task, calculate and store LCK response for each using respective scheme in Section IV-B. 5 Supervised Dictionary Learning: 6 for each { in the iteration } do 7 Use Lasso [19] to learn at the iteration based on the following loss: 8 Loss 9 end 10 Task-Dependent Sparse Coding: 11 for each { in the iteration } do LCK 12 Adopt to estimate using (14) end end 17 Output: The basis dictionary ; The new compressed histogram. (13) (14) It is worth to mention that, in practice, we learn the compression function and the compressed histogram simultaneously using Algorithm 1 with iteratively optimization between

6 JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION 2287 and. In this viewpoint, our proposed approach can be regarded as a single-stage processing. V. EXPERIMENTAL COMPARISONS In this section, we compare our task-dependent codebook compression with alternative approaches and five state-of-the-art works [1], [16], [33], [36], [50]. We carry out our comparisons in two benchmark databases and a large-scale web photo data set: 1) object recognition in PASCAL Visual Object Class (VOC) 07, which aims to classify whether a given image contains a certain object (output a binary judgement); 2) ND image retrieval in UKBench, which aims to find ND images that contain identical objects or identical scenes without regard to the photographing variances; and 3) a 0.5 million Flickr data set, which provides quantitative performance evaluations for three kinds of task-dependent codebook compressions, including category image search, semantic similar image search, and image ranking. A. Benchmark Databases and Evaluations 1) PASCAL VOC 07 Object Recognition Benchmark: The PASCAL VOC 2007 benchmark [45] contains 20 image categories, including Aeroplane, Bicycle, Bird, Boat, Bottle, Bus, Car, Cat, Chair, Cow, Diningtable, Dog, Horse, Motorbike, Person, Pottedplant, Sheep, Sofa, Train, and Tvmonitor. There arearound2500trainingimages,2500validationimages,and 5000 test images in total. We use PASCAL VOC 07 to evaluate our category-based LCK (see Section IV-B1) with comparisons to [16] and [36]. 2) UKBench ND Retrieval Benchmark: The UKBench data set [1] contains over images with 2500 objects. There are four images per object to offer sufficient variances in viewpoints, lighting conditions, scales, occlusions, and affine transforms. These category labels are modeled into our category LCK for each four images. To offer the MSER SIFT baseline, as reported in [1], identical to the setting of [1], we build our retrieval model over the entire UKBench database and then select the first image per object to form a query set. It ensures that the top returning photo would be the query itself. Subsequently, the performance depends on whether we can rank the remaining three photos of this object in the highest possible position. This is evaluated by the correct returning measurement [1], which shows the average correctly returned images in the top four results. The UKBench data set is used to evaluate our codebook compression with the category-based LCK in Section IV-B1, with comparisons to that in [1] and [33]. 3) 0.5 Million Flickr Database: We crawled over collaboratively labeled photos from Flickr, which gives a realworld evaluation for our task-dependent compression. It contains over 180 million local features with over labels (over 3600 unique keywords). Within this 0.5 million Flickr data set, we carry out three task-dependent codebook compressions for the three following tasks: 1) ND-0.5Million Evaluation: Our first task is the ND image search in this 0.5 million Flickr data set. Following the Tineye search evaluation methodology [51], we collected and manually labeled 23 groups of web images (1100 in total) from both Tineye and Google image search. The images in each group are partial duplicates among each other. We then add these ground-truth images into our 0.5 million data set. This enriched data set is referred to as (ND evaluationina0.5milliondatabase) (ND-0.5Million). 2 We quantitatively evaluate our task-dependent codebook compression using a category-aware LCK, with comparison to the related works in [1] and [50]. We measure our retrieval performance by mean average precision (MAP) at. It represents the mean precision of queries, where each of which reveals its position-sensitive ranking precision in the top positions as follows: MAP of relevant (15) where is the number of queries, is the rank, is the number of related images for query,and is the precision at the cutoff rank of the th related image. 2) SS-0.5Million Evaluation. Our second task is the semantic sensitive image retrieval in this 0.5 million Flickr data set. We select 20 labels identical to the categories in PASCAL VOC 07. For each label, we randomly pick up two images from our 0.5 million Flickr data set and rank the top 100 similar photos for each. For each ranking list, we ask a group of volunteers to identify whether each ranked image comes from the identical category of the query. If so, we treat it as a correct image, otherwise as an incorrect image. Similar to the ND-0.5Million Evaluation, we also use MAP to measure our performance. We denote this evaluation as a semantic similar evaluation in 0.5 million database (SS-0.5Million), which is used to quantize performances between our semantic-annotation-based LCK (see SectionIV-B2)andtheworkin[50]. 3) RS-0.5Million Evaluation. Our third task is the image ranking based on user-query log learning. For each query in the above SS-0.5Million Evaluation (containing 40 queries), we collect the correct returning images and their ranking orders as user-query logs, which are used to evaluate our ranking-based LCK in Section IV-B3. We denote this group as a ranking-sensitive evaluation in 0.5 million database) (RS-0.5Million evaluation), which evaluates our ranking consistency LCK (see Section IV-B3) with the normalized discounted cumulative gain (NDCG) measurement as (16) where is the NDCG at the rank position of and is the last position of the relevant sample within the first samplesintherankedlist. represents the rate of relevant between the th sample in the ranked list and the query bounded in [0, 1]. is the normalized constant for a given ranked list. Compared with precision and recall, NDCG is sensitive to position of the highest rated image, regardless of the variable lengths in different ranking lists. 2 Since the original 0.5 million Flickr data set could also contain partial duplicate images of our 23 ground-truth categories, we use every image in the ground-truth set to query the database and, subsequently, to identify and remove any duplicates from the original 0.5 million Flickr data set.

7 2288 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 Fig. 4. Parameter tuning of best compressed codebook size in three databases. The tuning performance in (a) is measured by The tuning performance in (b) is measured by corrected returning. The tuning performances in (c) and (d) are measured by MAP. (a) Tuning (K,, ) in PASCAL 07. (b) Tuning (K,, ) in UKBench. (c) Tuning (K,, ) in ND-0.5Million Evaluation. (d) Tuning (K,, ) in SS-0.5Million Evaluation. B. Baselines and Alternative Approaches We summarize our baselines [1], [12], [16], [33], [50] and alternative approaches compared in the subsequent quantitative comparisons 1) VT [1]: The first baseline comes from the unsupervised visual-codebook generation based on k-means clustering. We resort to its hierarchical version [1] to ensure scalable indexing and search. 2) [33]: Bosch et al. [33] reported the state-of-the-art recognition performances on Caltech 101. They use plsa to extract topics from the original BoW histograms, then train an SVM over the topic-level features. Since [33] also compressed an initial codebook, we directly compare our performance to the quantitative results in [33]. In Section V-E, we will further discuss the theoretical differences between [33] and this paper. 3) Learning-based codebook refinement [16], [31] (BoW Learning): To build supervised vocabulary for object recognition tasks, we also reimplement the work in [16], which incorporated category learning to adapt an initial vocabulary into several class-specific vocabularies. We compare our work to [16] in both the object categorization task (in PASCAL VOC 07) and the ND Image Retrieval task (in UKBench). 4) Sparse-coding-based visual codebook [36] (Sparse Coding): We further compare our codebook compression with a recent work in [36]. Different from our scheme that operates over the BoW histogram, the work in [36] used fast sparse coding to generate a compact dictionary from local patch collections. Since [36] is also deployed on the PASCAL VOC 07, we can directly offer quantitative comparisons to [36] in the object categorization task. 5) Aggregating local features (AggreSIFT): Jegou et al. [50] proposed to aggregate local descriptors with principal component analysis (PCA) and hashing indexing, which produced an approximate 32-bit descriptor per image, as a successive work of [53]. We also compare our scheme to [50] in our 0.5 million Flickr data set. 6) Compressed codebook without visual discriminability embedding (without visual): As an alternative approach, we do not embed both visual discriminability weighting and LCK to our loss function [see (8)]. Hence, only direct sparse coding is used to compress the initial codebook [use (2) to replace (8)]. This baseline demonstrates our effectiveness in visual discriminability embedding; Fig. 5. Parameter tuning of the best hierarchical layers for the initial visual codebooks within three benchmark databases, respectively. (a) Tuning H in PASCAL07. (b) Tuning H in UKBench. (c) Tuning H in 0.5 million Flickr database. 7) Compressed codebook without task-dependent embedding (without task): As another alternative approach, we only embed the visual discriminability weighting but not LCK, to compress the initial codebook [using (6) to replace (8)]. 8) - - : A straightforward approach is to compress the high-dimensional BoW histograms using PCA. In such a case, PCA can reduce the dimensionality while maintaining the main characteristics of the BoW histogram. In Section V-D, we compare our codebook compression with BoW PCAinbothUK- Bench and the 0.5 million Flickr data set. C. Parameter Tunings 1) Codebook Construction Tuning: For each benchmark data set, we extract SIFT [46] features from its training images. Using all SIFT features, we build a VT [1] to get the initial codebook, which generates a BoW histogram for each training image. We denote the hierarchical level (decide how many layers to build the VT) as and the branching factor (decide how many children clusters for each parent node) as. We stop the quantization division once there are less than 1000 SIFT features in a word. This setting gives at most words in the finest level. Fig. 4 gives the parameter tuning of codebook size in three data sets, respectively. Based on tuning results in Fig. 5, the initial vocabulary are settled as and for UKBench data set, and for PASCAL VOC 07 data set, and and for Flickr 0.5 million data set, respectively. For the object classification task, the linear SVM [48] is adopted to predict the category labels. For the ND retrieval (in UKBench) and image search (in 0.5 million Flickr data sets), we use L2 distance to rank image similarity with inverted indexing.

8 JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION 2289 Fig. 6. Average precision in all 20 object categories of PASCAL VOC 07 data set, with comparisons to state-of-the-art works in VT [1], BoW plsa [33], learning codebook [16], and sparse coding [36]. The line codebook compression denotes our final approach with both visual and task-dependent discriminability embedding, the lines without visual denote baseline (6), and the lines without task denote baseline (7). Our final approach is the one that includes both visual discriminability and task-dependent (category-based LCK) discriminability in codebook compression. In Without Visual, Without Task, and Final Approach, we settle for learning the basis dictionary. 2) Codebook Compression Tuning: Based on the best tuned visual codebook, we carry out our codebook compression to obtain the compressed codebook. Here, we briefly discuss the choice of three compression parameters: [control reconstruction sparsity in (2) (4) (6) (8)], [controls the degree of task-dependent embedding in (8)], and [control the dimension of basis dictionary in (2) (4) (6) (8)]. While a straightforward choice is to tune all of them simultaneously using cross-validation, we resort to a more efficient strategy that sequentially optimizes,,and. That is, we increase the volumes of. In addition, for each,wetunethebest pair as follows. First, as empirically found in [12], the sparsity parameter is settled as Then, serves as a regularization parameter to control that the compressed codebook does not overfit to the task labels : Alarger results to good fit to the training label (low variance) but will simultaneously increase the classification bias and vice versa. Then, we estimate the best by leave-one-out cross-validation. Fig. 4 presents our best tuned in each data set. D. Quantitative Results 1) Object Recognition in PASCAL VOC 07: Fig. 6 shows our object recognition performance in PASCAL VOC 07. First, a solely visual codebook [1] [baseline (1)] with a linear SVM [48] obtains the lowest performance among all approaches. This performance was also validated in the PASCAL VOC 07 evaluations of BoW [49]. Employing plsa [33] to compress the codebook [baseline (2)] improves performance by reducing the feature dimension to avoid the dimension curses in the SVM (this is also observed in [33]). Using codebook learning [16] (baseline (3) that learned a class-specific codebook is learned for each object class) with a one-class SVM, we find that the integration of semantic supervision can largely improve the performance. Employing sparse coding to directly learn a compressed codebook [36] [baseline (4)] outperforms baselines (1) (3) in effectiveness. However, its time cost is extremely huge since it directly operates over the entire local feature collection (Instead, our coding operates on the BoW histogram). For baseline (6) that compresses the initial codebook without either visual discriminability or task-dependent discriminability embedding, we achieve better performance than both VT [1] and BoW plsa [33] (the latter one is the most similar work to our alternative approach [baseline (7)] with solely visual discriminability embedding). However, we achieve lower performance than the learning codebook [16] [baseline (3)] and the sparse coding [36] [baseline (4)]. We explain this phenomenon from the fact that both baselines (3) [16] and (4) [36] integrate label supervision to build codebooks. Compared with our scheme that only embeds visual discriminability, such integration largely boosts their performances. With solely visual discriminability embedding [baseline (7)] without task-dependent embedding, we outperform the VT [1] [baseline (1)] the BoW plsa [33] [baseline (2)], and the learning codebook [16] [baseline (3)]. In addition, we also achieve comparable performance to the sparse coding [36] scheme [baseline (4)]. Finally, we embed both visual and task-dependent discriminability to obtain the best performances over all baselines. In particular, our approach outperforms the supervised sparse coding [36] in 16 categories. Advantages in computational efficiency: It is worth mentioning that the work in [36] directly encoded the gigantic collection of the original local descriptors (such as SIFT). This is extremely time-consuming and cannot be scaled up to millions or billions of descriptors. Indeed, the time complexity of [36] is already almost huge for PASCAL VOC 07 (contain images). On the other hand, the work of [16] learned category-specific codebooks and hence also cannot be scaled up to real-world database that contains over thousands of correlative semantic categories. This is a similar issue for the work in [31], which also learned one codebook per category to train the subsequent classifiers. Table I further shows the visual search efficiency in the 0.5 million Flickr data set. Using both visual and task-dependent discriminability embedding does not significantly increase the computational cost compared with the building of the original VT. In the online search, due to the online coding requirement, we also need additional cost to build the compressed BoW histogram, but this is still slightly efficient compared with the state-of-the-art works in extracting the plsa topic features in [33]. In addition, adding GNP would significantly increase the

9 2290 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 TABLE I TIME AND STORAGE ANALYSIS IN THE 0.5 MILLION FLICKR DATABASE Fig. 7. Performance of the ND image search task in the UKBench database. We compare our task-dependent codebook compression with six baselines: baseline (1) VT [1], baseline (2) BoW plsa [33], baseline (3) learning codebook [16], baseline (4) direct sparse coding [36], baseline (5) without visual discriminative embedding, baseline (6) without task embedding, and Bag-of-Words PCA [baseline (8)]. In such case, the four images in an identical category are with an identical category label in our category LCK LCK ). online query time. Note that the additional search time cost comes from twofold: 1) to transform the BoW histogram into the compressed histogram and 2) to search in the new inverted indexing of the compressed codewords. The later costs more time since the inverted file for compact codebook is now less sparse. 2) ND Image Retrieval in UKBench: Fig. 7 further shows the performance of above baselines in UKBench, where similar observations to Fig. 6 can be found. Note that the increase in the axis and the measurement in the axis are identical to the quantitative comparisons in [1]. In addition, there are two interesting observations. 1) ND image retrieval concerns more on the visual discriminability and less on the task-dependent discriminability. This can be validated by the large margin between baselines (6) and (7), where baseline (7) performs better due to its visual discriminability embedding. 2) For ND search, the original VT [1] [baseline (1)] even outperforms both BoW plsa [33] [baseline (2)] and learning codebook [16] [baseline (3)]. It is because the labels (four images as an identical category) do not always offer further information in codebook building since the intracategory images are already ND. Another reason is that the ND search is without classifier training. Hence, the dimension curses for [1] in Fig. 6 can be largely avoided. 3) Image Ranking Tasks in 0.5 Million Flickr Data Set: Fig. 8(a) further shows the ND-0.5Million Evaluation in the Flickr data set. While similar performance to Fig. 6 can be found, there are two additional observations compared with ND image retrieval. 1) The scenario of semantic image search should concern more on the task-dependent embedding and less on the visual discriminability embedding. Fig. 8(a) shows the large margin of our approach and baseline (7) to baselines (1), (5), and (6). 2) Baseline (5) [50] outperforms our alternative approaches [baselines (6) and (7)], but performs worse than our final approach. However, we should note that baseline (5) from [50] gives an extremely compact image signature with only approximate 32 B, whereas our final approach gives over 1-kB representation, as shown in Table I. Fig. 8(b) further shows the SS-0.5Million Evaluation in the 0.5 million Flickr data set. Similarly, there is a large margin from the learning codebook [16] [baseline (3)] and our final approach to the VT [1] [baseline (1)], adaptive vocabulary [16] [baseline (3)], and our alternative approaches [baseline (7)]. Note that baseline (3) outperforms our alternative approach [baseline (7)] with only visual discriminability embedding. However, our final approach still outperforms baseline (3) using both visual discriminability and task-dependent discriminability embedding. The main advantage comes from the usage of the annotation-based LCK in Section IV-B2. Finally, Fig. 8(c) shows the RS-0.5Million Evaluation in our 0.5 million Flickr data set, which is used to evaluate our ranking-based LCK in Section IV-B3, with comparisons to our alternative approaches [baselines (6) and (7)]. In such case, learning from the user-query logs can largely improve our ranking performance. By linearly increasing the user-query logs, our learning complexity does not increase a lot (at most linear). As shown in Figs. 7 and 8, serving as a straightforward solution, BoW PCA does not have better performance comparing with our scheme. While PCA preserves the principal component in the original BoW histogram, it is hard to capture the visual discriminability of the BoW in search. In addition, due to the computational cost in principal component extraction, PCA is less efficient for large codebooks. 4) Where is the Matching?: We have recorded the spatial locations of local patches. After quantizing them into the initial BoW histogram,wetransmit into using. Then, we recover from using. Since there are only partial bins in that are nonzero, we can plot these nonzero bins back to the local features that are quantized into these bins, which is shown in Fig. 9.

10 JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION 2291 Fig. 8. (a) Performance of ND image search task in the 0.5 million Flickr data set (ND-0.5Million Evaluation), with comparisons to VT [1] [baseline (1)] and aggregate local features [50] [baseline (5)], without visual discriminative embedding [baseline (6)], without task embedding [baseline (7)], and Bag-of-Words PCA [baseline (8)]. In such case, the images annotations are considered as correlative annotation supervision for using our category LCK LCK. (b) Performance of semantic sensitive image search task in 0.5 million Flickr data set (SS-0.5Million Evaluation), with comparisons to baseline (1) VT [1], aggregate local features [50] [baseline (5)], without visual discriminative embedding [baseline (6)], without task embedding [baseline (7)], and Bag-of-Words PCA [baseline (8)]. In such case, the images annotations are considered as correlative annotation supervision for using our category LCK LCK.(c)Performance of ranking-sensitive image search task in 0.5 million Flickr data set (RS-0.5Million Evaluation), with comparisons to baseline (1) VT [1], aggregate local features [50] [baseline (5)], without visual discriminative embedding [baseline (6)], without task embedding [baseline (7)], and Bag-of-Words PCA [baseline (8)]. In such case, the images annotations are considered as correlative annotation supervision for using our category LCK LCK Note that we leverage the NDCG measurement rather than the MAP measurement in our previous two groups evaluations. Fig. 9. Spatial locations of the recovered local patches from our compressed signature.ineachsubfigure, the top left part is the image in the database, the top right patches are the local patch collection extracted from this image, the down part is the patches that belong to the nonzero visual words as well as their spatial locations after reconstructing from. Fig. 11. Ranking example in the ND-0.5Million Evaluation, in which each left figure (large) denotes a query generated from our ground-truth set. In addition, we show two groups of ranking results: 1) Upper row: original VT-based ranking [1]; 2) Lower row: Our codebook compression-based ranking results with category-based LCK embedding. Fig. 10. Average distribution of compressed bag-of-words histogram (reconstruction parameter vectors). 5) Compressed Histogram Distribution: In Fig. 10, we sum from 100 test images to get an average histogram, 3 in which each bin shows the averaged reconstruction parameters. Note that we set in the compression function,which is learned from a word vocabulary. We can see that, compared with the originally sparse BoW histogram,thecompressed reconstruction parameter histogram (which is also our compressing descriptor) is very dense, which largely reduces the storage space (in memory) and retrieval efficiency. Finally, 3 We sum all for. Then, we obtain the average value for each bin by subdividing each bin with 100. Therefore, we obtain an average reconstruction parameter histogram to see the new codeword distribution. training classifiers based on such a low-dimensional histogram can also well avoid the dimension curse caused in the previous high-dimensional BoW histograms. E. Further Discussions On correlations to visual codebook topic model: Nevertheless, one straightforward solution in codebook compression is to use unsupervised topic models (such as plsa [32], [33] and LDA [33]) to build a higher level abstraction of the initial codebook (see Fig. 11). In addition to our superior quantitative performances, we further discuss two main differences between our scheme and topic models: 1) plsa and LDA are all generative models, which need predefined superparameters to decide topic numbers. However, it is unsuitable to assert that a given codebook has fixed topics in its compression for different task scenarios (eveninanidenticaldatabase).

11 2292 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL ) plsa and LDA are computationally inefficient. For instance the singular vector decomposition is hard to scale up. In contrast, our codebook compression uses the gradient decent operations to iteratively learn the basis dictionary, which can be easily paralleled (on the optimizing of either or ). On correlations to ICA dictionary learning: The work from independent component analysis (ICA) [47] also builds dictionary coding from the local patch collection, which is unsuitable for our case due to the following. 1) Components in ICA should be orthogonal between each other. On the contrary, there is not such constraints in our compression, where any two bases in need not to be orthogonal. Subsequently, our lossy constraint enables the integration of both visual and task-dependent discriminability. 2) ICA is computational inefficiency, which is deployed over the initial local patch collection. This cost is unaffordable in large scale (an identical drawback to the sparse coding based codebooks [12], [36]). 3) ICA focuses on the reconstruction capability of the original input signal. On the contrary, our model emphasizes on the discriminability (both visual and task dependent) for the subsequent visual search or object recognition tasks. VI. CONCLUSION AND FURTHER WORKS In this paper, we have proposed to compress an originally high-dimensional visual codebook with the help of supervised task labels based on sparse coding. In addition to the computational benefits in both processing time and storage space, our compressed codebook is also discriminative, extensible, and flexible. Our discriminability comes from integrating the codeword term weighting measurement into our compression cost, by which the important codewords are slightly compressed and vice versa. Our extensibility comes from deploying our compression scheme over the most existing codebooks [1], [3], [4], [8], [9] as a task-dependent post adaption. Our flexibility comes from supervising the compression via task labels, achieved by a task-dependent LCK in our compression loss. We have conducted extensive experiments on PASCAL VOC 07, UKBench, and a 0.5 million Flickr data set. We have shown superior performances compared with state-of-the-art codebooks [1], [16], [33], [36], [50]. Two interesting questions remain open in this paper. First, because there are numerous user-query logs generated everyday in most existing image search engines, how to adapt the supervised codebook compression to the incremental query logs is still an open problem. To this end, it would be interesting to extend our codebook compression into an incremental version. Second, to a certain degree, a formerly compressed codebook also offers valuable knowledge for the new tasks. We would further exploit the feasibility of task-dependent codebooks, e.g., adapting a compressed codebook from PASCAL VOC 07 to Caltech101. REFERENCES [1] D. Nister and H. Stewenius, Scalable recognition with a vocabulary tree, in Proc. CVPR, 2006, pp [2] J.Yang,Y.Jiang,A.Hauptmann,andC.W.Ngo, Evaluatingbag-ofvisual-words representations in scene classification, in Proc. Multimedia Inf. Retrieval, 2007, pp [3] R.Ji,X.Xie,H.Yao,andW.Y.Ma, Vocabularyhierarchyoptimization for effective and transferable retrieval, in Proc. CVPR, 2009, pp [4]J.Philbin,O.Chum,M.Isard,J.Sivic,andA.Zisserman, Object retrieval with large vocabulary and fast spatial matching, in Proc. CVPR, 2007, pp [5] J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek, Visual word ambiguity, IEEE Trans. Pattern Anal. Mach. Intel., vol. 32, no. 7, pp , Jul [6] D. Chen, S. Tsai, and V. Chandrasekhar, Tree histogram coding for mobile image matching, in Proc. Data Compression Conf., 2009, pp [7] D. Chen, S. Tsai, V. Chandrasekhar, G. Takacs, R. Vedantham, R. Grzeszczuk, and B. Girod, Inverted index compression for scalable image matching, in Proc. Data Compression Conf., 2010, p [8] J. Sivic and A. Zisserman, Video Google: A text retrieval approach to object matching in videos, in Proc. ICCV, 2003, pp [9] G.SchindlerandM.Brown, City-scale location recognition, in Proc. CVPR, 2007, pp [10] B. Kulis and K. Grauman, Kernelized locality-sensitive hashing for scalable image search, in Proc. ICCV, 2009, pp [11] H. Jegou, M. Douze, and C. Schmid, Hamming embedding and weak geometric consistency for large scale image search, in Proc. ECCV, 2008, pp [12] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, Supervised dictionary learning, in Proc. NIPS, 2008, pp [13] S. Lazebnik and M. Raginsky, Supervised learning of quantizer codebooks by information losss minimization, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 7, pp , Jul [14] F. Moosmann, B. Triggs, and F. Jurie, Fast discriminative visual codebooks using randomized clustering forests, in Proc. NIPS, 2006, pp [15] R.Ji,H.Yao,X.Sun,B.Zhong,andW.Gao, Towardssemanticembedding in visual vocabulary, in Proc. CVPR, 2010, pp [16] F. Perronnin, C. Dance, G. Csurka, and M. Bressan, Adapted vocabularies for generic visual categ, in Proc. ECCV, 2006, pp [17] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, Local features and kernels for classification of texture and object categories: A comprehensive review, Int. J. Comput. Vis., vol. 73, no. 2, pp , Jun [18] J. Liu, Y. Yang, and M. Shah, Learning semantic visual vocabularies using diffusion distance, in Proc. CVPR, 2009, pp [19] R. Tibshirani, Regression shrinkage and selection via the Lasso, J. Roy. Stat. Soc., vol. 58, no. 1, pp , [20] G. Salton and C. Buckley, Term-weighting approaches in automatic text retrieval, Inf. Process. Manage., vol. 24, no. 5, pp , [21] P. Quelhas, F. Monay, J. M. Odobez, D. Gatica-Perez, T. Tuytelaars, and L. J. Van Gool, Modeling scenes with local descriptors and latent aspects, in Proc. ICCV, 2005, pp [22] K. Grauman and T. Darrell, The pyramid match kernel: Discriminative classification with sets of image features, in Proc. ICCV, 2005, pp [23] A. Gionis, P. Indyk, and R. Motwani, Similarity search in high dimensions via hashing, in Proc. VLDB, 1999, pp [24] J. Philbin, O. Chum, M. Isard, J. Sivic,andA.Zisserman, Lostin quantization: Improving particular object retrieval in large scale image databases, in Proc. CVPR, 2008, pp [25] Y.-G. Jiang, C.-W. Ngo, and J. Yang, Towards optimal bag-of-features for object categorization and semantic video retrieval, in Proc. CIVR, 2007, pp [26] T. Kohonen, Learning vector quantization for pattern recognition Helsinki Inst. Technol., Helsinki, Finland, Tech. Rep. TKK-F-A601, [27] T. Kohonen, Self-Organizing Maps, 3rd ed. New York: Springer- Verlag, [28] A. Rao, D. Miller, K. Rose, and A. Gersho, A generalized VQ method for combined compression and estimation, in Proc. ICASSP, 1996, pp [29] B. Leibe, A. Leonardis, and B. Schiele, Combined object categorization and segmentation with an implicit shape model, in Proc. ECCV, 2004, pp [30] S. Agarwal and D. Roth, Learning a sparse representation for object detection, in Proc. ECCV, 2002, pp [31] J. Winn, A. Criminisi, and T. Minka, Object categorization by learned universal visual dictionary, in Proc. ICCV, 2005, pp [32] F.-F. Li and P. Perona, A Bayesian hierarchical model for learning natural scene categories, in Proc. ICCV, 2007, pp

12 JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION 2293 [33] A. Bosch, A. Zisserman, and X. Munoz, Scene classification using a hybrid generative/discriminative approach, IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 4, pp , Apr [34] D. Donoho, For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution, Commun. Pure Appl. Math., vol. 59, no. 7, pp , [35] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp , Feb [36] S. Bengio, F. Pereira, and Y. Singer, Group sparse coding, in Proc. NIPS, 2009, pp [37] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online dictionary learning for sparse coding, in Proc. ICML, 2009, pp [38] H. Liu, M. Palatucci, and J. Jiang, Blockwise coordinate descent procedures for the multi-task Lasso, with applications to neural semantic basis discovery, in Proc. ICML, 2009, pp [39] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM J. Sci. Comput., vol. 20, no. 1, pp , [40] C. Fellbaum, WordNet: An electronic lexical database. Cambridge, MA: MIT Press, [41] T. Pedersen, S. Patwardhan, and J. Michelizzi, WordNet similarity measuring the relatedness of concepts, in Proc. AAAI, 2004, pp [42] B. A. Olshausen and D. J. Field, Sparse coding with an overcomplete basis set: A strategy employed by v1?, Vis. Res., vol. 37, no. 23, pp , Dec [43] M. Aharon, M. Elad, and A. M. Bruckstein, The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representations, IEEE Trans. Signal Process., vol. 54, no. 11, pp , Nov [44] H. Lee, A. Battle, R. Raina, and A. Y. Ng, Efficient sparse coding algorithms, in Proc. NIPS, 2006, pp [45] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, The PASCAL Visual Object Classes Challenge (VOC) Results. [Online]. Available: [46] D. Lowe, Distinctive image features form scale-invariant keypoints, Int. J. Comput. Vis., vol. 60, no. 2, pp , Nov [47] A. Hyvarinen and E. Oja, Independent component analysis: Algorithms and applications, Neural Netw., vol. 13, no. 4/5, pp , May/Jun [48] J. Suykens and J. Vandewalle, Least squares support vector machine classifiers, Neural Process. Lett., vol. 9, no. 3, pp , Jun [49] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, The PASCAL visual object classes (VOC) challenge, Int. J. Comput. Vis., vol. 88, no. 2, pp , Jun [50] H. Jegou, M. Douze, C. Schmid, and P. Perez, Aggregating local descriptors into a compact image representation, in Proc. CVPR, 2010, pp [51] [Online]. Available: [52] [Online]. Available: [53] H. Jegou, M. Douze, and C. Schmid, Product quantization for nearest neighbor search, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp , Jan Rongrong Ji (M 11) received the Ph.D. degree from the Harbin Institute of Technology, Harbin, China, in From 2007 to 2008, he was a Research Intern with the Web Search and Mining Group, Microsoft Research Asia, Beijing, China, mentored by Xing Xie. From May 2010 to June 2010, he was a Visiting Student with the University of Texas at San Antonio, San Antonio, and cooperated with Professor Qi Tian. From July 2010 to November 2011, he was a Visiting Student with the Institute of Digital Media, Peking University, Beijing, under the supervision of Professor Wen Gao. He is currently a Postdoctoral Researcher with the Department of Electronic Engineering, Columbia University, New York. He is the author of over 40 referred journals and conferences, including the International Journal of Computer Vision, theieee TRANSACTIONS ON IMAGE PROCESSING, Computer Vision and Pattern Recognition, ACM Multimedia,theInternational Journal Conference on Artificial Intelligence, IEEEMULTIMEDIA, etc. His research interests include image retrieval and annotation, video retrieval and understanding. Dr. Ji is the recipient of a Microsoft Fellowship in He won the Best Paper Award of ACM Multimedia He was the Session Chair of ICME 2008 and Program Committee Member of ACM Multimedia He is a Reviewer for IEEE Signal Processing Magazine, the IEEE TRANSACTIONS ON MULTIMEDIA, SMC, TKDE, and ACM Multimedia Conferences, and an Associate Editor for International Journal of Computer Applications.. Hongxun Yao (M 03) received the B.S. and M.S. degrees in computer science from the Harbin Shipbuilding Engineering Institute, Harbin, China, in 1987 and 1990, respectively, and the Ph.D. degree in computer science from the Harbin Institute of Technology, Harbin, in Currently, she is a Professor with the School of Computer Science and Technology, Harbin Institute of Technology. She is the author of five books and over 200 scientific papers. Her research interests include pattern recognition, multimedia processing, and digital watermarking. Wei Liu (S 06) received the B.S. degree from Zhejiang University, Hangzhou, China, in 2001 and the M.E. degree from the Chinese Academy of Sciences, Beijing, China, in He is currently working towards the Ph.D. degree in the Department of Electrical Engineering, Columbia University, New York. He was a Research Assistant with the Department of Information Engineering, Chinese University of Hong Kong, Hong Kong. Xiaoshuai Sun (S 10) is currently working towards the Ph.D. degree with the Harbin Institute of Technology, Harbin, China. His research interests include image and video understanding, focusing on saliency analysis. Qi Tian (M 96 SM 03) received the B.E. degree in electronic engineering from Tsinghua University, Beijing, China, in 1992, the M.S. degree in electrical and computer engineering from Drexel University, Philadelphia, PA, in 1996, and the Ph.D. degree in electrical and computer engineering from the University of Illinois, Urbana Champaign, Urbana, in He is currently an Associate Professor with the Department of Computer Science, University of Texas at San Antonio (UTSA), San Antonio. His research interests include multimedia information retrieval and computer vision. Dr. Tian is the recipient of a Best Student Paper in ICASSP 2006, a Best Paper Candidate in PCM 2007, the 2010 ACM Service Award, and the Top 10% Best Paper Award at MMSP He has been serving as Program Chair, Organization Committee Member, and TPCs for numerous IEEE and ACM Conferences including ACM Multimedia, SIGIR, ICCV, ICME, etc. He was the Guest Editor of the IEEE TRANSACTIONS ON MULTIMEDIA, thejournal of Computer Vision and Image Understanding, Pattern Recognition Letter, EURASIP Journal on Advances in Signal Processing,, and the Journal of Visual Communication and Image Representation. He is on the Editorial Board of the IEEE TRANSACTIONS ON CIRCUIT AND SYSTEMS FOR VIDEO TECHNOLOGY, thejournal of Multimedia, and the Journal of Machine Visions and Applications.

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search Beyond Sliding Windows: Object Localization by Efficient Subwindow Search Christoph H. Lampert, Matthew B. Blaschko, & Thomas Hofmann Max Planck Institute for Biological Cybernetics Tübingen, Germany Google,

More information

Semantic Pooling for Image Categorization using Multiple Kernel Learning

Semantic Pooling for Image Categorization using Multiple Kernel Learning Semantic Pooling for Image Categorization using Multiple Kernel Learning Thibaut Durand (1,2), Nicolas Thome (1), Matthieu Cord (1), David Picard (2) (1) Sorbonne Universités, UPMC Univ Paris 06, UMR 7606,

More information

Sparse coding for image classification

Sparse coding for image classification Sparse coding for image classification Columbia University Electrical Engineering: Kun Rong(kr2496@columbia.edu) Yongzhou Xiang(yx2211@columbia.edu) Yin Cui(yc2776@columbia.edu) Outline Background Introduction

More information

Category-level localization

Category-level localization Category-level localization Cordelia Schmid Recognition Classification Object present/absent in an image Often presence of a significant amount of background clutter Localization / Detection Localize object

More information

String distance for automatic image classification

String distance for automatic image classification String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009 Analysis: TextonBoost and Semantic Texton Forests Daniel Munoz 16-721 Februrary 9, 2009 Papers [shotton-eccv-06] J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context

More information

Hierarchical Image-Region Labeling via Structured Learning

Hierarchical Image-Region Labeling via Structured Learning Hierarchical Image-Region Labeling via Structured Learning Julian McAuley, Teo de Campos, Gabriela Csurka, Florent Perronin XRCE September 14, 2009 McAuley et al (XRCE) Hierarchical Image-Region Labeling

More information

Large-scale visual recognition Efficient matching

Large-scale visual recognition Efficient matching Large-scale visual recognition Efficient matching Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline!! Preliminary!! Locality Sensitive Hashing: the two modes!! Hashing!! Embedding!!

More information

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012) Three things everyone should know to improve object retrieval Relja Arandjelović and Andrew Zisserman (CVPR 2012) University of Oxford 2 nd April 2012 Large scale object retrieval Find all instances of

More information

ImageCLEF 2011

ImageCLEF 2011 SZTAKI @ ImageCLEF 2011 Bálint Daróczy joint work with András Benczúr, Róbert Pethes Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences Training/test

More information

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia CPSC340 State-of-the-art Neural Networks Nando de Freitas November, 2012 University of British Columbia Outline of the lecture This lecture provides an overview of two state-of-the-art neural networks:

More information

Deep condolence to Professor Mark Everingham

Deep condolence to Professor Mark Everingham Deep condolence to Professor Mark Everingham Towards VOC2012 Object Classification Challenge Generalized Hierarchical Matching for Sub-category Aware Object Classification National University of Singapore

More information

Bag-of-features. Cordelia Schmid

Bag-of-features. Cordelia Schmid Bag-of-features for category classification Cordelia Schmid Visual search Particular objects and scenes, large databases Category recognition Image classification: assigning a class label to the image

More information

Multimodal Medical Image Retrieval based on Latent Topic Modeling

Multimodal Medical Image Retrieval based on Latent Topic Modeling Multimodal Medical Image Retrieval based on Latent Topic Modeling Mandikal Vikram 15it217.vikram@nitk.edu.in Suhas BS 15it110.suhas@nitk.edu.in Aditya Anantharaman 15it201.aditya.a@nitk.edu.in Sowmya Kamath

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,

More information

Learning Representations for Visual Object Class Recognition

Learning Representations for Visual Object Class Recognition Learning Representations for Visual Object Class Recognition Marcin Marszałek Cordelia Schmid Hedi Harzallah Joost van de Weijer LEAR, INRIA Grenoble, Rhône-Alpes, France October 15th, 2007 Bag-of-Features

More information

MULTIMODAL SEMI-SUPERVISED IMAGE CLASSIFICATION BY COMBINING TAG REFINEMENT, GRAPH-BASED LEARNING AND SUPPORT VECTOR REGRESSION

MULTIMODAL SEMI-SUPERVISED IMAGE CLASSIFICATION BY COMBINING TAG REFINEMENT, GRAPH-BASED LEARNING AND SUPPORT VECTOR REGRESSION MULTIMODAL SEMI-SUPERVISED IMAGE CLASSIFICATION BY COMBINING TAG REFINEMENT, GRAPH-BASED LEARNING AND SUPPORT VECTOR REGRESSION Wenxuan Xie, Zhiwu Lu, Yuxin Peng and Jianguo Xiao Institute of Computer

More information

Sparse Models in Image Understanding And Computer Vision

Sparse Models in Image Understanding And Computer Vision Sparse Models in Image Understanding And Computer Vision Jayaraman J. Thiagarajan Arizona State University Collaborators Prof. Andreas Spanias Karthikeyan Natesan Ramamurthy Sparsity Sparsity of a vector

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task BUAA AUDR at ImageCLEF 2012 Photo Annotation Task Lei Huang, Yang Liu State Key Laboratory of Software Development Enviroment, Beihang University, 100191 Beijing, China huanglei@nlsde.buaa.edu.cn liuyang@nlsde.buaa.edu.cn

More information

Modeling 3D viewpoint for part-based object recognition of rigid objects

Modeling 3D viewpoint for part-based object recognition of rigid objects Modeling 3D viewpoint for part-based object recognition of rigid objects Joshua Schwartz Department of Computer Science Cornell University jdvs@cs.cornell.edu Abstract Part-based object models based on

More information

Improved Fisher Vector for Large Scale Image Classification XRCE's participation for ILSVRC

Improved Fisher Vector for Large Scale Image Classification XRCE's participation for ILSVRC Improved Fisher Vector for Large Scale Image Classification XRCE's participation for ILSVRC Jorge Sánchez, Florent Perronnin and Thomas Mensink Xerox Research Centre Europe (XRCE) Overview Fisher Vector

More information

Large scale object/scene recognition

Large scale object/scene recognition Large scale object/scene recognition Image dataset: > 1 million images query Image search system ranked image list Each image described by approximately 2000 descriptors 2 10 9 descriptors to index! Database

More information

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Marc Pollefeys Joined work with Nikolay Savinov, Christian Haene, Lubor Ladicky 2 Comparison to Volumetric Fusion Higher-order ray

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds

Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds Sudheendra Vijayanarasimhan Kristen Grauman Department of Computer Science University of Texas at Austin Austin,

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Bilevel Sparse Coding

Bilevel Sparse Coding Adobe Research 345 Park Ave, San Jose, CA Mar 15, 2013 Outline 1 2 The learning model The learning algorithm 3 4 Sparse Modeling Many types of sensory data, e.g., images and audio, are in high-dimensional

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

Enhanced and Efficient Image Retrieval via Saliency Feature and Visual Attention

Enhanced and Efficient Image Retrieval via Saliency Feature and Visual Attention Enhanced and Efficient Image Retrieval via Saliency Feature and Visual Attention Anand K. Hase, Baisa L. Gunjal Abstract In the real world applications such as landmark search, copy protection, fake image

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study

Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study J. Zhang 1 M. Marszałek 1 S. Lazebnik 2 C. Schmid 1 1 INRIA Rhône-Alpes, LEAR - GRAVIR Montbonnot, France

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing

More information

Mixtures of Gaussians and Advanced Feature Encoding

Mixtures of Gaussians and Advanced Feature Encoding Mixtures of Gaussians and Advanced Feature Encoding Computer Vision Ali Borji UWM Many slides from James Hayes, Derek Hoiem, Florent Perronnin, and Hervé Why do good recognition systems go bad? E.g. Why

More information

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Adding spatial information Forming vocabularies from pairs of nearby features doublets

More information

Beyond Bags of Features

Beyond Bags of Features : for Recognizing Natural Scene Categories Matching and Modeling Seminar Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015

More information

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference Minh Dao 1, Xiang Xiang 1, Bulent Ayhan 2, Chiman Kwan 2, Trac D. Tran 1 Johns Hopkins Univeristy, 3400

More information

Limitations of Matrix Completion via Trace Norm Minimization

Limitations of Matrix Completion via Trace Norm Minimization Limitations of Matrix Completion via Trace Norm Minimization ABSTRACT Xiaoxiao Shi Computer Science Department University of Illinois at Chicago xiaoxiao@cs.uic.edu In recent years, compressive sensing

More information

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Mobile Human Detection Systems based on Sliding Windows Approach-A Review Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg

More information

Mobile visual search. Research Online. University of Wollongong. Huiguang Sun University of Wollongong. Recommended Citation

Mobile visual search. Research Online. University of Wollongong. Huiguang Sun University of Wollongong. Recommended Citation University of Wollongong Research Online University of Wollongong Thesis Collection University of Wollongong Thesis Collections 2013 Mobile visual search Huiguang Sun University of Wollongong Recommended

More information

Deformable Part Models

Deformable Part Models CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones

More information

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013 Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest

More information

Detection III: Analyzing and Debugging Detection Methods

Detection III: Analyzing and Debugging Detection Methods CS 1699: Intro to Computer Vision Detection III: Analyzing and Debugging Detection Methods Prof. Adriana Kovashka University of Pittsburgh November 17, 2015 Today Review: Deformable part models How can

More information

Experiments of Image Retrieval Using Weak Attributes

Experiments of Image Retrieval Using Weak Attributes Columbia University Computer Science Department Technical Report # CUCS 005-12 (2012) Experiments of Image Retrieval Using Weak Attributes Felix X. Yu, Rongrong Ji, Ming-Hen Tsai, Guangnan Ye, Shih-Fu

More information

Large-scale visual recognition The bag-of-words representation

Large-scale visual recognition The bag-of-words representation Large-scale visual recognition The bag-of-words representation Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline Bag-of-words Large or small vocabularies? Extensions for instance-level

More information

570 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2011

570 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2011 570 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 2, FEBRUARY 2011 Contextual Object Localization With Multiple Kernel Nearest Neighbor Brian McFee, Student Member, IEEE, Carolina Galleguillos, Student

More information

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 13, NO. 5, OCTOBER

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 13, NO. 5, OCTOBER IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 13, NO. 5, OCTOBER 2011 1019 Feature-Based Sparse Representation for Image Similarity Assessment Li-Wei Kang, Member, IEEE, Chao-Yung Hsu, Hung-Wei Chen, Chun-Shien

More information

Face Recognition via Sparse Representation

Face Recognition via Sparse Representation Face Recognition via Sparse Representation John Wright, Allen Y. Yang, Arvind, S. Shankar Sastry and Yi Ma IEEE Trans. PAMI, March 2008 Research About Face Face Detection Face Alignment Face Recognition

More information

AS DIGITAL cameras become more affordable and widespread,

AS DIGITAL cameras become more affordable and widespread, IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 17, NO. 3, MARCH 2008 407 Integrating Concept Ontology and Multitask Learning to Achieve More Effective Classifier Training for Multilevel Image Annotation Jianping

More information

BossaNova at ImageCLEF 2012 Flickr Photo Annotation Task

BossaNova at ImageCLEF 2012 Flickr Photo Annotation Task BossaNova at ImageCLEF 2012 Flickr Photo Annotation Task S. Avila 1,2, N. Thome 1, M. Cord 1, E. Valle 3, and A. de A. Araújo 2 1 Pierre and Marie Curie University, UPMC-Sorbonne Universities, LIP6, France

More information

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213) Recognition of Animal Skin Texture Attributes in the Wild Amey Dharwadker (aap2174) Kai Zhang (kz2213) Motivation Patterns and textures are have an important role in object description and understanding

More information

Exploring Bag of Words Architectures in the Facial Expression Domain

Exploring Bag of Words Architectures in the Facial Expression Domain Exploring Bag of Words Architectures in the Facial Expression Domain Karan Sikka, Tingfan Wu, Josh Susskind, and Marian Bartlett Machine Perception Laboratory, University of California San Diego {ksikka,ting,josh,marni}@mplab.ucsd.edu

More information

Using Machine Learning for Classification of Cancer Cells

Using Machine Learning for Classification of Cancer Cells Using Machine Learning for Classification of Cancer Cells Camille Biscarrat University of California, Berkeley I Introduction Cell screening is a commonly used technique in the development of new drugs.

More information

SEARCH BY MOBILE IMAGE BASED ON VISUAL AND SPATIAL CONSISTENCY. Xianglong Liu, Yihua Lou, Adams Wei Yu, Bo Lang

SEARCH BY MOBILE IMAGE BASED ON VISUAL AND SPATIAL CONSISTENCY. Xianglong Liu, Yihua Lou, Adams Wei Yu, Bo Lang SEARCH BY MOBILE IMAGE BASED ON VISUAL AND SPATIAL CONSISTENCY Xianglong Liu, Yihua Lou, Adams Wei Yu, Bo Lang State Key Laboratory of Software Development Environment Beihang University, Beijing 100191,

More information

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Preliminary Local Feature Selection by Support Vector Machine for Bag of Features Tetsu Matsukawa Koji Suzuki Takio Kurita :University of Tsukuba :National Institute of Advanced Industrial Science and

More information

Short Run length Descriptor for Image Retrieval

Short Run length Descriptor for Image Retrieval CHAPTER -6 Short Run length Descriptor for Image Retrieval 6.1 Introduction In the recent years, growth of multimedia information from various sources has increased many folds. This has created the demand

More information

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, Cordelia Schmid LEAR team, INRIA Rhône-Alpes, Grenoble, France

More information

Part based models for recognition. Kristen Grauman

Part based models for recognition. Kristen Grauman Part based models for recognition Kristen Grauman UT Austin Limitations of window-based models Not all objects are box-shaped Assuming specific 2d view of object Local components themselves do not necessarily

More information

arxiv: v3 [cs.cv] 3 Oct 2012

arxiv: v3 [cs.cv] 3 Oct 2012 Combined Descriptors in Spatial Pyramid Domain for Image Classification Junlin Hu and Ping Guo arxiv:1210.0386v3 [cs.cv] 3 Oct 2012 Image Processing and Pattern Recognition Laboratory Beijing Normal University,

More information

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011 Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Real-Time Object Segmentation Using a Bag of Features Approach

Real-Time Object Segmentation Using a Bag of Features Approach Real-Time Object Segmentation Using a Bag of Features Approach David ALDAVERT a,1, Arnau RAMISA c,b, Ramon LOPEZ DE MANTARAS b and Ricardo TOLEDO a a Computer Vision Center, Dept. Ciencies de la Computació,

More information

Latent Variable Models for Structured Prediction and Content-Based Retrieval

Latent Variable Models for Structured Prediction and Content-Based Retrieval Latent Variable Models for Structured Prediction and Content-Based Retrieval Ariadna Quattoni Universitat Politècnica de Catalunya Joint work with Borja Balle, Xavier Carreras, Adrià Recasens, Antonio

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Recognition. Topics that we will try to cover:

Recognition. Topics that we will try to cover: Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) Object classification (we did this one already) Neural Networks Object class detection Hough-voting techniques

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Linear combinations of simple classifiers for the PASCAL challenge

Linear combinations of simple classifiers for the PASCAL challenge Linear combinations of simple classifiers for the PASCAL challenge Nik A. Melchior and David Lee 16 721 Advanced Perception The Robotics Institute Carnegie Mellon University Email: melchior@cmu.edu, dlee1@andrew.cmu.edu

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

2028 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 4, APRIL 2017

2028 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 4, APRIL 2017 2028 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 26, NO. 4, APRIL 2017 Weakly Supervised PatchNets: Describing and Aggregating Local Patches for Scene Recognition Zhe Wang, Limin Wang, Yali Wang, Bowen

More information

Fisher vector image representation

Fisher vector image representation Fisher vector image representation Jakob Verbeek January 13, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.11.12.php Fisher vector representation Alternative to bag-of-words image representation

More information

Multiple Kernel Learning for Emotion Recognition in the Wild

Multiple Kernel Learning for Emotion Recognition in the Wild Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,

More information

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014 Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014 Problem Definition Viewpoint estimation: Given an image, predicting viewpoint for object of

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b 1

More information

Latest development in image feature representation and extraction

Latest development in image feature representation and extraction International Journal of Advanced Research and Development ISSN: 2455-4030, Impact Factor: RJIF 5.24 www.advancedjournal.com Volume 2; Issue 1; January 2017; Page No. 05-09 Latest development in image

More information

Efficient Kernels for Identifying Unbounded-Order Spatial Features

Efficient Kernels for Identifying Unbounded-Order Spatial Features Efficient Kernels for Identifying Unbounded-Order Spatial Features Yimeng Zhang Carnegie Mellon University yimengz@andrew.cmu.edu Tsuhan Chen Cornell University tsuhan@ece.cornell.edu Abstract Higher order

More information

Learning Compact Visual Attributes for Large-scale Image Classification

Learning Compact Visual Attributes for Large-scale Image Classification Learning Compact Visual Attributes for Large-scale Image Classification Yu Su and Frédéric Jurie GREYC CNRS UMR 6072, University of Caen Basse-Normandie, Caen, France {yu.su,frederic.jurie}@unicaen.fr

More information

Scalable Coding of Image Collections with Embedded Descriptors

Scalable Coding of Image Collections with Embedded Descriptors Scalable Coding of Image Collections with Embedded Descriptors N. Adami, A. Boschetti, R. Leonardi, P. Migliorati Department of Electronic for Automation, University of Brescia Via Branze, 38, Brescia,

More information

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task Kyunghee Kim Stanford University 353 Serra Mall Stanford, CA 94305 kyunghee.kim@stanford.edu Abstract We use a

More information

Combining Selective Search Segmentation and Random Forest for Image Classification

Combining Selective Search Segmentation and Random Forest for Image Classification Combining Selective Search Segmentation and Random Forest for Image Classification Gediminas Bertasius November 24, 2013 1 Problem Statement Random Forest algorithm have been successfully used in many

More information

Facial Expression Recognition Using Non-negative Matrix Factorization

Facial Expression Recognition Using Non-negative Matrix Factorization Facial Expression Recognition Using Non-negative Matrix Factorization Symeon Nikitidis, Anastasios Tefas and Ioannis Pitas Artificial Intelligence & Information Analysis Lab Department of Informatics Aristotle,

More information

An efficient face recognition algorithm based on multi-kernel regularization learning

An efficient face recognition algorithm based on multi-kernel regularization learning Acta Technica 61, No. 4A/2016, 75 84 c 2017 Institute of Thermomechanics CAS, v.v.i. An efficient face recognition algorithm based on multi-kernel regularization learning Bi Rongrong 1 Abstract. A novel

More information

Kernels for Visual Words Histograms

Kernels for Visual Words Histograms Kernels for Visual Words Histograms Radu Tudor Ionescu and Marius Popescu Faculty of Mathematics and Computer Science University of Bucharest, No. 14 Academiei Street, Bucharest, Romania {raducu.ionescu,popescunmarius}@gmail.com

More information

Visual Object Recognition

Visual Object Recognition Perceptual and Sensory Augmented Computing Visual Object Recognition Tutorial Visual Object Recognition Bastian Leibe Computer Vision Laboratory ETH Zurich Chicago, 14.07.2008 & Kristen Grauman Department

More information

Object Detection Based on Deep Learning

Object Detection Based on Deep Learning Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf

More information

6.819 / 6.869: Advances in Computer Vision

6.819 / 6.869: Advances in Computer Vision 6.819 / 6.869: Advances in Computer Vision Image Retrieval: Retrieval: Information, images, objects, large-scale Website: http://6.869.csail.mit.edu/fa15/ Instructor: Yusuf Aytar Lecture TR 9:30AM 11:00AM

More information

Speed-up Multi-modal Near Duplicate Image Detection

Speed-up Multi-modal Near Duplicate Image Detection Open Journal of Applied Sciences, 2013, 3, 16-21 Published Online March 2013 (http://www.scirp.org/journal/ojapps) Speed-up Multi-modal Near Duplicate Image Detection Chunlei Yang 1,2, Jinye Peng 2, Jianping

More information

Leveraging flickr images for object detection

Leveraging flickr images for object detection Leveraging flickr images for object detection Elisavet Chatzilari Spiros Nikolopoulos Yiannis Kompatsiaris Outline Introduction to object detection Our proposal Experiments Current research 2 Introduction

More information

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection

Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Using the Deformable Part Model with Autoencoded Feature Descriptors for Object Detection Hyunghoon Cho and David Wu December 10, 2010 1 Introduction Given its performance in recent years' PASCAL Visual

More information

Binary SIFT: Towards Efficient Feature Matching Verification for Image Search

Binary SIFT: Towards Efficient Feature Matching Verification for Image Search Binary SIFT: Towards Efficient Feature Matching Verification for Image Search Wengang Zhou 1, Houqiang Li 2, Meng Wang 3, Yijuan Lu 4, Qi Tian 1 Dept. of Computer Science, University of Texas at San Antonio

More information

Transfer Learning Algorithms for Image Classification

Transfer Learning Algorithms for Image Classification Transfer Learning Algorithms for Image Classification Ariadna Quattoni MIT, CSAIL Advisors: Michael Collins Trevor Darrell 1 Motivation Goal: We want to be able to build classifiers for thousands of visual

More information

70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing

70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing 70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY 2004 ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing Jianping Fan, Ahmed K. Elmagarmid, Senior Member, IEEE, Xingquan

More information

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015 on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015 Vector visual representation Fixed-size image representation High-dim (100 100,000) Generic, unsupervised: BoW,

More information