Hidden Markov Models for Automatic Annotation and Content-Based Retrieval of Images and Video

Size: px
Start display at page:

Download "Hidden Markov Models for Automatic Annotation and Content-Based Retrieval of Images and Video"

Transcription

1 Hidden Markov Models for Automatic Annotation and Content-Based Retrieval of Images and Video Arnab Ghoshal Johns Hopkins University Baltimore, MD, U.S.A. Pavel Ircing University of West Bohemia Pilsen, Czech Republic Sanjeev Khudanpur Johns Hopkins University Baltimore, MD, U.S.A. ABSTRACT This paper introduces a novel method for automatic annotation of images with keywords from a generic vocabulary of concepts or objects for the purpose of content-based image retrieval. An image, represented as sequence of featurevectors characterizing low-level visual features such as color, texture or oriented-edges, is modeled as having been stochastically generated by a hidden Markov model, whose states represent concepts. The parameters of the model are estimated from a set of manually annotated (training) images. Each image in a large test collection is then automatically annotated with the a posteriori probability of concepts present in it. This annotation supports content-based search of the image-collection via keywords. Various aspects of model parameterization, parameter estimation, and image annotation are discussed. Empirical retrieval results are presented on two image-collections COREL and keyframes from TRECVID. Comparisons are made with two other recently developed techniques on the same datasets. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models; I.5.1 [Models]: Statistical; I.4.8 [Scene Analysis]: Object Recognition General Terms Design, Experimentation, Performance Keywords Hidden Markov models, Image & video retrieval 1. INTRODUCTION The content of communications in the digital age is increasingly multi-modal in nature, with text, images and even speech or video being used in a single document. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR 05, August 15 19, 2005, Salvador, Brazil. Copyright 2005 ACM /05/ $5.00. This is true of web pages on the Internet, reports at a workplace, and even interpersonal interactions, as witnessed by the file-contents of most personal computers. Content-based indexing and retrieval of multimedia is therefore becoming an increasingly important issue. Unlike search and retrieval of text documents, in which the modality in which the user naturally, or at least usually, specifies her information need is the same as the modality of the search collection, there is relatively little work in image retrieval based on textual queries. The major reason, of course, is that open domain image understanding remains a largely unsolved problem, and even detecting a single predetermined object in an image, such as a human face, is a fairly difficult problem (cf e.g. [13]). Yet, important progress has been made in the last few years in content-based image retrieval, as reported in [4, 2, 3, 7] and elsewhere. We are not the first to observe that while the pure image understanding problem, i.e. the problem of recognizing all the objects in a given image, is very difficult due to several invariance issues, there are two aspects of the image search and retrieval problem which make it relatively much more tractable. One is the availability of side information in the accompanying text: images in multimedia documents are often accompanied by captions or other descriptive text which a search engine may use to guess the content of an image. This is indeed what is used, e.g. by Google Images and similar tools. The other is that the search problem is far more tolerant of erroneous inferences than is usually assumed necessary for object recognition in individual images: the system need not recognize with high certainty whether any particular image in a collection contains, say, a tiger, since it suffices if the inferred likelihood of tiger in images containing tigers is higher than its likelihood in images not containing tigers, no matter how small either likelihood is. These two observations lead to the research paradigm subscribed to in this paper. 1. We develop a generative stochastic model for images and their accompanying captions. Training material for such a model, we believe, is readily obtained from naturally occurring image+caption pairs. The vocabulary of the words used in the captions is fairly generic, enabling the annotation of additional images that may not have surrounding text. If surrounding text is available, our model is capable of using it as a prior, and computing posterior probabilities of the words in the caption given the visual evidence in the image. 2. We estimates parameters of our model from a manually annotated (training) collection of image+caption

2 pairs, and then automatically annotate (or caption) a collection of test images using this fairly generic vocabulary of keywords. 3. We measure the efficacy of our image annotation model by performing image retrieval experiments on the test collection using single-word queries, and compute standard information retrieval performance measures. We take the view that as long as the retrieved images are in the right rank-order, human users are very efficient at scanning thumbnails for the images they need. The remainder of this paper is organized as follows. Section 2 formally describes the hidden Markov model (HMM) used for image annotation, and compares it with two other recently developed techniques for the same task. Section 3 delves into various issues in the design and parameter estimation for the model. Section 4 presents a series of experimental results on two image-collections, namely the COREL and the TRECVID datasets. We conclude with some discussion in Section HIDDEN MARKOV MODELS FOR IM- AGE ANNOTATION Let a collection of image+caption pairs be provided and consider the problem of developing a stochastic generative model that jointly describes each pair. Let I = i 1,..., i T denote segments (regions) in an image, and C = {c 1,..., c N } denote the objects (concepts) present in that image, as specified by the corresponding label (caption) C. For each image region i t, t = 1,..., T, let x t IR d represent the color, texture, edges, shape and other salient visual features of the region 1. Finally, let V denote the global vocabulary of the caption-words c n across the entire collection of images. We propose to model the {x t}-vectors of an image I as a hidden Markov process [11], generated by an underlying unobserved Markov chain whose states s t take values in C. Specifically, each x t is generated according to some probability density function f( s t) given the state s t, where s t itself is a Markov chain with a known initial state s 0 and transition probabilities p(s t s t 1). Figure 1 illustrates an image with T = 6 rectangular regions with the caption {tiger, grass, water, ground}, and the state-transition diagram of the underlying Markov chain which generates these regions. Formally, for a state-sequence s 1,..., s T = s T 1 C T, f(x 1,..., x T, s 1,..., s T s 0) = T f(x t s t) p(s t s t 1). Note that observing the state sequence s t is equivalent to knowing the alignment of each image region i t with one of the words in the caption. Since this level of detail is usually not provided in a caption, a hidden Markov model (HMM) is an appropriate formalism for computing the joint likelihood: f(x T 1, C s 0) = s T 1 CT T f(x t s t) p(s t s t 1), (1) where x T 1 denotes the T -length sequence x 1,.., x T. 1 The T segments may be object based, with each region corresponding to one semantically distinct object, or they may be a simple rectangular partition of the image into fixed-size blocks. We have chosen to use fixed-size segments in our experiments based on a 4 6 or 5 7 partition of the image. Figure 1: Illustration of the State Transitions Graph and Output for an Image+Caption HMM. The emission densities f and transition probabilities p of the HMM are estimated, given a training collection of image+caption pairs, to maximize the joint likelihood (1). In particular, we model the output density f( c) for each state c V as a mixture of multivariate Gaussian densities on IR d : f(x c) = M m=1 w m,c (2π)d Σ m,c e 1 2 (x µm,c)t Σ 1 m,c (x µm,c), (2) where w m,c s are the mixture weights, µ m,c the mean-vector and Σ m,c the diagonal covariance-matrix for the of the m-th mixture component of the state c. The transition probabilities are non-parametric, with the initial values being the uniform pmf for all permissible states in C (or in V as described below). Details of this maximum likelihood estimation procedure are standard (cf [11]) and therefore omitted. Given an image+caption pair (I, C), the most likely alignment of image regions with caption-words is ŝ T 1 = arg max s T 1 CT T f(x t s t) p(s t s t 1). (3) This is sometimes called the Viterbi alignment. Finally, the marginal likelihood of an image may be computed as f(x T 1 s 0) = f(x T 1, V s 0) = s T 1 VT T f(x t s t) p(s t s t 1). (4) When an image with no caption is provided, this marginal likelihood enables one to use the HMM to compute the conditional probability that a particular image region i t was generated by a particular concept c V based on the total visual evidence x T 1 in the image I: p(s t = c x T 1, s 0) = f(xt 1, s t = c s 0) f(x T 1 s0) (5) T s = T 1 :s t=c f(xt st) p(st st 1) f(x T 1 s0). Finally, the probability of a particular concept c V being present (somewhere) in an image may be calculated as p(c C I, s 0) T p(s t = c x T 1, s 0). (6)

3 Thus jointly modeling the image-features and caption-text as the realization of such a generative stochastic process, i.e. an HMM, provides for 1. estimating the parameters of the model from annotated image+caption pairs, 2. aligning image-regions with caption-words in an image+caption pair, and 3. computing the likelihood of a caption-word being present in an image. HMMs have several strengths, and a few limitations, as models of stochastic processes, which our methodology inherits. It seems natural to use a common density f( c) to model image-features of regions corresponding to a concept c across different images, and to use different densities to model image-features corresponding to different concepts in the same image. Image-regions in our dataset are not manually aligned to words in the image-captions. HMMs provide a natural way to treat the missing alignment information. HMMs provide efficient algorithms for model estimation and inference described in Steps 1-3 above. In particular, indexing a large number of (test) images simultaneously with all the concepts in the vocabulary V is efficient: the computation grows linearly in the size of the test collection and, at worst, quadratically in the number of concepts, and does not depend on the size of the training collection. Features {x t} of adjacent image regions are assumed to be conditionally independent given the underlying concepts. We plan to address this limitation in future work using context-dependent HMM states and spatial derivatives of the image-features. A freely available toolkit, HTK [12], efficiently implements all the basic parameter estimation and probability calculation algorithms needed for working with the model described above, and this is what we have used in the experiments described in Section 4 below. Note that the model proposed above differs fundamentally from the use of HMMs by [9] for image annotation. In their work, different HMMs, one for each category (concept), are first estimated from all images labeled with that category. The states of the HMMs have no correspondence with any semantic concepts in an image. Only the likelihood (4) of each image in a test collection is computed under the HMM of every category. The actual assignment of index terms to a test image is done post hoc, based on (a few) categories under whose HMMs the test image is highly likely. 2.1 HMMs vs Image Annotation as MT In some recent work, Duygulu et al [4] have presented the image annotation problem as a variant on the statistical machine translation problem. In their formulation, the representations x t of the segmented image regions are discretized and then treated as words, replacing IR d with a vocabulary X of visual expressions (visterms). The training corpus of image+caption pairs is then treated as a corpus of aligned bi-lingual text, and statistical machine translation techniques are employed to infer a stochastic translation lexicon p(x c) between c V and x X. Given a test image I with visterms x T 1, a probability is calculated for each concept c in the caption-vocabulary (in the simplest case) as p(c C I) = x X p(c x)p(x I) = T p(c x 1 t) T. Note that the estimation of p(x c) in their case, particularly when using IBM Model-1, is identical in form to the embedded estimation of f(x c) in the HMM case provided we set the transition probabilities to be uniform. In both cases, the alignment of individual image-regions to caption-words is treated as a hidden variable and the EM algorithm is employed. Thus the primary difference between our model and theirs is that the image features are modeled here as continuous-valued vectors, avoiding the need for quantization. A lesser difference is that unlike object-based segmentation in their case, we have used regular rectangles to segment the image. In more recent work [6], the model of [4] has been investigated with rectangular regions, It will be shown in Section 4 that preserving the continuous-valued representation of the x t s results in significant improvement in annotation and retrieval performance. 2.2 HMMs vs Continuous Relevance Models Manmatha et al [7, 8] have used a continuous version of the relevance model (CRM) to perform image annotation and retrieval. In their case, continuous-valued visual features {x t} are extracted from rectangular regions of the (test) image I and compared with the visual features extracted from each training image J using a Gaussian kernel function. In particular, for a training image J with visual features {r 1,..., r T } and caption {y 1,..., y N }, k(x J) = 1 T T e 1 2 (x r t) T Σ 1 (x r t ). (2π)d Σ The parameter Σ is estimated from held-out data. This kernel then yields the probability of each concept being present in I as p(c I) ˆP (c J) k(i, J) = [ T ] ˆP (c J) k(x t J), J J where ˆP is computed from the caption of J. Note that the sum need only run over the training images which assign a nonzero ˆP -probability to the concept c. But for this respite, each test image needs to be compared, in principle, with every training image during annotation. While, upon first glance, the continuous relevance model seems to differ fundamentally from the HMM paradigm proposed here, deeper analysis reveals remarkable similarities. Each concept c V is modeled in the HMM by a mixture of Gaussians. The maximization of the joint likelihood (1) results in the emission probabilities f(x c) in the HMM being estimated by a fractional assignment of each r t J to one of the states y n; i.e. image regions r t from all images labeled with c contribute to the estimate of the probability density function (pdf) f(x c). Therefore, when the likelihood of a region x in a test image is computed, concepts whose pdf s were estimated from similar looking vectors r t will have high a posteriori probability (6). In particular, if each state c V has exactly as many Gaussian pdf s in the

4 mixture as the number of images containing c in their caption, if the mixture weights w m are uniform, if each Gaussian is modeled by a mean µ m,c = r t, and if the variance Σ is held constant, there is a remarkable similarity between the probability calculations of the HMM and the continuous relevance model. As may be anticipated, the annotation performance of the HMM and relevance model are comparable, as will be shown in Section 4. In a nutshell, the HMM performs an abstraction of the information presented in the paired image+caption training data. For each concept c V, only the sufficient statistics (mean and covariance) of the image features are retained. By contrast, the relevance model retains the entire training corpus for comparison with the test image. Consequently, the HMM performs the test annotations very efficiently, since each image-feature vector needs to be compared only with the Gaussian-mixture representation of each concept, not each training image. This results in tremendous computational speed-ups compared to the relevance model. Our approach is also much more scalable as a consequence. In our current experiments (see Sec. 4) the average time required to annotate an image with all the concepts is less than 2 seconds. Also the annotation is readily parallelized on multiple CPU s leading to further speed-ups. 3. HMM DESIGN AND TRAINING ISSUES The aforementioned comparison with the relevance model ought to convince the reader that the HMM paradigm permits several design choices which need to be explored. We briefly describe some of the major issues in this regard in this section. Fixed vs Object-Based Image Segmentation: Before we begin, note that we partition each image into a fixed number of rectangular blocks. One problem with rectangular partitioning is that the rectangular blocks might contain pixels belonging to different objects (or concepts) whereas our statistical model assumes that it is generated by the state corresponding to a single object (concept). However it was shown in [5] that models trained on rectangular partitions outperform those trained on object-based partitions. The decision to work with fixed blocks is also driven by pragmatic concerns: we were kindly provided the processed training and test data from the 2004 Johns Hopkins University Summer Workshop [6], where considerable effort had been made in extracting image-features for the entire COREL and TRECVID datasets, and since state-of-the-art image annotation and retrieval performance has been published on these datasets, it provides a convenient venue to make comparisons with related techniques. Visual Feature Vectors: The workshop team extracted color, texture and oriented-edge features of 30-dimensions (COREL) or 76-dimensions (TRECVID) from a rectangular partition of the images, and this is what we have used in our experiments. In case of the COREL images, there are 4 6 blocks per image (T = 24)and in case of the TRECVID collection, it is 5 7 (T = 35). The block size, we are told, is an ad hoc balance between two conflicting criteria: one would like each block to be sufficiently small to contain only one of the concepts with which the image is annotated, and at the same time one would like it to be as big as needed to cover the entire object, so that features x t that characterize the object may easily localized. Figure 2: Illustration of Unlabeled Objects in Images: this image was labeled with sand, beach, trees and people but not sky. HMM Topology and Permitted Transitions: We begin the HMM-design discussion by noting that given a set of image+caption pairs, we construct, for each image, an HMM with as many states as the number of words in its caption. The states of an HMM are fully connected, permitting any state to follow any other state. This corresponds to permitting any concept to occur in the image next to any other concept. One could, of course, investigate the exploitation of the spatial propensities of concepts such as sky and grass to appear in certain positions t in the image, and the propensity of certain concepts such as tiger and sky (not to) appear in adjacent blocks. But we have not done so in this preliminary investigation. Every state has an equal likelihood of being reached at every position t. Shared State HMMs: A caption c V appearing in two different images, of course, must be modeled by the same state. Therefore the HMM for each image shares states from a common pool of V states. In HTK parlance, the states of the HMM of each training image are tied to the corresponding states of other HMMs. Note that V = 375 for the COREL dataset, and V = 75 for TRECVID, as provided to us by [6]. Accounting for Unlabeled Objects: An additional concept, which we call null has been introduced into the vocabulary and is added to the caption of each image. The primary motivation for doing so was our observation that in several images, the annotators did not label some prominent concept in-spite of having a word in V corresponding to it. Figure 2 depicts an image in which the sky occupies more than a third of the image, and yet the image is annotated with sand, beach, trees and people but not sky. The null state is akin to the silence model used in automatic speech recognition. In both cases, since the stochastic model of the image or speech is a generative model, the entire image or speech sequence needs to be accounted for by the model, and adding an optional silence or null to the transcripts or caption results in better modeling of the data HMM Training Procedure Once the visual features of the images are extracted and the HMM for each training image defined, estimation of the HMM parameters proceeds in a standard manner as prescribed in [12]. In particular, we first use a single Gaussian 2 Unlike the speech recognition application, however, we did not include the null state in the calculation of (6) since a null has no direct benefit for image retrieval.

5 pdf for each state c V. Each Gaussian pdf is initialized with the common mean and variance computed from all the images in the training data. Several iterations of the EM algorithm are carried out, updating the means and variances of the HMMs. We note that the transition probabilities are not updated in the experiments reported below; however they have a negligible effect on the likelihood and therefore, hopefully, on the eventual annotation performance. Mixture Splitting: Once the single Gaussian pdf s have been estimated, the pdf for each state c V is replaced by a mixture of a pair of Gaussian pdf s obtained by minor perturbation. Further iterations of the EM algorithm are then carried out to (re)estimate the Gaussian mixture pdf s. This procedure is standard in training speech recognition systems and the details are described in [12]. Note that if the image features x t that align with any state are very well modeled by the mixture of a certain number of Gaussian pdf s, i.e. there is no gain in likelihood by increasing the complexity of the pdf, then one may choose not to mix up the pdf of that state. Such a model-based stopping criterion has been implemented in the HTK tools, and the experimental results reported below are for this default setting. Setting the Variance-Floor: One subtle consideration in estimating Gaussian pdf s is the estimate of variance. If a large number of mixture components are used, the amount of data used to estimate each component may be small, and the corresponding estimate of variance may be too low. Besides being inappropriate as a statistical model, this gives rise to numerical problems (divide-by-zero) during the probability calculations. It is standard to impose a variance floor a minimum value for the estimate of the variance. Typically, this is set to be a fraction of the total empirical variance of all the image data. Maximum Mixture Size: As long as the data-likelihood increases by increasing the number of mixture components in f(x c), we continue to increase the mixture size of the output pdf s of each state until a certain pre-specified maximum mixture size is attained. 3.2 Image Annotation using the HMM Once the output pdf s f(x c) of all the states c V have been estimated from the annotated image+caption pairs, we proceed to construct the so called decoding HMM. This is simply a fully connected V state HMM, one state corresponding to each concept in the vocabulary. Transition probabilities of the HMM are set to be uniform. Given test image, we perform the Balm-Welch algorithm that computes the posterior probability of (6). Once this is done for each image in the test collection, we are ready for image retrieval experiments. Note that the transition probabilities of the decoding HMM need not be uniform. A simple variation is to compute the co-occurrence statistics of the words in the concept vocabulary, and use that frequency to adjust the transition probabilities. For instance, plane and sky tend to co-occur much more often than plane and water, suggesting that p(sky plane) could be set higher than p(water plane). In a contrastive experiment, we set these probabilities to be proportional to the corresponding relative frequencies, with adequate smoothing to avoid zero-probabilities. This setting of transition probabilities is akin to the language model in speech recognition, and an experiment with this setting is so named in the results below. 4. EXPERIMENTAL RESULTS We present image retrieval results on two collections as described below, one of still images and another of key frames extracted from video. The COREL Image Dataset consists of 5000 images from 50 Corel Stock Photo CD s provided to us by [4]. Each CD contains 100 images with a certain theme (e.g. polar bears), of which 90 are designated to be in the training set and 10 in the test set, resulting in 4500 training images and a balanced 500-image test collection. We note that this collection and training/test division is also that used by [7] and [5]. As stated earlier, each image is divided into 24 rectangular regions, and 30 dimensional image features are extracted from it and provided to us by [6]. The vocabulary has V = 375 words. The TRECVID Dataset consists of key-frames extracted from Broadcast News video from several sources, such as ABC News and C-SPAN. A community-wide effort has resulted in 44,100 images from the video portion of the 2002 TREC SDR corpus to be annotated by a set 137 concepts arranged in a hierarchical tree. This collection has been divided into four parts by [1], and named concept-train, concept-validate, concept-fusion-1 and concept-fusion- 2. Each key-frame has been divided into 35 rectangles, and 76-dim visual feature-vectors extracted for each image-block by researchers at the 2004 Johns Hopkins Summer Workshop [6]. The workshop team also culled the concept vocabulary to remove annotations with very low counts, etc., so that the resulting collection of 38K images has captions covered by a V = 75 word vocabulary. Of the 44K manually labeled images, roughly 35K are used to train the HMMs, and about 9K are held out for image retrieval experiments. We again acknowledge the generosity of the workshop team in letting us use their training and test data in our experiments. The TRECVID-2003 Dataset consists of about 32K keyframes, again from Broadcast News sources [10] and was used for multiple tasks in the 2003 benchmark tests conducted by the US National Institutes for Standards and Technology. One of the tasks, named feature detection, corresponds to our image annotation task. Participants were asked to detect 17 different features (concepts) in each shot of the video. Relevance judgements were created post hoc by a pooling of the retrieved shots of participating systems in the usual manner. Of these 17 concepts, 6 were of the sort that we are unable to cover using the generic concept annotation we trained our HMMs for, e.g. Madeline Albright and Physical Violence. Of the remaining 11 concept, some, such as Buildings, correspond directly with a concept in our vocabulary, making it easy for us to retrieve images for them. Finally, some concepts, such as Sporting Events, may be synthesized by a union of events in our vocabulary, e.g. basketball, hockey, football, baseball, etc. We therefore present retrieval results for this subset of 11 concepts (or features). Evaluating Image Annotation: Since our goal is to evaluate image annotation performance in the service of image retrieval, we follow [7] and construct single-word queries for

6 Mixture-size map Table 1: Effect of Mixture-Size on Mean Average Precision for 260 One-Word Queries on COREL. Variance-Floor is set to Mixture size map Table 3: Effect of Mixture-Size on Mean Average Precision for 75 One-Word Queries on TRECVID. Variance-Floor is set to Variance-Floor map Table 2: Effect of the Variance-Floor on Mean Average Precision for 260 One-Word Queries on COREL. Number of mixture components is set to 10. each word in our concept vocabulary V. Thus there are 375 queries for the COREL dataset (of which 260 have at least one relevant image in the test collection) and 75 for the TRECVID dataset. We retrieve images from the appropriate manually-labeled test sets (500 images for COREL) and (9000 for TRECVID). All test images whose caption contains the query-word are deemed relevant to that query, and all other test images as irrelevant. Standard mean average precision (map) is calculated over all queries using the standard (trec eval) protocols, and precision-recall curves are plotted where appropriate. 4.1 Annotation Performance: COREL Tables 1 and 2 show the mean average precision of the HMM based image annotation system on the 500 COREL images. Two primary design parameters are investigated: the maximum number of Gaussian mixture components in f(x c), and the variance-floor set during re-estimation of the HMM parameters. All experiments use a word co-occurrence based language model during decoding. It is clear from the two tables that performance improves up to a certain point by building more and more complex models by adding mixture components, and then begins to level off. It is also clear that a very conservative variance floor (1% of the global variance of all image blocks) hurts performance, and that individual concepts have sufficiently sharp densities that a variance floor of 0.1% is significantly better. The best performance is an map of We note for the sake of comparison that on the 260 word query set, the normalized CRM of [5] has a map of 0.26, which is significantly higher than that of the HMM, while the MT based model of [4] on the same training and test sets has an map of 0.15 [6]. 4.2 Annotation Performance: TRECVID The annotation performance for an HMM as a function of an increasing number of mixture components on the TRECVID dataset is shown in Table 3. It confirms the earlier conclusion based on the COREL data that up to a certain point, increasing model complexity improves performance. Note that the TREC training set is significantly larger than the COREL, and hence performance continues to increase even up to 20-Gaussian mixture densities. The improvement in map from 10 to 20 mixture components is significant at a p-value less than , while that from 16 to 20 at Before proceeding to increase the number of mixture components beyond 20, it is fair to contemplate whether a concept which has been attested, say, 20 times in the training Variance-Floor map Table 4: Effect of the Variance-Floor on Mean Average Precision for 75 One-Word Queries on TRECVID. Number of mixture components is set to 20. images should be modeled by an HMM state with a mixture of more than 20 Gaussian pdf s. Of course, each image has up to 35 feature vectors and a concept may span more than one image block. In order to establish where this trade-off may lie, we undertook an controlled mixture growth experiment. With a variance floor of 0.001, we estimated 4 different HMMs. The output pdf f(x c) for concept c was 1. mixed-up to a maximum of 20 pdf s no matter how many tokens were seen in the 28K image+caption pairs, 2. mixed-up to a maximum of 1 pdf per 3 tokens of c, 3. mixed-up to a maximum of 1 pdf per 5 tokens of c, 4. mixed-up to a maximum of 1 pdf per 15 tokens of c. Note that the very frequent concepts are not affected by this, since many of them have more than 300 tokens. It is only the infrequent concepts whose maximum mixture size is progressively reduced in the successive experiments. Table 5 presents the results for the increasingly conservative model growth strategies. Max mixture size = 20, limited by 1 pdf per 15 tokens 5 tokens 3 tokens no limit map Table 5: Annotation Performance for Progressively Conservative Limits on the Maximum Number of Mixture Components for each Concept. Max mixture size = 100, limited by 1 pdf per 15 tokens 5 tokens 3 tokens no limit map Table 6: Annotation Performance with up to 100 Gaussian pdf s for each Concept. The reader should be convinced that the TRECVID dataset is large enough not to take a conservative view about model size in the range we are operating. Therefore, we went ahead and estimated an HMM system with up to 100 Gaussian pdf s per concept (state), limited of course by the likelihoodgain based automatic limits implemented in HTK. The annotation performance results of the final system are presented in Table 6. We plot the 11-point precision-recall curve for the 100-Gaussian HMM system in Figure 3. The map

7 Figure 3: Precision Recall Curve for 75 One-Word Queries on the TRECVID Dataset. Figure 5: Precision Recall Curve for 11 Features on TRECVID of for the 100-Gaussian system is significantly better than the map for the 20-Gaussian system (p-value = ). We note from [6] that the MT based system has a mean average precision of 0.13 and the Normalized CRM has 0.16 for identical training and test partitions and feature extraction. The MT based system discritizes the x t s while the CRM uses the 76-dim features. We believe that the HMM based system is significantly better than either of them. 3 The HMM based approach is therefore competitive with the best of the available techniques. Figure 4 illustrates some of the more successful queries out of the 75 by showing the top 5 retrieved images, and the precision-recall curves for these queries. As the reader may notice, the images have several visually striking features, which the HMM has apparently managed to model even with such a primitive model. Note that one of the queries shown, (weather news), happens to be among the features prescribed in the TRECVID-2003 feature-detection task, as described next. 4.3 Detection Performance: TRECVID-2003 We mapped 11 of the 17 feature detection queries from TRECVID-2003 onto one or more concepts in our 75-word vocabulary, and performed image retrieval using the HMM developed on the TRECVID dataset. Some features, such as sporting events, were obtained by the union of concepts such as hockey, baseball, etc., while others, such as weather, had direct analogues in V. The average precision for the 11 concepts is reported in Table 7 and Figure 5. Also reported in Table 7 is the average precision of an almost comparable stage (BOU) from [1]. Note that BOU represents the best of three feature detectors, only one of which is comparable to our HMM based system. Since the 3 The authors of [5] state that the CRM attains a mean average precision of 0.18 on this task using 30 dimensional features instead of 76-dim. overall system submitted by [1] was the best performing feature detection system in TRECVID-2003, we feel confident in claiming that the HMM based methods proposed here are competitive with the state of the art. Nō Feature HMM IBM 1011 Outdoors News subject face People Building Road Vegetation Animal Female speech Car/truck/bus Aircraft News subject monologue Non-studio setting Sporting event Weather news Zoom in Physical violence Madeleine Albright map Table 7: Feature Detection Performance on the TRECVID-2003 Feature-Detection Task. 5. DISCUSSION AND CONCLUSIONS We have presented a novel method for image annotation for the purpose of content based image retrieval, and evaluated in in several ways to establish that it is indeed competitive with or better than the state of the art. The system was implemented with publicly available tools and tested on standard datasets.

8 Figure 4: Top 5 Retrieved Images for Some Concepts from the TRECVID Dataset. Much more is known in the speech recognition literature about training and decoding using hidden Markov models, including context-dependent modeling, unsupervised adaptation, adaptive and discriminative training, and minimum error decoding, all of which remains to be applied to the image annotation and retrieval problem. Each of these techniques have resulted in significant improvements in recognition accuracy on speech recognition, and indeed constitute the bulk of the progress in speech recognition research in the last decade. Adaptation, for instance, adjusts the models to overall variations from image to image, discriminative training replaces the maximization of the likelihood (1) with maximization of the ability to discriminate concepts, and context dependent modeling accounts for the influence of adjacent objects on the visual features of an object. We expect that significant improvements in image annotation and retrieval performance will be obtained by intelligently importing these ideas from speech recognition. 6. ACKNOWLEDGMENTS The authors gratefully acknowledge the considerable assistance, particularly standardized image datasets and visual features, provided by the Johns Hopkins Summer Workshop team in conducting these experiments. This research was partially supported by the U.S. National Science Foundation via Grant No IIS , the U.S. Department of De No MDA C-1005, and the Czech fense via Contract Academy of Sciences via Grant No 1ET REFERENCES [1] A. Amir et al. IBM Research TRECVID-2003 Video Retrieval System. In Proc. TRECVID2003, November [2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. M. Blei, and M. I. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3: , [3] D. M. Blei and M. I. Jordan. Modeling Annotated Data. In 26th Annual International ACM SIGIR Conference, pages , [4] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth. Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary. In Seventh European Conference on Computer Vision, volume 4, pages , [5] S. L. Feng, R. Manmatha, and V. Lavrenko. Multiple Bernoulli relevance models for image and video annotation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, volume 2, pages II 1002 II 1009, [6] G. Iyengar et al. Joint Visual-Test Modeling for Multimedia Retrieval. Available at: [7] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic Image Annotation and Retrieval using Cross-Media Relevance Models. In 26th Annual International ACM SIGIR COnference, pages , [8] V. Lavrenko, S. L. Feng, and R. Manmatha. Statistical models for automatic video annotation and retrieval. In Proc. IEEE International Conf. on Acoustics, Speech and Signal Processing, volume 3, pages 17 21, May [9] J. Li and J. Z. Wang. Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach. IEEE Trans. on Pattern Analysis and Machine Intelligence, 25(9): , [10] NIST. In Proceedings of the TREC Video Retrieval Evaluation Conference (TRECVID2003), November [11] L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77(2): , [12] S. Young et al. The HTK Book [13] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pages I 511 I 518, December 2001.

Using Maximum Entropy for Automatic Image Annotation

Using Maximum Entropy for Automatic Image Annotation Using Maximum Entropy for Automatic Image Annotation Jiwoon Jeon and R. Manmatha Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst Amherst, MA-01003.

More information

IBM Research Report. Semantic Annotation of Multimedia Using Maximum Entropy Models

IBM Research Report. Semantic Annotation of Multimedia Using Maximum Entropy Models RC23342 (W0409-128) September 20, 2004 Computer Science IBM Research Report Semantic Annotation of Multimedia Using Maximum Entropy Models Janne Argillander, Giridharan Iyengar, Harriet Nock IBM Research

More information

Automatic Image Annotation and Retrieval Using Hybrid Approach

Automatic Image Annotation and Retrieval Using Hybrid Approach Automatic Image Annotation and Retrieval Using Hybrid Approach Zhixin Li, Weizhong Zhao 2, Zhiqing Li 2, Zhiping Shi 3 College of Computer Science and Information Technology, Guangxi Normal University,

More information

A Novel Model for Semantic Learning and Retrieval of Images

A Novel Model for Semantic Learning and Retrieval of Images A Novel Model for Semantic Learning and Retrieval of Images Zhixin Li, ZhiPing Shi 2, ZhengJun Tang, Weizhong Zhao 3 College of Computer Science and Information Technology, Guangxi Normal University, Guilin

More information

An Introduction to Content Based Image Retrieval

An Introduction to Content Based Image Retrieval CHAPTER -1 An Introduction to Content Based Image Retrieval 1.1 Introduction With the advancement in internet and multimedia technologies, a huge amount of multimedia data in the form of audio, video and

More information

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

Columbia University High-Level Feature Detection: Parts-based Concept Detectors TRECVID 2005 Workshop Columbia University High-Level Feature Detection: Parts-based Concept Detectors Dong-Qing Zhang, Shih-Fu Chang, Winston Hsu, Lexin Xie, Eric Zavesky Digital Video and Multimedia Lab

More information

CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs

CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs CS 543: Final Project Report Texture Classification using 2-D Noncausal HMMs Felix Wang fywang2 John Wieting wieting2 Introduction We implement a texture classification algorithm using 2-D Noncausal Hidden

More information

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Stefan Müller, Gerhard Rigoll, Andreas Kosmala and Denis Mazurenok Department of Computer Science, Faculty of

More information

An Attempt to Identify Weakest and Strongest Queries

An Attempt to Identify Weakest and Strongest Queries An Attempt to Identify Weakest and Strongest Queries K. L. Kwok Queens College, City University of NY 65-30 Kissena Boulevard Flushing, NY 11367, USA kwok@ir.cs.qc.edu ABSTRACT We explore some term statistics

More information

Associating video frames with text

Associating video frames with text Associating video frames with text Pinar Duygulu and Howard Wactlar Informedia Project School of Computer Science University Informedia Digital Video Understanding Project IDVL interface returned for "El

More information

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Abstract Automatic linguistic indexing of pictures is an important but highly challenging problem for researchers in content-based

More information

Region-based Segmentation and Object Detection

Region-based Segmentation and Object Detection Region-based Segmentation and Object Detection Stephen Gould Tianshi Gao Daphne Koller Presented at NIPS 2009 Discussion and Slides by Eric Wang April 23, 2010 Outline Introduction Model Overview Model

More information

The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System

The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System Our first participation on the TRECVID workshop A. F. de Araujo 1, F. Silveira 2, H. Lakshman 3, J. Zepeda 2, A. Sheth 2, P. Perez

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Exploiting Ontologies for Automatic Image Annotation

Exploiting Ontologies for Automatic Image Annotation Exploiting Ontologies for Automatic Image Annotation Munirathnam Srikanth, Joshua Varner, Mitchell Bowden, Dan Moldovan Language Computer Corporation Richardson, TX, 75080 [srikanth,josh,mitchell,moldovan]@languagecomputer.com

More information

Generic Face Alignment Using an Improved Active Shape Model

Generic Face Alignment Using an Improved Active Shape Model Generic Face Alignment Using an Improved Active Shape Model Liting Wang, Xiaoqing Ding, Chi Fang Electronic Engineering Department, Tsinghua University, Beijing, China {wanglt, dxq, fangchi} @ocrserv.ee.tsinghua.edu.cn

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

EE 6882 Statistical Methods for Video Indexing and Analysis

EE 6882 Statistical Methods for Video Indexing and Analysis EE 6882 Statistical Methods for Video Indexing and Analysis Fall 2004 Prof. ShihFu Chang http://www.ee.columbia.edu/~sfchang Lecture 1 part A (9/8/04) 1 EE E6882 SVIA Lecture #1 Part I Introduction Course

More information

Searching Video Collections:Part I

Searching Video Collections:Part I Searching Video Collections:Part I Introduction to Multimedia Information Retrieval Multimedia Representation Visual Features (Still Images and Image Sequences) Color Texture Shape Edges Objects, Motion

More information

A bipartite graph model for associating images and text

A bipartite graph model for associating images and text A bipartite graph model for associating images and text S H Srinivasan Technology Research Group Yahoo, Bangalore, India. shs@yahoo-inc.com Malcolm Slaney Yahoo Research Lab Yahoo, Sunnyvale, USA. malcolm@ieee.org

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Speech Recognition Components Acoustic and pronunciation model:

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009 Analysis: TextonBoost and Semantic Texton Forests Daniel Munoz 16-721 Februrary 9, 2009 Papers [shotton-eccv-06] J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context

More information

Metric learning approaches! for image annotation! and face recognition!

Metric learning approaches! for image annotation! and face recognition! Metric learning approaches! for image annotation! and face recognition! Jakob Verbeek" LEAR Team, INRIA Grenoble, France! Joint work with :"!Matthieu Guillaumin"!!Thomas Mensink"!!!Cordelia Schmid! 1 2

More information

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

Selection of Scale-Invariant Parts for Object Class Recognition

Selection of Scale-Invariant Parts for Object Class Recognition Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract

More information

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, Cordelia Schmid LEAR team, INRIA Rhône-Alpes, Grenoble, France

More information

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages

Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Spoken Document Retrieval (SDR) for Broadcast News in Indian Languages Chirag Shah Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 2014-2015 Jakob Verbeek, November 28, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Continuous Visual Vocabulary Models for plsa-based Scene Recognition

Continuous Visual Vocabulary Models for plsa-based Scene Recognition Continuous Visual Vocabulary Models for plsa-based Scene Recognition Eva Hörster Multimedia Computing Lab University of Augsburg Augsburg, Germany hoerster@informatik.uniaugsburg.de Rainer Lienhart Multimedia

More information

Short Run length Descriptor for Image Retrieval

Short Run length Descriptor for Image Retrieval CHAPTER -6 Short Run length Descriptor for Image Retrieval 6.1 Introduction In the recent years, growth of multimedia information from various sources has increased many folds. This has created the demand

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

Deep Generative Models Variational Autoencoders

Deep Generative Models Variational Autoencoders Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative

More information

732A54/TDDE31 Big Data Analytics

732A54/TDDE31 Big Data Analytics 732A54/TDDE31 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Peña IDA, Linköping University, Sweden 1/27 Contents MapReduce Framework Machine Learning with MapReduce Neural Networks

More information

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi Hidden Markov Models Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi Sequential Data Time-series: Stock market, weather, speech, video Ordered: Text, genes Sequential

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN: Semi Automatic Annotation Exploitation Similarity of Pics in i Personal Photo Albums P. Subashree Kasi Thangam 1 and R. Rosy Angel 2 1 Assistant Professor, Department of Computer Science Engineering College,

More information

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics

More information

Color Image Segmentation

Color Image Segmentation Color Image Segmentation Yining Deng, B. S. Manjunath and Hyundoo Shin* Department of Electrical and Computer Engineering University of California, Santa Barbara, CA 93106-9560 *Samsung Electronics Inc.

More information

User Strategies in Video Retrieval: a Case Study

User Strategies in Video Retrieval: a Case Study User Strategies in Video Retrieval: a Case Study L. Hollink 1, G.P. Nguyen 2, D.C. Koelma 2, A.Th. Schreiber 1, M. Worring 2 1 Business Informatics, Free University Amsterdam. {hollink,schreiber}@cs.vu.nl

More information

STATISTICAL MODELS FOR TEXT QUERY-BASED IMAGE RETRIEVAL

STATISTICAL MODELS FOR TEXT QUERY-BASED IMAGE RETRIEVAL STATISTICAL MODELS FOR TEXT QUERY-BASED IMAGE RETRIEVAL A Dissertation Presented by SHAOLEI FENG Submitted to the Graduate School of the University of Massachusetts Amherst in partial fulfillment of the

More information

Project Participants

Project Participants Annual Report for Period:10/2004-10/2005 Submitted on: 06/21/2005 Principal Investigator: Yang, Li. Award ID: 0414857 Organization: Western Michigan Univ Title: Projection and Interactive Exploration of

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

Conditional Random Fields for Object Recognition

Conditional Random Fields for Object Recognition Conditional Random Fields for Object Recognition Ariadna Quattoni Michael Collins Trevor Darrell MIT Computer Science and Artificial Intelligence Laboratory Cambridge, MA 02139 {ariadna, mcollins, trevor}@csail.mit.edu

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Chapter 10. Conclusion Discussion

Chapter 10. Conclusion Discussion Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with

More information

A Robust Wipe Detection Algorithm

A Robust Wipe Detection Algorithm A Robust Wipe Detection Algorithm C. W. Ngo, T. C. Pong & R. T. Chin Department of Computer Science The Hong Kong University of Science & Technology Clear Water Bay, Kowloon, Hong Kong Email: fcwngo, tcpong,

More information

Patent Image Retrieval

Patent Image Retrieval Patent Image Retrieval Stefanos Vrochidis IRF Symposium 2008 Vienna, November 6, 2008 Aristotle University of Thessaloniki Overview 1. Introduction 2. Related Work in Patent Image Retrieval 3. Patent Image

More information

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018 Assignment 2 Unsupervised & Probabilistic Learning Maneesh Sahani Due: Monday Nov 5, 2018 Note: Assignments are due at 11:00 AM (the start of lecture) on the date above. he usual College late assignments

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints Last week Multi-Frame Structure from Motion: Multi-View Stereo Unknown camera viewpoints Last week PCA Today Recognition Today Recognition Recognition problems What is it? Object detection Who is it? Recognizing

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Object Class Recognition using Images of Abstract Regions

Object Class Recognition using Images of Abstract Regions Object Class Recognition using Images of Abstract Regions Yi Li, Jeff A. Bilmes, and Linda G. Shapiro Department of Computer Science and Engineering Department of Electrical Engineering University of Washington

More information

Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy

Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy Pallavi Baral and Jose Pedro Fique Department of Economics Indiana University at Bloomington 1st Annual CIRANO Workshop on Networks

More information

Query-Sensitive Similarity Measure for Content-Based Image Retrieval

Query-Sensitive Similarity Measure for Content-Based Image Retrieval Query-Sensitive Similarity Measure for Content-Based Image Retrieval Zhi-Hua Zhou Hong-Bin Dai National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China {zhouzh, daihb}@lamda.nju.edu.cn

More information

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection

Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection Petr Somol 1,2, Jana Novovičová 1,2, and Pavel Pudil 2,1 1 Dept. of Pattern Recognition, Institute of Information Theory and

More information

A Miniature-Based Image Retrieval System

A Miniature-Based Image Retrieval System A Miniature-Based Image Retrieval System Md. Saiful Islam 1 and Md. Haider Ali 2 Institute of Information Technology 1, Dept. of Computer Science and Engineering 2, University of Dhaka 1, 2, Dhaka-1000,

More information

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Jan Vaněk and Josef V. Psutka Department of Cybernetics, West Bohemia University,

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Time series, HMMs, Kalman Filters

Time series, HMMs, Kalman Filters Classic HMM tutorial see class website: *L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. Time series,

More information

An Approach for Reduction of Rain Streaks from a Single Image

An Approach for Reduction of Rain Streaks from a Single Image An Approach for Reduction of Rain Streaks from a Single Image Vijayakumar Majjagi 1, Netravati U M 2 1 4 th Semester, M. Tech, Digital Electronics, Department of Electronics and Communication G M Institute

More information

Textural Features for Image Database Retrieval

Textural Features for Image Database Retrieval Textural Features for Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington Seattle, WA 98195-2500 {aksoy,haralick}@@isl.ee.washington.edu

More information

Efficient Acquisition of Human Existence Priors from Motion Trajectories

Efficient Acquisition of Human Existence Priors from Motion Trajectories Efficient Acquisition of Human Existence Priors from Motion Trajectories Hitoshi Habe Hidehito Nakagawa Masatsugu Kidode Graduate School of Information Science, Nara Institute of Science and Technology

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

CHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS

CHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS 145 CHAPTER 6 QUANTITATIVE PERFORMANCE ANALYSIS OF THE PROPOSED COLOR TEXTURE SEGMENTATION ALGORITHMS 6.1 INTRODUCTION This chapter analyzes the performance of the three proposed colortexture segmentation

More information

Clustering Documents in Large Text Corpora

Clustering Documents in Large Text Corpora Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe@cs.dal.ca http://www.cs.dal.ca/ bhe Yongzheng Zhang Faculty of Computer Science

More information

Introduction to SLAM Part II. Paul Robertson

Introduction to SLAM Part II. Paul Robertson Introduction to SLAM Part II Paul Robertson Localization Review Tracking, Global Localization, Kidnapping Problem. Kalman Filter Quadratic Linear (unless EKF) SLAM Loop closing Scaling: Partition space

More information

An Introduction to Pattern Recognition

An Introduction to Pattern Recognition An Introduction to Pattern Recognition Speaker : Wei lun Chao Advisor : Prof. Jian-jiun Ding DISP Lab Graduate Institute of Communication Engineering 1 Abstract Not a new research field Wide range included

More information

Formulating Semantic Image Annotation as a Supervised Learning Problem

Formulating Semantic Image Annotation as a Supervised Learning Problem Appears in IEEE Conf. on Computer Vision and Pattern Recognition, San Diego, 2005. Formulating Semantic Image Annotation as a Supervised Learning Problem Gustavo Carneiro, Department of Computer Science

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Outline Objective Approach Experiment Conclusion and Future work Objective Automatically establish linguistic indexing of pictures

More information

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D. PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D. Rhodes 5/10/17 What is Machine Learning? Machine learning

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Browsing News and TAlk Video on a Consumer Electronics Platform Using face Detection

Browsing News and TAlk Video on a Consumer Electronics Platform Using face Detection MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Browsing News and TAlk Video on a Consumer Electronics Platform Using face Detection Kadir A. Peker, Ajay Divakaran, Tom Lanning TR2005-155

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

ImgSeek: Capturing User s Intent For Internet Image Search

ImgSeek: Capturing User s Intent For Internet Image Search ImgSeek: Capturing User s Intent For Internet Image Search Abstract - Internet image search engines (e.g. Bing Image Search) frequently lean on adjacent text features. It is difficult for them to illustrate

More information

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

ORT EP R RCH A ESE R P A IDI!  #$$% &' (# $! R E S E A R C H R E P O R T IDIAP A Parallel Mixture of SVMs for Very Large Scale Problems Ronan Collobert a b Yoshua Bengio b IDIAP RR 01-12 April 26, 2002 Samy Bengio a published in Neural Computation,

More information

CS 231A Computer Vision (Fall 2011) Problem Set 4

CS 231A Computer Vision (Fall 2011) Problem Set 4 CS 231A Computer Vision (Fall 2011) Problem Set 4 Due: Nov. 30 th, 2011 (9:30am) 1 Part-based models for Object Recognition (50 points) One approach to object recognition is to use a deformable part-based

More information

CSE 586 Final Programming Project Spring 2011 Due date: Tuesday, May 3

CSE 586 Final Programming Project Spring 2011 Due date: Tuesday, May 3 CSE 586 Final Programming Project Spring 2011 Due date: Tuesday, May 3 What I have in mind for our last programming project is to do something with either graphical models or random sampling. A few ideas

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 5 Inference

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information

Eye Detection by Haar wavelets and cascaded Support Vector Machine

Eye Detection by Haar wavelets and cascaded Support Vector Machine Eye Detection by Haar wavelets and cascaded Support Vector Machine Vishal Agrawal B.Tech 4th Year Guide: Simant Dubey / Amitabha Mukherjee Dept of Computer Science and Engineering IIT Kanpur - 208 016

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

Computationally Efficient M-Estimation of Log-Linear Structure Models

Computationally Efficient M-Estimation of Log-Linear Structure Models Computationally Efficient M-Estimation of Log-Linear Structure Models Noah Smith, Doug Vail, and John Lafferty School of Computer Science Carnegie Mellon University {nasmith,dvail2,lafferty}@cs.cmu.edu

More information

A Model Selection Criterion for Classification: Application to HMM Topology Optimization

A Model Selection Criterion for Classification: Application to HMM Topology Optimization A Model Selection Criterion for Classification Application to HMM Topology Optimization Alain Biem IBM T. J. Watson Research Center P.O Box 218, Yorktown Heights, NY 10549, USA biem@us.ibm.com Abstract

More information

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION Prateek Verma, Yang-Kai Lin, Li-Fan Yu Stanford University ABSTRACT Structural segmentation involves finding hoogeneous sections appearing

More information

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University NIPS 2008: E. Sudderth & M. Jordan, Shared Segmentation of Natural

More information

Patch-Based Image Classification Using Image Epitomes

Patch-Based Image Classification Using Image Epitomes Patch-Based Image Classification Using Image Epitomes David Andrzejewski CS 766 - Final Project December 19, 2005 Abstract Automatic image classification has many practical applications, including photo

More information

IMAGE SEGMENTATION. Václav Hlaváč

IMAGE SEGMENTATION. Václav Hlaváč IMAGE SEGMENTATION Václav Hlaváč Czech Technical University in Prague Faculty of Electrical Engineering, Department of Cybernetics Center for Machine Perception http://cmp.felk.cvut.cz/ hlavac, hlavac@fel.cvut.cz

More information

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008 Cluster Analysis Jia Li Department of Statistics Penn State University Summer School in Statistics for Astronomers IV June 9-1, 8 1 Clustering A basic tool in data mining/pattern recognition: Divide a

More information

Implementation Techniques

Implementation Techniques V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight

More information

Joint Visual-Text Modeling for Automatic Retrieval of Multimedia Documents

Joint Visual-Text Modeling for Automatic Retrieval of Multimedia Documents Joint Visual-Text Modeling for Automatic Retrieval of Multimedia Documents G. Iyengar IBM TJ Watson Research Center giyengar@us.ibm.com P. Ircing Univ. West Bohemia ircing@kky.zcu.cz M. R. Krause Georgetown

More information