Using the forest to see the trees: exploiting context for visual object detection and localization

Size: px
Start display at page:

Download "Using the forest to see the trees: exploiting context for visual object detection and localization"

Transcription

1 Using the forest to see the trees: exploiting context for visual object detection and localization A. Torralba Massachusetts Institute of Technology 32 Vassar Cambridge, USA K. P. Murphy Department of Computer Science University of British Columbia Vancouver BC V6T 1Z4 W. T. Freeman Massachusetts Institute of Technology 32 Vassar Cambridge, USA ABSTRACT Recognizing objects in images is an active area of research in computer vision. In the last two decades, there has been much progress and there are already object recognition systems operating in commercial products. However, most of the algorithms for detecting objects perform an exhaustive search across all locations and scales in the image comparing local image regions with an object model. That approach ignores the semantic structure of scenes and tries to solve the recognition problem by brute force. In the real world, objects tend to co-vary with other objects, providing a rich collection of contextual associations. These contextual associations can be used to reduce the search space by looking only in places in which the object is expected to be; this also increases performance, by rejecting patterns that look like the target but appear in unlikely places. Most modeling attempts so far have defined the context of an object in terms of other previously recognized objects. The drawback of this approach is that inferring the context becomes as difficult as detecting each object. An alternative view of context relies on using the entire scene information holistically. This approach is algorithmically attractive since it dispenses with the need for a prior step of individual object recognition. In this paper we use a probabilistic framework for encoding the relationships between context and object properties and we show how an integrated system provides improved performance. We view this as a significant step towards general purpose machine vision systems. 1. INTRODUCTION Visual object detection, such as finding cars and people in images, is an important but challenging task. It is important because of its inherent scientific interest (understanding how to make machines see may shed light on biological vision), and because it is useful for many applications, such as content-based image retrieval, robotics, etc. It is challeng- An early version of this paper, entitled Using the forest to see the trees: a graphical model relating features, objects and scenes, was published in Neural Information Processing Systems, 23, MIT Press. Ref. [9]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 28 ACM 1-782/8/X...$5.. ing because the appearance of objects can vary a lot from instance to instance, and from image to image, due to factors such as variation in pose, lighting, style, articulation, occlusion, low quality imaging, etc. Over the last two decades, much progress has been made in visual object detection using machine learning techniques. Most of these approaches rely on using supervised learning to train a classifier to distinguish between instances of the object class and the background. The trained classifier is then applied to thousands of small overlapping patches or windows of each test image, and the locations of the highconfidence detections are returned. The features computed inside each patch are usually the outputs of standard image processing operations, such as a histogram of responses to Gabor filters at different scales and orientations. The classifiers themselves are standard supervised learning models such as SVMs, neural networks or boosted decision stumps [2]. This sliding window classifer technique has been quite successful in certain domains such as detecting cars, pedestrians and faces. Indeed most contemporary digital cameras emply such a technique to detect faces, which they use to set the auto-focus. Also, some cars now come equipped with pedestrian detection systems based on similar principles. One major problem with the standard approach is that even a relatively low false-positive rate per class can be unacceptable when there are many classes or categories. For example, if each detector generates about 1 false alarm every 1 images, and there are 1 classes, we will have 1 false alarms per image. An additional problem is that running every detector on every image can be slow. These are both fundamental obstacles to building a general purpose vision system. One reason for the relatively high false alarm rate of standard approaches is that most object detection systems are myopic, in the sense that they only look at local features of the image. One possible remedy is to leverage global features of the image, and to use these to compute the prior probability that each object category is present, and if so, its likely location and scale. Previous work (e.g., [17]) has shown that simple global image features, known as the gist of the image, are sufficient to provide robust predictions about the presence and location of different object categories. Such features are fast to compute, and provide information that is useful for many classes and locations simultaneously. In this paper, which is an extension of our previous work [17, 9, 8], we present a simple approach for combining stan-

2 ? Figure 1: In presence of image degradation (e.g. blur), object recognition is strongly influenced by contextual information. The visual system makes assumptions regarding object identities based on its size and location in the scene. In these images, the same black blob can be interpreted as a plate, bottle, cell phone, car, pedestrian or shoe, depending on the context. (Each circled blob has identical pixels, but in some cases has been rotated.) dard sliding-window object detection systems, which use local, bottom up image features, with systems that predict the presence and location of object categories based on global, or top-down, image features. These global features serve to define the context in which the object detection is happening. The importance of context is illustrated in Figure 1, which shows that the same black blob, when placed in different surroundings, can be interpreted as a plate or bottle on the table, a cell phone, a pedestrian or car, or even a shoe. Another example is shown in Figure 2: it is easy to infer that there is very probably a computer monitor behind the blacked out region of the image. We are not the first to point out the importance of context in computer vision. For example, Strat and Fischler emphasized its importance in their 1991 paper [16]. However, there are two key differences between our approach and previous work. First, in early work, such as [16], the systems consist of hand-engineered if-then rules, whereas more recent systems rely on statistical models that are fit to data. Second, most other approaches define the context in terms of other objects [6, 14, 18, 13]; but this introduces a chicken-andegg problem: to detect an object of type 1 you first have to detect an object of type 2. By contrast, we propose a hierarchical approach, in which we define the context in terms of an overall scene category. This can be reliably inferred using global images features. Conditioned on the scene category, we assume that objects are independent. While not strictly true, this results in a simple yet effective approach, as we will show below. In the following sections, we describe the different com- Figure 2: What is hidden behind the mask? In this example, context is so strong that one can reliably infer that the hidden object is a computer monitor. ponents of our model. We will start by showing how we can represent contextual information without using objects as an intermediate representation. Then we will show how that representation can be integrated with an object detector. 2. GLOBAL IMAGE FEATURES: THE GIST OF AN IMAGE In the same way that an object can be recognized without decomposing it into a set of nameable parts (e.g., the most successful face detectors do not try to detect the eyes and mouth first, instead they search for less semantically meaningful features), scenes can also be recognized without necessarily decomposing them into objects. The advantage of this is that it provides an additional source of information that can be used to provide contextual information for object recognition. As suggested in [1, 11] it is possible to build a global representation of the scene that bypasses object identities, in which the scene is represented as a single entity. Recent work in computer vision has highlighted the importance of global scene representations for scene recognition [11, 1, 7] and as a source of contextual information [17, 9, 3]. These representations are based on computing statistics of low level features (similar to representations available in early visual areas such as oriented edges, vector quantized image patches, etc.) over fixed image regions. One example of a global image representation is the gist descriptor [11]. The gist descriptor is a vector of features g, where each individual feature g k is computed as: g k = X x,y w k (x, y) I(x,y) h k (x,y) 2 (1) where denotes image convolution and is a pixel-wise multiplication. I(x, y) is the luminance channel of the input image, h k (x, y) is a filter from a bank of multiscale oriented Gabor filters (6 orientations and 4 scales), and w k (x,y) is

3 Figure 3: This figure illustrates the information encoded by the gist features for three different images. See text for details. a spatial window that will compute the average output energy of each filter at different image locations. The windows w k (x, y) divide the image in a grid of 4 4 non-overlapping windows. This results in a descriptor with a dimensionality of = 384. Figure 3 illustrates the amount of information preserved by the gist descriptor. The middle column shows the average of the output magnitude of the multiscale-oriented filters on a polar plot (note that the orientation of each plot is orthogonal to the direction of the edges in the image). The average response of each filter is computed locally by splitting the image into 4 4 windows. Each different scale is color coded (red for high spatial frequencies, and blue for the low spatial frequencies), and the intensity is proportional to the energy for each filter output. In order to illustrate the amount of information preserved by this representation, the right column of figure 3 shows noise images that are coerced to have the same gist features as the target image, using the texture synthesis method of [2]. As shown in figure 3, the gist descriptor provides a coarse description of the textures present in the image and their spatial organization. The gist descriptor preserves relevant information needed for categorizing scenes into categories (e.g., classifying an image as being a beach scene, a street or a living-room). As reported in [12], when trying to discriminate across 15 different scene categories, the gist descriptor classifies correctly 75 % of the images. Recognizing the scene depicted by a picture is an important task on its own, but in addition it can be used to provide strong contextual priors as we will discuss in the next section. 3. JOINT SCENE CLASSIFICATION AND OBJECT DETECTION In this section, we describe our approach in more detail. In Section 3.1, we briefly describe the standard approach to object detection and localization using local features. In Sections 3.3 and 3.2 we describe how to use global features for object localization and detection respectively. In Section 3.4 we discuss how to integrate these local and global features. A comparison of the performance of local and global features is deferred until Section Object presence detection and localization using local features In our previous paper [9], we considered detecting four different types or classes of objects: cars, people, keyboards and screens (computer monitors). In this paper, we will mostly focus on cars, for brevity. We use a subset of the LabelMe dataset [11, 15] for training and testing (details are in Section 4). There are two tasks that we want to address: object presence detection (where the goal is to predict if the object is present or absent in the image, i.e., to answer the question: is there any car in this image?) and object localization (where the goal is to precisely locate all the instances of an object class within each image). Solving the object presence detection task can be done even if the object localization is not accurate. We can formalize the object presence detection and localization problem as follows. Let P t = 1 if one or more objects of type t are present anywhere in the image, and P t = otherwise. The goal of object presence detection is to estimate the probability p(p t = 1 I), where I is the image. Later we will generalize this slightly by trying to estimate the number of instances of the object class that might be present, p(n t I), where N t {, 1,2, 3 5, 5 1, > 1}. We call this object counting. The goal of object localization is to specify the location and size of each of the object instances. More precisely, let O t i be a binary random variable representing whether image patch i contains an object of type t or not, for i {1,..., N}, where N 1 is the number of image patches. (The size and shape of the image patches varies according to the object type; for side views of cars, we use patches of size 3 8; to handle cars of different sizes, we apply the technique to multiple versions of the image at different scales.) One way to perform localization is to compute the log-likelihood ratio c t i = log p(f t i O t i = 1)/p(f t i O t i = ), (2) for each i and t, and then to return all the locations where this log likelihood ratio is above some threshold. Here f t i is a set of local features extracted from image I at patch i for class t. The details of the features and classifier that we used can be found in [19]. For simplicity, in this paper we select the D most confident detections (after performing local non-maximum suppression); let their locations be denoted by l t i, for i {1,..., D}. Figure 6(a) gives an illustration of the output of our system on a typical image. For the results in this paper, we set D = 1 so that no correct detections are discarded and still small enough to be efficient. In the figure we show the top D = 4 detections to avoid clutter. The locations of each detection l t i are indicated by the position and scale of the box, and their confidences c t i are indicated by the thickness of the border. In Figure 6(b-top), we see that although the system has detected the car, it has also detected 3 false positives.

4 Highway P(N=n S = s) Insidecity Insidecity Street 1 Open country Mountain Figure 4: Predicting the presence/absence of cars in images and their locations using gist. The outputs shown here do not incorporate any information coming from a car detector and are only based on context. Note that in the dataset used to fit the distributions of object counts for each scene category, it is more common to find cars in street scenes (with many cars circulating and parked) than in highway scenes, where there are many shots of empty roads. Hence the histogram for highway shows p(n car = ) =.6. This is fairly typical of this kind of approach. Below we will see how to eliminate many of these false positives by using global context. 3.2 Object presence detection using global image features To determine if an object class is present in an image given the gist, we could directly learn a binary classifier of the form p(p t = 1 g). Similarly, to predict the number of objects, we could learn an ordinal regression function of the form p(n t g). Instead, we choose a two-step approach in which we first estimate the category or type of scene, p(s = s g), and then use this to predict the number of objects present, p(n t S = s). This approach has the benefit of having an explicit representation of the scene category (e.g., a street, a highway, a forest) which is also an important desired output of an integrated model. We can classify the scene using a simple Parzen-window based density estimator p(s = s g) p(g S = s) = 1 J JX N(g µ j, σji), 2 j=1 where J is the number of mixture components for each class conditional density. Some examples of scene classification are shown in Figure 4. As shown in [12], this technique classifies 75% of the images correctly across 15 different scene categories. Other classifiers give similar performance. Once we have estimated the scene category, we can predict the number of objects that are present using p(n t = n g) = X s p(n t = n S = s)p(s = s g) (3) where p(n t = n S = s) is estimated by simple counting. 3.3 Object localization using global image features The gist captures the overall spatial layout of the image, and hence can be used to predict the expected vertical location of each object class before running any detectors; we call this location priming. However, the gist is not useful for predicting the horizontal locations of objects, which are usually not very constrained by the overall structure of the scene (except possibly by the horizontal location of other objects, a possibility we ignore in this paper). We can use any non-linear regression function to learn the mapping from gist to expected vertical location. We used a mixture of experts model [4], which is a simple weighted average of locally linear regression models. More precisely, we define KX p(y t g) = w k (g)n(y t β T k g, σk) 2 k=1 where Y t is the vertical location of class t, K is the number of experts or mixture components, N represents a Gaussian or normal distribution, β k are the regression weights for mixture component k, σ 2 k is the residual variance, and w k (g) is the weight or responsibility of expert k, given by the softmax or multinomial logistic function: w k (g) = exp(v T k g) P K k =1 exp(vt k g) We illustrate the predictions made by this model in Figure 6(b), where we scale the intensity of each image pixel by the probability density function p(y t g). We see that the effect is to mask out regions of the image which are unlikely to contain the object of interest. Some more examples can be seen in Figure Integrated model We now discuss how to combine the various pieces described above. The basic idea is to use the global features

5 Y t Y t O t i Indicator of presence of object class t in box i t 1 t D Σt Σt t 1 t D Y t N t t i Vertical location of object class t Number of instances of object class t Location of box i for object class t O t 1 O t D N t g N t O t 1 O t D c t i D Score of box i for object class t Number of high confidence detections g Gist descriptor c t 1 c t D S c t 1 c t D S Scene category Figure 5: Integrated system represented as a directed graphical model. We show two object types, t and t, for simplicity. The observed variables are shaded circles, the unknown variables are clear circles. Variables are defined in the text. The Σ t node is a dummy node used to enforce the constraint between the N t nodes and the O t i nodes. to make top-down predictions about how many object instances should be present, and where, and then to use the local patch classifiers to provide bottom-up signals. The key issue is how to combine these two information sources. The approach we take is as follows (this differs slightly from the method originally described in [9]). Let us initially ignore location information. We treat the confidence score of the detector (c t i, defined in Equation 2) as a local likelihood term, and fit a model of the form p(c t i O t i = o) = N(c t i µ t o, σ t o) for o {, 1}. We can learn the parameters of this Gaussian by computing the empirical mean and variance of the scores when the detector is applied to a set of patches which do contain the object (so o = 1) and which do not contain the object (so o = ). If we have a uniform prior over whether each detection is a true or false positive, p(o t i = 1) =.5, we can compute the posterior using Bayes rule as follows: p(o t i = 1 c t i) = p(c t i O t i = 1) p(c t i Ot i = 1) + p(ct i Ot i = ) However, the detections are not all independent, since we have the constraint that N t = P D i=1 I(Ot i = 1), where N t is the number of objects of type t. If we have top-down information about N t from the gist, based on Equation 3, then we can compute the posterior distribution over detections in O(2 D ) time, given the gist, as follows: p(o t 1:D g) DX p(o1:d n)p(n t t = n g) n= Here the term p(o1:d n) t is 1 only if the bit vector O1:D t of length D has precisely n elements turned on. For compactness, we use the notation 1 : D to denote the indices 1,..., D. We can combine this with the local detectors as follows: DY p(o1:d c t t 1:D, g) p(o1:d g) t p(c t i Oi) t If the gist strongly suggests that the object class is absent, then p(n t = g) 1, so we turn all the object bits off in the posterior regardless of the detector scores, p(o t 1:D = c t 1:D, g) 1. If the gist strongly indicates that one object is present, then p(n t = 1 g) 1, and only one O t i bit will i=1 be turned on in the posterior; this will be the one with the highest detector score. And so on. Now we discuss how to integrate location information. Let l t i be the location of the i th detection for class t. Since Y t represents the expected location of an object of class t, we define another local likelihood term p(l t i Oi t = 1, Y t ) = N(l t i Y t, τ t ), where τ t is the variance around the predicted location. If the object is absent, we use a uniform distribution p(l t i Oi t =, Y t ) 1. Of course, Y t is not observed directly, but we can predict it based on the gist; this yields Z p(l t i Oi, t g) = p(l t i Oi, t Y t)p(y t g)dy t which can be solved in closed form, since it is the convolution of two Gaussians. We can now combine expected location and detections as follows: DY p(o1:d c t t 1:D, l t 1:D, g) p(o1:d g) t p(c t i Oi)p(l t t i Oi, t g) To see the effect of this, suppose that the gist strongly suggests that only one object of type t is present, p(n t = 1 g) 1; in this case, the object bit which is turned on will be the one that has the highest score and which is in the most likely location. Thus confident detections in improbable locations are suppressed; similarly, unconfident detections in likely locations are boosted. Finally, we discuss how to combine multiple types of objects. Intuitively, the presence of a car makes the presence of a pedestrian more likely, but the presence of a computer monitor less likely. However, it is impractical to encode a joint distribution of the form p(p 1,..., P T ) directly, since this would require O(2 T ) parameters. (Encoding p(n 1,..., N T ) directly would be even worse.) Instead, we introduce the scene category latent variable S, and assume that the presence (and number) of object types is conditionally independent given the scene category: p(n 1,..., N T ) = X s p(s = s) i=1 TY p(n t S = s) t=1 Given this assumption, we can perform inference for multiple object types in parallel as follows: for each possible scene

6 category, compute the posterior p(o t 1:D c t 1:D, l t 1:D, g, S = s) as described above, and then combine them using a weighted average with p(s = s g) as the weights. In summary, our whole model is the following joint probability distribution: p(o 1:T 1:D, N 1:T, Y 1:T, S c 1:T 1:D, l 1:T 1:D, g) p(s g) TY DY p(y t g)p(n t S)p(O1:D N t t) p(l t i O i, Y t )p(c t i O i) t=1 This is illustrated as a probabilistic graphical model (see e.g., [5]) in Figure 5. There is one node for each random variable: the shaded nodes are observed (these are deterministic functions of the image), and the unshaded nodes are hidden or unknown, and need to be inferred. There is a directed edge into each node from all the variables it directly depends on. For example, the g S arc reflects the scene classifier; the g Y t arc reflects the location priming based on the gist; the S N t arc reflects the object counts given the scene category; the Oi t c t i arc reflects the fact that the presence or absence of an object of type t in patch i affects the the detector score or confidence c t i; the Oi t l t i arc is a deterministic link encoding of the location of patch i; the Y t l t i arc reflects the p(l t i Y t, Oi) t term; finally, there are the Oi t Σ t and N t Σ t arcs, which is simply a trick for enforcing the N t = P D i=1 I(Ot i = 1) constraint. The Σ t node is a dummy node used to enforce the constraint between the N t nodes and the Oi t nodes. Specifically, it is clamped to a fixed state, and we then define p(σ t O1:D, t N t = n) = I( P i Ot i = n) (conditional on the observed child Σ t, all the parent nodes, N t and Oi, t become correlated due to the explaining away phenomenon [5]). From Figure 5, it is clear that by conditioning on S, we can perform inference on each type of object independently in parallel. The time complexity for exact inference in this model is O(ST2 D ), ignoring the cost of running the detectors. (Techniques for quickly evaluating detectors on large images, using cascades of features, are discussed in [2].) We can speed up inference in several ways. For example, we can prune out improbable object categories (and not run their detectors) if p(n t > g) is too low, which is very effective since g is fast to compute. Of the categories that survive, we can just run their detectors in the primed region, near E(Y t g). This will reduce the number of detections D per category. Finally, if necessary, we can use Monte Carlo inference (such as Gibbs sampling) in the resulting pruned graphical model to reduce time complexity. i=1 4. RESULTS Example of the integrated system in action are shown in Figure 6(c): We see that location priming, based on the gist, has down-weighted the scores of the detections in improbable locations, thus eliminating false positives. In the second row, the local detector is able to produce a confident detection, but the second car produces a low confidence detection. As the low confident detection falls inside the predicted region, the confidence of the detection increases. Note that in this example there are two false alarms that happens to also fall within the prediction region. In this case, the overall system will increase the magnitude of the error. If the detector produces errors that are contextually correct, the integrated model will not be able to discard those. The third row shows a different example of failure of the integrated model. In this case, the structure of the scene makes the system think that this is a street scene, and then mixes the boats with cars. Despite these sources of errors, the performances of the integrated system are substantially better than the performances of the car detectors in isolation. For a more quantative study of the performance of our method, we used the scenes dataset from [11] consisting of 2688 images covering 8 scene categories (streets, building facades, skyscrapers, highways, mountainous landscapes, coast, beach and fields). We use half of the dataset to train the models and the other half for testing. Figure 7 shows performances at two tasks: object localization and object presence detection. The plots correspond to precision-recall plots: the horizontal axis denotes the percentage of cars in the database that have been detected for a particular detection threshold and the vertical axis is the percentage of correct detections for the same threshold. Different points in the graph are achieved by varying the decision threshold. For both tasks, the plot shows the performances using an object detector alone, the performances of the integrated model, and the performance of an integrated model with an oracle that tells for each image the true context. The performance of the integrated model has to be within the performance of the detector alone and the context oracle. Figure 7(right) shows a precision-recall curve which quantifies the performance of 3 different systems for detecting object presence. The worst one is based on an object detector using local features alone, the middle one is our integrated system which uses local and global features, and the best one is an oracle system based on using the true scene category label. We see that our integrated model does much better than just using a detector, but it is clear that better scene classification would improve the results further. It is important to note that detecting if an object is present in an image can be done with good accuracy even without object localization. The knowledge of the scene depicted by the image can be enough. For instance, in a picture of a street it is quite certain that a car will appear in the picture, while it is unlikely that a car will appear on a beach scene. Therefore, the relation between the scene category and the object can provide a lot of information even when the detector fails to locate the object in the image. Figure 7(left) shows a precision-recall curve which quantifies the performance of 3 different systems for localizing objects. Again the worst one is based on an object detector using local features alone, the middle one is our integrated system which uses local and global features, and the best one is a oracle system based on using the true scene category label. In this case, knowing the true scene category does not help as much: it can eliminate false positives such as cars in indoor scenes, but it cannot eliminate false positives such as cars detected in a street scene but up in the sky. (Of course, the gist-based location priming system tries to eliminate such spatial outliers, but knowing the scene category label does not help with localization.) Object localization is a much harder task than merely detecting the presence of an object. This is evident from the horizontal scale in Figure 7(left): the recall never goes beyond about 3%, meaning that about 7% of cars are missed by the detector, mostly due to occlusion. Even if context can be used to narrow down the search space and to remove false alarms that occur outside the relevant image

7 a) input image b) car detector output c) location priming c) integrated model output Figure 6: a) Three input images b) Top 4 detections from an object detector based on local features. The thickness of the boxes is related to the confidence of the detection. c) Predicted location of the car based on global features. d) Combining local and global features.

8 1 8 Object localization Detector alone Integrated model Integrated model with context oracle 1 8 Object presence detection Precision 6 4 Precision chance a) Recall b) Recall Figure 8: An object which is out of context may be falsely eliminated by our system. Figure 7: Performance on car localization (left) and car presence detection (right). region, still, if the detector is not able to localize the object, context information will not be able to precisely localize the object. The use of global context (even with the oracle) does not increase the recall (as this requires the detector to work), however context is able to increase the precision as it is able to remove false alarms in scenes in which cars are not expected. It is possible that a finer grained notion of context, perhaps based on other objects, could help in such cases. Note, however, that for image retrieval applications (e.g., on the web), object presence detection is sufficient. For speed reasons, we could adopt the following two stage approach: first select images that are predicted to contain the object based on the gist alone, since this is much faster than applying a sliding window classifier; then apply the integrated model to further reduce false positives. 5. CONCLUSIONS We have discussed one approach for combining local and global features in visual object detection and localization. Of course, the system is not perfect. For example, sometimes objects appear out of context and may be accidently eliminated if the local evidence is ambiguous (see Figure 8). The only way to prevent this is if the local detector gives a sufficiently strong bottom-up signal. Conversely, if the detector makes a false positive error in a contextually plausible location, it will not be ruled out by our system. But even people can also suffer from such hallucinations. In more general terms, we see our system as a good example of probabilistic information fusion, an approach which is widely used in other areas such as speech recognition, which combines local acoustic models which longer-range language models. Since computer vision is inherently a difficult inverse problem, we believe it will be necessary to combine as many sources of evidence as possible when trying to infer the true underlying scene structure. 6. ACKNOWLEDGMENTS Funding for this work was provided by NGA NEGI , MURI Grant N , NSF Career award IIS 74712, NSF contract IIS , a National Defense Science and Engineering Graduate Fellowship, and gifts from Microsoft and Google. KPM would like to thank NSERC and CIFAR for support. 7. REFERENCES [1] L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages , 25. [2] D. Heeger and J. R. Bergen. Pyramid-based texture analysis/synthesis. In SIGGRAPH 95: Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages , New York, NY, USA, ACM. [3] D. Hoiem, A. Efros, and M. Hebert. Geometric context from a single image. In IEEE Intl. Conf. on Computer Vision, 25. [4] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6: , [5] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 29. [6] S. Kumar and M. Hebert. Discriminative random fields: A discriminative framework for contextual interaction in classification. In IEEE Intl. Conf. on Computer Vision, 23. [7] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages , 26. [8] K. Murphy, A. Torralba, D. Eaton, and W. T. Freeman. Object detection and localization using local and global features. In J. Ponce, M. Hebert, C. Schmidt, and A. Zisserman, editors, Toward Category-Level Object Recognition. 26. [9] K. Murphy, A. Torralba, and W. Freeman. Using the forest to see the trees: a graphical model relating features, objects and scenes. In Advances in Neural Info. Proc. Systems, 23. [1] A. Oliva and P. G. Schyns. Diagnostic color blobs mediate scene recognition. Cognitive Psychology, 41:176 21, 2. [11] A. Oliva and A. Torralba. Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal in Computer Vision, 42: , 21. [12] A. Quattoni and A. Torralba. Recognizing indoor scenes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages , 29. [13] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In

9 IEEE Intl. Conf. on Computer Vision, Rio de Janeiro, 27. [14] X. H. Richard, R. S. Zemel, and M. A. Carreira-perpinan. Multiscale conditional random fields for image labeling. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages , 24. [15] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. LabelMe: a database and web-based tool for image annotation. Intl. J. Computer Vision, 77(1-3): , 28. [16] T. M. Strat and M. A. Fischler. Context-based vision: recognizing objects using information from both 2-D and 3-D imagery. IEEE Trans. on Pattern Analysis and Machine Intelligence, 13(1):15 165, [17] A. Torralba. Contextual priming for object detection. Intl. J. Computer Vision, 53(2): , 23. [18] A. Torralba, K. Murphy, and W. Freeman. Contextual models for object detection using boosted random fields. In Advances in Neural Info. Proc. Systems, 24. [19] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5): , 27. [2] P. Viola and M. Jones. Robust real-time object detection. Intl. J. Computer Vision, 57(2): , 24.

Using the Forest to See the Trees: Context-based Object Recognition

Using the Forest to See the Trees: Context-based Object Recognition Using the Forest to See the Trees: Context-based Object Recognition Bill Freeman Joint work with Antonio Torralba and Kevin Murphy Computer Science and Artificial Intelligence Laboratory MIT A computer

More information

Context. CS 554 Computer Vision Pinar Duygulu Bilkent University. (Source:Antonio Torralba, James Hays)

Context. CS 554 Computer Vision Pinar Duygulu Bilkent University. (Source:Antonio Torralba, James Hays) Context CS 554 Computer Vision Pinar Duygulu Bilkent University (Source:Antonio Torralba, James Hays) A computer vision goal Recognize many different objects under many viewing conditions in unconstrained

More information

ELL 788 Computational Perception & Cognition July November 2015

ELL 788 Computational Perception & Cognition July November 2015 ELL 788 Computational Perception & Cognition July November 2015 Module 6 Role of context in object detection Objects and cognition Ambiguous objects Unfavorable viewing condition Context helps in object

More information

Beyond Bags of Features

Beyond Bags of Features : for Recognizing Natural Scene Categories Matching and Modeling Seminar Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015

More information

Histograms of Oriented Gradients

Histograms of Oriented Gradients Histograms of Oriented Gradients Carlo Tomasi September 18, 2017 A useful question to ask of an image is whether it contains one or more instances of a certain object: a person, a face, a car, and so forth.

More information

ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING. Wei Liu, Serkan Kiranyaz and Moncef Gabbouj

ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING. Wei Liu, Serkan Kiranyaz and Moncef Gabbouj Proceedings of the 5th International Symposium on Communications, Control and Signal Processing, ISCCSP 2012, Rome, Italy, 2-4 May 2012 ROBUST SCENE CLASSIFICATION BY GIST WITH ANGULAR RADIAL PARTITIONING

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

Part based models for recognition. Kristen Grauman

Part based models for recognition. Kristen Grauman Part based models for recognition Kristen Grauman UT Austin Limitations of window-based models Not all objects are box-shaped Assuming specific 2d view of object Local components themselves do not necessarily

More information

Spatial Latent Dirichlet Allocation

Spatial Latent Dirichlet Allocation Spatial Latent Dirichlet Allocation Xiaogang Wang and Eric Grimson Computer Science and Computer Science and Artificial Intelligence Lab Massachusetts Tnstitute of Technology, Cambridge, MA, 02139, USA

More information

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation For Intelligent GPS Navigation and Traffic Interpretation Tianshi Gao Stanford University tianshig@stanford.edu 1. Introduction Imagine that you are driving on the highway at 70 mph and trying to figure

More information

Object recognition (part 2)

Object recognition (part 2) Object recognition (part 2) CSE P 576 Larry Zitnick (larryz@microsoft.com) 1 2 3 Support Vector Machines Modified from the slides by Dr. Andrew W. Moore http://www.cs.cmu.edu/~awm/tutorials Linear Classifiers

More information

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011 Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition

More information

Classification and Detection in Images. D.A. Forsyth

Classification and Detection in Images. D.A. Forsyth Classification and Detection in Images D.A. Forsyth Classifying Images Motivating problems detecting explicit images classifying materials classifying scenes Strategy build appropriate image features train

More information

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce

Object Recognition. Computer Vision. Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce Object Recognition Computer Vision Slides from Lana Lazebnik, Fei-Fei Li, Rob Fergus, Antonio Torralba, and Jean Ponce How many visual object categories are there? Biederman 1987 ANIMALS PLANTS OBJECTS

More information

Generic Object Detection Using Improved Gentleboost Classifier

Generic Object Detection Using Improved Gentleboost Classifier Available online at www.sciencedirect.com Physics Procedia 25 (2012 ) 1528 1535 2012 International Conference on Solid State Devices and Materials Science Generic Object Detection Using Improved Gentleboost

More information

Patch-Based Image Classification Using Image Epitomes

Patch-Based Image Classification Using Image Epitomes Patch-Based Image Classification Using Image Epitomes David Andrzejewski CS 766 - Final Project December 19, 2005 Abstract Automatic image classification has many practical applications, including photo

More information

Structured Models in. Dan Huttenlocher. June 2010

Structured Models in. Dan Huttenlocher. June 2010 Structured Models in Computer Vision i Dan Huttenlocher June 2010 Structured Models Problems where output variables are mutually dependent or constrained E.g., spatial or temporal relations Such dependencies

More information

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision Object Recognition Using Pictorial Structures Daniel Huttenlocher Computer Science Department Joint work with Pedro Felzenszwalb, MIT AI Lab In This Talk Object recognition in computer vision Brief definition

More information

CS 231A Computer Vision (Fall 2011) Problem Set 4

CS 231A Computer Vision (Fall 2011) Problem Set 4 CS 231A Computer Vision (Fall 2011) Problem Set 4 Due: Nov. 30 th, 2011 (9:30am) 1 Part-based models for Object Recognition (50 points) One approach to object recognition is to use a deformable part-based

More information

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009 Analysis: TextonBoost and Semantic Texton Forests Daniel Munoz 16-721 Februrary 9, 2009 Papers [shotton-eccv-06] J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context

More information

Local Features and Bag of Words Models

Local Features and Bag of Words Models 10/14/11 Local Features and Bag of Words Models Computer Vision CS 143, Brown James Hays Slides from Svetlana Lazebnik, Derek Hoiem, Antonio Torralba, David Lowe, Fei Fei Li and others Computer Engineering

More information

Contextual priming for artificial visual perception

Contextual priming for artificial visual perception Contextual priming for artificial visual perception Hervé Guillaume 1, Nathalie Denquive 1, Philippe Tarroux 1,2 1 LIMSI-CNRS BP 133 F-91403 Orsay cedex France 2 ENS 45 rue d Ulm F-75230 Paris cedex 05

More information

Category-level localization

Category-level localization Category-level localization Cordelia Schmid Recognition Classification Object present/absent in an image Often presence of a significant amount of background clutter Localization / Detection Localize object

More information

Visual localization using global visual features and vanishing points

Visual localization using global visual features and vanishing points Visual localization using global visual features and vanishing points Olivier Saurer, Friedrich Fraundorfer, and Marc Pollefeys Computer Vision and Geometry Group, ETH Zürich, Switzerland {saurero,fraundorfer,marc.pollefeys}@inf.ethz.ch

More information

Part-based and local feature models for generic object recognition

Part-based and local feature models for generic object recognition Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza

More information

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Mobile Human Detection Systems based on Sliding Windows Approach-A Review Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg

More information

Beyond Bags of features Spatial information & Shape models

Beyond Bags of features Spatial information & Shape models Beyond Bags of features Spatial information & Shape models Jana Kosecka Many slides adapted from S. Lazebnik, FeiFei Li, Rob Fergus, and Antonio Torralba Detection, recognition (so far )! Bags of features

More information

Context Based Object Categorization: A Critical Survey

Context Based Object Categorization: A Critical Survey Context Based Object Categorization: A Critical Survey Carolina Galleguillos 1 Serge Belongie 1,2 1 Computer Science and Engineering, University of California, San Diego 2 Electrical Engineering, California

More information

Artistic ideation based on computer vision methods

Artistic ideation based on computer vision methods Journal of Theoretical and Applied Computer Science Vol. 6, No. 2, 2012, pp. 72 78 ISSN 2299-2634 http://www.jtacs.org Artistic ideation based on computer vision methods Ferran Reverter, Pilar Rosado,

More information

Supervised learning. y = f(x) function

Supervised learning. y = f(x) function Supervised learning y = f(x) output prediction function Image feature Training: given a training set of labeled examples {(x 1,y 1 ),, (x N,y N )}, estimate the prediction function f by minimizing the

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

Local Features: Detection, Description & Matching

Local Features: Detection, Description & Matching Local Features: Detection, Description & Matching Lecture 08 Computer Vision Material Citations Dr George Stockman Professor Emeritus, Michigan State University Dr David Lowe Professor, University of British

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

Announcements. Recognition. Recognition. Recognition. Recognition. Homework 3 is due May 18, 11:59 PM Reading: Computer Vision I CSE 152 Lecture 14

Announcements. Recognition. Recognition. Recognition. Recognition. Homework 3 is due May 18, 11:59 PM Reading: Computer Vision I CSE 152 Lecture 14 Announcements Computer Vision I CSE 152 Lecture 14 Homework 3 is due May 18, 11:59 PM Reading: Chapter 15: Learning to Classify Chapter 16: Classifying Images Chapter 17: Detecting Objects in Images Given

More information

Modeling Image Context using Object Centered Grid

Modeling Image Context using Object Centered Grid Modeling Image Context using Object Centered Grid Sobhan Naderi Parizi, Ivan Laptev, Alireza Tavakoli Targhi Computer Vision and Active Perception Laboratory Royal Institute of Technology (KTH) SE-100

More information

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation

Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation Large-Scale Traffic Sign Recognition based on Local Features and Color Segmentation M. Blauth, E. Kraft, F. Hirschenberger, M. Böhm Fraunhofer Institute for Industrial Mathematics, Fraunhofer-Platz 1,

More information

TA Section: Problem Set 4

TA Section: Problem Set 4 TA Section: Problem Set 4 Outline Discriminative vs. Generative Classifiers Image representation and recognition models Bag of Words Model Part-based Model Constellation Model Pictorial Structures Model

More information

Discriminative classifiers for image recognition

Discriminative classifiers for image recognition Discriminative classifiers for image recognition May 26 th, 2015 Yong Jae Lee UC Davis Outline Last time: window-based generic object detection basic pipeline face detection with boosting as case study

More information

Improving Recognition through Object Sub-categorization

Improving Recognition through Object Sub-categorization Improving Recognition through Object Sub-categorization Al Mansur and Yoshinori Kuno Graduate School of Science and Engineering, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama-shi, Saitama 338-8570,

More information

CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION

CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION 122 CHAPTER 5 GLOBAL AND LOCAL FEATURES FOR FACE RECOGNITION 5.1 INTRODUCTION Face recognition, means checking for the presence of a face from a database that contains many faces and could be performed

More information

Deformable Part Models

Deformable Part Models CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones

More information

CS 4495 Computer Vision. Linear Filtering 2: Templates, Edges. Aaron Bobick. School of Interactive Computing. Templates/Edges

CS 4495 Computer Vision. Linear Filtering 2: Templates, Edges. Aaron Bobick. School of Interactive Computing. Templates/Edges CS 4495 Computer Vision Linear Filtering 2: Templates, Edges Aaron Bobick School of Interactive Computing Last time: Convolution Convolution: Flip the filter in both dimensions (right to left, bottom to

More information

FACE DETECTION AND RECOGNITION OF DRAWN CHARACTERS HERMAN CHAU

FACE DETECTION AND RECOGNITION OF DRAWN CHARACTERS HERMAN CHAU FACE DETECTION AND RECOGNITION OF DRAWN CHARACTERS HERMAN CHAU 1. Introduction Face detection of human beings has garnered a lot of interest and research in recent years. There are quite a few relatively

More information

Context Based Object Categorization: A Critical Survey

Context Based Object Categorization: A Critical Survey Context Based Object Categorization: A Critical Survey Carolina Galleguillos a Serge Belongie a,b a University of California San Diego, Department of Computer Science and Engineering, La Jolla, CA 92093

More information

Course Administration

Course Administration Course Administration Project 2 results are online Project 3 is out today The first quiz is a week from today (don t panic!) Covers all material up to the quiz Emphasizes lecture material NOT project topics

More information

Supervised learning. y = f(x) function

Supervised learning. y = f(x) function Supervised learning y = f(x) output prediction function Image feature Training: given a training set of labeled examples {(x 1,y 1 ),, (x N,y N )}, estimate the prediction function f by minimizing the

More information

Texton Clustering for Local Classification using Scene-Context Scale

Texton Clustering for Local Classification using Scene-Context Scale Texton Clustering for Local Classification using Scene-Context Scale Yousun Kang Tokyo Polytechnic University Atsugi, Kanakawa, Japan 243-0297 Email: yskang@cs.t-kougei.ac.jp Sugimoto Akihiro National

More information

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS Cognitive Robotics Original: David G. Lowe, 004 Summary: Coen van Leeuwen, s1460919 Abstract: This article presents a method to extract

More information

Mining Discriminative Adjectives and Prepositions for Natural Scene Recognition

Mining Discriminative Adjectives and Prepositions for Natural Scene Recognition Mining Discriminative Adjectives and Prepositions for Natural Scene Recognition Bangpeng Yao 1, Juan Carlos Niebles 2,3, Li Fei-Fei 1 1 Department of Computer Science, Princeton University, NJ 08540, USA

More information

CS 231A Computer Vision (Fall 2012) Problem Set 3

CS 231A Computer Vision (Fall 2012) Problem Set 3 CS 231A Computer Vision (Fall 2012) Problem Set 3 Due: Nov. 13 th, 2012 (2:15pm) 1 Probabilistic Recursion for Tracking (20 points) In this problem you will derive a method for tracking a point of interest

More information

Visuelle Perzeption für Mensch- Maschine Schnittstellen

Visuelle Perzeption für Mensch- Maschine Schnittstellen Visuelle Perzeption für Mensch- Maschine Schnittstellen Vorlesung, WS 2009 Prof. Dr. Rainer Stiefelhagen Dr. Edgar Seemann Institut für Anthropomatik Universität Karlsruhe (TH) http://cvhci.ira.uka.de

More information

Selection of Scale-Invariant Parts for Object Class Recognition

Selection of Scale-Invariant Parts for Object Class Recognition Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract

More information

Analysis of Image and Video Using Color, Texture and Shape Features for Object Identification

Analysis of Image and Video Using Color, Texture and Shape Features for Object Identification IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VI (Nov Dec. 2014), PP 29-33 Analysis of Image and Video Using Color, Texture and Shape Features

More information

Contexts and 3D Scenes

Contexts and 3D Scenes Contexts and 3D Scenes Computer Vision Jia-Bin Huang, Virginia Tech Many slides from D. Hoiem Administrative stuffs Final project presentation Nov 30 th 3:30 PM 4:45 PM Grading Three senior graders (30%)

More information

Bus Detection and recognition for visually impaired people

Bus Detection and recognition for visually impaired people Bus Detection and recognition for visually impaired people Hangrong Pan, Chucai Yi, and Yingli Tian The City College of New York The Graduate Center The City University of New York MAP4VIP Outline Motivation

More information

Wikipedia - Mysid

Wikipedia - Mysid Wikipedia - Mysid Erik Brynjolfsson, MIT Filtering Edges Corners Feature points Also called interest points, key points, etc. Often described as local features. Szeliski 4.1 Slides from Rick Szeliski,

More information

Face Detection. Raghuraman Gopalan AT&T Labs-Research, Middletown NJ USA

Face Detection. Raghuraman Gopalan AT&T Labs-Research, Middletown NJ USA Face Detection Raghuraman Gopalan AT&T Labs-Research, Middletown NJ USA Book chapter collaborators: William Schwartz (University of Campinas, Brazil), Rama Chellappa and Ankur Srivastava (University of

More information

Segmentation. Bottom up Segmentation Semantic Segmentation

Segmentation. Bottom up Segmentation Semantic Segmentation Segmentation Bottom up Segmentation Semantic Segmentation Semantic Labeling of Street Scenes Ground Truth Labels 11 classes, almost all occur simultaneously, large changes in viewpoint, scale sky, road,

More information

arxiv: v3 [cs.cv] 3 Oct 2012

arxiv: v3 [cs.cv] 3 Oct 2012 Combined Descriptors in Spatial Pyramid Domain for Image Classification Junlin Hu and Ping Guo arxiv:1210.0386v3 [cs.cv] 3 Oct 2012 Image Processing and Pattern Recognition Laboratory Beijing Normal University,

More information

The SIFT (Scale Invariant Feature

The SIFT (Scale Invariant Feature The SIFT (Scale Invariant Feature Transform) Detector and Descriptor developed by David Lowe University of British Columbia Initial paper ICCV 1999 Newer journal paper IJCV 2004 Review: Matt Brown s Canonical

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

Object Detection Design challenges

Object Detection Design challenges Object Detection Design challenges How to efficiently search for likely objects Even simple models require searching hundreds of thousands of positions and scales Feature design and scoring How should

More information

Can Similar Scenes help Surface Layout Estimation?

Can Similar Scenes help Surface Layout Estimation? Can Similar Scenes help Surface Layout Estimation? Santosh K. Divvala, Alexei A. Efros, Martial Hebert Robotics Institute, Carnegie Mellon University. {santosh,efros,hebert}@cs.cmu.edu Abstract We describe

More information

Using Geometric Blur for Point Correspondence

Using Geometric Blur for Point Correspondence 1 Using Geometric Blur for Point Correspondence Nisarg Vyas Electrical and Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA Abstract In computer vision applications, point correspondence

More information

Computer Vision for HCI. Topics of This Lecture

Computer Vision for HCI. Topics of This Lecture Computer Vision for HCI Interest Points Topics of This Lecture Local Invariant Features Motivation Requirements, Invariances Keypoint Localization Features from Accelerated Segment Test (FAST) Harris Shi-Tomasi

More information

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints Last week Multi-Frame Structure from Motion: Multi-View Stereo Unknown camera viewpoints Last week PCA Today Recognition Today Recognition Recognition problems What is it? Object detection Who is it? Recognizing

More information

Edge and Texture. CS 554 Computer Vision Pinar Duygulu Bilkent University

Edge and Texture. CS 554 Computer Vision Pinar Duygulu Bilkent University Edge and Texture CS 554 Computer Vision Pinar Duygulu Bilkent University Filters for features Previously, thinking of filtering as a way to remove or reduce noise Now, consider how filters will allow us

More information

Joint design of data analysis algorithms and user interface for video applications

Joint design of data analysis algorithms and user interface for video applications Joint design of data analysis algorithms and user interface for video applications Nebojsa Jojic Microsoft Research Sumit Basu Microsoft Research Nemanja Petrovic University of Illinois Brendan Frey University

More information

A Hierarchical Compositional System for Rapid Object Detection

A Hierarchical Compositional System for Rapid Object Detection A Hierarchical Compositional System for Rapid Object Detection Long Zhu and Alan Yuille Department of Statistics University of California at Los Angeles Los Angeles, CA 90095 {lzhu,yuille}@stat.ucla.edu

More information

CS 223B Computer Vision Problem Set 3

CS 223B Computer Vision Problem Set 3 CS 223B Computer Vision Problem Set 3 Due: Feb. 22 nd, 2011 1 Probabilistic Recursion for Tracking In this problem you will derive a method for tracking a point of interest through a sequence of images.

More information

Window based detectors

Window based detectors Window based detectors CS 554 Computer Vision Pinar Duygulu Bilkent University (Source: James Hays, Brown) Today Window-based generic object detection basic pipeline boosting classifiers face detection

More information

Class 5: Attributes and Semantic Features

Class 5: Attributes and Semantic Features Class 5: Attributes and Semantic Features Rogerio Feris, Feb 21, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Project

More information

Learning to Recognize Faces in Realistic Conditions

Learning to Recognize Faces in Realistic Conditions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time

More information

String distance for automatic image classification

String distance for automatic image classification String distance for automatic image classification Nguyen Hong Thinh*, Le Vu Ha*, Barat Cecile** and Ducottet Christophe** *University of Engineering and Technology, Vietnam National University of HaNoi,

More information

Hand Posture Recognition Using Adaboost with SIFT for Human Robot Interaction

Hand Posture Recognition Using Adaboost with SIFT for Human Robot Interaction Hand Posture Recognition Using Adaboost with SIFT for Human Robot Interaction Chieh-Chih Wang and Ko-Chih Wang Department of Computer Science and Information Engineering Graduate Institute of Networking

More information

Edge and corner detection

Edge and corner detection Edge and corner detection Prof. Stricker Doz. G. Bleser Computer Vision: Object and People Tracking Goals Where is the information in an image? How is an object characterized? How can I find measurements

More information

Object recognition (part 1)

Object recognition (part 1) Recognition Object recognition (part 1) CSE P 576 Larry Zitnick (larryz@microsoft.com) The Margaret Thatcher Illusion, by Peter Thompson Readings Szeliski Chapter 14 Recognition What do we mean by object

More information

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK 1 Po-Jen Lai ( 賴柏任 ), 2 Chiou-Shann Fuh ( 傅楸善 ) 1 Dept. of Electrical Engineering, National Taiwan University, Taiwan 2 Dept.

More information

Schedule for Rest of Semester

Schedule for Rest of Semester Schedule for Rest of Semester Date Lecture Topic 11/20 24 Texture 11/27 25 Review of Statistics & Linear Algebra, Eigenvectors 11/29 26 Eigenvector expansions, Pattern Recognition 12/4 27 Cameras & calibration

More information

Contexts and 3D Scenes

Contexts and 3D Scenes Contexts and 3D Scenes Computer Vision Jia-Bin Huang, Virginia Tech Many slides from D. Hoiem Administrative stuffs Final project presentation Dec 1 st 3:30 PM 4:45 PM Goodwin Hall Atrium Grading Three

More information

Comparing Local Feature Descriptors in plsa-based Image Models

Comparing Local Feature Descriptors in plsa-based Image Models Comparing Local Feature Descriptors in plsa-based Image Models Eva Hörster 1,ThomasGreif 1, Rainer Lienhart 1, and Malcolm Slaney 2 1 Multimedia Computing Lab, University of Augsburg, Germany {hoerster,lienhart}@informatik.uni-augsburg.de

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 12 Combining

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

CS 231A Computer Vision (Fall 2012) Problem Set 4

CS 231A Computer Vision (Fall 2012) Problem Set 4 CS 231A Computer Vision (Fall 2012) Problem Set 4 Master Set Due: Nov. 29 th, 2012 (23:59pm) 1 Part-based models for Object Recognition (50 points) One approach to object recognition is to use a deformable

More information

Recovering Intrinsic Images from a Single Image

Recovering Intrinsic Images from a Single Image Recovering Intrinsic Images from a Single Image Marshall F Tappen William T Freeman Edward H Adelson MIT Artificial Intelligence Laboratory Cambridge, MA 02139 mtappen@ai.mit.edu, wtf@ai.mit.edu, adelson@ai.mit.edu

More information

Optimizing Monocular Cues for Depth Estimation from Indoor Images

Optimizing Monocular Cues for Depth Estimation from Indoor Images Optimizing Monocular Cues for Depth Estimation from Indoor Images Aditya Venkatraman 1, Sheetal Mahadik 2 1, 2 Department of Electronics and Telecommunication, ST Francis Institute of Technology, Mumbai,

More information

Detecting Object Instances Without Discriminative Features

Detecting Object Instances Without Discriminative Features Detecting Object Instances Without Discriminative Features Edward Hsiao June 19, 2013 Thesis Committee: Martial Hebert, Chair Alexei Efros Takeo Kanade Andrew Zisserman, University of Oxford 1 Object Instance

More information

Bag-of-features. Cordelia Schmid

Bag-of-features. Cordelia Schmid Bag-of-features for category classification Cordelia Schmid Visual search Particular objects and scenes, large databases Category recognition Image classification: assigning a class label to the image

More information

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213) Recognition of Animal Skin Texture Attributes in the Wild Amey Dharwadker (aap2174) Kai Zhang (kz2213) Motivation Patterns and textures are have an important role in object description and understanding

More information

Eye Detection by Haar wavelets and cascaded Support Vector Machine

Eye Detection by Haar wavelets and cascaded Support Vector Machine Eye Detection by Haar wavelets and cascaded Support Vector Machine Vishal Agrawal B.Tech 4th Year Guide: Simant Dubey / Amitabha Mukherjee Dept of Computer Science and Engineering IIT Kanpur - 208 016

More information

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba Adding spatial information Forming vocabularies from pairs of nearby features doublets

More information

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel CSC411 Fall 2014 Machine Learning & Data Mining Ensemble Methods Slides by Rich Zemel Ensemble methods Typical application: classi.ication Ensemble of classi.iers is a set of classi.iers whose individual

More information

Markov Random Fields and Gibbs Sampling for Image Denoising

Markov Random Fields and Gibbs Sampling for Image Denoising Markov Random Fields and Gibbs Sampling for Image Denoising Chang Yue Electrical Engineering Stanford University changyue@stanfoed.edu Abstract This project applies Gibbs Sampling based on different Markov

More information

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016 edestrian Detection Using Correlated Lidar and Image Data EECS442 Final roject Fall 2016 Samuel Rohrer University of Michigan rohrer@umich.edu Ian Lin University of Michigan tiannis@umich.edu Abstract

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University September 30, 2016 1 Introduction (These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan.

More information

Latest development in image feature representation and extraction

Latest development in image feature representation and extraction International Journal of Advanced Research and Development ISSN: 2455-4030, Impact Factor: RJIF 5.24 www.advancedjournal.com Volume 2; Issue 1; January 2017; Page No. 05-09 Latest development in image

More information

Skin and Face Detection

Skin and Face Detection Skin and Face Detection Linda Shapiro EE/CSE 576 1 What s Coming 1. Review of Bakic flesh detector 2. Fleck and Forsyth flesh detector 3. Details of Rowley face detector 4. Review of the basic AdaBoost

More information

CS4670: Computer Vision

CS4670: Computer Vision CS4670: Computer Vision Noah Snavely Lecture 6: Feature matching and alignment Szeliski: Chapter 6.1 Reading Last time: Corners and blobs Scale-space blob detector: Example Feature descriptors We know

More information