Semantic Visual Decomposition Modelling for Improving Object Detection in Complex Scene Images

Size: px

Start display at page:

Download "Semantic Visual Decomposition Modelling for Improving Object Detection in Complex Scene Images"

Leon Cox
5 years ago
Views:

1 Semantic Visual Decomposition Modelling for Improving Object Detection in Complex Scene Images Ge Qin Department of Computing University of Surrey United Kingdom Bogdan Vrusias Department of Computing University of Surrey United Kingdom Abstract We propose a systematic method for constructing a compositional model for recognising object instances in images of real life subjects. The model is trained on a set of visual examples of contained in a given image, in order to capture the visual characteristics of the contained objects, and to derive spatial relationships between the internal key sub-components of each object instance. The recognition method focuses on extracting visual similarities at the component level in three feature spaces: histogram of boundary distribution, intensity histogram, and histogram of oriented gradient (HOG). Principle Component Analysis (PCA) is used for the component selection and feature weighting. The proposed recognition method is not only capable of improving the accuracy of popular object detection algorithms, but also offers a systematic way of generating detection models. Keywords Contextual object recognition, semantic object modelling, visual object decomposition. I. INTRODUCTION Visual recognition has been one of the most popular research areas in computer vision for the last half of the century. The research community nowadays focuses on semantically understanding the objects and its surrounding environment, beyond the visual appearance. Human beings are naturally capable of identifying both visual and semantic similarities for a given set of images. On one hand, we are able to extract similarities in shape, colour, texture or patterns in other photometric domains; on the other hand, we are also capable of interpreting the contextual information beyond the visual appearances and associate objects or scenes based on their semantic similarities. Such combination of visual and semantic analysis gives us the flexibility to select which information, visual or semantic, to use in order to achieve a particular recognition. It is unquestionable that better visual processing techniques provide better object recognition result and further simplify the semantics derivation. Improving the performances of classic image processing techniques [1, 2] have direct impacts on the performance of object recognition. Compared with the classic content-based information retrieval (CBIR) systems [3, 4], the research interests have been gradually moved from specific context-based object retrieval towards generic knowledge-based scene understanding [5, 6], focusing on visual analysis over images using queries relating to the visual features and compositions of visual features. Compositional based recognition is considered as a commonly accepted way of exploiting prior knowledge around the detecting model in the form of parts and the relationship between them [7, 8]. Borenstein [9] proposed a recognition system to extract a cow or a runner from its natural background by combining visual similarity driven bottom up segmentation stitching with knowledge driven top down splitting. Although the recognition only works on simple data, i.e. single object that is visually distinctive from the background, it provides a way to systematically recognise small visually descriptive pieces; group the pieces to form semantically descriptive objects guided by a model template. It can be considered as the very first initial step to derive highlevel object knowledge by analysing low-level visual descriptions. Oliva and Torralba s research reveals that statistical structure within the processing images plays a key fundamental role in generic scene understanding [10, 11]. Boutell s [12, 13] work in natural scene recognition analyses the trend of the spatial colour moment and uses it as a semantic feature to recognise outdoor scenes. He also developed a generative model to monitor pair-wise spatial relationship between semantic objects appeared in one scene instance. Currently, most scene understanding is performed on long distance natural landscape scenes. Such scene domain is advantageous as the semantic features are monolithic and normally applicable to the whole image. Furthermore, segmentation on long distance scene images usually outputs fewer regions, which simplifies the spatial relationship analysis between those regions. However, this scene understanding approach is difficult to be applied to recognition of structured object in indoor or closed scenes, which contain more detailed semantics relationships within an object or between objects.

The present work attempts to improve the recognition performance based on existing image processing techniques, by the addition of systematically extracted semantic information of the objects

2 The present work attempts to improve the recognition performance based on existing image processing techniques, by the addition of systematically extracted semantic information of the objects detected in the image. An object model is trained in a supervised fashion [14, 15] and the visually distinctive features, within each key component forming the detecting object, are extracted and weighed accordingly. As shown in Figure 1, the recognition process is split into two stages: Hypothesis Generation and Hypothesis Validation. Hypothesis Generation produces image patches that have overall similarities when comparing against the object model; and Hypothesis Validation examines the visual appearances and spatial relationship of the components inside the generated hypothesis to determine if sufficient details are extracted to announce the object recognition. Unlike other similar research that focuses mainly on natural landscape scenes, the presented work focuses mainly on street scenes with structural objects, where semantic relationships are embedded within the image details and are more consistent than general landscape themes. II. MODEL CONSTRUCTION Fig. 1. Object recognition proposal. Many cutting edge compositional based recognition methods focus on building a codebook containing a large amount of discriminative local features to describe the detecting object. In this work, instead of attempting to directly recognise a complicated structural object from processing arbitrary local features, we propose an intermediate stage to fuse the low-level visual information into components with basic semantic information. The overall recognition of an object depends on the successful recognition of several key sub-components of the object, which makes the recognition less dependent on the overall set of visual appearances of the detecting object. The object model,, consists the information at two feature levels: global level features and local level features. For global feature, captures the boundary distribution of the entire object; for local features, records visual patterns of the components within an object, for every component in the collection of components. For a particular component within the object, we represent the component using three visual feature descriptors, including boundary distribution ( ), histogram of oriented gradient ( ), and intensity histogram ( ). The model is constructed in a supervised approach, where the detecting object and its inner-components are labelled in the training samples. The feature map for each component is built in an un-supervised way, in which visual similarities among the training samples over a specific feature space are extracted; and the feature map only records the most distinctive features shared among most of the training samples. A. Object Decomposition Object decomposition focuses on decomposing the targeted contextual objects into visually simple but semantically meaningful components. Such compositional approach enables us to construct a hierarchical knowledge model for the detecting object, which contains the visual semantic (i.e. visual grammar) about the detecting objects. In this work, the decomposition rule is derived from modelling the labelled intercity street scene image samples from MIT LabelME dataset [16]. B. Feature Extraction Shape is a robust feature against photometric variations. In this work, PB boundary detector [17] is used to extract boundary in the processing image. From the output of the PB boundary detection, we are able to accumulate the boundaries map to compute the histogram map of boundary distribution and use it to monitor the similarities of boundary orientation shared among individual object instances. Every point in the histogram map of boundary distribution is assigned with a value indicating the likelihood of detecting a boundary point at that location [14][14]. The map is used as the global level feature descriptor to generate recognition hypotheses; and it is also used as one of the local level descriptors to validate the previously generated hypotheses. The Histogram of Oriented Gradient (HOG) is another commonly used descriptor, which captures local feature appearances by analysing the intensity gradients distribution of the targeted areas of interest [18]. HOG is able to generate a good performance in capturing strong directional feature at localised regions. Following Felzenszwalkb's approach [19] we apply filter kernels and over an sub-region in the targeting gray-scale image using a sliding window approach. The gradient magnitude for each pixel is summarised into a one dimensional 9-bin histogram, each of which records the intensity of the gradient at that specific direction. Further normalisation is carried out to adjust the gradient histogram vector according to its surrounding windows. Finally, we accumulate the HOG distributions for all samples for a given component to derive the similarity in the oriented gradient for that component. The intensity distribution is tracked and monitored using gray-scale intensity histograms in a 32-bin feature vector. Due to the unstable intensity distribution at the global object level, the intensity analysis is only applied at the component level. The intensity similarity is calculated between the image patch and the model; then multiplied by the customised weighting to

compute the recognition confidence for the component, which consequently contributes to the overall object recognition score. C.

contributes differently towards the object recognition process. Principle Component Analysis (PCA) is used to extract a set of key principle components that dominate the object recognition.

e. occurrence frequency, relative location, and relative size.

3 compute the recognition confidence for the component, which consequently contributes to the overall object recognition score. C. Component Selection Since an object s sub-components are defined by the manual annotations provided in the sample image dataset, each component has its own distinctiveness and it therefore contributes differently towards the object recognition process. Principle Component Analysis (PCA) is used to extract a set of key principle components that dominate the object recognition. As explained in previous chapter, we take into consideration of three features for every component, i.e. boundary distribution, intensity distribution and histogram of oriented gradient (HOG), together with three of its relational information, i.e. occurrence frequency, relative location, and relative size. During the encoding process, each component is converted into a 55-bit feature vector (9-bit for boundary distribution, 32- bit for intensity histogram, 9-bit for HOG distribution, 1-bit for occurrence, 2-bit for relative location, and 2-bit for relative size). For each element in the feature vector, we calculate the difference between the element value, against its mean value ; then normalise the difference by dividing it with 2 standard deviation. We discard elements that reside outside the 2 standard deviation, to cover 95% of the sample data. The relevance score for each component is then calculated by averaging the individual normalised distances for each element in the feature vector, as shown in (1). Every vehicle sample is converted into a N-element vector, each of which represents the relevance score of a component. To balance the performance with the computation overhead, we decided to only select the top 6 components, which accumulatively contribute towards the recognition of the 77% of the samples, i.e. wheel, rim, window, tail light, head light, windshield. D. Component Recognition To achieve recognition for a targeted component, a set of similarity measures are performed over different feature spaces between the component candidate and the feature maps stored in the model. For each feature space, we convert the extracted feature into a 1-D vector space and measure its correlation against the mean vector. The boundary map of a given sample is divided into 8-by-8 pixel windows. Within each window, we compute the overall score for that window by dividing the total intensity energy with the total number of edge points in that window to generate a 1-D vector with elements. Similarly, for the histogram of oriented gradient (HOG), every window is represented in a 9-bin 1-D vector, each bin represent a direction. For colour intensity, we convert the intensity distribution for each of the R*G*B band into a 32-bin array. (1) For each feature for any given component, a mean vector is computed across the complete sample set. The normalised Euclidean distance metric, i.e. Mahalanobis distance, shown in (2), is calculated between every sample against the mean vector to measure how close the sample is against the centroid of the entire sample set. Then we compute the standard deviation of the Mahalanobis distances to represent the variation spread between individual samples to the mean and reverse standard deviation stated to convert it into a value between 0 to 1, as shown in (3), and use it as the weighting score in the processed feature space for that component. Successful recognition of an individual component contributes towards the recognition of the whole object. Assessment over the semantic relationship is also carried out between the component candidate and other identified components. Figure 2 shows the spatial relationship between filtered key individual components within the detecting object. The relative size matrix for each component is also monitored to ensure the recognitions for each individual component are consistent towards to the whole contextual object recognition. This mutual spatial map together with the relative size restriction significantly reduces the searching domain for the remaining components when one component is identified, therefore improve the detection efficiency significantly. Fig. 2. Boundary & HOG distribution for each Component (Headlight, Taillight, Wheel, Rim, Window, Shield). III. OBJECT RECOGNITION The object recognition process is divided into two stages: an approximation process, Hypothesis Generation, is first applied to quickly restrict the searching areas; then a more comprehensive matching, Hypothesis Validation, is performed to verify the generated hypotheses by recognising each (2) (3)

component and examine their inter-spatial relationships. Thresholding is applied to determine when recognition in a specific feature space is achieved.

Thresholds are set dynamically, depending on whether the recognition is applied to the global object level, or local component level.

4 component and examine their inter-spatial relationships. Thresholding is applied to determine when recognition in a specific feature space is achieved. Standard deviation is computed between feature maps from individual samples and feature means stored in the object model. Thresholds are set dynamically, depending on whether the recognition is applied to the global object level, or local component level. For the hypothesis generation at a global object level, we set the threshold to be within 3 standard deviations away from the mean value to ensure we include the maximum number of true positive hypotheses. Whereas for the hypothesis validation, we set the threshold for each component at each feature space to be within 1 standard deviation from the mean, in order to filter out as many false positive hypotheses as possible, and therefore increase the recognition accuracy. A. Hypothesis Generation For the hypothesis generation, the exhausted sliding window searching is applied over a set of scales, scanning through the whole image to generate potential object hypotheses. The set of scales are pre-defined covering from 5% to 50% of the size of the processing image. Boundary detection is first applied to each processing image patch, to extract its boundary map, as shown in Figure 3. Each candidate window, shown in top row, is compared against the boundary distribution map of the global object stored in the object model, shown in bottom row. The boundary map is then segmented into 8x8 pixel windows, and each window is compared against the corresponding boundary window stored in the model. For every pixel point within a window of size at location, we measure the difference of the intensity between and, by summing up the boundary intensity difference for every point within the window and then divide it by total number of points in that window, as shown in (4). Fig. 3. Hypothesis Validation for Boundary Matching. The processing sample is converted into a 1-D vector, each element in the vector represents the boundary intensity difference for a corresponding window at location. A mean boundary vector is extracted in the same way from the model's boundary histogram map. The boundary distribution matching for the object is calculated using Mahalanobis distance between and, shown in (5). Every sample that pass the pre-defined matching threshold is considered to be a potential object candidates. Thus a set of candidates output from the hypothesis (4) generation is collected and passed into the hypothesis validation process to verify the recognition. B. Hypothesis Validation Hypothesis generation provides a set of locations, where there are potentially high probabilities of being identified as object instances. For all the hypotheses that passed the threshold during the hypothesis generation process, they are treated as potential object candidates, and decomposed into sub-regions according to the object model for further validation. Hypothesis validation examines the corresponding sub-regions of the extracted hypotheses and attempt to validate the hypotheses by identifying its essential sub-components according to the object model. The component recognition result is then consolidated together to validate the recognition of the whole object. The recognition at local component level is carried out in three feature spaces, boundary distribution ( ), histogram of oriented gradient ( ), and intensity histogram ( ). For the validation of boundary distribution for each component, it is similar like the boundary distribution matching at the global object level. For any particular component, we divide the boundary map into pixel windows and calculate the boundary distribution for each window at location. We then compute the boundary distribution matching between the processing image region and model, in the form of Mahalanobis distance (6). For the histogram of oriented gradient, we compute HOG feature for every pixel window for each component to generate a HOG map. Within each window, we use the direction with the highest gradient intensity to represent the gradient of the window, shown in (7); and the HOG matching between sample and model is calculated in Mahalanobis distance shown in (8). For intensity histogram, we convert the gray scale intensity map of the process image into a 32-bit histogram and compute the Mahalanobis distance between the processing image and the histogram in model, shown in (9). The final recognition for each component is the sum of the matching result from all three feature spaces multiplied by its corresponding weighting, shown in (10). Object hypotheses with validation scores that pass the threshold are considered to be the correct hypotheses. (5) (6) (7) (8) (9)

5 IV. EXPERIMENT (10) We compare the performance of the proposed method of semantic visual decomposition modelling (SVDM) against popular existing recognition methods: contour template matching (CTM) [15], top-down and bottom-up segment merging/splitting (TDBU) [9], and part-based deformable models (PBDM) Error! Reference source not found.. A. Dataset The training samples for model construction are extracted from street scene images in MIT LabelME dataset [16], an online dataset allowing customised annotations at component level. Object model is built on 40 training samples selected containing sufficient visual details for each annotated component. The recognition performance is evaluated using the MIT StreetScene dataset. The MIT StreetScene dataset contains professionally labelled and verified annotations at a contextual object level. We compared the recognition results against state-of-the-art methods and also against the manually annotation benchmark provided within MIT StreetScene dataset. B. Contour Template Matching (CTM) Contour template matching is a simple but classic recognition method based on matching the boundary orientation of an object candidate with the contour model. The distances between the centre point and the intersection point at a particular angle are measured between the object candidate and the contour model, stated in (11). Two intersection points are considered to be a matching pair if their distance differences within a pre-defined threshold and an object instance contain sufficient matching pairs identified to support the hypothesis. (11) In this work, we set to examine the contour matching with different numbers of distance pairs between the candidate window and the vehicle model; different thresholds in distance variation and different threshold for the number of matching. In general, the method has fast execution; however, the recognition is easily tempered by noise regions. For instance, foliage regions are often matched with any shape due to the large amount of evenly distributed noise edges generated due to the illumination changes. Increasing the number of distance pairs and thresholds is able to improve the recognition accuracy. The consequence is that the computation complexity also increases proportionally to the number of pairs involved in the recognition. C. Top-Down and Bottom-Up Matching (TDBU) Combination of top-down and bottom-up (TDBU) is a recognition method using template matching guided by the segmentation maps from the two extreme directions. Going through the coarse segmentation maps in a top-down approach is able to restrict the searching areas for the detecting object. Once the locations of the potential object hypothesis are identified, a bottom-up approach is applied to look into the hypothesis locations to validate the hypotheses by merging or splitting the segments in those locations with the guidance of the detailed segmentation maps. The targeted image is firstly over-segmented, and individual segments are recursively merged base on the colour and texture saliency against the adjacent segments. A hierarchy of segment maps can be generated from the merging order, with few distinctive segments on the top of the hierarch and over-segmented regions at the bottom. Going through the hierarchy from top to bottom, a set of hypotheses can be generated by matching the overlapping area between the template and the grouped segments in the hypothesis location. The main drawback of the TDBU method is that it does not cope well in recognising objects with complicated background. Since its performance is heavily influenced by the initial over-segmentation. In other words, the detecting object cannot be recognised if it cannot be separated from the surrounding segments in the merging decision tree. Furthermore, TDBU turns to be computational intensive when processing complicated real life images, in which the detecting objects are small comparing the background. The situation is worsened when the detecting objects is visually indistinguishable from the background areas. D. Part-Based Deformable Modelling (PBDM) The part-based deformable modelling method described in this paper is based on the work of Felzenszwalb [25] based on histogram of oriented gradient (HOG). Like other codebook based approaches, the parts-based deformable model is constructed using a loosely supervised approach by training the object model using labelled object samples; however, leaving the recognizable inner parts of the object to deform in a unsupervised way. Template model used by Felzenszwalkb is shown in Figure 15. Hypotheses are generated through a coarse matching at the root level and the hypotheses are further reinforced by a deformable parts matching, aiming to capture the detailed patterns that are not visible at the coarse level. PBDM performs the recognition at two levels. A quick object detection is carried out in a sliding window approach at coarse root level; then recognition for its deformable parts at a refined level. The object recognition result is the summation of the recognition score for each deformable part, computed by comparing the HOG feature map extracted for each image patch against the object model at scale. is the coarse level object model and are the individual deformable parts in the object model. In HOG vehicle model, the most distinctive features is concentrated around the wheel regions, while the other vehicles regions can be described as horizontal HOG features. Matching this HOG model against image patches is sufficient to robustly separate the vehicle instances from the rest regions even in such a challenging dataset with complicated

background. However the recognition performance reduces when processing images containing horizontal representation in the HOG feature space.

Those inner parts are only grouped based on visual similarity. They encapsulate limited semantic information and thus cannot be used to assist in filtering out semantics false position hypotheses. E.

Instead of generating the inner components using an unsupervised deformable approach, SVDM encapsulates both the visual appearances and the spatial relationships of the inner components into the

In the recognition process, template matching is performed for each component, examining boundary, HOG, and intensity histogram at the specific locations based on the spatial map stored within the

Figure 4 (a) is a processing image sample, we analyse it through the hypotheses generation to output a hypotheses map recording where each hypothesis is located and its similarity matching, showing

For each object hypothesis generated, recursive matching is applied to identify the components that form the object to validate the hypothesis, as shown in Figure 4 (c) to (g).

6 background. However the recognition performance reduces when processing images containing horizontal representation in the HOG feature space. Furthermore, Fellzenszwalb's method only requires marking the whole training samples with bounding boxes; and leave objects' inner components to "selfdeform" based on visual integrity. Those inner parts are only grouped based on visual similarity. They encapsulate limited semantic information and thus cannot be used to assist in filtering out semantics false position hypotheses. E. Semantic Visual Decomposition Modelling (SVDM) SVDM extracts and analyses features in the forms of the histogram of boundary, histogram of oriented gradient and intensity histogram. Instead of generating the inner components using an unsupervised deformable approach, SVDM encapsulates both the visual appearances and the spatial relationships of the inner components into the object model based on object annotations provided with the training dataset. In the recognition process, template matching is performed for each component, examining boundary, HOG, and intensity histogram at the specific locations based on the spatial map stored within the object model. Template matching is applied across the whole image using a sliding window approach to generate object hypotheses. Figure 4 (a) is a processing image sample, we analyse it through the hypotheses generation to output a hypotheses map recording where each hypothesis is located and its similarity matching, showing as the intensity in the map, against the object model, shown in Figure 4 (b). For each object hypothesis generated, recursive matching is applied to identify the components that form the object to validate the hypothesis, as shown in Figure 4 (c) to (g). The final recognition result is the combination of hypothesis generation result at the object level with the hypothesis validation result at the component level, shown in Figure 4 (h). Comparing with the PBDM, with a more restricted validation process deployed during the recognition, the SVDM approach is able to filter out false positives that are generated during the hypothesis generation process. The SVDM also experiences the side effect that troubles many component based recognition methods, but unlike other, the proposed method can recover from the effect of low thresholds since the validation stage can eliminate most of the false positives. Another type of misrecognitions (false positives) can be generated when recognitions at component level over-rules the recognition at the vehicle level. For example, the wheels and wheel rims from two different vehicles are extracted at their expected locations are matching with high confidence to form a vehicle object. So the SVDM starts to generate misrecognition by stitching components from different objects that match the recognition model. This can be eliminated by increasing the threshold in the hypothesis generation stage to minimise the number of candidate objects to be considered, and therefore restricting the object-level matching to allow for too many false positive objects. F. Method Evaluation Headings For the performance comparison across the different methods we used a subset from the MIT StreetScene dataset by randomly selecting 80 images containing 150 vehicle instances from the side view and mixing them together with 80 randomly selected street images with no vehicle present. The number of vehicle instances and the location for each vehicle instance is not restricted in the image set. The manual annotation of this subset of StreetScene images is considered by the research community as ground truth for recognition performance evaluation. To evaluate the performance of the recognition, we consider a vehicle instance to be identified correctly if the recognition result has an intersection ratio that is greater 90% with the manual annotation provided in the MIT StreetScene dataset. With the recognition validation at component level, thresholding is applied to determine if a key component is identified. As explained previously, the threshold is set to be the mean plus 1 standard deviation. Based on the results obtained from the experiments, the proposed SVDM method (see Figure 5 and Table I) outperformed both CTM and TDBU methods, and it is also able to deliver a matching performance against the PBDM method based on HOG feature. Like the a) Processing Image b) Vehicle Hypotheses Map c) Head Light Candidates Map d) Tail Light Candidates Map e) Wheel Candidates Map f) Rim Candidates Map g) Window Candidates Map h) PDBM Recognition Result Fig. 4. Semantic Virtual Decomposition Modelling (SVDM) process for validating the hypothesis.

PBDM method, SVDM has a high recall (97% and 95% respectively) and therefore retrieves most objects from the scene, but the proposed PBDM method has slightly better precision (61% against 59% for the

7 PBDM method, SVDM has a high recall (97% and 95% respectively) and therefore retrieves most objects from the scene, but the proposed PBDM method has slightly better precision (61% against 59% for the PBDM) and therefore a higher F-measure, at 0.74, which is the highest against all other methods overall. In general, the proposed method works well with vehicle recognition due to its highly structural representation. With a tight threshold, the proposed SVDM is able to generate more accurate recognition result comparing to the PBDM method. However, when a loose threshold is applied, the SVDM does not handle as well as the PDBM method and it is easier to confuse vehicle instances with background patches that share similar visual patterns. TABLE I. Fig. 5. Performance Comparison. RECOGNITION RESULT COMPARISON Method Recognition Performance Precision Recall F-measure CTM 42% 88% TDBU 62% 73% PBDM 59% 97% SVDM 61% 95% V. ACKNOWLEDGMENT In conclusion, we proposed a method to automatically construct object models by analysing visual and semantic spatial characteristics of each object's compositional innerparts. The proposed method is built on the boundary and Histogram of Oriented Gradient (HOG) features. Comparison is carried out against existing benchmark recognition methods such as CTM, TDBU, and PBDM over the same dataset. The proposed SVDM method is able to generate an improvement in the recognition accuracy comparing with those popular detection methods, with the added benefit of having an automatic way of generating models for object detection. Further work will be focused in extending the SVDM to monitor objects from multiple classes. For example, SVDM can be extended to monitor scene recognition by using statistical analysis focusing mainly on the co-occurrence and spatial relationships between objects of difference classes instead of visual appearances of inner object components. Other features can also be considered for the object recognition, so that more false positives can be discarded. REFERENCES [1] J. Harel, C. Koch, et al, "Graph-based visual saliency", Proceedings of Neural Information Processing Systems, pp , 2006 [2] S. Bileschi and L. Wolf, "A Unified System For Object Detection, Texture Recognition, and Context Analysis Based on the Standard Model Feature Set", British Machine Vision Conference, [3] J.R. Smith, F.F. Chang, "VisualSEEK: a fully automated content-based image query system", Proceedings of ACM Multimedia, pp , [4] Y. Rui, T.S. Huang, et al, "Relevance feedback: a power tool in interactive content-based image retrieval", IEEE transaction in Circuits Systems Video Technol. Vol.8 (5), pp , [5] J. Vogel, and B. Schiele, "Semantic Modeling of Natural Scenes for Content-Based Image Retrieval", Journal of Computer Vision, Vol.72, pp , [6] L. Li, R. Socher, et al, "Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework", the Joint VCL-ViSU workshop, [7] Oliva, A. and Torralba, A. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope, International Journal of Computer Vision, vol.42, vol.3, pp , [8] M. A. Grudin, "On internal representations in face recognition systems", Pattern Recognition, vol.33 (7), pp , [9] E. Borenstein, J. Malik, et al, "Combining Top-down and Bottom-up Segmentation", IEEE Conf. on Computer Vision and Pattern Recognition, [10] A. Oliva and A. Torralba, "Modeling the shape of the scene: A holistic representation of the spatial envelop", International Journal of Computer Vision, pp , [11] A. Torralba. "Comtextual priming for object detection", Internatioal Journal of Computer Vision, pp , [12] M. Boutell, A. Choudhury, Jiebo Luo, and C. M. Brown, "Using Semantic Features for Scene Classification: how Good do they Need to Be?", IEEE Intl. Conf. on Multimedia and Expo, pp , [13] M. R. Boutell, J. Luo, C. M. Brown, "Scene Parsing Using Region- Based Generative Models" IEEE Transactions on Multimedia, vol.9(1), pp , December2006 [14] G. Qin and B. Vrusias, "Adaptable Models and Semantic Filtering for Object Recognition in Street Images", Int. Conf. on Signal and Image Processing Applications, [15] G. Qin, B. Vrusias, and L. Gilliam, "Background Filtering for Improving of Object Detection in Images", International Conference on Pattern Recognition, [16] B. C. Russel, and A. Torralba, "LabelME: a database and web-based tool for image annotation", International Journal of Computer Vision, vol.77, pp , [17] David Martin, Charless Fowlkes, and Jitendra Malik, "Learning to detect natural image boundaries using local brightness, color, and texture cues", IEEE Trans. PAMI, vol 26, pp , [18] Dalal, N, "Histograms of oriented gradients for human detection", Computer Vision Pattern Recognition, IEEE Computer Society, [19] Felzenszwalb, P. F., Girshick, R. B., et al. "Object detection with discriminatively trained part-based models", IEEE transactions on pattern analysis and machine intelligence,vol.32, pp , 2010.

Deformable Part Models

CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones