Multi-Evidence Filtering and Fusion for Multi-Label Classification, Object Detection and Semantic Segmentation Based on Weakly Supervised Learning

Size: px
Start display at page:

Download "Multi-Evidence Filtering and Fusion for Multi-Label Classification, Object Detection and Semantic Segmentation Based on Weakly Supervised Learning"

Transcription

1 Multi-Evidence Filtering and Fusion for Multi-Label Classification, Object Detection and Semantic Segmentation Based on Weakly Supervised Learning Weifeng Ge Sibei Yang Yizhou Yu Department of Computer Science, The University of Hong Kong Abstract Supervised object detection and semantic segmentation require object or even pixel level annotations. When there exist image level labels only, it is challenging for weakly supervised algorithms to achieve accurate predictions. The accuracy achieved by top weakly supervised algorithms is still significantly lower than their fully supervised counterparts. In this paper, we propose a novel weakly supervised curriculum learning pipeline for multi-label object recognition, detection and semantic segmentation. In this pipeline, we first obtain intermediate object localization and pixel labeling results for the training images, and then use such results to train task-specific deep networks in a fully supervised manner. The entire process consists of four stages, including object localization in the training images, filtering and fusing object instances, pixel labeling for the training images, and task-specific network training. To obtain clean object instances in the training images, we propose a novel algorithm for filtering, fusing and classifying object instances collected from multiple solution mechanisms. In this algorithm, we incorporate both metric learning and density-based clustering to filter detected object instances. Experiments show that our weakly supervised pipeline achieves state-of-the-art results in multi-label image classification as well as weakly supervised object detection and very competitive results in weakly supervised semantic segmentation on MS-COCO, PASCAL VOC 2007 and PASCAL VOC Introduction Deep neural networks give rise to many breakthroughs in computer vision by usinging huge amounts of labeled training data. Supervised object detection and semantic segmentation require object or even pixel level annotations, which are much more labor-intensive to obtain than image level labels. On the other hand, when there exist image level labels only, due to incomplete annotations, it is very challenging to predict accurate object locations, pixel-wise labels, or even image level labels in multi-label image classification. Given image level supervision only, researchers have proposed many weakly supervised algorithms for detecting objects and labeling pixels. These algorithms employ different mechanisms, including bottom-up, top-down [44, 23] and hybrid approaches [32], to dig out useful information. In bottom-up algorithms, pixels are usually grouped into many object proposals, which are further classified, and the classification results are merged to match groundtruth image labels. In top-down algorithms, images first go through a forward pass of a deep neural network, and the result is then propagated backward to discover which pixels actually contribute to the final result [44, 23]. There are also hybrid algorithms [32] that consider both bottom-up and top-down cues in their pipeline. Although there exist many weakly supervised algorithms, the accuracy achieved by top weakly supervised algorithms is still significantly lower than their fully supervised counterparts. This is reflected in both the precision and recall of their results. In terms of precision, results from weakly supervised algorithms contain much more noise and outliers due to indirect and incomplete supervision. Likewise, such algorithms also achieve much lower recall because there is insufficient labeled information for them to learn comprehensive feature representations of target object categories. However, different types of weakly supervised algorithms may return different but complementary subsets of the ground truth. These observations motivate an approach that first collect as many evidences and results as possible from multiple types of solution mechanisms, put them together, and then remove noise and outliers from the fused results using powerful filtering techniques. This is in contrast to deep neural networks trained from end to end. Although this approach needs to collect results from multiple separately trained networks, the filtered and fused evidences are eventually used for training a single network used for the testing stage. Therefore, the running time of the final network during the testing stage is still comparable to that of state-ofthe-art end-to-end networks. According to the above observations, we propose a weakly supervised curriculum learning pipeline for object recognition, detection and segmentation. At a high level, we 1277

2 obtain object localization and pixelwise semantic labeling results for the training images first using their image level labels, and then use such intermediate results to train object detection, semantic segmentation, and multi-label image classification networks in a fully supervised manner. Since image level, object level and pixel level analysis has mutual dependencies, they are not performed independently but organized into a single pipeline with four stages. In the first stage, we collect object localization results in the training images from both bottom-up and topdown weakly supervised object detection algorithms. In the second stage, we incorporate both metric learning and density-based clustering to filter detected object instances. In this way, we obtain a relatively clean and complete set of object instances. Given these object instances, we further train a single-label object classifier, which is applied to all object instances to obtain their final class labels. Third, to obtain a relatively clean pixel-wise probability map for every class and every training image, we fuse the image level attention map, object level attention maps and an object detection heat map. The pixel-wise probability maps are used for training a fully convolutional network, which is applied to all training images to obtain their final pixel-wise label maps. Finally, the obtained object instances and pixel-wise label maps for all the training images are used for training deep networks for object detection and semantic segmentation respectively. To make pixel-wise label maps of the training images help multi-label image classification, we perform multi-task learning by training a single deep network with two branches, one for multi-label image classification and the other for pixel labeling. Experiments show that our weakly supervised curriculum learning system is capable of achieving state-of-the-art results in multi-label image classification as well as weakly supervised object detection and very competitive results in weakly supervised semantic segmentation on MS-COCO [26], PASCAL VOC 2007 and PASCAL VOC 2012 [12]. In summary, this paper has the following contributions. We introduce a novel weakly supervised pipeline for multi-label object recognition, detection and semantic segmentation. In this pipeline, we first obtain intermediate labeling results for the training images, and then use such results to train task-specific networks in a fully supervised manner. To localize object instances relatively accurately in the training images, we propose a novel algorithm for filtering, fusing and classifying object instances collected from multiple solution mechanisms. In this algorithm, we incorporate both metric learning and density-based clustering to filter detected object instances. To obtain a relatively clean pixel-wise probability map for every class and every training image, we propose an algorithm for fusing image level and object level attention maps with an object detection heat map. The fused maps are used for training a fully convolutional network for pixel labeling. 2. Related Work Weakly Supervised Object Detection and Segmentation. Weakly supervised object detection and segmentation respectively locates and segments objects with image-level labels only [28, 7]. They are important for two reasons: first, learning complex visual concepts from image level labels is one of the key components in image understanding; second, fully supervised deep learning is too data hungry. Methods in [28, 10, 9] treat the weakly supervised localization problem as an image classification problem, and obtain object locations in specific pooling layers of their networks. Methods in [4, 38] extract object instances from images using selective search [40] or edge boxes [48], convert the weakly supervised detection problem into a multiinstance learning problem [8]. The method in [8] at first learns object masks as in [10, 9], and then uses the E-M algorithm to force the network to learn object segmentation masks obtained at previous stages. Since it is very hard for a network to directly learn object locations and pixel labels without sufficient supervision, in this paper, we decompose object detection and pixel labeling into multiple easier problems, and solve them progressively in multiple stages. Neural Attention. Many efforts [44, 2, 23] have been made to explain how neural networks work. The method in [23] extends layer-wise relevance propagation (LRP) [1] to comprehend inherent structured reasoning of deep neural networks. To further ignore the cluttered background, a positive neural attention back-propagation scheme, called excitation back-propagation (Excitation BP), is introduced in [44]. The method in [2] locates top activations in each convolutional map, and maps these top activation areas into the input image using bilinear interpolation. In our pipeline, we adopt the excitation BP [44] to calculate pixel-wise class probabilities. However for images with multiple category labels, a deep neural network could fuse the activations of different categories in the same neurons. To solve this problem, we train a single-label object instance classification network and perform excitation BP in this network to obtain more accurate pixel level class probabilities. Curriculum Learning. Curriculum learning [3] is part of the broad family of machine learning methods that starts with easier subtasks and gradually increases the difficulty level of the tasks. In [3], Yoshua et al. describe the concept of curriculum learning, and use a toy classification problem to show the advantage of decomposing a complex problem into several easier ones. In fact, the idea behind curriculum learning has been widely used before [3]. Hinton et al. [17] trained a deep neural network layer by layer using a restricted Boltzmann machine [36] to avoid the local min- 1278

3 (a) Image Level Stage: Proposal Generation and Multi Evidence Fusion Input/Image Object Heatmap Object Instances (b) Instance Level Stage: Outlier Detection and Object Instance Filtering Triplet Loss Net (c) Pixel Level Stage: Probability Map Fusion and Pixel Label Prediction Filtered Object Instances Label Map with Uncertainty Instance Attention Map Image Attention Map Probability Map Instance Classifier Figure 1. The proposed weakly supervised pipeline. From left to right: (a) Image level stage: fuse the object heatmaps H and the image attention map Ag to generate object instances R for the instance level stage, and provide these two maps for information fusion at the pixel level stage. (b) Instance level stage: perform triplet loss based metric learning and density based clustering for outlier detection, and train a single label instance classifier φs (, ) for instance filtering. (c) Pixel level stage: integrate the object heatmaps H, instance attention map Al, and image attention map Ag for pixel labeling with uncertainty. ima in deep neural networks. Many machine learning algorithms [37, 14] follow a similar divide-and-conquer strategy in curriculum learning. In this paper, we adopt this strategy to decompose the pixel labeling problem into image level learning, object instance level learning and pixel level learning. All the learning tasks in these three stages are relatively simple using the training data in the current stage and the output from the previous stage. 3. Weakly Supervised Curriculum Learning 3.1. Overview Given an image I associated with an image level label vector y I = [y 1, y 2,..., y C ]T, our weakly supervised curriculum learning aims to obtain pixel-wise labels Y I = [y 1, y 2,..., y P ]T, and then use these labels to assist weakly supervised object detection, semantic segmentation and multi-label image classification. Here C is the total number of object classes, P is the total number of pixels in I, and y l is binary. y l = 1 means the l-th object class exists in I, and y l = 0 otherwise. The label of a pixel p is denoted by a C-dimensional binary vector y p. The number of object classes existing in I, which is the same as the number of positive components of y I is denoted by K. Following the divide-and-conquer idea in curriculum learning [3], we decompose the pixel labeling task into three stages: the image level stage, the instance level stage and the pixel level stage Image Level Stage The image level stage not only decomposes multi-label image classification into a set of single-label object instance classifications, but also provides an initial set of pixel-wise probability maps for the pixel level stage. (a) Heatmap Proposals (b) Attention Proposals (c) Fused Proposals Figure 2. (a) Proposals Rh and Rl generated from an object heatmap, (b) proposals generated from an attention map, (c) filtered proposals (green), heatmap proposals (red and blue), and attention proposals (purple). Object Heatmaps. Unlike the fully supervised case, weakly supervised object detection produces object instances with higher uncertainty and also misses a higher percentage of true objects. To reduce the number of missing detections, we propose to compute an object heatmap H for every object class existing in the image. For an image I with width W and height H, a dense set of object proposals R = (R1, R2,..., Rn ) are generated using sliding anchor windows. And the feature stride λs is set to 8. The number of locations in the input image where we can place anchor windows is H/λs W/λs. Denote the short side of image I by Lρ. Following the setting used for RPN [29], we let the anchor windows at a single location have four scales [Lρ/8, Lρ/4, Lρ/2, Lρ] and three aspect ratios [0.5, 1, 2]. After proposals out of image borders have been removed, there are usually remaining proposals per image. Here we define a stack of object heatmaps H = [H 1, H 2,..., H C ] as a C H W matrix, and all values are 1279

4 set to zero initially. The object detection and classification network φ d (, ) used here is the weakly supervised object testing net VGG-16 from [38]. For every proposal R i in R, its object class probability vector φ d (I,R i ) is added to all the pixels in the corresponding window in the heatmaps. Then every heatmap is normalized to [0, 1] as follows, H c = (H c min(h c ))/max(h c ), where H c is the heatmap for the c-th object class. Note that only the heatmaps for object classes existing in I are normalized. All the other heatmaps are ignored and set to zeros. Multiple Evidence Fusion. The object heatmaps highlight the regions that may contain objects even when the level of supervision is very weak. However, since they are generated using sliding anchor windows at multiple scales and aspect ratios, they tend to highlight pixels near but outside true objects, as shown in Fig 2. Given an image classification network trained using the image level labels (here we use GoogleNet V1 [44]), neural attention calculates the contribution of every pixel to the final classification result. It tends to focus on the most influential regions but not necessarily the entire objects. Note that false positive regions may occur during excitation BP [44]. To obtain more accurate object instances, we integrate the top-down attention mapsa g = [A 1 g,a 2 g,...,a C g] with the object heatmaps H = [H 1,H 2,...,H C ]. For object classes existing in image I, their corresponding heatmaps H and attention maps A g are thresholded by distinct values. The heatmaps H are too smooth to indicate accurate object boundaries, but they provide important spatial priors to constrain object instances obtained from the attention maps. We assume that regions with a sufficiently high value in the object heatmaps should at least include parts of objects, and regions with sufficiently low values everywhere do not contain any objects. Following this assumption, we threshold the heatmaps with two values 0.65 and 0.1 to identify highly confident object proposals R h = (R1,R h 2,...,R h N h h ) and relatively low confident object proposals R l = (R1,R l 2,...,R l N l l ) after connected component extraction. Then the attention maps are thresholded by 0.5 to attention proposals R a = (R1,R a 2,...,R a N a a ) as shown in Fig 2. N h, N l and N a are the proposal numbers of R h, R l and R a. All these object proposals have corresponding class labels. During the fusion, for each object class, the attention proposalsr a which cover more than 0.5 of any proposals in R h are preserved. We denote these proposals by R, each of which is modified slightly to completely enclose the corresponding proposal inr h meanwhile be completely contained inside the corresponding proposal inr l (Fig 2). (a) Input Proposals (b) Distance Map Figure 3. (a) Input proposals of the triplet-loss network, (b) distance map computed using features from the triplet-loss network Instance Level Stage Since multiple object categories present in the same image make it hard for neural attention to obtain an accurate pixel-wise attention map for each class, we train a singlelabel object instance classification network and compute attention maps in this network to obtain more accurate pixel level class probabilities. The fused object instances from the image level stage are further filtered by metric learning and density-based clustering. The remaining labeled object proposals are used for training this object instance classifier, which can also be used to further remove remaining false positive object instances. Metric Learning for Feature Embedding. Metric learning is popular in face recognition [34], person reidentification and object tracking [34, 46, 39]. It embeds an image X into a multi-dimensional feature space by associating this image with a fixed size vector, φ t (X, ), in the feature space. This embedding makes similar images close to each other and dissimilar images apart in the feature space. Thus the similarity between two images can be measured by their distance in this space. The tripletloss network φ t (, ) proposed in [34] has the additional property that it can well separate classes even when intraclass distances have large variations. When there exist training samples associated with incorrect class labels, the loss stays at a high value and the distances between correctly labeled and mislabeled samples remain very large even after the training process has run for a long time. Now let R = [R 1,R 2,...,R O ] T denote the fused object instances from all training images in the image level stage, and Y = [y 1,y 2,...,y O ] T are their labels. Here O is the total number of fused instances, and y l is the label vector of instancer l. We train a triplet-loss networkφ t (, ) using GoogleNet V2 with BatchNorm as in [34]. Each mini-batch first chooses b object classes randomly, and then chooses a instances from these classes randomly. These instances are cropped out from the training images and fed into φ t (, ). Fig. 3 visualizes a mini-batch composition and the corresponding pairwise distances among instances. 1280

5 Clustering for Outlier Removal. Clustering aims to remove outliers that are less similar to other object instances in the same class. Specifically, we perform density based clustering [31] to form a single cluster of normal instances within each object class independently, and instances outside this cluster are considered outliers. This is different from that in [31]. Let R c denote instances in R with class label c, and N c is the number of instances in R c. Calculate the pairwise distances d(, ) among these instances, and obtain the N c by N c distance matrix D c. For an instance R c n, if its distance from another instance is less than λ d (= 0.8), its density d c n is increased by 1. Rank these instances by their densities in a descending order, and choose the instance ranked at the top as the seed of the cluster. Then add instances to the cluster following the descending order if their distance to any element in the cluster is less than λ d and their density is higher than N c /4. Instance Classifier for Re-labeling. Since metric learning and clustering screen object instances in an aggressive way and may heavily decrease their recall, we use the normal instances surviving the previous clustering step to train an instance classifier, which is in turn used to re-label all object proposals generated in the image level stage again. This is a single-label classification problem as each object instance is allowed a single label. GoogleNet V1 with the SoftMax loss serves as the classifier φ s (, ), and it is finetuned from the image level classifier. For every object proposal generated in the previous image level stage, if its label predicted by the instance classifier does not match its original label, it is labeled as an outlier and permanently discarded Pixel Level Stage In previous stages, we have already built an image classifier, a weakly supervised object detector, and an object instance classifier. Each of these deep networks produces its own inference result from the input image. For example, the image classifier generates a global attention map, and the object detector generates the object heatmaps. In the pixel level stage, we still perform multi-evidence filtering and fusion to integrate the inference results from all these component networks to obtain the pixelwise probability map indicating potential object categories at every pixel. The global attention map A g from the image classifier has a full knowledge about the objects in an image but sometimes only focuses on the most important object parts. The object instance classifier has a local view of each individual object. With the help of object-specific local attention maps generated from the instance classifier, we can avoid missing small objects. Instance Attention Map. Here we define the instance attention map A l as a C H W matrix, and all values are zero initially. For every surviving object instance from the instance level stage, the object instance classifier φ s (, ) is used to extract its local attention map, and add it to the corresponding region in the instance attention map A l. Normalize the range ofa l to [0, 1] as we did for object heatmaps. Probability Map Integration. The final attention map A is obtained by calculating the element-wise maximum between the image attention map A g and the instance attention map A l. That is, A = max(a l,a g ). For both the heatmaph and the attention mapa, only the classes existing in the image are considered. The background maps of A andh are defined as follows, A 0 = max(0,1 Σ C l=1y l A l ), H 0 = max(0,1 Σ C l=1y l H l ). Now bothaandh become(c+1) H W matrices. For the l-th channel, if y l = 0, A l = 0 and H l = 0. Then we perform softmax on both maps along the channel dimension independently. The final probability map P is defined as the result of applying normalization in the class channel to the element-wise product between A and H by treating H as a filter. That is, P = normalize(h A). Pixel Labeling with Uncertainty. Pixel labels Y I are initialized with the probability map P. For every pixel p, if the maximum element in its label vectory p is larger than a threshold (=0.8), we simply set the maximum element to 1 and other elements to 0; otherwise, the class label at p is uncertain. To inspect these uncertain pixels more carefully, we obtain additional evidence by computing their saliency scores S (normalized into [0, 1]) using an existing state-ofthe-art salient object detection algorithm [25]. Given an uncertain pixel q with a high saliency score (S q 0.3), if the maximum element in its label vector y q is larger than a threshold (=0.6) and this element does not correspond to the background, we set the maximum element to 1 and other elements to 0. Given another uncertain pixel o with a low saliency score (S o < 0.3), if the maximum element in its label vector y o corresponds to the background, we set the background element to 1 and other elements to Object Recognition, Detection and Segmentation 4.1. Semantic Segmentation Given pixel-wise labels generated at the end of the pixel level stage for all training images, we train a fully convolutional network (FCN) similar to the network in [27] to perform semantic segmentation. Note that all pixels with uncertain class labels are excluded during training. In the prediction part, we adopt atrous spatial pyramid pooling as in [5]. The resulting trained network can be used for labeling all pixels in any testing image as well as pixels with uncertain labels in all training images. 1281

6 (a) Input/Image (b) Object Heatmap (c) Image Attention (d) Instance Attention (e) Probability (e) Segmentation Figure 4. The pixel labeling process in the pixel level stage. White pixels in the last column indicate pixels with uncertain labels Object Detection Once all pixels with uncertain labels in the training images have been re-labeled using the above network for semantic segmentation, we generate object instances in these images by computing bounding boxes of connected pixels sharing the same semantic label. As in [38] and [24], we train fast RCNN [13] using these bounding boxes and their associated labels. Since the bounding boxes generated from the semantic label maps may contain noise, we filter them using our object instance classifier as in Section 3.3. VGG- 16 is still the base network of our object detector, which is trained with five scales and flip as in [38] Multi label Classification The main component in our multi-label classification network is the structure of ResNet-101 [16]. There are two branches after layer res4b22 relu of the main component, one branch for classification and the other for semantic segmentation. Both branches share the same structure after layer res4b22 relu. Here we adopt multi-task learning to train both branches. The idea is using the training data for the segmentation branch to make the convolutional kernels in the main component more discriminative and powerful. This network architecture is shown in the supplemental materials. Layer pool5 of ResNet-101 in the classification branch is removed, and the output X( R ) of layer res5c is a matrix. X is directly fed into a C convolutional layer, and a classification map Ŷ cls( R C ) is obtained. We let the semantic label map Ŷ seg( R C ) play the role of an attention map Ŷ att after the summation over each channel of the semantic label map is normalized to 1. The final image level probability vector ŷ is the result of spatial average pooling over the element-wise product between Ŷ cls and Ŷ att. Here Ŷ att is used to identify important image regions and assign them larger weights. At the end, the probability vectorŷ is fully connected to an output layer, which performs binary classification for each of the C classes. The cross-entropy loss is used for training the multi-label classification network. The segmentation branch uses atrous spatial pyramid pooling to perform semantic segmentation, and softmax is applied to enforce a single label per pixel. 5. Experimental Results All our experiments are implemented using Caffe [18] and run on an NVIDIA TITAN X(Maxwell) GPU with 12GB memory. The hyper-parameters in Section 3 are set according to common sense and confirmed after we visually verify that the segmentation results on a few training samples are valid. The same parameter setting is used for all datasets and has not been tuned on any validation sets Semantic Segmentation Datasets and performance measures. The Pascal VOC 2012 dataset [11] serves as a benchmark in most existing work on weakly-supervised semantic segmentation. It has 21 classes and training images (the VOC 2012 training set and additional data annotated in [15]), 1449 for validation and 1456 for testing. Only image tags are used as training data in our experiments. We report results on both the validation (supplemental materials) and test sets. Implementation details. Our network is based on VGG- 16. The layers after relu5 3 and layer pool4 are removed. Dilations in layers conv5 1, conv5 2, and conv5 3 are set to 2. The feature strideλ s at layerrelu5 3 is 8. We add the atrous spatial pyramid pooling as in DeepLab V3 [5] after layer relu5 3. The dilations in our atrous spatial pyramid pooling layers are [1,2,4,6]. This FCN is implemented in py-faster-rcnn [30]. For data augmentation, we use five image scales (480, 576, 688, 864, 1024) (the shorter side is resized to one of these scales) and horizontal flip, and cap the longer side at During testing, the original size of an input image is preserved. The network is fine-tuned from the pre-trained model for ImageNet in [35]. The learning rate γ is set to in the first 20k iterations, and in the next 20k iterations. The weight decay is , and the mini-batch size is 1. Post-processing using CRF [22] is added during testing. Result comparison. We compare our method with existing state-of-the-art algorithms. Table 1 lists the results of weakly supervised semantic segmentation on Pascal VOC The proposed method achieves 55.6% mean IoU, comparable to the state of the art (AE-SPL [43]). Recent al- 1282

7 Figure 5. The detection and semantic segmentation results on Pascal VOC 2012 test set (the first row) and Pascal VOC 2007 test set (the second row). The detection results are gotten by select proposals with the highest confidence of every class. The semantic segmentation results are post-processed by CRF [22]. method bg aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv miou SEC[21] FCL[32] TP-BM[20] AE-PSL[43] Ours+CRF Table 1. Comparison among weakly supervised semantic segmentation methods on PASCAL VOC 2012segmentation test set. gorithms, including AE-PSL[43], F-B [33], FCL [32], and SEC [21], all conduct end-to-end training to learn object score maps. Our method demonstrates that if we filter and integrate multiple types of intermediate evidences at different granularities during weakly supervised training, the results become equally competitive or even better Object Detection Datasets and performance measures. The performance of our object detector in Section 4.2 is evaluated on the popular Pascal VOC 2007 and Pascal VOC 2012 datasets [11]. Each of these two datasets is divided into train, val and test sets. The trainval sets (5011 images for 2007 and images for 2012) are used for training, and only image tags are used. Two measures are used to test our model: map and CorLoc. According to the standard Pascal VOC protocol, the mean average precision (map) is used for testing our trained models on the test sets, and the correct localization (CorLoc) is used for measuring the object localization accuracy [6] on the trainval sets whose image tags are already used as training data. Implementation details. We use the code for py-fasterrcnn [30] to implement fast R-CNN [13]. The network is still VGG-16. The learning rate is set to in the first 30k iterations, and in the next 10k iterations. The momentum and weight decay are set to 0.9 and respectively. We follow the same data augmentation setting in [38], use five image scales (480, 576, 688, 864, 1200) and horizontal flip, and cap the longer image side at Result comparison. Object detection results on Pascal VOC 2007 test set (Table 2) and Pascal VOC 2012 test set (supplemental materials) are reported. Object localization results on Pascal VOC 2007 trainval set and Pascal VOC 2012 trainval set are also reported (supplemental material). On Pascal VOC 2012 test set, our algorithm achieves the highest map (47.5%), at least 5.0% higher than the latest state-of-the-art algorithms including OICR [38] and HCP+DSD+OSSH3[19]. Our trained model also achieves the highest map (51.2%) among all weakly supervised algorithms on Pascal VOC 2007 test set, 4.2% higher than the latest result from [38]. The object localization accuracy (CorLoc) of our trained model on Pascal VOC 2007 trainval set and Pascal VOC 2012 trainval set are respectively 67% and 69.4%, which are 2.7% and 3.8% higher than the previous best Multi Label Classification Dataset and performance measures. Microsoft COCO [26] is the most popular dataset in multi-label classification. MS-COCO was primarily built for object recognition tasks in the context of scene understanding. The training set is composed of images in 80 classes, on average 2.9 object labels per image. Since the groundtruth labels of the test set is not available, performance evaluation is conducted on the validation set with images. We train our models on the training set and test them on the validation set. Performance measures for multi-label classification is 1283

8 method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv map OM+MIL+FRCNN[24] HCP+DSD+OSSH3[19] OICR-Ens+FRCNN[38] Ours+FRCNN w/o clustering Ours+FRCNN w/o uncertainty Ours+FRCNN w/o instances Ours+FRCNN Table 2. Average precision (in %) of weakly supervised methods on PASCAL VOC 2007detection test set. method F1-C P-C R-C F1-O P-O R-O F1-C/top3 P-C/top3 R-C/top3 F1-O/top3 P-O/top3 R-O/top3 CNN-RNN[41] RLSD[45] RNN-Attention[42] ResNet101-SRN[47] ResNet101( )(baseline) Ours Table 3. Performance comparison among multi-label classification methods on Microsoft COCO 2014validation set. quite different from those for single-label classification. Following [47, 42], we employ macro/micro precision, macro/micro recall, and macro/micro F1-measure to evaluate our trained models. For precision, recall and F1- measure, labels with confidence higher than 0.5 are considered positive. P-C, R-C and F1-C represent the average per-class precision, recall and F1-measure while P-O, R-O and F1-O represent the average overall precision, recall and F1-measure. These measures do not require a fixed number of labels per image. To compare with existing state-of-the-art algorithms, we also report the results of top-3 labels with confidence higher than 0.5 as in [42]. Implementation details. Our main network for multi-label classification is ResNet-101 as described earlier. The resolution of the input images is at We first train a network with the classification branch only. As a common practice, a pre-trained model for ImageNet is fine-tuned with the learning rate γ set to in the first 20k iterations, and in the next 20k iterations. The weight decay is Then we add the segmentation branch and train this new branch only by fixing all the layers before layer res4b22 relu and the classification branch. The learning rate is set to in the frist 20k iterations, and in the next 20k iterations. At last, we train the entire network with both branches using the cross-entropy loss for multi-label classification for 30k iterations with a learning rate while still fixing the layers before layer res4b22 relu. Result comparison. In addition to our two-branch network, we also train a ResNet-101 classification network as our baseline. The multi-label classification performance of both networks on MS-COCO is reported in Table 3. Since the input resolution of our baseline is , in comparison to the latest work (ResNet101-SRN) [47], the performance of our baseline is slightly better. Specifically, the F1-C of our baseline is 72.8%, which is 2.8% higher than the F1- C of ResNet101-SRN. In comparison to the baseline, our two-branch network further achieves overall better performance. Specifically, the P-C of our two-branch network is 6.6% higher than the baseline, the R-C is 2.7% lower, and the F1-C is 2.1% higher. All F1-measures (F1-C, F1-O, F1- C/top3 and F1-O/top3) of our two-branch network are the highest among all state-of-the-art algorithms Ablation Study We perform an ablation study on Pascal VOC 2007 detection test set by replacing or removing a single component in our pipeline every time. First, to verify the importance of object instances, we remove all steps related to object instances, including the entire instance level stage and the operations related to the instance attention map in the pixel level stage. The map is decreased by 3.1% as shown in Table 2. Second, the clustering and outlier detection step in the instance level stage is removed. We directly train an instance classifier using the object proposals from the image level stage. The map is decreased by 2.7%. Third, instead of labeling a subset of pixels only in the pixel level stage, we assign a unique label to every pixel even in the case of low confidence. The map drops to 47.5%, 3.7% lower than the performance of the original pipeline. 6. Conclusions In this paper, we have presented a new pipeline for weakly supervised object recognition, detection and segmentation. Different from previous algorithms, we fuse and filter object instances from different techniques and perform pixel labeling with uncertainty. We use the resulting pixelwise labels to generate groundtruth bounding boxes for object detection and attention maps for multi-label classification. Our pipeline has achieved clearly better performance in all of these tasks. Nevertheless, how to simplify the steps in our pipeline deserves further investigation. 1284

9 References [1] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise explanations for nonlinear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e , [2] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep visual representations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages ACM, , 3 [4] H. Bilen and A. Vedaldi. Weakly supervised deep detection networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [5] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arxiv preprint arxiv: , , 6 [6] T. Deselaers, B. Alexe, and V. Ferrari. Weakly supervised localization and learning with generic knowledge. International journal of computer vision, 100(3): , [7] A. Diba, V. Sharma, A. Pazandeh, H. Pirsiavash, and L. Van Gool. Weakly supervised cascaded convolutional networks. arxiv preprint arxiv: , [8] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence, 89(1):31 71, [9] T. Durand, T. Mordan, N. Thome, and M. Cord. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), [10] T. Durand, N. Thome, and M. Cord. Weldon: Weakly supervised learning of deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [11] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98 136, , 7 [12] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2): , [13] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages , , 7 [14] A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu. Automated curriculum learning for neural networks. arxiv preprint arxiv: , [15] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages IEEE, [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages , [17] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7): , [18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages ACM, [19] Z. Jie, Y. Wei, X. Jin, J. Feng, and W. Liu. Deep self-taught learning for weakly supervised object localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July , 8 [20] D. Kim, D. Cho, D. Yoo, and I. So Kweon. Two-phase learning for weakly supervised object localization. In The IEEE International Conference on Computer Vision (ICCV), Oct [21] A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In European Conference on Computer Vision, pages Springer, [22] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pages , , 7 [23] S. Lapuschkin, A. Binder, G. Montavon, K.-R. Muller, and W. Samek. Analyzing classifiers: Fisher vectors and deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , , 2 [24] D. Li, J.-B. Huang, Y. Li, S. Wang, and M.-H. Yang. Weakly supervised object localization with progressive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , , 8 [25] G. Li and Y. Yu. Deep contrast learning for salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages Springer, , 7 [27] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [28] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [29] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In 1285

10 Advances in neural information processing systems, pages 91 99, [30] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), , 7 [31] A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 344(6191): , [32] A. Roy and S. Todorovic. Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , , 7 [33] F. Saleh, M. S. A. Akbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvarez. Built-in foreground/background prior for weakly-supervised semantic segmentation. In European Conference on Computer Vision, pages Springer, [34] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/ , [36] P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, COL- ORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, [37] L. Sun, Q. Huo, W. Jia, and K. Chen. A robust approach for text detection from natural scene images. Pattern Recognition, 48(9): , [38] P. Tang, X. Wang, X. Bai, and W. Liu. Multiple instance detection network with online instance classifier refinement. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July , 4, 6, 7, 8 [39] G. Tsagkatakis and A. Savakis. Online distance metric learning for object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 21(12): , [40] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2): , [41] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu. Cnn-rnn: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [42] Z. Wang, T. Chen, G. Li, R. Xu, and L. Lin. Multi-label image recognition by recurrently discovering attentional regions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [43] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In IEEE CVPR, , 7 [44] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Topdown neural attention by excitation backprop. In European Conference on Computer Vision, pages Springer, , 2, 4 [45] J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu. Multi-label image classification with regional latent semantic dependencies. arxiv preprint arxiv: , [46] L. Zheng, Y. Yang, and A. G. Hauptmann. Person reidentification: Past, present and future. arxiv preprint arxiv: , [47] F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang. Learning spatial regularization with image-level supervisions for multi-label image classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July [48] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In European Conference on Computer Vision, pages Springer,

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological

More information

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Presented by Tushar Bansal Objective 1. Get bounding box for all objects

More information

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network Anurag Arnab and Philip H.S. Torr University of Oxford {anurag.arnab, philip.torr}@eng.ox.ac.uk 1. Introduction

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

Feature-Fused SSD: Fast Detection for Small Objects

Feature-Fused SSD: Fast Detection for Small Objects Feature-Fused SSD: Fast Detection for Small Objects Guimei Cao, Xuemei Xie, Wenzhe Yang, Quan Liao, Guangming Shi, Jinjian Wu School of Electronic Engineering, Xidian University, China xmxie@mail.xidian.edu.cn

More information

Object Detection Based on Deep Learning

Object Detection Based on Deep Learning Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf

More information

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection Zeming Li, 1 Yilun Chen, 2 Gang Yu, 2 Yangdong

More information

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network Liwen Zheng, Canmiao Fu, Yong Zhao * School of Electronic and Computer Engineering, Shenzhen Graduate School of

More information

Gradient of the lower bound

Gradient of the lower bound Weakly Supervised with Latent PhD advisor: Dr. Ambedkar Dukkipati Department of Computer Science and Automation gaurav.pandey@csa.iisc.ernet.in Objective Given a training set that comprises image and image-level

More information

Object detection with CNNs

Object detection with CNNs Object detection with CNNs 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before CNNs After CNNs 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 year Region proposals

More information

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK Wenjie Guan, YueXian Zou*, Xiaoqun Zhou ADSPLAB/Intelligent Lab, School of ECE, Peking University, Shenzhen,518055, China

More information

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep learning for object detection. Slides from Svetlana Lazebnik and many others Deep learning for object detection Slides from Svetlana Lazebnik and many others Recent developments in object detection 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before deep

More information

YOLO9000: Better, Faster, Stronger

YOLO9000: Better, Faster, Stronger YOLO9000: Better, Faster, Stronger Date: January 24, 2018 Prepared by Haris Khan (University of Toronto) Haris Khan CSC2548: Machine Learning in Computer Vision 1 Overview 1. Motivation for one-shot object

More information

Fully Convolutional Networks for Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Evan Shelhamer* Trevor Darrell UC Berkeley Chaim Ginzburg for Deep Learning seminar 1 Semantic Segmentation Define a pixel-wise labeling

More information

arxiv: v1 [cs.cv] 12 Jun 2018

arxiv: v1 [cs.cv] 12 Jun 2018 Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features Xiang Wang 1 Shaodi You 2,3 Xi Li 1 Huimin Ma 1 1 Department of Electronic Engineering, Tsinghua University 2 DATA61-CSIRO

More information

Lecture 7: Semantic Segmentation

Lecture 7: Semantic Segmentation Semantic Segmentation CSED703R: Deep Learning for Visual Recognition (207F) Segmenting images based on its semantic notion Lecture 7: Semantic Segmentation Bohyung Han Computer Vision Lab. bhhanpostech.ac.kr

More information

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION Chien-Yao Wang, Jyun-Hong Li, Seksan Mathulaprangsan, Chin-Chin Chiang, and Jia-Ching Wang Department of Computer Science and Information

More information

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization

Supplementary Material: Unconstrained Salient Object Detection via Proposal Subset Optimization Supplementary Material: Unconstrained Salient Object via Proposal Subset Optimization 1. Proof of the Submodularity According to Eqns. 10-12 in our paper, the objective function of the proposed optimization

More information

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Efficient Segmentation-Aided Text Detection For Intelligent Robots Efficient Segmentation-Aided Text Detection For Intelligent Robots Junting Zhang, Yuewei Na, Siyang Li, C.-C. Jay Kuo University of Southern California Outline Problem Definition and Motivation Related

More information

Weakly Supervised Region Proposal Network and Object Detection

Weakly Supervised Region Proposal Network and Object Detection Weakly Supervised Region Proposal Network and Object Detection Peng Tang, Xinggang Wang, Angtian Wang, Yongluan Yan, Wenyu Liu ( ), Junzhou Huang 2,3, Alan Yuille 4 School of EIC, Huazhong University of

More information

Unified, real-time object detection

Unified, real-time object detection Unified, real-time object detection Final Project Report, Group 02, 8 Nov 2016 Akshat Agarwal (13068), Siddharth Tanwar (13699) CS698N: Recent Advances in Computer Vision, Jul Nov 2016 Instructor: Gaurav

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

TS 2 C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection

TS 2 C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection TS 2 C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection Yunchao Wei 1, Zhiqiang Shen 1,2, Bowen Cheng 1, Honghui Shi 3, Jinjun Xiong 3, Jiashi Feng 4, and

More information

arxiv: v1 [cs.cv] 31 Mar 2016

arxiv: v1 [cs.cv] 31 Mar 2016 Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract.

More information

arxiv: v1 [cs.cv] 13 Jul 2018

arxiv: v1 [cs.cv] 13 Jul 2018 arxiv:1807.04897v1 [cs.cv] 13 Jul 2018 TS 2 C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection Yunchao Wei 1, Zhiqiang Shen 1,2, Bowen Cheng 1, Honghui Shi

More information

arxiv: v1 [cs.cv] 24 May 2016

arxiv: v1 [cs.cv] 24 May 2016 Dense CNN Learning with Equivalent Mappings arxiv:1605.07251v1 [cs.cv] 24 May 2016 Jianxin Wu Chen-Wei Xie Jian-Hao Luo National Key Laboratory for Novel Software Technology, Nanjing University 163 Xianlin

More information

Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation

Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation Lecturer: Yunchao Wei Image Formation and Processing (IFP) Group University of Illinois at Urbanahttps://weiyc.githu Champaign

More information

arxiv: v4 [cs.cv] 6 Jul 2016

arxiv: v4 [cs.cv] 6 Jul 2016 Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu, C.-C. Jay Kuo (qinhuang@usc.edu) arxiv:1603.09742v4 [cs.cv] 6 Jul 2016 Abstract. Semantic segmentation

More information

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm Instructions This is an individual assignment. Individual means each student must hand in their

More information

RSRN: Rich Side-output Residual Network for Medial Axis Detection

RSRN: Rich Side-output Residual Network for Medial Axis Detection RSRN: Rich Side-output Residual Network for Medial Axis Detection Chang Liu, Wei Ke, Jianbin Jiao, and Qixiang Ye University of Chinese Academy of Sciences, Beijing, China {liuchang615, kewei11}@mails.ucas.ac.cn,

More information

Lecture 5: Object Detection

Lecture 5: Object Detection Object Detection CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 5: Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 Traditional Object Detection Algorithms Region-based

More information

arxiv: v2 [cs.cv] 18 Jul 2017

arxiv: v2 [cs.cv] 18 Jul 2017 PHAM, ITO, KOZAKAYA: BISEG 1 arxiv:1706.02135v2 [cs.cv] 18 Jul 2017 BiSeg: Simultaneous Instance Segmentation and Semantic Segmentation with Fully Convolutional Networks Viet-Quoc Pham quocviet.pham@toshiba.co.jp

More information

arxiv: v2 [cs.cv] 8 Apr 2018

arxiv: v2 [cs.cv] 8 Apr 2018 Single-Shot Object Detection with Enriched Semantics Zhishuai Zhang 1 Siyuan Qiao 1 Cihang Xie 1 Wei Shen 1,2 Bo Wang 3 Alan L. Yuille 1 Johns Hopkins University 1 Shanghai University 2 Hikvision Research

More information

arxiv: v1 [cs.cv] 15 Oct 2018

arxiv: v1 [cs.cv] 15 Oct 2018 Instance Segmentation and Object Detection with Bounding Shape Masks Ha Young Kim 1,2,*, Ba Rom Kang 2 1 Department of Financial Engineering, Ajou University Worldcupro 206, Yeongtong-gu, Suwon, 16499,

More information

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

More information

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma Mask R-CNN presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma Mask R-CNN Background Related Work Architecture Experiment Mask R-CNN Background Related Work Architecture Experiment Background From left

More information

Segmenting Objects in Weakly Labeled Videos

Segmenting Objects in Weakly Labeled Videos Segmenting Objects in Weakly Labeled Videos Mrigank Rochan, Shafin Rahman, Neil D.B. Bruce, Yang Wang Department of Computer Science University of Manitoba Winnipeg, Canada {mrochan, shafin12, bruce, ywang}@cs.umanitoba.ca

More information

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta Encoder-Decoder Networks for Semantic Segmentation Sachin Mehta Outline > Overview of Semantic Segmentation > Encoder-Decoder Networks > Results What is Semantic Segmentation? Input: RGB Image Output:

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

Spatial Localization and Detection. Lecture 8-1

Spatial Localization and Detection. Lecture 8-1 Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday

More information

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs Zhipeng Yan, Moyuan Huang, Hao Jiang 5/1/2017 1 Outline Background semantic segmentation Objective,

More information

Optimizing Object Detection:

Optimizing Object Detection: Lecture 10: Optimizing Object Detection: A Case Study of R-CNN, Fast R-CNN, and Faster R-CNN Visual Computing Systems Today s task: object detection Image classification: what is the object in this image?

More information

arxiv: v1 [cs.cv] 1 Apr 2017

arxiv: v1 [cs.cv] 1 Apr 2017 Multiple Instance Detection Network with Online Instance Classifier Refinement Peng Tang Xinggang Wang Xiang Bai Wenyu Liu School of EIC, Huazhong University of Science and Technology {pengtang,xgwang,xbai,liuwy}@hust.edu.cn

More information

Fewer is More: Image Segmentation Based Weakly Supervised Object Detection with Partial Aggregation

Fewer is More: Image Segmentation Based Weakly Supervised Object Detection with Partial Aggregation GE, WANG, QI: FEWER IS MORE 1 Fewer is More: Image Segmentation Based Weakly Supervised Object Detection with Partial Aggregation Ce Ge nwlgc@bupt.edu.cn Jingyu Wang wangjingyu@bupt.edu.cn Qi Qi qiqi@ebupt.com

More information

PASCAL VOC Classification: Local Features vs. Deep Features. Shuicheng YAN, NUS

PASCAL VOC Classification: Local Features vs. Deep Features. Shuicheng YAN, NUS PASCAL VOC Classification: Local Features vs. Deep Features Shuicheng YAN, NUS PASCAL VOC Why valuable? Multi-label, Real Scenarios! Visual Object Recognition Object Classification Object Detection Object

More information

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task Kyunghee Kim Stanford University 353 Serra Mall Stanford, CA 94305 kyunghee.kim@stanford.edu Abstract We use a

More information

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang SSD: Single Shot MultiBox Detector Author: Wei Liu et al. Presenter: Siyu Jiang Outline 1. Motivations 2. Contributions 3. Methodology 4. Experiments 5. Conclusions 6. Extensions Motivation Motivation

More information

G-CNN: an Iterative Grid Based Object Detector

G-CNN: an Iterative Grid Based Object Detector G-CNN: an Iterative Grid Based Object Detector Mahyar Najibi 1, Mohammad Rastegari 1,2, Larry S. Davis 1 1 University of Maryland, College Park 2 Allen Institute for Artificial Intelligence najibi@cs.umd.edu

More information

arxiv: v1 [cs.cv] 9 Aug 2017

arxiv: v1 [cs.cv] 9 Aug 2017 BlitzNet: A Real-Time Deep Network for Scene Understanding Nikita Dvornik Konstantin Shmelkov Julien Mairal Cordelia Schmid Inria arxiv:1708.02813v1 [cs.cv] 9 Aug 2017 Abstract Real-time scene understanding

More information

Yiqi Yan. May 10, 2017

Yiqi Yan. May 10, 2017 Yiqi Yan May 10, 2017 P a r t I F u n d a m e n t a l B a c k g r o u n d s Convolution Single Filter Multiple Filters 3 Convolution: case study, 2 filters 4 Convolution: receptive field receptive field

More information

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors

[Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors [Supplementary Material] Improving Occlusion and Hard Negative Handling for Single-Stage Pedestrian Detectors Junhyug Noh Soochan Lee Beomsu Kim Gunhee Kim Department of Computer Science and Engineering

More information

arxiv: v1 [cs.cv] 4 Jun 2015

arxiv: v1 [cs.cv] 4 Jun 2015 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks arxiv:1506.01497v1 [cs.cv] 4 Jun 2015 Shaoqing Ren Kaiming He Ross Girshick Jian Sun Microsoft Research {v-shren, kahe, rbg,

More information

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL Yingxin Lou 1, Guangtao Fu 2, Zhuqing Jiang 1, Aidong Men 1, and Yun Zhou 2 1 Beijing University of Posts and Telecommunications, Beijing,

More information

Classifying a specific image region using convolutional nets with an ROI mask as input

Classifying a specific image region using convolutional nets with an ROI mask as input Classifying a specific image region using convolutional nets with an ROI mask as input 1 Sagi Eppel Abstract Convolutional neural nets (CNN) are the leading computer vision method for classifying images.

More information

Multi-Glance Attention Models For Image Classification

Multi-Glance Attention Models For Image Classification Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We

More information

Structured Prediction using Convolutional Neural Networks

Structured Prediction using Convolutional Neural Networks Overview Structured Prediction using Convolutional Neural Networks Bohyung Han bhhan@postech.ac.kr Computer Vision Lab. Convolutional Neural Networks (CNNs) Structured predictions for low level computer

More information

Dataset Augmentation with Synthetic Images Improves Semantic Segmentation

Dataset Augmentation with Synthetic Images Improves Semantic Segmentation Dataset Augmentation with Synthetic Images Improves Semantic Segmentation P. S. Rajpura IIT Gandhinagar param.rajpura@iitgn.ac.in M. Goyal IIT Varanasi manik.goyal.cse15@iitbhu.ac.in H. Bojinov Innit Inc.

More information

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b 1

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Pixel Offset Regression (POR) for Single-shot Instance Segmentation

Pixel Offset Regression (POR) for Single-shot Instance Segmentation Pixel Offset Regression (POR) for Single-shot Instance Segmentation Yuezun Li 1, Xiao Bian 2, Ming-ching Chang 1, Longyin Wen 2 and Siwei Lyu 1 1 University at Albany, State University of New York, NY,

More information

Dense Image Labeling Using Deep Convolutional Neural Networks

Dense Image Labeling Using Deep Convolutional Neural Networks Dense Image Labeling Using Deep Convolutional Neural Networks Md Amirul Islam, Neil Bruce, Yang Wang Department of Computer Science University of Manitoba Winnipeg, MB {amirul, bruce, ywang}@cs.umanitoba.ca

More information

Real-time Object Detection CS 229 Course Project

Real-time Object Detection CS 229 Course Project Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection

More information

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab. [ICIP 2017] Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab., POSTECH Pedestrian Detection Goal To draw bounding boxes that

More information

arxiv: v1 [cs.cv] 26 May 2017

arxiv: v1 [cs.cv] 26 May 2017 arxiv:1705.09587v1 [cs.cv] 26 May 2017 J. JEONG, H. PARK AND N. KWAK: UNDER REVIEW IN BMVC 2017 1 Enhancement of SSD by concatenating feature maps for object detection Jisoo Jeong soo3553@snu.ac.kr Hyojin

More information

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Instance-aware Semantic Segmentation via Multi-task Network Cascades Instance-aware Semantic Segmentation via Multi-task Network Cascades Jifeng Dai, Kaiming He, Jian Sun Microsoft research 2016 Yotam Gil Amit Nativ Agenda Introduction Highlights Implementation Further

More information

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection ILSVRC 2016 Object Detection from Video Byungjae Lee¹, Songguo Jin¹, Enkhbayar Erdenee¹, Mi Young Nam², Young Gui Jung², Phill Kyu

More information

Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation

Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation Md Atiqur Rahman and Yang Wang Department of Computer Science, University of Manitoba, Canada {atique, ywang}@cs.umanitoba.ca

More information

arxiv: v3 [cs.cv] 13 Apr 2016 Abstract

arxiv: v3 [cs.cv] 13 Apr 2016 Abstract ProNet: Learning to Propose Object-specific Boxes for Cascaded Neural Networks Chen Sun 1,2 Manohar Paluri 2 Ronan Collobert 2 Ram Nevatia 1 Lubomir Bourdev 3 1 USC 2 Facebook AI Research 3 UC Berkeley

More information

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 Etienne Gadeski, Hervé Le Borgne, and Adrian Popescu CEA, LIST, Laboratory of Vision and Content Engineering, France

More information

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform Supplementary Material

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform Supplementary Material Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform Supplementary Material Xintao Wang 1 Ke Yu 1 Chao Dong 2 Chen Change Loy 1 1 CUHK - SenseTime Joint Lab, The Chinese

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

arxiv: v2 [cs.cv] 25 Jul 2018

arxiv: v2 [cs.cv] 25 Jul 2018 C-WSL: Count-guided Weakly Supervised Localization Mingfei Gao 1, Ang Li 2, Ruichi Yu 1, Vlad I. Morariu 3, and Larry S. Davis 1 1 University of Maryland, College Park 2 DeepMind 3 Adobe Research {mgao,richyu,lsd}@umiacs.umd.edu

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning Applications 11.01.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period starts

More information

Faster R-CNN Implementation using CUDA Architecture in GeForce GTX 10 Series

Faster R-CNN Implementation using CUDA Architecture in GeForce GTX 10 Series INTERNATIONAL JOURNAL OF ELECTRICAL AND ELECTRONIC SYSTEMS RESEARCH Faster R-CNN Implementation using CUDA Architecture in GeForce GTX 10 Series Basyir Adam, Fadhlan Hafizhelmi Kamaru Zaman, Member, IEEE,

More information

DeepBox: Learning Objectness with Convolutional Networks

DeepBox: Learning Objectness with Convolutional Networks DeepBox: Learning Objectness with Convolutional Networks Weicheng Kuo Bharath Hariharan Jitendra Malik University of California, Berkeley {wckuo, bharath2, malik}@eecs.berkeley.edu Abstract Existing object

More information

Final Report: Smart Trash Net: Waste Localization and Classification

Final Report: Smart Trash Net: Waste Localization and Classification Final Report: Smart Trash Net: Waste Localization and Classification Oluwasanya Awe oawe@stanford.edu Robel Mengistu robel@stanford.edu December 15, 2017 Vikram Sreedhar vsreed@stanford.edu Abstract Given

More information

Finding Tiny Faces Supplementary Materials

Finding Tiny Faces Supplementary Materials Finding Tiny Faces Supplementary Materials Peiyun Hu, Deva Ramanan Robotics Institute Carnegie Mellon University {peiyunh,deva}@cs.cmu.edu 1. Error analysis Quantitative analysis We plot the distribution

More information

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018 Object Detection TA : Young-geun Kim Biostatistics Lab., Seoul National University March-June, 2018 Seoul National University Deep Learning March-June, 2018 1 / 57 Index 1 Introduction 2 R-CNN 3 YOLO 4

More information

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK 1 Po-Jen Lai ( 賴柏任 ), 2 Chiou-Shann Fuh ( 傅楸善 ) 1 Dept. of Electrical Engineering, National Taiwan University, Taiwan 2 Dept.

More information

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material Yi Li 1, Gu Wang 1, Xiangyang Ji 1, Yu Xiang 2, and Dieter Fox 2 1 Tsinghua University, BNRist 2 University of Washington

More information

Human Pose Estimation with Deep Learning. Wei Yang

Human Pose Estimation with Deep Learning. Wei Yang Human Pose Estimation with Deep Learning Wei Yang Applications Understand Activities Family Robots American Heist (2014) - The Bank Robbery Scene 2 What do we need to know to recognize a crime scene? 3

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Announcements Computer Vision Lecture 16 Deep Learning Applications 11.01.2017 Seminar registration period starts on Friday We will offer a lab course in the summer semester Deep Robot Learning Topic:

More information

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks

Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Show, Discriminate, and Tell: A Discriminatory Image Captioning Model with Deep Neural Networks Zelun Luo Department of Computer Science Stanford University zelunluo@stanford.edu Te-Lin Wu Department of

More information

A Novel Representation and Pipeline for Object Detection

A Novel Representation and Pipeline for Object Detection A Novel Representation and Pipeline for Object Detection Vishakh Hegde Stanford University vishakh@stanford.edu Manik Dhar Stanford University dmanik@stanford.edu Abstract Object detection is an important

More information

arxiv: v1 [cs.cv] 26 Jun 2017

arxiv: v1 [cs.cv] 26 Jun 2017 Detecting Small Signs from Large Images arxiv:1706.08574v1 [cs.cv] 26 Jun 2017 Zibo Meng, Xiaochuan Fan, Xin Chen, Min Chen and Yan Tong Computer Science and Engineering University of South Carolina, Columbia,

More information

Learning Fully Dense Neural Networks for Image Semantic Segmentation

Learning Fully Dense Neural Networks for Image Semantic Segmentation Learning Fully Dense Neural Networks for Image Semantic Segmentation Mingmin Zhen 1, Jinglu Wang 2, Lei Zhou 1, Tian Fang 3, Long Quan 1 1 Hong Kong University of Science and Technology, 2 Microsoft Research

More information

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR Object Detection CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR Problem Description Arguably the most important part of perception Long term goals for object recognition: Generalization

More information

arxiv: v1 [cs.cv] 11 Apr 2018

arxiv: v1 [cs.cv] 11 Apr 2018 arxiv:1804.03821v1 [cs.cv] 11 Apr 2018 ExFuse: Enhancing Feature Fusion for Semantic Segmentation Zhenli Zhang 1, Xiangyu Zhang 2, Chao Peng 2, Dazhi Cheng 3, Jian Sun 2 1 Fudan University, 2 Megvii Inc.

More information

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia Deep learning for dense per-pixel prediction Chunhua Shen The University of Adelaide, Australia Image understanding Classification error Convolution Neural Networks 0.3 0.2 0.1 Image Classification [Krizhevsky

More information

Deep Learning for Object detection & localization

Deep Learning for Object detection & localization Deep Learning for Object detection & localization RCNN, Fast RCNN, Faster RCNN, YOLO, GAP, CAM, MSROI Aaditya Prakash Sep 25, 2018 Image classification Image classification Whole of image is classified

More information

Team G-RMI: Google Research & Machine Intelligence

Team G-RMI: Google Research & Machine Intelligence Team G-RMI: Google Research & Machine Intelligence Alireza Fathi (alirezafathi@google.com) Nori Kanazawa, Kai Yang, George Papandreou, Tyler Zhu, Jonathan Huang, Vivek Rathod, Chen Sun, Kevin Murphy, et

More information

arxiv: v2 [cs.cv] 28 Jul 2017

arxiv: v2 [cs.cv] 28 Jul 2017 Exploiting Web Images for Weakly Supervised Object Detection Qingyi Tao Hao Yang Jianfei Cai Nanyang Technological University, Singapore arxiv:1707.08721v2 [cs.cv] 28 Jul 2017 Abstract In recent years,

More information

Learning From Weakly Supervised Data by The Expectation Loss SVM (e-svm) algorithm

Learning From Weakly Supervised Data by The Expectation Loss SVM (e-svm) algorithm Learning From Weakly Supervised Data by The Expectation Loss SVM (e-svm) algorithm Jun Zhu Department of Statistics University of California, Los Angeles jzh@ucla.edu Junhua Mao Department of Statistics

More information

arxiv: v1 [cs.cv] 14 Dec 2015

arxiv: v1 [cs.cv] 14 Dec 2015 Instance-aware Semantic Segmentation via Multi-task Network Cascades Jifeng Dai Kaiming He Jian Sun Microsoft Research {jifdai,kahe,jiansun}@microsoft.com arxiv:1512.04412v1 [cs.cv] 14 Dec 2015 Abstract

More information

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks Nikiforos Pittaras 1, Foteini Markatopoulou 1,2, Vasileios Mezaris 1, and Ioannis Patras 2 1 Information Technologies

More information

Single-Shot Refinement Neural Network for Object Detection -Supplementary Material-

Single-Shot Refinement Neural Network for Object Detection -Supplementary Material- Single-Shot Refinement Neural Network for Object Detection -Supplementary Material- Shifeng Zhang 1,2, Longyin Wen 3, Xiao Bian 3, Zhen Lei 1,2, Stan Z. Li 4,1,2 1 CBSR & NLPR, Institute of Automation,

More information

Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition

Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition Tianshui Chen 1, Zhouxia Wang 1,2, Guanbin Li 1, Liang Lin 1,2 1 School of Data and Computer Science, Sun Yat-sen University,

More information

Geometry-aware Traffic Flow Analysis by Detection and Tracking

Geometry-aware Traffic Flow Analysis by Detection and Tracking Geometry-aware Traffic Flow Analysis by Detection and Tracking 1,2 Honghui Shi, 1 Zhonghao Wang, 1,2 Yang Zhang, 1,3 Xinchao Wang, 1 Thomas Huang 1 IFP Group, Beckman Institute at UIUC, 2 IBM Research,

More information

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing 3 Object Detection BVM 2018 Tutorial: Advanced Deep Learning Methods Paul F. Jaeger, of Medical Image Computing What is object detection? classification segmentation obj. detection (1 label per pixel)

More information

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS Zhao Chen Machine Learning Intern, NVIDIA ABOUT ME 5th year PhD student in physics @ Stanford by day, deep learning computer vision scientist

More information