RGB-D Object Recognition and Pose Estimation based on Pre-trained Convolutional Neural Network Features

Size: px
Start display at page:

Download "RGB-D Object Recognition and Pose Estimation based on Pre-trained Convolutional Neural Network Features"

Transcription

1 RGB-D Object Recognition and Pose Estimation based on Pre-trained Convolutional Neural Network Features Max Schwarz, Hannes Schulz, and Sven Behnke Abstract Object recognition and pose estimation from RGB-D images are important tasks for manipulation robots which can be learned from examples. Creating and annotating datasets for learning is expensive, however. We address this problem with transfer learning from deep convolutional neural networks (CNN) that are pre-trained for image categorization and provide a rich, semantically meaningful feature set. We incorporate depth information, which the CNN was not trained with, by rendering objects from a canonical perspective and colorizing the depth channel according to distance from the object center. We evaluate our approach on the Washington RGB-D Objects dataset, where we find that the generated feature set naturally separates classes and instances well and retains pose manifolds. We outperform state-of-the-art on a number of subtasks and show that our approach can yield superior results when only little training data is available. I. INTRODUCTION The recent success of deep convolutional neural networks (CNN) in computer vision can largely be attributed to massive amounts of data and immense processing speed for training these non-linear models. However, the amount of data available varies considerably depending on the task. Especially robotics applications typically rely on very little data, since generating and annotating data is highly specific to the robot and the task (e.g. grasping) and thus prohibitively expensive. This paper addresses the problem of small datasets in robotic vision by reusing features learned by a CNN on a large-scale task and applying them to different tasks on a comparably small household object dataset. Works in other domains [1] [3] demonstrated that this transfer learning is a promising alternative to feature design. Figure 1 gives an overview of our approach. To employ a CNN, the image data needs to be carefully prepared. Our algorithm segments objects, removes confounding background information and adjusts them to the input distribution of the CNN. Features computed by the CNN are then fed to a support vector machine (SVM) to determine object class, instance, and pose. While already on-par with other state-ofthe-art methods, this approach does not make use of depth information. Depth sensors are prevalent in todays robotics, but large datasets for CNN training are not available. Here, we propose to transform depth data into a representation which is easily interpretable by a CNN trained on color images. After detecting the ground plane and segmenting the object, we render it from a canonical pose and color it according All authors are with Rheinische Friedrich-Wilhelms-Universität Bonn, Computer Science Institute VI, Autonomous Intelligent Systems, Friedrich- Ebert-Allee 144, Bonn schwarzm@cs.uni-bonn.de, schulzh@ais.uni-bonn.de, behnke@cs.uni-bonn.de Pose Color Masked color Pre-trained CNN Depth Colorized depth Pre-trained CNN SVR Instance SVM Category SVM Fig. 1: Overview of our approach. Images are pre-processed by extracting foreground, reprojecting to canonical pose, and colorizing depth. The two images are processed independently by a convolutional neural network. CNN features are then used to successively determine category, instance, and pose. to distance from the object center. Combined with the image color features, this method outperforms other recent approaches on the Washington RGB-D Objects dataset on a number of subtasks. This dataset requires categorization of household objects, recognizing category instances, and estimating their pose. In short, our contributions are as follows: 1) We introduce a novel pre-processing pipeline for RGB-D images facilitating CNN use for object categorization, instance recognition, and pose regression. 2) We analyze features produced by our pipeline and a pre-trained CNN. We show that they naturally separate common household object categories, as well as their instances, and produce low-dimensional pose manifolds. 3) We demonstrate that with discriminative training, our features improve state-of-the-art on the Washington RGB-D Objects classification dataset. 4) We show that in contrast to previous work, only few instances are required to attain good results. 5) Finally, we demonstrate that with our method even category level pose estimation is possible without sacrificing accuracy. After discussing related work, we describe our feature extraction pipeline and supervised learning setup in Sections III and IV, respectively. We analyze the features and their performance in Section V. CNN features

2 II. RELATED WORK Deep convolutional neural networks (CNN) [4] [6] became the dominant method in the ImageNet large scale image classification challenge [7] since the seminal work of Krizhevsky et al. [8]. Their success can mainly be attributed to large amounts of available data and fast GPU implementations, enabling the use of large non-linear models. To solve the task of distinguishing 1000, sometimes very similar object categories, these networks compute new representations of the image with repeated convolutions, spatial max-pooling [9], and non-linearities [8]. The higher-level representations are of special interest, as they provide a generic description of the image in an increasingly semantic feature space [10]. This observation is supported by the impressive performance of CNN features learned purely on classification tasks applied to novel tasks in computer vision such as object detection [1], [2], subcategorization, domain adaptation, scene recognition [3], attribute detection, and instance retrieval [2]. In many cases, the results are on par with or surpass the state of the art on the respective datasets. While Girshick et al. [1] report improvements when finetuning the CNN features on the new task, this approach is prone to overfitting in the case of very few training instances. We instead follow Donahue et al. [3] and only re-interpret features computed by the CNN. In contrast to the investigations of Donahue et al. [3] and Razavian et al. [2], we focus on a dataset in a robotic setting with few labeled instances. We use pre-trained CNN in conjunction with preprocessed depth images, which is not addressed by the works discussed so far. Very recently, Gupta et al. [11] proposed a similar technique, where color channels are given by horizontal disparity, height above ground, and angle with vertical. In contrast to their method, we propose an object-centered colorization scheme, which is tailored to the classification and pose estimation task. Previous work on the investigated RGB-D Objects dataset begins with the dataset publication by Lai et al. [12], who use a combination of several hand-crafted features (SIFT, Texton, color histogram, spin images, 3D bounding boxes) and compares the performance of several classifiers (linear SVM, Gaussian kernel SVM, random forests). These baseline results have been improved significantly in later publications. Lai et al. [13] propose a very efficient hierarchical classification method, which optimizes classification and pose estimation jointly on all hierarchy levels. The method uses stochastical gradient descent (SGD) for training and is able to warm-start training when adding new objects to the hierarchy. While the training method is very interesting and could possibly be applied to this work, the reported results stay significantly behind the state of art. Finally, Bo et al. [14] show a very significant improvement in classification accuracy and reduction in pose estimation error. Their method learns hierarchical feature representations from RGB-D Objects data without supervision using hierarchical matching pursuit (HMP). This work shows the promise (a) Region of interest (b) Mask distance (c) Faded output Fig. 2: Overview of the RGB preprocessing pipeline, with ROI from object segmentation, distance transform from the object mask and faded output used for CNN feature extraction. of feature learning from raw data and is the current stateof-the-art approach on the RGB-D Objects dataset. However, the number of training examples must be suitably large to allow for robust feature learning in contrast to our work, which uses pre-learned features and is thus able to learn from few examples. Of course, the feature learning process is not needed in our case which leads to a significant runtime advantage for our method. Finally, our method generates less feature dimensions (10,192 vs. 188,300) and thus is also faster in the classifier training and recall steps. III. CNN FEATURE EXTRACTION PIPELINE In order to use a pre-trained CNN for feature extraction, it is necessary to preprocess the input data into the format that matches the training set of the neural network. CaffeNet [15], which we use here, expects square RGB images depicting one dominant object as in the ImageNet Large Scale Visual Recognition Challenge [7]. A. RGB Image Preprocessing Since the investigated CNN was trained on RGB images, not much pre-processing is needed to extract robust features from color images. Example images from the preprocessing pipeline can be seen in Fig. 2. We first crop the image to a square region of interest (see Fig. 2a). In a live situation, this region of interest is simply the bounding box of all object points determined by the tabletop segmentation (Section III-B). During evaluation on the Washington RGB-D Objects dataset (Section V), we use the provided object segmentation mask to determine the bounding box. The extracted image region is then scaled to fit the CNN input size, in our case pixels. To reduce the CNN s response to the background, we apply a fading operation to the image (Fig. 2c). Each pixel is interpolated between its RGB color c 0 = (r 0, g 0, b 0 ) and the corresponding ILSVRC 2011 mean image pixel c m = (r m, g m, b m ) based on its pixel distance r to the nearest object pixel: where c := α c 0 + (1 α) c m, (1) 1 if r = 0, α := 0 if r > R, (R r) β else. (2)

3 (a) Depth image (c) Region of interest (b) Object segmentation (d) Fill-in result around the vertical axis. To this end, we find points on the plane and extract object clusters with Euclidean Clustering (see Fig. 3b). In a second step, we fill-in holes in the depth map. We employ a common scheme based on the work of Levin et al. [17], who investigated the colorization of grayscale images from few given color pixels. The colorization is guided by the grayscale image in such a way that regions with similar intensity are colored the same. We fill-in depth values using the same technique guided by a grayscale version of the RGB image. This has the advantage of using the RGB information for disambiguation between different depth layers, whereas the standard approach of depth image median filtering cannot include color information. In detail, the objective of the fill-in step is to minimize the squared difference between the depth value D(p) with p = (u, v) and the weighted average of the depth at neighboring pixels, J(D) = D(p) 2 w ps D(s), (3) p with s N(p) (e) Generated mesh (f) Canonical view (g) Colorized image Fig. 3: Overview of the depth preprocessing pipeline. (a) input depth map, (b) extracted segmentation mask after tabletop segmentation [16], (c) region of interest containing only the object with unavailable depth pixels shown in red, (d) missing depth values filled in, (e) mesh extracted from point cloud, (f) reprojection of mesh to a canonical camera pose, (g) final image used for CNN feature extraction. The fade radius R = 30 was manually tuned to exclude as much background as possible while keeping objects with non-optimal segmentation intact. The exponent β = 0.75 was later roughly tuned for best cross-validation score in the category level classification. B. Depth Image Preprocessing Feeding depth images to a CNN poses a harder problem. The investigated CNN was trained on RGB images and is thus not expected to perform well on raw depth images. To address this, we render image-like views of the object from the depth data using a five-step pipeline, which will be detailed below. Figure 3 illustrates all steps for an example. In the first step, we perform a basic segmentation to extract the horizontal surface the object is resting on. Following Holz et al. [16], we estimate surface normals from the depth image, discard points from non-horizontal surfaces and register a planar model using Random Sample Consensus (RANSAC). The main objective here is to construct a local reference frame which is fixed in all dimensions, except for rotation w ps = exp [ (G(p) G(s)) 2 /2σ 2 p], (4) where G(p) is the grayscale intensity of p and σ p is the variance of intensity in a window around p defined by N( ). Minimizing J(D) leads to a sparse linear equation system, which we solve with the BiCGSTAB solver of the Eigen linear algebra package. A result of the fill-in operation can be seen in Fig. 3d. After the fill-in operation, we filter the depth map using a shadow filter, where points whose normals are perpendicular to the view ray get discarded. This operation is only executed on the boundaries of the object to keep the object depth map continuous. To increase invariance of the generated images against camera pitch angle changes, we normalize the viewing angle by an optional reprojection step. The goal is to create a view of the object from a canonical camera pitch angle. To enable reprojection, we first create a mesh using straight-forward triangulation from the filled depth map (Fig. 3e). We then render the mesh from the canonical perspective to create a new depth image. Our naïve meshing approach creates a linear interpolation for previously hidden surfaces of the object (see Fig. 3f). We believe this to be a plausible guess of the unknown geometry without making further assumptions about the object, such as symmetries. The final preprocessing step is to calculate a color C for each depth image pixel p = (u, v) with the 3D reprojection p = (x, y, z). We first estimate a 3D object center q from the bounding box of the object point cloud. The points are then colored according to their distance r from a line g through q in the direction of the plane normal n from tabletop segmentation (Fig. 4): C(p) = P (dist g (p )). (5)

4 Fig. 4: Coordinate system used for colorization. The detected plane normal n through the object center is depicted as a green bar, while an exemplary distance to a colorized point is shown as a red bar. We chose a fixed RGB interpolation from green over red and blue to yellow as palette function P. Since the coloring is not normalized, this allows the network to discriminate between scaled versions of the same shape. If scale-invariant shape recognition is desired, the coloring can easily be normalized. Note that depth likely carries less information than color and could be processed at a coarser resolution. We keep resolution constant, however, since the input size of the learned CNN cannot be changed. C. Image Feature Extraction We investigated the winning CNN from ImageNet Large- Scale Visual Recognition Challenge (ILSVRC) 2011 by Krizhevsky et al. [8]. The open-source Caffe framework [15] provides a pre-trained version of this network. We extract features from the previous-to-last and the last fully connected layer in the network (named fc7 and fc8 in Caffe). This gives us = features per RGB and depth image each, resulting in features per RGB-D frame. Reprojection and coloring are only used for instance-level classification and pose regression, since object categorization cannot benefit from a canonical perspective given the evaluation regime of the Washington RGB-D dataset. Figure 5 shows responses of the first convolutional layer to RGB and depth stimuli. The same filters show different behavior in RGB and depth channels. As intended by our preprocessing, the activation images exhibit little activity in faded-out background regions. A. Object Classification n IV. LEARNING METHOD For classification, we use linear Support Vector Machines (SVMs). We follow a hierarchical approach as in Lai et al. [13]: In a first level, a linear multiclass SVM predicts the object category. The next level contains SVMs predicting the instance in each particular category. B. Object Pose Estimation The RGB-D object dataset makes the assumption that object orientation is defined by a single angle α around the normal vector of the planar surface. This angle is consistently annotated for instances of each object category. However, annotations are not guaranteed to be consistent across categories. Instead of regressing α directly, we construct a hierarchy for pose estimation to avoid the discontinuity at α = 0 = 360, which is hard for a regressor to match. We first predict a rough angle interval for α using a linear SVM. In our experiments, four angle intervals of 90 gave best results. For each interval, we then train one RBF-kernel support vector regressor to predict α. During training, we include samples from the neighboring angle intervals to increase robustness against misclassifications on the interval level. This two-step regressor is trained for each instance. We further train the regressor for each category to provide pose estimation without instance identification, which is supported by the dataset but is not reported by other works, albeit being required in any real-world household robotics application. A. Evaluation Protocol V. EVALUATION We evaluate our approach on the Washington RGB-D Objects dataset [12]. It contains 300 objects organized in 51 categories. For each object, there are three turntable sequences captured from different camera elevation angles (30, 45, 60 ). The sequences were captured with an ASUS Xtion Pro Live camera with resolution in both RGB and depth channels. The dataset also contains approximate ground truth labels for the turntable rotation angle. Furthermore, the dataset provides an object segmentation based on depth and color. We use this segmentation mask in our pre-processing pipeline. However, since our RGB preprocessing needs background pixels for smooth background fading (Section III-A), we could not use the provided pre-masked evaluation dataset but instead had to use the corresponding frames from the full dataset. Since our method fades out most of the background, only features close to the object remain. This includes the turntable surface, which is not interesting for classification or pose regression and the turntable markers, which do not simplify the regression problem since the objects are placed randomly on the turntable in each view pose. Thus, we believe that our results are still comparable to other results on the same dataset. For evaluation, we follow the protocol established by Lai et al. [12] and Bo et al. [14]. We use every fifth frame for training and evaluation. For category recognition, we report the cross-validation accuracy for ten predefined folds over the objects, i.e. in each fold the test instances are completely unknown to the system. For instance recognition and pose estimation, we employ the Leave-Sequence-Out scheme of Bo et al. [14], where the system is trained on the 30 and 60 sequences, while evaluation is on the 45 sequence of every instance.

5 Fig. 5: CNN activations for sample RGB-D frames. The first column shows the CNN input image (color and depth of a pitcher and a banana), all other columns show corresponding selected responses from the first convolutional layer. Note that each column is the result of the same filter applied to color and pre-processed depth. classes (CNN) classes (PHOW) TABLE I: Comparison of category and instance level classification accuracies on the Washington RGB-D Objects dataset. Category Accuracy (%) Method Lai et al. [12] Bo et al. [14] PHOW[18] Ours instant noodles (CNN) instant noodles (PHOW) Fig. 6: Visualization of our CNN-based features. Top row shows t-sne embedding of 1/10 of the Washington RGB-D Objects dataset using CNN and PHOW features, colored by class. CNN separates classes better than PHOW. Bottom row shows a separate t-sne embedding of the instant noodles class (45 sequence), colored and connected by pose. The CNN separates instances and creates pose manifolds. B. Results In addition to the work of Bo et al. [14], we compare our proposed method to a baseline of dense SIFT features (PHOW, [18]), which are extracted at multiple scales, quantized using k-means and histogrammed in a 2 2 and a 4 4 grid over the image. We used vlfeat1 with standard settings, which are optimized for the Caltech 101 dataset. We then apply SVM training for classification and pose estimation as described in Section IV. Without any supervised learning, we can embed the features produced by the CNN and PHOW in R2 using a t-sne 1 Instance Accuracy (%) RGB RGB-D RGB RGB-D 74.3 ± ± ± ± ± ± ± embedding [19]. The result is shown in Fig. 6. While the upper row shows that CNN object classes are well-separated in the input space, the lower row demonstrates that object instances of a single class also become well-separated. Similar poses of the same object remain close in the feature-space, expressing a low-dimensional manifold. These are highly desirable properties for an unsupervised feature mapping which facilitate learning with very few instances. In contrast, PHOW features only exhibit these properties to a very limited extent: Classes and instances are less well-separated, although pose similarities are largely retained. Table I summarizes our recognition results and compares them with other works. We improve on the state-of-theart in category and instance recognition accuracy for RGB and RGB-D data. The exception is RGB-based instance recognition, where the HMP approach by Bo et al. [14] wins by 0.1%. Analyzing the confusion matrix (Fig. 7), the category level classification exhibits few systematic errors. Some object categories prove to be very difficult, since they contain instances with widely varying shape but only few examples (e.g. mushroom), or instances which are very similar in color and shape to instances of other classes (e.g. pitcher and coffe mug). Telling apart the peaches from similarly rounded but brightly colored sponges would likely profit from more examples and detailed texture analysis. Classification performance degrades gracefully when the dataset size is reduced, which is shown in Fig. 8. We reduce the dataset for category and instance recognition by uniform stratified sampling on category and instance level, respectively. With only 30% of the training set available, category classification accuracy decreases by 0.65 percentage

6 Category Pose error (median) Category level Category Instance level lemon lime tomato Category Prediction Fig. 9: Distribution of median pose error over categories. Left plot shows median pose error over categories, right plot over instances. Median over all categories is shown in red. Some objects of type lemon, lime, and tomato exhibit high rotation symmetry and do not support pose estimation. Fig. 7: Top: Confusion matrix for category recognition, normalized by number of samples for each ground truth label. Selected outliers: 1) pitcher recognized as coffee mug, 2) peach as sponge, 3) keyboard as food bag. Bottom: Sample images for pitcher, coffe mug, peach, and sponge. TABLE III: Runtimes of various algorithm steps per input frame in seconds. We measured runtime on an Intel Core 2.7 GHz and a standard mobile graphics card (NVidia GeForce GT 730M) for CUDA computations. Timings include all preprocessing steps. Step Our work Bo et al. [14] Feature extraction (RGB) Feature extraction (depth) Total Category acc. Instance acc. Pose error ( ) CNN: RGB-D Bo et al. [14] (RGB-D) CNN: RGB PHOW (RGB) CNN: RGB-D (Instance) CNN: RGB-D (Category) Relative training set size Fig. 8: Learning curves for classification accuracy (top and center) and median pose estimation error (bottom). We report cross validation accuracy for category recognition and accuracy on the 45 sequence for instance recognition. points only (PHOW: 2.2%), while instance classification decreases by roughly 2% (PHOW: 25.2% from 62.6%, not shown). This supports our observation that the CNN feature space already separates the categories of the RGB-D objects in a semantically meaningful way. We also performed an categorization experiment with the Leave-Sequence-Out evaluation protocol. When training on the two camera pitch angles of 30 and 60 and testing the same objects for the pitch angle of 45, our method achieves near perfect category recognition accuracy (99.6%). We also improve the state-of-the-art in pose estimation by a small margin. Table II reports the pose estimation error of the instance-level estimation and the category-level estimation. Notably, our average pose error is significantly lower than the pose error of the other methods. We were not able to produce reasonable accuracies for pose based on the PHOW features, since the large instance classification error strongly affects all pose estimation metrics. Surprisingly, our category-level pose regression achieves even lower median pose error, surpassing the state-of-the-art result of Bo et al. [14]. The category-level estimation is less precise only in the MedPose(I) and AvePose(I) categories, where its broader knowledge is not as useful as precise fitting to the specific instance. Figure 9 shows the distribution of pose errors over categories. We note that the dataset contains objects in at least three categories which exhibit rotation symmetries and do not support estimating pose. This effect is mitigated by category level pose estimation, which shows that pose estimation can greatly benefit from the generalization across instances provided by category-level training.

7 TABLE II: Median and average pose estimation error on Washington RGB-D Objects dataset. Wrong classifications are penalized with 180 error. (C) and (I) describe subsets with correct category/instance classification, respectively. Angular Error ( ) Work MedPose MedPose(C) MedPose(I) AvePose AvePose(C) AvePose(I) Lai et al. [13] Bo et al. [14] Ours instance level pose regression Ours category level pose regression The MedPose error is the median pose error, with 180 penalty if the class or instance of the object was not recognized. MedPose(C) is the median pose error of only the cases where the class was correctly identified, again with 180 penalty if the instance is predicted wrongly. Finally, MedPose(I) only counts the samples where class and instance were identified correctly. AvePose, AvePose(C) and AvePose(I) describe the average pose error in each case respectively. TABLE IV: Color palettes for depth colorization with corresponding instance recognition accuracy. Accuracy (%) Palette Depth Only RGB-D Gray Green Green-red-blue-yellow The color palette choice for our depth colorization is a crucial parameter. We compare the four-color palette introduced in Section III to two simpler colorization schemes (black and green with brightness gradients) shown in Table IV and compared them by instance recognition accuracy. Especially when considering purely depth-based prediction, the fourcolor palette wins by a large margin. We conclude that more colors result in more discriminative depth features. Since computing power is usually very constrained in robotic applications, we benchmarked runtime for feature extraction and prediction on a lightweight mobile computer with an Intel Core i7-4800mq 2.7 GHz and a common mobile graphics card (NVidia GeForce GT 730M) for CUDA computations. As can be seen in Table III, the runtime of our approach is dominated by the depth preprocessing pipeline, which is not yet optimized for speed. Still, our runtimes are low enough to allow frame rates of up to 5 Hz in a future real-time application. VI. CONCLUSION We presented an approach which allows object categorization, instance recognition and pose estimation of objects on planar surfaces. Instead of handcrafting or learning features, we relied on a convolutional neural network (CNN) which was trained on a large image categorization dataset. We made use of depth features by rendering objects from canonical views and proposed a CNN-compatible coloring scheme which codes metric distance from the object center. We evaluated our approach on the challenging Washington RGB-D Objects dataset and find that in feature space, categories and instances are well separated. Supervised learning on the CNN features improves state-of-the-art in classification as well as average pose accuracy. Our performance degrades gracefully when the dataset size is reduced. REFERENCES [1] R. Girshick, J. Donahue, T. Darrell, and J. Malik. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation. arxiv: [2] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, CNN features off-the-shelf: an astounding baseline for recognition, CVPR DeepVision Workshop, [3] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, Decaf: a deep convolutional activation feature for generic visual recognition, in Proceedings of International Conference on Machine Learning (ICML), 2014, pp [4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp , [5] M. Riesenhuber and T. Poggio, Hierarchical models of object recognition in cortex, Nature neuroscience, vol. 2, no. 11, pp , [6] S. Behnke, Hierarchical neural networks for image interpretation, ser. Lecture Notes in Computer Science (LNCS). Springer, 2003, vol [7] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2014). ImageNet large scale visual recognition challenge. arxiv: [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (NIPS), 2012, pp [9] D. Scherer, A. Müller, and S. Behnke, Evaluation of pooling operations in convolutional architectures for object recognition, in Artificial Neural Networks (ICANN), Springer, 2010, pp [10] M. D. Zeiler and R. Fergus, Visualizing and understanding convolutional networks, in European Conference on Computer Vision (ECCV), 2014, pp [11] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, Learning rich features from RGB-D images for object detection and segmentation, in Europ. Conf. on Computer Vision (ECCV), 2014, pp [12] K. Lai, L. Bo, X. Ren, and D. Fox, A large-scale hierarchical multiview RGB-D object dataset, in International Conference on Robotics and Automation (ICRA), 2011, pp [13] K. Lai, L. Bo, X. Ren, and D. Fox, A scalable tree-based approach for joint object and pose recognition., in Proceedings of Conference on Artificial Intelligence (AAAI), [14] L. Bo, X. Ren, and D. Fox, Unsupervised feature learning for RGB- D based object recognition, in International Symp. Experimental Robotics, 2013, pp [15] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. (2014). Caffe: convolutional architecture for fast feature embedding. arxiv: [16] D. Holz, S. Holzer, R. B. Rusu, and S. Behnke, Real-time plane segmentation using RGB-D cameras, in RoboCup 2011: Robot Soccer World Cup XV, 2012, pp [17] A. Levin, D. Lischinski, and Y. Weiss, Colorization using optimization, in Transactions on Graphics (TOG), vol. 23, 2004, pp [18] A. Bosch, A. Zisserman, and X. Munoz, Image classification using random forests and ferns, in Int. Conf. on Computer Vision, [19] L. Van der Maaten and G. Hinton, Visualizing data using t-sne, Journal of Machine Learning Research, vol. 9, pp , 2008.

Part Localization by Exploiting Deep Convolutional Networks

Part Localization by Exploiting Deep Convolutional Networks Part Localization by Exploiting Deep Convolutional Networks Marcel Simon, Erik Rodner, and Joachim Denzler Computer Vision Group, Friedrich Schiller University of Jena, Germany www.inf-cv.uni-jena.de Abstract.

More information

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 Etienne Gadeski, Hervé Le Borgne, and Adrian Popescu CEA, LIST, Laboratory of Vision and Content Engineering, France

More information

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK 1 Po-Jen Lai ( 賴柏任 ), 2 Chiou-Shann Fuh ( 傅楸善 ) 1 Dept. of Electrical Engineering, National Taiwan University, Taiwan 2 Dept.

More information

Fast Semantic Segmentation of RGB-D Scenes with GPU-Accelerated Deep Neural Networks

Fast Semantic Segmentation of RGB-D Scenes with GPU-Accelerated Deep Neural Networks Fast Semantic Segmentation of RGB-D Scenes with GPU-Accelerated Deep Neural Networks Nico Höft, Hannes Schulz, and Sven Behnke Rheinische Friedrich-Wilhelms-Universität Bonn Institut für Informatik VI,

More information

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material Charles R. Qi Hao Su Matthias Nießner Angela Dai Mengyuan Yan Leonidas J. Guibas Stanford University 1. Details

More information

Deep Neural Networks:

Deep Neural Networks: Deep Neural Networks: Part II Convolutional Neural Network (CNN) Yuan-Kai Wang, 2016 Web site of this course: http://pattern-recognition.weebly.com source: CNN for ImageClassification, by S. Lazebnik,

More information

An Exploration of Computer Vision Techniques for Bird Species Classification

An Exploration of Computer Vision Techniques for Bird Species Classification An Exploration of Computer Vision Techniques for Bird Species Classification Anne L. Alter, Karen M. Wang December 15, 2017 Abstract Bird classification, a fine-grained categorization task, is a complex

More information

Classification of objects from Video Data (Group 30)

Classification of objects from Video Data (Group 30) Classification of objects from Video Data (Group 30) Sheallika Singh 12665 Vibhuti Mahajan 12792 Aahitagni Mukherjee 12001 M Arvind 12385 1 Motivation Video surveillance has been employed for a long time

More information

Learning Semantic Environment Perception for Cognitive Robots

Learning Semantic Environment Perception for Cognitive Robots Learning Semantic Environment Perception for Cognitive Robots Sven Behnke University of Bonn, Germany Computer Science Institute VI Autonomous Intelligent Systems Some of Our Cognitive Robots Equipped

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning Applications 11.01.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period starts

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Announcements Computer Vision Lecture 16 Deep Learning Applications 11.01.2017 Seminar registration period starts on Friday We will offer a lab course in the summer semester Deep Robot Learning Topic:

More information

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro Ryerson University CP8208 Soft Computing and Machine Intelligence Naive Road-Detection using CNNS Authors: Sarah Asiri - Domenic Curro April 24 2016 Contents 1 Abstract 2 2 Introduction 2 3 Motivation

More information

arxiv: v2 [cs.cv] 18 Aug 2015

arxiv: v2 [cs.cv] 18 Aug 2015 Multimodal Deep Learning for Robust RGB-D Object Recognition Andreas Eitel Jost Tobias Springenberg Luciano Spinello Martin Riedmiller Wolfram Burgard arxiv:57.682v2 [cs.cv] 8 Aug 25 Abstract Robust object

More information

Real-time Object Detection CS 229 Course Project

Real-time Object Detection CS 229 Course Project Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning for Object Categorization 14.01.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period

More information

arxiv: v1 [cs.cv] 6 Jul 2016

arxiv: v1 [cs.cv] 6 Jul 2016 arxiv:607.079v [cs.cv] 6 Jul 206 Deep CORAL: Correlation Alignment for Deep Domain Adaptation Baochen Sun and Kate Saenko University of Massachusetts Lowell, Boston University Abstract. Deep neural networks

More information

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report Figure 1: The architecture of the convolutional network. Input: a single view image; Output: a depth map. 3 Related Work In [4] they used depth maps of indoor scenes produced by a Microsoft Kinect to successfully

More information

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18, REAL-TIME OBJECT DETECTION WITH CONVOLUTION NEURAL NETWORK USING KERAS Asmita Goswami [1], Lokesh Soni [2 ] Department of Information Technology [1] Jaipur Engineering College and Research Center Jaipur[2]

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

Spatial Localization and Detection. Lecture 8-1

Spatial Localization and Detection. Lecture 8-1 Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday

More information

Learning visual odometry with a convolutional network

Learning visual odometry with a convolutional network Learning visual odometry with a convolutional network Kishore Konda 1, Roland Memisevic 2 1 Goethe University Frankfurt 2 University of Montreal konda.kishorereddy@gmail.com, roland.memisevic@gmail.com

More information

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation Object detection using Region Proposals (RCNN) Ernest Cheung COMP790-125 Presentation 1 2 Problem to solve Object detection Input: Image Output: Bounding box of the object 3 Object detection using CNN

More information

Rich feature hierarchies for accurate object detection and semantic segmentation

Rich feature hierarchies for accurate object detection and semantic segmentation Rich feature hierarchies for accurate object detection and semantic segmentation BY; ROSS GIRSHICK, JEFF DONAHUE, TREVOR DARRELL AND JITENDRA MALIK PRESENTER; MUHAMMAD OSAMA Object detection vs. classification

More information

Apparel Classifier and Recommender using Deep Learning

Apparel Classifier and Recommender using Deep Learning Apparel Classifier and Recommender using Deep Learning Live Demo at: http://saurabhg.me/projects/tag-that-apparel Saurabh Gupta sag043@ucsd.edu Siddhartha Agarwal siagarwa@ucsd.edu Apoorve Dave a1dave@ucsd.edu

More information

Training Convolutional Neural Networks for Translational Invariance on SAR ATR

Training Convolutional Neural Networks for Translational Invariance on SAR ATR Downloaded from orbit.dtu.dk on: Mar 28, 2019 Training Convolutional Neural Networks for Translational Invariance on SAR ATR Malmgren-Hansen, David; Engholm, Rasmus ; Østergaard Pedersen, Morten Published

More information

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Presented by Tushar Bansal Objective 1. Get bounding box for all objects

More information

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016

Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016 Human Action Recognition Using CNN and BoW Methods Stanford University CS229 Machine Learning Spring 2016 Max Wang mwang07@stanford.edu Ting-Chun Yeh chun618@stanford.edu I. Introduction Recognizing human

More information

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington 3D Object Recognition and Scene Understanding from RGB-D Videos Yu Xiang Postdoctoral Researcher University of Washington 1 2 Act in the 3D World Sensing & Understanding Acting Intelligent System 3D World

More information

Additive Manufacturing Defect Detection using Neural Networks. James Ferguson May 16, 2016

Additive Manufacturing Defect Detection using Neural Networks. James Ferguson May 16, 2016 Additive Manufacturing Defect Detection using Neural Networks James Ferguson May 16, 2016 Outline Introduction Background Edge Detection Methods Results Porosity Detection Methods Results Conclusion /

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

Additive Manufacturing Defect Detection using Neural Networks

Additive Manufacturing Defect Detection using Neural Networks Additive Manufacturing Defect Detection using Neural Networks James Ferguson Department of Electrical Engineering and Computer Science University of Tennessee Knoxville Knoxville, Tennessee 37996 Jfergu35@vols.utk.edu

More information

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material Yi Li 1, Gu Wang 1, Xiangyang Ji 1, Yu Xiang 2, and Dieter Fox 2 1 Tsinghua University, BNRist 2 University of Washington

More information

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Introduction In this supplementary material, Section 2 details the 3D annotation for CAD models and real

More information

Semantic RGB-D Perception for Cognitive Robots

Semantic RGB-D Perception for Cognitive Robots Semantic RGB-D Perception for Cognitive Robots Sven Behnke Computer Science Institute VI Autonomous Intelligent Systems Our Domestic Service Robots Dynamaid Cosero Size: 100-180 cm, weight: 30-35 kg 36

More information

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep learning for object detection. Slides from Svetlana Lazebnik and many others Deep learning for object detection Slides from Svetlana Lazebnik and many others Recent developments in object detection 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before deep

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Study of Residual Networks for Image Recognition

Study of Residual Networks for Image Recognition Study of Residual Networks for Image Recognition Mohammad Sadegh Ebrahimi Stanford University sadegh@stanford.edu Hossein Karkeh Abadi Stanford University hosseink@stanford.edu Abstract Deep neural networks

More information

In Defense of Fully Connected Layers in Visual Representation Transfer

In Defense of Fully Connected Layers in Visual Representation Transfer In Defense of Fully Connected Layers in Visual Representation Transfer Chen-Lin Zhang, Jian-Hao Luo, Xiu-Shen Wei, Jianxin Wu National Key Laboratory for Novel Software Technology, Nanjing University,

More information

Object Detection Based on Deep Learning

Object Detection Based on Deep Learning Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf

More information

arxiv: v1 [cs.cv] 20 Dec 2016

arxiv: v1 [cs.cv] 20 Dec 2016 End-to-End Pedestrian Collision Warning System based on a Convolutional Neural Network with Semantic Segmentation arxiv:1612.06558v1 [cs.cv] 20 Dec 2016 Heechul Jung heechul@dgist.ac.kr Min-Kook Choi mkchoi@dgist.ac.kr

More information

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks Nikiforos Pittaras 1, Foteini Markatopoulou 1,2, Vasileios Mezaris 1, and Ioannis Patras 2 1 Information Technologies

More information

Visual object classification by sparse convolutional neural networks

Visual object classification by sparse convolutional neural networks Visual object classification by sparse convolutional neural networks Alexander Gepperth 1 1- Ruhr-Universität Bochum - Institute for Neural Dynamics Universitätsstraße 150, 44801 Bochum - Germany Abstract.

More information

Automatic Colorization of Grayscale Images

Automatic Colorization of Grayscale Images Automatic Colorization of Grayscale Images Austin Sousa Rasoul Kabirzadeh Patrick Blaes Department of Electrical Engineering, Stanford University 1 Introduction ere exists a wealth of photographic images,

More information

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard (Deep) Learning for Robot Perception and Navigation Wolfram Burgard Deep Learning for Robot Perception (and Navigation) Lifeng Bo, Claas Bollen, Thomas Brox, Andreas Eitel, Dieter Fox, Gabriel L. Oliveira,

More information

Learning with Side Information through Modality Hallucination

Learning with Side Information through Modality Hallucination Master Seminar Report for Recent Trends in 3D Computer Vision Learning with Side Information through Modality Hallucination Judy Hoffman, Saurabh Gupta, Trevor Darrell. CVPR 2016 Nan Yang Supervisor: Benjamin

More information

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 ECCV 2016 Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 Fundamental Question What is a good vector representation of an object? Something that can be easily predicted from 2D

More information

Aggregating Descriptors with Local Gaussian Metrics

Aggregating Descriptors with Local Gaussian Metrics Aggregating Descriptors with Local Gaussian Metrics Hideki Nakayama Grad. School of Information Science and Technology The University of Tokyo Tokyo, JAPAN nakayama@ci.i.u-tokyo.ac.jp Abstract Recently,

More information

2 OVERVIEW OF RELATED WORK

2 OVERVIEW OF RELATED WORK Utsushi SAKAI Jun OGATA This paper presents a pedestrian detection system based on the fusion of sensors for LIDAR and convolutional neural network based image classification. By using LIDAR our method

More information

Semantic Segmentation

Semantic Segmentation Semantic Segmentation UCLA:https://goo.gl/images/I0VTi2 OUTLINE Semantic Segmentation Why? Paper to talk about: Fully Convolutional Networks for Semantic Segmentation. J. Long, E. Shelhamer, and T. Darrell,

More information

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION

MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION MULTI ORIENTATION PERFORMANCE OF FEATURE EXTRACTION FOR HUMAN HEAD RECOGNITION Panca Mudjirahardjo, Rahmadwati, Nanang Sulistiyanto and R. Arief Setyawan Department of Electrical Engineering, Faculty of

More information

arxiv: v1 [cs.cv] 4 Dec 2014

arxiv: v1 [cs.cv] 4 Dec 2014 Convolutional Neural Networks at Constrained Time Cost Kaiming He Jian Sun Microsoft Research {kahe,jiansun}@microsoft.com arxiv:1412.1710v1 [cs.cv] 4 Dec 2014 Abstract Though recent advanced convolutional

More information

Transfer Learning. Style Transfer in Deep Learning

Transfer Learning. Style Transfer in Deep Learning Transfer Learning & Style Transfer in Deep Learning 4-DEC-2016 Gal Barzilai, Ram Machlev Deep Learning Seminar School of Electrical Engineering Tel Aviv University Part 1: Transfer Learning in Deep Learning

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

On the Effectiveness of Neural Networks Classifying the MNIST Dataset On the Effectiveness of Neural Networks Classifying the MNIST Dataset Carter W. Blum March 2017 1 Abstract Convolutional Neural Networks (CNNs) are the primary driver of the explosion of computer vision.

More information

3D model classification using convolutional neural network

3D model classification using convolutional neural network 3D model classification using convolutional neural network JunYoung Gwak Stanford jgwak@cs.stanford.edu Abstract Our goal is to classify 3D models directly using convolutional neural network. Most of existing

More information

Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz Supplemental Material

Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz Supplemental Material Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz Supplemental Material Ayush Tewari 1,2 Michael Zollhöfer 1,2,3 Pablo Garrido 1,2 Florian Bernard 1,2 Hyeongwoo

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 CS 2750: Machine Learning Neural Networks Prof. Adriana Kovashka University of Pittsburgh April 13, 2016 Plan for today Neural network definition and examples Training neural networks (backprop) Convolutional

More information

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington Perceiving the 3D World from Images and Videos Yu Xiang Postdoctoral Researcher University of Washington 1 2 Act in the 3D World Sensing & Understanding Acting Intelligent System 3D World 3 Understand

More information

CRF Based Point Cloud Segmentation Jonathan Nation

CRF Based Point Cloud Segmentation Jonathan Nation CRF Based Point Cloud Segmentation Jonathan Nation jsnation@stanford.edu 1. INTRODUCTION The goal of the project is to use the recently proposed fully connected conditional random field (CRF) model to

More information

SELF SUPERVISED DEEP REPRESENTATION LEARNING FOR FINE-GRAINED BODY PART RECOGNITION

SELF SUPERVISED DEEP REPRESENTATION LEARNING FOR FINE-GRAINED BODY PART RECOGNITION SELF SUPERVISED DEEP REPRESENTATION LEARNING FOR FINE-GRAINED BODY PART RECOGNITION Pengyue Zhang Fusheng Wang Yefeng Zheng Medical Imaging Technologies, Siemens Medical Solutions USA Inc., Princeton,

More information

SurfNet: Generating 3D shape surfaces using deep residual networks-supplementary Material

SurfNet: Generating 3D shape surfaces using deep residual networks-supplementary Material SurfNet: Generating 3D shape surfaces using deep residual networks-supplementary Material Ayan Sinha MIT Asim Unmesh IIT Kanpur Qixing Huang UT Austin Karthik Ramani Purdue sinhayan@mit.edu a.unmesh@gmail.com

More information

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images 1 Introduction - Steve Chuang and Eric Shan - Determining object orientation in images is a well-established topic

More information

arxiv: v1 [cs.cv] 31 Mar 2016

arxiv: v1 [cs.cv] 31 Mar 2016 Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract.

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

UNDERSTANDING complex scenes has gained much

UNDERSTANDING complex scenes has gained much IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 06 Combining Semantic and Geometric Features for Object Class Segmentation of Indoor Scenes Farzad Husain, Hannes Schulz, Babette

More information

Robotics Programming Laboratory

Robotics Programming Laboratory Chair of Software Engineering Robotics Programming Laboratory Bertrand Meyer Jiwon Shin Lecture 8: Robot Perception Perception http://pascallin.ecs.soton.ac.uk/challenges/voc/databases.html#caltech car

More information

Rotation Invariance Neural Network

Rotation Invariance Neural Network Rotation Invariance Neural Network Shiyuan Li Abstract Rotation invariance and translate invariance have great values in image recognition. In this paper, we bring a new architecture in convolutional neural

More information

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Chi Li, M. Zeeshan Zia 2, Quoc-Huy Tran 2, Xiang Yu 2, Gregory D. Hager, and Manmohan Chandraker 2 Johns

More information

Efficient Convolutional Network Learning using Parametric Log based Dual-Tree Wavelet ScatterNet

Efficient Convolutional Network Learning using Parametric Log based Dual-Tree Wavelet ScatterNet Efficient Convolutional Network Learning using Parametric Log based Dual-Tree Wavelet ScatterNet Amarjot Singh, Nick Kingsbury Signal Processing Group, Department of Engineering, University of Cambridge,

More information

Multi-Glance Attention Models For Image Classification

Multi-Glance Attention Models For Image Classification Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We

More information

Deconvolutions in Convolutional Neural Networks

Deconvolutions in Convolutional Neural Networks Overview Deconvolutions in Convolutional Neural Networks Bohyung Han bhhan@postech.ac.kr Computer Vision Lab. Convolutional Neural Networks (CNNs) Deconvolutions in CNNs Applications Network visualization

More information

Robust Scene Classification with Cross-level LLC Coding on CNN Features

Robust Scene Classification with Cross-level LLC Coding on CNN Features Robust Scene Classification with Cross-level LLC Coding on CNN Features Zequn Jie 1, Shuicheng Yan 2 1 Keio-NUS CUTE Center, National University of Singapore, Singapore 2 Department of Electrical and Computer

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Human Pose Estimation with Deep Learning. Wei Yang

Human Pose Estimation with Deep Learning. Wei Yang Human Pose Estimation with Deep Learning Wei Yang Applications Understand Activities Family Robots American Heist (2014) - The Bank Robbery Scene 2 What do we need to know to recognize a crime scene? 3

More information

Artistic ideation based on computer vision methods

Artistic ideation based on computer vision methods Journal of Theoretical and Applied Computer Science Vol. 6, No. 2, 2012, pp. 72 78 ISSN 2299-2634 http://www.jtacs.org Artistic ideation based on computer vision methods Ferran Reverter, Pilar Rosado,

More information

Selection of Scale-Invariant Parts for Object Class Recognition

Selection of Scale-Invariant Parts for Object Class Recognition Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract

More information

Su et al. Shape Descriptors - III

Su et al. Shape Descriptors - III Su et al. Shape Descriptors - III Siddhartha Chaudhuri http://www.cse.iitb.ac.in/~cs749 Funkhouser; Feng, Liu, Gong Recap Global A shape descriptor is a set of numbers that describes a shape in a way that

More information

Metric Learning for Large-Scale Image Classification:

Metric Learning for Large-Scale Image Classification: Metric Learning for Large-Scale Image Classification: Generalizing to New Classes at Near-Zero Cost Florent Perronnin 1 work published at ECCV 2012 with: Thomas Mensink 1,2 Jakob Verbeek 2 Gabriela Csurka

More information

Project 3 Q&A. Jonathan Krause

Project 3 Q&A. Jonathan Krause Project 3 Q&A Jonathan Krause 1 Outline R-CNN Review Error metrics Code Overview Project 3 Report Project 3 Presentations 2 Outline R-CNN Review Error metrics Code Overview Project 3 Report Project 3 Presentations

More information

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling [DOI: 10.2197/ipsjtcva.7.99] Express Paper Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling Takayoshi Yamashita 1,a) Takaya Nakamura 1 Hiroshi Fukui 1,b) Yuji

More information

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation. Equation to LaTeX Abhinav Rastogi, Sevy Harris {arastogi,sharris5}@stanford.edu I. Introduction Copying equations from a pdf file to a LaTeX document can be time consuming because there is no easy way

More information

arxiv: v3 [cs.cv] 3 Oct 2012

arxiv: v3 [cs.cv] 3 Oct 2012 Combined Descriptors in Spatial Pyramid Domain for Image Classification Junlin Hu and Ping Guo arxiv:1210.0386v3 [cs.cv] 3 Oct 2012 Image Processing and Pattern Recognition Laboratory Beijing Normal University,

More information

Finding Tiny Faces Supplementary Materials

Finding Tiny Faces Supplementary Materials Finding Tiny Faces Supplementary Materials Peiyun Hu, Deva Ramanan Robotics Institute Carnegie Mellon University {peiyunh,deva}@cs.cmu.edu 1. Error analysis Quantitative analysis We plot the distribution

More information

Edge Detection Using Convolutional Neural Network

Edge Detection Using Convolutional Neural Network Edge Detection Using Convolutional Neural Network Ruohui Wang (B) Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong, China wr013@ie.cuhk.edu.hk Abstract. In this work,

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

arxiv: v1 [cs.cv] 28 Sep 2018

arxiv: v1 [cs.cv] 28 Sep 2018 Camera Pose Estimation from Sequence of Calibrated Images arxiv:1809.11066v1 [cs.cv] 28 Sep 2018 Jacek Komorowski 1 and Przemyslaw Rokita 2 1 Maria Curie-Sklodowska University, Institute of Computer Science,

More information

Classifying Depositional Environments in Satellite Images

Classifying Depositional Environments in Satellite Images Classifying Depositional Environments in Satellite Images Alex Miltenberger and Rayan Kanfar Department of Geophysics School of Earth, Energy, and Environmental Sciences Stanford University 1 Introduction

More information

Learning-based Localization

Learning-based Localization Learning-based Localization Eric Brachmann ECCV 2018 Tutorial on Visual Localization - Feature-based vs. Learned Approaches Torsten Sattler, Eric Brachmann Roadmap Machine Learning Basics [10min] Convolutional

More information

Supplemental Material for Can Visual Recognition Benefit from Auxiliary Information in Training?

Supplemental Material for Can Visual Recognition Benefit from Auxiliary Information in Training? Supplemental Material for Can Visual Recognition Benefit from Auxiliary Information in Training? Qilin Zhang, Gang Hua, Wei Liu 2, Zicheng Liu 3, Zhengyou Zhang 3 Stevens Institute of Technolog, Hoboken,

More information

Contextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work

Contextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work Contextual Dropout Finding subnets for subtasks Sam Fok samfok@stanford.edu Abstract The feedforward networks widely used in classification are static and have no means for leveraging information about

More information

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological

More information

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture

More information

CNN-based Human Body Orientation Estimation for Robotic Attendant

CNN-based Human Body Orientation Estimation for Robotic Attendant Workshop on Robot Perception of Humans Baden-Baden, Germany, June 11, 2018. In conjunction with IAS-15 CNN-based Human Body Orientation Estimation for Robotic Attendant Yoshiki Kohari, Jun Miura, and Shuji

More information

Lecture 5: Object Detection

Lecture 5: Object Detection Object Detection CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 5: Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 Traditional Object Detection Algorithms Region-based

More information

Object detection with CNNs

Object detection with CNNs Object detection with CNNs 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before CNNs After CNNs 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 year Region proposals

More information

arxiv: v1 [cs.cv] 26 Jun 2017

arxiv: v1 [cs.cv] 26 Jun 2017 Detecting Small Signs from Large Images arxiv:1706.08574v1 [cs.cv] 26 Jun 2017 Zibo Meng, Xiaochuan Fan, Xin Chen, Min Chen and Yan Tong Computer Science and Engineering University of South Carolina, Columbia,

More information