arxiv: v3 [cs.cv] 21 Oct 2016

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "arxiv: v3 [cs.cv] 21 Oct 2016"

Transcription

1 Associating Grasping with Convolutional Neural Network Features Li Yang Ku, Erik Learned-Miller, and Rod Grupen University of Massachusetts Amherst, Amherst, MA, USA arxiv: v3 [cs.cv] 21 Oct 2016 Abstract. In this work, we provide a solution for pre-shaping a humanlike robot hand for grasping based on visual information. Our approach uses convolutional neural networks (CNNs) to define a mapping between images and grasps. Applying CNNs to robotics applications is non-trivial for two reasons. First, collecting enough robot data to train a CNN at the same scale as the models trained in the vision community is extremely difficult. In this work, we demonstrate that by using a pre-trained CNN, a small set of grasping examples is sufficient for generalizing across different objects of similar shapes. Second, the final output of a CNN contains little location information of the observed object, which is essential for the robot to manipulate the object. We take advantage of the hierarchical nature of CNN layers and identify the 3D positions of features that capture the hierarchical support relations between filters in different CNN layers using an approach we call targeted backpropagation. Targeted backpropagation traces the activation of higher level features in a CNN backwards through the network to discover the locations in the observation that were responsible for making them fire, thus localizing important structures that are manipulable in the environment. We show that this approach outperforms approaches without targeted backpropagation in a cluttered scene. We further implement a hierarchical controller that controls fingers and palms based on features located in different layers of the CNN for pre-shaping the robot hand and demonstrate that this approach outperforms a point cloud based approach on a grasping task on Robonaut-2. 1 Introduction In this work, we provide a solution for pre-shaping a human-like robot hand based on visual information. It has been shown that humans are capable of pre-shaping their hands to grasp an object based merely on visual information from an object. Experiments done with patient DF by Goodale and Milner [1] have shown that human s pre-shaping ability is likely handled by the dorsal pathway in our brain rather then the ventral pathway that recognizes objects. This suggests that the ability to pre-shape our hand before grasping is not just the result of recognizing the object. Although without haptic feedback, visual input may not be sufficient for certifying a robust grasp, pre-shaping is still a fundamental ability that can guide the robot hand before it makes contact with

2 2 Associating Grasping with Convolutional Neural Network Features Fig. 1. Overall architecture and hierarchical CNN features. A tuple of a yellow dot, cyan dot, and magenta dot represents a hierarchical CNN feature in the conv-3 layer. The magenta dots represent filters part of the hierarchical CNN features in the conv-3 layer that contribute to the same conv-5 layer filter represented by the yellow dot. These features can be traced back to the image input and mapped to the point cloud to support manipulation. The arrows from the CNN layers to the controllers indicate a mapping from hierarchical CNN features in the conv-4 and conv-3 layers to the arm and hand controllers. Target points for the arm and hand controllers are generated based on the relative positions to corresponding features. the object. In this work, the purpose of pre-shaping is not just for picking up the object but to grasp it in a specific way. We believe this may result in a more stable grasp and more consistent interactions with the object. Our approach uses convolutional neural networks (CNNs) to define a mapping between images and grasps. Since 2012, CNNs have attracted a great deal of attention in the computer vision community and have outperformed other algorithms on many benchmarks. However applying CNNs to robotics applications is non-trivial for two reasons. First, collecting enough robot data to train a CNN at the same scale as the models trained in the vision community from scratch is extremely difficult. In this work, we demonstrate that by using a CNN trained on ImageNet [2], a small set of grasping examples is sufficient for generalizing across different objects of similar shapes. Second, the final output of a CNN contains little location information from the observed object, which is

3 Associating Grasping with Convolutional Neural Network Features 3 essential for the robot to manipulate the object. We take advantage of the hierarchical nature of CNN layers and identify a set of features we call hierarchical CNN features that capture the hierarchical support relations between filters in different CNN layers. The 3D positions of such features can be identified using an approach we call targeted backpropagation. Targeted backpropagation traces the activation of high-level filters in a CNN backwards through the network to discover the locations in the observation that were responsible for making them fire, thus localizing important structures that are manipulable in the environment. Figure 1 shows the overall architecture and how such hierarchical CNN features are mapped to a point cloud to support manipulation. Fig. 2. Robonaut-2 pre-shaping its hand before grasping an object. Notice that in the box example the fingers are preshaped to grasp the faces of the box and in the air duster example the fingers are preshaped such that the air duster would be wrapped in the hand. We analyze our algorithm based on the grasp point accuracy through crossvalidation on the R2 Grasping Dataset we collected. Approaches with and without the proposed targeted backpropagation are compared. This approach is then tested in a grasping experiment on the humanoid robot Robonaut-2 [3]. A hierarchical controller that controls fingers and palms based on features in different CNN layers is implemented for pre-shaping the robot hand. We demonstrate that the proposed approach outperforms a baseline approach that grasps the centroid of the object point cloud. Figure 2 shows Robonaut-2 pre-shaping its hand before grasping. 2 Related Work The idea that our brain encodes visual stimulus in two separate regions was first proposed in [4]. Ungerleider and Mishkin further discovered that the two areas, inferotemporal cortex and posterior parietal cortex receive independent

4 4 Associating Grasping with Convolutional Neural Network Features sets of projections from the striate cortex and propose that the ventral stream that ends in the inferotemporal cortex plays a critical role in identifying objects, while the dorsal stream that ends in the posterior parietal cortex encodes the spatial location of those same objects [5]. This hypothesis is often known as the distinction of what and where between the two visual pathways. However, in 1992 Goodale and Milner proposed an alternative perspective on the functionality of these two visual pathways based on many observations made with patient DF [6]. Patient DF is unique in the sense that she developed a profound visual form agnosia due to anoxic damage to her ventral stream. Despite DFs inability to recognize the shape, size and orientation of visual objects, she is capable of grasping the very same object with accurate hand and finger movements. Based on a series of experiments [1], Goodale and Milner proposed the perceptionaction model, which suggests that the dorsal pathway provides action-relevant information about the structural characteristic of objects and not just about their position. In our work, instead of planning a grasp action based on the result of object recognition, our approach is consistent with Goodale and Milner s perception-action model that separates pre-shaping from object recognition. Recently, studies on this perception-action model in the field of cognitive science have shifted from identifying functional dissociation between ventral and dorsal streams to how these two streams interact with each other [7]. The attentional blink experiments done in [8] suggest that object recognition processes in the ventral stream activate action-related processing in the dorsal stream, which will lead to enhanced selection and consolidation of objects that is related to the observed object. This interaction between the ventral and dorsal streams disappears when the association between items is semantic rather than visual. This result is consistent with our solution where the action process is instantiated by the classification process and uses visual clues instead of object identities for grasp actions. A lot of work has been done on generating robotic grasp plans from visual information. In [9], a single grasp point is identified using a probabilistic model on a set of visual features such as edges, textures, and colors. In [10], online learning is used to associate different types of grasp with the object s height and width. In [11] the probability of a hand configuration successfully grasping a novel object is calculated using contact, center of mass, and force closure properties based on point cloud and image information. A shape template approach for grasping novel objects was also introduced in [12]. The authors introduced a shape descriptor called height map to capture local object geometry for matching part of a point cloud generated by a novel object to a known grasp template. A geometric approach for grasping novel objects based on point cloud is proposed in [13]. An antipodal grasp is determined through finding cutting planes that satisfy geometric constraints. A similar approach based on local object geometry was also introduced in [14]. Our approach associates CNN features learned from the ImageNet to a known grasp. We consider not just a local feature but also where it is in a higher level structure. Since CNN features are based on RGB image input, our approach can infer a grasp pose from the side even when there

5 Associating Grasping with Convolutional Neural Network Features 5 are only point cloud information of the object s top surface. In [15], a deep network trained on 1035 examples is used to determine a successful grasp based on RGB-D data. Grasp positions are exhaustively searched and evalutated. Our approach uses pre-trained features and generates grasp points based on a small set of grasping examples. Several authors have applied CNNs to robotics. In [16], visuomotor policies are learned using an end-to-end neural network that takes images and outputs joint torques. A three layer CNN is used without any max pooling layer to maintain spatial information. In our work, we also use filters in the third convolution layers; but unlike the previous work, we consider their relationship with higher layer filters. In [17], an autoencoder is used to learn spatial information of features of a neural network. Our approach uses targeted backpropagation to find the receptive field that causes a high-level response in a particular image. In [18], a CNN is used to learn what features are graspable through 50 thousand trials collected using a Baxter robot. The final layer is used to select 1 out of 18 grasp orientations. Our approach considers the relative position between features and robot finger tips; therefore, capable of generating more complicated grasp poses using a human-like hand. Our work is also related to work in visualizing CNNs. In [19], the filters of each layer are visualized by looking for image patches that result in the maximum activation. Deconvolution is also used to find what pixels activated each filter. Another approach that trains a separate CNN to learn the reverse mapping from different layers of a CNN back to the original image has also been introduced in [20]. In our work, we use the visualization tool introduced by Yosinski [21] to visualize filters that represent meaningful characteristics in each layer. Some authors have explored using filter activation other than the final layer of a CNN. In [22], hypercolumns, which are defined as the activation of all CNN units above a pixel, are used on tasks such as simultaneous detection and segmentation, keypoint localization, and part labeling. Our approach groups filters in different layers based on their hierarchical relationship instead of spatial relationship. In [23], the last two layers of two CNNs, one that takes an image as input and one that takes depth as input, are used to identify object category, instance, and pose. In [24], the last layer is used to identify object instance while the fifth convolution layer is used to determine the aspect of an object. In our work we consider a feature as the activation of a lower layer filter that causes a specific higher layer filter to activate and plan grasp poses based on these features. Our work is also inspired by deformable part models [25] where a class model is composed of smaller models of parts; e.g. wheels are parts of a bicycle. We view Convolutional Neural Networks similarly; a filter in a higher layer is a combination of filters in the lower layer. If a higher layer filter represents a high level structure, the lower layer filters that contribute to this higher layer filter can be seen as representing local parts of this structure and may provide useful information for manipulating an object.

6 6 Associating Grasping with Convolutional Neural Network Features 3 Dataset In this work, we created the R2 Grasping Dataset that contains a detailed grasp pose for a set of objects. The data is collected using an Asus Xtion camera and the Robonaut-2 simulator [26]. The object is placed on a flat surface where the camera is about 70 cm above and looking down at 55 degree angle. The point cloud generated by the camera is then projected into the simulator based on the location which the xtion camera is mounted on Robonaut-2. We then manually adjust the left robot arm and each finger of the left hand so that the robot hand can perform a firm grasp on the object. For cuboid objects, we adjust the thumb tip and index finger tip to the front and back faces of the cuboid and about 3cm away from the left face. For the cylinder objects, we adjust the thumb tip and index finger tip such that the whole hand wraps around the object. The robot arm movement is executed using Moveit! [27]. We then record the point cloud, the RGB image, and the complete configuration of the hand in the camera frame. Fig. 3. The R2 Grasping Dataset is collected through an interactive interface where the robot arm and hand is adjusted to the grasp pose. We collected a total of 120 grasping examples of twelve different objects. six of the objects are cylindrical shaped and six of them are cuboids as shown in Figure 4. The same object is presented at different orientations and under different lighting conditions. In addition to grasping examples with a single object, we also created 24 grasping examples in cluttered scenarios. Twelve of them include a single cylin-

7 Associating Grasping with Convolutional Neural Network Features 7 Fig. 4. Left: The set of objects used in the R2 Grasping Dataset. Right: The set of novel objects used in the grasping experiment. drical object and twelve of them include a single cuboid object. The configuration of the hand while grasping these objects are also recorded. Figure 8 shows a few examples of the cluttered test set. 4 Approach To learn how to grasp a previously unseen object, we use AlexNet [28], a CNN trained on ImageNet, implemented with Caffe [29] to identify visual features that represent meaningful characteristics of the geometric shape of an object to support grasping. In this section, we first describe how to identify features that activate consistently among the training data. We call these identified features that capture the hierarchical support relations between filters in different CNN layers hierarchical CNN features. We further use an approach we call targeted backpropagation to discover the locations of these features by tracing the activation of filters in a CNN backwards through the network. We then discuss how a grasp distributions that stores the relative position between features and robot end effector frames are learned. 4.1 Identifying Consistent Hierarchical CNN Features Given a set of grasp poses on objects with similar shapes, our goal is to find a set of visual features that activate consistently. Our assumption is that some of these features will represent meaningful geometric information regarding the shape of an object that supports grasping. For each training example, we first do a masked forward pass on AlexNet. For each collected point cloud, we segment the flat supporting surface through Random Sample Consensus (RANSAC) [30] and create a 2D mask that excludes pixels that are not part of an object in the RGB image. This mask is dilated to include the boundary of the object. During the forward pass, for each convolution layer we check each filter and zero out the responses that are not marked as part of the object according to the mask. This

8 8 Associating Grasping with Convolutional Neural Network Features Fig. 5. Hierarchical CNN feature visualizations among cuboid objects (left) and cylinders (right). Each square figure is the visualization of a CNN filter while the edges connect a lower layer filter to a parent filter in a higher layer. The numbers under the squares are the corresponding filter indices. Filters are visualized using the visualization tool introduced in [21]. Notice that the lower level filters represent local structures of a parent filter. approach removes filter responses that are not part of the object. The masked forward pass approach is used instead of directly segmenting the object in the RGB image to avoid learning the sharp edges caused by segmentation. We first identify consistent filters, filters that activate consistently over similar object types, in the fifth convolution (conv-5) layer. The top N 5 filters that have the highest sum of log responses among all the grasping examples of the same type of grasp are pinpointed. We observe that many filters in the conv-5 layer represent high level structures and fire on box-like objects, tube-like objects, faces, etc. Knowing what type of object is observed can determine the type of the grasp but is not sufficient for grasping, since boxes can have different sizes and be placed at a different pose. However, if we can identify lower level features such as the front edge and the back edge of a box we can place the robot fingers relative to these lower level feature locations. In this work, we exploit the fact that CNN features are by nature hierarchical; a filter in a higher layer with little location information is a combination of lower level features with higher spatial accuracy. Instead of simply associating features with filters in a CNN, we consider a feature to be a lower layer filter in the CNN that causes certain higher layer filters to become active. For example, we found that filter 87 in the conv-5 layer in AlexNet represents a box shaped object. This filter is a combination of several filters in the fourth convolution (conv-4) layer. Among these filters, filter 190 in the conv-4 layer represents the lower right corner of the box and filter 133 in the conv-4 layer represents the top left

9 Associating Grasping with Convolutional Neural Network Features 9 corner of the box. Filter 190 is also a combination of several filters in the third convolution (conv-3) layer. Among these filters, filter 168 in the conv-3 layer represents the diagonal right edge of the lower right corner of the box and filter 54 in the conv-3 layer represents the diagonal left edge of the lower right corner of the box. Simply looking at filter 168 in the conv-3 layer will find all diagonal edges in an image, but if we only look at part of this filter that has a hierarchical relationship with filter 190 in the conv-4 layer and filter 87 in the conv-5 layer, we find local features that correspond to meaningful parts of a box-like object. In other words, instead of representing a feature with a single filter in the conv-3 layer, we use a tuple of filters to represent a feature such as (f87, 5 f190, 4 f168) 3 in the previous example, where fj m represents the jth filter in the mth convolutional layer. We call such features hierarchical CNN features. Within a hierarchical CNN feature we call a higher level filter the parent filter of a lower level filter. Figure 5 shows the visualization of several hierarchical CNN features identified in cuboid and cylindrical objects, including the feature in the previous example. In order to identify consistent filters in layers lower than conv-5, we first find the max response of the activation map for each of the consistent filters fi 5 in the conv-5 layer. For each of these max responses we zero out all other responses and perform backpropagation to obtain its derivative with respect to the conv-4 layer. We then find filters that contribute to fi 5 consistently in the conv-4 layer by identifying the top N 4 filters that have the highest sum of log derivatives. For each of these N 4 filters fj 4, we zero out all derivatives except for the maximum derivative of filter fj 4. In other words, for each hierarchical CNN feature we only perform backpropagation on one filter and one of its location that has the maximum derivative per convolutional layer. With the same procedure we then find N 3 filters that contribute to fj 4 consistently. We call this kind of backpropagation on one specific filter per layer targeted backpropagation. To further locate a feature in 3D we then perform backpropagation to the image and map the mean of the response location to the corresponding 3D point in the point cloud. We define the response of a hierarchical CNN feature as the maximum targeted backpropagation derivative of the lowest layer filter in the tuple. Figure 6 shows an example of results of targeted backpropagation to the image layer from hierarchical CNN features in the conv-5, conv-4, and conv-3 layers. The conv-3 layer features can be interpreted as representing lower level features such as edges and corners of the cuboid object. 4.2 Grasp Distribution A grasp distribution is used to map from feature locations to grasp points; it can be used to determine how likely that the end effectors relative positions to the corresponding features results in the intended grasp. For each type of grasp we store a distribution of grasp targets for each end effector with respect to each identified consistent feature. A grasp target is the relative position from a robot frame position to a feature position. In our experiment, we create a distribution from the robot palm to each consistent features in the conv-4 layer and distributions from the robot index finger tip and thumb tip to each consistent

10 10 Associating Grasping with Convolutional Neural Network Features Fig. 6. Targeted backpropagation example. The color image is the input image. Each square image with a black background represents the result of targeted backpropagation from hierarchical CNN features in the conv-5, conv-4, and conv-3 layers to the image layer. The blue dot in each targeted backpropagation result represents the mean of the response locations. Notice that the mean response locations of conv-3 layer features are located closer to edges and corners of the cuboid object compared to the mean response locations of conv-4 and conv-5 features. These conv-3 layer features can be interpreted as representing local structures of the cuboid object.

11 Associating Grasping with Convolutional Neural Network Features 11 features in the conv-3 layer. We call the combination of all of these distributions and their corresponding hierarchical CNN features the grasp distribution. Although specifying grasp points based on the offsets to feature positions may not be invariant to objects with different sizes, as long as the end effector is close to certain feature points the difference would be small. Not all features that fire consistently on objects within the same type are good features to plan actions relative to. For example, when grasping a box, positioning the index finger relative to the closer edge of the box will result in positions with high variance, since the size of the box may vary. In contrast, positioning the index finger relative to the further edge of the box will result in a lower variance. For each end effector position we pick the top N hierarchical CNN features in the grasp distribution that have the lowest position variance among training examples and have the same parent filter in the conv-5 layer. The other features are then removed from the grasp distribution. We found that restricting all the hierarchical CNN features to have the same parent filter allows our approach to perform well in a cluttered scenario. During testing, the hierarchical CNN features in the grasp distribution are first identified. The 2D locations of these features in the image plane are then located through targeted backpropagation and mapped to 3D positions on the 3D point cloud. The distributions that belong to the same end effector are then merged in the same coordinate based on the 3D positions of the corresponding hierarchical CNN features. The grasp targets for the robot palm, thump tip, and index finger tip are then determined by the weighted mean position of the merged distribution with the feature responses as weights. Figure 7 shows examples of generated grasp positions and grasp distributions on different objects. 5 Experiments In this work, we ran two sets of experiments. In the first set of experiments, we evaluate based on the difference between the calculated grasp points and the ground truth in the R2 Grasping Dataset. In the second set of experiments, we test our approach on grasping novel objects on Robonaut-2 [3] and compare the number of successful grasps with a baseline approach. 5.1 Experiments on the R2 Grasping Dataset In this set of experiments we analyze the performance of our approach on the R2 Grasping Dataset. First, we perform cross-validation and calculate the accuracy of grasp points. Second, we compare results with and without targeted backpropagation. Cross-Validation Results We apply cross-validation by leaving out one object instance at a time during training and test on the left out object by comparing the calculated grasp points to the ground truth. The distance between the example position and the targeted position of the palm, index finger tip, and

12 12 Associating Grasping with Convolutional Neural Network Features Fig. 7. Sample cross-validation results for single object scenario. The red, green, and blue spheres represent the target palm, thumb tip, and index finger tip locations of the left robot hand. The target locations are the weighted mean of the colored dots that each represents a relative position from a feature to a grasp point in one training example. Notice that for the cuboid object the target thumb tip and index finger tip is located on the opposing face and about 3cm away from the left face of the cuboid as it was trained. For the cylinder object the thumb tip and index finger tip is on the right side of the cylinder, such that the whole hand wraps around the object. The black pixels are locations behind the point cloud that are not observable.

13 Associating Grasping with Convolutional Neural Network Features 13 thumb tip is calculated and shown in Table 1. The average grasp position error for the palm is higher than the thumb and index finger; this is likely because given just the positions of local features is not sufficient to predict an accurate position for the palm. However, since the palm is not contacting the object, its position is less crucial for a successful grasp. Figure 7 shows a few results of cross-validation on different objects at different pose and lighting. Similar to the training data, the cuboid objects are grasped at positions closer to the left side while the cylindrical objects are grasped such that the fingers would wrap around the cylinder. cylindrical objects cetaphil wood maxwell blue paper yellow left hand jar cylinder can jar roll jar average thumb tip index tip palm cuboid objects cube redtea bandage twinings brillo tazo left hand box box box box box box average thumb tip index tip palm Table 1. Average grasp position error on cylindrical and cuboid objects in meters. Hierarchical CNN Feature and Targeted Backpropagation To evaluate the proposed hierarchical CNN feature and targeted backpropagation approach (tbp), we compare cross-validation results to four alternative methods. The first (notbp-test) uses only the lowest level filter of the hierarchical CNN features, therefore removing the relationship between lower and higher level CNN filters. The second alternative method (notbp-train) identifies the same number of consistent features in each layer during training but finds individual CNN filters instead of hierarchical CNN features. The grasp distribution is then generated based on these individual filters in each layer. Similar to the second alternative, the third (notbp-conv5) uses individual CNN filters instead of hierarchical CNN features but only considers filters in the conv-5 layer. The fourth alternative (single-conv5) identifies the top five consistent filters in the conv-5 layer and uses the one that has the highest response during testing. To make the comparison fair, we also remove filters other than the top N hierarchical CNN features in the grasp distribution that have the lowest position variance among training

14 14 Associating Grasping with Convolutional Neural Network Features examples for the first three comparative approaches. We use N 5 = N 4 = N 3 = 5 and N = 15 in this experiment. The results are shown in the first row of Table 2. Our approach performs better than all four alternatives. However the difference is small compared to the first and third alternatives, this is because the lower level filter that has the maximum response is mostly the same with or without restricting it to have the same parent filter when only one object is presented. In the next test, we will show that the targeted backpropagation approach has a greater advantage when low level features may be generated from different high level structures, i.e. clutter. The fact that our approach outperforms the single-conv5 approach shows the benefit of using lower level features to higher level features on planning actions. On the contrary, the notbp-train-conv5 approach that also only uses high level filters performs well because although filters in the conv-5 layer are more likely to represent higher level object structures, many of them also represent local features like corners and edges. Targeted backpropagation is most useful when the scene is more complex and the same filter in the lower layers fires at multiple places. Since our approach limits the filters in the conv-3 layer to have the same parent filter in the conv-5 layer, only lower layer features that belong to the same high level structure are considered. We further tested these four alternative approaches on the cluttered test set. On the cluttered test set we consider a test case successful if the distance errors of the thumb tip and index finger tip are both less than 5cm and the palm error is less than 10cm. The results are shown in the second row of Table 2. Figure 8 shows a few example results using the target backpropagation approach on cluttered test cases. The targeted backpropagation approach performs significantly better than the notbp-test, notbp-train, and notbp-conv5 approaches since filters may fire on different objects without constraining it to a single high level filter. Figure 9 shows two comparison results between the targeted backpropagation approach and the notbp-test approach. tbp notbp- notbp- notbp- singletest train conv-5 conv-5 Single Object Experiment: cross validation average grasp position error (cm) Cluttered Experiment: number of failed cluttered cases (24 total) Table 2. Comparing with and without targeted backpropagation.

15 Associating Grasping with Convolutional Neural Network Features 15 Fig. 8. Examples of grasping in a cluttered scenario. The red, green, and blue spheres represent the target palm, thumb tip, and index finger tip locations of the left robot hand. The target locations are the weighted mean of the colored dots that each represents a relative position from a feature to a grasp point in one training example. The top two rows are trained on grasping cuboid objects and the bottom two rows are trained on grasping cylinder objects. Notice that our approach is able to identify the only cuboid or cylinder in the scene and generate grasp points similar to the training examples.

16 16 Associating Grasping with Convolutional Neural Network Features Fig. 9. Comparing with and without targeted backpropagation in a cluttered scenario. The left column uses the targeted backpropagation approach and the right column uses the notbp-test approach. The top row uses features learned on a cuboid object and the bottom row uses features learned on a cylinder object. The red, green, and blue spheres represent the target palm, thumb tip, and index finger tip locations of the left robot hand. The target locations are the weighted mean of the colored dots that each represents a relative position from a feature to a grasp point in one training example. Notice that without targeted backpropagation the colored dots are scattered around since the highest response filter in conv-3 or conv-4 layer are no longer restricted to the same high level structure.

17 Associating Grasping with Convolutional Neural Network Features Experiments on Robonaut-2 In this section, we describe how we evaluate the proposed pre-shaping algorithm based on the percentage of successful grasps on a set of novel objects on Robonaut-2 [3]. We describe the experiment setting, the hierarchical controller used for pre-shaping, and results in the following. Experimental Setting For each trial, an object in the novel object set is placed on a flat surface within the robot s reach. Given the object image and point cloud, the robot moves its wrist and fingers to the pre-shaping pose. After reaching the pre-shaping pose, the hand changes to a pre-defined closed posture and tries to pick up the object by moving the hand up vertically. We consider a grasp to be successful if the object did not drop after the robot tries to pick it up. We tested a total of 100 grasping trials on 10 novel objects with our proposed approach and a comparative baseline approach. The novel objects used in this experiment are shown in Figure 4. Point cloud approach Hierarchical CNN feature (Our approach) cylindrical objects tumbler wipe package basil container hemp protein duster average 40% 60% 0% 20% 40% 36% 100% 100% 100% 100% 100% 100% Point cloud approach Hierarchical CNN feature (Our approach) cuboid objects cracker box ritz box bevita box bag box energy bar box average 80% 20% 60% 60% 60% 52% 100% 80% 100% 100% 100% 96% Table 3. Grasp success rate on novel objects based on 5 trials per object. Hierarchical Controller A hierarchical controller that corresponds to hierarchical CNN features in different CNN layers is implemented to reach the preshaping pose. Given the object image and point cloud, our approach generates targets for the robot palm, index finger tip, and thumb tip. The palm target is determined based on CNN features in the conv-4 layer while the thumb tip and

18 18 Associating Grasping with Convolutional Neural Network Features index finger tip target is determined based on CNN features in the conv-3 layer. The pre-shaping is executed in two steps. First, the arm controller moves the arm such that the sum of distance from the palm, index finger tip, and thumb tip to their corresponding target is minimized. Once the arm controller converges, the hand controller moves the wrist and fingers to minimize the sum of distance from the index finger tip and thumb tip to their corresponding target. These controllers are based on the control basis framework [31] and can be written in the form φ σ τ, where φ is a potential function that describes the sum of distance to the targets, σ represents sensory resources allocated, and τ represents the motor resources allocated. Results We compare our algorithm to a baseline point cloud approach that moves the robot hand to a position where the object point cloud center is located at the center of the hand after the hand is fully closed. The experiment results are shown in Table 3. Among the 50 grasping trials only one grasp failed with our approach due to a failure in controlling the index finger to the target position. We demonstrate that our approach has a much higher probability of success in grasping novel objects compared to the point cloud based approach. Figure 10 shows Robonaut-2 grasping novel objects during testing. 6 Discussion Past studies have shown that CNN features are only effective for object recognition when most of the filter responses are used. In [32], it is shown that to achieve 90% object recognition accuracy on the PASCAL dataset most object classes require using at least 30 to 40 filters in the conv-5 layer. Research done in inverting CNN features [20] has also shown that most information is contained not in the top-5 activated filters but the rest of the filters with small probabilities. In this work, we demonstrate that a subset of hierarchical CNN features that has the same parent filter in the conv-5 layer are sufficient for pre-shaping the robot hand for grasping common household objects with simple shapes. Although using all filters may provide subtle information for object classification, we argue that when interacting with objects, the strength of CNN features lies in their hierarchical nature and a small set of filters are sufficient to support manipulation. However, our current work is limited to objects with simple shapes that can be represented by a single filter in the conv-5 layer. In future work, we plan to extend to more complicated objects by performing targeted backpropagation from a combination of higher layer filters that represent more complicated object types. 7 Conclusion In this work, we tackle the problem of pre-shaping a human-like robot hand for grasping based on visual input. We collect a grasping dataset with detailed hand

19 Associating Grasping with Convolutional Neural Network Features 19 Fig. 10. Robonaut-2 grasping 10 different novel objects. The first and third column show the pre-shaping steps while the second and fourth column show the corresponding grasp and pickup. Notice that the cuboid objects are grasped on the faces while the cylinder objects are grasped such that the object is wrapped in the hand.

20 20 Associating Grasping with Convolutional Neural Network Features configuration of Robonaut-2 and introduce the hierarchical CNN feature that captures the hierarchical relationship between filters in a pre-trained CNN. Our approach first identifies hierarchical CNN features that are active consistently among the same type of grasps in training, then an approach we call targeted backpropagation is used to locate each feature s 3D position. With these feature locations we generate distributions for the robot palm, thumb, and index finger positions based on hierarchical CNN features in different CNN layers. We evaluate our approach on the R2 Grasping Dataset we collected and show significant improvement over approaches without using hierarchical CNN features and targeted backpropagation in cluttered scenarios. We further tested our solution in a grasping experiment where 50 grasp trials on novel objects are performed on Robonaut-2. We compared to a point cloud based baseline approach and showed that our approach results in a much higher percentage of successful grasps. 8 ACKNOWLEDGMENT This material is based upon work supported under Grant NASA-GCT-NNX12AR16A and a NASA Space Technology Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Aeronautics and Space Administration. References 1. Goodale, M., Milner, D.: Sight unseen: An exploration of conscious and unconscious vision. OUP Oxford (2013) 2. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3) (2015) Diftler, M.A., Mehling, J., Abdallah, M.E., Radford, N.A., Bridgwater, L.B., Sanders, A.M., Askew, R.S., Linn, D.M., Yamokoski, J.D., Permenter, F., et al.: Robonaut 2-the first humanoid robot in space. In: Robotics and Automation (ICRA), 2011 IEEE International Conference on, IEEE (2011) Schneider, G.E.: Two visual systems. Science (1969) 5. Mishkin, M., Ungerleider, L.G., Macko, K.A.: Object vision and spatial vision: two cortical pathways. Trends in neurosciences 6 (1983) Goodale, M.A., Milner, A.D.: Separate visual pathways for perception and action. Trends in neurosciences 15(1) (1992) McIntosh, R.D., Schenk, T.: Two visual streams for perception and action: Current trends. Neuropsychologia 47(6) (2009) Adamo, M., Ferber, S.: A picture says more than a thousand words: Behavioural and erp evidence for attentional enhancements due to action affordances. Neuropsychologia 47(6) (2009) Saxena, A., Driemeyer, J., Ng, A.Y.: Robotic grasping of novel objects using vision. The International Journal of Robotics Research 27(2) (2008) Platt, R., Grupen, R.A., Fagg, A.H.: Re-using schematic grasping policies. In: Humanoid Robots, th IEEE-RAS International Conference on, IEEE (2005)

21 Associating Grasping with Convolutional Neural Network Features Saxena, A., Wong, L.L., Ng, A.Y.: Learning grasp strategies with partial shape information. In: AAAI. Volume 3. (2008) Herzog, A., Pastor, P., Kalakrishnan, M., Righetti, L., Bohg, J., Asfour, T., Schaal, S.: Learning of grasp selection based on shape-templates. Autonomous Robots 36(1-2) (2014) Pas, A.t., Platt, R.: Using geometry to detect grasps in 3d point clouds. arxiv preprint arxiv: (2015) 14. Zhang, L.E., Ciocarlie, M., Hsiao, K.: Grasp evaluation with graspable feature matching. In: RSS Workshop on Mobile Manipulation: Learning to Manipulate. (2011) 15. Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34(4-5) (2015) Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. arxiv preprint arxiv: (2015) 17. Finn, C., Tan, X.Y., Duan, Y., Darrell, T., Levine, S., Abbeel, P.: Deep spatial autoencoders for visuomotor learning. reconstruction 117(117) (2015) Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. arxiv preprint arxiv: (2015) 19. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Computer vision ECCV Springer (2014) Dosovitskiy, A., Brox, T.: Inverting convolutional networks with convolutional networks. arxiv preprint arxiv: (2015) 21. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding neural networks through deep visualization. In: Deep Learning Workshop, International Conference on Machine Learning (ICML). (2015) 22. Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) Schwarz, M., Schulz, H., Behnke, S.: Rgb-d object recognition and pose estimation based on pre-trained convolutional neural network features. In: Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE (2015) Wilkinson, E., Takahashi, T.: Efficient aspect object models using pre-trained convolutional neural networks. In: Humanoid Robots (Humanoids), 2015 IEEE- RAS 15th International Conference on, IEEE (2015) Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32(9) (2010) Dinh, P., Hart, S.: NASA Robonaut 2 Simulator (2013) [Online; accessed 7-July- 2014]. 27. Sucan, I.A., Chitta, S.: Moveit! (2013) [Online]. 28. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arxiv preprint arxiv: (2014) 30. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6) (1981)

22 22 Associating Grasping with Convolutional Neural Network Features 31. Huber, M.: A hybrid architecture for adaptive robot control. PhD thesis, University of Massachusetts Amherst (2000) 32. Agrawal, P., Girshick, R., Malik, J.: Analyzing the performance of multilayer neural networks for object recognition. In: Computer Vision ECCV Springer (2014)

Associating Grasp Configurations with Hierarchical Features in Convolutional Neural Networks

Associating Grasp Configurations with Hierarchical Features in Convolutional Neural Networks Associating Grasp Configurations with Hierarchical Features in Convolutional Neural Networks Li Yang Ku, Erik Learned-Miller, and Rod Grupen Abstract In this work, we provide a solution for posturing the

More information

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK 1 Po-Jen Lai ( 賴柏任 ), 2 Chiou-Shann Fuh ( 傅楸善 ) 1 Dept. of Electrical Engineering, National Taiwan University, Taiwan 2 Dept.

More information

arxiv: v1 [cs.cv] 31 Mar 2016

arxiv: v1 [cs.cv] 31 Mar 2016 Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract.

More information

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material Charles R. Qi Hao Su Matthias Nießner Angela Dai Mengyuan Yan Leonidas J. Guibas Stanford University 1. Details

More information

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS Kuan-Chuan Peng and Tsuhan Chen School of Electrical and Computer Engineering, Cornell University, Ithaca, NY

More information

Aspect Transition Graph: an Affordance-Based Model

Aspect Transition Graph: an Affordance-Based Model 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 Aspect Transition Graph:

More information

Aspect Transition Graph: an Affordance-Based Model

Aspect Transition Graph: an Affordance-Based Model Aspect Transition Graph: an Affordance-Based Model Li Yang Ku, Shiraj Sen, Erik G. Learned-Miller, and Roderic A. Grupen School of Computer Science University of Massachusetts Amherst Amherst, Massachusetts

More information

Part Localization by Exploiting Deep Convolutional Networks

Part Localization by Exploiting Deep Convolutional Networks Part Localization by Exploiting Deep Convolutional Networks Marcel Simon, Erik Rodner, and Joachim Denzler Computer Vision Group, Friedrich Schiller University of Jena, Germany www.inf-cv.uni-jena.de Abstract.

More information

Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection

Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen Google Abstract. We describe a learning-based approach

More information

Does the Brain do Inverse Graphics?

Does the Brain do Inverse Graphics? Does the Brain do Inverse Graphics? Geoffrey Hinton, Alex Krizhevsky, Navdeep Jaitly, Tijmen Tieleman & Yichuan Tang Department of Computer Science University of Toronto How to learn many layers of features

More information

The Crucial Components to Solve the Picking Problem

The Crucial Components to Solve the Picking Problem B. Scholz Common Approaches to the Picking Problem 1 / 31 MIN Faculty Department of Informatics The Crucial Components to Solve the Picking Problem Benjamin Scholz University of Hamburg Faculty of Mathematics,

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Learning to Use a Ratchet by Modeling Spatial Relations in Demonstrations

Learning to Use a Ratchet by Modeling Spatial Relations in Demonstrations Learning to Use a Ratchet by Modeling Spatial Relations in Demonstrations Li Yang Ku, Scott Jordan, Julia Badger, Erik Learned-Miller, and Rod Grupen Abstract We introduce a framework where visual features,

More information

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS Zhao Chen Machine Learning Intern, NVIDIA ABOUT ME 5th year PhD student in physics @ Stanford by day, deep learning computer vision scientist

More information

The Hilbert Problems of Computer Vision. Jitendra Malik UC Berkeley & Google, Inc.

The Hilbert Problems of Computer Vision. Jitendra Malik UC Berkeley & Google, Inc. The Hilbert Problems of Computer Vision Jitendra Malik UC Berkeley & Google, Inc. This talk The computational power of the human brain Research is the art of the soluble Hilbert problems, circa 2004 Hilbert

More information

3D model classification using convolutional neural network

3D model classification using convolutional neural network 3D model classification using convolutional neural network JunYoung Gwak Stanford jgwak@cs.stanford.edu Abstract Our goal is to classify 3D models directly using convolutional neural network. Most of existing

More information

arxiv: v3 [cs.ro] 9 Nov 2017

arxiv: v3 [cs.ro] 9 Nov 2017 End-to-End Learning of Semantic Grasping Eric Jang Google Brain ejang@google.com Sudheendra Vijayanarasimhan Google svnaras@google.com Peter Pastor X peterpastor@x.team Julian Ibarz Google Brain julianibarz@google.com

More information

Learning to Grasp Objects: A Novel Approach for Localizing Objects Using Depth Based Segmentation

Learning to Grasp Objects: A Novel Approach for Localizing Objects Using Depth Based Segmentation Learning to Grasp Objects: A Novel Approach for Localizing Objects Using Depth Based Segmentation Deepak Rao, Arda Kara, Serena Yeung (Under the guidance of Quoc V. Le) Stanford University Abstract We

More information

In Defense of Fully Connected Layers in Visual Representation Transfer

In Defense of Fully Connected Layers in Visual Representation Transfer In Defense of Fully Connected Layers in Visual Representation Transfer Chen-Lin Zhang, Jian-Hao Luo, Xiu-Shen Wei, Jianxin Wu National Key Laboratory for Novel Software Technology, Nanjing University,

More information

Tunnel Effect in CNNs: Image Reconstruction From Max-Switch Locations

Tunnel Effect in CNNs: Image Reconstruction From Max-Switch Locations Downloaded from orbit.dtu.dk on: Nov 01, 2018 Tunnel Effect in CNNs: Image Reconstruction From Max-Switch Locations de La Roche Saint Andre, Matthieu ; Rieger, Laura ; Hannemose, Morten; Kim, Junmo Published

More information

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington 3D Object Recognition and Scene Understanding from RGB-D Videos Yu Xiang Postdoctoral Researcher University of Washington 1 2 Act in the 3D World Sensing & Understanding Acting Intelligent System 3D World

More information

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Introduction In this supplementary material, Section 2 details the 3D annotation for CAD models and real

More information

Does the Brain do Inverse Graphics?

Does the Brain do Inverse Graphics? Does the Brain do Inverse Graphics? Geoffrey Hinton, Alex Krizhevsky, Navdeep Jaitly, Tijmen Tieleman & Yichuan Tang Department of Computer Science University of Toronto The representation used by the

More information

IDE-3D: Predicting Indoor Depth Utilizing Geometric and Monocular Cues

IDE-3D: Predicting Indoor Depth Utilizing Geometric and Monocular Cues 2016 International Conference on Computational Science and Computational Intelligence IDE-3D: Predicting Indoor Depth Utilizing Geometric and Monocular Cues Taylor Ripke Department of Computer Science

More information

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials Yuanjun Xiong 1 Kai Zhu 1 Dahua Lin 1 Xiaoou Tang 1,2 1 Department of Information Engineering, The Chinese University

More information

arxiv: v1 [cs.cv] 28 Sep 2018

arxiv: v1 [cs.cv] 28 Sep 2018 Camera Pose Estimation from Sequence of Calibrated Images arxiv:1809.11066v1 [cs.cv] 28 Sep 2018 Jacek Komorowski 1 and Przemyslaw Rokita 2 1 Maria Curie-Sklodowska University, Institute of Computer Science,

More information

Channel Locality Block: A Variant of Squeeze-and-Excitation

Channel Locality Block: A Variant of Squeeze-and-Excitation Channel Locality Block: A Variant of Squeeze-and-Excitation 1 st Huayu Li Northern Arizona University Flagstaff, United State Northern Arizona University hl459@nau.edu arxiv:1901.01493v1 [cs.lg] 6 Jan

More information

Learning Semantic Environment Perception for Cognitive Robots

Learning Semantic Environment Perception for Cognitive Robots Learning Semantic Environment Perception for Cognitive Robots Sven Behnke University of Bonn, Germany Computer Science Institute VI Autonomous Intelligent Systems Some of Our Cognitive Robots Equipped

More information

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington Perceiving the 3D World from Images and Videos Yu Xiang Postdoctoral Researcher University of Washington 1 2 Act in the 3D World Sensing & Understanding Acting Intelligent System 3D World 3 Understand

More information

Deep Neural Networks:

Deep Neural Networks: Deep Neural Networks: Part II Convolutional Neural Network (CNN) Yuan-Kai Wang, 2016 Web site of this course: http://pattern-recognition.weebly.com source: CNN for ImageClassification, by S. Lazebnik,

More information

Fully Convolutional Networks for Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation Fully Convolutional Networks for Semantic Segmentation Jonathan Long* Evan Shelhamer* Trevor Darrell UC Berkeley Chaim Ginzburg for Deep Learning seminar 1 Semantic Segmentation Define a pixel-wise labeling

More information

arxiv: v1 [cs.cv] 6 Jul 2016

arxiv: v1 [cs.cv] 6 Jul 2016 arxiv:607.079v [cs.cv] 6 Jul 206 Deep CORAL: Correlation Alignment for Deep Domain Adaptation Baochen Sun and Kate Saenko University of Massachusetts Lowell, Boston University Abstract. Deep neural networks

More information

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Presented by Tushar Bansal Objective 1. Get bounding box for all objects

More information

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Chi Li, M. Zeeshan Zia 2, Quoc-Huy Tran 2, Xiang Yu 2, Gregory D. Hager, and Manmohan Chandraker 2 Johns

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning for Object Categorization 14.01.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period

More information

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological

More information

A Hierarchial Model for Visual Perception

A Hierarchial Model for Visual Perception A Hierarchial Model for Visual Perception Bolei Zhou 1 and Liqing Zhang 2 1 MOE-Microsoft Laboratory for Intelligent Computing and Intelligent Systems, and Department of Biomedical Engineering, Shanghai

More information

Real-Time Depth Estimation from 2D Images

Real-Time Depth Estimation from 2D Images Real-Time Depth Estimation from 2D Images Jack Zhu Ralph Ma jackzhu@stanford.edu ralphma@stanford.edu. Abstract ages. We explore the differences in training on an untrained network, and on a network pre-trained

More information

Learning a visuomotor controller for real world robotic grasping using simulated depth images

Learning a visuomotor controller for real world robotic grasping using simulated depth images Learning a visuomotor controller for real world robotic grasping using simulated depth images Ulrich Viereck 1, Andreas ten Pas 1, Kate Saenko 2, Robert Platt 1 1 College of Computer and Information Science,

More information

Why equivariance is better than premature invariance

Why equivariance is better than premature invariance 1 Why equivariance is better than premature invariance Geoffrey Hinton Canadian Institute for Advanced Research & Department of Computer Science University of Toronto with contributions from Sida Wang

More information

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document Franziska Mueller 1,2 Dushyant Mehta 1,2 Oleksandr Sotnychenko 1 Srinath Sridhar 1 Dan Casas 3 Christian Theobalt

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

Object Purpose Based Grasping

Object Purpose Based Grasping Object Purpose Based Grasping Song Cao, Jijie Zhao Abstract Objects often have multiple purposes, and the way humans grasp a certain object may vary based on the different intended purposes. To enable

More information

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task Kyunghee Kim Stanford University 353 Serra Mall Stanford, CA 94305 kyunghee.kim@stanford.edu Abstract We use a

More information

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains Jiahao Pang 1 Wenxiu Sun 1 Chengxi Yang 1 Jimmy Ren 1 Ruichao Xiao 1 Jin Zeng 1 Liang Lin 1,2 1 SenseTime Research

More information

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Efficient Segmentation-Aided Text Detection For Intelligent Robots Efficient Segmentation-Aided Text Detection For Intelligent Robots Junting Zhang, Yuewei Na, Siyang Li, C.-C. Jay Kuo University of Southern California Outline Problem Definition and Motivation Related

More information

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601 Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network Nathan Sun CIS601 Introduction Face ID is complicated by alterations to an individual s appearance Beard,

More information

arxiv: v1 [cs.ro] 11 Jul 2016

arxiv: v1 [cs.ro] 11 Jul 2016 Initial Experiments on Learning-Based Randomized Bin-Picking Allowing Finger Contact with Neighboring Objects Kensuke Harada, Weiwei Wan, Tokuo Tsuji, Kohei Kikuchi, Kazuyuki Nagata, and Hiromu Onda arxiv:1607.02867v1

More information

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015 Etienne Gadeski, Hervé Le Borgne, and Adrian Popescu CEA, LIST, Laboratory of Vision and Content Engineering, France

More information

Multi-Glance Attention Models For Image Classification

Multi-Glance Attention Models For Image Classification Multi-Glance Attention Models For Image Classification Chinmay Duvedi Stanford University Stanford, CA cduvedi@stanford.edu Pararth Shah Stanford University Stanford, CA pararth@stanford.edu Abstract We

More information

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Today s class Overview Convolutional Neural Network (CNN) Training CNN Understanding and Visualizing CNN Image Categorization:

More information

Two-Stream Convolutional Networks for Action Recognition in Videos

Two-Stream Convolutional Networks for Action Recognition in Videos Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan Andrew Zisserman Cemil Zalluhoğlu Introduction Aim Extend deep Convolution Networks to action recognition in video. Motivation

More information

Learning from Successes and Failures to Grasp Objects with a Vacuum Gripper

Learning from Successes and Failures to Grasp Objects with a Vacuum Gripper Learning from Successes and Failures to Grasp Objects with a Vacuum Gripper Luca Monorchio, Daniele Evangelista, Marco Imperoli, and Alberto Pretto Abstract In this work we present an empirical approach

More information

Contextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work

Contextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work Contextual Dropout Finding subnets for subtasks Sam Fok samfok@stanford.edu Abstract The feedforward networks widely used in classification are static and have no means for leveraging information about

More information

Real-time Object Detection CS 229 Course Project

Real-time Object Detection CS 229 Course Project Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection

More information

Deformable Part Models

Deformable Part Models CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones

More information

arxiv: v4 [cs.cv] 6 Jul 2016

arxiv: v4 [cs.cv] 6 Jul 2016 Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu, C.-C. Jay Kuo (qinhuang@usc.edu) arxiv:1603.09742v4 [cs.cv] 6 Jul 2016 Abstract. Semantic segmentation

More information

arxiv: v1 [q-bio.nc] 24 Nov 2014

arxiv: v1 [q-bio.nc] 24 Nov 2014 Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Brain s Ventral Visual Pathway arxiv:4.6422v [q-bio.nc] 24 Nov 24 Umut Güçlü and Marcel A. J. van Gerven Radboud

More information

RGB-D Object Recognition and Pose Estimation based on Pre-trained Convolutional Neural Network Features

RGB-D Object Recognition and Pose Estimation based on Pre-trained Convolutional Neural Network Features RGB-D Object Recognition and Pose Estimation based on Pre-trained Convolutional Neural Network Features Max Schwarz, Hannes Schulz, and Sven Behnke Abstract Object recognition and pose estimation from

More information

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18, REAL-TIME OBJECT DETECTION WITH CONVOLUTION NEURAL NETWORK USING KERAS Asmita Goswami [1], Lokesh Soni [2 ] Department of Information Technology [1] Jaipur Engineering College and Research Center Jaipur[2]

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Announcements Computer Vision Lecture 16 Deep Learning Applications 11.01.2017 Seminar registration period starts on Friday We will offer a lab course in the summer semester Deep Robot Learning Topic:

More information

Computer Vision Lecture 16

Computer Vision Lecture 16 Computer Vision Lecture 16 Deep Learning Applications 11.01.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Seminar registration period starts

More information

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis 3D Shape Analysis with Multi-view Convolutional Networks Evangelos Kalogerakis 3D model repositories [3D Warehouse - video] 3D geometry acquisition [KinectFusion - video] 3D shapes come in various flavors

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

arxiv: v1 [cs.cv] 20 Dec 2016

arxiv: v1 [cs.cv] 20 Dec 2016 End-to-End Pedestrian Collision Warning System based on a Convolutional Neural Network with Semantic Segmentation arxiv:1612.06558v1 [cs.cv] 20 Dec 2016 Heechul Jung heechul@dgist.ac.kr Min-Kook Choi mkchoi@dgist.ac.kr

More information

Visual Perception for Robots

Visual Perception for Robots Visual Perception for Robots Sven Behnke Computer Science Institute VI Autonomous Intelligent Systems Our Cognitive Robots Complete systems for example scenarios Equipped with rich sensors Flying robot

More information

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images 1 Introduction - Steve Chuang and Eric Shan - Determining object orientation in images is a well-established topic

More information

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection Zeming Li, 1 Yilun Chen, 2 Gang Yu, 2 Yangdong

More information

A model for full local image interpretation

A model for full local image interpretation A model for full local image interpretation Guy Ben-Yosef 1 (guy.ben-yosef@weizmann.ac.il)) Liav Assif 1 (liav.assif@weizmann.ac.il) Daniel Harari 1,2 (hararid@mit.edu) Shimon Ullman 1,2 (shimon.ullman@weizmann.ac.il)

More information

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report Figure 1: The architecture of the convolutional network. Input: a single view image; Output: a depth map. 3 Related Work In [4] they used depth maps of indoor scenes produced by a Microsoft Kinect to successfully

More information

Supervised Learning of Classifiers

Supervised Learning of Classifiers Supervised Learning of Classifiers Carlo Tomasi Supervised learning is the problem of computing a function from a feature (or input) space X to an output space Y from a training set T of feature-output

More information

Pose estimation using a variety of techniques

Pose estimation using a variety of techniques Pose estimation using a variety of techniques Keegan Go Stanford University keegango@stanford.edu Abstract Vision is an integral part robotic systems a component that is needed for robots to interact robustly

More information

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful

More information

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta Encoder-Decoder Networks for Semantic Segmentation Sachin Mehta Outline > Overview of Semantic Segmentation > Encoder-Decoder Networks > Results What is Semantic Segmentation? Input: RGB Image Output:

More information

Combining RGB and Points to Predict Grasping Region for Robotic Bin-Picking

Combining RGB and Points to Predict Grasping Region for Robotic Bin-Picking Combining RGB and Points to Predict Grasping Region for Robotic Bin-Picking Quanquan Shao a, Jie Hu Shanghai Jiao Tong University Shanghai, China e-mail: a sjtudq@qq.com Abstract This paper focuses on

More information

Semantic RGB-D Perception for Cognitive Robots

Semantic RGB-D Perception for Cognitive Robots Semantic RGB-D Perception for Cognitive Robots Sven Behnke Computer Science Institute VI Autonomous Intelligent Systems Our Domestic Service Robots Dynamaid Cosero Size: 100-180 cm, weight: 30-35 kg 36

More information

arxiv:submit/ [cs.cv] 13 Jan 2018

arxiv:submit/ [cs.cv] 13 Jan 2018 Benchmark Visual Question Answer Models by using Focus Map Wenda Qiu Yueyang Xianzang Zhekai Zhang Shanghai Jiaotong University arxiv:submit/2130661 [cs.cv] 13 Jan 2018 Abstract Inferring and Executing

More information

Using Geometric Blur for Point Correspondence

Using Geometric Blur for Point Correspondence 1 Using Geometric Blur for Point Correspondence Nisarg Vyas Electrical and Computer Engineering Department, Carnegie Mellon University, Pittsburgh, PA Abstract In computer vision applications, point correspondence

More information

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules Report Explainable Machine Learning Dynamic Routing Between Capsules Author: Michael Dorkenwald Supervisor: Dr. Ullrich Köthe 28. Juni 2018 Inhaltsverzeichnis 1 Introduction 2 2 Motivation 2 3 CapusleNet

More information

Team Description Paper Team AutonOHM

Team Description Paper Team AutonOHM Team Description Paper Team AutonOHM Jon Martin, Daniel Ammon, Helmut Engelhardt, Tobias Fink, Tobias Scholz, and Marco Masannek University of Applied Science Nueremberg Georg-Simon-Ohm, Kesslerplatz 12,

More information

High precision grasp pose detection in dense clutter*

High precision grasp pose detection in dense clutter* High precision grasp pose detection in dense clutter* Marcus Gualtieri, Andreas ten Pas, Kate Saenko, Robert Platt College of Computer and Information Science, Northeastern University Department of Computer

More information

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep learning for object detection. Slides from Svetlana Lazebnik and many others Deep learning for object detection Slides from Svetlana Lazebnik and many others Recent developments in object detection 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before deep

More information

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro Ryerson University CP8208 Soft Computing and Machine Intelligence Naive Road-Detection using CNNS Authors: Sarah Asiri - Domenic Curro April 24 2016 Contents 1 Abstract 2 2 Introduction 2 3 Motivation

More information

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in

More information

Optimizing Monocular Cues for Depth Estimation from Indoor Images

Optimizing Monocular Cues for Depth Estimation from Indoor Images Optimizing Monocular Cues for Depth Estimation from Indoor Images Aditya Venkatraman 1, Sheetal Mahadik 2 1, 2 Department of Electronics and Telecommunication, ST Francis Institute of Technology, Mumbai,

More information

Human Pose Estimation with Deep Learning. Wei Yang

Human Pose Estimation with Deep Learning. Wei Yang Human Pose Estimation with Deep Learning Wei Yang Applications Understand Activities Family Robots American Heist (2014) - The Bank Robbery Scene 2 What do we need to know to recognize a crime scene? 3

More information

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard (Deep) Learning for Robot Perception and Navigation Wolfram Burgard Deep Learning for Robot Perception (and Navigation) Lifeng Bo, Claas Bollen, Thomas Brox, Andreas Eitel, Dieter Fox, Gabriel L. Oliveira,

More information

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network Anurag Arnab and Philip H.S. Torr University of Oxford {anurag.arnab, philip.torr}@eng.ox.ac.uk 1. Introduction

More information

Rotation Invariance Neural Network

Rotation Invariance Neural Network Rotation Invariance Neural Network Shiyuan Li Abstract Rotation invariance and translate invariance have great values in image recognition. In this paper, we bring a new architecture in convolutional neural

More information

arxiv: v1 [cs.ro] 31 Dec 2018

arxiv: v1 [cs.ro] 31 Dec 2018 A dataset of 40K naturalistic 6-degree-of-freedom robotic grasp demonstrations. Rajan Iyengar, Victor Reyes Osorio, Presish Bhattachan, Adrian Ragobar, Bryan Tripp * University of Waterloo arxiv:1812.11683v1

More information

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides Deep Learning in Visual Recognition Thanks Da Zhang for the slides Deep Learning is Everywhere 2 Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object

More information

Efficient Grasping from RGBD Images: Learning Using a New Rectangle Representation. Yun Jiang, Stephen Moseson, Ashutosh Saxena Cornell University

Efficient Grasping from RGBD Images: Learning Using a New Rectangle Representation. Yun Jiang, Stephen Moseson, Ashutosh Saxena Cornell University Efficient Grasping from RGBD Images: Learning Using a New Rectangle Representation Yun Jiang, Stephen Moseson, Ashutosh Saxena Cornell University Problem Goal: Figure out a way to pick up the object. Approach

More information

Know your data - many types of networks

Know your data - many types of networks Architectures Know your data - many types of networks Fixed length representation Variable length representation Online video sequences, or samples of different sizes Images Specific architectures for

More information

Robotics Programming Laboratory

Robotics Programming Laboratory Chair of Software Engineering Robotics Programming Laboratory Bertrand Meyer Jiwon Shin Lecture 8: Robot Perception Perception http://pascallin.ecs.soton.ac.uk/challenges/voc/databases.html#caltech car

More information

An Exploration of Computer Vision Techniques for Bird Species Classification

An Exploration of Computer Vision Techniques for Bird Species Classification An Exploration of Computer Vision Techniques for Bird Species Classification Anne L. Alter, Karen M. Wang December 15, 2017 Abstract Bird classification, a fine-grained categorization task, is a complex

More information

Bilinear Models for Fine-Grained Visual Recognition

Bilinear Models for Fine-Grained Visual Recognition Bilinear Models for Fine-Grained Visual Recognition Subhransu Maji College of Information and Computer Sciences University of Massachusetts, Amherst Fine-grained visual recognition Example: distinguish

More information

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU, Machine Learning 10-701, Fall 2015 Deep Learning Eric Xing (and Pengtao Xie) Lecture 8, October 6, 2015 Eric Xing @ CMU, 2015 1 A perennial challenge in computer vision: feature engineering SIFT Spin image

More information