arxiv: v3 [cs.cv] 21 Oct 2016

Similar documents
Associating Grasp Configurations with Hierarchical Features in Convolutional Neural Networks

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

arxiv: v1 [cs.cv] 31 Mar 2016

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

Aspect Transition Graph: an Affordance-Based Model

Aspect Transition Graph: an Affordance-Based Model

Part Localization by Exploiting Deep Convolutional Networks

Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection

Does the Brain do Inverse Graphics?

The Crucial Components to Solve the Picking Problem

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Learning to Use a Ratchet by Modeling Spatial Relations in Demonstrations

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

The Hilbert Problems of Computer Vision. Jitendra Malik UC Berkeley & Google, Inc.

3D model classification using convolutional neural network

arxiv: v3 [cs.ro] 9 Nov 2017

Learning to Grasp Objects: A Novel Approach for Localizing Objects Using Depth Based Segmentation

In Defense of Fully Connected Layers in Visual Representation Transfer

Tunnel Effect in CNNs: Image Reconstruction From Max-Switch Locations

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Does the Brain do Inverse Graphics?

IDE-3D: Predicting Indoor Depth Utilizing Geometric and Monocular Cues

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

arxiv: v1 [cs.cv] 28 Sep 2018

Channel Locality Block: A Variant of Squeeze-and-Excitation

Learning Semantic Environment Perception for Cognitive Robots

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Deep Neural Networks:

Fully Convolutional Networks for Semantic Segmentation

arxiv: v1 [cs.cv] 6 Jul 2016

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Computer Vision Lecture 16

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

A Hierarchial Model for Visual Perception

Real-Time Depth Estimation from 2D Images

Learning a visuomotor controller for real world robotic grasping using simulated depth images

Why equivariance is better than premature invariance

Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor Supplemental Document

Content-Based Image Recovery

Object Purpose Based Grasping

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

arxiv: v1 [cs.ro] 11 Jul 2016

CEA LIST s participation to the Scalable Concept Image Annotation task of ImageCLEF 2015

Multi-Glance Attention Models For Image Classification

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Two-Stream Convolutional Networks for Action Recognition in Videos

Learning from Successes and Failures to Grasp Objects with a Vacuum Gripper

Contextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work

Real-time Object Detection CS 229 Course Project

Deformable Part Models

arxiv: v4 [cs.cv] 6 Jul 2016

arxiv: v1 [q-bio.nc] 24 Nov 2014

RGB-D Object Recognition and Pose Estimation based on Pre-trained Convolutional Neural Network Features

International Journal of Computer Engineering and Applications, Volume XII, Special Issue, September 18,

Computer Vision Lecture 16

Computer Vision Lecture 16

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

Deep Learning for Computer Vision II

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

arxiv: v1 [cs.cv] 20 Dec 2016

Visual Perception for Robots

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

A model for full local image interpretation

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Supervised Learning of Classifiers

Pose estimation using a variety of techniques

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Combining RGB and Points to Predict Grasping Region for Robotic Bin-Picking

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Semantic RGB-D Perception for Cognitive Robots

arxiv:submit/ [cs.cv] 13 Jan 2018

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Using Geometric Blur for Point Correspondence

Dynamic Routing Between Capsules

Team Description Paper Team AutonOHM

High precision grasp pose detection in dense clutter*

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Optimizing Monocular Cues for Depth Estimation from Indoor Images

Human Pose Estimation with Deep Learning. Wei Yang

(Deep) Learning for Robot Perception and Navigation. Wolfram Burgard

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Rotation Invariance Neural Network

arxiv: v1 [cs.ro] 31 Dec 2018

Efficient Grasping from RGBD Images: Learning Using a New Rectangle Representation. Yun Jiang, Stephen Moseson, Ashutosh Saxena Cornell University

Know your data - many types of networks

Robotics Programming Laboratory

An Exploration of Computer Vision Techniques for Bird Species Classification

Bilinear Models for Fine-Grained Visual Recognition

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Transcription:

Associating Grasping with Convolutional Neural Network Features Li Yang Ku, Erik Learned-Miller, and Rod Grupen University of Massachusetts Amherst, Amherst, MA, USA {lku,elm,grupen}@cs.umass.edu arxiv:1609.03947v3 [cs.cv] 21 Oct 2016 Abstract. In this work, we provide a solution for pre-shaping a humanlike robot hand for grasping based on visual information. Our approach uses convolutional neural networks (CNNs) to define a mapping between images and grasps. Applying CNNs to robotics applications is non-trivial for two reasons. First, collecting enough robot data to train a CNN at the same scale as the models trained in the vision community is extremely difficult. In this work, we demonstrate that by using a pre-trained CNN, a small set of grasping examples is sufficient for generalizing across different objects of similar shapes. Second, the final output of a CNN contains little location information of the observed object, which is essential for the robot to manipulate the object. We take advantage of the hierarchical nature of CNN layers and identify the 3D positions of features that capture the hierarchical support relations between filters in different CNN layers using an approach we call targeted backpropagation. Targeted backpropagation traces the activation of higher level features in a CNN backwards through the network to discover the locations in the observation that were responsible for making them fire, thus localizing important structures that are manipulable in the environment. We show that this approach outperforms approaches without targeted backpropagation in a cluttered scene. We further implement a hierarchical controller that controls fingers and palms based on features located in different layers of the CNN for pre-shaping the robot hand and demonstrate that this approach outperforms a point cloud based approach on a grasping task on Robonaut-2. 1 Introduction In this work, we provide a solution for pre-shaping a human-like robot hand based on visual information. It has been shown that humans are capable of pre-shaping their hands to grasp an object based merely on visual information from an object. Experiments done with patient DF by Goodale and Milner [1] have shown that human s pre-shaping ability is likely handled by the dorsal pathway in our brain rather then the ventral pathway that recognizes objects. This suggests that the ability to pre-shape our hand before grasping is not just the result of recognizing the object. Although without haptic feedback, visual input may not be sufficient for certifying a robust grasp, pre-shaping is still a fundamental ability that can guide the robot hand before it makes contact with

2 Associating Grasping with Convolutional Neural Network Features Fig. 1. Overall architecture and hierarchical CNN features. A tuple of a yellow dot, cyan dot, and magenta dot represents a hierarchical CNN feature in the conv-3 layer. The magenta dots represent filters part of the hierarchical CNN features in the conv-3 layer that contribute to the same conv-5 layer filter represented by the yellow dot. These features can be traced back to the image input and mapped to the point cloud to support manipulation. The arrows from the CNN layers to the controllers indicate a mapping from hierarchical CNN features in the conv-4 and conv-3 layers to the arm and hand controllers. Target points for the arm and hand controllers are generated based on the relative positions to corresponding features. the object. In this work, the purpose of pre-shaping is not just for picking up the object but to grasp it in a specific way. We believe this may result in a more stable grasp and more consistent interactions with the object. Our approach uses convolutional neural networks (CNNs) to define a mapping between images and grasps. Since 2012, CNNs have attracted a great deal of attention in the computer vision community and have outperformed other algorithms on many benchmarks. However applying CNNs to robotics applications is non-trivial for two reasons. First, collecting enough robot data to train a CNN at the same scale as the models trained in the vision community from scratch is extremely difficult. In this work, we demonstrate that by using a CNN trained on ImageNet [2], a small set of grasping examples is sufficient for generalizing across different objects of similar shapes. Second, the final output of a CNN contains little location information from the observed object, which is

Associating Grasping with Convolutional Neural Network Features 3 essential for the robot to manipulate the object. We take advantage of the hierarchical nature of CNN layers and identify a set of features we call hierarchical CNN features that capture the hierarchical support relations between filters in different CNN layers. The 3D positions of such features can be identified using an approach we call targeted backpropagation. Targeted backpropagation traces the activation of high-level filters in a CNN backwards through the network to discover the locations in the observation that were responsible for making them fire, thus localizing important structures that are manipulable in the environment. Figure 1 shows the overall architecture and how such hierarchical CNN features are mapped to a point cloud to support manipulation. Fig. 2. Robonaut-2 pre-shaping its hand before grasping an object. Notice that in the box example the fingers are preshaped to grasp the faces of the box and in the air duster example the fingers are preshaped such that the air duster would be wrapped in the hand. We analyze our algorithm based on the grasp point accuracy through crossvalidation on the R2 Grasping Dataset we collected. Approaches with and without the proposed targeted backpropagation are compared. This approach is then tested in a grasping experiment on the humanoid robot Robonaut-2 [3]. A hierarchical controller that controls fingers and palms based on features in different CNN layers is implemented for pre-shaping the robot hand. We demonstrate that the proposed approach outperforms a baseline approach that grasps the centroid of the object point cloud. Figure 2 shows Robonaut-2 pre-shaping its hand before grasping. 2 Related Work The idea that our brain encodes visual stimulus in two separate regions was first proposed in [4]. Ungerleider and Mishkin further discovered that the two areas, inferotemporal cortex and posterior parietal cortex receive independent

4 Associating Grasping with Convolutional Neural Network Features sets of projections from the striate cortex and propose that the ventral stream that ends in the inferotemporal cortex plays a critical role in identifying objects, while the dorsal stream that ends in the posterior parietal cortex encodes the spatial location of those same objects [5]. This hypothesis is often known as the distinction of what and where between the two visual pathways. However, in 1992 Goodale and Milner proposed an alternative perspective on the functionality of these two visual pathways based on many observations made with patient DF [6]. Patient DF is unique in the sense that she developed a profound visual form agnosia due to anoxic damage to her ventral stream. Despite DFs inability to recognize the shape, size and orientation of visual objects, she is capable of grasping the very same object with accurate hand and finger movements. Based on a series of experiments [1], Goodale and Milner proposed the perceptionaction model, which suggests that the dorsal pathway provides action-relevant information about the structural characteristic of objects and not just about their position. In our work, instead of planning a grasp action based on the result of object recognition, our approach is consistent with Goodale and Milner s perception-action model that separates pre-shaping from object recognition. Recently, studies on this perception-action model in the field of cognitive science have shifted from identifying functional dissociation between ventral and dorsal streams to how these two streams interact with each other [7]. The attentional blink experiments done in [8] suggest that object recognition processes in the ventral stream activate action-related processing in the dorsal stream, which will lead to enhanced selection and consolidation of objects that is related to the observed object. This interaction between the ventral and dorsal streams disappears when the association between items is semantic rather than visual. This result is consistent with our solution where the action process is instantiated by the classification process and uses visual clues instead of object identities for grasp actions. A lot of work has been done on generating robotic grasp plans from visual information. In [9], a single grasp point is identified using a probabilistic model on a set of visual features such as edges, textures, and colors. In [10], online learning is used to associate different types of grasp with the object s height and width. In [11] the probability of a hand configuration successfully grasping a novel object is calculated using contact, center of mass, and force closure properties based on point cloud and image information. A shape template approach for grasping novel objects was also introduced in [12]. The authors introduced a shape descriptor called height map to capture local object geometry for matching part of a point cloud generated by a novel object to a known grasp template. A geometric approach for grasping novel objects based on point cloud is proposed in [13]. An antipodal grasp is determined through finding cutting planes that satisfy geometric constraints. A similar approach based on local object geometry was also introduced in [14]. Our approach associates CNN features learned from the ImageNet to a known grasp. We consider not just a local feature but also where it is in a higher level structure. Since CNN features are based on RGB image input, our approach can infer a grasp pose from the side even when there

Associating Grasping with Convolutional Neural Network Features 5 are only point cloud information of the object s top surface. In [15], a deep network trained on 1035 examples is used to determine a successful grasp based on RGB-D data. Grasp positions are exhaustively searched and evalutated. Our approach uses pre-trained features and generates grasp points based on a small set of grasping examples. Several authors have applied CNNs to robotics. In [16], visuomotor policies are learned using an end-to-end neural network that takes images and outputs joint torques. A three layer CNN is used without any max pooling layer to maintain spatial information. In our work, we also use filters in the third convolution layers; but unlike the previous work, we consider their relationship with higher layer filters. In [17], an autoencoder is used to learn spatial information of features of a neural network. Our approach uses targeted backpropagation to find the receptive field that causes a high-level response in a particular image. In [18], a CNN is used to learn what features are graspable through 50 thousand trials collected using a Baxter robot. The final layer is used to select 1 out of 18 grasp orientations. Our approach considers the relative position between features and robot finger tips; therefore, capable of generating more complicated grasp poses using a human-like hand. Our work is also related to work in visualizing CNNs. In [19], the filters of each layer are visualized by looking for image patches that result in the maximum activation. Deconvolution is also used to find what pixels activated each filter. Another approach that trains a separate CNN to learn the reverse mapping from different layers of a CNN back to the original image has also been introduced in [20]. In our work, we use the visualization tool introduced by Yosinski [21] to visualize filters that represent meaningful characteristics in each layer. Some authors have explored using filter activation other than the final layer of a CNN. In [22], hypercolumns, which are defined as the activation of all CNN units above a pixel, are used on tasks such as simultaneous detection and segmentation, keypoint localization, and part labeling. Our approach groups filters in different layers based on their hierarchical relationship instead of spatial relationship. In [23], the last two layers of two CNNs, one that takes an image as input and one that takes depth as input, are used to identify object category, instance, and pose. In [24], the last layer is used to identify object instance while the fifth convolution layer is used to determine the aspect of an object. In our work we consider a feature as the activation of a lower layer filter that causes a specific higher layer filter to activate and plan grasp poses based on these features. Our work is also inspired by deformable part models [25] where a class model is composed of smaller models of parts; e.g. wheels are parts of a bicycle. We view Convolutional Neural Networks similarly; a filter in a higher layer is a combination of filters in the lower layer. If a higher layer filter represents a high level structure, the lower layer filters that contribute to this higher layer filter can be seen as representing local parts of this structure and may provide useful information for manipulating an object.

6 Associating Grasping with Convolutional Neural Network Features 3 Dataset In this work, we created the R2 Grasping Dataset that contains a detailed grasp pose for a set of objects. The data is collected using an Asus Xtion camera and the Robonaut-2 simulator [26]. The object is placed on a flat surface where the camera is about 70 cm above and looking down at 55 degree angle. The point cloud generated by the camera is then projected into the simulator based on the location which the xtion camera is mounted on Robonaut-2. We then manually adjust the left robot arm and each finger of the left hand so that the robot hand can perform a firm grasp on the object. For cuboid objects, we adjust the thumb tip and index finger tip to the front and back faces of the cuboid and about 3cm away from the left face. For the cylinder objects, we adjust the thumb tip and index finger tip such that the whole hand wraps around the object. The robot arm movement is executed using Moveit! [27]. We then record the point cloud, the RGB image, and the complete configuration of the hand in the camera frame. Fig. 3. The R2 Grasping Dataset is collected through an interactive interface where the robot arm and hand is adjusted to the grasp pose. We collected a total of 120 grasping examples of twelve different objects. six of the objects are cylindrical shaped and six of them are cuboids as shown in Figure 4. The same object is presented at different orientations and under different lighting conditions. In addition to grasping examples with a single object, we also created 24 grasping examples in cluttered scenarios. Twelve of them include a single cylin-

Associating Grasping with Convolutional Neural Network Features 7 Fig. 4. Left: The set of objects used in the R2 Grasping Dataset. Right: The set of novel objects used in the grasping experiment. drical object and twelve of them include a single cuboid object. The configuration of the hand while grasping these objects are also recorded. Figure 8 shows a few examples of the cluttered test set. 4 Approach To learn how to grasp a previously unseen object, we use AlexNet [28], a CNN trained on ImageNet, implemented with Caffe [29] to identify visual features that represent meaningful characteristics of the geometric shape of an object to support grasping. In this section, we first describe how to identify features that activate consistently among the training data. We call these identified features that capture the hierarchical support relations between filters in different CNN layers hierarchical CNN features. We further use an approach we call targeted backpropagation to discover the locations of these features by tracing the activation of filters in a CNN backwards through the network. We then discuss how a grasp distributions that stores the relative position between features and robot end effector frames are learned. 4.1 Identifying Consistent Hierarchical CNN Features Given a set of grasp poses on objects with similar shapes, our goal is to find a set of visual features that activate consistently. Our assumption is that some of these features will represent meaningful geometric information regarding the shape of an object that supports grasping. For each training example, we first do a masked forward pass on AlexNet. For each collected point cloud, we segment the flat supporting surface through Random Sample Consensus (RANSAC) [30] and create a 2D mask that excludes pixels that are not part of an object in the RGB image. This mask is dilated to include the boundary of the object. During the forward pass, for each convolution layer we check each filter and zero out the responses that are not marked as part of the object according to the mask. This

8 Associating Grasping with Convolutional Neural Network Features Fig. 5. Hierarchical CNN feature visualizations among cuboid objects (left) and cylinders (right). Each square figure is the visualization of a CNN filter while the edges connect a lower layer filter to a parent filter in a higher layer. The numbers under the squares are the corresponding filter indices. Filters are visualized using the visualization tool introduced in [21]. Notice that the lower level filters represent local structures of a parent filter. approach removes filter responses that are not part of the object. The masked forward pass approach is used instead of directly segmenting the object in the RGB image to avoid learning the sharp edges caused by segmentation. We first identify consistent filters, filters that activate consistently over similar object types, in the fifth convolution (conv-5) layer. The top N 5 filters that have the highest sum of log responses among all the grasping examples of the same type of grasp are pinpointed. We observe that many filters in the conv-5 layer represent high level structures and fire on box-like objects, tube-like objects, faces, etc. Knowing what type of object is observed can determine the type of the grasp but is not sufficient for grasping, since boxes can have different sizes and be placed at a different pose. However, if we can identify lower level features such as the front edge and the back edge of a box we can place the robot fingers relative to these lower level feature locations. In this work, we exploit the fact that CNN features are by nature hierarchical; a filter in a higher layer with little location information is a combination of lower level features with higher spatial accuracy. Instead of simply associating features with filters in a CNN, we consider a feature to be a lower layer filter in the CNN that causes certain higher layer filters to become active. For example, we found that filter 87 in the conv-5 layer in AlexNet represents a box shaped object. This filter is a combination of several filters in the fourth convolution (conv-4) layer. Among these filters, filter 190 in the conv-4 layer represents the lower right corner of the box and filter 133 in the conv-4 layer represents the top left

Associating Grasping with Convolutional Neural Network Features 9 corner of the box. Filter 190 is also a combination of several filters in the third convolution (conv-3) layer. Among these filters, filter 168 in the conv-3 layer represents the diagonal right edge of the lower right corner of the box and filter 54 in the conv-3 layer represents the diagonal left edge of the lower right corner of the box. Simply looking at filter 168 in the conv-3 layer will find all diagonal edges in an image, but if we only look at part of this filter that has a hierarchical relationship with filter 190 in the conv-4 layer and filter 87 in the conv-5 layer, we find local features that correspond to meaningful parts of a box-like object. In other words, instead of representing a feature with a single filter in the conv-3 layer, we use a tuple of filters to represent a feature such as (f87, 5 f190, 4 f168) 3 in the previous example, where fj m represents the jth filter in the mth convolutional layer. We call such features hierarchical CNN features. Within a hierarchical CNN feature we call a higher level filter the parent filter of a lower level filter. Figure 5 shows the visualization of several hierarchical CNN features identified in cuboid and cylindrical objects, including the feature in the previous example. In order to identify consistent filters in layers lower than conv-5, we first find the max response of the activation map for each of the consistent filters fi 5 in the conv-5 layer. For each of these max responses we zero out all other responses and perform backpropagation to obtain its derivative with respect to the conv-4 layer. We then find filters that contribute to fi 5 consistently in the conv-4 layer by identifying the top N 4 filters that have the highest sum of log derivatives. For each of these N 4 filters fj 4, we zero out all derivatives except for the maximum derivative of filter fj 4. In other words, for each hierarchical CNN feature we only perform backpropagation on one filter and one of its location that has the maximum derivative per convolutional layer. With the same procedure we then find N 3 filters that contribute to fj 4 consistently. We call this kind of backpropagation on one specific filter per layer targeted backpropagation. To further locate a feature in 3D we then perform backpropagation to the image and map the mean of the response location to the corresponding 3D point in the point cloud. We define the response of a hierarchical CNN feature as the maximum targeted backpropagation derivative of the lowest layer filter in the tuple. Figure 6 shows an example of results of targeted backpropagation to the image layer from hierarchical CNN features in the conv-5, conv-4, and conv-3 layers. The conv-3 layer features can be interpreted as representing lower level features such as edges and corners of the cuboid object. 4.2 Grasp Distribution A grasp distribution is used to map from feature locations to grasp points; it can be used to determine how likely that the end effectors relative positions to the corresponding features results in the intended grasp. For each type of grasp we store a distribution of grasp targets for each end effector with respect to each identified consistent feature. A grasp target is the relative position from a robot frame position to a feature position. In our experiment, we create a distribution from the robot palm to each consistent features in the conv-4 layer and distributions from the robot index finger tip and thumb tip to each consistent

10 Associating Grasping with Convolutional Neural Network Features Fig. 6. Targeted backpropagation example. The color image is the input image. Each square image with a black background represents the result of targeted backpropagation from hierarchical CNN features in the conv-5, conv-4, and conv-3 layers to the image layer. The blue dot in each targeted backpropagation result represents the mean of the response locations. Notice that the mean response locations of conv-3 layer features are located closer to edges and corners of the cuboid object compared to the mean response locations of conv-4 and conv-5 features. These conv-3 layer features can be interpreted as representing local structures of the cuboid object.

Associating Grasping with Convolutional Neural Network Features 11 features in the conv-3 layer. We call the combination of all of these distributions and their corresponding hierarchical CNN features the grasp distribution. Although specifying grasp points based on the offsets to feature positions may not be invariant to objects with different sizes, as long as the end effector is close to certain feature points the difference would be small. Not all features that fire consistently on objects within the same type are good features to plan actions relative to. For example, when grasping a box, positioning the index finger relative to the closer edge of the box will result in positions with high variance, since the size of the box may vary. In contrast, positioning the index finger relative to the further edge of the box will result in a lower variance. For each end effector position we pick the top N hierarchical CNN features in the grasp distribution that have the lowest position variance among training examples and have the same parent filter in the conv-5 layer. The other features are then removed from the grasp distribution. We found that restricting all the hierarchical CNN features to have the same parent filter allows our approach to perform well in a cluttered scenario. During testing, the hierarchical CNN features in the grasp distribution are first identified. The 2D locations of these features in the image plane are then located through targeted backpropagation and mapped to 3D positions on the 3D point cloud. The distributions that belong to the same end effector are then merged in the same coordinate based on the 3D positions of the corresponding hierarchical CNN features. The grasp targets for the robot palm, thump tip, and index finger tip are then determined by the weighted mean position of the merged distribution with the feature responses as weights. Figure 7 shows examples of generated grasp positions and grasp distributions on different objects. 5 Experiments In this work, we ran two sets of experiments. In the first set of experiments, we evaluate based on the difference between the calculated grasp points and the ground truth in the R2 Grasping Dataset. In the second set of experiments, we test our approach on grasping novel objects on Robonaut-2 [3] and compare the number of successful grasps with a baseline approach. 5.1 Experiments on the R2 Grasping Dataset In this set of experiments we analyze the performance of our approach on the R2 Grasping Dataset. First, we perform cross-validation and calculate the accuracy of grasp points. Second, we compare results with and without targeted backpropagation. Cross-Validation Results We apply cross-validation by leaving out one object instance at a time during training and test on the left out object by comparing the calculated grasp points to the ground truth. The distance between the example position and the targeted position of the palm, index finger tip, and

12 Associating Grasping with Convolutional Neural Network Features Fig. 7. Sample cross-validation results for single object scenario. The red, green, and blue spheres represent the target palm, thumb tip, and index finger tip locations of the left robot hand. The target locations are the weighted mean of the colored dots that each represents a relative position from a feature to a grasp point in one training example. Notice that for the cuboid object the target thumb tip and index finger tip is located on the opposing face and about 3cm away from the left face of the cuboid as it was trained. For the cylinder object the thumb tip and index finger tip is on the right side of the cylinder, such that the whole hand wraps around the object. The black pixels are locations behind the point cloud that are not observable.

Associating Grasping with Convolutional Neural Network Features 13 thumb tip is calculated and shown in Table 1. The average grasp position error for the palm is higher than the thumb and index finger; this is likely because given just the positions of local features is not sufficient to predict an accurate position for the palm. However, since the palm is not contacting the object, its position is less crucial for a successful grasp. Figure 7 shows a few results of cross-validation on different objects at different pose and lighting. Similar to the training data, the cuboid objects are grasped at positions closer to the left side while the cylindrical objects are grasped such that the fingers would wrap around the cylinder. cylindrical objects cetaphil wood maxwell blue paper yellow left hand jar cylinder can jar roll jar average thumb tip 0.0197 0.0164 0.0134 0.0202 0.0136 0.0147 0.0163 index tip 0.011 0.0111 0.023 0.017 0.0155 0.0126 0.015 palm 0.0128 0.0393 0.0234 0.0183 0.0328 0.0173 0.024 cuboid objects cube redtea bandage twinings brillo tazo left hand box box box box box box average thumb tip 0.0094 0.0101 0.0093 0.0095 0.0088 0.0111 0.01 index tip 0.0135 0.0207 0.0278 0.0134 0.0098 0.0157 0.0168 palm 0.0241 0.0203 0.0195 0.0177 0.0177 0.0285 0.0213 Table 1. Average grasp position error on cylindrical and cuboid objects in meters. Hierarchical CNN Feature and Targeted Backpropagation To evaluate the proposed hierarchical CNN feature and targeted backpropagation approach (tbp), we compare cross-validation results to four alternative methods. The first (notbp-test) uses only the lowest level filter of the hierarchical CNN features, therefore removing the relationship between lower and higher level CNN filters. The second alternative method (notbp-train) identifies the same number of consistent features in each layer during training but finds individual CNN filters instead of hierarchical CNN features. The grasp distribution is then generated based on these individual filters in each layer. Similar to the second alternative, the third (notbp-conv5) uses individual CNN filters instead of hierarchical CNN features but only considers filters in the conv-5 layer. The fourth alternative (single-conv5) identifies the top five consistent filters in the conv-5 layer and uses the one that has the highest response during testing. To make the comparison fair, we also remove filters other than the top N hierarchical CNN features in the grasp distribution that have the lowest position variance among training

14 Associating Grasping with Convolutional Neural Network Features examples for the first three comparative approaches. We use N 5 = N 4 = N 3 = 5 and N = 15 in this experiment. The results are shown in the first row of Table 2. Our approach performs better than all four alternatives. However the difference is small compared to the first and third alternatives, this is because the lower level filter that has the maximum response is mostly the same with or without restricting it to have the same parent filter when only one object is presented. In the next test, we will show that the targeted backpropagation approach has a greater advantage when low level features may be generated from different high level structures, i.e. clutter. The fact that our approach outperforms the single-conv5 approach shows the benefit of using lower level features to higher level features on planning actions. On the contrary, the notbp-train-conv5 approach that also only uses high level filters performs well because although filters in the conv-5 layer are more likely to represent higher level object structures, many of them also represent local features like corners and edges. Targeted backpropagation is most useful when the scene is more complex and the same filter in the lower layers fires at multiple places. Since our approach limits the filters in the conv-3 layer to have the same parent filter in the conv-5 layer, only lower layer features that belong to the same high level structure are considered. We further tested these four alternative approaches on the cluttered test set. On the cluttered test set we consider a test case successful if the distance errors of the thumb tip and index finger tip are both less than 5cm and the palm error is less than 10cm. The results are shown in the second row of Table 2. Figure 8 shows a few example results using the target backpropagation approach on cluttered test cases. The targeted backpropagation approach performs significantly better than the notbp-test, notbp-train, and notbp-conv5 approaches since filters may fire on different objects without constraining it to a single high level filter. Figure 9 shows two comparison results between the targeted backpropagation approach and the notbp-test approach. tbp notbp- notbp- notbp- singletest train conv-5 conv-5 Single Object Experiment: cross validation average 1.719 1.805 2.114 1.755 2.138 grasp position error (cm) Cluttered Experiment: number of failed cluttered 2 20 13 19 2 cases (24 total) Table 2. Comparing with and without targeted backpropagation.

Associating Grasping with Convolutional Neural Network Features 15 Fig. 8. Examples of grasping in a cluttered scenario. The red, green, and blue spheres represent the target palm, thumb tip, and index finger tip locations of the left robot hand. The target locations are the weighted mean of the colored dots that each represents a relative position from a feature to a grasp point in one training example. The top two rows are trained on grasping cuboid objects and the bottom two rows are trained on grasping cylinder objects. Notice that our approach is able to identify the only cuboid or cylinder in the scene and generate grasp points similar to the training examples.

16 Associating Grasping with Convolutional Neural Network Features Fig. 9. Comparing with and without targeted backpropagation in a cluttered scenario. The left column uses the targeted backpropagation approach and the right column uses the notbp-test approach. The top row uses features learned on a cuboid object and the bottom row uses features learned on a cylinder object. The red, green, and blue spheres represent the target palm, thumb tip, and index finger tip locations of the left robot hand. The target locations are the weighted mean of the colored dots that each represents a relative position from a feature to a grasp point in one training example. Notice that without targeted backpropagation the colored dots are scattered around since the highest response filter in conv-3 or conv-4 layer are no longer restricted to the same high level structure.

Associating Grasping with Convolutional Neural Network Features 17 5.2 Experiments on Robonaut-2 In this section, we describe how we evaluate the proposed pre-shaping algorithm based on the percentage of successful grasps on a set of novel objects on Robonaut-2 [3]. We describe the experiment setting, the hierarchical controller used for pre-shaping, and results in the following. Experimental Setting For each trial, an object in the novel object set is placed on a flat surface within the robot s reach. Given the object image and point cloud, the robot moves its wrist and fingers to the pre-shaping pose. After reaching the pre-shaping pose, the hand changes to a pre-defined closed posture and tries to pick up the object by moving the hand up vertically. We consider a grasp to be successful if the object did not drop after the robot tries to pick it up. We tested a total of 100 grasping trials on 10 novel objects with our proposed approach and a comparative baseline approach. The novel objects used in this experiment are shown in Figure 4. Point cloud approach Hierarchical CNN feature (Our approach) cylindrical objects tumbler wipe package basil container hemp protein duster average 40% 60% 0% 20% 40% 36% 100% 100% 100% 100% 100% 100% Point cloud approach Hierarchical CNN feature (Our approach) cuboid objects cracker box ritz box bevita box bag box energy bar box average 80% 20% 60% 60% 60% 52% 100% 80% 100% 100% 100% 96% Table 3. Grasp success rate on novel objects based on 5 trials per object. Hierarchical Controller A hierarchical controller that corresponds to hierarchical CNN features in different CNN layers is implemented to reach the preshaping pose. Given the object image and point cloud, our approach generates targets for the robot palm, index finger tip, and thumb tip. The palm target is determined based on CNN features in the conv-4 layer while the thumb tip and

18 Associating Grasping with Convolutional Neural Network Features index finger tip target is determined based on CNN features in the conv-3 layer. The pre-shaping is executed in two steps. First, the arm controller moves the arm such that the sum of distance from the palm, index finger tip, and thumb tip to their corresponding target is minimized. Once the arm controller converges, the hand controller moves the wrist and fingers to minimize the sum of distance from the index finger tip and thumb tip to their corresponding target. These controllers are based on the control basis framework [31] and can be written in the form φ σ τ, where φ is a potential function that describes the sum of distance to the targets, σ represents sensory resources allocated, and τ represents the motor resources allocated. Results We compare our algorithm to a baseline point cloud approach that moves the robot hand to a position where the object point cloud center is located at the center of the hand after the hand is fully closed. The experiment results are shown in Table 3. Among the 50 grasping trials only one grasp failed with our approach due to a failure in controlling the index finger to the target position. We demonstrate that our approach has a much higher probability of success in grasping novel objects compared to the point cloud based approach. Figure 10 shows Robonaut-2 grasping novel objects during testing. 6 Discussion Past studies have shown that CNN features are only effective for object recognition when most of the filter responses are used. In [32], it is shown that to achieve 90% object recognition accuracy on the PASCAL dataset most object classes require using at least 30 to 40 filters in the conv-5 layer. Research done in inverting CNN features [20] has also shown that most information is contained not in the top-5 activated filters but the rest of the filters with small probabilities. In this work, we demonstrate that a subset of hierarchical CNN features that has the same parent filter in the conv-5 layer are sufficient for pre-shaping the robot hand for grasping common household objects with simple shapes. Although using all filters may provide subtle information for object classification, we argue that when interacting with objects, the strength of CNN features lies in their hierarchical nature and a small set of filters are sufficient to support manipulation. However, our current work is limited to objects with simple shapes that can be represented by a single filter in the conv-5 layer. In future work, we plan to extend to more complicated objects by performing targeted backpropagation from a combination of higher layer filters that represent more complicated object types. 7 Conclusion In this work, we tackle the problem of pre-shaping a human-like robot hand for grasping based on visual input. We collect a grasping dataset with detailed hand

Associating Grasping with Convolutional Neural Network Features 19 Fig. 10. Robonaut-2 grasping 10 different novel objects. The first and third column show the pre-shaping steps while the second and fourth column show the corresponding grasp and pickup. Notice that the cuboid objects are grasped on the faces while the cylinder objects are grasped such that the object is wrapped in the hand.

20 Associating Grasping with Convolutional Neural Network Features configuration of Robonaut-2 and introduce the hierarchical CNN feature that captures the hierarchical relationship between filters in a pre-trained CNN. Our approach first identifies hierarchical CNN features that are active consistently among the same type of grasps in training, then an approach we call targeted backpropagation is used to locate each feature s 3D position. With these feature locations we generate distributions for the robot palm, thumb, and index finger positions based on hierarchical CNN features in different CNN layers. We evaluate our approach on the R2 Grasping Dataset we collected and show significant improvement over approaches without using hierarchical CNN features and targeted backpropagation in cluttered scenarios. We further tested our solution in a grasping experiment where 50 grasp trials on novel objects are performed on Robonaut-2. We compared to a point cloud based baseline approach and showed that our approach results in a much higher percentage of successful grasps. 8 ACKNOWLEDGMENT This material is based upon work supported under Grant NASA-GCT-NNX12AR16A and a NASA Space Technology Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Aeronautics and Space Administration. References 1. Goodale, M., Milner, D.: Sight unseen: An exploration of conscious and unconscious vision. OUP Oxford (2013) 2. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3) (2015) 211 252 3. Diftler, M.A., Mehling, J., Abdallah, M.E., Radford, N.A., Bridgwater, L.B., Sanders, A.M., Askew, R.S., Linn, D.M., Yamokoski, J.D., Permenter, F., et al.: Robonaut 2-the first humanoid robot in space. In: Robotics and Automation (ICRA), 2011 IEEE International Conference on, IEEE (2011) 2178 2183 4. Schneider, G.E.: Two visual systems. Science (1969) 5. Mishkin, M., Ungerleider, L.G., Macko, K.A.: Object vision and spatial vision: two cortical pathways. Trends in neurosciences 6 (1983) 414 417 6. Goodale, M.A., Milner, A.D.: Separate visual pathways for perception and action. Trends in neurosciences 15(1) (1992) 20 25 7. McIntosh, R.D., Schenk, T.: Two visual streams for perception and action: Current trends. Neuropsychologia 47(6) (2009) 1391 1396 8. Adamo, M., Ferber, S.: A picture says more than a thousand words: Behavioural and erp evidence for attentional enhancements due to action affordances. Neuropsychologia 47(6) (2009) 1600 1608 9. Saxena, A., Driemeyer, J., Ng, A.Y.: Robotic grasping of novel objects using vision. The International Journal of Robotics Research 27(2) (2008) 157 173 10. Platt, R., Grupen, R.A., Fagg, A.H.: Re-using schematic grasping policies. In: Humanoid Robots, 2005 5th IEEE-RAS International Conference on, IEEE (2005) 141 147

Associating Grasping with Convolutional Neural Network Features 21 11. Saxena, A., Wong, L.L., Ng, A.Y.: Learning grasp strategies with partial shape information. In: AAAI. Volume 3. (2008) 1491 1494 12. Herzog, A., Pastor, P., Kalakrishnan, M., Righetti, L., Bohg, J., Asfour, T., Schaal, S.: Learning of grasp selection based on shape-templates. Autonomous Robots 36(1-2) (2014) 51 65 13. Pas, A.t., Platt, R.: Using geometry to detect grasps in 3d point clouds. arxiv preprint arxiv:1501.03100 (2015) 14. Zhang, L.E., Ciocarlie, M., Hsiao, K.: Grasp evaluation with graspable feature matching. In: RSS Workshop on Mobile Manipulation: Learning to Manipulate. (2011) 15. Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34(4-5) (2015) 705 724 16. Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. arxiv preprint arxiv:1504.00702 (2015) 17. Finn, C., Tan, X.Y., Duan, Y., Darrell, T., Levine, S., Abbeel, P.: Deep spatial autoencoders for visuomotor learning. reconstruction 117(117) (2015) 240 18. Pinto, L., Gupta, A.: Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. arxiv preprint arxiv:1509.06825 (2015) 19. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Computer vision ECCV 2014. Springer (2014) 818 833 20. Dosovitskiy, A., Brox, T.: Inverting convolutional networks with convolutional networks. arxiv preprint arxiv:1506.02753 (2015) 21. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.: Understanding neural networks through deep visualization. In: Deep Learning Workshop, International Conference on Machine Learning (ICML). (2015) 22. Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 447 456 23. Schwarz, M., Schulz, H., Behnke, S.: Rgb-d object recognition and pose estimation based on pre-trained convolutional neural network features. In: Robotics and Automation (ICRA), 2015 IEEE International Conference on, IEEE (2015) 1329 1335 24. Wilkinson, E., Takahashi, T.: Efficient aspect object models using pre-trained convolutional neural networks. In: Humanoid Robots (Humanoids), 2015 IEEE- RAS 15th International Conference on, IEEE (2015) 284 289 25. Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32(9) (2010) 1627 1645 26. Dinh, P., Hart, S.: NASA Robonaut 2 Simulator (2013) [Online; accessed 7-July- 2014]. 27. Sucan, I.A., Chitta, S.: Moveit! (2013) [Online]. 28. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097 1105 29. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arxiv preprint arxiv:1408.5093 (2014) 30. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6) (1981) 381 395

22 Associating Grasping with Convolutional Neural Network Features 31. Huber, M.: A hybrid architecture for adaptive robot control. PhD thesis, University of Massachusetts Amherst (2000) 32. Agrawal, P., Girshick, R., Malik, J.: Analyzing the performance of multilayer neural networks for object recognition. In: Computer Vision ECCV 2014. Springer (2014) 329 344