Editorial Manager(tm) for International Journal of. Title: A Minimalist Approach to 3D Visually-Guided Grasping for Service Robots

Size: px

Start display at page:

Download "Editorial Manager(tm) for International Journal of. Title: A Minimalist Approach to 3D Visually-Guided Grasping for Service Robots"

Grace Haynes
5 years ago
Views:

1 Computer Vision Editorial Manager(tm) for International Journal of Manuscript Draft Manuscript Number: Title: A Minimalist Approach to 3D Visually-Guided Grasping for Service Robots Article Type: Special Issue: Vision & Robotics Section/Category: Keywords: 3D visually-guided grasping; autonomous manipulation; service robots. Corresponding Author: Dr. Pedro J Sanz, PhD Corresponding Author's Institution: Universitat Jaume I First Author: Johannes Speth, Dipl-Ing Order of Authors: Johannes Speth, Dipl-Ing; Mario Prats, Dpl-Ing; Pedro J Sanz, PhD Manuscript Region of Origin: Abstract: After a lot of previous work in robotic manipulation, from very different perspectives, not enough significant advances on how we can obtain a multipurpose system for autonomous manipulation, within the Service Robotics context, has been achieved yet. In particular, bearing in mind that we are interested on a vision-guided system capable to manipulate all kind of real unknown objects within unstructured environments, a double difficult must be overcome. In first place, a dexterous robotic hand will be necessary to guarantee the potential manipulation skills in a rough similar manner as humans do. And, on the other hand, we must obtain a suitable 3D visual representation of those objects to be manipulated. Until now, these two problems, vision and manipulation were treated, many times, in an insufficient connected manner, in spite of a few contributions from the hand-eye coordination perspective. Therefore, we propose a more effective approach taking into account vision and manipulation in a very connected way, following a biologically inspired approach. Thus, we do not try any 3D reconstruction neither recognition process, but only a minimalist representation necessary to guarantee a suitable grip of previously unknown objects. As experimental validation of this new approach, a set of different objects have been successfully reached and grasped by using our robotic system (i.e. a 7 d.o.f. robot arm, a three-fingered hand, and a camera in hand) as experimental setup. In summary, after increasing the coupled manner in which vision and manipulation have been used, it has been possible to open a new way of progress towards autonomous manipulation for service robots, demonstrating the reliability and feasibility of this system, over isolated prior unknown objects at the moment.

2 Manuscript Click here to download Manuscript: IJCV-sanzp2005-Final.doc A Minimalist Approach to 3D Visually-Guided Grasping for Service Robots Johannes Speth 1, Mario Prats 2, and Pedro J. Sanz 2 1 Technische Universität München, München, Germany, jospeth@gmail.com 2 Universitat Jaume I, Castellón, Spain, {mprats, sanzp}@uji.es Abstract. After a lot of previous work in robotic manipulation, from very different perspectives, not enough significant advances on how we can obtain a multipurpose system for autonomous manipulation, within the Service Robotics context, has been achieved yet. In particular, bearing in mind that we are interested on a vision-guided system capable to manipulate all kind of real unknown objects within unstructured environments, a double difficult must be overcome. In first place, a dexterous robotic hand will be necessary to guarantee the potential manipulation skills in a rough similar manner as humans do. And, on the other hand, we must obtain a suitable 3D visual representation of those objects to be manipulated. Until now, these two problems, vision and manipulation were treated, many times, in an insufficient connected manner, in spite of a few contributions from the hand-eye coordination perspective. Therefore, we propose a more effective approach taking into account vision and manipulation in a very connected way, following a biologically inspired approach. Thus, we do not try any 3D reconstruction neither recognition process, but only a minimalist representation necessary to guarantee a suitable grip of previously unknown objects. As experimental validation of this new approach, a set of different objects have been successfully reached and grasped by using our robotic system (i.e. a 7 d.o.f. robot arm, a three-fingered hand, and a camera in hand) as experimental setup. In summary, after increasing the coupled manner in which vision and manipulation have been used, it has been possible to open a new way of progress towards autonomous manipulation for service robots, demonstrating the reliability and feasibility of this system, over isolated prior unknown objects at the moment. Keywords: 3D visually-guided grasping; autonomous manipulation; service robots. 1 Introduction First of all, a question arises, why, after all these years, the 3D visually-guided grasping problem continues unsolved, meanwhile is of crucial importance for the robotics community. Maybe a lot of work has been made in not the best direction. In the real world, characterized by dynamic changes and uncertainty, nobody need perfect geometric models at all to success on daily manipulation tasks. Thus, from our point of view, all the previous approaches making use of predefined perfect models or trying to make complete reconstructions of scenes are a bit far to find a successful solution in the Service Robotics context, characterized by uncertainty and continuous dynamic changes. In the following, some information about preceding related work and motivation is presented. 1.1 Related Work We can say that research in robotic grasping started 25 years ago with the work of Asada [1]. Since then, hundreds of works related to grasping and manipulation have been presented in the most important robotic conferences and journals. However, after 25 years of efforts, we are still not able to robustly grasp and manipulate any kind of objects with robots, in a similar way as children easily do. Until now, works on grasping can be classified into three groups: grasp

3 planning, that tries to obtain a set of contacts over the object in order to grasp it robustly; grasp analysis, which evaluates possible grasps and tries to select the best one; and grasp execution which is aimed at actually grasping the object. Recent surveys on robotic grasping can be found in [6,7]. Without doubt, the problem of robotic grasping is a difficult one, in the sense that it needs much external information that is very difficult to obtain with current sensors, as, for example, the object's model, the friction coefficient or the object's mass. For that reason, a simplified version of the problem was first addressed, namely 2D grasping. This technique concerns the stable grasping of an object, but taking only a top view of it. It is the approach adopted by almost all works that make use of vision for grasping, because, in this case, the vision processing simply consists in extracting the object's contour and possible grasp regions. Nice results have been obtained in this line [2,3,4], but, unfortunately, they can only cope with planar objects, seen from a restricted set of views. The 3D grasping approach deals with the problem of grasping any kind of objects, which do not necessarily have planar faces. Works in this line are usually classified into two groups, depending upon they consider an already known 3D model of the object, or not. Most of the works on 3D grasping belong to the first group. They consider a known 3D model of the object and try to find a set of contacts which guarantee grasp stability. One example of this approach is [5], where the object's surface is sampled into a set of points which are used for computing 4- fingered force-closure grasps in 3D. In [8], the authors implement a randomized grasp generation algorithm for quickly finding force-closure grasps in 3D, taking as input the object's 3D model. For more examples, see [9] and [10]. The main drawback of these approaches is that they usually require a lot of computation time and are not fast enough for the applications we are interested in. Although some recent works are obtaining good processing times [9], there is still the problem that the object's 3D model must be known. This is a strong assumption that can be difficult to achieve in real life scenarios, where service robots must perform their tasks among hundreds of different unknown objects. To overcome this problem, another research line is concerned about grasping 3D objects which model is unknown. This is more interesting from the point of view of service robotics, where the amount of objects that the robot can find in its way is so large that makes it impossible to have a 3D model for each one of them. However, we still need some 3D geometrical information about the object, in order to feed our grasp planning algorithms. So, the general approach is to make a full 3D reconstruction of the object, in order to discover the object's 3D model by ourselves, and then use this model as input for a grasp planning algorithm. There are several ways for retrieving the 3D geometry of an object, which have lead to a research line in its own, namely 3D reconstruction. Perhaps the easiest solution is to use a laser range sensor, which can provide accurate 3D information about the environment. Some works, as [11], use this kind of sensors for computing the 3D structure of a given object with the intention of grasping it. Some other works use vision as the sensor for retrieving the world 3D structure. Depending on the image processing technique, several methodologies exist, as shape from stereo, shape from silhouette, shape from shading, shape from motion, shape from shadow, etc. Some of them have been applied to robotic grasping, and others do not. In [12], for example, the authors use stereo vision to obtain a cloud of 3D points that are later grouped into polygonal meshes, and compared with stored object models whose grasp points are already known. But perhaps the most active line is shape from silhouette, because of its robustness. Other techniques rely on shading, shadows, textures, etc. which, in the case of shading depends upon the illumination, and in the case of textures, they might not be present. This makes them fail in certain scenarios under certain conditions. However, any physical object under any condition has a boundary which generates a silhouette in the image that provides much information about the object's geometry [13]. With the aim of computing the object's geometry using this simple and robust feature, several authors on vision communities have opted for the volume intersection approach [14], which consists in back-projecting the silhouette for several viewpoints and intersecting all the volumes (visual hulls). Although this is a very simple approach, some important problems exist, as planning which are the minimum sets of views

4 required for capturing an object's model [15]. In addition, the intersection of 3D volumes is a time-consuming operation, and when the number of viewpoints is high, this method becomes slow. Some authors are concerned about this problem and have published some efficient algorithms [16,17]. Another approach for silhouette-based object 3D reconstruction consists in observing how the silhouette changes with small viewpoint displacements. Good works in this line are [18][19], where the authors make use of differential geometry for simultaneously estimating object's geometry and camera motion. As far as we know, there are no works that integrate 3D reconstruction and grasping into a whole. The general approach is to use 3D reconstruction algorithms for computing object's geometry, and grasp planning algorithms for computing contacts [11]. But there has been little effort trying to answer to the question of what kind of 3D information does grasp algorithms need in order to successfully compute a stable grasp. In our opinion, a complete 3D model is not necessary neither suitable for doing 3D grasping, and recent works from neuroscience support this thesis [20,21]. In order to have service robots working on unstructured and dynamic scenarios, we need simple and fast visual processing, as the 2D case, and effective grasping capabilities, as the 3D case. Thus, we rather follow a minimalist approach that consists in retrieving 3D structure from simple 2D image processing, which results in a fast and highly integrated reconstruction and grasping algorithm. To the best of our knowledge, only one work [22] followed a similar approach to ours. In [22], the authors proposed an active vision approach for visually-guided grasping of unknown, smooth objects with a parallel jaw gripper. Instead of reconstructing the object, they searched for an antipodal grasp in 3D, which is a basic condition of stability in twofingered grasps. Thus, by means of simple 2D image processing, they were able to guide the gripper in 3D towards the object. However, they only tested the algorithm with a specific scenario (i.e. a pile of potatoes). As we are interested in service robotics applications, our algorithm must deal with any kind of objects, and we cannot store a 3D model for each one. Our aim is to follow a minimalist approach to obtain the strictly necessary 3D information, enough for ensuring a stable 3D grasp in a reasonable amount of time (i.e. less than 35 seconds if the object is inside the workspace of our robot arm). 1.2 Motivation After more than ten years working in different aspects of robotic manipulation within the Robotic Intelligence Laboratory of Universitat Jaume-I, we recently started a new project entitled A Complete Autonomous Manipulation System based on Data Fusion and Learning Techniques for Service Robots. Therefore, we hope to offer some novel contributions in very different aspects, such a the new architecture that will be experimented; the data fusion techniques that will be included in order to guide the grasping operation by means of force, tactile and vision information; the human-robot interface that will enable to use voice commands, and also a very simple dialogue, by means of voice synthesis; a collision avoidance module that will grant the safety of people around the robot; and finally, new grip 3D determination, evaluation including active learning, and execution techniques, by using a multifingered hand that will be validated in 3D real life scenarios. Thus, in this paper we present the first results obtained within this project, representing a first step towards a robust solution in the 3D visually-guided grasping problem. It is noticeable, that independently of the kind of robotic problem to be solved, for us was always important to take a look to the nature, in order to find a good biologically source of inspiration. In particular, for this work some recent contributions from neuroscience have been of great importance [20,21]. In fact, this kind of investigation is underlying the minimalist thesis that is validated in our work. In particular, from [20] it is confirmed that: when observers explore objects they concentrate on plan views, like the front and side views, because these views are unstable and can be thought of as singularities in the viewing space of an object. In other words, these are the views where there is the greatest amount of change in the visibility of the object features as the object is rotated by a small amount. Inspection strategies that

5 concentrate on such views might facilitate the encoding of the object s 3D structures. This result encourages us to work in the investigation to extract a 3D representation of an object by using only a pair of views with our camera-in-hand system. And, other important research for us has been derived from [21], where is confirmed that: our hand posture anticipates the purpose of our action and the function of a target object. But even here, semantic information about the object must be integrated with accurate metrical information about its absolute size and its location and orientation with respect to our hand. This is related with our strategy implemented to grasp a previously unknown object, by combining the early 3D representation with the capabilities of our three fingered hand. The organization of this paper is the following: Next section starts with all about methodological aspects, including the procedure to extract the 3D representation of an object, and the subsequent prehension strategy. Section 3 presents the experimental validation results. Finally, Section 4 concludes the paper. 2 Methodology Aspects The basic idea will be to take advantage of our level of expertise working on the 2D visuallyguided grasping problem, and by means of an original sequence of operations, obtain the necessary 3D parameters to enable the successfully execution of the corresponding 3D visuallyguided grasping operation. With the intention to better clarify the new ideas underlying this approach, we start using a very simple example, a box to be grasped, displaying in subsequent Fig. 1 to 3, the sequence of actions to extract the necessary images to obtain the 3D representation used later, as input, for the grasping algorithms. The goal is to first analyze the object - in this case a box - by taking images of the object from predefined directions and then move to a calculated grasping position with a calculated hand pre-shape. In the last step the grasp is to be executed and the box to be lifted up. Figure 1. Steps from top, where two images are taken

main implemented modules are displayed, can be seen in Fig. 4 and 5.

6 Figure 2. Moving to front position Figure 3. Reconstruction from front, where two images are taken again Furthermore, in order to get a global comprehension about the structure of the implemented program, two diagrams, as flowcharts, where the main implemented modules are displayed, can be seen in Fig. 4 and 5. These flowcharts are representing a schematic description of the sequence of the executed functions as well as an overview of the most important classes and their relationships among them. Thus, in the Fig. 4, all the steps previous to the grasp strategy can be observed, showing the image processing and basic robot actions required to achieved a 3D successfully object representation to serve as input for the grasp procedure. Therefore, the 3D object model is taken and the contour of each view is analyzed for feasible grasp regions in the grasp analysis step. Such a feasible grasp region together with the camera pose of the view and the extracted 3D information about the object is then used to generate a desired grasping mode; tool pose and hand pre-shape (i.e. grasp synthesis). Then the approaching and the grasping is performed. (see Fig. 5). In the following, the steps of the algorithm are specified in note form in order to get an overview about what is happening, letting afterwards deeper explanations: Step-1. In order to overcome the parallax problem a visual servoing loop is used to follow the centroid of the object surface. Thus, a position directly above the object is reached (see Fig. 1) so that only the contour as projected straight down on the workspace will be seen. (i.e. Initialization in Fig. 4)

7 Step-2. The object contour is analyzed processing some well-known geometric features (i.e. Contour Analysis in Fig. 4): axes of inertia; bounding box along these axes; length and width of bounding box in pixels; intersection points of axes of inertia with bounding box; curvature of contour (i.e. κ-angular bending); symmetry degree; and circularity. Step-3. This information together with the pose of the camera in base coordinates is saved for every view point. This internal representation of the object is also referred to as (object-) view. (i.e. Save View in Fig. 4) Step-4. The camera is moved along its optical axis (see Fig. 1). Step-5. The 2nd view is created: steps 2 and 3 are repeated from the new position. Step-6. The first 3D analysis if performed: The intersection points of the axes of inertia with the bounding box are reconstructed. From this the length, width and height of the bounding box as seen from above are calculated. For these values the center of mass of the bounding box is calculated. Step-7. The front pose is calculated from direction of minimum of inertia axis, I min, the size and the centroid of the object. (i.e. Calculate new Pose in Fig. 4) Step-8. The camera is moved to the front position (see Fig. 2). Step-9. The object is again analyzed and two views are saved: Steps 2-6 are repeated (see Fig. 3). Step-10. The 3D analysis is performed again and the final 3D object model is created. From the front view the height and the width of the object model can be refined. In general every new view point will provide a lot of extra information about the object. In this case only the width and height of the object can be refined since the chosen object model (i.e. bounding box) is so simple. Yet, for a starting point this simple model is just fine and also works for many different objects as can be seen in section 3. For more advanced purposes more complex models can be thought of, like the intersection space of the re-projected contours for every view, that are beyond of the scope of this work. Step-11. The resulting 3D object model is the input to the grasp analysis procedure. Step-12. The contour of the top and the front view is analyzed for feasible grasp regions (i.e. Grasp Analysis in Fig. 5). Step-13. Feasible grasps are proposed to the user who has to choose one. Step-14. For the chosen grasp the hand pose and preshape are calculated, (i.e. Grasp Synthesis in Fig. 5). Step-15. The approaching is performed: The palm of the hand is moved to one side of the bounding box above or in front of the chosen grasping position (i.e. force focus point). Step-16. The grasping is executed: The fingers are closed until all fingers make contact with the object. A predefined force is exerted by the fingers. Step-17. The object is lifted up. Lets see now a deeper explanation about these main steps, dividing in two separate subsections, one related with those preliminary aspects summarized in Fig. 4, and the other for those shown in Fig. 5, concerning the proposed grasp strategy. 2.1 Extracting the 3D Object Representation It is noticeable that the image processing module, in Fig. 4, starts with a grey level image, ending with the corresponding object contour. As it has been aforementioned, we made use of previous successfully research to solve this kind of process [23,24]. Then our contour extraction algorithm was used to extract the longest contour of the binarized image, being these contour points the basis of all further calculations. See an example in Fig. 6 (top).

8 Figure 4. The flowchart concerning the geometric analysis process is displayed. Figure 5. The flowchart concerning the grasp planning procedure is shown.

9 Figure 6. An example of the Image Processing module (top) and Geometric Features extraction (bottom) acting over a toy-lizard. Respecting the initialization module (i.e. Step-1), in Fig. 4, a visual servoing loop has been used to overcome the parallax problem, see Fig. 7. As can be appreciated in this figure, the contours of the object as seen from these two points have hardly anything in common. And because the goal of this work is to calculate some 3D features of the object and eventually find grasping regions on its contour, it is crucial to solve this problem, so that only contours are produced which represent the object in a suitable manner for grasping it. Hence, our way to prevent the parallax problem has been to move straight over the object, so that the sides of the object do not falsify the contour. This means the optical axis of the camera shall be inside the cone projected by the contour of the object. In other words the center of the image (i.e. principle point) shall lie inside the projected contour of the object. The centroid of the contour is the optimal point with this feature. Thus, a 2D visual servoing loop with the centroid as visual primitive could be used to move the camera straight over the centroid. This was done by a loop which calculates the distance of the image center (p x,p y ) to the centroid (c x,c y ) of the contour and multiplying it with a factor, k, empirically obtained. This result in the velocity screw to move the arm. For this movement only the components v x and v y of the velocity screw are relevant, since we are only moving parallel to the plane of the work area along the x- and y-axis of the tool coordinate system as v x = k (c x p x ) (2.1) v y = k (c y p y ) (2.2) This loop stops when the centroid coincides with the center of the image. Then the velocity screw is also zero. Since the image processing could be slower than required we must guarantee that the arm does not move too fast by keeping the value of k suitable to the maximum arm velocity limit. Otherwise it can result in uncontrollable movements of the arm. Figure 7. A bottle is shown to exemplify the parallax problem.

10 The next step (i.e. Step-2), concerns with the extraction of geometric features from the top view. Once this position straight above the object is reached, moment-based features are computed from the object contour that will be used in the further course of the algorithm. Only computation of the moments up to order two is needed, and from them, the centroid and orientation of the main axis. See an example in Fig. 6 (bottom). Low level details about how those features can be calculated can be found in elsewhere [24,25]. However, their use for this algorithm is yet to be explained here. For most features it must also be differentiated between the top and the front view. In this section only the application for the top view will be explained. Afterwards, their use for the front view will be explained within this section. The axes of inertia are a very important feature when looking at the object from top, because it is very likely that they are also axes of mirror symmetry if the object has any, see details in elsewhere [26]. Since humans also tend to analyze and grasp objects from the directions of its main axes, as described in [20], it seems a promising way to be developed in the robotics side, and some previous works in the grasp planning domain confirm this thesis [26,27,28]. Thus, in our case, the axes of inertia of the top contour point in the directions from where the object should be analyzed next. The bounding box is used to serve as an approximation for the dimensions of the object. Since we do not want a complete reconstruction of the object, a very simplified model of the object had to be assumed. In this case a 3D bounding box around the object was chosen for which the length, width and height are calculated and serve as an indicator whether the object is graspable and how the object is best grasped with the used gripper (i.e. the Barrett hand). For example if the object is not higher than the width of the hand s wrist it is not possible to grasp it from the side. The dimensions of this bounding box are also used to calculate the position of the palm for grasping the object. For the top view the bounding box is spanned along the direction of I min and I max around the object. Feature Points are needed to be able to perform some 3D reconstruction in order to know the distance from the object and also the real sizes of the object. These feature points must be invariant to the scaling of the contour while moving along the optical axis, so that corresponding points can easily be found for the reconstruction. For the top view the intersections of I min and I max with the bounding box can be used to calculate the real length and width of the bounding box by reconstruction. Also the distance of these reconstructed points to the camera can be used to calculate an approximation of the real height of the object if the level of the workspace is known. This can be seen in Fig. 1. The curvature of the contour is calculated and will be used to calculate the grasping regions, explained later in subsection 2.2. The normalized global symmetric deficiency is used to get some information about the symmetry degree of the object along the axes of inertia directions, see details in elsewhere [26]. Since the movement to the front position depends on the direction of I min the symmetry degree of the object towards this axis as seen from above is a good indicator for the usefulness of the front view. If the symmetry degree from above is high enough the contour as seen from the front position can be assumed to represent the object shape in a suitable manner. If the degree of symmetry is very low the left and right part of the front contour will not face each other in fact. Thus they are inappropriate regions to compute a suitable grasp. This means that objects with poor symmetry from above should not be grasped from front. The circularity is used to get more predictable directions for the axes of inertia for circular contours (e.g. projections of spherical and cylindrical objects), see Fig. 7 (left). Because their axes of inertia have undefined directions and can change extremely just by the influence of noise, and the algorithm used relies on the feature of invariance, which normally exists for the axes of inertia, such unstable calculations can be a source for drastic failure of the system. That is why it is important to detect these kinds of objects. So, the formula for calculating this feature, namely circularity of the contour is

11 (2.3) Whereas A is the area of the region and maxdist is the distance of the farthest contour pixel to the centroid. Only circular objects get values close to 1 (e.g. C = 1 for perfect circle). Thus, with a threshold of 0.85, which was set empirically, the system was able to recognize circular shapes very reliably. When such an object is found, the orientation of I min is just set to a predefined value (e.g. 90 degrees), which results in unvarying axes of inertia between different images also for a circular object. Let s see now the displacement along the optical axis problem (i.e. Step-4). In the work by Rüttinger [29], it was pointed out that the system can only grasp objects from front reliably if the objects were rotational symmetric, or at least would not change their projected shape much when looking at them from different view points. This is because cubes for example, much like a bottle, will project an outer contour onto the image plane which cannot be recognized as a cube if the camera does not get an image of exactly one side of the cube. In fact, this problem arises in two different occasions during the execution of our algorithm. The first is in the starting position where it is possible to get a falsified view of the object. This is taken care off by following the centroid as described before in Step-1. The second occasion for this error exists when moving the camera to the second position needed for triangulation. For triangulation with disparity the two camera positions have to be next to each other. Here the cameras might see a different shape of the object due to the parallax problem. Since in this work an eye-in-hand system is used, one can find a way around this problem by moving along the optical axis (see Fig. 8). Thereby the contour resulting in the image will only change in size, but not in shape. Thus it is much more robust to reconstruct corresponding points for nonsymmetric objects. Detailed information about 3D reconstruction can be found in elsewhere [30]. Figure 8. Geometric interpretation of re-projected rays (left). Skew rays in 3D space (right). In our case, the ray for an image point p for a camera with the pose C and the calibration matrix C cal can be described in base coordinates by equation (2.4) with C = (R t) being the 3 4 matrix containing the relative pose (rotation R and translation t) of the center of projection with respect to the reference coordinate system. The factor m can be varied to scale the length of the ray. One problem occurring here is that in general the two

12 vectors do not intersect in 3D space due to the discreteness of the pixel coordinates. In fact, the rays will be skewed which can be seen in Fig. 8 (right), leaving us with the problem of intersecting skewed vectors in 3D space. The point of intersection of two skewed lines of sight is defined as the point with the shortest distance to both lines of sight [31]. This point is the middle point of the line n connecting the two points P 1 and P 2 in Fig. 8 (right), which are the closest points of each line with respect to the other. By definition n is the normal vector to both lines of sight (i.e. ray 1 and ray 2 ). Since the dot product of a vector with its normal is always equal to zero, this restriction can be used to formulate the following equations: (2.5) (2.6) Now we have two linear equations and two unknown variables, namely m 1 and m 2 for the rays ray 1 and ray 2, see equation 2.4. If the lines of sight are not parallel this system of equations always has a unique solution resulting in the points P 1 and P 2. These are the closest point of the two skewed lines towards each other. The virtual reconstructed point P is then obtained by taking the point in the middle between them, with (2.7) In the following, Steps 7 and 8 will be clarified (i.e. moving to the front position). Another challenge encountered during the development of the algorithm was to find a position apart from the top position which could be used to extract some more information about the object. Of course another view from a side of the object would provide a lot of extra information which can be a great help for analyzing the object for possible grasps. First of all, the view from the side can result in a refinement of the height of the object. As can be appreciated in Fig. 9, the height of an object cannot be calculated from the contour of the top view under certain circumstances. The front view can add vital information for calculating the real height of an object. Further more the contour as seen from the front view can also be analyzed for grasping regions, which is the requirement for making grasps from front also possible. As mentioned before the direction of I min in the top view is used to move to the front position. This is done because of the high probability of being also the axis of mirror symmetry of the object. Only if this is the case the front view shows a contour of the object that can be used for grasp analysis. Otherwise the perspective distortion will result in incomprehensible contours of the object, which cannot be used with this simple object model representation. In the beginning of this work the position as shown in Fig. 9 was tried to be reached, since this would again result in a position facing exactly one side of the object. The advantages of only seeing the desired side were already described before. Unfortunately, many restrictions were encountered which made the movement of the hand/camera to a perfectly horizontal front position, as displayed in Fig. 9, impossible for most cases, like the following: For small objects the hand would hit into the work ground in order to get the camera in the right position. Tall objects (large height) do not fit into the field of vision. The joints of the arm cannot perform this extreme movement. Because of these problems the decision was made to move to a top-front position as shown in Fig. 2. One of the problems occurring here was to keep the object in the field of vision, if possible even centered. This is necessary in order to guarantee for an equal expansion of the object to all sides in the image during the movement along the optical axis. This prevents that parts of the object move outside the image before others. By this the algorithm is optimized for big objects.

13 Figure 9. Example for height refinement through front view In order to achieve this objective the angle θ by which the camera has to be pitched must be calculated with respect to the movement of the arm and the height of the object. Fig. 2 gives a geometric interpretation of the following formula (2.8) with dx being the distance of the camera along the direction of I min and dz being the difference between the z-components of the top-front camera pose and the calculated center of mass of the bounding box. As shown before this center of mass can lie too low if the height of the object cannot be seen correctly from above. In this case the object might be out of the center of the image. The geometric features from front (i.e. Step 9) are explained now. As it has been mentioned before, some of the geometric features must be interpreted in a different way for the front view. These differences will be described in the next: The axes of inertia might be completely useless for the front view. Since it is likely that parts of the top surface are also projected into the front view it cannot be analyzed like the top view. Some knowledge about the setup must be taken into account for the analysis from front. For example, since the objects are supposed to be standing straight the directions of the x- and y-axis of the image coordinate system can be assumed as the main directions of the object. For the bounding box from the front this means that it is spanned along x- and y-axis of the image coordinate system around the object contour. The Feature Points are chosen to be the most extreme points of the front contour with respect to their x- and y-components. The reconstructed point with the smallest y-value is then the real highest point of the object, whereas the point with the highest y-value corresponds to the lowest point of the object. This can be used to calculate the real height of the object. 2.2 The Grasp Determination and Execution Procedure Now that the contours are extracted, the curvature of the contours is known and the dimensions and the geometric features of the object are calculated, there is only one more thing to do: Find a suitable grasp, preshape (i.e. Steps 11-14) and execute it (i.e. Steps 15-17). It sounds very easy, yet it states another very complex part of the whole procedure. The procedure used with the box will serve as an example for explaining the following steps. At first the contours of the

views have to be analyzed and a grasp has to be chosen. Then the arm has to be moved to the object. In the last step the fingers have to be closed until they make contact with the object.

14 views have to be analyzed and a grasp has to be chosen. Then the arm has to be moved to the object. In the last step the fingers have to be closed until they make contact with the object. These steps will be described in more detail now. From Contours to Feasible Grasp Regions. What has to be done in order to find feasible grasp regions is subdivided into several steps. What we have are the contours from two views from above and the contour from two views from the top-front position. These four images can be seen in Fig. 10. Since the contours of the near and far view only differ in size, the contour of the near position was chosen to be analyzed because the resolution of this contour is higher. The corresponding curvature functions can be seen in Fig. 11. Again the quality of the contour description is pretty bad, due to the bad image quality. The following paragraphs describe all the steps to analyze the curvature of an object for possible grasps, and it is partly based on the work by Morales [32]. Grasp Regions. The first thing to do is to find parts of small curvature by applying a curvature threshold to the curvature function. This means that consecutive contour points with a curvature lower than this threshold are grouped together and represented by a straight line. These line segments are from now on referred to as grasp regions. With the curvature threshold by itself circular contours with large diameters would never be split up into smaller grasp regions as they would never reach the threshold value. That is why a second threshold is necessary, named the accumulation threshold, which puts an upper bound to the sum of curvature over all the points added to a grasp region. Once the accumulated threshold is reached a new grasp region is started. With these two thresholds complex contours can be represented by grasp regions which simplifies the next steps a lot. These grasp regions already fulfill the curvature condition of the stability criteria [32]. All found grasp regions for the box object can be seen in Fig. 12. Figure 10. Contours from top views (top); Contours from top-front views (bottom) Figure 11. Curvature (k-bending)

15 Figure 12. The set of grasp regions found for the box Compatible Regions. In the next step compatible regions have to be found. These are all the regions which fulfill the force closure criteria [6,7], with respect to a given reference grasp region [32]. Since these criteria are different for two and three finger grasps there must be made a distinction between them. The two finger case. Compatible regions for two fingers have to face each other. That means the angle between their normal vectors must be in the range of 180.0º ± 2θ, with θ being the opening angle of the friction cone as described in [32], see Fig. 13. Figure 13. Geometric interpretation of Coulomb friction model The three finger case. As it is expected a more complex approach will be necessary in this case. Since the fingers can be spread and, by this, change the direction of the exerted forces, the threshold of the maximum allowed angle between the surface normal vectors for compatible regions must be less restricted. In this work it is assumed that three grasp regions are compatible regions if their normal vectors positively span a plane. The geometric interpretation can be seen in Fig. 14. An example for compatible 3-finger regions is shown in Fig. 15. Feasible Regions. Once all compatible regions for the two and three finger case are computed, the last step is to reduce and refine them to feasible regions. Again the two and three finger case must be differentiated. The two finger case. In order to guarantee that the compatible regions can also be grasped they have to be projected onto each other. Each region is projected along its normal vector onto the other region. The resulting sub-regions are then called feasible region to the one that has been projected. Only if both regions can be projected onto the other, the resulting sub-regions are chosen as feasible (i.e. possible) grasp. The force focus point in the two finger case is just the mid point of the line connecting the mid points of the feasible regions, see Fig. 17.

16 Figure 14. Adapted from [32]. (a) Vectors positively span a plane (b) vectors do not Figure 15. Example for compatible 3-finger regions Figure 16. Projection of compatible regions results in feasible 2 finger regions The three finger case. In order to check if three compatible regions are also feasible regions their friction cones must be intersected and the centroid of the resulting area must be projected onto each involved region. Only if this projection intersects with all these regions they are feasible regions towards each other. The computation of this intersection area is a complex problem, called the vertex enumeration problem [32]. The implementation was out of the scope of this work. In order to get any three finger

17 results a simplified calculation was developed and implemented, which still proved to be quite useful. A good example for this simplified approach can be seen for the hexagonal object in Fig. 17. Instead of intersecting the friction cones and taking the centroid of the intersection area the following is done: o Intersect the inside normals of the two compatible regions originating at the center of the each region o Check if the resulting point can be projected onto the reference region (thumb region) o Check if the absolute difference θ 1 θ 2 is lower than a given threshold in order to account for the symmetric spread of the hand o If both checks are positive the intersection point is taken as force focus point. By doing this only a small sub-group of possible three fingered grasps is calculated, but the quality of them is very high due to the strict conditions. Figure 17. Intersection of normal vector results in force focus Choosing the Best Grasp. It is not the goal of this work to provide rules and conditions to evaluate and compares feasible grasps. Still after the feasible grasps are computed all of them can also be executed. In order to decide which one is to be executed the operator has to choose the desired grasp from the possible ones. At first all two finger grasps from top are shown to the operator. If none is selected the three finger grasps are shown. Again if none is selected the system switches to the front power grasps. Once the operator selects a grasp it is executed by the system. If no grasp seems to be good and the operator does not choose one the system goes back to the starting position and the program quits. Grasping the Object (Steps 15-17). The 3D information about the object together with feasible regions and a force focus point can now be used to calculate the movement of the hand so that the object can be grasped. This means the center of the palm, which is also the origin of the hand coordinate system, must be moved to the force focus point. Additionally, the hand must be turned so that the thumb is aligned with the reference grasp region, i.e. the hand is rolled by a certain angle. In the three finger case the two other fingers must also be spread by the angle between the normal vectors of the feasible regions with the reference region (see Fig. 17). On the other hand, Fig. 18 shows an L-shaped object which is used to describe the procedure of moving to the chosen grasp position from top and from front and then executing the grasp. Approaching from Top. The Fig. 19 nicely shows the important characteristics for the approaching step when the object is lying on the work area.

Figure 18. L-shaped object The Displacement Component.

int and the known distance to the surface to the object, by using the formula Equation 2.8 describes the relative movement of the hand that corresponds to the vector δ.

This re-projected vector must then be multiplied with the known distance (i.e. dist) of the camera to the surface of the object model, in this case the 3D bounding box.

18 Figure 18. L-shaped object The Displacement Component. The placement of the hand above the force focus point can be calculated by using the vector δ, between the center of the image and the force focus point together with the camera calibration matrix C int and the known distance to the surface to the object, by using the formula Equation 2.8 describes the relative movement of the hand that corresponds to the vector δ. This movement is calculated by multiplying the force focus point (i.e. the marked central point in Fig. 17) with the inverse calibration matrix C -1 int and thus re-projecting it into 3D space. This re-projected vector must then be multiplied with the known distance (i.e. dist) of the camera to the surface of the object model, in this case the 3D bounding box. This can be interpreted as the projection of the image point onto the real surface of the object model. This results in the grasp point relative to the camera position. (2.8) Figure 19. Lying L-shaped object with necessary features for approaching On the other hand, the rotational component of the grasp pose can simply be described mathematically by the well-known roll matrix Roll(θ), by the formula (2.9) The Grasp Pose. Since the camera is not in the location of the calculations absolutely the relative movements as calculated for the view cannot be directly used for the movement of the arm. Beforehand it must be transformed into a pose in base coordinates. Therefore a relative grasp pose in homogenous coordinates P rel must be created. Then this matrix can be transformed

into base coordinates by multiplying it with the known camera pose. As said before the camera poses P cam for every view are also saved. So, the next formula is used (2.

19 into base coordinates by multiplying it with the known camera pose. As said before the camera poses P cam for every view are also saved. So, the next formula is used (2.10) with both P cam and P rel being homogenous matrices of the order 4 4. Approaching from Front. Finding feasible grasp regions from the top-front view of an object is even more complicated. The problem arising here again is the perspective distortion of the contour. As can be seen in Fig. 20, the parallel sides of the object do not appear parallel in the contour anymore. One possibility to solve this problem would be to find all the grasp regions in the near and the far top-front views and then use corresponding pixels to reconstruct the grasp regions in 3D. In order to do this two things are necessary: An algorithm for finding corresponding contour points, and an exact camera calibration (i.e. intrinsic and extrinsic parameters). Since in our case, none of these were suitable because the errors would just be too big, the approach used in order to get some front grasps is illustrated in Fig. 20 (right). If the difference between the two angles φ 1 and φ 2 is not higher than a threshold, empirically adjusted, 5 degrees in our case, the regions are assumed to be parallel and therefore compatible regions. By projecting the regions onto each other, only this time along the x-axis of the image coordinate system, feasible grasp regions are extracted. Then again the force focus point is reprojected onto the bounding box of the object producing the desired position of the palm. It is obvious that there are circumstances when this assumption does not hold true. Yet, since there is an operator who is responsible for the election of the executed grasp this is not a problem at the moment. In future, of course, there will be a need for a better grasp analysis from front and an evaluation process to automatically find the best grasp among the possible ones. Figure 20. L-shaped object from top-front (left), contour with features (right) Grasp Execution. Two control laws have been implemented to improve the grasping performance of the hand: (a) Simultaneous closing of the fingers and (b) compensation of z- movement for precision grasp. The control (a), is necessary because the fingers might exert unwanted forces on the object if the closing extent of the fingers differs much. In velocity control this can happen easily, since a desired velocity results in slightly differing velocities for different fingers. In order to overcome this, a control law is used to make sure the fingers close simultaneously. For finger F1 this means: (2.11) where v F1set is the velocity value of finger 1 as it is given to the control software. v F1des is the desired velocity of finger 1, which is regulated by the difference between its motor position

20 (θ M1 ) and the average of the other fingers θ M2 and θ M3 ). This difference is multiplied by a factor κ to change the effect of the regulation. This regulation is performed for each finger. If the object is centered in the palm of the hand this makes sure the fingers intersect with the object at the same time and unwanted forces and torques are minimized. The control (b), takes into account that the algorithm also makes a difference between grasps from top and grasps from front. Top grasps are performed as precision grasps whereas front grasps are performed as power grasps. The difference is that power grasps try to enclose the object in the hand and also use the palm to support the grasp, whereas precision grasps only make contact with the finger tips. For power grasps the palm must be moved as close as possible to object and the fingers are just closed, thus reaching as far around the object as possible. In the case of precision grasps the hand kinematics are used to make sure that the finger tips stay at a fixed height while closing the hand. This is done by moving the arm in velocity control to compensate the movement of the fingers in z-direction of the tool coordinate system. By this the fingers always stay close to the top of the object and certainly over the center of mass, which is important for stable grasps from above. 3 Experimental Validation This section has been divided into two different subsections, one devoted to the arrangements description, ant the other compiling different results Experimental Setup All the experiments were implemented in the real system shown in the Fig. 21. A well-known 7 d.o.f. Mitsubishi PA10 robot arm was used, including a camera-in-hand and a three-fingered Barrett Hand. The camera model is a Sony XC-333, which is a very compact remote head color camera very common in industrial applications. With its size of about cm it can be assembled between the spreadable fingers of the Barrett Hand (i.e. see Finger 1 and Finger 2 in Fig. 21) without being in the way. The resolution of the CCD-Chip is pixels. The Matlab calibration toolbox was used to calculate the intrinsic camera parameters. Respecting the Barrett Hand, in Fig. 21, the main parameters of its kinematics can be seen, adapted from [32], as a thumb (i.e. Finger 3) and two opposite fingers (i.e. Finger 1 and Finger 2); the extension of the fingers, d 1, d 2 and d 3 ; the spread of the fingers, α; Coordinates of the centre of the wrist, W x, W y. Furthermore, the Barrett hand has some special features which shall be explained in more detail. These features are of great importance since they define the possibilities for grasping objects. See the list in the following: Each finger can be opened and closed individually. Although each finger has two joints this movement is executed by only one motor per finger. The closing mechanism is based on the patented Torque Switch mechanism [33] which keeps the angles of the inner and outer joint of each finger in a fixed relation to each other, comparable to the human hand. If the outer link of a finger has first contact with the object the motors are stopped once the desired motor torque is applied. If the inner link of a finger is stopped first the motor torque is completely transmitted to the outer link to ensure maximum possible closure of the hand for maximum firmness and security of the grasp. Of course, in this case the relation of the joint angles changes. Each finger contains a so-called strain gauge sensor, to measure the force applied to the outer joint (finger tip). The fingers F1 and F2 have an extra degree of freedom as they can be spread around the palm up to 180 degrees. The spread of the fingers is always symmetric, since they are mechanically connected. Both fingers can only be spread together by the same amount.

The difference between the motors for closing and the spread motor is that the closing motors are locked and stay in place once no command is sends to them, whereas the spread motor is fully back

The maximum force that can be actively produced at each finger tip is 2kg. Figure 21.

The Kinematics model is on the left column, and the real system is on the right.

21 The difference between the motors for closing and the spread motor is that the closing motors are locked and stay in place once no command is sends to them, whereas the spread motor is fully back drivable. Like the Mitsubishi arm the Barrett hand can also be moved in position control as well as in velocity control for real time applications. The maximum force that can be actively produced at each finger tip is 2kg. Figure 21. A redundant Mitsubishi PA10 was used as a robot arm, including a camera-in-hand system, and a three-fingered Barrett Hand. The Kinematics model is on the left column, and the real system is on the right. It is also necessary to talk about the limitations of the Barrett hand as this will also be the limitations to the possibilities for grasping an object: The closing mechanism. Due to the fixed relation between the joint angles of each finger the closing of fingers is very inflexible. A differing opening of the fingers also results in different z-levels in tool coordinates. Further more the direction of the force vector also depends on the closing extension of the fingers. That is why for asymmetric grasps unwanted torques are exerted which unbalance possible grasps or even makes some impossible.

22 The strain sensors. Another encountered problem is that of the finger strain sensors. These sensors are there to measure the torque in the outer joint of each finger. In fact the value needed is mostly the force in the finger tips. It is not possible to calculate the real force at the finger tip very easily, since it depends on the point of intersection in the outer link and the angle between the finger and the plane of intersection. Further more the sensor can not be used very sensitively, as static friction in the joint itself might be interpreted as contact. That means light weight objects might tip over or at least be moved if one finger make contact before the others because this light touch cannot be measured. The last problem with those strain sensors is that they tend to have failure. The cord which runs inside the finger to transmit the force on the finger tip can easily jump from the knurl which is used as a transmitter of the force to the sensor. If this happens no force can be measured at all The System in Action In order to test the algorithm experiments were made with different kinds of objects. The following results show the high potential of the algorithm. With the simple model of a box and only analyzing the outer contour from two camera positions the system can grasp all these objects without knowing them in advance. An overview over the 3D objects setup can be appreciated in the Fig. 22. Small box Big box Filled box L-object Hexagonal box Tape roll Triangular object bottle Toy lizard Figure 22. The complete setup of unknown 3D objects used in the experiments. Some examples of two-fingered grasp from top can be seen in Fig. 23. Basically, four kinds of objects are processed. The first one, the box, is one of the simplest objects that can be grasped, since it is equivalent to the chosen model. The extraction of the grasp regions is also very easy because the edges are very distinctive. There are 4 ways to grasp this box from above and also 2 grasps from front are possible. On the other hand, grasping the filled box is kind of like grasping the first one. The main difference is that this box is filled with some batteries and cables. Since the longest contour is extracted from the image, it always results in the contour around the box. Thus it does not matter what is in the box. One problem that can occur is if the

box is filled with heavy material, the fingers will slip unless the finger force is adapted.

this must be taken into consideration before the execution of the grasp. Figure 23.

Filled box, L-object, and Hexagonal box (from top to bottom).

If the reconstruction of the feature points is performed correctly, so that the real distance to this point is

23 box is filled with heavy material, the fingers will slip unless the finger force is adapted. Since in our current system there is no way to measure the weight of the object or the slipping of the fingers, this must be taken into consideration before the execution of the grasp. Figure 23. Feasible grasps (on the left), and executed two-fingered grasps (on the right), for a few examples: Big box, Filled box, L-object, and Hexagonal box (from top to bottom). Respecting the L-object it can be seen that the palm is always moved directly over the force focus point. If the reconstruction of the feature points is performed correctly, so that the real distance to this point is calculated right, this movement will always be accurate. Due to the poor image quality and the imprecise hand-eye calibration this calculation might fail sometimes, which is the main source of error in the whole process.

Finally, the Hexagonal box represents a very interesting test object because it can be grasped from many directions and with two and three finger configurations.

The Hexagonal box, as already said, can be grasped with two and three fingered configurations.

In the calculation of the grasp pose and preshape it is not taken into account that the fingers do not intersect in one point.

24 Finally, the Hexagonal box represents a very interesting test object because it can be grasped from many directions and with two and three finger configurations. Again it has very sharp edges which make the extraction of grasp regions much easier. Some examples about three-fingered grasps are displayed in Fig. 24. The Hexagonal box, as already said, can be grasped with two and three fingered configurations. In the three fingered case an error becomes obvious that appears due to the simplification of the hand model. In the calculation of the grasp pose and preshape it is not taken into account that the fingers do not intersect in one point. The force focus point is just taken as the intersection point of the normal vectors to the grasp regions. In fact the joints of the two spreadable fingers are separated by 5 cm. For more accurate grasp calculations this would result in a slightly different palm position and spread opening, see the work by Morales [32]. This must be considered as possible improvement. Since it was the goal of this preliminary work to end with a system that can execute a grasp, this was not implemented due to lack of time. Yet, the error was never so big that the object could not be grasped for that reason. This is because the objects must have a certain minimum size in order to be graspable for the hand and for bigger objects this relative error becomes less. Figure24. Feasible grasp (on the left), and executed three-fingered grasps (on the right), for a few examples: Hexagonal box and Triangular object (from top to bottom). Results about two-fingered grasps from front, also namely as power grasp, are displayed in Fig. 25. Respecting the Big box, from the right feasible region it can be seen that the edge separating the top from the front side was not recognized because the angle in the curvature is not large enough. Only because of the projection of the grasp regions onto each other the remaining feasible grasp regions are the shown ones. Of course here also is a high potential for improving the algorithm for front grasps. On the other hand, the bottle example demonstrates another problem with the top-front contour. Due to the perspective distortion the curvature appears to be much more extreme than it actually is. This result in many short grasps regions which makes it harder to decide for one. Finally, the L-object is actually very convenient for grasping from front if it faces in the right direction. Then the object looks just like a tall box from front. This is of course the optimal case. If the object is turned by 180 degrees this is absolutely not the case. Then the concavity of the shape will result in a very extreme error, because the model of the rigid box will not apply to the real object at all.

Figure 25. Feasible grasp (on the left), and executed two-fingered grasps (on the right), for a few examples: Big box, Bottle, and L-object (from top to bottom).

consumed time during performance is presented.

special PCI card; and, the Barrett Hand controller is connected to a standard RS-232 serial port. Thus, the system was tested with the set of objects that appear in Fig.

problem, by using visual servoing techniques.

25 Figure 25. Feasible grasp (on the left), and executed two-fingered grasps (on the right), for a few examples: Big box, Bottle, and L-object (from top to bottom). On the other hand, regarding that both, the visual processing and the robot/hand control were implemented in an ordinary PC running at 3GHz, with 512 Mb of RAM, in the following an estimation of consumed time during performance is presented. Bearing in mind that: the camera analog stream is transformed into a digital signal and connected to the PC via a firewire port; the robotic arm is accessed through an optical connection using a special PCI card; and, the Barrett Hand controller is connected to a standard RS-232 serial port. Thus, the system was tested with the set of objects that appear in Fig. 22, and the execution time was measured for each of the following stages: System initialization, that consists in detecting the object, and centering it in the image for avoiding the parallax problem, by using visual servoing techniques. Top views, which include the arm movements necessary for getting the two top views of the object and processing the corresponding geometrical features. Front views, which include the arm movements necessary for getting the two front views of the object and processing the corresponding geometrical features. User action, which involves the user selection of the grasp to apply. Grasping, this starts when the grasp is selected and ends when the object is grasped. A total of 10 tries were made, for each of these objects, showing the mean values in Table 1.

Prof. Fanny Ficuciello Robotics for Bioengineering Visual Servoing

Prof. Fanny Ficuciello Robotics for Bioengineering Visual Servoing Visual servoing vision allows a robotic system to obtain geometrical and qualitative information on the surrounding environment high level control motion planning (look-and-move visual grasping) low level