Object recognition and grasping using bimanual robot. Report

Size: px

Start display at page:

Download "Object recognition and grasping using bimanual robot. Report"

Calvin Nash
5 years ago
Views:

1 Master thesis Master in Automatic, Control and Robotics Object recognition and grasping using bimanual robot Report Author: Director: Aleix Ripoll Ruiz Jan Rosell Call: October 2016 Escola Tècnica Superior d Enginyeria Industrial de Barcelona

3 Abstract This document presents an algorithm to recognise and locate known bulky objects from a workspace and compute grasps using two anthropomorphic hands. Once valid grasp points have been found and they are kinematically reachable by the manipulator, motion planning and a collision check are performed. In the first place, this document presents a general architecture of the system and the hardware devices and software implemented. Then, a detailed description of the object recognition, grasp and motion planning and manager modules are provided. Robot Operating System (ROS) is the framework chosen to be responsible for the communication between nodes with appropriate messages, to manage the data from the sensors and to execute robot tasks. In the second place, attention is focused on the evaluation of the object recognition and grasp system. From this chapter, the most important conclusions are: - The camera s calibration takes an important role in order to obtain good performance and reliability of the system. - This project proposes an approach to use multiple camera views to construct a final point-cloud scene to obtain a complete 3D object. - This project presents a satisfactory approach of filtering the desired object with its colours instead of using its shape. - The more complete the point-cloud is, the more reduced is the error of object pose. - A system which grasps a bulky object with six points, three for each hand, instead of four as other works recommend, is a more robust system. - The system has a bimanual manipulator with 22 DoF for each arm. Therefore, inverse kinematic problem is a laborious process to converge to a feasible solution. - ROS service is less robust to service provider changes or server failures respect to the ROS action which provides feedback on the task progress and cancelation at any time.

4 Glossary AAS Allegro Application Studio API Application Programming Interface AR Augmented Reality BSD Berkeley Software Distribution C-Space Configuration Space DLS Damped Least Squares DoF Degrees of Freedom EST Expansive Space Trees FK Forward Kinematics FoV Field of View FPFH Fast Point Feature Histogram GA2H Grasping with two Allegro Hands library GIKL General Inverse Kinematics Library GUI Graphical User Interface HLS Hue Lightness Saturation ICP Iterative Closest Point IK Inverse Kinematics IOC Institut d Organització i Control de sistemes industrials KDL orocos Kinematics and Dynamics Library KPIECE Kinematic Planning by Interior-Exterior Cell Exploration MIT Massachusetts Institute of Technology MPL2 Mozilla Public License 2.0 NAN Not A Number OMPL Open Motion Planning Library PCL Point Cloud Library PFH Point Feature Histogram PR Personal Robots PRM Probabilistic Roadmap Method ROS Robot Operating System RRT Rapidly-exploring Random Tree SAC-IA SAmple Consensus Initial Alignment SBL Single-query Bi-directional Lazy collision checking planner SPFH Simplified Point Feature Histogram STAIR STanford AI Robot SVD Singular Value Decomposition ToF Time-of-Flight UPC Universitat Politècnica de Catalunya UR Universal Robot URDF Unified Robot Description Format XML extensible Markup Language

5 Contents 1. INTRODUCTION Objectives and scope Project activities GENERAL ARCHITECTURE OF THE SYSTEM Hardware architecture Camera sensors Robotic Hand Bimanual Manipulator Software architecture Robot Operating System (ROS) ROS packages: openni_launch and iai_kienct C++ libraries The Kautham Project OBJECT RECOGNITION MODULE Data representation Point-clouds Data acquisition Camera Calibration Object recognition node Point-cloud processing ROS service: objects_recognition GRASP AND MOTION PLANNING MODULE Object grasp node Approach overview Robot inverse kinematics node Robot model simulation General Inverse Kinematics Library (GIKL) Motion planning node Introduction to motion planning Basic features of the Kautham Project MANAGER AND EXECUTION MODULE Input data files Manager node RESULTS 55

6 6.1. Evaluation of the Object Recognition Module Camera calibration Multiple Kinects evaluation Object pose estimation Object recognition Evaluation of the Grasp and Motion Planning Module PROJECT BUDGET ENVIRONMENTAL IMPACT 78 CONCLUSIONS AND FURTHER WORK 79 ACKNOWLEDGEMENTS 81 BIBLIOGRAPHY 82

7 Object recognition and grasping using bimanual robot 1 1. Introduction During the last decades, the number of bimanual human-like robots applications has increased every day from tasks related to pick and place items of an industrial production line to an emerging class of personal robots that assist older or disabled people for cleaning, delivering items or preparing meals at home. In this scope, many of these applications require an efficient system for sensing the environment of the robot and an effective bimanual robot with anthropomorphic hands for executing some tasks, for instance manipulating two objects simultaneously. The module responsible for sensing the surroundings of the bimanual robot could provide the ability to recognise and locate the 3D objects, avoiding collisions with external obstacles or localizing the robot while it is mapping its environment. This project is focused on determining if a specific object is in the workspace or not and grasping it with a bimanual robot if so. Human beings tend to relay on visual data to react to the world around them. Robots can carry out many tasks with sensors such as RFID sensors, RGB-D cameras, ranging sensors, tactile sensors, accelerometers, gyroscopes, etc. However, the performance of these sensors is light years away from biological human sensors as well as its high costs to a high performance. Fortunately, the robotic community has seen massive progress in low-cost hardware such as depth cameras and anthropomorphic robotic hands. An object recognition system finds and locates objects in the real world from the data provided by the robot sensors and knowing a priori the objects models. But this task is surprisingly difficult. In this project the stages followed for object recognition process with 3D scene captured with RGB-D cameras is discussed. These types of sensors illuminate the scene with a pattern light to observe its deformation when it collides with the surfaces of the scene, or blink an infrared pulse estimating the time required to fly into the scene and bounce back to the depth camera. The first technology is used by Microsoft Kinect 360 [1] and the second one is used by Microsoft Kinect One [2]. Bimanual grasping systems allow robots to manipulate certain objects and perform complex tasks that are not feasible with single end-effectors. In this project the system architecture to plan motions towards grasping configurations and to perform robust grasping operations with a system composed of two Allegro Hands [3] attached to UR5 robots [4] is presented. The general architecture of the system, the main hardware devices and the software tools used in this project are presented in chapter 2. This document presents in chapter 3 the theory about point-cloud processing, the calibration system procedure and the ROS service implemented. Chapter 4 presents how the planning of the object grasp, the robot inverse kinematic and the motion planning have been integrated as well as the main theory background. In chapter 5 the manager module is presented. Chapter 6 shows the results for each module and, chapter 7 concludes and presents the future work.

8 2 Report 1.1. Objectives and scope In order to clarify what it is expected during the development of this project, it is important to establish objectives and the limits of the project. The aim of the project is to develop a system able to locate and grasp automatically a specific object using a bimanual robot. To accomplish it, an open C++ software application working on the Robotic Operating System (ROS) [5] is implemented. The main objectives are: a) Development of a method to recognise and locate the object based on pointcloud models using RGB-D cameras. b) Development of the communications with a grasp and motion planning module (already developed at the IOC) to determine the points to grasp the object taking into account the hand tips and to find an admissible collision-free path that connects the initial configuration of the arms to the target one. c) Implementation of an execution module responsible for managing the communication between the other modules. d) Validation of the system by performing some experiments. The project will use open software tools and hardware devices available at the IOC s laboratory: - RGB-D cameras: Microsoft Kinect 360 and Kinect One. - Bimanual manipulator: two Allegro Hands attached to two UR5 robots. - Open software tools: The Kautham Project, General Inverse Kinematics Library (GIKL) and Grasping with two Allegro Hands library (GA2H).

9 Object recognition and grasping using bimanual robot Project activities This section presents the tasks developed in this project in order to meet the objectives initially proposed. Below, all tasks are listed and, a diagram is illustrating all activities without time line representation. The project consists of four stages in order to meet the initial objectives. Stage I consists in dealing with all the tasks previous to the programming in C++. These include defining the project activities, searching similar works (state of the art), studying theory about pointcloud processing, motion planning and robot inverse kinematics. Stage II consists of programming the different modules responsible for object recognition, object grasp, path planning and robot inverse kinematic. Each of these modules are tested in Stage III in order to study the performance and reliability of each one. At the end, Stage IV summarises the results and conclusions in a document. During all stages, there is a parallel Continuous stage which contains the tasks related to the project management. For each stage, a division of tasks and subtasks has been made. o o Stage I: Previous study o Task 1: IOC needs T1.1: Define objectives. T1.2: Define project activities. o Task 2: Documentation T2.1: State of the art. T2.2: C++ programming language. Stage II: Programming o Task 3: Create database of point-cloud models of common objects T3.1: Database generation. o Task 4: System architecture T4.1: Install required packages and devices drivers. T4.2: Definition of ROS nodes. T4.3: Build CMake file. T4.4: Classes and methods definition. T4.5: Definition of types of communication required. o Task 5: Object recognition module: T5.1: Camera calibration system. T5.2: Point-cloud processing system. T5.3: ROS service: object_recognition. T5.4: Build CMake file. o Task 6: Grasping and motion planning module: T6.1: Integration of GA2H. T6.2: Integration of Kautham application. T6.3: Build CMake file. o Task 7: Inverse kinematic node:

4 Report o T7.1: Integration of GIKL. T7.2: Build CMake file. Task 8: Manager module: T8.1: Implementation of the manager and execution node. T8.2: Build CMake file. o o o Stage III: Experimental o Task 9: System robustness: find unexpected issues and fix the maximum number possible.

10 4 Report o T7.1: Integration of GIKL. T7.2: Build CMake file. Task 8: Manager module: T8.1: Implementation of the manager and execution node. T8.2: Build CMake file. o o o Stage III: Experimental o Task 9: System robustness: find unexpected issues and fix the maximum number possible. T9.1: Run execution node to find new unexpected issue. o Task 10: System performance: find bad performance in certain code and improve it as much as possible. T10.1: Run execution node to find the lack of performance. Stage IV: Report o Task 11: Results and Conclusions o Task 12: Final review and creation of final report Continuous stage: o Task R: Project reviews and weekly meetings with IOC supervisors. The following diagram does not represent a real time line as some tasks could be finished earlier or later. Figure 1.1. Tasks project diagram. It is not scaled in time.

11 Object recognition and grasping using bimanual robot 5 2. General architecture of the system This chapter provides useful information about the general structure of the system, the hardware devices and the software implemented. According to the Engineering Systems Division at Massachusetts Institute of Technology (MIT) [6], the architecture of any complex system is an abstract description of the entities involved of a system and how they are related. To construct the architecture of a system involves determining what the system is supposed to do and how specifically it will do it. Moreover, the basic process of creating the system architecture is by decomposition in which a top level concept of the system s functions is broken down into subsystems capable of performing subfunctions. Knowing these definitions, the architectures of the system and the sub-systems can be modelled. The functional architecture [6] is designed to accomplish the system s objectives presented in section 1.1: Figure 2.1. System architecture. Figure 2.2 shows the general architecture of the Object Recognition and Grasping System and the connectivity between its modules. The first module consists of the object recognition process which identifies and recognises the desired objects to be manipulated later. This subsystem estimates the pose for each recognised objects in the world coordinate system in order to grasp and manipulate them with the bimanual robot (see chapter 3). The second module is responsible for determining the strategy to pick the object involved with the robot s end effectors and place the object to the final desired pose with an optimal motion planning (see chapter 4). Finally, the execution module carries out the process to manage the information given by the other two modules (camera sensor, planner and robot information) and the input user tasks (see chapter 5).

6 Report Figure 2.2. Block diagram of the Object Recognition and Grasping System. The architecture decomposition for each module has been presented.

4 shows the architecture of the object grasp and motion planning module and Figure 2.5 presents the manager and execution module.

12 6 Report Figure 2.2. Block diagram of the Object Recognition and Grasping System. The architecture decomposition for each module has been presented. Here, the difference between hardware and software is done as shown in [6]. Figure 2.3 shows the architecture of the object recognition module, Figure 2.4 shows the architecture of the object grasp and motion planning module and Figure 2.5 presents the manager and execution module. The difference between the physical (hardware) and non-physical (software) entities is clear and evident in the most engineering projects like the present one. Two schemes, one with the relations between hardware and other for the software can be constructed. Figure 2.3. Architecture of object recognition module.

13 Object recognition and grasping using bimanual robot 7 Figure 2.4. Architecture of the object and motion planning module. Figure 2.5. Architecture of manager and execution module.

14 8 Report 2.1. Hardware architecture The hardware architecture refers to the identification and description of the system s physical components and their interaction and compatibility into the system architecture. Figure 2.6 describes the hardware used in this project: camera sensors, bimanual robot, two robotic hands and a computer that manages the whole system. Figure 2.6. Main hardware components used in this project Camera sensors Kinect 360 and Kinect One have been used in this project. Both are sensors for videogames developed by Microsoft [2]. These cameras are widely used for the robotics community in the field of computer vision for its low cost. The main differences between these two cameras are the technique to estimate the distances of object surfaces, the colour and depth camera resolution, the depth range and the field of view (FoV): - Technology: Kinect 360 uses structured light and Kinect One uses time-offlight (ToF) technique. - Colour camera resolution: Kinect 360 has 640x480 and Kinect One has 1920x Depth camera resolution: Kinect 360 has 320x240 and Kinect One has 512x Depth range: Kinect 360 has 0.4m to 4m and Kinect One has 0.5m to 4.5m. - Field of view: Kinect 360 has 43ºvertical and 57º horizontal and Kinect One has 60ºvertical and 70º horizontal.

15 Object recognition and grasping using bimanual robot 9 The technical specifications of these two cameras are presented in Table 2.1 and Table 2.2: Table 2.1. Kinect 360 technical specifications [1]. Kinect 360 Specifications Technology Structured light (see section 3.1.2) Colour camera Pixel size Spatial resolution Depth camera Depth resolution Depth range 1280 x 640 x 3.6μm distance 320x240@30fps distance 0.4m to 4m Field of view (FoV) Tilt motor Vertical tilt range Frame rate (depth and colour stream) 43 vertical by 57 horizontal ± 27º USB standard 2.0 Up to 30fps Audio format A four-microphone array and signal processing including acoustic echo cancellation and noise suppression. 16-kHz, 24-bit mono pulse code modulation (PCM) Accelerometer Price 270 A 2G/4G/8G accelerometer configured for the 2G range, with a 1 accuracy upper limit.

10 Report Figure 2.7 shows the main components of the Kinect 360: Figure 2.7. Kinect 360 components [1]. Table 2.2. Kinect One technical specifications [2].

16 10 Report Figure 2.7 shows the main components of the Kinect 360: Figure 2.7. Kinect 360 components [1]. Table 2.2. Kinect One technical specifications [2]. Kinect One Specifications Technology Time-of-flight (see section 3.1.2) Colour camera Depth camera 1920 x 512 x Depth range 0.5m to 4.5m Field of view (FoV) Tilt motor 60 vertical by 70 horizontal No Frame rate (depth and colour stream) USB standard 3.0 Audio format Up to 30fps Four microphones to capture sound, record audio, as well as find the location of the sound source and the direction of the audio wave. Dimensions Weight Accelerometer 24.9cm x 6.6cm x 6.7cm. The Kinect cable is 2.9m long 1.4kg No Price 450

Object recognition and grasping using bimanual robot 11 Figure 2.8 shows the main components of the Kinect One: 2.1.2. Robotic Hand Figure 2.8. Kinect One components [1].

17 Object recognition and grasping using bimanual robot 11 Figure 2.8 shows the main components of the Kinect One: Robotic Hand Figure 2.8. Kinect One components [1]. Two Allegro hands have been used in this project. This hand is developed by SimLab s Allegro Application Studio (AAS) [3]. Allegro hand is a low-cost and adaptive robotic hand for a variety of commercial robots. It is composed with four fingers and sixteen independent controlled joints. For this reason, it is the good platform for grasp and manipulation applications. It is capable of holding a variety of object geometries. The technical specifications are presented in Table 2.3: Table 2.3. Allegro hand technical specifications [3]. Allegro Hand Specifications Weight Finger (0.17kg) Thumb (0.19kg) Total (1.08kg) Design Anthropomorphic DoF 16 (4 Dof x 4 fingers) payload Actuation Joint resolution 5 kg Type: DC Motor Gear ratio: 1:369 Max. Torque: 0.70Nm 0.002º (measured with potentiometer) Communication Type: CAN (NI, Softing, Kvaser or ESD CAN) Frequency: 333Hz

12 Report Power System requirements 7.4VDC (7.0V 8.1V) 5A minimum CPU: Inter Core 2 Duo or higher RAM: At least 2GB HDD: at least 2GB Graphics: OpenGL 3.

9 shows the main components of the Allegro hand: Figure 2.9. Allegro hand dimensions in mm and its main components [3]

10 shows the system with a table to manipulate objects and two Allegro hands (4 DoF) coupled to UR5 arms (6 DoF).

18 12 Report Power System requirements 7.4VDC (7.0V 8.1V) 5A minimum CPU: Inter Core 2 Duo or higher RAM: At least 2GB HDD: at least 2GB Graphics: OpenGL 3.0 H/W Acceleration enabled with at least 64Mb of video RAM OS: MS Windows and Linux (ROS) Price Figure 2.9 shows the main components of the Allegro hand: Figure 2.9. Allegro hand dimensions in mm and its main components [3] Bimanual Manipulator The bimanual robot is composed of two UR5 robots fixed by a T-shaped structure built at the IOC [7]. Figure 2.10 shows the system with a table to manipulate objects and two Allegro hands (4 DoF) coupled to UR5 arms (6 DoF). The whole system tries to emulate a human-like morphology. Figure Bimanual manipulator structure. Image from [8]. The UR5 robot is a collaborative arm with six joints developed by Universal Robots [4]. UR5 is designed to mimic the range of motion of a human arm with its flexibility. This robot

Object recognition and grasping using bimanual robot 13 is widely used in many applications such as industrial assembly line, food handling, painting, pick and place, etc.

85m DoF 6 revolute joints Joint ranges ±360º speed payload Actuation Repeatability Joints: 180º/s Tool: 1m/s 5kg Max. Torque: 2.3Nm ±0.1mm Communication TCP/IP 100Mbit: IEEE 802.

19 Object recognition and grasping using bimanual robot 13 is widely used in many applications such as industrial assembly line, food handling, painting, pick and place, etc. The technical specifications are presented in Table 2.4: Table 2.4. UR5 technical specifications [4]. UR5 Specifications Weight 18.4kg Workspace Working radius up to 0.85m DoF 6 revolute joints Joint ranges ±360º speed payload Actuation Repeatability Joints: 180º/s Tool: 1m/s 5kg Max. Torque: 2.3Nm ±0.1mm Communication TCP/IP 100Mbit: IEEE 802.3u, 100BASE-TX Ethernet socket & Modbus TCP Power supply VAC, 50-60Hz Power consumption 200W Price The UR5 workspace is a sphere of 0.85m radius excluding a middle cylindrical area of 0.117m radius. Figure 2.11 shows the UR5 manipulator view and its workspace: Figure Workspace of the UR5 manipulator [4].

20 14 Report 2.2. Software architecture The software architecture refers to the identification and description of the system s nonphysical components and their interaction and compatibility into the system architecture. Computer software interprets the information provided for the physical hardware from which the system is built. Software can be divided into application software and the system software. The first one uses the computer system to perform a specific functionality for the benefit of the user (typical computer program). The second one includes operating systems, which manage resources and provide common services for other software working on top of them, and device drivers which control the devices connected to the computer. Figure 2.12 describes the system software and the application software used in this project: Figure Main software components used in this project. The object recognition and grasping system is programmed in C++ language and it uses the components described in Figure Additional information of each one is presented in the following subsections Robot Operating System (ROS) The Robot Operating System (ROS) [5] is framework for building robot software. It started at the Stanford University in the mid-2000s by the Stanford AI Robot (STAIR) and the Personal Robots (PR) program. It is an open source with a large variety of collection of tools, libraries and conventions that simplify the work of creating complex and robust robot applications with a wide variety of robotic platforms.

Object recognition and grasping using bimanual robot 15 The ROS community allows exchanging robot software and knowledge through ROS distributions that make easier to install a collection of

21 Object recognition and grasping using bimanual robot 15 The ROS community allows exchanging robot software and knowledge through ROS distributions that make easier to install a collection of software, repositories where different institutions can develop and release their own robot software and ROS wiki where anyone can contribute with his/her own documentation and tutorials [9]. The ROS software is structured into packages which each one contains some combination of code, data and documentation. Each package directory has to include a CMakeList.txt and package.xml file that describes the contents of the package and how catkin should interact with it. catkin is the ROS build system that generates executable programs, libraries and interfaces [10]. The ROS system allows different nodes [11] to communicate with each other, exchanging information and data. However, the whole system needs a running ROS Master [12] in order to notice nodes the existence of other nodes and starting to communicate with each other. The ROS Master enables individual ROS nodes to locate one another in the system and it tracks publishers and subscribers to topics and services. The communications between nodes can be done with client/server or publisher/subscriber methodologies. ROS topics [13] implement a publish/subscribe communication mechanism and ROS services [14] and actions uses a client/server communication method. Figure 2.13 shows a clear representation of these three communication ways between nodes: Figure ROS communication mechanisms. One of the most common way to exchange data in a distributed system is a publish/subscribe communication mechanism implemented by topics. Before node (A) starts to transmit data over topics, it must first advertise the topic name and the type of message that is going to be sent. Node (B) and node (C) have to subscribe to this topic by making a request to ROS Master. Then, both nodes will receive messages from this topic. A topic is one-way communication and it is useful if there might be multiple nodes listening [15].

22 16 Report A service is a synchronous two-way communication that allows one node to call a function that executes in another node. The server node specifies a function and advertises the service. By making a request, the client node can accesses this service and then, awaiting a response from the server node [15]. An action is asynchronous two-way communication between nodes. It is similar to the request and response of a service, for actions a goal and a result respectively. Moreover, the server node can provide feedback for some updates on the progress and the client node can cancel the previously request at any time [15] ROS packages: openni_launch and iai_kienct2 OpenNI SDK, libfreenect and libfreenect2 are available libraries for accessing the Microsoft Kinect USB cameras [16], [17], [18], [19]. They search for a Kinect through serial number, parameter necessary for distinguishing between multiple Kinects and running them at the same time. A purpose of this project is to use ROS to manage data between different modules that process different type of information. There are ROS packages available that they interpret the information provided by the camera drivers into useful data type in ROS. Different ROS packages have been used in order to use Kinect 360 and Kinect One in the ROS system: 1. Packages needed for launching Kinect 360 with libfreenect and OpenNI SDK: a. Openni_camera: publishes raw depth, RGB and IR image streams from Microsoft Kinect 360, PrimeSense PSDK, ASUS Xtion Pro or Pro Live cameras using the OpenNI SDK driver standard [19], [20]. b. Openni_launch: launches RGB-D processing through rgbd_launch with the OpenNI driver [21]. c. Rgbd_launch: opens an OpenNI SDK device to convert raw depth/rgb/ir streams to depth images, disparity images and registered point-clouds. This file is launched internally from openni_launch [22]. 2. Package needed for launching Kinect One with libfreenect2 and OpenNI SDK: a. Iai_kinect2: this is a collection of tools and libraries for ROS to interface to the Kinect One device using libfreenect2. It contains a calibration tool for calibrating the IR sensor to the RGB sensor and depth measurements. It converts raw depth/rgb/ir streams to depth images and registered point-clouds [23].

23 Object recognition and grasping using bimanual robot C++ libraries This section presents different high-level C++ libraries providing useful utilities. Eigen is used for basic matrix and vector operations such as computing a translation and rotation of a system reference frame [24]. Basic colour image processing algorithms are managed by using OpenCV library [25]. Point Cloud library (PCL) offers most important point-cloud processing algorithms needed for the object recognition application [26]. Grasping with two Allegro Hands library (GA2H) and the General Inverse Kinematics library (GIKL) are two libraries developed at the Industrial and Control Engineering (IOC). GA2H is providing the grasp points of the 3D object and GIKL computes the inverse kinematics of our bimanual manipulator to reach the grasp points. Some definitions for each library have been presented: 1. Eigen: is a C++ library for linear algebra, matrix and vector operations and geometrical transformations. Eigen is an open source library licensed under Mozilla Public License 2.0 (MPL2) [24]. 2. OpenCV: is an open source library licensed under Berkeley Software Distribution (BSD) that provides real-time computer vision applications. OpenCV was developed to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products. OpenCV s application areas include: facial detection and recognition, human gesture recognition, object segmentation and recognition, motion tracking, produce 3D point-clouds from stereo cameras, stitch images together to produce a high resolution image, etc [25]. 3. Point Cloud Library: is an open source library for 2D and 3D image and point-cloud processing under Berkeley Software Distribution (BSD) license and thus free for commercial and research use. The PCL library has numerous algorithms for point-cloud processing such as filtering outliers from noisy data, stitch 3D point-clouds together, feature extraction to recognise objects based on the geometric appearance, surface reconstruction, registration, segment relevant parts of a scene, etc [26], [27]. 4. Grasping with Allegro Hands: is a C++ library developed at the Industrial and Control Engineering (IOC). The library computes the grasp points of a 3D object in order manipulate it with two Allegro Hands mounted in a bimanual manipulator (see section 2.1.3). The algorithm also provides the inverse kinematics and the motion planning solutions to grasp the 3D object and manipulate it [8]. 5. General Inverse Kinematics Library: is a C++ library developed at the Industrial and Control Engineering (IOC). The library deals with the main robotic kinematic problems of any robot-tree like structure. It provides simple enhancements to the Orocos Kinematics and Dynamics library (KDL) issues, implements analytical Jacobian solutions for specific manipulators and offers alternatives for the numerical inverse kinematic solver such as Damped Least

18 Report Squares method, not included in KDL. Additional information can be found in the Annex A. 2.2.4.

24 18 Report Squares method, not included in KDL. Additional information can be found in the Annex A The Kautham Project The Kautham Project is a software tool developed at the Industrial and Control Engineering (IOC) for teaching and research in robot motion planning [28]. It is developed in C++ programming language and it is using the Open Motion Planning Library (OMPL) for the large number of planners that it provides [29]. OMPL is an open source library under the terms of a Berkeley Software Distribution (BSD) license. The library offers a large number of sampling-based motion planning algorithms such as Probabilistic Roadmap Method (PRM), Rapidly-exploring Random Tree (RRT), Expansive Space Trees (EST), etc. OMPL does not contain any method for collision detection or visualization tool [30], but Kautham uses QT for the user interface, the Coin3D library for the graphic rendering, PQP library for the collision checker and ODE library for the dynamic simulation of the rigid bodies. XML format is used for the input problem formulation. Robots and 3D obstacles are defined as kinematic chains from Denavit-Hartenberg file or Universal Robotic Description Format (URDF) file. The following figure shows the Kautham Project software application: Figure Kautham graphical user interface. Different basic planners can be easily used and parameterized, allowing users to learn in deep each of them. This application software can simulate the same problem with several planners and different strategies changing their parameters such as maximum sampling points, maximum nearest neighbours, maximum distance to connect two nodes without collision, sampler method, etc [28]. An example of the configuration space of a problem solved using a PRM planner is shown in Figure 2.15:

Object recognition and grasping using bimanual robot 19 Figure 2.15. Visualization of the configuration space of a problem solved with a PRM.

25 Object recognition and grasping using bimanual robot 19 Figure Visualization of the configuration space of a problem solved with a PRM. The user can fix different solver limitations such as the number of runs to simulate each problem, maximum memory used or maximum time to find a solution. Kautham creates a benchmarking output file which contains the simulation results [28]. An example of the benchmarking output is provided comparing the execution times in seconds of PRM, EST and Single-query Bi-directional Lazy collision checking (SLB) planners: Figure Example of boxplot in the benchmarking output.

26 20 Report 3. Object Recognition Module The object recognition module identifies specific 3D objects models within a captured 3D scene with colour information. Moreover, the object s pose with respect to the world coordinate system is provided. Figure 3.1 shows the block diagram of the module indicating the inputs and outputs: Figure 3.1. Object Recognition Module block. A point-cloud data definition, a data acquisition, a way for calibrating RGB-D cameras and point-cloud processing pipeline are explained in the following subsections. At the end, this section presents how a ROS node is constructed to advertise a ROS service with the functionality described, if any client requests an object recognition from a 3D scene with colour information Data representation An image taken with a common optical camera represents a projection of a partial 3D world scene to a 2D plane. The 2D digital image consists of a numerical matrix which its corresponding values are the light intensity. The matrix elements are called pixels. Each pixel can have one or more channels. One channel could define a grayscale image and three channels a colour image. When the captured image adds, for each pixel, the information of distances in Cartesian space, the data is representing a 3D world scene Point-clouds A point-cloud is a data structure used to represent a collection of multi-dimensional points and is commonly used to represent three-dimensional data in some coordinate system. These points are usually defined by X, Y, Z geometric coordinates and often are intended to represent the external surface of an object. Moreover, a point-cloud can also be defined with the corresponding colour information, so, each point includes the light intensity value of red, green, blue and transparency. The point-cloud becomes 4D when colour information is present. A point-cloud can be generated artificially from a computer program, for instance, creating a point-cloud sphere with algebra, or can be acquired from hardware sensors such as

Object recognition and grasping using bimanual robot 21 stereo cameras, 3D scanners or time-of-flight cameras (see 3.1.2). Figure 3.2 shows a point-cloud example.

27 Object recognition and grasping using bimanual robot 21 stereo cameras, 3D scanners or time-of-flight cameras (see 3.1.2). Figure 3.2 shows a point-cloud example. The coordinate system of the point-cloud is set with respect to a fixed coordinate system. If the point-cloud is representing a 3D world scene, then the fixed coordinate system is usually the sensor origin. Figure 3.2. Point-cloud example with different objects on the table. Image from [26] Data acquisition There are several ways for estimating the distance of an arbitrary object in the 3D scene and to convert it to a point-cloud. This section presents some hardware sensors such as stereo cameras, 3D scanners or time-of-flight cameras and how they can construct a point-cloud with colour information captured in real scene. The type of sensor chosen depends on the situation and the application to develop. Sometimes, it requires pointcloud with high resolution, a faster sensor with high frequency rate or a point-cloud without noise signals. RGB-D camera sensor combines information about colour and depth. This allows capturing texture from colour image and geometry of an object in the scene from depth sensor. This camera sensor contains: - Optical sensor for getting the colour information. - Depth sensor composed by infrared ray emitter and infrared sensor which estimates the distances of the objects in the scene. Some types of RGB-D cameras are: 1. Stereo cameras: imitates human binocular vision because uses two or more lenses with a separate image sensor. So, it gives the ability to capture 3D images [31]. 2. Time-of-flight (ToF): measures the time that a ray or pulse of light takes to travel certain distance between the camera and the object. The entire scene is

22 Report captured with single pulse of light, as opposed to point-by-point with a laser beam. ToF camera has an active light source that is used to illuminate the scene.

Structured light: is a technique to measure 3D surfaces by the projection of light patterns of infrared light.

28 22 Report captured with single pulse of light, as opposed to point-by-point with a laser beam. ToF camera has an active light source that is used to illuminate the scene. The camera registers the light reflected from the scene and depth information is calculated [32]. 3. Structured light: is a technique to measure 3D surfaces by the projection of light patterns of infrared light. This patter is seen distorted when looked from a perspective different from the projector. By analysing this distortion, information about the depth can be retrieved and the surface reconstructed. The main advantage is that the projected features are easily distinguished by the camera [33]. How the point-cloud is constructed from the optical camera and the depth sensor is not trivial. Note that a camera calibration is required in order to obtain the correspondences between the pixels from the colour and depth sensors because both sensors are not in the same position inside the camera. A translation and rotation matrix is needed to relate both sensors. Section 3.2 shows the procedure for calibrate the RGB-D camera and an example of uncalibrated and calibrated point-cloud is shown in Figure 3.5. Kinect 360 and Kinect One have been used in this project and the main camera specifications for both are presented in section The following figures show the images from Kinect 360 and the point-cloud generated. Figure 3.3. Example of images taken with Kinect 360.

29 Object recognition and grasping using bimanual robot 23 Figure 3.4. Coloured point-cloud generated with Kinect Camera Calibration Object recognition, navigation systems or 3D scene reconstruction are examples of robotics applications that use the camera calibration parameters. The process of camera calibration gives us a mathematical model of the camera s geometry, a distortion model of the lens and the pose of a camera with respect to coordinate frame. Camera calibration can be divided into two main stages: Calibrate intrinsic parameters: such as lens distortion, focal length, optical centre and for RGB-D cameras, colour and depth image offsets. Calibrate extrinsic parameters: the position and orientation of a camera in a reference coordinate frame. The intrinsic model transforms a 3D point in real world scene into a corresponding 2D point in camera image plane. For RGB-D cameras, the intrinsic calibration solves the colour and depth offsets because optical and IR sensors are not in the same position inside the camera. Usually, the intrinsic calibration needs to be done only once per camera, it makes sense to save the estimated model after a successful calibration into a file such as XML or YAML format [34], [35]. Later, just loading these files into the program will allow skipping the calibration stage. Reliable calibration methods already exist, which are widely used [23], [36]. The default intrinsic parameters of the Kinect 360 provided for the ROS openni_camera [37] are quite accurate and an intrinsic calibration procedure has not been applied. However, Kinect One has been calibrated by the method provided by the ROS iai_kinect2 package [23] which gives the camera intrinsic matrix, the distortion coefficients of the lens, the rotation matrix and the projection matrix for RGB and IR cameras and saves the information into a YAML file. Moreover, another YAML file is generated containing the matrix which relates both cameras containing the rotation and translation from RGB to the

This is clear evident around edges like the edge of the Stäubli robot or the rugby ball.

The iai_kinect2 uses a chessboard pattern, a symmetric or an asymmetric circle grid to

Typical calibration steps for optical and IR depth sensors are: 1.

30 24 Report IR cameras. Figure 3.5 shows how the colour is now correctly applied on the depth data. This is clear evident around edges like the edge of the Stäubli robot or the rugby ball. Figure 3.5. Uncalibrated and calibrated point-cloud. The iai_kinect2 uses a chessboard pattern, a symmetric or an asymmetric circle grid to calibrate the intrinsic parameters. Typical calibration steps for optical and IR depth sensors are: 1. Starting at a short distance where the calibration pattern covers most of the image. 2. Tilting the calibration pattern vertically and horizontally. 3. Move the pattern further away. 4. Move the pattern with different orientations around most of the image. Figure 3.6. Calibration steps with chessboard pattern.

Object recognition and grasping using bimanual robot 25 Figure 3.7 shows the pattern used to calibrate optical and IR of Kinect One: Figure 3.7. Chess5x7x0.

This transformation can be decomposed into a rotation and a translation.

31 Object recognition and grasping using bimanual robot 25 Figure 3.7 shows the pattern used to calibrate optical and IR of Kinect One: Figure 3.7. Chess5x7x0.03 pattern detected for optical and IR sensors. The extrinsic calibration determines the transformation of the camera coordinate system to world coordinate system. This transformation can be decomposed into a rotation and a translation. Homogeneous transformations based on 4x4 real matrices have been used most often within the robotics and computer vision community. The homogeneous matrix is described below in a block-partitioned form [38]: (Eq. 3.1) where is the relative orientation and is the relative position of frame A relative to frame B. Pattern transformation matrix is known and the current pose of the camera with respect to the pattern coordinate system is also provided. Then, the camera coordinate system with respect to world coordinate system is determined by multiplying all the homogenous transformation matrices: (Eq. 3.2)

Figure 3.9 presents how a single camera can be calibrated with multiple patterns.

system, because it gets the information of multiple patterns. Figure 3.9.

Note that symmetric markers such as chessboard (Figure 3.

32 26 Report Figure 3.8 shows how multiple cameras can be calibrated with the same pattern coordinate system. Figure 3.8. Extrinsic calibration of two cameras with the same pattern. Figure 3.9 presents how a single camera can be calibrated with multiple patterns. This process is more robust to determine the pose of a camera with respect to the world coordinate system, because it gets the information of multiple patterns. Figure 3.9. Extrinsic calibration of a single camera with multiple patterns. Note that symmetric markers such as chessboard (Figure 3.7) used for intrinsic calibration, they do not provide a correct orientation of the camera with respect to the world coordinate system. The algorithm is not able to distinguish when the pattern is oriented up or down. This issue is solved by using asymmetric patterns. Reliable

Object recognition and grasping using bimanual robot 27 algorithms already exist, which are widely used: 1. Asymmetrical circle pattern of OpenCV library [39]. 2. ARToolkit: calculate the real camera position and orientation relative to physical markers in real time [40].

33 Object recognition and grasping using bimanual robot 27 algorithms already exist, which are widely used: 1. Asymmetrical circle pattern of OpenCV library [39]. 2. ARToolkit: calculate the real camera position and orientation relative to physical markers in real time [40]. 3. ROS package AR_pose: is an Augmented Reality marker estimation that uses ARToolkit. It provides a transform between the camera and a single AR marker or an array of transforms for multiple AR markers [41]. 4. AR Track Alvar: is newer and more advanced than the ARToolkit, which has been the core for other ROS AR markers packages [42]. AR markers are print outs looking much like QR code. They have a special structure that makes it easy for the common object recognition algorithms to find them inside an image or camera stream. First, they have a clear border that detaches the marker from the outside world. Second, inside this border there is a unique pattern that is easily recognisable by the image recognition algorithms as it provides clear edges. Third, knowing the size of the marker helps to transform the 2D image coordinates to 3D world coordinates. However, all types of patterns have some restrictions such as the marker size and the angle of the camera looking to the marker. Figure Three examples of AR tags [42]. OpenCV algorithms have been implemented for its simplicity instead of external ROS packages which generates a ROS package dependency. Therefore, asymmetrical circle grid is used (see Figure 3.13) and AR markers will be integrated in further work. The method developed to calibrate the extrinsic parameters of a camera is the same whatever the camera we use. The camera driver, that decodes the sensor information, has to be adapted to ROS message topic in order to provide the image colour data to the extrinsic camera calibration algorithm. Figure 3.11 shows the inputs and outputs of the extrinsic calibration function. The needed inputs are: the colour image topic, intrinsic camera parameters and the position and orientation of all asymmetric circle grid patterns in the scene referred to the world coordinate system. The output is the position and orientation of the camera in the world coordinate system.

28 Report Figure 3.11. Extrinsic camera calibration block. A diagram of the code structure is shown in Figure 3.12.

The first one attempts to determine whether the input image contains a grid of circles.

34 28 Report Figure Extrinsic camera calibration block. A diagram of the code structure is shown in Figure Two OpenCV functions have been used for the generation of the extrinsic camera matrix ( ): findcirclesgrid and solvepnp. The first one attempts to determine whether the input image contains a grid of circles. The function locates centres of the circles and returns a non-zero value if all of the centres have been found [43]. The second one estimates the object pose with respect to the pattern coordinates, given a set of object points, their corresponding image projections, the camera matrix and the distortion coefficients [43]. Then, Object3DChessboard determines an array of object points in the pattern coordinate system. This object points are correlated with the centre of the circles recognised in the input image. Finally, the pose of the camera in the world coordinates system is determined by a simple transformation. Knowing the position and orientation of the detected pattern and an average translation ( ) and an average rotation ( ) of the camera with respect to the pattern detected, the extrinsic camera matrix is set: : is an output rotation vector that brings points from the model coordinate system to the camera coordinate system. This vector is based on the Rodrigues rotation formula which it is an efficient algorithm for rotating a vector in space given an axis and angle of rotation [43]. : is an output translation vector. Figure Block diagram of camera set_pose function.

Object recognition and grasping using bimanual robot 29 Figure 3.13 shows a detected asymmetric circle grid 4x11 located in front of the camera. Figure 3.13. Detected an asymmetric circles grid 4x11 used for extrinsic camera calibration.

features extraction and how the object recognition node is advertising a ROS service in order to provide object recognition if a client requests it. 3.3.1.

35 Object recognition and grasping using bimanual robot 29 Figure 3.13 shows a detected asymmetric circle grid 4x11 located in front of the camera. Figure Detected an asymmetric circles grid 4x11 used for extrinsic camera calibration Object recognition node The most important concepts presented in this section are the point-cloud processing algorithms implemented such as filtering, segmentation, keypoints detection and features extraction and how the object recognition node is advertising a ROS service in order to provide object recognition if a client requests it Point-cloud processing A general processing pipeline is presented in Figure The main stages are: pointcloud filtering to remove noise which can cause a poor system performance, a reduced point-cloud density can decrease the computational cost for processing in the following stages, a segmentation stage permits to distinguish which are the points of the 3D object to recognise from the entire point-cloud scene captured, descriptors extraction is used to get the geometric information more relevant of the object before the comparison with an object model and the last stage computes the position and orientation of the matched object.

30 Report Figure 3.14. Point-cloud processing pipeline. The RGB-D cameras generate point-clouds with different point densities.

36 30 Report Figure Point-cloud processing pipeline. The RGB-D cameras generate point-clouds with different point densities. Additionally, measurement errors cannot be avoided which complicates the estimation of point-cloud parameters such as surface normals. Consequently, these irregularities lead to erroneous values and fail to object recognition process. However, these issues can be solved applying some algorithms. Point-cloud filtering process consists in removing the points out of interest which they do not provide useful information such as points at infinite caused by unexpected ray reflections in the scene, unpredictable sensor noise, points with not a number (nan) value, massive amount of data, occlusions, etc. The PCL s functions used to afford these issues are: 1. PassThrough filter [44]: makes simple filtering along a specified dimension, cutting off values that are either inside or outside a given user ranges. This allows setting a workspace where the objects will be placed for the user to recognise them. Figure 3.15 shows how this function works where one image contains the original point-cloud and the other one only the points inside the workspace fixed by the user.

Object recognition and grasping using bimanual robot 31 Figure 3.15. Example of pass through filtering. 2.

Consequently, they cannot process the point-cloud if nan values are not removed before. 3.

The method consists in use a 3D voxel grid like a 3D box to approximate all the points inside with their centroid.

16 shows how this function works where one image contains the original point-cloud and the other one a downsample applied. Figure 3.16. Downsampling a point-cloud using a voxel grids.

37 Object recognition and grasping using bimanual robot 31 Figure Example of pass through filtering. 2. removenanfrompointcloud [45]: removes all points in the input cloud which contains a nan value. Most PCL functions used in following stages cannot afford nan values. Consequently, they cannot process the point-cloud if nan values are not removed before. 3. VoxelGrid filter [46]: this PCL function is used for downsample the point-cloud in order to reduce the number of points. The method consists in use a 3D voxel grid like a 3D box to approximate all the points inside with their centroid. This approach is slower than approximating them with the centre of the voxel but it represents the object surface more accurately. Figure 3.16 shows how this function works where one image contains the original point-cloud and the other one a downsample applied. Figure Downsampling a point-cloud using a voxel grids. Two additional functions have been developed: 4. Colour filtering: removes all points which are not closed to the colour threshold fixed by the user. The function is following the HLS model representation which rearranges the geometry of RGB in order to be more intuitive than cartesian cube representation. HLS is defined as double cone or cylinder and it stands for hue, lightness and saturation. Two vertices correspond to the white and black colours, the angle defines the hue value, the distance to

32 Report the vertical axis defines the saturation value and the distance of the white and black

In order to compare the colours of the point-cloud and the colour threshold fixed by the user,

Note that points closed to white and black values are always filtered.

reference and which are the ones close to white and black colours (see Figure 3.18).

inside the blue region. Figure 3.18. HLS model: three regions delimited.

38 32 Report the vertical axis defines the saturation value and the distance of the white and black axis corresponds to the lightness. Figure HLS model representation. In order to compare the colours of the point-cloud and the colour threshold fixed by the user, it is convenient to obtain the pure hue values of the point-cloud. Note that points closed to white and black values are always filtered. A radius threshold is applied in order to consider which are the values closed to colour reference and which are the ones close to white and black colours (see Figure 3.18). Therefore, a point-cloud is set at his maximum saturation and a half of lightness if a point is inside the blue region. Figure HLS model: three regions delimited. Points in region blue are considered as colour. Figure 3.19 shows the original and the final point-cloud after applying colour filtering: Figure An example of maximum colour saturation and a half of lightness.

Object recognition and grasping using bimanual robot 33 5. Cluster filtering or radius outlier filtering: discard clusters with low density points.

For example, if two neighbours and a radius are specified, blue points will be removed from the point-cloud (see Figure 3.

39 Object recognition and grasping using bimanual robot Cluster filtering or radius outlier filtering: discard clusters with low density points. The user specifies a number of neighbours which every index must have within a specified radius to remain in the point-cloud. For example, if two neighbours and a radius are specified, blue points will be removed from the point-cloud (see Figure 3.20). Figure An example of point cluster filtering. Figure 3.21 shows the selected clusters with low density of points to be removed from the original point-cloud. Figure Cluster filtering process. Segmentation methods divide large amounts of data into smaller clusters in order to regroup elements with the same properties and decreasing the computational cost for processing the data in the next stages. The segmentation method implemented is based on colour-based region growing algorithm. Region growing approach consists in connecting two points within a specified distance. If a point with the same properties is not close to another, the algorithm splits the remaining point-cloud into separate clusters. Colour-based region growing uses colour instead of points normals. Clusters are made by merging points close to the same colour. Then, two neighbouring clusters with a small difference between average colours are merged together. During this step every single cluster is verified by the number of points that it contains. If this number is less than the threshold specified, the current cluster is merged with the closest neighbouring cluster.

based on colour: Figure 3.22. Segmented scene with colour-based region growing. Figure 3.23 and Figure 3.

This is possible with the colour-based region growing segmentation where each cluster detected

40 34 Report Figure 3.22 shows how this RegionGrowingRGB function [47] is splitting the whole scene in different clusters based on colour: Figure Segmented scene with colour-based region growing. Figure 3.23 and Figure 3.24 present how the object recognition module is capable to detecting single or multiple objects. This is possible with the colour-based region growing segmentation where each cluster detected could be an independent object (see Figure 3.23) or clusters could form a single object with different colours (see Figure 3.24): Figure Segmented scene with multiple objects.

Object recognition and grasping using bimanual robot 35 Figure 3.24. Segmented scene with single object with different colours.

41 Object recognition and grasping using bimanual robot 35 Figure Segmented scene with single object with different colours. At this point, all clusters have been taken as a keypoints to proceed the following stages. The keypoints contain the important points that can describe the entire cloud. For this reason, the input cluster can be downsampled using the VoxelGrid [46] with specific leaf size and all sub-sampled points become our keypoints. Then, the normals of these points are used to build the shape histogram used in the descriptors. The normal is the information related with the curvature of the surface and they can be computed with NormalEstimation function [48]. Object recognition compares different set of points or templates in order to find correspondences between models. The information that uniquely characterizes a point is known as descriptor or point-feature representation [49]. Descriptors represent the information of an image or point-cloud and they are a fundamental part of a system for the object recognition. A descriptor is build with the information of its neighbourhood and its normals. If the point-cloud is not dense, the descriptors will have lack of information about the curvature of the object [50]. The PCL function used to extract a descriptor is: 1. FPFHEstimation [51]: class to compute Fast Point Feature Histogram (FPFH) [52] descriptors from the input point-cloud and its surface normals. FPFH descriptor is a local descriptor that improves the performance of Point Feature Histogram descriptor (PFH) by only considering the direct connections between keypoints and its neighbours. FPFH removes additional links between neighbours. A PFH model is based on the relationships between the points in the k- neighbourhood and their estimated surface normals. Figure 3.25 presents a diagram of the PFH computation:

36 Report Figure 3.25. The influence region diagram for a PFH.

42 36 Report Figure The influence region diagram for a PFH. Point P q marked in red and placed in the middle of a circle (sphere in 3D) with radius r, is fully interconnected in a mesh with all its k neighbours (points with distances smaller than the radius specified). Then, the final PFH descriptor is computed with the relationships between all pairs of points in the neighbourhood and it represents a computational complexity of O(nk 2 ) being n the keypoints and k the number of neighbours for each point [52]. Besides, FPFH algorithm reduces the computational complexity of the algorithm PFH to O(nk). This method computes a descriptor of a point P q only considering pairs between itself and its neighbours (red lines illustrated in Figure 3.26) using a Simplified Point Feature Histogram (SPFH) [52]. An additional step is performed to compensate this loss of information from the connections between neighbours. FPFH adds to the P q point a weighted SPFH from its neighbours according to distance. This procedure incorporates additional information about the connections between neighbours and the point P q until 2r distance [52]. (Eq. 3.3) where represents the distance between point P q and a neighbour point P k. The extra FPFH connections are shown with black lines in Figure As the diagram shows, some of the value pairs will be counted twice (marked with thicker lines in the figure).

Object recognition and grasping using bimanual robot 37 Figure 3.26. The influence region diagram for a FPFH.

43 Object recognition and grasping using bimanual robot 37 Figure The influence region diagram for a FPFH. To sum up, FPFH has been implemented in this project because it reduces the overall complexity of PFH making possible to use it in real-time applications. FPFH includes additional point pairs outside the r radius sphere at most 2r away but it does not fully interconnect all neighbours of point P q as it can be seen from the Figure 3.26.Some value pairs is missing which might contribute to loss information of the geometry around the point. However, the re-weighting scheme recaptures some of the point neighbouring values using SPFH as presented in equation above. The final stage presented in Figure 3.1 consists in an alignment process called template registration or matching. By registering a template to a new point-cloud, the position and orientation of the object can be determined. Moreover, a fitness score of the alignment is provided. Two PCL functions have been implemented to deal with the registration stage: 1. SampleConsensusInitialAlignment [53]: this method takes an object template as input and aligns it to the target cloud. Looking at the fitness score gives an idea of how successful the alignment was, and looking at the transformation matrix tells us the position and orientation of the object aligned to the target cloud. This matrix can be described in a block-partitioned form as shown Eq.3.1. The Sample Consensus Initial Alignment procedure (SAC-IA) [52] uses pointcloud intrinsic, rotation invariant features. However, the computational complexity is high because the matching step requires analyzing all possible correspondence pairs searching similar histograms in both point-clouds. 2. IterativeClosestPoint (ICP) [54]: minimises the difference between two pointclouds iteratively [55] and a transformation is estimated based on Singular Value Decomposition (SVD). ICP requires accurate initialization and requires mostly overlapping scans. This function has three termination criteria: a

38 Report maximum number of iterations, a minimum difference value between the previous and the current estimated transformation, and a minimum Euclidean squared error.

44 38 Report maximum number of iterations, a minimum difference value between the previous and the current estimated transformation, and a minimum Euclidean squared error. The fitness score is updated after ICP alignment. A high value indicates that either the alignment between template and object scene is not the same object or the alignment could require more iteration to get the best match between both point-clouds. A lower number means a better alignment. A transformation matrix that describes how template points should be rotated and translated in order to best align with the points in the target cloud is also provided. Note that the alignment score says a lot about the final results. The 3D object is recognised in the entire scene and its position and orientation with respect to the world coordinate system is fixed, if this parameter has a low value. A general registration pipeline used in this project is presented in Figure 3.27: Figure Block diagram of the registration procedure.

Object recognition and grasping using bimanual robot 39 3.3.2. ROS service: objects_recognition ROS service is a communication mechanism to pass data between nodes [14].

45 Object recognition and grasping using bimanual robot ROS service: objects_recognition ROS service is a communication mechanism to pass data between nodes [14]. A service client sends a request message to a service server and it keeps waiting for a response. This is a limitation of ROS services because the client can wait forever if the service is not available for some reason such as server has died unexpectedly or service name was misspelled in the client call. Service allows one ROS node [11] to call a function that executes in another node. Each service has a unique name and a type that determines the structure of the request and response messages in the ROS network. A client can make a persistent connection to a service, which enables higher performance but less robustness to service provider changes. Service calls are well suited to things that you only need to do occasionally and that take a bounded amount of time to complete. To get more information about ROS services, look at the API documentation [14]. Service communication has the following characteristics: Figure ROS service scheme. A single node initiates the request and only one node can receive it and send back a response. Service files are just lists of message types which can be built using existing ROS messages [56] such as std_msgs, geometry_msgs, nav_msgs or sensor_msgs, or they can be ones you have defined yourself. ROS message is a simple data structure used on the communication between ROS nodes. Messages can contain standard primitive types as integer, floating, point, boolean, etc [56]. Object recognition node is advertising a service named objects_recognition associated to the objectsrecognition.srv service type (see Figure 3.30). When the service node receives a client request, server executes the function srvlocateobjects. This function proceeds to the point-cloud processing algorithm explained in section When the algorithm ends, the server makes a response to the client with a list of objects recognised and its poses with respect to the world coordinate system. Section 6.1 presents some tests and the quantitative results of this module. The scheme of the object_recognition service can be seen in Figure 3.29.

40 Report Figure 3.29. Scheme of ROS service: object_recognition.

communication are presented in Figure 3.30. A single service file (.

msg) are created to build a ROS service communication.

but messages can contain other types of messages.

46 40 Report Figure Scheme of ROS service: object_recognition. Finally, the files created to make possible the ROS service communication are presented in Figure A single service file (.srv) and three messages files (.msg) are created to build a ROS service communication. Service definition cannot embed another service inside of a service, but messages can contain other types of messages. In the service definition, the request and the response are separated with three dashes (---). Figure Service and messages types available in this project.

47 Object recognition and grasping using bimanual robot Grasp and Motion Planning Module General robotic applications need the ability to interact with and manipulate objects in the physical world. Humans see specific objects and know immediately how they would grab them to pick them up. Robotic grasp detection lags far behind human performance. This module firstly focuses on, the problem of finding a good grasp given a 3D object model and its pose and, secondly, on finding an optimal trajectory in a free-collision space between the initial configuration of the bimanual manipulator to the target object. This module computes precision grasps for bulky objects using two anthropomorphic hands. Figure 4.1 shows the block diagram of the module indicating the inputs and outputs: Figure 4.1. Object Grasp and Motion Planning Module block Object grasp node This section presents the node responsible for computing the grasps for bulky objects using two anthropomorphic hands. This type of robotic hand has been developed as versatile end-effector tools able to perform complex grasp and manipulation tasks. There are several approaches to find a good grasp points [57], [58], [59], but this project is focused on the approach developed at the IOC [8]. Figure 4.2 shows the object grasping points block with the inputs and the output elements: Approach overview Figure 4.2. Object Grasping points block. Most projects define grasp as an end-effector configuration which achieves partial or complete form closure of a given object. This is a challenging problem because it depends on the pose and configuration of the robotic gripper as well as the shape and physical properties of the object to be grasped. So, it requires a search over a large number of possible gripper configurations.

42 Report The work in [8] deals with the problem of how to grasp bulky objects that cannot be grasped with only one hand, using two anthropomorphic hands simultaneously.

The approach relies on a full 3D object model to find an appropriate grasp. The method consists of searching three reachable contact points for each hand in an iterative way.

48 42 Report The work in [8] deals with the problem of how to grasp bulky objects that cannot be grasped with only one hand, using two anthropomorphic hands simultaneously. The software library provided considers the constraints of the hands and arms in the search of the contact points on the surface of the object. The approach relies on a full 3D object model to find an appropriate grasp. The method consists of searching three reachable contact points for each hand in an iterative way. Using frictional contact points, two fingers and the thumb from each hand are sufficient to get robustness of the grasp. The approach has the following steps: 1. The surface of the object model is divided into two sets of slices:, 2. A set of triplets (, ) composed of three points with the normal direction (, ) are computed in each slice. 3. Each triplet of is tested together with each triplet of in order to find a couple of triplets that satisfies the force-closure condition. 4. The grasp quality is evaluated as the largest perturbation that a grasp can resist in any direction. In the end, the algorithm returns the solution grasp (see Figure 4.3), if a free-collision path for the whole system is found. The method tries with another couple of triplets in case that the inverse kinematics or the collision check does not return a solution. If after exploring the object through all triplets and its three axis any grasp points have been found, the algorithm returns an error [8]. To illustrate the proposed approach, Figure 4.3 shows a rugby ball divided into blue and orange slices and the resulting contact points: Figure 4.3. Rugby ball divided in slices based on its inertial axis and the resulting contact points.

Object recognition and grasping using bimanual robot 43 4.2.

49 Object recognition and grasping using bimanual robot Robot inverse kinematics node Kinematics is a branch of physics that studies the motion of systems without the consideration of forces, mass or moments that cause motion. Thus, robot kinematics refers the analytical study of the motion of a robot manipulator analysing the relationship between the dimensions and connectivity of kinematic chains and the pose (position and orientation) for each link of the manipulator [38]. Therefore, it is crucial to define suitable kinematic models for a robot mechanism in order to analyze the behaviour of manipulators. A manipulator is characterized by an arm that provides mobility, a wrist that confers dexterity and an end-effector that performs the robot s task. The mechanical structure of a robot manipulator consists of a sequence of rigid bodies, called links, interconnected by means of joints which provide pure rotation or translation between two consecutive links. So, a kinematic model is built with a hierarchic structure with a parent-child relationship. This means that, if a joint is rotated around an arbitrary axis, all its children will also rotate around the same axis because they derive all of its parent s transformations [38]. In fact, the robot kinematics describes the analytical relationship between the joint positions and the pose of the end-effector. Hence, the formulation of the kinematic relationships allows the study of the forward kinematic problem and the inverse kinematic problem. Forward kinematics uses the kinematic equations of a manipulator to compute the position of the end-effector from specified values for the joint parameters. In contrast, for serial manipulators inverse kinematics is a much more difficult problem because it involves the transformation of cartesian coordinates, as positions and orientations of the end-effector, into joint coordinates of a robot manipulator. In fact, the existence of revolute joints causes the problem more difficult to solve because of the presence of nonlinear equations [38]. The following figure shows the relationship between forward and inverse kinematics: Figure 4.4. The schematic representation of forward and inverse kinematics. In addition, there are two main techniques that are used in order to solve an inverse kinematic problem: the analytical and numerical one [60]. Through the first method, the joint variables are solved analytically according to a given end-effectors pose, while

50 44 Report through the second one, the joint variables are obtained based on an iterative technique. There are several methods for solving inverse kinematic problems numerically. These include cyclic coordinate descent methods, pseudoinverse methods, Jacobian transpose methods, the Levenberg-Marquardt damped least squares methods, quasi-newton and conjugate gradient methods and neural net and artificial intelligence methods [61]. The GIKL provides three Jacobian IK solvers: Transpose method, Pseudoinverse method and Damped Least Squares method. Figure 4.5 shows the block responsible for calculating the joints position with the robot model topology and the object grasping points computed in the previous stage (see section 4.1): Figure 4.5. Robot inverse kinematics block. In the annex A.1, additional information about robot kinematics theory and numerical inverse kinematics solvers are provided Robot model simulation Creating a complete virtual model of a robot by simulating components and control programs can significantly impact the general efficiency of a project. Some benefits of simulations could be: Reducing costs involved in robot manufacturing. Simulating various alternatives without involving physical costs. Robot or components can be tested before implementation. Demonstration of a system to determine if is viable or not. Compatibility with a wide range of programming languages. The Unified Robot Description Format (URDF) is an extensible Markup Language (XML) [34] format for representing a robot model. The main limitation is that only tree structures can be represented, excluding all parallel robots. Also, the specification assumes the robot consists of rigid links connected by joints; so, flexible elements are not supported. XML specifications for robot models are: Sensor: describes a sensor such as a camera, ray sensor, IR, etc. Link: describes the kinematic and dynamic properties of a link. Transmission: transmits link actuators to joints and represents their mechanical coupling. Joint: describes the kinematic and dynamic properties of a joint. Gazebo: describes simulation properties such as damping, friction, etc. Model_state: describes the state of a model at a certain time.

51 Object recognition and grasping using bimanual robot 45 Model: describes the kinematic and dynamic properties of a robot structure [62]. The above XML specifications highlighted in bold consist of: Kinematic and dynamic description of the robot. Visual representation of the robot. Collision model of the robot. The description of a robot model involves a set of link elements, and a set of joint elements connecting the links together. The link element describes a rigid body with inertia, collision box and visual features [63]. The joint element describes the kinematics and dynamics of the joint and also specifies the joint limits [64]. Figure 4.6 shows the main attributes of a link and a joint element: Figure 4.6. Link and Joint representations inside URDF General Inverse Kinematics Library (GIKL) Nowadays, the number of robot manipulator applications is increasing and robots morphology is changing accordingly to the functionality. Computing a closed-form solution for the inverse kinematics of general manipulators could lead to complex problems in need of a solution. However, an inverse kinematic library which uses different numerical methods to obtain the optimal configuration for any robot's pose could be a great alternative, mainly for the generalization for all kinds of ndof, tree-like structures and mobile robot manipulators and it could also be an alternative to avoid computing the analytical formulation for each robot that only exists in some cases. There are some robot kinematic libraries available to satisfy our initial needs that are very robust and have quite a good performance. Therefore, in this project, it has been decided to choose one of these libraries and expand its functionality while improving their performances. This is the case of the Orocos Kinematics and Dynamics library, commonly known as KDL, which computes the robot inverse kinematics with the Jacobian pseudo-

46 Report inverse but it has some drawbacks [65]: 1. Difficult to converge in presence of joint limits. 2. Only one Jacobian pseudoinverse method implemented. 3.

52 46 Report inverse but it has some drawbacks [65]: 1. Difficult to converge in presence of joint limits. 2. Only one Jacobian pseudoinverse method implemented. 3. Difficulties to converge with tree manipulators. 4. No actions taken when local minima appear during the iteration process. 5. Any function to load a robot from file. Issues 3-5 above can be mitigated by simple implementation enhancements to KDL library such as use URDF file to load any open manipulator, implement analytical Jacobian solution for each tree manipulators and take action when appear locals minima. Issues 1 and 2 require consideration of alternative IK algorithms such as Jacobian Transpose and Damped Least Squares methods. All these issues are intended to be improved in the General Inverse Kinematics Library (GIKL). GIKL is a C++ library developed at the Industrial and Control Engineering (IOC) which it deals with the main robotic kinematic problems of any robot-tree like structure. The purpose of use this library in this project is to find the final joint configuration of the robot (see Figure 2.10) to reach the grasping points of the object with two robotic hands. So, an inverse kinematic problem has to be solved. Figure 4.7 (a particularization of Figure 4.5) shows the input parameters required to use the inverse_kinematic function and the output variables: Figure 4.7. GIKL inverse kinematics block diagram. First, the file path, where the URDF file is, has to be provided to the IK solver. The solver constructs internally the kinematic robot structure according to the information of this file. The solver also identifies all robot s end-effectors. The manipulator presented in Figure 2.10 has eight end-effectors which correspond to the fingers of the two Allegro Hands. Second, the desired positions and the orientation of the robot s end-effectors have to be provided. The numerical IK solver type, the initial joint configuration of the robot, the maximum iterations allowed and the maximum error in position and orientation allowed has to be set, too. Finally, the inverse_kinematic function computes the final joints configuration of the robot placing each end-effector to the target points of the object with the desired orientation. The solver tries to find the solution before reaching the maximum iterations and it gives the final error in position and orientation. In the annex A.2, additional information about the GIKL structure and its functionality are provided.

53 Object recognition and grasping using bimanual robot 47 The complete evaluation of the GIKL library is presented in the Annex A.3. The most important results are presented in Table 4.1: Table 4.1. Performance of the GIKL. Chain/Tree DoF GIKL pinv solve rate GIKL pinv avg time GIKL transpose solve rate GIKL transpose avg time GIKL DLS solve rate GIKL DLS avg time UR5 5 kukalwr 7 87,50% 4,22ms 49,00% 183,04ms 88,90% 7,79ms 100,00% 2,35ms 63,30% 158,63ms 100,00% 4,88ms Bimanual manipulator % 1555ms 4.3. Motion planning node Robot motion planning is responsible for finding an optimal solution plan from initial joint configuration to target joint configuration. In addition, the algorithm searches a free collision path in order to avoid collision of the manipulator to its environment until the manipulator reaches the target object where the contact is accepted. Figure 4.8 shows the block diagram of the module indicating the inputs and outputs: Figure 4.8. Robot Motion planning block Introduction to motion planning Motion planning problem tries to produce a continuous motion that connects a start configuration (q ini ) and a goal configuration (q goal ) while avoiding collisions with known obstacles. There are many robot applications which require the motion planning formulation such as the mobile robot navigation, the robot autonomy optimization, grasping objects, etc. A problem of a robot motion planning can be stated as follows: Fix a start pose of the robot. Set a desired goal pose of the robot.

54 48 Report Describe the geometric structure of the robot. Provide the geometric description of the world. The algorithm tries to find a path that moves the robot gradually from start to goal in configuration space. A configuration space (C-Space) is the set of all possible robot configurations relative to a fixed coordinate system. Usually a configuration is expressed as vector of positions and orientations. In addition, free space (C free ) and obstacle region (C obs ) are defined inside the C-Space. Free space contains all robot configurations that avoid collision with obstacles. The sampling-based algorithms explore the free C-Space by sampling it with enough points that allow to find a solution. Some of these algorithms are: The Probabilistic Roadmap Method (PRM) [66]. The Rapidly-exploring Random Trees (RRT) [67]. RRT Connect (bidirectional version of RRT) [68]. Expansive Space Trees (EST) [69]. Single-query Bi-directional planner with Lazy collision checking (SBL) [70]. Kinodynamic motion Planning by Interior Exterior Cell Exploration (KPIECE) [71]. The Rapidly-exploring Random Trees (RRT) is a single-query planner which attempts to solve a query as fast as possible and does not focus on the exploration of the entire map as PRM does. RRT has a random sequence to construct a single tree rooted at q ini that grows toward q goal configuration [67]. RRT algorithm consists of: 1. First, q rand is selected randomly from a uniform distribution in C-Space. 2. Then, the algorithm adds a new configuration (q new ) into the tree depending on the step size and the q near. Configuration q new is obtained by moving q near by step size toward q rand. Figure 4.9 shows the procedure of RRT to add a new configuration based on q rand and step size from q near : Figure 4.9. The basic RRT construction algorithm. It is important to take into account the tradeoff between the exploration of C free and the number of samples added to the tree, especially for high-dimensional problems. If step

Object recognition and grasping using bimanual robot 49 size is small, then the exploration steps are short and the nodes of the tree are close together.

55 Object recognition and grasping using bimanual robot 49 size is small, then the exploration steps are short and the nodes of the tree are close together. As the number of nodes becomes large, memory consumption is increased and finding the nearest neighbour becomes expensive, which reduces the performance of the planner [72]. Figure 4.10 shows an example of RRT problem solved: Figure Final RRT construction. At the end, the implementation of a bidirectional version of RRT such as RRTconnect, which grows two trees toward each other rooted at q ini and q goal until they are merged into one, will improve the efficiency of the motion planning system [68] Basic features of the Kautham Project The Kautham Project is developed in C++ programming language and it is using the OMPL library for the large number of planners that it provides. Different basic planners can be easily used and parameterized, allowing users to learn in deep different planning algorithms. This application software can simulate the same problem with several planners and different strategies changing their parameters such as maximum sampling points, maximum nearest neighbours, maximum distance to connect two nodes without collision, sampler method, etc. Moreover, the application can visualize the solution path for the problem formulated and animate it [28]. The main information required to define a problem is: The path to a robot and the obstacles description file: bimanual manipulator with Allegro Hands (URDF) and the object. The initial and the final configurations of the robot(s): the initial and final joint positions of the bimanual manipulator provided by the inverse kinematic block (section 4.2). The configuration of the obstacle(s): the object s position and orientation respect to the world coordinate system provided by the object recognition module (section 3). Other parameters of the planner.

Each service has a unique name and a type that determines the structure of the request and response messages in the ROS network.

56 50 Report The Kautham Project has been encapsulated as a ROS service. Therefore, the Object Recognition and Grasping system can initiate a ROS service communication mechanism to pass data between the manager_node and the kautham_node. Each service has a unique name and a type that determines the structure of the request and response messages in the ROS network. Some of these services defined by the kautham_node server are: kautham_node/setquery kautham_node/checkcollision kautham_node/openproblem kautham_node/closeproblem kautham_node/setinit kautham_node/setgoal kautham_node/setrobotsconfig kautham_node/setobstaclesconfig kautham_node/setrobcontrols kautham_node/setplanner kautham_node/setplannerparameter kautham_node/solve kautham_node/addobstacle kautham_node/connect etc. A service example of kautham_node is presented in Figure In case that the system needs to set the initial and goal queries, the manager_node requests it to the server node and waits to a response. The initial and goal queries are determined by the initial and final joint positions of the bimanual manipulator provided by the inverse kinematic block. Figure ROS service scheme of the kautham_node/setquery and its service file.

Object recognition and grasping using bimanual robot 51 5.

The manager is designed to accomplish the requirements of the user by using the functionalities of the object recognition module and the grasping module. Figure 5.

57 Object recognition and grasping using bimanual robot Manager and Execution Module The manager and execution module is responsible for the supervision and the control of the different modules in the system. The manager is designed to accomplish the requirements of the user by using the functionalities of the object recognition module and the grasping module. Figure 5.1 shows the block diagram of the module indicating the inputs and outputs: Figure 5.1. Manager and Execution Module block. The following subsections detail the user input file definition, how is constructed a ROS client node to call a ROS service, the execution pipeline and other features are explained Input data files Object Recognition and Grasping System uses simple text files to load the input user parameters to accomplish the tasks. One text file is used for the camera calibration setup which contains the most important parameters needed for several OpenCV functions described in section 3.2. This file has to include the next arguments: 1. PATTERN_TYPE: ASYMMETRIC_CIRCLES_GRID 2. PATTERN_SIZE: Number of inner corners per a chessboard row and column 3. SQUARE_SIZE: real distance measured in meters between pattern corners. 4. PATTERN_POSE: the pose of the pattern with respect to the world reference frame described by the cartesian position in meters (x, y, z) and the angle-axis description of the orientation in degrees (ax, ay, az, angle).

58 52 Report A text file example is shown in Figure 5.2: Figure 5.2. Example of user input file used in the camera calibration stage. A second file provides important information to the system such as how many objects in the scene have to be found, the model to be recognised, the hue colours of the object and the dimensions of the workspace where the objects are. This file has to include the following arguments: 1. OBJ_SEARCH: Single object or Multiple objects to be recognised in the scene. 2. MODEL: indicating the path where the model template is. 3. COLORS: the hue levels which the system will focus on. 4. WORKSPACE: dimensions in meters of the box where the system will focus on finding the objects. The box is located at the origin of the world reference frame. A text file example is shown in Figure 5.3: Figure 5.3. Example of user input file used in the manager and execution module.

Object recognition and grasping using bimanual robot 53 5.2. Manager node A general execution pipeline is presented in Figure 5.4.

59 Object recognition and grasping using bimanual robot Manager node A general execution pipeline is presented in Figure 5.4. A ROS node is built to implement these execution stages: request the object_recognition service to get the position and orientation of the objects in the scene, GA2H is used for computing the grasping points of the object recognised, inverse kinematic of the bimanual manipulator is computed and finally, motion planning produces a continuous motion that connects a start configuration and a goal configuration while avoiding collision with obstacles. Figure 5.4. Execution processing pipeline. First, the manager node reads the input user files described in section 5.1 and the node sends a service request to locate the object or the objects in the scene using the object_recognition service (see section 3.3.2). When the request is sent, the client node is waiting for a response. Second, the manager executes the GA2H library to compute the grasping points of the object recognised. The inputs are the name of the model, the PCD file of the model object and its position and orientation. The outputs are the target points on the object where the tips of the two Allegro Hands will be placed (see section 4.1). Then, the manager node executes the GIKL library to solve the inverse kinematic of the bimanual manipulator. The manager sends the desired position and orientation for each tip of both Allegro Hands. The library also needs the URDF file of the bimanual manipulator in order to build internally the tree-like structure of the manipulator. Inverse kinematic solver computes the joint positions with initial joint positions, a method (pseudoinverse, transpose or DLS), a maximum iteration and error allowed and the vector

60 54 Report of the desired end-effectors pose. More information is presented in section 4.2. Finally, the manager node queries The Kautham Project to solve a basic motion planning problem to determine a continuous motion that connects a start configuration of the bimanual manipulator and a goal configuration. The goal configuration of the bimanual manipulator corresponds to the final joints positions computed in the previous stage. Furthermore, the Kautham determines the path where the continuous motion of the manipulator avoids collisions with the known obstacles in the scene. Additional information about motion planning can be found in section 4.3. When the solution path is found, the Kautham Project offers the possibility to execute the plan simultaneously in a virtual simulation and a real execution [73]. More information about how to run the Kautham Project is explained in the Annex C.

Object recognition and grasping using bimanual robot 55 6. Results The performance of the Object Recognition and Grasping System is analyzed separately for each module.

Evaluation of the Object Recognition Module 6.1.1. Camera calibration Calibration process of any type of sensor takes an important role in order to obtain good system s performances.

61 Object recognition and grasping using bimanual robot Results The performance of the Object Recognition and Grasping System is analyzed separately for each module. Quantitative and qualitative results of the object recognition module have been presented and some object grasps have been illustrated at the end of this section Evaluation of the Object Recognition Module Camera calibration Calibration process of any type of sensor takes an important role in order to obtain good system s performances. All RGB-D cameras used in this project have been calibrated with the methodology described in section 3.2. Object Recognition module processes the point-cloud data provided from the cameras sensors. Therefore, the system has to compute the intrinsic transformation matrix for the camera in order to adjust the colour and the depth images due to both sensor offsets. Figure 6.1 shows how the colour is now correctly applied on the depth data. This is clearly evident around edges like the edge of the computers or the rugby ball. Figure 6.1. Uncalibrated and calibrated point-cloud. The extrinsic camera calibration defines the transformation of the camera coordinate system to the world coordinate system. This process requires a definition of an asymmetric pattern and its transformation to the world coordinate system. Then, the camera determines its pose with respect to the world by recognising the pattern. Section 3.2 has explained in depth the asymmetric patterns used and the functions implemented to calibrate the extrinsic parameters of a camera. In this section, two different configurations are provided in order to study the extrinsic calibration performance. Note that each camera views a single pattern in both cases. Running multiple Kinects at the same time causes interference in depth sensors (see the results in section 6.1.2). Thus, it is important to minimise a common area captured with both depth sensors and place the cameras in such a way that the interferences caused by the projection of light patterns of infrared light is minimised (see section 3.1.2).

56 Report The first configuration consists of the object recognition using two cameras capturing both sides of the object.

coordinate system as Figure 6.2 shows: Figure 6.2. Object recognition with two cameras and two patterns.

object captured. Figure 6.3. Object recognition with two cameras and a single pattern.

62 56 Report The first configuration consists of the object recognition using two cameras capturing both sides of the object. Thus, cameras can be placed at 180º each other and two patterns allow calibrating both cameras with respect to the world coordinate system as Figure 6.2 shows: Figure 6.2. Object recognition with two cameras and two patterns. The second configuration consists of the object recognition using two cameras which are capturing different sides of the object and a common area of the object. Both cameras are placed almost 90º each other in order to detect the same pattern and minimising the common area of the object captured. Figure 6.3. Object recognition with two cameras and a single pattern. The main problem of using two patterns to find the transformation matrix that links the cameras to the world coordinate system is the numerical error committed. Thus, when two point-clouds are merged, there is an offset in translation and rotation. This fact can be

Object recognition and grasping using bimanual robot 57 minimised setting both patterns in the world coordinate system with high precision but other factors can still add numerical errors in the

63 Object recognition and grasping using bimanual robot 57 minimised setting both patterns in the world coordinate system with high precision but other factors can still add numerical errors in the transformations. Besides, the second configuration presents an object recognition system with two cameras and a single pattern. This reduces the numerical error because both cameras are focusing on the same pattern and thus, the same transformation matrix is applied for both cameras. The results of merging two point-clouds captured in both configurations presented can be seen in Figure 6.4: Figure 6.4. Results of merging two point-clouds: first configuration (left), second configuration (right) In addition, the second configuration does not merge perfectly both point-clouds. This fact could come from the type of pattern used. As commented in section 3.2, the asymmetrical circle grid (4x11) is used for the extrinsic camera calibration. This pattern is not square and more circles are defining the z axis than y axis of the pattern with respect to the world coordinate system. So, when two point-clouds are merged, more offset error appears in y axis than z axis. Moreover, the higher offset between point-clouds appears in x axis because the distance between the camera frame and the pattern frame is determined by 2D image processing functions. Thus, the functions estimates the x distance according to the size of the circles: small size of circles indicates higher distance and vice versa.

58 Report However, the translation offsets are relatively smalls, when the object recognition is focused on bulky objects such as a rugby ball, and orientation offsets does not appear between

. Figure 6.5. Translation offsets between two merged point-clouds using the same pattern calibration. 6.1.2.

performances. However, the use of multiple structured light cameras such as Kinect 360 at the same time offers some drawbacks.

64 58 Report However, the translation offsets are relatively smalls, when the object recognition is focused on bulky objects such as a rugby ball, and orientation offsets does not appear between point-clouds. The object recognition module is able to determine the pose of the object with the presence of that translation offsets (see Figure 6.5). Figure 6.5. Translation offsets between two merged point-clouds using the same pattern calibration Multiple Kinects evaluation This project presents some configurations where the RGB-D cameras are set at different position and analysing qualitatively which is the best scenario that provides good performances. However, the use of multiple structured light cameras such as Kinect 360 at the same time offers some drawbacks. Kinect 360 measures the 3D surfaces by the projection of light patterns of infrared light. The use of multiple Kinects 360 causes overlapping views of the light patterns. This fact causes confusion for the depth sensors of both Kinects 360 resulting in holes and noise in the computed depth images. Figure 6.6 shows a depth map when only single Kinect 360 is running and when two Kinects 360 are projecting the light patterns at the same time: Figure 6.6. Depth image without interferences (left) and with interferences (right). When the point-cloud is constructed, the interferences also appear on it (see Figure 6.7). Thus, it is important to minimise the common surface captured with both depth sensors in order to obtain a good point-cloud quality. The interferences are minimised if the cameras

65 Object recognition and grasping using bimanual robot 59 are located correctly. In addition, the Kinect One is a Time-of-flight camera type which measures the time that takes a pulse of light to travel certain distance between the camera and the surface (see section 3.1.2). Although the technology is different with respect to the Kinect 360, the infrared ray projected of the Kinect One also interferes on Kinect 360 sensors and vice versa. Figure 6.7. Point-cloud without interferences and with interferences. Despite the loss of information in the final point-cloud due to the interferences, the system with multiple Kinects is able to recognise and estimate the pose of the objects as we can see in the results in sections and Object pose estimation The aim of this section is to present quantitative results for the object recognition and pose estimation mechanism presented in section 3.3. In order to determine the number of fixed cameras and its configuration in the workspace, four scenarios have been designed. The intention is to analyse which Kinect type can provide better results and, if the results improve by using more than one camera. Four different simulations are presented: a) Simulation using only Kinect One with aerial front view. b) Simulation using only Kinect 360 with aerial front view. c) Simulation using two Kinects 360 with lateral view. d) Simulation using a Kinect One with aerial front view and two Kinects 360 with lateral view. The pose estimation error and the computation time have been averaged after 100 simulations for all scenarios presented. Therefore, four systems are presented with its accuracy and its computation time. Figure 6.8 shows the asymmetrical circle grid used for the camera calibration, an initial pose of the Rugby ball object and the initial configuration of Kinect One and Kinect 360:

60 Report Figure 6.8. Initial configuration of the elements involved in the simulations.

object. The real pose of the object is set by the user and it could contain some little error.

165) meters and orientation expressed in angle-axis of (1, 0, 0) with an angle of π/2 radians. For scenario 3 and 4: cartesian position (0, 0, 0.

66 60 Report Figure 6.8. Initial configuration of the elements involved in the simulations. Note that the accuracy of the system is determined by the difference between the estimated pose provided by the Object Recognition Module and the real pose of the object. The real pose of the object is set by the user and it could contain some little error. Thus, the centre of the Rugby ball is placed at: For scenario 1 and 2: cartesian position (0, 0, 0.165) meters and orientation expressed in angle-axis of (1, 0, 0) with an angle of π/2 radians. For scenario 3 and 4: cartesian position (0, 0, 0.165) meters and orientation expressed in angle-axis of (0, 1, 0) with an angle of π/2 radians. The system is analyzed with an object which it is big enough to be grasped with the bimanual manipulator (see Figure 2.10). It does not make sense using a little object which can be manipulated with one hand. The official dimensions for the rugby ball size 5 are: Figure 6.9. Official dimensions of the Rugby ball size 5 [74].

Object recognition and grasping using bimanual robot 61 In case that the objects pose estimation algorithm commits an error in position up to the size of the rugby ball, the system will be considered

67 Object recognition and grasping using bimanual robot 61 In case that the objects pose estimation algorithm commits an error in position up to the size of the rugby ball, the system will be considered with poor performance. Therefore, the Object Recognition Module is designed to estimate the pose of a bulky object with the minimum error as possible. Four different scenarios have been simulated: a) Scenario 1: Kinect One with aerial front view The Object Recognition Module is tested using the Kinect One with an aerial front view capturing the scene. A representation of the scenario is sketched in Figure 6.10: Figure Scheme of the scenario 1 with its elements involved in the simulations. b) Scenario 2: Kinect 360 with aerial front view The second experiment is exactly the same as the previous one but the Kinect 360 is used now. A representation of the scenario is sketched in Figure 6.11: Figure Scheme of the scenario 2 with its elements involved in the simulations.

62 Report c) Scenario 3: two Kinects 360 with lateral view The third experiment consists of

provide to the processing algorithms more information about the scene.

Figure 6.12. Scheme of the scenario 3 with its elements involved in the simulations.

the Kinect One to the previous scenario to increase the point-cloud density and provide an

68 62 Report c) Scenario 3: two Kinects 360 with lateral view The third experiment consists of implementing additional depth camera to increase the point-cloud density of the scene and provide to the processing algorithms more information about the scene. A representation of the scenario is sketched in Figure 6.12: Figure Scheme of the scenario 3 with its elements involved in the simulations. d) Scenario 4: Kinect One and two Kinects 360 The last experiment consists of implementing the Kinect One to the previous scenario to increase the point-cloud density and provide an aerial view of the scene. A representation of the scenario is sketched in Figure 6.13: Figure 6.13 Scheme of the scenario 4 with its elements involved in the simulations.

69 Object recognition and grasping using bimanual robot 63 The position error has been calculated by the difference between the estimated and the real position expressed in meters. Figure 6.14, Figure 6.15 and Figure 6.16 show the position errors committed during the simulations of the four different scenarios. The estimated position errors for all simulations are represented in plus sign with different colours and the desired error in black circle sign. The simulation results for each scenario are presented with independent figures in the Annex B. Figure Plane XY of the estimated position error of the Rugby ball. Figure Plane XZ of the estimated position error of the Rugby ball.

70 64 Report Figure Plane YZ of the estimated position error of the Rugby ball. The orientation error has been calculated by the difference between the estimated and the real orientation expressed in angle axis. The error committed to estimate the orientation of the Rugby ball object is very high (see Table 6.1). Some simulations present errors around 3 radiant or degrees in some axis. This means that the system is very poor for estimating the orientation of the objects. However, the results presented in section determine that the orientation errors decrease for objects with asymmetric geometry. The Rugby ball can be described as an ellipsoid which has symmetry in the semi-major and semi-minor axis. This fact is a problem for fixing appropriately the orientation of the object frame. In our case, the orientation of the Rugby ball can only be fixed to the z axis, adjusting it to the longitudinal direction and the other two axes randomly. Finally, Table 6.1 summarises the results of all scenarios presented. The position and orientation errors for each scenario have been averaged and a confidence interval has been provided to indicate the region with higher probability to obtain the position and orientation error of the object. In addition, it is presented the density of points for each scenario as well as the average computational time which is directly related with the size of the point-cloud to be processed.

71 Object recognition and grasping using bimanual robot 65 Table 6.1. Results of the four simulated scenarios. Scenario Number of Points Object Position errors (units in meters) Object Orientation errors (units in degrees) x error y error z error angle error Average time in seconds Kinect One ± ± ± ± Kinect ± ± ± ± two Kinect 360 Kinect One and two Kinect ± ± ± ± ± ± ± ±

with recognising different objects, different position and orientations of the objects and testing the system with different configurations of the

Object models used for testing Object Recognition Module.

b) Simulation using two Kinects 360 with lateral view. c) Simulation using a Kinect One with aerial front view and two Kinects 360 with lateral view.

72 66 Report Object recognition This section presents qualitative results of the Object Recognition Module in order to study how the versatile is the system designed with recognising different objects, different position and orientations of the objects and testing the system with different configurations of the cameras. The 3D model objects are: Cylinders with different colours. Cylinder in T structure. Rugby ball. Skate helmet. Figure Object models used for testing Object Recognition Module. Three different scenarios have been simulated: a) Simulation using only Kinect One with aerial front view. b) Simulation using two Kinects 360 with lateral view. c) Simulation using a Kinect One with aerial front view and two Kinects 360 with lateral view. The following three scenarios follow a basic methodology to show clearly the visual results obtained. First an image colour of the workspace is provided in order to present the objects to be recognised and then, a set of views of the 3D object recognised. The point-cloud visualization is done by the PCLVisualizer [26] which an empty window displays a black background and the world coordinate system (x axis in red, y axis in green and z axis in blue).

Object recognition and grasping using bimanual robot 67 a) Scenario 1: Kinect One with aerial front view The aim of this scenario is to analyze the performance of the Object Recognition Module if the

Three simulations are performed by orienting the longitudinal direction of the cylinder through the three orthogonal planes located at the origin of the coordinate system: Figure 6.18.

The longitudinal direction of the cylinder oriented perpendicular to the plane XY achieves a perfect match between the model template and the cylinder in the scene (see Figure 6.19).

73 Object recognition and grasping using bimanual robot 67 a) Scenario 1: Kinect One with aerial front view The aim of this scenario is to analyze the performance of the Object Recognition Module if the point-cloud scene is taken with a single Kinect One with aerial front view. A green cylinder with length of 16cm and diameter of 6cm is the first object model to carry out the simulations. Three simulations are performed by orienting the longitudinal direction of the cylinder through the three orthogonal planes located at the origin of the coordinate system: Figure Experiment 1: three different orientations of the green cylinder model. The longitudinal direction of the cylinder oriented perpendicular to the plane XY achieves a perfect match between the model template and the cylinder in the scene (see Figure 6.19). When the cylinder is perpendicular to the plane XZ, the surface of the cylinder is tangential to the infrared ray projected of the Kinect which causes distortion in the pointcloud (see Figure 6.20). Last figure shows how the match is sometimes done partially between the model template and the object in the scene probably caused by low iterations to align both models. Figure 6.21 shows how the cylinder template needs more iteration to align itself to the cover of the cylinder. Figure Results of experiment 1: three object views of the cylinder perpendicular to the plane XY.

Results of experiment 1: three object views of the cylinder perpendicular to the plane YZ.

Cylinder models with different colours (green, pink and blue) are placed as shown in Figure 6.

Although the Object Recognition Module is detecting correctly the number of objects in the scene, the

As commented previously, the match is sometimes done partially between the model template and the object

23 shows a perfect match in the pink cylinder but the green cylinder could need more iteration to align.

74 68 Report Figure Results of experiment 1: three object views of the cylinder perpendicular to the plane XZ. Figure Results of experiment 1: three object views of the cylinder perpendicular to the plane YZ. The second experiment consists of testing the system with multiple objects in the workspace. Cylinder models with different colours (green, pink and blue) are placed as shown in Figure 6.22: Figure Experiment 2: two different sets of the cylinder models. Although the Object Recognition Module is detecting correctly the number of objects in the scene, the object pose estimation is not achieving good results with high accuracy. As commented previously, the match is sometimes done partially between the model template and the object in the scene. This fact is caused by low iterations to align both models. Figure 6.23 shows a perfect match in the pink cylinder but the green cylinder could need more iteration to align. Figure 6.24 shows that the pink and the green cylinders oriented vertically achieve better match than the blue cylinder oriented horizontally. This fact has been detected in the previous experiment (see Figure 6.20). Finally, the Object Recognition Module is providing the objects pose with respect to the

Object recognition and grasping using bimanual robot 69 world coordinate system.

set at the centre and the orientation randomly.

to the longitudinal direction of the cylinder.

Figure 6.24. Results of experiment 2: three object views of three cylinders.

This object is composed with two colours and it is placed as shown Figure 6.25:

75 Object recognition and grasping using bimanual robot 69 world coordinate system. However, when the model template has symmetric geometry such as sphere, the object frame is set at the centre and the orientation randomly. In our case, the orientation of the cylinders can only be fixed to the z axis, adjusting it to the longitudinal direction of the cylinder. At this point, it is interesting to perform some experiments with asymmetric objects. Figure Results of experiment 2: three object views of two cylinders perpendicular to the plane XY. Figure Results of experiment 2: three object views of three cylinders. Third experiment consists of recognise a single object with T structure. This object is composed with two colours and it is placed as shown Figure 6.25: Figure Experiment 3: two different orientations of the T model. The object model used in this experiment maintains some symmetric geometry in two transversal planes. However, the T structure allows to the system to set the orientation of the object fixing the z axis to the longitudinal direction of the vertical cylinder (green)

70 Report and the y axis to the longitudinal direction of the horizontal cylinder (blue).

Results of experiment 3: three object views of the T model. Figure 6.27.

The previous experiments have been designed to analyze the behaviour of the Object Recognition Module with

Fourth experiment consists of recognise bulky objects to be manipulated with a bimanual robot. Figure 6.

The results obtained with Kinect One with aerial front view indicate that the Object Recognition Module

76 70 Report and the y axis to the longitudinal direction of the horizontal cylinder (blue). Then, the x axis is fixed by definition. Figure Results of experiment 3: three object views of the T model. Figure Results of experiment 3: two object views of the T model with random orientation. The previous experiments have been designed to analyze the behaviour of the Object Recognition Module with objects that can be manipulated with one robotic hand. Fourth experiment consists of recognise bulky objects to be manipulated with a bimanual robot. Figure 6.28 shows how a Rugby ball and a Skate helmet are placed in the workspace: Figure Experiment 4: Rugby ball and Skate helmet models. The results obtained with Kinect One with aerial front view indicate that the Object Recognition Module could not obtain good matches for bulky objects as shown in Figure One factor could be the low density of points describing the scene. Therefore, a single depth camera capturing all objects in the scene could be not enough and multiple depth cameras would be necessary.

Object recognition and grasping using bimanual robot 71 Figure 6.29. Results of experiment 4: Rugby ball and Skate helmet pose estimation.

Fifth experiment consists of recognise the Rugby ball with the information provided with two depth cameras.

Figure 6.30. Experiment 5: two lateral views of the Rugby ball model. The final result can be seen in Figure 6.31.

77 Object recognition and grasping using bimanual robot 71 Figure Results of experiment 4: Rugby ball and Skate helmet pose estimation. b) Scenario 2: two Kinect 360 with lateral view The aim of this scenario is to analyze the performance of the Object Recognition Module if the point-cloud scene is taken with two Kinects 360. The experiments presented try to improve the results obtained in previous simulations with bulky objects by choosing correctly the location of the cameras. Fifth experiment consists of recognise the Rugby ball with the information provided with two depth cameras. The point-cloud is much dense respect to the previous experiments. The Rugby ball pose is shown in Figure 6.30: Figure Experiment 5: two lateral views of the Rugby ball model. The final result can be seen in Figure The pose estimation is not completely accurate because the two point-clouds taken separately by two different Kinects 360 are not merged well as can be appreciate in the figure. In section has been explained the causes of these offsets between point-clouds. Moreover, the object is illuminated with the infrared projector of both depth cameras at the same time. This fact causes interferences in the depth sensors and the visible side of the object has been distorted.

to capture two different sides of the ball while

the common surface captured with both depth

Despite the final point-cloud obtained with two

78 72 Report Figure Results of experiment 5: Rugby ball pose estimation. The Rugby ball is oriented differently in order to capture two different sides of the ball while avoiding interferences between the cameras sensor (see section 6.1.2): Figure Experiment 6: two lateral views of the Rugby ball model. The object has been oriented in order to minimise the common surface captured with both depth sensors to obtain a good point-cloud quality. Despite the final point-cloud obtained with two Kinects 360 presents translation offsets, the system with multiple Kinects is able to recognise and estimate the pose of the object. Figure Results of experiment 6: Rugby ball pose estimation.

Object recognition and grasping using bimanual robot 73 c) Scenario 3: two Kinect 360 with

Object Recognition Module if the point-cloud scene is taken with three different views.

Experiment 7: three different views of the Rugby ball model.

79 Object recognition and grasping using bimanual robot 73 c) Scenario 3: two Kinect 360 with lateral view and Kinect One The aim of last scenario is to analyze the performance of the Object Recognition Module if the point-cloud scene is taken with three different views. So, three point-clouds are merged into one. Figure Experiment 7: three different views of the Rugby ball model. Finally, the Rugby ball can be constructed almost completely, just missing the bottom and the behind part. Therefore, a better matching between the object template and the object in the scene has been obtained with respect to the previous experiments with the Rugby ball: Figure Results of experiment 7: Rugby ball pose estimation.

74 Report 6.2. Evaluation of the Grasp and Motion Planning Module The aim of this section is to present visual results for the object grasp and motion planning procedure presented in section 4.

0m 0.0m 0.15m 0 0 1 0º b) 0.0m 0.0m 0.15m 0 0 1 90º c) 0.1m 0.1m 0.15m 0 0 1 45º d) 0.0m 0.0m 0.15m 0.3 0.9-0.

the trajectory followed for the manipulator to reach the object. Note that these experiments have been performed in uncluttered environments with the object placed against a uniform background.

80 74 Report 6.2. Evaluation of the Grasp and Motion Planning Module The aim of this section is to present visual results for the object grasp and motion planning procedure presented in section 4. The object model selected has been the rugby ball to test the system in four different poses: Table 6.2. Rugby ball pose with respect to workspace coordinate system. Test x y z ax ay az angle a) 0.0m 0.0m 0.15m º b) 0.0m 0.0m 0.15m º c) 0.1m 0.1m 0.15m º d) 0.0m 0.0m 0.15m º Figures below illustrate how the object has been divided in slices to place the left and right fingers and thumbs, show the resulting contact points and present several pictures in order to see the trajectory followed for the manipulator to reach the object. Note that these experiments have been performed in uncluttered environments with the object placed against a uniform background. Grasping in cluttered environments is a harder problem both from a robot perception as well as a robot motion planning. In this evaluation, robust strategies for object grasping are needed to overcome the positioning errors previously analysed (see section 6.1.3). Figure Test a): Rugby ball divided in slices and the resulting contact points. Figure Test a): Manipulator movements following the path planner solution.

81 Object recognition and grasping using bimanual robot 75 Figure Test b): Rugby ball divided in slices and the resulting contact points. Figure Test b): Manipulator movements following the path planner solution. Figure Test c): Rugby ball divided in slices and the resulting contact points.

the resulting contact points. Figure 6.43.

82 76 Report Figure Test c): Manipulator movements following the path planner solution. Figure Test d): Rugby ball divided in slices and the resulting contact points. Figure Test d): Manipulator movements following the path planner solution.

ROS-Industrial Basic Developer s Training Class

ROS-Industrial Basic Developer s Training Class Southwest Research Institute 1 Session 4: More Advanced Topics (Descartes and Perception) Southwest Research Institute 2 MOVEIT! CONTINUED 3 Motion Planning