Representation and Perception for Robotic Garment Manipulation

Size: px

Start display at page:

Download "Representation and Perception for Robotic Garment Manipulation"

Gabriella McDaniel
5 years ago
Views:

CENTER FOR MACHINE PERCEPTION CZECH TECHNICAL UNIVERSITY IN PRAGUE Representation and Perception for Robotic Garment Manipulation (PhD Thesis Proposal) Jan Stria striajan@cmp.felk.cvut.

1 CENTER FOR MACHINE PERCEPTION CZECH TECHNICAL UNIVERSITY IN PRAGUE Representation and Perception for Robotic Garment Manipulation (PhD Thesis Proposal) Jan Stria CTU CMP August 29, 2014 PhD THESIS Proposal ISSN Available at ftp://cmp.felk.cvut.cz/pub/cmp/articles/stria/stria-tr pdf Thesis Advisor: prof. Ing. Václav Hlaváč, CSc. RNDr. Daniel Průša, Ph.D. The author was supported by the European Commission under the grant agreement FP7-ICT CloPeMa and by the Grant Agency of the Czech Technical University in Prague under the project SGS13/205/OHK3/3T/13. Research Reports of CMP, Czech Technical University in Prague, No. 15, 2014 Published by Center for Machine Perception, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University Technická 2, Prague 6, Czech Republic fax , phone , www:

2 Representation and Perception for Robotic Garment Manipulation Jan Stria August 29, 2014

3 Abstract We deal with garments representation and perception for their automated robotic manipulation. It is a challenging task because the garments are non-rigid objects and thus they can deform significantly. Both motivation and possible applications of this research topic are discussed. The main goal of this work is to give an overview of the state of the art in the field of study. Many existing methods are summarized. Several publicly available databases are introduced which are relevant for the visual perception of garments. We also present our own past work. It aims on folding a single garment by a dual-armed robot. The work is concluded by proposing our future research plans.

4 Contents 1. Introduction and problem definition 1 2. Motivation Household robots Industrial robots Non-robotic applications Related work Tokyo Denki University National Institute of Advanced Industrial Science and Technology Daido Institute of Technology Kagawa University University of Tokyo Keio University University of California Bosch Research and Technology Center North America Clemson University Institut de Robòtica i Informàtica Industrial Center for Research and Technology Hellas University of Glasgow Columbia University Existing datasets Our past contribution Future work 28 Bibliography 31 A. Polygonal Models for Clothing 37 B. Garment perception and its folding using a dual-arm robot 50

5 1. Introduction and problem definition This work deals with perception and representation of garments. The visual perception is an important component of the robotic systems which are able to manipulate garments autonomously. The perception is a challenging task because the garments mostly consist of fabric. They are non-rigid objects which can deform significantly. Thus their shape cannot be described and recognized easily, as it depends on their actual configuration. It is a significant difference compared to the rigid objects whose shape is invariant and only their 3D position and orientation can vary in time. The robotic manipulation of the garments and the non-rigid objects in general is a challenging task as well because of similar reasons. The grasping strategy must take into account not only the actual 3D position and orientation of the garment, but also its current deformation. Moreover, the garment is usually further deformed while it is being manipulated. On the other hand, this fact provides a great opportunity for using an active perception and manipulation loop. The work is organized as follows. Chap. 2 gives motivation for garments perception and manipulation. Possible applications both in household and industrial robotics are discussed. Several non-robotic applications are also mentioned. Chap. 3 is the core of this work. We introduce and review many existing state of the art methods for the automated garments perception and manipulation. The methods deal with various tasks including localization of garments in real-world environment, grasping and lifting a single garment from a pile, unfolding a crumpled garment, recognizing category of a previously unseen garment or garments folding. We also describe several publicly available databases which are relevant for visual perception of garments. Chap. 4 summarizes our past work which aims on folding a single garment by a dual-armed robot. The future research plans are proposed in Chap. 5. We would like to improve and extend our existing folding method. Moreover, we would like to invent an advanced semantic representation of garments. 1

6 2. Motivation 2.1. Household robots According to [Yamazaki et al., 2012], there is a serious problem with aging society. It is caused mainly by increasing life expectancy and decreasing birthrate. The authors claim that in Japan there were 20% people over the age of 65 in the year This ratio is expected to double to 40% by the year Although many of the senior people are self-sustaining, some of them may suffer by reduced mobility. As a result, they are not able to perform usual household activities such as cooking, washing dishes, laundering, ironing, clothes folding, vacuuming, sweeping etc. Some of them may be moreover unable to get dressed or to have a bath. Quality of their lives can be significantly improved by personal assistance. [Yamazaki et al., 2012] believe that the assistance could be provided by robots in future. Construction of such robots will probably reflect the intended usage for various household tasks. Similarly to humans, the robots will probably use general manipulators in order to perform various tasks. They will also use these manipulators to operate common house facilities, e.g. sweeper, sink, dishes or washing machine. The robots will act mostly autonomously, but they will also cooperate with humans in some activities. Thus from the psychological reasons, it is preferred if they will resemble humans. Examples of several existing experimental assistive robots can be seen in Fig The household robots could be eventually also used by people who do not require personal assistance, but they e.g. have a time demanding job and they probably do not want to spent their free time by housekeeping. Some of them already hire human housekeepers these days and they might consider buying a robotic assistant in future. One of the household robot required skills is perception and manipulation of non-rigid objects. These skills are needed to manipulate clothes during laundering, drying, ironing and folding, to hang curtains, to wash floor and furniture with a piece of cloth, to hold and carry plastic cups etc. In past years, there has already been some research on automated perception and manipulation of non-rigid objects by general humanoid robots applied in housekeeping tasks. Most of the publications focus on clothes perception and manipulation. [Kaneko and Kakikura, 2001] propose a vision-based pipeline for grasping and isolating a single piece of cloth from a pile of washed clothes and for unfolding it subsequently. [Yamazaki and Inaba, 2009] employ home assistant robot AR to identify and collect clothes randomly tossed around a room. [Maitin-Shepard et al., 2010] utilize the humanoid robot PR2 for towels folding and [Miller et al., 2012] for folding various types of clothes including shirts and pants. The same robot PR2 is used by [Wang et al., 2011] for automated pairing of socks. There is also RoboCup@Home 1 league which is a part of RoboCup 2 initiative is an annual competition focused on development of autonomous assistive robots. The robots act in simulated non-standardized real world environments, e.g. kitchen, living room, bathroom or restaurant. They have to perform sets on actions aimed at navigation, object recognition and manipulation, human-computer interaction or behavior

2. Motivation a) AR robot controlling a washing machine during laundering b) ARMAR robot putting dirty dishes into a washer c) HRP-2 robot pouring water into a cup d) ASIMO robot

Examples of existing experimental robots performing various household activities. adaptation. The competing robots are evaluated based on their performance in predefined scenarios.

2. Industrial robots Automated perception and manipulation of non-rigid objects can also be utilized in industrial applications.

The industrial robots have to perform the intended tasks with high reliability, nearly without failures.

On the other hand, these robots are usually able to act with limited perception and simple planning mechanisms, as they are installed in invariant environment and they manipulate

Example of automated industrial manipulation of clothes are laundry companies which supply hotels and hospitals with clean towels, bed-clothes, bathrobes or gowns.

7 2. Motivation a) AR robot controlling a washing machine during laundering b) ARMAR robot putting dirty dishes into a washer c) HRP-2 robot pouring water into a cup d) ASIMO robot serving drinks e) PR2 robot folding a towel f) Nara assisting with putting clothes on a mannequin Figure 2.1. Examples of existing experimental robots performing various household activities. adaptation. The competing robots are evaluated based on their performance in predefined scenarios. According to RoboCup@Home organizers, the scenarios should include perception and manipulation of non-rigid objects in future Industrial robots Automated perception and manipulation of non-rigid objects can also be utilized in industrial applications. In contrast to general purpose humanoids used for housekeeping, robots specialized for one particular task are often used. The industrial robots have to perform the intended tasks with high reliability, nearly without failures. They should operate in high speed which is often not true for the research projects. The industrial robots also have to be endurable and easily maintainable. On the other hand, these robots are usually able to act with limited perception and simple planning mechanisms, as they are installed in invariant environment and they manipulate with well known objects. Example of automated industrial manipulation of clothes are laundry companies which supply hotels and hospitals with clean towels, bed-clothes, bathrobes or gowns. Many of these companies already use partially automated lines, e.g. those manufactured by Jensen-Group 3 company. However, there are still several tasks which have to be performed by humans, e.g. inputting dirty clothes to washing machines or setting already washed clothes to mangling machines. See Fig. 2.2a. Working conditions in these laundry companies are often unhealthy because of dust, heat and humidity. Thus there is a demand for a fully automated lines, as proposed by [Hata et al., 2008]. Another possible utilization are sewing manufactures in which the sewing machines have to be operated by human workers because of the complexity of performed tasks

2. Motivation a) Machine for spreading towels which are fed by human workers b) Early prototype of sewing robot by SoftWear Automation c) Worker assembling car doors Figure 2.2. Examples of possible industrial applications for robots which are able to manipulate with non-rigid objects.

It should utilize machine vision to accomplish very high accuracy of sewing. It is shown in Fig. 2.2b Industrial manipulation with non-rigid objects is not only limited to clothes.

Such robots are produced by big corporations like ABB Group 5 or KUKA 6. These robots are used e.g. for welding or painting of car bodies. However, human workers still have to e.g. mount rubber seals or place floor rags.

8 2. Motivation a) Machine for spreading towels which are fed by human workers b) Early prototype of sewing robot by SoftWear Automation c) Worker assembling car doors Figure 2.2. Examples of possible industrial applications for robots which are able to manipulate with non-rigid objects. However, there is already a university spinoff company SoftWear Automation 4. They aim at development of a fully automated sewing machine. It should utilize machine vision to accomplish very high accuracy of sewing. It is shown in Fig. 2.2b Industrial manipulation with non-rigid objects is not only limited to clothes. One of the other branches is automotive industry. Although the automobile manufactures are already heavy automatized, the robots work only with rigid parts of cars. Such robots are produced by big corporations like ABB Group 5 or KUKA 6. These robots are used e.g. for welding or painting of car bodies. However, human workers still have to e.g. mount rubber seals or place floor rags. This is not only because of manipulation issues, but also because of perception. While the rigid parts can be usually manipulated either with simple or even no perception, some advanced perception mechanisms will be needed for non-rigid objects. It can be said that there is still big room for deployment of robots in industry Non-robotic applications Applications of methods for non-rigid objects perception are not limited to robotics where they are combined with manipulation strategies. The perception of non-rigid objects is a challenging task because of a varying shapes and configurations of such objects. Moreover, the visual perception of non-rigid objects by itself has several applications. [Yamaguchi et al., 2012] propose method for parsing clothes in fashion photographs of humans. They try to determine which pixels display certain types of clothes, e.g. jeans, jumper or hat. The recognized clothes are then used to refine pose estimation of people wearing them, as seen in Fig. 2.3a. The human pose estimation itself is a broadly studied problem with applications in human-computer interaction or activity recognition. [Kalantidis et al., 2013] are also interested in clothes parsing and recognition in everyday photographs of people. They use the identified clothes to suggest similar clothes from a database containing photographs of more than one million fashion products. The similarity is based on color and texture of clothes. [Manfredi et al., 2014] propose method for finding clusters of similar clothes in a database of fashion photographs. The similarity is also based on color. The identified clusters can be used to recommend

Example applications of clothes perception for refinement of human pose estimation [Yamaguchi et al., 2012] and for visual clothes retrieval [Manfredi et al., 2014].

9 2. Motivation a) Original pose estimation (left) is used for clothes parsing (center) which helps to refine the original pose estimation (right). b) Clothes from the input image (left) are parsed and used to find similar clothes from the database (right). Figure 2.3. Example applications of clothes perception for refinement of human pose estimation [Yamaguchi et al., 2012] and for visual clothes retrieval [Manfredi et al., 2014]. visually similar products to the product currently viewed by the user. An example can be seen in Fig. 2.3b. The authors claim that they are currently testing deployment of the method in e-shop owned by a world leading clothes retailer. In our opinion, the main drawback of the clothes retrieval methods is their concern in color and texture of clothes. They do not consider tightness, length or shape of clothes. The are also not able to detect presence and position of shirring, collar, pockets or zipfastener. However, all these features are probably more important than color in order to decide about similarity of two pieces of clothes. Thus recognition of these advanced features still remains an unsolved problem. 5

10 3. Related work This chapter describes the state of the art in garment perception, representation and manipulation. Each section of the chapter except the very last one presents work of one particular research group. The sections are named according to these groups. If there are are more research groups involved, we mention just the most notable one due to a limited space. There are two main reasons for such chapter division. At first, the introduced papers deal with various overlapping tasks and thus it is not possible to categorize them according to their topics. Moreover, the papers published by one particular research group usually gradually improve certain method or propose new methods based on the previously published ones. It is thus convenient to survey all papers published by one group together and in a chronological ordering. The last section of the chapter describes publicly available datasets which are relevant for the visual perception of garments Tokyo Denki University The authors of [Hamajima and Kakikura, 2000] and [Kaneko and Kakikura, 2001] suggest the whole pipeline leading from a heap of several crumpled garments to a stack of correctly folded garments. A dual-arm robot having a rotating wheels on its grippers is utilized for this task. The whole procedure is split into three main steps including isolation of a single piece of garment from the heap, unfolding it by a series of regrasping operations performed in the air and finally putting the garment on a table and folding it. However, the authors describe only the isolation and unfolding methods. The folding is rather sketched out in the referred works. [Hamajima and Kakikura, 2000] are concerned mainly with the isolation task. It is assumed that there is a heap of many garments on the table. A single image of the heap is taken from the top. A mask denoting part of the image occupied by the heap is provided manually. The task is to identify a large enough uniformly colored or textured region, compute its center and use the center as a grasping point. The regions are constructed recursively by extraction of the pixels corresponding to the highest peak in RGB color channels histograms. The textured areas are excluded from the construction process. They are detected as the areas with many fine edges. If the recursive procedure does not find a large enough uniform color region, the largest continuous textured area is used for grasping. Once the grasping point is determined, the garment is lifted by a single robotic arm which leads to a hanging state of the garment. The goal is to grasp the garment by the second arm and then regrasp it by the first arm in order to unfold it in the air. The final grasping points should be located at two different hemlines or at the endpoints of the same hemline. Detection of hemlines in the image is based on shadows appearing on the garment surface and on approximate shape of the garment outline in the hanging state. The authors claim that the hemlines form convex outlines and they cast shadows on its surrounding. [Kaneko and Kakikura, 2001] describe recognition of the unfolded garment category and pose after the regrasping operation. At first, outline of the hanging garment is 6

3. Related work a) Isolation and grasping b) Hemline detection for regrasping Figure 3.1. Clothes isolation and regrasping by [Hamajima and Kakikura, 2000].

11 3. Related work a) Isolation and grasping b) Hemline detection for regrasping Figure 3.1. Clothes isolation and regrasping by [Hamajima and Kakikura, 2000]. (a) Image of the washed heap is segmented into uniform regions. Center of the largest region is grasped. (b) Hemlines are used for regrasping. They are detected as convex shapes which cast shadows. matched to three template shapes. Once the best matching template is known, the category of the garment as well as its approximate pose are determined by a decision tree constructed manually for the particular template. Decision rules are rather complicated. They are based mainly on measuring various metric features from the outline and comparing them with predefined thresholds National Institute of Advanced Industrial Science and Technology [Kita and Kita, 2002] are interested in pose estimation of the pullover being held by a single gripper in a hanging position. Approximate width and length of the pullover trunk and length of its sleeves are known in advance. Thus a mass-spring model [Lander, 1999] of the pullover can be constructed, virtually spread on the ground, grasped using each of its 20 vertices, and lifted up. This gives 20 possible virtual states of the hanging pullover. The goal is to select the state which best resembles the observed image of the real pullover. The state space is at first pruned by comparing number of hanging sleeves and position of the lowest point. The suitable hanging models are then overlaid over the observed image and partially deformed to fit the observed contour. The best model is selected by computing overlap ratio of the rendered model and the binarized image. Once a state of the pullover is known, the next grasping point can be selected. The authors split possible 20 states to three groups: A one shoulder is already held, B no shoulder is held but some shoulder is visible which can be grasped, C any shoulder is neither held nor visible. The case C can be always transformed to B and B can be transformed to A by a single regrasping. The other shoulder is grasped by the free gripper in the case A. [Kita et al., 2004a] and [Kita et al., 2004b] extend the previously described method. In addition to the previously supported pullover, they also deal with trousers. The garment is now observed with a pair of cameras. Both images are compared with the model to check a consistency of the state estimation. The model to contour adjustment was replaced by a simpler width normalization and vertical translation of the model. The grasping procedure is described in more detail. The goal is to grasp the contour point which is the closest one to a certain node of the overlaid virtual model. Its 3D coordinates are computed from the stereo images. Direction from which the gripper 7

should approach the garment is decided from the model, because it would be complicated to correctly reconstruct the garment surface from the data.

12 3. Related work Figure 3.2. [Kita et al., 2009a] attempt to match various 3D mesh models of clothes to the observed hanging garment. The model is always slightly deformed to fit the observed point cloud prior to computing the matching score. should approach the garment is decided from the model, because it would be complicated to correctly reconstruct the garment surface from the data. The proposed method was successful in 3 out of 8 attempts performed with two variously sized pullovers. The methods are further improved in [Kita et al., 2009a]. The hanging garment is observed by a trinocular camera system in order to be reconstructed. The resulting point cloud is again matched to virtual models for various hanging locations. The models have been in advance created and physically simulated in Maya 1 software. Each model is partially deformed while matching it to the observed point cloud. The deformation starts from the held model vertex. It is applied gradually to all neighbors of the currently processed vertices which have been not adjusted yet. Two types of forces are used for the deformation. The internal forces preserve shape of the model by keeping distance between vertices. They simulate elasticity and flexural rigidity. The external forces comprehend gravitation force and attraction to the closest point from the observed cloud. Once the model is adjusted, its similarity with the observed garment is again computed as the overlap ratio in 2D projection. State of the hanging pullover was correctly recognized in 22 out of 27 performed experiments. [Kita et al., 2010] improve the previously described method by adding active manipulation of the hanging garment. The goal is to rotate or spread the garment so that its state can be recognized more easily. The rotation is accomplished by rotating the robot arm holding the garment. It is needed to see some actually occluded part of the garment. The spreading is performed by the free robot arm. It aims at setting the garment such that its front view projection has the largest possible width. The spreading can be performed by pushing either side or center of the garment. It is decided automatically based on convexity or concavity of the hanging garment towards the front vision system. The experiments have shown that a contribution of the proposed active manipulation is questionable. The spreading operation occasionally fails and makes the garment state even less recognizable. [Kita et al., 2009b] deal mainly with planning of the grasping move. It is based on the idea introduced in [Kita et al., 2004a]. The grasp is planned using the virtual garment model matched to the observed data rather than using the data themselves. At first, the theoretically optimal grasping direction is computed. The gripper should approach a hemline of the garment perpendicularly, having its two jaw fingers aligned with the 1 8

3. Related work Figure 3.3. Garment unfolding and recognition by [Osawa et al., 2007]. The lying sweater is grasped, lifted up and shaken to be disentangled.

13 3. Related work Figure 3.3. Garment unfolding and recognition by [Osawa et al., 2007]. The lying sweater is grasped, lifted up and shaken to be disentangled. Then it is manipulated to be held by both grippers. Images of the incrementally expanded garment are used for a category recognition. garment surface. The final grasping direction also considers limits of the robot moves. [Kita et al., 2011] extend the pipeline with grasping the garment from a table and lifting it up. It is assumed that the approximate garment location on the table is known in advance. Thus it can be segmented by finding a connected component in the reconstructed 3D data. Height of the table is also computed from the observed data in order to avoid collision. The preferred grasping locations are vertically oriented parts of the surface located high above the table. Rest of the pipeline remains unchanged Daido Institute of Technology [Osawa et al., 2007] propose a system for unfolding and recognition of a hanging garment category (short-sleeved and long-sleeved shirts, trousers, underpants, bras, towels and handkerchiefs). The lying garment is grasped by the first arm, lifted and shaken to be disentangled. Then the lowest point of the garment is visually examined in order to find out whether it is a corner. The corner detection is achieved by rotating the garment and measuring curvature changes on the contour adjacent to the lowest point. The lowest point is then grasped by the second arm, the first grasping point is released and the whole procedure is repeated. The procedure finishes while both the grasped top point and the examined bottom point are corners. Distance of these corners is measured, the bottom corner is grasped and lifted up such that both corners are held near each other in the same horizontal level. The arms are then horizontally moved, expanding distance of the held points to the maximum distance measured in the hanging state. Images of the garment being expanded are acquired and matched to the predefined template shapes for various clothes categories and various combinations of their grasped corners. The matching is based on computation of covariances between the width normalized observed image and all template images. The covariance coefficients for individual expansion steps are weighted, summed and the best matching template is selected. The clothes category recognition rate is between 92% and 100% for various categories, regarding correctness of the grasped corners recognition. [Osawa and Kano, 2012] describe a system for contour tracing of a rectangular gar- 9

14 3. Related work Figure 3.4. Contour tracing of rectangular garments by [Osawa and Kano, 2012]. The gripper contains a back light facing the camera. The goal is to trace the contour to the corner once or twice, depending how the garment is held at the beginning. ments. They employ a robot with two arms. The first arm is used to hold the garment corner. A special tracing gripper with two fingers is mounted to the second arm. One finger is mounted a camera and the other one contains a LED light facing the camera. This gripper is used to trace hemline adjacent to the held corner. The tracing procedure is driven by visual information obtained from the finger camera. The contour is detected by simple brightness thresholding which is possible thanks to the strong back-light. Magnitude and orientation of forces applied by both the holding and the tracing gripper change according to mutual positions of the grippers. The tracing is finished when the gripper finds a corner. It is detected visually as the contour point having a very high curvature. The corner detection success rate is 96%. The tracing procedure takes tens of seconds, based on length of the traced hemline and frequency of visual checks Kagawa University [Hata et al., 2008] propose an experimental system for picking up a single towel from a pile, unfolding it in the air and putting it into mangling and folding industrial line. The whole procedure consists of several steps. At first, the highest point of the pile is selected as the grasping point. Once the towel is grasped and lifted up, it is hanging such that one of its corners is its lowest point. This corner is grasped by the second arm and the initial grasping point is released. The first arm then grasps the lowest point of the towel which is a corner again. Since the towel is being held by two corners, it can be stretched and put into the line. Since towels usually have no significant texture, it would be difficult to find pairs of corresponding points for 3D reconstruction from stereo images. Thus a pattern consisting of a mesh and dots is projected onto the garment surface by a LED light. Detection of the highest point of the pile is performed in two steps. At first, a rough approximation of the pile surface is reconstructed from corresponding pattern dots. The pattern mesh surrounding the highest dot is then utilized to refine the neighboring surface in order to find the grasping point. The lowest corner detection is also performed in two steps. At first, the lowest point of the hanging towel is find by tracing its contour in a single image. Neighborhood of the corner is then 3D reconstructed in order to compute its position in space as well as orientation of the adjacent hemline which is used for grasping [Kobayashi et al., 2008]. The previous work is extended in [Hata et al., 2009]. A special sliding mechanism is 10

3. Related work Lenses are shifted Y Y Projected pattern C1 Camera 2 Z X C2 C3 Pattern projector Measurable area Camera 1 Figure 3.5. Towel picking by [Kobayashi et al.

15 3. Related work Lenses are shifted Y Y Projected pattern C1 Camera 2 Z X C2 C3 Pattern projector Measurable area Camera 1 Figure 3.5. Towel picking by [Kobayashi et al., 2008] utilizes pattern projector and a pair of cameras for 3D reconstruction. The towel corners are detected in the reconstructed model. introduced which grasps the first identified corner and extends the towel before grasping the second corner. The authors also describe a novel adjusting mechanism which pushes the towel so that its hemline is parallel to the mangling and folding machine. It is claimed that the overall success rate of taking the towel from a pile and putting it into a mangling and folding line is around 80%. The most problematic step is grasping of the first corner which is successful in 84% cases University of Tokyo [Yamazaki and Inaba, 2009] aim on visual localization of garments in real world environment. It is needed e.g. to construct a home assistant robot which should be able to tidy up clothes tossed around a room [Yamazaki et al., 2010]. The detection method is based on the natural wrinkledness of fabric forming the garment. They work with color images of a living room which were acquired by a camera attached to the robot head. The input image is at first filtered by variously oriented Gabor filters [Fogel and Sagi, 1989], employing the sliding window method, in order to localize regions embodying significant changes in frequency domain. Responses obtained from all filters are then summed and intensity histograms having 20 bins are constructed for all windows. The histograms are processed by the SVM classifier trained on positive clothes samples and negative background samples. The classification gives a trimap of highly reliable clothes pixels, highly reliable background pixels and pixels which are hard to decide. These uncertain pixels are decided based on reliable pixels in their neighborhood which have similar color. The similarity propagation is achieved by employing the modified grabcut segmentation algorithm [Rother et al., 2004] seeding from the reliable regions Keio University [Sugiura et al., 2009] implemented an interactive graphical tool for folding a piece of spread garment. Contour of the garment is acquired from an image and approximated by a polygon. The user is then expected to interactively input several folding operations. The tool is a compromise between a physically reliable simulation of the garment and a purely geometrical approach. The planned sequence of folds is delivered to the robot and performed while visually checking the observed data with the virtual garment. 11

The input image (top-left image in each block) is filtered with Gabor filters (bottom-left). The responses are classified with SVM to get the trimap (top-right).

, 2011] utilizes the gravity to fold the rectangular cloth over the predefined directed segment.

16 3. Related work Figure 3.6. Clothes detection by [Yamazaki and Inaba, 2009] applied for the shirt hanging over the backrest (left block), multiple clothes on the tables (middle) and the image containing no clothes (right). The input image (top-left image in each block) is filtered with Gabor filters (bottom-left). The responses are classified with SVM to get the trimap (top-right). The final classification is given by segmenting the RGB image (bottom-right). Figure 3.7. [van den Berg et al., 2011] utilizes the gravity to fold the rectangular cloth over the predefined directed segment. The cloth is gradually grasped at various locations (top-left corner at first, then bottom-left corner) and moved on triangular paths up and down University of California [van den Berg et al., 2011] describe a general method for folding a flattened article of cloth which lies on a table. The task is accomplished by incrementally grasping, lifting up, moving and laying down particular parts of the cloth while leaving other parts laid on the table. Example is folding the sleeve over the torso of the shirt laying on a table. The cloth is represented purely geometrically by its polygonal contour, either convex or concave. Its physical properties like flexibility, stretchability or friction are considered to have certain ideal values, mostly zero or infinity. The authors generally describe the contour points which must be grasped in order to immobilize the hanging part of the cloth. They also define a so called g-fold by a directed segment in the table plane whose endpoints lie on the polygonal cloth contour. The cloth part at the left side of the segment is folded over the segment to its right side. The folded part is kept immobilized all the time. The grasped points move on triangular paths as seen in Fig The cloth is being stacked into multiple overlaid layers by performing g-folds. The stack is described by facets which are polygonal parts of cloth bounded by hemlines and folds. Each facet is assigned its level in the stack and a transformation matrix from its original position. There is only one facet at the beginning the whole cloth. Each facet intersected by the planned g-fold segment is subdivided into two new facets. The facet representation makes it possible to define which points of the stack must be grasped during g-fold. It also has to be planned whether the gripper should grasp the whole stack or only its several top layers. The proposed approach was successfully tested on two-armed PR2 robot to fold several towels, pants, long-sleeved tops and short-sleeved shirts. The user is supposed to specify g-folds in simple GUI because there is no perception mechanism. [Maitin-Shepard et al., 2010] propose an algorithm for robust visual detection of cloth 12

3. Related work Figure 3.8. Towels processing procedure by [Maitin-Shepard et al.

corners which are considered convenient grasping points. The algorithm is utilized for robotic towel folding performed partially in the air and partially on the table.

They are defined as silhouette contour points which have significant depth discontinuities in several images acquired from various view angles.

The goal is to find a large corner according to the trackable length of its adjacent edges. The 2D corners found in the image are verified by fitting a 3D plane to the observed point cloud.

The pile is segmented and its center is grasped and lifted.

The lowest point of the hanging towel is omitted from selection of the second corner because it is requested to hold two adjacent corners, not the diagonally opposite ones.

The fine untwisting is achieved by rotating the wrists while maximizing a certain objective function. Optionally, the towel is regrasped to be held for its shorter edge.

Finally, 3D positions of its corners are estimated from stereo, two towel is folded twice and placed on a stack of already folded towels.

17 3. Related work Figure 3.8. Towels processing procedure by [Maitin-Shepard et al., 2010] comprehends the towel grasping and lifting, followed by regrasping in order to hold two corners, untwisting performed in the air, setting the towel back on the table and folding it. corners which are considered convenient grasping points. The algorithm is utilized for robotic towel folding performed partially in the air and partially on the table. The first step in the proposed corner detection algorithm is a detection of cloth borders. They are defined as silhouette contour points which have significant depth discontinuities in several images acquired from various view angles. The corners are found by RANSAC fitting [Fischler and Bolles, 1981] of corner shapes to detected borders. The corner shape is any angle between 45 and 110. The goal is to find a large corner according to the trackable length of its adjacent edges. The 2D corners found in the image are verified by fitting a 3D plane to the observed point cloud. The corner grasping point detection is a key component of the robotic towel folding procedure. The towels are assumed to be crumpled and piled on a table at the beginning. The pile is segmented and its center is grasped and lifted. The sequence of 16 images for corners detection is acquired by rotating the hanging garment around its vertical axis, each step by 12. The first corner is chosen arbitrarily. The lowest point of the hanging towel is omitted from selection of the second corner because it is requested to hold two adjacent corners, not the diagonally opposite ones. Once two adjacent corners are held, the towel is repeatedly pulled taut in order to partially untwisting it. The fine untwisting is achieved by rotating the wrists while maximizing a certain objective function. Optionally, the towel is regrasped to be held for its shorter edge. The towel is than pulled across the edge of the table in order to be spread out. Finally, 3D positions of its corners are estimated from stereo, two towel is folded twice and placed on a stack of already folded towels. The proposed method succeeded in all 50 out of 50 end-to-end trials. The average running time of one trial is 25 minutes. [Cusumano-Towner et al., 2011] are generally interested in bringing an unknown garment from an initial crumpled configuration to a desired configuration using a dual-arm PR2 robot. The proposed method consists of two phases. In the disambiguation phase, the garment is identified and manipulated to an arbitrary recognizable configuration. In the reconfiguration phase, it is brought to the desired configuration by another series of manipulations. The disambiguation phase starts with grasping a random point of the laying garment and lifting it up. The garment is then regrasped several times for its lowest point and finally it is grasped with both grippers. The sequence of manipulations is modeled by HMM (Hidden Markov Model). The garment is modeled as a 3D triangulated mesh. The currently grasped mesh vertices and the garment category form the 13

18 3. Related work Figure 3.9. Folding proposed by [Miller et al., 2011] is based on polygonal models of clothing. The model is matched to the segmented garment contour to recognize both its category and pose. The folded models are used to visually check the folding progress in each step. hidden states of HMM. Their prior probability distributions are uniform while picking up the garment randomly. The transitional model is based on comparing heights of the observed garment with simulated mesh after individual regrasps. The final transition is based on matching of the observed contour to the simulated one while holding the garment with both grippers. The task is to infer the garment category and identify two vertices held in the end. The next step is the reconfiguration phase. Two vertices of its mesh representation are specified for each category which should be held, e.g. shoulders for shirts or hips for pants. The garment is laid on the table, the grippers are opened and two new vertices are grasped. They are not necessarily the desired final vertices which may not be directly reachable. The series of regrasps leading to the final configuration is planned using a graph of configurations which is built in advance from a simulation. The disambiguation experiments show that both category and finally held vertices can be identified with more than 90% accuracy. The method can also distinguish various sizes of one category of clothes, however, the accuracy reduces to 64%. The whole procedure including both disambiguation and reconfiguration succeeded in 20 out of 30 trials. [Miller et al., 2011] and [Miller et al., 2012] deal with folding of an unknown garment which is fully spread and unfolded on the known green table. Thus segmentation can be performed by subtracting the background and the garment contour can be easily extracted. The contour is then fitted a parametric polygonal model specific for a particular category of clothes. Each legal setting of parameters uniquely determines positions of polygon vertices, e.g. shirt armpits or towel corners. A dense contour of the model is generated and matched to the observed contour while minimizing distances between model contour points and their closest observed contour points, and viceversa. The minimization utilizes a coordinate-wise descent over the parameters of the model, maintaining an adaptive step size for each parameter separately. The gradients are evaluated numerically. The optimization comprises three phases, each one minimizing over a larger subset of parameters. The last phase also allows to break the model symmetries, e.g. left-right symmetry of the shirt. The described procedure is also used to recognize the garment category by fitting models for various categories and selecting the best matching one. The perceived garment is folded in a series of g-folds introduced by [van den Berg et al., 2011]. The garment configuration is checked after each fold by fitting a folded model derived from the original one. The described method was deployed to two armed PR2 robot and successfully tested in several end to end folding trials. The average fitting error is usually under 2 cm. The fitting procedure takes 30 to 150 seconds, however, rarely it also does not converge. 14

3. Related work Figure 3.10. [Bersch et al., 2011] cover the shirt with unique fiducial markers to simplify its registration. The shirt is represented as a triangular mesh.

The configuration of a previously unseen sock is recognized in the first step. Pairs of corresponding socks are then determined and folded together.

The socks texture is described by MR8 filter responses [Varma and Zisserman, 2005] and LBP (Local Binary Patterns) [Ojala et al., 1996].

Once combined with the global model, each sock is decided to be found in one of eight configurations: sideways, heel up, heel down and bunched; each of them right-side out or inside out.

19 3. Related work Figure [Bersch et al., 2011] cover the shirt with unique fiducial markers to simplify its registration. The shirt is represented as a triangular mesh. The vertical folds are detected in order to determine the next grasping point. [Wang et al., 2011] are interested in perception and manipulation of socks. The configuration of a previously unseen sock is recognized in the first step. Pairs of corresponding socks are then determined and folded together. The configuration recognition combines local texture and shape descriptors with a global contour model by [Miller et al., 2011] which was described previously. The socks texture is described by MR8 filter responses [Varma and Zisserman, 2005] and LBP (Local Binary Patterns) [Ojala et al., 1996]. The local shape is described by HOG (Histogram of Oriented Gradients) [Dalal and Triggs, 2005]. The descriptors are classified by SVM in order to recognize specific parts of the sock, e.g. heel or toe. Once combined with the global model, each sock is decided to be found in one of eight configurations: sideways, heel up, heel down and bunched; each of them right-side out or inside out. The local texture and shape descriptors are also utilized to visually match pairs of socks. The authors test both greedy matching, selecting the best matching pair in each step, and the globally optimal minimum cost perfect matching Bosch Research and Technology Center North America [Bersch et al., 2011] introduce a complete pipeline for grasping a crumpled short-sleeved shirt from the table, unfolding it by a series of regrasps in the air, and finally folding it while laying it down on a table in one complex motion. The task is accomplished by the PR2 robot equipped with two arms, jaw grippers, two pairs of stereo cameras mounted on the robot head and standard mono cameras mounted on its arms. After grasping the highest point of the shirt and lifting it by a single arm, the shirt is rotated around its vertical axis and repeatedly scanned. The stereo cameras are used to construct its point cloud representation. The visual perception was simplified by covering a surface of the shirt with hundreds fiducial markers, each of them uniquely coding a particular part of the shirt. These are detected by the mono cameras. Surface of the shirt is represented by a triangular mesh which makes it possible to compute geodesic distances of the surface points. Estimation of the grasped mesh vertex is based on the idea that Euclidean distances of the grasped vertex and the observed markers should resemble their geodesic distances in the hanging state because of gravity. The next grasping point is roughly selected by a greedy approach as the point closest to the shoulder which is the desired final point to be held. Then vertical folds appearing on the hanging shirt are found by detecting vertical lines in the gradient image using Hough transform [Duda and Hart, 1972] and the shirt is segmented to regions. Only the region containing the next grasping point is examined to compute the exact grasping 15

3. Related work Figure 3.11. [Willimon et al.

20 3. Related work Figure [Willimon et al., 2013a] match a triangulated mesh to the sequence of color and depth images by minimizing the energy function expressing both the garment internal forces and the alignment with the observed data. position and orientation. Pose of the shirt is checked after each grasping. If the shirt is held near shoulders by both grippers and if it is unfolded then the regrasping procedure is finished. The final step is laying the shirt on the table while rotating its sides inwards and folding the shirt over the edge of the table. The whole pipeline succeeded in 18 out of 20 attempts which is 90%. Its average duration is 19 minutes. The unfolding requires 6 regrasps on the average Clemson University [Willimon et al., 2012] are interested in a configuration estimation of non-rigid objects. They employ sequences of RGBD images acquired by Kinect sensors. The goal is to register a predefined 3D model with each frame of the incoming sequence showing a garment which is sequentially deformed. The model is represented by a triangulated mesh of 3D points. The goal is to find shape of the mesh which minimizes its energy. The energy is given by a weighted sum of several terms. The smoothness term captures the internal energy of the mesh and ensures its smoothness. It expresses the second derivatives of the mesh vertices. The correspondence term compares locations of corresponding points in the reference and the currently observed mesh. The correspondences are found by matching SURF descriptors [Bay et al., 2008]. The depth term evaluates depth differences between each vertex in the mesh and value measured by Kinect. It makes the mesh to fit the garment in textureless areas where no correspondences can be found. The boundary term makes the boundary vertices stay near boundary of the observed object. It measures distances of the boundary vertices to the closest contour points. A semi-implicit iterative algorithm by [Kass et al., 1988] is employed to minimize the overall energy function. The described work is extended in [Willimon et al., 2013a]. The initial 3D mesh model is created automatically from the first frame of the sequence. Contour of the observed garment is newly extracted from both color and depth images by the grabcut algorithm [Rother et al., 2004]. The boundary term was redesigned to measure not only distances of the boundary vertices to the closest contour points, but also distance of the contour points to the closest boundary vertices. This makes the boundary term more symmetric. The correspondence term was omitted without any decrease in performance. The energy minimization utilizes the same semi-implicit approach. The proposed approach can deal with significant deformations and occlusions of the garment being manipulated by a human. Moreover, the algorithm is able to reinitialize itself from a complex deformation by constructing a new mesh model. It was successively tested on image sequences showing shirts, pants and shorts being translated, rotated and scaled by a human. [Willimon et al., 2011a] propose a method for picking up single garment from a pile 16

3. Related work Figure 3.12. The garment grasping by [Willimon et al.

images needed for classification. of garments lying on a table and classifying its category. They distinguish 6 categories (short-sleeved shirt, long-sleeved shirt, pants, shorts, socks, underwear).

The segmentation utilizes a graph-based method by [Felzenszwalb and Huttenlocher, 2004]. Surface of the pile is then reconstructed from stereo cameras.

The robot then tries to grasp and lift the garment. Accomplishment of the grasping is checked visually by subtracting image of the arm acquired before and after the grasping.

21 3. Related work Figure The garment grasping by [Willimon et al., 2011a] consists of several phases: the color image of the pile is segmented to several regions, the highest region is selected for grasping and the garment is lifted up in order to acquire two images needed for classification. of garments lying on a table and classifying its category. They distinguish 6 categories (short-sleeved shirt, long-sleeved shirt, pants, shorts, socks, underwear). A color image acquired by a camera above the table is used to segment the pile into several regions. The regions do not correspond to individual garments, but to their uniform parts. The segmentation utilizes a graph-based method by [Felzenszwalb and Huttenlocher, 2004]. Surface of the pile is then reconstructed from stereo cameras. The resulting 3D points are used to compute the average height of each region above the table. Geometric center of the region having the maximum average height is chosen as a grasping point. The robot then tries to grasp and lift the garment. Accomplishment of the grasping is checked visually by subtracting image of the arm acquired before and after the grasping. If the grasping was not successful, it is repeated more close to the table. Two images of the hanging garment are acquired by rotating it 90 around its vertical axis. The garment is then dropped, grasped again and two new images are acquired. The whole procedure is repeated 10 times which gives 20 images in total. The reason is to have images of the hanging garment being held by various random points. Each image is then matched with all annotated template images to find its nearest neighbor. The similarity measure combines absolute difference of silhouette areas, absolute difference of silhouette eccentricities, Hausdorff distance of silhouette contours and Hausdorff distance of detected Canny edges [Canny, 1986]. The silhouette is obtained by a background subtraction. The experiments show that the active perception giving 20 images significantly improves the category recognition accuracy, compared to classification from single image only. [Willimon et al., 2013b] and [Willimon et al., 2013c] deal with the classification of an unflattened garment laying on a table. Their method should be robust enough for clothes to be classified according to category, gender, owner s age, color or season of use. However, only category recognition is discussed. The proposed classifier, called L-M-H or L-C-S-H, consists of multiple levels: low, mid (comprehending characteristics and selection masks) and high. The low level component utilizes both local and global image features to estimate whether the observed garment has a particular characteristic. The features are extracted from already segmented image [Felzenszwalb and Huttenlocher, 2004]. The local feature vectors are combinations of SIFT [Lowe, 2004] to describe texture and FPFH [Rusu et al., 2009] to describe 3D shape. The global features combine HSV color histogram, histogram of Canny edge lengths [Canny, 1986], histogram of transformed normals and shape descriptor of distances from the silhouette center to its boundary. The low level feature vectors are classified by 27 binary SVM classifiers in order to decide about 27 binary characteristics of the garment. The characteristics comprehend presence of collar, presence of round neck, whether the garment is striped, whether it is made of denim, etc. The mid level also comprehend selection masks which determine 17

3. Related work L C S H Color Texture. Size Collars Buttons.

The features are used to decide about presence of the characteristic parts.

possible to classify an unknown garment.

3 72.3 46.4 4.5 23.9 0.0 95.6 Figure 3.14. [Willimon et al.

a subset of characteristics which are important for the particular category

The masks are created incrementally by removing one characteristic at time

The high level category classification is based on comparing

characteristics vectors for all clothes categories.

by introducing the mid layer, compared to SVM classification directly from

, 2011b] describe a method for flattening and unfolding a garment laying on

In the first phase, minor wrinkles and folds are removed by circularly

The garment is always grasped for its contour in the particular direction

The second phase utilizes a depth map computed from stereo to detect and

So called peak ridge is detected as a neighborhood of the highest point.

22 3. Related work L C S H Color Texture. Size Collars Buttons. Ankle hem Shirt characteristics. Dress characteristics Shirts. Dress Figure The L-C-S-H approach for clothes classification by [Willimon et al., 2013b] consists of several layers. The low level layer is responsible for features detection. The features are used to decide about presence of the characteristic parts. Each clothes category is described by these characteristics which makes it possible to classify an unknown garment. Iteration Depth map 3D model Flatness Figure [Willimon et al., 2011b] are able to flatten and unfold a garment set on a table in tens iterations of pulling moves. The procedure finishes while at least 95% garment is flat. a subset of characteristics which are important for the particular category of clothes. The masks are created incrementally by removing one characteristic at time and checking the classification performance without it. The high level category classification is based on comparing characteristics vector of the observed garment with masked average characteristics vectors for all clothes categories. The experiments show significant improvement of the classification accuracy by introducing the mid layer, compared to SVM classification directly from low level features. [Willimon et al., 2011b] describe a method for flattening and unfolding a garment laying on a table in two following phases. In the first phase, minor wrinkles and folds are removed by circularly moving around the garment and pulling it to 8 directions every 45. The garment is always grasped for its contour in the particular direction and pulled away from its center. The second phase utilizes a depth map computed from stereo to detect and remove possible major folds. So called peak ridge is detected as a neighborhood of the highest point. Surface of the whole garment is split into several components according to a height continuity. Then corners are detected in the binary garment segmentation mask using the algorithm by [Harris and Stephens, 1988]. The corners located in the same component as the peak ridge form grasping candidates. One of these corners is selected and the garment is pulled inside to or outside from its center, depending on exact position of the corners. The proposed algorithm was virtually tested in Houdini 2 3D animation software. It is shown that the algorithm successfully increases flatness of the garment. The flatness is defined as a portion of the garment points located low above the table. However, tens of pulling manipulations are usually needed. The algorithm was also deployed to a real robot

23 3. Related work Figure [Ramisa et al., 2012] extract features from both color and depth images and classify them in order to build the map of graspable locations. The map is searched for its local maxima which are again classified in order to determine the grasping point Institut de Robòtica i Informàtica Industrial [Ramisa et al., 2012] are interested in grasping of highly wrinkled clothes laying on a table. They utilize features extracted from both depth and color image acquired by Kinect sensor in order to determine the optimal grasping point. Both images are processed by a sliding window technique to compute a descriptor for each patch. Depth is described by a histogram of discretized point depths and discretized geodetic distances of individual points to the patch center. Texture descriptor of individual patches is based on SIFT [Lowe, 2004]. The depth and texture feature vectors are concatenated and quantized. Quantized vectors from neighboring patches are summarized in bag of features histograms [Lazebnik et al., 2006]. The bags of features are classified by the logistic regression classifier [Bishop, 2006] learned from positive (image windows showing graspable parts of clothes, e.g. collars) and negative samples (image windows covered at most 50% by clothes). This gives a probability map which is searched for local maxima. Neighborhoods of the maxima are then examined by a more expensive SVM classifier with χ 2 extended Gaussian kernel [Zhang et al., 2007] in order to refine locations of the grasping point candidates. The best candidate is selected based on a wrinkledness measure computed from the estimated probabilistic distribution of depth normals in the candidate neighborhood. The authors report 30% to 70% success rate of collars detection, based on complexity of the testing database. [Ramisa et al., 2013] introduce a new descriptor called FINDDD (Fast Integral Normal 3D) which is computed from depth maps. It should outperform the existing general purpose descriptors in characterization of textile materials. Normal vectors for all depth map points are computed using integral images at first. All normal vectors are oriented in the hemisphere pointing from the image plane to the camera. Thus each normal can be discretized to a histogram of vectors pointing from centre of the hemisphere to vertices on its surface. The vertices (i.e. histogram bins) are obtained by triangulation of the hemisphere. The histogram for each normal is built by splitting the normal to the bins according to their orientation similarity. Histograms constructed for normals in rectangular neighborhood are summed. The FINDDD descriptor is formed by concatenation of these summed histograms from a larger neighborhood. Usefulness of the proposed FINDDD descriptor is verified in several experiments. They include localization and categorization of the wrinkles formed on the surface of a polo shirt. Next application is recognition of clothes type using SVM classifier trained on bags of features [Lazebnik et al., 2006] obtained by quantization of the FINDDD descriptors learned from data. Finally, the authors used the FINDDD descriptor to 19

3. Related work Figure 3.16. FINDDD descriptor proposed by [Ramisa et al., 2013] is computed from depth maps (left).

improve the previously mentioned collars detection accuracy up to 80%. [Alenyà et al., 2012] deal with benchmarking of various grasping strategies for textile materials.

24 3. Related work Figure FINDDD descriptor proposed by [Ramisa et al., 2013] is computed from depth maps (left). The estimated normals (middle) are discretized to bins located on the halfspherical surface. Histogram of their orientations is then built (right). improve the previously mentioned collars detection accuracy up to 80%. [Alenyà et al., 2012] deal with benchmarking of various grasping strategies for textile materials. They identify several issues including different hardware configurations used by various researchers or a problematic repeatability of initial conditions in the following trials caused by clothes deformations. The authors propose to identify and abstract all possible grasping point perception strategies and all possible grasping action mechanisms. Each combination of perception and action strategies should then be tested and evaluated independently. Two sets of experiments were performed to prove this concept. The first set utilizes a simple perception and it is performed with various configurations of three-fingers hands, whereas the second one is performed with a simple two-fingers hand and utilizes a more advanced perception algorithm. The evaluation criteria is based on a ratio of successfully grasped garments Center for Research and Technology Hellas [Mariolis and Malassiotis, 2013b] propose a method for matching predefined templates of unfolded clothes to images of folded clothes. The main idea is that the folding axis always becomes a part of the garment outer boundary. In the first step, contour of the folded garment is extracted from the image and approximated by a polygon. Each edge of the polygon is then examined as a potential folding axis. It is achieved by removing the edge and partially matching the remaining open polyline to the unfolded template. Each partial match generates a hypothesis about location and orientation of the folding axis. The template is then virtually folded along the estimated axis and matched to the observed contour using inner distance shape contexts [Ling and Jacobs, 2007]. The proposed pipeline can be applied recursively in order to deal with multiple folds. The matching accuracy is over 95% for synthetic data and 88% for real data. However, no practical unfolding experiments were performed on a robot. [Doumanoglou et al., 2014] are interested in category recognition and unfolding of the garment which was randomly grasped from a table and it is hung by a gripper. At first, the lowest point of the hanging garment is grasped and the original point is released in order to reduce number of possible configurations. The second step is a concurrent recognition of the garment category and of the held point (short-sleeved shirt, trousers, two configurations of shorts, two configurations of long-sleeved shirt). The recognition utilizes random decision forests [Breiman, 2001] trained on simple features computed from a depth map. The features are: depth distance of two points, difference of depth distances of three points, absolute value of curvature for a certain point. A random set of tests for randomly selected points is generated at each node during training. The 20

3. Related work Figure 3.17. [Doumanoglou et al., 2014] utilizes decision forests for clothes category recognition and grasping point estimation.

test minimizing the information entropy in its children is selected as the decision rule in that node. The following step is estimation of the next grasping point using Hough forests [Gall et al.

25 3. Related work Figure [Doumanoglou et al., 2014] utilizes decision forests for clothes category recognition and grasping point estimation. The inner nodes evaluate depth differences and curvatures of randomly selected points. The leaf nodes vote for the grasping point location. test minimizing the information entropy in its children is selected as the decision rule in that node. The following step is estimation of the next grasping point using Hough forests [Gall et al., 2011]. The approach is similar to the category and configuration recognition. However, positions predicted by the individual decision trees are accumulated into the Hough image [Duda and Hart, 1972] and the point with the highest concentration of votes is selected. Shape of its neighborhood is estimated by local fitting of a plane and it is then grasped. The procedure is repeated to estimate the second grasping point with another set of Hough trees. Category and configuration recognition as well as grasping point estimation employ an active approach. The robot rotates the garment around its vertical axis until it can decide surely. The sequence of observations, recognitions and rotations is modeled as the partially observable Markov decision process [Kaelbling et al., 1998]. The robot keeps probabilistic distribution of the current internal state which is the category and configuration pair or the next grasped point. It incrementally updates this distribution according to the likelihood of the observed data and according to the predefined state transitional probabilities. The robot is positively rewarded for a correct decision about the internal state, highly punished for an incorrect decision and slightly punished for rotating the garment. The active approach increases the category recognition accuracy from 90% to 100%. The whole unfolding operation was successful in 93% cases, requiring 2.4 rotations on the average University of Glasgow [Sun et al., 2013] deal with the task of flattening a rectangular piece of cloth. Their approach is based on detection of wrinkles in a depth map. At first, a measure of wrinkledness is computed for each point as the average absolute deviation of depth values on its neighborhood. The highly wrinkled points are selected by thresholding and clustered by k-means algorithm. The obtained clusters are then linked by the hierarchical clustering algorithm [Hastie et al., 2009] in order to get bigger clusters corresponding to individual wrinkles. Position and orientation of the most salient wrinkle is used to plan a flattening move. The flattening is performed by pulling a corner or an edge of the cloth. The whole perception and manipulation procedure is repeated until the wrinkledness measure is reduced to a certain level. However, only the virtual experiments with the cloth simulated as the mass-spring model [Lander, 1999] were performed. 21

3. Related work Figure 3.18. Cloth flattening by [Sun et al., 2013] utilizes the depth map to compute a measure of wrinkledness.

, 2014] utilize depth maps acquired from 3D models in the training phase or from Kinect sensor in the testing phase.

, 2014] are interested in category and pose recognition of a hanging garment being held by a robotic arm.

The training data were obtained by modeling and simulating garments in Maya 3 software. Each garment was virtually grasped for each of its predefined 20-50 grasping points and lifted up under gravity.

Each feature vector is represented as a linear combination of several codewords. The whole image is represented as a bag of features [Lazebnik et al.

The classification consists of two layers. In the first layer, each depth image votes for the garment category to choose the dominant one.

26 3. Related work Figure Cloth flattening by [Sun et al., 2013] utilizes the depth map to compute a measure of wrinkledness. The highly wrinkled points are clustered and wrinkles orientation is detected. 3D model Depth map SIFT vectors Codewords Recognition Figure [Li et al., 2014] utilize depth maps acquired from 3D models in the training phase or from Kinect sensor in the testing phase. The SIFT features are extracted and converted to combinations of codewords in order to recognize the garment category and grasped point Columbia University [Li et al., 2014] are interested in category and pose recognition of a hanging garment being held by a robotic arm. They only use depth maps acquired by the Kinect sensor to avoid dealing with potentially complex clothes texture. The training data were obtained by modeling and simulating garments in Maya 3 software. Each garment was virtually grasped for each of its predefined grasping points and lifted up under gravity. Its depth images were acquired by 90 virtual cameras from various positions and orientations. Then the dense SIFT features [Lowe, 2004] are extracted from each image and quantized to get a codebook. Each feature vector is represented as a linear combination of several codewords. The whole image is represented as a bag of features [Lazebnik et al., 2006] which are used to train SVM classifier utilizing max-pooling [Yang et al., 2009]. An unknown hanging garment is classified from 150 to 200 depth images acquired by rotating the holding arm. The classification consists of two layers. In the first layer, each depth image votes for the garment category to choose the dominant one. In the second layer, each depth image votes for one of grasping points, depending on the selected category. The grasping points are processed by the label pruning algorithm. It iteratively computes mean of the estimated grasping points and removes points far from this mean. The first layer utilizes RBF kernels, while the second one utilizes linear SVM. The category recognition accuracy is 60% to 90%, depending on the testing data. Regarding estimation of the grasped point, 50% of them are estimated with 10 cm tolerance and 70% with 15 cm tolerance Existing datasets This section provides references to the currently available datasets related to our field of study. They comprehend mainly color images of clothes in various configurations,

3. Related work Figure 3.20. The database by [Wagner et al., 2013] contains color images and depth maps (3rd figure from left) of flat, wrinkled and folded garments. Figure 3.21.

depending on the intended purpose of the particular dataset. Some datasets also include depth maps taken by range sensors.

The garments are spread, however, some of them are once consequently or twice folded. The database includes more than 2000 entries of 17 different garments split into 9 categories.

The attached annotation text file defines pixel coordinates of important landmark points located on garment contour, e.g. armpits, endpoints of sleeves or crotch.

The database contains 65 color images of 12 different garments laying on a monochromatic surface. The image resolution is 650 400.

, 2013] comprehends stereo data for 16 different garments, each of them taken in 5 pose configurations (flat, half folded, fully folded, wrinkled and hanging).

27 3. Related work Figure The database by [Wagner et al., 2013] contains color images and depth maps (3rd figure from left) of flat, wrinkled and folded garments. Figure The database by [Aragon-Camarasa et al., 2013] contains stereo pairs of high resolution images which can be used for 3D reconstruction of point clouds. depending on the intended purpose of the particular dataset. Some datasets also include depth maps taken by range sensors. The images are usually accompanied with specific annotations. [Wagner et al., 2013] created the database of color and depth images showing garments laying on a wooden flat surface. The garments are spread, however, some of them are once consequently or twice folded. The database includes more than 2000 entries of 17 different garments split into 9 categories. Each entry comprehends color image and depth image taken by the Asus Xtion PRO LIVE device. The attached annotation text file defines pixel coordinates of important landmark points located on garment contour, e.g. armpits, endpoints of sleeves or crotch. [Mariolis and Malassiotis, 2013a] published the database of garments which were spread and then once folded according to a random folding line. The database contains 65 color images of 12 different garments laying on a monochromatic surface. The image resolution is Each image is annotated with the coordinates of important landmark points as in the previous case. The database by [Aragon-Camarasa et al., 2013] comprehends stereo data for 16 different garments, each of them taken in 5 pose configurations (flat, half folded, fully folded, wrinkled and hanging). Each database item consists of a pair of stereo color images, pair of garment segmentation masks, two stereo disparity maps (horizontal and vertical), two corresponding disparity confidence maps and calibration parameters for both cameras. [Miller et al., 2011] provides the database of more than 500 color images of various garment types spread on a green surface. Most of the images were also transformed to a bird s-eye perspective. Each piece of garment was taken in different poses, e.g. 23

3. Related work Figure 3.22. The database by [Grana et al., 2014] contains fashion photos of 60000 garments. Figure 3.23. The database by [Ramisa et al.

The image resolution is 640 480. There are no annotations provided. The database published by [Grana et al.

There are 3 types of images: garments worn by a model, those worn by a mannequin and those set on a monochromatic background.

Each image is accompanied with a binary background segmentation mask.

, 2012] created the database containing more than 600 color images and depth maps of garments laying on a table. They were acquired by the Kinect device.

The dataset aims on recognition of clothes parts.

28 3. Related work Figure The database by [Grana et al., 2014] contains fashion photos of garments. Figure The database by [Ramisa et al., 2012] contains images of garments with annotated polygonal regions showing the important parts. different angles adjacent to sleeves armpits. The image resolution is There are no annotations provided. The database published by [Grana et al., 2014] contains approximately color images of garments. There are 3 types of images: garments worn by a model, those worn by a mannequin and those set on a monochromatic background. The images are provided in full and reduced resolutions which differ from image to image. Each image is accompanied with a binary background segmentation mask. There is also an annotation for each piece of garment telling its dominant color, a general category and a more detailed category. [Ramisa et al., 2012] created the database containing more than 600 color images and depth maps of garments laying on a table. They were acquired by the Kinect device. The resolution is for both color and depth images. Binary background segmentation masks are also available. The dataset aims on recognition of clothes parts. Thus each image is provided with polygonal regions in pixel coordinates which show one of predefined 11 interesting clothes parts, e.g. collars, sleeves or hemlines. There are more than 1000 annotated regions. There are 6 instances of garments arranged in folded pose, wrinkled pose, or stacked on a heap of garments. [Willimon et al., 2013a] provides the database of color images and point clouds of garment laying on a monochromatic green background. There are flat, folded, wrinkled and crumpled garments. The database comprehends 202 different garments (85 shirts, 30 cloths, 25 pants, 25 shorts, 10 dresses, 22 socks and 5 jackets), each of them 24

3. Related work Figure 3.24. The database by [Willimon et al., 2013b] shows various garments being held and manipulated by a human.

The annotation tells garment category, male/female/unisex type, intended age of the owner and a season to be worn.

It contains image sequences of garments being held and manipulated by a human.

29 3. Related work Figure The database by [Willimon et al., 2013b] shows various garments being held and manipulated by a human. captured 5 times. All the data were acquired by the Kinect device and thus they are provided in the standard resolution The annotation tells garment category, male/female/unisex type, intended age of the owner and a season to be worn. Main purpose of the database is recognition of these features. There is another database created by [Willimon et al., 2013b]. It contains image sequences of garments being held and manipulated by a human. Main purpose of this database is estimation of the garment pose. There are 10 sequences of shirts, shorts and pants in the dataset, each of them consisting of nearly 200 images in the resolution Background segmentation masks for all images are also provided. 25

30 4. Our past contribution Our research has been motivated by the needs of CloPeMa (Clothes Perception and Manipulation) 1 project up to now. It is a three year research project funded by European Commission. It aims on development of the robot which will be able to perceive and manipulate fabrics, textiles and garments. CloPeMa robot consists of two industrial robotic arms mounted on a standard turn-table. Both arms are equipped with grippers developed by our project collaborators. The perception system consists of range and vision sensors mounted to the arms and of a binocular head. Functionality of the robot is demonstrated in several tasks. Probably the most complex of them is a complete pipeline for processing a pile of crumpled garments. The pipeline comprehends isolation of a single garment from the pile, recognition of its category, unfolding, laying the garment down on the table, folding it and eventually flattening. We have developed a complete vision solution for the folding part of the pipeline. We have successfully integrated it into existing manipulation framework of the robot. The folding procedure starts with a piece of garment spread on a flat surface. The procedure consists of several iterations of perception-decision-manipulation loops. Each loop correspond to performing one fold of the garment. Fig. 4.1 shows a short-sleeved shirt folding sequence. In [Stria et al., 2014a] (see Appendix A), we introduce initial version of our computer vision solution for automated clothes folding. We assume that there is a single piece of garment spread on a flat surface. We know the garment category in advance, e.g. towel, pants or shirt. Color properties of that surface are known in advance too. They are stored in form of a learned probabilistic color model. Segmentation of the garment and background surface is achieved using a modified grabcut algorithm proposed by [Rother et al., 2004]. The modification lies in an automatic initialization of the algorithm not engaging the user. The segmentation is followed by extraction of the garment contour. The contour is approximated by a polygon using a dynamic programming approach proposed by [Perez and Vidal, 1994]. The simplified contour is then matched to a polygonal model also by dynamic programming. The polygonal model combines manually defined structure with probabilistic distributions of inner angles learned from training data. There is one model for each type of clothes. Once the polygonal model is matched to the contour, we know positions of the important landmark points, e.g. armpit, inner and outer corner of sleeve or crotch. The landmarks are then used by the robot for performing folding sequence. In this initial version of the method, the vision sensing is performed only once and the garment configuration is not checked between individual folds. In [Stria et al., 2014b] (see Appendix B), we improve the original method in several manners. The process of segmentation and contour simplification remains unaltered. However, the polygonal model is improved to consider also probabilistic distributions of its edges relative lengths. The dynamic programming procedure for matching a contour to the model is significantly improved by considering not only neighboring points of the contour. For more details refer Appendix B. The folding procedure is improved to repeat vision sensing after performing each single fold. This feedback is

in [Miller et al., 2011] and [Miller et al., 2012].

The localization displacements are roughly said around 1 cm which is sufficient for

31 4. Our past contribution Figure 4.1. Our dual-arm CloPeMa robot successfully folding a blue short-sleeved shirt in a sequence of three following moves. important because of possible unintentional translation and rotation of the garment caused by robotic manipulation. Simplified contour of the folded garment is matched to a polygonal model of the folded garment. The folded model is created automatically from the original one in agreement with the planned fold. The proposed method was compared to the up to now state of the art algorithm described in [Miller et al., 2011] and [Miller et al., 2012]. Precision of landmark points localization is similar for both methods. The localization displacements are roughly said around 1 cm which is sufficient for automated folding. However, our method is approximately 10 to 50 times faster. The proposed sensing algorithm was also deployed to our CloPeMa robot and successfully tested in real folding of towels, pants, shorts, short-sleeved shirts and long-sleeved shirts. The folding procedure was integrated into the complex procedure solving the task of isolating a single crumpled garment from a pile of clothes, unfolding it, lying it on a table and folding it. 27

5. Future work The ultimate goal of the thesis is to advance in methods for machine perception and automated manipulation of non-rigid objects, mainly clothes.

However, we would also like to work on more general idea of advanced representation of non-rigid objects and its usage for active perception.

One source of failures is in the setting of the garment on a table before folding it. It may result to creases and wrinkles as seen in Fig. 5.1a.

32 5. Future work The ultimate goal of the thesis is to advance in methods for machine perception and automated manipulation of non-rigid objects, mainly clothes. The particular subgoals are partially motivated by needs of the aforementioned CloPeMa project and possible following projects. However, we would also like to work on more general idea of advanced representation of non-rigid objects and its usage for active perception. The first goal is related to the folding pipeline proposed in Chap. 4. The folding sequence is not always reliable. The overall success rate of the sequence is approximately 70%. One source of failures is in the setting of the garment on a table before folding it. It may result to creases and wrinkles as seen in Fig. 5.1a. Another source of failures is the incorrectly performed fold which can be caused by several factors. The first of them is an unreliable gripper mechanism which drops the garment before finishing the planned folding move. The second one is sliding of the whole garment on the table during the folding which leads to an incorrectly positioned folding line. The third factor is the inability of the robotic wrist to rotate the grasped garment while folding it which may lead to an unintentional backward fold on top of the planned fold. These issues can be seen in Fig. 5.1b and Fig. 5.1c. We believe that the introduced failures can be automatically detected and then fixed by performing additional robotic manipulations. The vision based detection can prosper from the fact that we know both the initial state of the cloth and the planned folding move. Thus we can model what the final state should be and we can compare the predicted model with the observed data. The perception should utilize both color images and depth maps obtained from range detectors and stereo cameras. The depth maps can be used to reconstruct surface of the garment. The color image provides additional information about discontinuities of the garment surface caused by folds and creases. From our experience, these discontinuities are often not visible in the depth map because the garment is thin. Therefore we would like to use the reconstructed surface as a smooth approximation of the real garment geometry and combine it with edge detection in the color image. The edge detection should be based both on shading and on discontinuities in texture tiling as seen in Fig The detected folding issues can be fixed in several ways. Some of them can be fixed a) Creases and wrinkles risen while setting garments on table b) Displaced folding line due to garment sliding c) Unintentional backward fold caused by non-rotating gripper Figure 5.1. Examples of folding issues which could be possibly detected by an additional perception and fixed by performing an extra manipulation. 28

5. Future work Figure 5.2. Images taken by the camera attached to the robotic wrist. They show the wrist coming to the garment, grasping it by closing the gripper and picking up the garment.

If the failure ends in a totally crumpled state of the garment, it is always possible to start again the whole CloPeMa pipeline including unfolding, setting on a table and correct folding.

It would be a significant improvement to the current state where the perception procedure is performed at the beginning and then after each fold.

for checking whether the gripper has successfully grasped the garment or whether the garment has not been unintentionally dropped during manipulation. Fig. 5.2 shows example real time folding images.

They consist of many meaningful parts which can be easily recognized by a human, e.g. collar, neckline, hem, pocket, button, zip fastener etc.

33 5. Future work Figure 5.2. Images taken by the camera attached to the robotic wrist. They show the wrist coming to the garment, grasping it by closing the gripper and picking up the garment. by pushing some part of the garment or grasping and moving it. More serious issues may require reverting of the last fold and performing it again. If the failure ends in a totally crumpled state of the garment, it is always possible to start again the whole CloPeMa pipeline including unfolding, setting on a table and correct folding. Another interesting idea is a real time visual checking performed during robotic manipulation. It would be a significant improvement to the current state where the perception procedure is performed at the beginning and then after each fold. Moreover, the real time visual perception is not used by the almost any currently available methods discussed in Chap. 3. The real time perception could be used e.g. for checking whether the gripper has successfully grasped the garment or whether the garment has not been unintentionally dropped during manipulation. Fig. 5.2 shows example real time folding images. The real time checking could be also solved by analyzing data provided by tactile and force sensors which are integrated in the gripper. Clothes are not just pieces of fabric. They consist of many meaningful parts which can be easily recognized by a human, e.g. collar, neckline, hem, pocket, button, zip fastener etc. We would like to be able to recognize these parts and describe their qualities, e.g. shape of the neckline, type of the pocket etc. This is the first step to a semantic description of the garment. Moreover, the garment itself has its own qualities. It can be said whether a skirt is short or long, whether jeans have low or high rise, whether a shirt is slim fit or fly away etc. The automated semantic description obtained from image of a garment could be used in several scenarios. The first scenario is directly related to our clothes folding pipeline. a) Front and back side of jean shorts b) Same dress in various colors Figure 5.3. Semantic description of clothes can be used for: (a) recognition of front and back side of clothes; (b) retrieval of similar clothes having dissimilar color or texture. 29

34 5. Future work Since we recognize configuration of the garment from its contour, we are not able decide whether we see its front or back side. However, humans take the front-back orientation of the garment into account while folding it. We would like to define various rules like that jeans usually have big pockets at the back side and a button on the front side as in Fig. 5.3a, that there is a zipper on the front side of a jacket, or that the front neckline is usually more cut than the back one. These rules could be used for an active vision. The robot would look at the garment and possible decide whether it is set front or back side up. If it would not be able to decide it certainly, then it would reverse the garment and check it from the other side. Once a decision were made, the robot would eventually reverse the garment to a correct front-back orientation prior to folding. The semantic description of clothes would be also very usable for the retrieval systems mentioned in Sec The currently available methods are based on color and texture similarity, however, the mentioned qualitative features are far more discriminative in our opinion. You can see it in Fig. 5.3b. This research could perhaps have an industrial application in clothes recommendation systems used in online stores. There is an another task related to clothes semantics. It is an advanced representation of the fabric forming the garment. The goal is to automatically create such a representation from the perceived data and utilize the representation for an automated manipulation. We strongly believe that the representation can be based on theory of manifolds [Tu, 2008]. In mathematics, a n-dimensional manifold is a topological space which resemble n-dimensional Euclidean space in a close neighborhood of its every point. Example of 1D manifold is a circle in Euclidean plane. Example of 2D manifold is a sphere or torus in 3D space. In our case, we deal with 2D manifold in 3D space which is a surface of the garment. If we were able to construct a manifold from the perceived garment and to understand its semantics, we would completely know its precise configuration. Then all the garment manipulation tasks could be solved relatively easily and the correct robotic motion planning would be the only problem. However, construction of the semantic manifold is very challenging. We would like to be able to construct it not only for a spread garment, but also for a crumpled one. It should be achieved by incorporating an active vision paradigm. In each step, the robot will construct representation of the currently visible part of the garment, memorize it and then it will manipulate the garment in order to view its another part. Thus the manifold will be built incrementally. 30

35 Bibliography [Alenyà et al., 2012] Alenyà, G., Ramisa, A., Moreno-Noguer, F., and Torras, C. (2012). Characterization of textile grasping experiments. In Proc. ICRA Workshop on Conditions for Replicable Experiments and Performance Comparison in Robotics Research, pages [Aragon-Camarasa et al., 2013] Aragon-Camarasa, G., Oehler, S. B., Liu, Y., Li, S., Cockshott, P., and Siebert, J. P. (2013). Glasgow s stereo image database of garments [Bay et al., 2008] Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008). Speeded-up robust features (surf). Computer Vision and Image Understanding, 110(3): [Bersch et al., 2011] Bersch, C., Pitzer, B., and Kammel, S. (2011). Bimanual robotic cloth manipulation for laundry folding. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS 2011), pages IEEE. 15 [Bishop, 2006] Bishop, C. M. (2006). Springer. 19 Pattern Recognition and Machine Learning. [Breiman, 2001] Breiman, L. (2001). Random forests. Machine Learning, 45(1): [Canny, 1986] Canny, J. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6): [Cusumano-Towner et al., 2011] Cusumano-Towner, M., Singh, A., Miller, S., O Brien, J. F., and Abbeel, P. (2011). Bringing clothing into desired configurations with limited perception. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2011), pages IEEE. 13 [Dalal and Triggs, 2005] Dalal, N. and Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2005), volume 1, pages IEEE. 15 [Doumanoglou et al., 2014] Doumanoglou, A., Kargakos, A., and Malassiotis, T.-K. K. S. (2014). Autonomous active recognition and unfolding of clothes using random decision forests and probabilistic planning. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2014), pages IEEE. 20, 21 [Duda and Hart, 1972] Duda, R. O. and Hart, P. E. (1972). Use of the hough transformation to detect lines and curves in pictures. Communications of the ACM, 15(1): , 21 [Felzenszwalb and Huttenlocher, 2004] Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):

36 Bibliography [Fischler and Bolles, 1981] Fischler, M. A. and Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6): [Fogel and Sagi, 1989] Fogel, I. and Sagi, D. (1989). Gabor filters as texture discriminator. Biological Cybernetics, 61(2): [Gall et al., 2011] Gall, J., Yao, A., Razavi, N., Van Gool, L., and Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11): [Grana et al., 2014] Grana, C., Manfredi, M., Calderara, S., and Cucchiara, R. (2014). Garment selection and color classification. imagelab/researchactivity.asp?idattivita= [Hamajima and Kakikura, 2000] Hamajima, K. and Kakikura, M. (2000). Planning strategy for task of unfolding clothes. Robotics and Autonomous Systems, 32(2): , 7 [Harris and Stephens, 1988] Harris, C. and Stephens, M. (1988). A combined corner and edge detector. In Proc. Alvey Vision Conference, pages [Hastie et al., 2009] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer. 21 [Hata et al., 2009] Hata, S., Hiroyasu, T., Hayash, J., Hojoh, H., and Hamada, T. (2009). Flexible handling robot system for cloth. In Proc. IEEE International Conference on Mechatronics and Automation (ICMA 2009), pages IEEE. 10 [Hata et al., 2008] Hata, S., Hiroyasu, T., Hayashi, J., Hojoh, H., and Hamada, T. (2008). Robot system for cloth handling. In Proc. Annual Conference of the IEEE Industrial Electronics Society (IECON 2008), pages IEEE. 3, 10 [Kaelbling et al., 1998] Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1): [Kalantidis et al., 2013] Kalantidis, Y., Kennedy, L., and Li, L.-J. (2013). Getting the look: Clothing recognition and segmentation for automatic product suggestions in everyday photos. In Proc. ACM International Conference on Multimedia Retrieval (ICMR 2013), pages ACM. 4 [Kaneko and Kakikura, 2001] Kaneko, M. and Kakikura, M. (2001). Planning strategy for putting away laundry isolating and unfolding task. In Proc. IEEE International Symposium on Assembly and Task Planning 2001, pages IEEE. 2, 6 [Kass et al., 1988] Kass, M., Witkin, A., and Terzopoulos, D. (1988). Snakes: Active contour models. International Journal of Computer Vision, 1(4): [Kita et al., 2011] Kita, Y., Kanehiro, F., Ueshiba, T., and Kita, N. (2011). Clothes handling based on recognition by strategic observation. In Proc. IEEE International Conference on Humanoid Robots (Humanoids 2011), pages IEEE. 9 [Kita and Kita, 2002] Kita, Y. and Kita, N. (2002). A model-driven method of estimating the state of clothes for manipulating it. In Proc. IEEE Workshop on Applications of Computer Vision (WACV 2002), pages IEEE. 7 32

37 Bibliography [Kita et al., 2010] Kita, Y., Neo, E. S., Ueshiba, T., and Kita, N. (2010). Clothes handling using visual recognition in cooperation with actions. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS 2010), pages IEEE. 8 [Kita et al., 2004a] Kita, Y., Saito, F., and Kita, N. (2004a). A deformable model driven method for handling clothes. In Proc. International Conference on Pattern Recognition (ICPR 2004), pages IEEE. 7, 8 [Kita et al., 2004b] Kita, Y., Saito, F., and Kita, N. (2004b). A deformable model driven visual method for handling clothes. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2004), pages IEEE. 7 [Kita et al., 2009a] Kita, Y., Ueshiba, T., Neo, E. S., and Kita, N. (2009a). Clothes state recognition using 3d observed data. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2009), pages IEEE. 8 [Kita et al., 2009b] Kita, Y., Ueshiba, T., Neo, E. S., and Kita, N. (2009b). A method for handling a specific part of clothing by dual arms. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS 2009), pages IEEE. 8 [Kobayashi et al., 2008] Kobayashi, H., Hata, S., Hojoh, H., Hamada, T., and Kawai, H. (2008). A study on handling system for cloth using 3-d vision sensor. In Proc. IEEE Annual Conference of Industrial Electronics (IECON 2008), pages IEEE. 10, 11 [Lander, 1999] Lander, J. (1999). Devil in the blue-faceted dress: Real-time cloth animation. Game Developer Magazine, 21. 7, 21 [Lazebnik et al., 2006] Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2006), volume 2, pages IEEE. 19, 22 [Li et al., 2014] Li, Y., Chen, C.-F., and Allen, P. K. (2014). Recognition of deformable object category and pose. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2014). IEEE. 22 [Ling and Jacobs, 2007] Ling, H. and Jacobs, D. W. (2007). Shape classification using the inner-distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(2): [Lowe, 2004] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2): , 19, 22 [Maitin-Shepard et al., 2010] Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., and Abbeel, P. (2010). Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2010), pages IEEE. 2, 12, 13 [Manfredi et al., 2014] Manfredi, M., Grana, C., Calderara, S., and Cucchiara, R. (2014). A complete system for garment segmentation and color classification. Machine Vision and Applications, 25(4): , 5 33

38 Bibliography [Mariolis and Malassiotis, 2013a] Mariolis, I. and Malassiotis, S. (2013a). CERTH color image dataset of folded garments. ftp://clopema.iti.gr/certh_folded_clothes_ database. 23 [Mariolis and Malassiotis, 2013b] Mariolis, I. and Malassiotis, S. (2013b). Matching folded garments to unfolded templates using robust shape analysis techniques. In Proc. Computer Analysis of Images and Patterns (CAIP 2013), pages Springer. 20 [Miller et al., 2011] Miller, S., Fritz, M., Darrell, T., and Abbeel, P. (2011). Parametrized shape models for clothing. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2011), pages IEEE. 14, 15, 23, 27 [Miller et al., 2012] Miller, S., Van Den Berg, J., Fritz, M., Darrell, T., Goldberg, K., and Abbeel, P. (2012). A geometric approach to robotic laundry folding. The International Journal of Robotics Research, 31(2): , 14, 27 [Ojala et al., 1996] Ojala, T., Pietikäinen, M., and Harwood, D. (1996). A comparative study of texture measures with classification based on featured distributions. Pattern Recognition, 29(1): [Osawa and Kano, 2012] Osawa, F. and Kano, K. (2012). Contour tracking of soft sheet materials using local contour image data. International Journal of Automation Technology, 6(5): , 10 [Osawa et al., 2007] Osawa, F., Seki, H., and Kamiya, Y. (2007). Unfolding of massive laundry and classification types by dual manipulator. Journal of Advanced Computational Intelligence and Intelligent Informatics, 11(5): [Perez and Vidal, 1994] Perez, J.-C. and Vidal, E. (1994). Optimum polygonal approximation of digitized curves. Pattern Recognition Letters, 15(8): [Ramisa et al., 2012] Ramisa, A., Alenyà, G., Moreno-Noguer, F., and Torras, C. (2012). Using depth and appearance features for informed robot grasping of highly wrinkled clothes. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2012), pages IEEE. 19, 24 [Ramisa et al., 2013] Ramisa, A., Alenyà, G., Moreno-Noguer, F., and Torras, C. (2013). FINDDD: A fast 3D descriptor to characterize textiles for robot manipulation. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS 2013), pages IEEE. 19, 20 [Rother et al., 2004] Rother, C., Kolmogorov, V., and Blake, A. (2004). GrabCut interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3): , 16, 26 [Rusu et al., 2009] Rusu, R. B., Blodow, N., and Beetz, M. (2009). Fast point feature histograms (FPFH) for 3D registration. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2009), pages IEEE. 17 [Stria et al., 2014a] Stria, J., Průša, D., and Hlaváč, V. (2014a). Polygonal models for clothing. In Proc. Towards Autonomous Robotic System (TAROS 2014). To appear

39 Bibliography [Stria et al., 2014b] Stria, J., Průša, D., Hlaváč, V., Wagner, L., Petrík, V., Krsek, P., and Smutný, V. (2014b). Garment perception and its folding using a dual-arm robot. In Proc. International Conference on Intelligent Robots and Systems (IROS 2014). To appear. 26 [Sugiura et al., 2009] Sugiura, Y., Igarashi, T., Takahashi, H., Gowon, T. A., Fernando, C. L., Sugimoto, M., and Inami, M. (2009). Graphical instruction for a garment folding robot. In Proc. ACM SIGGRAPH Emerging Technologies, page 12. ACM. 11 [Sun et al., 2013] Sun, L., Aragon-Camarasa, G., Siebert, J. P., and Rogers, S. (2013). A heuristic-based approach for flattening wrinkled clothes. In Proc. Towards Autonomous Robotic System (TAROS 2013). Springer. 21, 22 [Tu, 2008] Tu, L. W. (2008). An Introduction to Manifolds, 2nd ed. Springer. 30 [van den Berg et al., 2011] van den Berg, J., Miller, S., Goldberg, K., and Abbeel, P. (2011). Gravity-based robotic cloth folding. In Proc. International Workshop on the Algorithmic Foundations of Robotics (WAFR 2010), pages Springer. 12, 14 [Varma and Zisserman, 2005] Varma, M. and Zisserman, A. (2005). A statistical approach to texture classification from single images. International Journal of Computer Vision, 62(1-2): [Wagner et al., 2013] Wagner, L., Krejčová, D., and Smutný, V. (2013). CTU color and depth image dataset of spread garments. Technical Report CTU-CMP , Czech Technical University in Prague. 23 [Wang et al., 2011] Wang, P. C., Miller, S., Fritz, M., Darrell, T., and Abbeel, P. (2011). Perception for the manipulation of socks. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS 2011), pages IEEE. 2, 15 [Willimon et al., 2011a] Willimon, B., Birchfield, S., and Walker, I. (2011a). Classification of clothing using interactive perception. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2011), pages IEEE. 16, 17 [Willimon et al., 2011b] Willimon, B., Birchfield, S., and Walker, I. (2011b). Model for unfolding laundry using interactive perception. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS 2011), pages IEEE. 18 [Willimon et al., 2012] Willimon, B., Hickson, S., Walker, I., and Birchfield, S. (2012). An energy minimization approach to 3D non-rigid deformable surface estimation using RGBD data. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS 2012), pages IEEE. 16 [Willimon et al., 2013a] Willimon, B., Walker, I., and Birchfield, S. (2013a). 3D nonrigid deformable surface estimation without feature correspondence. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2013), pages IEEE. 16, 24 [Willimon et al., 2013b] Willimon, B., Walker, I., and Birchfield, S. (2013b). Classification of clothing using mid-level layers. International Scholarly Research Notices Robotics. 17, 18, 25 35

40 Bibliography [Willimon et al., 2013c] Willimon, B., Walker, I., and Birchfield, S. (2013c). A new approach to clothing classification using mid-level layers. In Proc. IEEE International Conference on Robotics and Automation (ICRA 2013), pages IEEE. 17 [Yamaguchi et al., 2012] Yamaguchi, K., Kiapour, M. H., Ortiz, L. E., and Berg, T. L. (2012). Parsing clothing in fashion photographs. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), pages IEEE. 4, 5 [Yamazaki and Inaba, 2009] Yamazaki, K. and Inaba, M. (2009). A cloth detection method based on image wrinkle feature for daily assistive robots. In Proc. IAPR Conference on Machine Vision Applications (MVA 2009), pages , 11, 12 [Yamazaki et al., 2012] Yamazaki, K., Ueda, R., Nozawa, S., Kojima, M., Okada, K., Matsumoto, K., Ishikawa, M., Shimoyama, I., and Inaba, M. (2012). Home-assistant robot for an aging society. Proceedings of the IEEE, 100(8): [Yamazaki et al., 2010] Yamazaki, K., Ueda, R., Nozawa, S., Mori, Y., Maki, T., Hatao, N., Okada, K., and Inaba, M. (2010). System integration of a daily assistive robot and its application to tidying and cleaning rooms. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS 2010), pages IEEE. 11 [Yang et al., 2009] Yang, J., Yu, K., Gong, Y., and Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), pages IEEE. 22 [Zhang et al., 2007] Zhang, J., Marsza lek, M., Lazebnik, S., and Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2):

41 A. Polygonal Models for Clothing A reprint of our paper Polygonal Models for Clothing which will be published in the proceedings of the 15th Towards Autonomous Robotic Systems (TAROS 2014) conference. 37

42 Polygonal Models for Clothing Jan Stria, Daniel Průša, and Václav Hlaváč Center for Machine Perception, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University Karlovo nám. 13, Prague 2, Czech Republic Abstract. We address the problem of recognizing a configuration of a piece of garment fairly spread out on a flat surface. We suppose that the background surface is invariant and that its color is sufficiently dissimilar from the color of a piece of garment. This assumption enables quite reliable segmentation followed by extraction of the garment contour. The contour is approximated by a polygon which is then fitted to a polygonal garment model. The model is specific for each category of garment (e.g. towel, pants, shirt) and its parameters are learned from training data. The fitting procedure is based on minimization of the energy function expressing dissimilarities between observed and expected data. The fitted model provides reliable estimation of garment landmark points which can be utilized for an automated folding using a pair of robotic arms. The proposed method was experimentally verified on a dataset of images. It was also deployed to a robot and tested in a real-time automated folding. Keywords: clothes folding, robotic manipulation 1 Introduction We present a solution for identifying an arrangement of a piece of garment spread out on a flat surface. Our research is motivated by the needs of the European Commission funded project Clothes Perception and Manipulation (CloPeMa) [16]. This project focuses on a garments manipulation (sorting, folding, etc.) by a two armed industrial robot which is shown in Fig. 1. The general aim is to advance the state of the art in the autonomous perception and manipulation of limp materials like fabrics, textiles and garments, placing the emphasis on universality and robustness of the methods. The task of clothes state recognition has already been approached by Miller et al. [9]. They consider a garment fairly spread on a green surface, which allows to segment images using simple color thresholding. The obtained garment contour is fitted to a parametric polygonal model specific for a particular category of clothing. The fitting procedure is based on iterative estimation of numeric parameters of the given model. The authors report quite accurate results. However, the main drawback is a slow performance. It takes seconds for a single contour and a single model. This makes the algorithm practically unusable for a

2 Jan Stria, Daniel Průša, and Václav Hlaváč Fig.

We use it as the experimental platform for folding of various types of clothes. real-time operations.

Another successful application is an automated folding of towels based on a robust visual detection of their corner points [8].

use single-view image [5] and stereo image [6] to estimate state of the hanging clothes being held by a gripper.

[3] solve the problem of lifting a single towel from a pile of highly wrinkled towels and grasping it for its corner.

[12] are also interested in determination of the grasping point. They combine features computed from both color and range images in order to locate highly wrinkled regions.

In this work, we propose a complete pipeline for clothes configuration recognition by estimating positions of the most important landmark points (e.g. all four corners of a towel).

43 2 Jan Stria, Daniel Průša, and Václav Hlaváč Fig. 1: Our robotic test platform utilizes two industrial hollow-wrist welding manipulators Motoman MA1400 with mounted cameras, Kinect-like rangefinders and dedicated grippers. We use it as the experimental platform for folding of various types of clothes. real-time operations. The authors also use the parametric model for recognition and fitting of pairs of socks [15]. Information about texture of socks was utilized here as well. Another successful application is an automated folding of towels based on a robust visual detection of their corner points [8]. The two-armed robot starts with a towel randomly dropped on a table and folds it in a sequence of manipulations performed both on the table and in the air. Kita et al. use single-view image [5] and stereo image [6] to estimate state of the hanging clothes being held by a gripper. Their approach is based on matching a deformable model to the observed data. Hata et al. [3] solve the problem of lifting a single towel from a pile of highly wrinkled towels and grasping it for its corner. The solution is based on detection of the highest point of the pile followed by corner detection in stereo data. Ramisa et al. [12] are also interested in determination of the grasping point. They combine features computed from both color and range images in order to locate highly wrinkled regions. The research area of cloth modeling is explored mainly by the computer graphics community. Hu et al. [4] give an overview of the known methods. In this work, we propose a complete pipeline for clothes configuration recognition by estimating positions of the most important landmark points (e.g. all four corners of a towel). The identified landmarks can be used for automated folding performed by robotic arms. We introduce our own polygonal models describing contours of various categories of clothing and develop a fast, dynamic programming based methods for an efficient fitting of an unknown contour to the models. Moreover, we have modified the grabcut image segmentation algorithm to work automatically without being initialized by a user input, utilizing a background model learned in advance from training data. The recognition pipeline can be summarized as follows: 1. Capturing input: The input is a single color image of a piece of garment spread on a table. We assume that type of the clothing (e.g. towel, pants,

44 Polygonal Models for Clothing 3 shirt) is known in advance. The image is taken from a bird s eye perspective by a camera attached to the robot. Since the relative position of the table and the camera is known, all pixels not displaying the table and the garment lying on it can be cropped. 2. Segmentation: The goal is to segment the garment and its background which is a wooden table in our case. We assume that the table and the garment have statistically dissimilar colors. We also assume that the table is invariant and thus its color properties can be learned from data. These assumptions make it possible to modify the grabcut segmentation algorithm [13] in a way that it does not require manual initialization. 3. Contour detection: The binary mask obtained from the segmentation is processed by Moore s algorithm [2] for tracing 8-connected boundary of a region. This gives a bounding polygon of the garment. Vertices of the polygon are formed by individual contour pixels. 4. Polygonal approximation: The dense boundary is then approximated by a polygon having fewer vertices. Their exact count depends on a model of garment which we want to fit in the following step. Generally, the number of vertices is higher than the number of landmark points for a specific model. 5. Model fitting: The polygonal approximation of the garment contour is matched to a polygonal model defined for the corresponding type of garment. The matching procedure employs dynamic programming approach to find correspondences between approximating vertices and landmark points defining the specific polygonal model. The matching considers mainly local features of the approximating polygon. As there are more vertices than landmarks, some of the contour vertices remain unmatched. 2 Contour extraction 2.1 Learning the background color model The background color model is a conditional probabilistic distribution of RGB values of background pixels. The distribution is represented as a mixture of K 3D Gaussians (GMM): K K exp ( 1 2 p(z) = π k N (z; µ k, Σ k ) = π (z µ k) T Σ 1 k (z µ k) ) k (2π)3 Σ k k=1 k=1 Here π k is a prior probability of k-th component and N (z; µ k, Σ k ) denotes 3D normal distribution having a mean vector µ k and a covariance matrix Σ k. The mixture is learned from training data, i.e. from a set Z = {z n = (z R n, z G n, z B n ) T [0, 1] 3 } of vectors representing RGB intensities of Z background pixels. The number of GMM components K is determined empirically based on the number of visible clusters in RGB cube with visualized training data. E.g. for a nearly uniform green background one component should be sufficient, for the table in our experiments we choose three components. (1)

45 4 Jan Stria, Daniel Průša, and Václav Hlaváč To train the GMM probabilistic distribution, we split the training data to K clusters C 1... C K at first, employing the binary tree algorithm for the palette design [10]. The algorithm starts with all training data Z assigned to a single cluster and it iteratively constructs a binary tree like hierarchy of clusters in a top-bottom manner. In each iteration, the cluster having the greatest variance is split to two new clusters. The separating plane passes through the center of the cluster and its normal vector is parallel with the principal eigenvector of the cluster. The algorithm stops after K 1 iterations with K clusters. Prior probability π k, mean vector µ k and covariance matrix Σ k for the k-th GMM component is computed using the maximum likelihood principle [1] from data vectors contained in the corresponding cluster C k : π k = C k Z, µ k = 1 C k z n, Σ k = 1 C k z n C k 2.2 Unsupervised segmentation z n C k (z n µ k )(z n µ k ) T (2) The segmentation is based on the grabcut algorithm [13] which is originally a supervised method. It expects an RGB image Z = {z n [0, 1] 3 : n = 1... W H} of size W H. Moreover, the user is expected to determine a trimap T = {t n {F, B, U}: n = 1... W H}. The value t n determines for the n-th pixel whether the user considers it being a part of the foreground (t n = F ), background (t n = B) or whether the pixel should be classified automatically (t n = U). The trimap T is usually defined via some interactive tool enabling to draw a stroke over foreground pixels, another stroke over background pixels and leave the other pixels undecided. In the proposed method, the input trimap is created automatically using the learned background GMM probabilistic model from Eq. 1 and two predetermined probability thresholds P B and P F : F, p(z n ) < P F t n = U, P F p(z n ) P B (3) B, P B < p(z n ) The thresholds P B and P F are set based on the training data so that 3% training pixels have the probability lower than P F and 80% training pixels have probability higher than P B in the learned background model. The foreground component of the trimap is thus initialized by lowly probable pixels while the background component by highly probable pixels. The core part of the grabcut [13] algorithm is an iterative energy minimization. It repeatedly goes through two phases. First, GMM models for the foreground and the background color are reestimated. And second, the individual pixels are relabeled based on finding the minimum cut in a special graph. To estimate the GMM color models we utilize the binary tree algorithm [10] described in Sec. 2.1 followed by the maximum likelihood estimation introduced in Eq. 2. We use three components both for background and foreground GMM

46 Polygonal Models for Clothing 5 (a) Input image (b) Segmentation (c) Approximation Fig. 2: (a) Input is formed by a single RGB image. (b) The input trimap for grabcut algorithm is automatically initialized with foreground (plotted in yellow), background (blue) and unknown (red) pixels. The resulting segmentation gives a garment contour (green). (c) The contour is simplified by approximating it by a polygon (magenta). which is sufficient in our case of not so varying table and garment. The grabcut algorithm iterates until convergence which usually takes 5 15 cycles. However, the segmentation mask is being changed only slightly in the later cycles. Since we need to get the segmentation as fast as possible, we stop the optimization after three cycles. 2.3 Contour simplification The segmentation algorithm proposed in the previous section is followed by Moore s algorithm [2] for contour tracing. The result is a closed contour in the image plane, i.e. a list (q 1... q L ) of 2D coordinates q i = (x i, y i ). The number of distinct points L depends on the image resolution as well as on the piece of garment size. Typically, L has an order of hundreds or thousands. To be able to fit out polygonal model to the contour effectively, we need to simplify the contour by approximating it with a polygon having N vertices where N L. More precisely, we want to select a subsequence of N points (p 1... p N ) (q 1... q L ). Additionally, we want to minimize the sum of Euclidean distances of the original points (q 1... q L ) to edges of the approximating polygon (p 1... p N ) as seen in Fig. 3a. The simplification procedure is based on the dynamic programming algorithm for the optimal approximation of an open curve by a polyline [11], [7]. It iteratively constructs the optimal approximation of points (q 1... q i ) by n vertices from previously found approximations of (q 1... q j ) where j {n 1... i 1} by n 1 points. The construction is demonstrated in Fig. 3b. Time complexity of the algorithm is O(L 2 N). Since the algorithm works with an open curve, it would have to be called L times for every possible cycle breaking point q i to obtain optimal approximation of the closed curve. However, we only call it constantly many times to get a suboptimal approximation which is sufficient for our purpose.

47 6 Jan Stria, Daniel Průša, and Václav Hlaváč q 2 q 3 q 4 =p 2 q 5 q 2 q 3 q 4 q 5 q 1 =p 1 q 6 q 1 q 6 q 9 q 8 q 7 =p 3 (a) Contour approximation problem q 9 q 7 q 8 (b) Dynamic programming solution Fig. 3: (a) The original contour (q 1... q L ) (plotted in red) is simplified by approximating it with a polygon (p 1... p N ) (blue) while minimizing distances of the original points q i to polygon edges. (b) Dynamic programming algorithm for polygonal approximation utilizes previously constructed approximations of points (q 1... q 4, q 5, q 6 ) by n 1 vertices (plotted in various colors) to obtain approximation of points (q 1... q 7 ) by n vertices. 3 Polygonal models 3.1 Models definition and learning To be able to recognize the configuration of a piece of garment, we describe contours of various types of clothing by simple polygonal models. The models are determined by their vertices. Inner angles incident to the vertices are learned from training data. Additional conditions are defined in some cases to deal with inner symmetries or similarities of distinct models. We use the following models: 1. Towel is determined by 4 corner vertices as shown in Fig. 4. All inner angles incident to the vertices share the same probability distribution. There is an additional condition that the height of the towel (distance between the top edge and the bottom edge) is required to be longer that its width (distance between the left edge and the right edge). 2. Pants are determined by 7 vertices. There are 3 various shared distributions of inner angles as shown in Fig Short-sleeved shirt is determined by 10 vertices and 4 shared distributions of inner angles as shown in Fig. 4. There is an additional condition that the distance between the armpit and the inner corner of the sleeve is required to be maximally 50% of the distance between the armpit and the bottom corner of the shirt. 4. Long-sleeved shirt is similar to the short-sleeved model. The distance between the armpit and the inner corner of the sleeve should be minimally 50% of the distance between the armpit and the bottom corner of the shirt. The probability distributions for inner angles incident to vertices of polygonal models are learned from manually annotated data. We assume that the angles have normal distributions. This seems as a reasonable assumption, since e.g.

48 Polygonal Models for Clothing 7 top-left top-right left-shoulder right-shoulder top-left top-right α α left-out β β α left-pit γ α right-pit γ right-out β β α crotch γ α left-in right-in α α δ δ β β β β bot-left bot-right bot-left bot-right left-out left-in right-in right-out Fig. 4: Polygonal model for towel, short-sleeved shirt and pants. Angles sharing one distribution are denoted by the same letter and plotted with the same color. 0 π 2 π 3π 2 2π 0 π 2 π 3π 2 2π 0 π 2 π 3π 2 2π (a) Towel (b) Short shirt (c) Pants Fig. 5: Angle distributions learned for various types of clothes models. Colors of the plotted distributions correspond to angles in Fig. 4. a corner angle of a towel should be approximately 90 with a certain variance caused by deformations of the contour. The mean and the variance of each normal distribution is estimated using a maximum-likelihood principle similarly to Eq. 2. Various vertices in a model can share the same angles distribution because of obvious symmetries, e.g. all 4 corners of a towel should be statistically identical. 3.2 Problem of model matching We described in Sec. 2.3 how to approximate a contour by N points (p 1... p N ). Each polygonal model defined in Sec. 3.1 is determined by M vertices (v 1... v M ) where M is specific for the particular model. See examples of models in Fig. 4. We show how to match an unknown simplified contour onto a given model. We assume that N M, i.e. the simplified contour contains more points than is the number of vertices of the model to be matched. The problem of matching can be then defined as a problem of finding a mapping of simplified contour points to model vertices f : {p 1... p N } {v 1... v M } {s}. Symbol s represents a dummy vertex which corresponds to a segment of the polygonal model. It makes it possible to leave some of the contour points unmapped to a real vertex. Additionally, a proper mapping f has to satisfy several conditions:

49 8 Jan Stria, Daniel Průša, and Václav Hlaváč p 2 p 3 p4 v 3 v4 p 1 p 5 p 7 p 6 v2 v 1 Fig. 6: Points of the simplified contour (p 1... p 7 ) are matched (plotted in blue) to vertices of the polygonal model (v 1... v 4 ). Some of them remain unmatched (green), i.e. they are mapped to a dummy vertex s representing a segment. The energy of the particular matching is based on a similarity of corresponding inner angles (red). 1. There exists a point p i mapped to it for each vertex v m. More formally, v m p i : f(p i ) = v m. 2. No two points p i and p j are mapped to the same vertex v m. However, many points can be mapped to segments represented by a dummy vertex s. Formally, p i p j : f(p i ) = f(p j ) f(p i ) = f(p j ) = s. 3. The mapping preserves the ordering of points on the polygonal contour and the ordering of vertices of the polygonal model in the clockwise direction. For example of such a proper mapping see Fig. 6. The number of mappings f satisfying the aferementioned conditions for N contour points and M model vertices is given by the combinatorial formula: ( ) ( ) M 1 N 1 N 1 N N (4) M 1 M 1 The interpretation is that we can choose 1 of N points to be mapped to the first vertex v 1. From the remaining N 1 points, we select a subset of M 1 points which are mapped to vertices v 2... v M. All other points are mapped to the dummy vertex s representing all segments of the polygonal model. We introduce an energy function E(f) associated with a matching f. Let us denote φ i the inner angle adjacent to the point p i of the simplified contour. Let us also denote µ m the mean value and σm 2 the variance of the normal distribution of inner angles N (φ; µ m, σm) 2 learned for the vertex v m of a particular polygonal model. We recall that the same distribution can be shared by several vertices of one polygonal model as seen in Fig. 4. The energy function is then given by: E(f) = ( ) log N φi ; µ m, σm) 2 log N (φ i ; π, π2 (5) 16 f(p i)=v m f(p i)=s It can be seen that we force angles of unmatched points (p i such that f(p i ) = s) to be close to π, i.e. we want the unmatched parts of the contour to resemble

50 Polygonal Models for Clothing 9 straight segments. We set the variance π 2 /16 for unmatched points empirically. Since the energy is inversely proportional to a probability, the optimal mapping f is obtained as f = arg min f E(f). 3.3 Matching algorithm Eq. 4 shows that the count of all admissible mappings is exponential in the number of vertices M. Thus it would be inefficient to evaluate the energy function for each mapping. We have rather developed an algorithm employing a dynamic programming approach which has a polynomial time complexity. The dynamic programming optimization procedure seen in Alg. 1 is called for every shifted simplified contour (p 1... p N ) = (p d... p N, p 1... p d 1 ), where the shift is d {1... N}. The reason is that Alg. 1 finds a mapping f such that f(p im ) = v m for m {1... M} and 1 i 1 i 2... i M N, i.e. one of the first points is mapped to the vertex v 1, some of its successors along the contour to the vertex v 2 and so on. Thus we have to try various shifts in order to be able to map any point to vertex v 1 as seen in Fig. 6. Alg. 1 does not work with points and vertices directly. It expects a precomputed matrix V R N M and a vector S R N instead. The value V i,m is a cost of matching the inner angle φ i associated with the point p i to the learned angle distribution for vertex v m, i.e. V i,m = log N (φ i ; µ m, σ 2 m) as in Eq. 5. The value S i is a cost of matching φ i to the angle of a dummy vertex s, i.e. S i = log N (φ i ; π, π 2 /16) as in Eq. 5. Both minimizations in Alg. 1 can be performed incrementally in O(N) time by remembering the summation value for previous j. The first minimization is performed N times, the second one O(NM) times. Thus the time complexity of Alg. 1 is O(N 2 M). The Alg. 1 is called N times for variously shifted contour, i.e. for d {1... N}. Thus the overall complexity of contour matching is O(N 3 M). Algorithm 1 Contour matching algorithm Input: V i,m = cost of mapping point p i to vertex v m S i = cost of mapping point p i to segment s Output: T i,m = cost of mapping sub-contour (p 1... p i) to vertices (v 1... v m) for all i {1... N} do T i,1 min j {1...i} ( j 1 S k + V j,1 + k=1 end for for all m {2... M} do for all i {m... N} ( do T i,m end for end for return T N,M min j {m...i} i k=j+1 T j 1,m 1 + V j,m + S k ) i k=j+1 S k )

51 10 Jan Stria, Daniel Průša, and Václav Hlaváč 4 Experiments The proposed methods were tested on a dataset of spread garments collected at the Czech Technical University [14]. The dataset contains color images (as in Fig. 2) and depth maps taken by Kinect-like device from a bird s eye perspective. All images were manually annotated by specifying positions of landmark points which correspond to vertices of the proposed polygonal models in Fig. 4. The resolution of images is The edge of 1 pixel approximately corresponds to 0.09 cm in real world coordinates. We used 158 testing images (29 towels, 45 pants, 45 short-sleeved shirts and 39 long-sleeved shirts). The algorithms were implemented mainly in Matlab. Some of the most time-critical functions were reimplemented in C++. The performance was evaluated on a notebook with 2.5 GHz processor and 4 GB memory. The input images were downsampled to the resolution for the purpose of segmentation. The smaller resolution preserves all desired details and significantly improves the time performance of the segmentation algorithm. Totally 153 of 158 input images were correctly segmented which gives 97% success ratio. The incorrectly segmented images were excluded from the further evaluation. The time spent by segmenting one image is on average 0.87 seconds. The contour simplification algorithm is the most time consuming operation of the proposed pipeline. The running times can be seen in Tab. 1. They highly depend on the length of the contour which is induced mainly by the shape complexity of the particular category of clothing. The subsequent model matching procedure is working with the already simplified contour and thus it is very fast as seen in Tab. 1. The whole pipeline including also segmentation and contour simplification runs around 5 seconds in the worst case. This is a significant improvement compared to seconds required just for model fitting which is reported by Miller et al. [9]. Table 1: Time performance (in seconds) of contour simplification phase and polygonal model matching phase for various categories of clothing. Phase Towel Pants Short-sleeved Long-sleeved Contour Matching Table 2: Displacements (in centimeters) of the identified vertices to ground-truth vertices found by polygonal model matching for various categories of clothing. Error Towel Pants Short-sleeved Long-sleeved Median Mean Std. dev

Polygonal Models for Clothing 11 Fig. 7: Displacements of the vertices found by model matching (plotted in green) and the manually annotated landmarks (red).

2 summarizes displacements of vertices found by the proposed algorithm compared to the manually annotated landmark points. These errors are similar to those reported by Miller et al. [9].

7 visualizes the displacements for the selected representatives of clothing.

52 Polygonal Models for Clothing 11 Fig. 7: Displacements of the vertices found by model matching (plotted in green) and the manually annotated landmarks (red). The displacements were computed for various configurations of garments and then they were projected to the canonical image. Tab. 2 summarizes displacements of vertices found by the proposed algorithm compared to the manually annotated landmark points. These errors are similar to those reported by Miller et al. [9]. They are small enough to determine the configuration of a piece of garment reliably and then use this information to manipulate the garment with robotic arms. Fig. 7 visualizes the displacements for the selected representatives of clothing. The errors were computed for various configurations of the same piece of garment and then they were projected to a canonical image. The biggest source of displacements are shoulders as seen in Fig. 7 for the green long-sleeved sweater. However, estimation of their position can be ambiguous even for a human. Moreover, their exact position is rather unimportant for automated manipulation. A few other significant errors were made while estimating armpits of a shirt with very short sleeves as seen in Fig. 7 for the white shirt. They are caused by indistinguishable shape of the sleeves on the contour. The contour resembles a straight line around the armpits. The proposed algorithms were deployed to a real robot and successfully tested in several folding sequences of various garments, as seen in Fig. 1. The folding procedure succeeds approximately in 70% attempts. However, observed folding failures were almost never caused by the described vision pipeline. Main source of these failures lies in an unreliable grasping mechanism and in occasional inability to plan move of robotic arms. 5 Conclusion We have fulfilled our goal and proposed a fast method allowing to recognize the configuration of a piece of garment. We have achieved a good accuracy, comparable to those of known approaches, despite the usage of a more challenging nonuniform background. The presented model has proved to be sufficient for the studied situation. The recognition procedure was deployed to a real robot and successfully tested in fully automated folding. In the future, we would like to strengthen power of the model by introducing more global constraints. Our intention is to generalize the method to folded pieces of garment. We would also like to learn the robot how to detect folding failures and how to recover from them.

53 12 Jan Stria, Daniel Průša, and Václav Hlaváč 6 Acknowledgment The authors were supported by the European Commission under the project FP7-ICT CloPeMa (J. Stria), by the Grant Agency of the Czech Technical University in Prague under the project SGS13/205/OHK3/3T/13 (J. Stria, D. Průša) and by the Technology Agency of the Czech Republic under the project TE Center Applied Cybernetics (V. Hlaváč). References 1. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd ed. Wiley (2000) 2. Gonzalez, R.C., Woods, R.E., Eddins, S.L.: Digital Image Processing Using MAT- LAB, 2nd ed. Gatesmark (2009) 3. Hata, S., Hiroyasu, T., Hayashi, J., Hojoh, H., Hamada, T.: Robot system for cloth handling. In: Proc. Annual Conf. of IEEE Industrial Electronics Society (IECON). pp (2008) 4. Hu, X., Bai, Y., Cui, S., Du, X., Deng, Z.: Review of cloth modeling. In: Proc. SECS Int. Colloquium on Computing, Communication, Control and Management (CCCM). pp (2009) 5. Kita, Y., Kita, N.: A model-driven method of estimating the state of clothes for manipulating it. In: Proc. IEEE Workshop on Applications of Computer Vision (WACV). pp (2002) 6. Kita, Y., Ueshiba, T., Neo, E.S., Kita, N.: Clothes state recognition using 3d observed data. In: Proc. IEEE Int. Conf. on Robotics and Automation (ICRA). pp (2009) 7. Kolesnikov, A., Fränti, P.: Polygonal approximation of closed discrete curves. Pattern Recognition 40(4), (2007) 8. Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., Abbeel, P.: Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In: Proc. IEEE Int. Conf. on Robotics and Automation (ICRA). pp (2010) 9. Miller, S., Fritz, M., Darrell, T., Abbeel, P.: Parametrized shape models for clothing. In: Proc. IEEE Int. Conf. on Robotics and Automation (ICRA). pp (2011) 10. Orchard, M., Bouman, C.: Color quantization of images. IEEE Trans. on Signal Processing 39(12), (1991) 11. Perez, J.C., Vidal, E.: Optimum polygonal approximation of digitized curves. Pattern Recognition Letters 15(8), (1994) 12. Ramisa, A., Alenyà, G., Moreno-Noguer, F., Torras, C.: Using depth and appearance features for informed robot grasping of highly wrinkled clothes. In: Proc. IEEE Int. Conf. on Robotics and Automation (ICRA). pp (2012) 13. Rother, C., Kolmogorov, V., Blake, A.: Grabcut interactive foreground extraction using iterated graph cuts. ACM Trans. on Graphics 23(3), (2004) 14. Wagner, L., Krejčová, D., Smutný, V.: CTU color and depth image dataset of spread garments. Tech. Rep. CTU-CMP , Center for Machine Perception, Czech Technical University (2013) 15. Wang, P.C., Miller, S., Fritz, M., Darrell, T., Abbeel, P.: Perception for the manipulation of socks. In: Proc. IEEE Int. Conf. on Intelligent Robots and Systems (IROS). pp (2011) 16. CloPeMa project clothes perception and manipulation.

54 B. Garment perception and its folding using a dual-arm robot A reprint of our paper Garment Perception and its Folding Using a Dual-arm Robot which will be published in the proceedings of the 27th International Conference On Intelligent Robots and Systems (IROS 2014). 50

55 Garment Perception and its Folding Using a Dual-arm Robot Jan Stria, Daniel Průša, Václav Hlaváč, Libor Wagner, Vladimír Petrík, Pavel Krsek, Vladimír Smutný 1 Abstract The work addresses the problem of clothing perception and manipulation by a two armed industrial robot aiming at a real-time automated folding of a piece of garment spread out on a flat surface. A complete solution combining vision sensing, garment segmentation and understanding, planning of the manipulation and its real execution on a robot is proposed. A new polygonal model of a garment is introduced. Fitting the model into a segmented garment contour is used to detect garment landmark points. It is shown how folded variants of the unfolded model can be derived automatically. Universality and usefulness of the model is demonstrated by its favorable performance within the whole folding procedure which is applicable to a variety of garments categories (towel, pants, shirt, etc.) and evaluated experimentally using the two armed robot. The principal novelty with respect to the state of the art is in the new garment polygonal model and its manipulation planning algorithm which leads to the speed up by two orders of magnitude. I. INTRODUCTION The reported research contributes to the garment sensing and its manipulation (sorting, folding, etc.) which is performed on a dual arm robot in our case within the European Commission funded project CloPeMa [1]. The project advances the state of the art in the autonomous perception and manipulation of limp materials like fabrics, textiles and garments. The emphasis is put at universality and robustness. We propose a method for an autonomous folding of a piece of garment spread out on a flat surface. Our aim is to provide a real-time procedure which is applicable to an extended collection of garments of various shapes and to give satisfactory results when employed to a real robot. The method involves vision sensing, garment understanding, planning and manipulation tasks. The similar objective has already been approached by Miller et al. [2], [3] and applied to a garment folding on Willow Garage PR2 robot. The authors consider a garment fairly spread on an uniform green surface which extremely simplifies the segmentation task. A contour is obtained by segmenting a single image taken by the robot camera and it is fitted to a parametric polygonal model specific for a particular category of garment. The fitting procedure is an iterative estimation of numeric parameters of the model. Quite accurate fitting results are reported. However, the main drawback is the slow performance as it takes seconds to fit a single contour, depending on a complexity of the particular model. The same parametric polygonal model was 1 All authors are with Center for Machine Perception, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic. Contacts to all authors are to be found at our web page used for socks configuration recognition and for pairing corresponding socks [4]. The previously used global model matching approach is combined with local fitting of texture and shape descriptors in this work. An automated folding of towels based on a robust visual detection of their corner points is presented in [5]. The PR2 robot starts with a towel dropped on a table and folds it in a sequence of manipulations performed both on the table and in the air. This problem is also partially solved by Hata et al. [6] who are interested in lifting a single towel from a pile of highly wrinkled towels. The towel is then regrasped to be held for its corner in order to simplify future manipulation. The solution is based on detecting the highest point of the pile followed by corner detections in stereo data. Ramisa et al. [7], [8] combine features computed from both color and range images to define a measure of clothes wrinkledness. The identified highly wrinkled regions are good candidates for automated grasping. State estimation of the hanging clothes being held by a single gripper is approached by Kita et al. They utilize both single-view [9] and stereo [10] images. The method is based on matching the observed data to many precomputed deformable 3D models, each one for an individual grasping location, and selecting the best fitting model. The recognition accuracy can be improved by pushing the hanging garment with the second robotic hand in order to bring it into a more distinguishable position [11]. Doumanoglou et al. [12] are interested in category and pose recognition of a hanging garment in order to bring it into the desired position. They work with depth images acquired by a range sensor. The recognition algorithm utilizes decision forests trained on very simple local features. The manipulation planning is formulated in a probabilistic Markov framework. Willimon et al. [13] estimate configuration of moderately wrinkled clothes by fitting a triangulated surface to the observed range image. The fitting procedure is based on iterative minimization of the energy function which expresses surface smoothness and both visual and spatial similarity of the data. We already explained the computer vision basics of our approach in [14] where the former method detected unfolded garments which allowed us to launch a scripted sensing-less sequence of robot actions to fold a piece of garment. Here we present a more accurate polygonal model incorporating relative lengths of contour segments. In addition, we propose a generic procedure allowing to derive folded models from the basic, unfolded ones. A new dynamic programmingbased algorithm for matching the model to observed data is proposed. This work also documents its robotics side, i.e. introducing our dual arm testbed and performing proposed

Fig. 1: Testbed and a detail of the arm with a mounted gripper and Xtion on the wrist. methods on it. The testbed used within CloPeMa is documented in Sec. II.

Experiments, accuracy of recognition and the performance of the folding by the robot are evaluated in Sec. V. II.

The robotic arms are mounted on the turn-table tilted from the vertical axis and the angle between them is 30.

The system runs on a PC connected to the master controller by a local Ethernet.

56 Fig. 1: Testbed and a detail of the arm with a mounted gripper and Xtion on the wrist. methods on it. The testbed used within CloPeMa is documented in Sec. II. All vision related models and algorithms are described in Sec. III. Sec. IV gives brief overview of the automated manipulation planning and execution. Experiments, accuracy of recognition and the performance of the folding by the robot are evaluated in Sec. V. II. TESTBED DESCRIPTION CloPeMa testbed consists of two Motoman MA1400 robotic arms, the R750 turn-table and two DX100 controllers. The robotic arms are mounted on the turn-table tilted from the vertical axis and the angle between them is 30. The arms and the turn-table are driven by two controllers (master and slave) at a low-level. The high-level control system is built on the Robot Operating System [15] (ROS Hydro). The system runs on a PC connected to the master controller by a local Ethernet. Basic software allows to move the robotic arms to required positions as well as to read actual positions of the arms joints. This functionality is supported by MotoROS package, distributed by the robot manufacturer, and extended to support the dual arm robot. The testbed is fitted with sensors and grippers. Xtion range finder sensors are attached on the wrists of both arms. Additional Xtion sensor is mounted to a camera-head holder. The binocular head [16] is placed on the top of the holder. Two grippers specialized to garment manipulation [17] are attached to robot wrists. Each arm is also equipped with a force and torque sensor in the wrists. Fig. 1 shows the testbed and a detail of the gripper and Xtion sensor on the wrist. The reported work uses only Xtion vision sensors. III. VISION SENSING AND UNDERSTANDING The crucial task in our system for automated folding is to recognize the configuration of a piece of garment to be manipulated by the robot. This is performed as a computer vision task utilizing a single color image captured by the Xtion camera mounted on the robotic arm. The algorithm precedes the folding sequence to determine the initial configuration of the spread garment. Vision sensing is also repeated after performing each single fold, utilizing results from the previous runs. (a) Input image (b) Segmentation (c) Simpl. contour Fig. 2: Pixels of the input image are used to initialize trimap for grabcut algorithm. The trimap consists of foreground (plotted in cyan), background (yellow) and unknown (magenta) pixels. The resulting segmentation gives a garment contour (red) which is approximated by a polygon (blue). The vision procedure can be split into several steps: The input is formed by a color image of a piece of garment placed on a table. The color of the table with a wooden-like surface differs from the garment color. The garment location in the image is determined automatically by a segmentation method trained from data. The contour of the garment is extracted from the segmentation mask and approximated by a polygon, which reduces the number of contour points (now polygon vertices) significantly. The approximated polygonal contour is matched to a polygonal model specific for a particular category of clothing. Once the polygonal model is matched, positions of the contour landmark points are known. A. Segmentation Segmentation is the first phase of the recognition pipeline. Since the table beneath the garment is unchanged, its color can be learned from training data. We model the table color probabilistically as a Gaussian mixture model (GMM) of RGB triples. Components of the GMM are initialized by a binary tree algorithm for palette design [18] which repeatedly splits RGB vectors to subsets in a direction with the greatest variance. The number of GMM components is determined empirically to model variability of the table color sufficiently. We use 3 components in our case of a wooden table. The prior probabilities, mean vectors and covariance matrices for individual components are learned according to the maximum likelihood principle [19]. The color of the garment lying on the table is unknown. However, we suppose that it is sufficiently different from the color of the table. Thus pixels visualizing the table should have higher probability in the trained GMM than pixels visualizing an unknown garment. Based on this assumption, we label the n-th pixel by label t n according to probability p(z n ) of its color z n in the trained GMM model: foreground, p(z n ) < P F t n = unknown, P F p(z n ) P B (1) background, P B < p(z n ) The probability thresholds P F and P B are chosen so that 3% training pixels have probability lower than P F and 80%

57 Fig. 3: Polygonal models for all currently supported categories of clothing (towel, pants, short-sleeved shirt and longsleeved shirt). left-shoulder f right-shoulder a α α a left-out right-out β left-armpit right-armpit β b β γ γ β b c c left-in right-in d d 0 π 2 π 3π 2 2π (b) Angles of them have probability higher than P B. Fig. 2b shows example of such a labeling. The described labeling is used to automatically initialize color models of the grabcut segmentation algorithm [20]. This is done instead of requiring the user to draw a stroke over foreground (garment) and background (table) pixels to initialize the color models. The grabcut algorithm is based on the iterative reestimation of the GMM color models and relabeling of pixels by minimizing a certain energy function. In our implementation, we use GMM with 3 components and we interrupt the optimization after performing 3 iterations. B. Contour processing A contour of the garment is extracted from the obtained segmentation mask by Moore s algorithm [21] for border tracking. The contour is formed by pixels at the garment s border as seen in Fig. 2b. Thus it can be considered to be a polygon with several hundreds or thousands vertices. Their exact amount depends mainly on size of the garment and on resolution of the input image. However, shape of the contour is much simpler and thus it can be closely approximated by a polygon which has at most tens of vertices, depending on the shape complexity and on the desired precision. Example of a simplified contour can be seen in Fig. 2c. The simplification applies iteratively the algorithm for optimal approximation of an open curve by a polyline [22]. The algorithm utilizes dynamic programming approach to minimize the overall distance of the original contour points to the edges of the approximating polygon. To find the global optimum, the inner algorithm would have to be run for each point of the original cyclic contour, breaking it to an open curve. However, in practice it is possible to stop it after several iterations to obtain a sufficient approximation. C. Polygonal models of clothing The shapes of possible contours for a particular category of clothing are described by polygonal models. We distinguish the following categories of clothing: towel, pants, shortsleeved shirt and long-sleeved shirt. Fig. 3 visualizes all the models and Fig. 4 shows a more detailed polygonal model for a short-sleeved shirt. Each polygonal model is determined by its vertices and their relative mutual positions. The mutual positions are described by inner angles adjacent to vertices as well as by relative lengths of polygon edges with respect to its perimeter. The inner angles and relative lengths are learned from training data. They are modeled probabilistically by δ bottom-left e δ bottom-right (a) Polygonal model (c) Relative lengths Fig. 4: Polygonal model for a short-sleeved shirt. Inner angles sharing the same distribution are denoted by the same Greek letter α... δ. Edges sharing the same distribution of relative lengths are denoted by the same Latin letter a... f. Colors of the probability distributions correspond to colors of angles and edges in the polygonal model. independent normal distributions as seen in Fig. 4b and Fig. 4c. Some distributions are shared by more vertices and edges of the same model, e.g. the distribution of inner angles adjacent to left and right armpit of a shirt or the distribution of relative lengths of towel top and bottom edges. The distributions sharing is allowed by the obvious left-right and top-bottom symmetries of clothing shapes as in Fig. 4. D. Matching contours to polygonal models Now, we have a contour of a garment approximated by a polygon having N vertices p 1... p N, which we will call points from now on. The polygonal model is determined by vertices v 1... v M. The number of vertices M is specific for a particular category of clothing. It always holds N > M. We describe how to match points of the polygonal contour to vertices and segments of the polygonal model, i.e. how to find a mapping f such that i {1... N}: { v m, point p i is mapped to vertex v m, f(p i ) = (2) s, point p i is aligned to a segment. The mapping has to satisfy several conditions: There exists a point p i mapped to it for each vertex v m. No two points p i and p j are mapped to the same vertex v m. However, many points can be mapped to segments represented by a symbol s. The mapping preserves the ordering of points on the polygonal contour and the ordering of vertices of the polygonal model in the clockwise direction. See Fig. 5. The number of all possible mappings f satisfying the conditions can be enumerated easily. To do it, select one of N points to be mapped to vertex v 1. Then select an arbitrary subset of M 1 points from remaining N 1 points to be mapped to vertices v 2... v M. Thus the amount of all possible mappings is: ( ) N 1 N (3) M 1

58 p 2 p 3 v 4 v 1 p j-1 p j v m p j v m p 7 p 1 p 5 p 4 p 6 Fig. 5: Visualization of function f which maps points p 1... p N to vertices v 1... v M (blue arrows). Some points are mapped to segments (red arrows). The mapping preserves clockwise ordering of both points and vertices. To compare the quality of mappings, we define a cost function C(f) associated with a mapping f. The overall cost is the summation of local costs which express local qualities of a particular mapping. Let assume that indices in all following equations iterate in closed cycles, namely i 1 = (i mod N) + 1, i 1 = (i 2 mod N) + 1, m 1 = (m mod M)+1 and m 1 = (m 2 mod M)+1. Vertex matching cost Vi,j,k m is defined for each triple of contour points p i, p j, p k and each model vertex v m : V m i,j,k = λ V log N ( p i p j p k ; µ m, σ 2 m) (4) It expresses how the size of the oriented angle p i p j p k fits the normal distribution N ( ; µ m, σm) 2 of inner angles adjacent to the vertex v m. Mean µ m and variance σm 2 of the distribution are learned from data as in Fig. 4b. Symbol λ V denotes weight of the vertex matching cost. Edge matching cost Ej,k m is defined for each pair of points p j, p k and each polygonal model vertex v m : ( ) Ej,k m p j p k = λ E log N n i=1 p ip i 1 ; ν m, τm 2 (5) It expresses how the relative length of the line segment p j p k (with respect to overall length of the contour) fits the distribution of relative lengths of the model edge v m v m 1. Mean ν m and variance τm 2 of the distribution are learned from data as in Fig. 4c. Symbol λ E denotes the weight of the edge matching cost. Segment matching cost S j,k is defined for each pair of simplified contour points p j, p k in following way: S j,k = λ S i I j,k log N ( p i 1 p i p i 1 ; ξ, φ 2 ) (6) The range I j,k passed by index i is defined as: { {j k 1}, j k I j,k = {j N, 1... k 1}, j > k The segment matching cost expresses the penalty paid for points not matched to any vertex. These points together with neighboring segments should resemble straight lines as seen in Fig. 5. This is why the mean and the variance are set empirically as ξ = π and φ 2 = π 2 /16. Symbol λ S denotes the weight of the segment matching cost. v 3 v 2 (7) p m-1 (a) Minimization range p k v m+1 p i v m-1 T m-1 i,j V m i,j,k S j,k Ej,k E m (b) Matching costs p k v m+1 Fig. 6: (a) Minimization of the total cost Tj,k m goes over all p i, where i {m 1... j 1}. Various choices of p i are visualized in different colors. (b) Total matching cost Tj,k m is given by summing previous total cost T m 1 i,j (plotted in magenta), vertex cost Vi,j,k m (red), edge cost Em j,k (green) and segment cost S j,k (blue). Weights of matching costs were set empirically as λ V = 1, λ E = 1/3 and λ S = 1 to balance typical values of the costs. Note that both vertex and segment matching cost evaluate angles and so their weights are equal, whereas the edge matching cost evaluate relative lengths. All three types of costs are visualized in Fig. 6b by different colors. The overall cost is given by summing costs for all vertices v m and points p i, p j, p k such that f(p i ) = v m 1, f(p j ) = v m, f(p k ) = v m 1. The goal is to find the mapping having the minimal overall cost f = arg min f C(f). E. Dynamic programming algorithm for contour matching The number of all possible mappings (see Eq. 3) is exponential in the number of vertices. It would be very infeasible to compute costs for all such mappings. Instead we propose an efficient dynamic programming algorithm. The main part of the algorithm is listed in Alg. 1. It assumes that f(p 1 ) = v 1 and f(p r ) = v m, where r {M... N}. It finds cost of the optimal mapping to the remaining vertices v 2... v M 1. The optimal mapping itself can be constructed by remembering also index of the point p i minimizing the cost Tj,k m, followed by backward tracing as usual in dynamic programming algorithms. The global optimum can be found by calling Alg. 1 for each combination of N possible n-shifts of the contour points (p 1... p N ) (p n... p N, p 1... p n 1 ) with N M +1 options of selecting p r, i.e. Alg. 1 is called O(N 2 ) times in total. Alg. 1 is based on an iterative evaluation of cost Tj,k m, for m increasing. The cost is a summation of local costs defined in Eq. 4, Eq. 5, Eq. 6 for points p 1... p j optimally mapped to vertices v 1... v m so that f(p r ) = v M, f(p 1 ) = v 1, f(p j ) = v m, f(p k ) = v m+1. The main step of the algorithm is the minimization searching for a point p i mapped to the previous vertex v m 1. The minimization is visualized in Fig. 6a. The purpose of individual costs is summarized in Fig. 6b. The most time complex part of Alg. 1 are three nested loops computing O(N 2 M) costs Tj,k m, each of them obtained as minimization over O(N) elements. Thus the overall complexity of Alg. 1 is O(N 3 M). Because Alg. 1 is called O(N 2 ) times, the overall time complexity of the proposed contour matching algorithm is O(N 5 M). Although that the

59 Algorithm 1 Contour matching algorithm In: r = index s.t. f(p r ) = v M Vi,j,k m = cost of matching p ip j p k to v m Ej,k m = cost of matching p jp k to v m v m+1 S j,k = cost of approximating p j+1... p k 1 by p j p k LS LO LA LI RS RO RA RI LS F 1 LO LA LI T 2 F 1 T 2 T 3 F 1 F 3 Out: T m j,k = cost of matching p 1... p k 1 to v 1... v m s.t. f(p j ) = v m, f(p k ) = v m+1 for all j {2... r M + 2} do for all k {j r M + 3} do Tj,k 2 ( Vr,1,j 1 + V1,j,k) 2 ( + E M r,1 + E1,j 1 + Ej,k 2 ) + (S r,1 + S 1,j + S j,k ) end for end for for all m {3... M 1} do for all j {m... r M + m} do for all k {j r M + m + 1} do T m j,k end for end for end for return T M r,1 min i {m 1...j 1} + E m j,k + S j,k min i {M 1...r 1} ( T m 1 i,j + V m i,j,k ( ) T m 1 i,r + Vi,r,1 m degree of the polynomial is fairly high, the real performance is very good as we show in Sec. V. It is because we choose number of simplified contour points N between 10 and 20, depending on complexity of the matched model, which is enough for a precise approximation of the original contour. F. Generating folded models The proposed pipeline for landmark points recognition is not used only for a spread garment. After revealing its initial configuration, the garment is folded by the robot and a new image is taken. The contour is extracted and simplified in the same way as already described. Also the matching algorithm is unaltered, however, we have to use a modified polygonal model which reflects the performed fold. The incremental creation of folded models is shown in Fig. 7. The original vertices are being replaced by vertices denoting endpoints of individual folds. The s-th fold is performed in the clockwise direction along the contour from position F s to T s. All the original vertices positioned either between or near F s and T s are removed and two new vertices F s and T s connected by an edge are added. The distributions of inner angles and relative lengths, which are used to evaluate penalty Vi,j,k m in Eq. 4 and penalty Ej,k m in Eq. 5, are adjusted to correspond to the observed image and planned fold. The means µ m and ν m for the next folded model are set to the angles and relative lengths measured in the actual image, considering line of the planned fold. The variances σm 2 and τm 2 adjacent to the original vertices are all set to the smallest variance learned for the ) BL BR BL Fig. 7: Incremental creation of folded models for a shortsleeved shirt. The original vertices are being replaced by new vertices denoting endpoints of individual folds (plotted in various colors). original model, as all following manipulation is performed with that particular piece of garment. The variances adjacent to the newly added vertices are set to twice that value because of the uncertainty in the performed fold. IV. PLANNING, GRASPING AND MANIPULATION ROS provides packages to perform various robotics tasks. We utilize MoveIt package [23] which is included to support the motion planning. It implements interfaces for common robotics libraries. We use two approaches to generate robot trajectories for our purposes planning and interpolation. The planning uses Open Motion Planning Library (OMPL) [24] to schedule collision free trajectories from one joint state to another one. We have tested several planning algorithms from OMPL and found out that RRT-Connect [25] suits best our needs since it successfully finds a plan in most of the cases and in a reasonable time. The interpolation generates points evenly distributed on a line in Cartesian coordinates. Then it computes the inverse kinematic for each point to produce the final trajectory which is sent to the robot controller. We adopted the scheme proposed in [26] for the folding. The robot moves the grasped corners along a triangular path and utilizes the gravitational force acting on the garment. Since our gripper is not suitable for grasping flat garment from above, our lower finger slides under the garment and and grasps it. The motion near to the garment and with the garment in grippers uses interpolation to have a full control over the actual trajectory and to prevent tearing the garment. The rest of the motion utilizes the planning discussed above. T 1 V. EXPERIMENTS We have performed two sets of experiments to test the proposed methods. Sec. V-A shows performance of the computer vision pipeline on a dataset of static images. Sec. V-B describes folding experiments performed on a robot. A. Experiments on the dataset of images We have tested the proposed computer vision pipeline on a dataset of spread garments collected by our team [27]. The dataset contains color images taken from a bird s eye perspective. One pixel roughly corresponds to 0.09 cm in world coordinates. All images were manually annotated by specifying locations of vertices of the described F 2 T 1

Error Median [cm] Mean [cm] Std. dev. [cm] Towel Pants Short-sleeved Long-sleeved 0.

14 TABLE I: Displacements of the vertices found by the polygonal model matching for

8: Displacements of the vertices found by model matching (plotted in green) and the

The displacements were computed for various configurations of garments and then they

We used 170 testing images (41 towels, 45 pants, 45 short-sleeved shirts and 39

The performance was evaluated on a notebook with Intel M430 2.

Segmentation was performed on images downsampled to resolution 320 256 to achieve a

The incorrectly segmented images were excluded from the further evaluation.

The contour simplification algorithm is the most time consuming operation which takes

The subsequent model matching procedure works with the already simplified contour.

The complete computer vision pipeline runs almost every time under 5 seconds which is a

I summarizes displacements of vertices found by the proposed algorithm compared to the

achieved by our former algorithm [14] on a subset of the current dataset. Fig.

problematic task for our model. They are sometimes confused with the neckline.

9: Robot performs series of folds with a short-sleeved shirt.

60 Error Median [cm] Mean [cm] Std. dev. [cm] Towel Pants Short-sleeved Long-sleeved TABLE I: Displacements of the vertices found by the polygonal model matching for various categories of clothing. Fig. 8: Displacements of the vertices found by model matching (plotted in green) and the manually annotated landmarks (red). The displacements were computed for various configurations of garments and then they were projected to the canonical image. polygonal models. We used 170 testing images (41 towels, 45 pants, 45 short-sleeved shirts and 39 long-sleeved shirts). The algorithms were implemented in MATLAB and C++. The performance was evaluated on a notebook with Intel M GHz processor and 8 GB memory. Segmentation was performed on images downsampled to resolution to achieve a better time performance. Totally 165 of 170 input images were correctly segmented. The incorrectly segmented images were excluded from the further evaluation. The time spent by segmenting one image is on average 0.83 seconds. The contour simplification algorithm is the most time consuming operation which takes between 0.5 and 3.5 seconds, depending mainly on the contour complexity. The subsequent model matching procedure works with the already simplified contour. Its runtime is 0.14 seconds on average. The complete computer vision pipeline runs almost every time under 5 seconds which is a significant improvement compared to seconds of Miller et al. [2] Tab. I summarizes displacements of vertices found by the proposed algorithm compared to the manual annotations. They are similar to those achieved by Miller et al. [2] on their own dataset. Moreover, the reported displacements are approximately 20% lower than displacements achieved by our former algorithm [14] on a subset of the current dataset. Fig. 8 visualizes displacements for various configurations of the selected pieces of garments. The experiments showed that determining exact locations of shoulders is the most problematic task for our model. They are sometimes confused with the neckline. However, since positions of shoulders are not used for the automated folding, these errors cause no problems. B. Experiments on the CloPeMa testbed We performed several experiments on CloPeMa testbed to test the proposed algorithms. A piece of garment was spread manually on the table next to the robot in each Fig. 9: Robot performs series of folds with a short-sleeved shirt. Images shown in the left column were taken by the robot s camera in order to fit the plotted polygonal model. Fig. 10: Detailed view of the robot successfully folding a red towel, black jeans and a violet long-sleeved sweater. experiment. ROS application following predefined folding steps was started next. In each step, the robot was moved into the start configuration and the scene was perceived using the Asus Xtion RGB camera. The captured image was sent to the vision pipeline to find the polygonal model of the garment. The matched model was used to determine the position of the next fold, i.e. grasping points and points where the garment had to be placed. Several gripper approach directions were generated for each grasp or each place point to increase the chance of a successful planning. If the planning was successful then the resulting trajectory was executed by the robot. Stages of the folding process are captured in Fig. 9. Fig. 10 gives more detailed examples.

Folding Clothes Autonomously: A Complete Pipeline

1 Folding Clothes Autonomously: A Complete Pipeline Andreas Doumanoglou, Jan Stria, Georgia Peleka, Ioannis Mariolis, Vladimír Petrík, Andreas Kargakos, Libor Wagner, Václav Hlaváč, Tae-Kyun Kim, and Sotiris