Articulated Object Recognition: a General Framework and a Case Study

Size: px

Start display at page:

Download "Articulated Object Recognition: a General Framework and a Case Study"

Joshua Cobb
6 years ago
Views:

1 Articulated Object Recognition: a General Framework and a Case Study Luigi Cinque, Enver Sangineto, Computer Science Department, University of Rome La Sapienza, via Salaria 113, Rome, Italy Steven Tanimoto Department of Computer Science and Engineering University of Washington, Seattle, WA 98195, USA Abstract We present in this paper a general-purpose approach for articulated object recognition. We split the recognition process in two distinct phases. In the former we use standard model-based techniques in order to recognize and localize in the input image the rigid components the articulated object is composed of. In the second phase the spatial configurations formed by the recognized components are analyzed and compared with the valid configurations of the object we are searching. The comparison is based on a constraint satisfaction method which can deal with both missing components and false positives. The proposed method is based on a redundant set of constraints which represent the valid spatial configurations of the object s components. Such constraints are not embedded in the system nor are domain-specific but they are learned during a suitable training phase. We show how this approach can be used in different scenarios with different kinds of articulated objects and we present a case study concerning a robotic application. 1 Introduction Recognition of articulated objects is an open research problem with implications in different important fields, first of which is the study of human body: human body detection (e.g., for surveillance purposes [9], pedestrian detection in driving supporting systems [1], etc.), posture recognition (e.g., in smart rooms), gesture recognition [11] and human motion estimation [9] (e.g., for human-computer interaction or medical purposes), etc.. Other interesting articulated and/or deformable objects are the human face in its different expressions, biological individuals [4] or articulated artifacts (e.g., robots [3, 13] or utensils [5]). Since object recognition is a difficult task also for rigid (i.e., non-deformable) objects, the degrees of freedom one has to take into account in dealing with the appearance of an articulated object make the task even more difficult. This is the reason for which most of the works for articulated object recognition are domain-specific while there are relatively few proposals for general-purpose approaches. Grimson [5] deals with nonrigid objects whose rigid subparts can stretch and/or rotate the one with respect to the other. The degrees of freedom of the object s components are described by a set of parameters which must satisfy some specific constraints. Recognition of the object is realized using a constraint propagation technique. Some drawbacks of this proposal are the usually expensive computational costs of the constraint propagation process and the necessity of ad hoc constraints which must be embedded in the recognition system for each specific object. Forsyth et al. describe biological articulated objects (e.g., animal bodies [4] and dressed people [12]) using ribbons. Ribbons are near-rectangular patches in the image. For instance, a side-view of a horse body can be described by means of a ribbon for the body, a ribbon for the head and four ribbons for the two legs (splitting each leg in upper-leg and lower-leg). Given an input image, first of all it is segmented using color and texture features looking for skinlike patches which are approximated by means of a set of ribbons. Such ribbons are then progressively grouped and, for each group, perspective invariant features are computed taking into account all the components of the group. The corresponding feature vector is used to classify the object using a statistical classifier previously trained. Kinematics and dynamics constraints are used in [12] to prune inconsistent ribbon groups. Domain-specific kinematics and/or dynamics constraints are used in several other human motion and posture recognition approaches (e.g., [8, 9]). Other common approaches for deformable object recognition are: graph-based representation of the silhouette s skeleton [3, 6] or Point Distribution Models (PDMs) approaches [1]. Both methods rely on the extraction of the object s silhouette for the analyzed image and, thus, either are sensitive to partial occlusions or need a manual initialization. In this paper we propose a method for general-purpose articulated object recognition which can be applied in dif-

2 ferent scenarios. Our approach is based on two distinct phases. If O is the articulated object to recognize (e.g., a robot or a human figure) and O is composed of N rigid components O 1,..., O N, the first phase is dedicated to separately search for O 1,..., O N in the image I using standard model-based techniques. For each O i, such techniques look for a rigid geometric transformation able to align the object model (M i ) with a portion of I representing M i and can be applied to images with occlusions and cluttered scenes and to generic shapes (M i ). The relative positions of the objects recognized in I are subsequently checked using generalized configuration constraints which represent all the possible legal configurations of O. No specific knowledge about the domain is used in the definition of the generalized configuration constraints because they are automatically learned by the system in the training phase using a set of examples. 2 First phase: recognition of the single rigid components When the shape of the object to recognize is either fixed or its variability is limited, there are a lot of possible recognition approaches in scientific literature which usually allow the system to localize it in a given image. If the image can be (at least partially) segmented then the task can usually be reliably executed. For instance, in video data the human figure(s) and/or other moving objects can be separated from the background using motion detection techniques. Furthermore, skin detection can help the system to localize hands and heads [14]. In some domains blobs of uniform color/texture can be approximated by ribbons [4, 12]. In Section 6.2 we show how model-based approaches can be used in order to precisely localize in a given cluttered and noisy scene the appearances of rigid robot components having fixed (but generic) shapes and colours. The method presented in Section 6.2 assumes to deal with sharp shapes. Nevertheless, the articulated object recognition approach we present in this paper is independent from the specific method used for the recognition of the object s components. For instance, simpler and widely used template matching techniques based on a rough representation of the object s 2D appearance are usually able to localize the articulated object s components in coloursegmented images. We assume in the following that, every time a rigid or semi-rigid component O i of O has been localized in the image I, then it is associated with: T i =(x i ;y i ; i ); (1) where (x i ;y i ) is the position of the centroid of O i in I and i is the angle between the main axis of O i and the x-axis of I. Let us call a i the main axis of O i,beinga i a line passing through the point (x i ;y i ) with orientation i (see Figure 1). The axis of a rigid object O i is an arbitrarily defined axis fixed when its model M i is created. For simplifying reasons we assume to deal with only 2D transformations and in Section 6 we show how 2D transformations are sufficient for our robotic domain. Nevertheless, the whole method can be easily extended to 3D transformations by increasing the number of parameters used in Sections 3 and 4. Finally, the reason for which in Equation (1) there is no scale information will be clarified in Section 5. Since now on we call O i a rigid component of O, even if it can be a non-rigid shape (e.g., a hand) for instance approximated by a blob of skin-like pixels. Moreover, we assume that, in the training phase of the algorithm (Section 4) all the components O 1 ;:::;O N of O have been localized in each image example. This assumption will be relaxed in the recognition phase (Section 5). 3 Configuration representation In this section we show how we represent a configuration of the O 1 ;:::;O N rigid components of O in a given image I. For each couple of components (O i ;O j ) and transformations T i, T j mapping O i and O j into I, we compute: T ij =(ρ ij ;ff ij ;fi ij ); (2) p where ρ ij = (x i x j ) 2 +(y i y j ) 2, ff ij is the angle between the vector passing through the points (x i ;y i ) and (x j ;y j ) and the line a i, while fi ij is the angle between a i and a j. Figure 1 shows the relation between two components: a torso and a leg maker of a robot (see Figure 2), respectively represented by two schematic silhouettes. The first component is centered in P i and the second is centered in P j. a i is the axis of the first component, passing through P i and forming the angle i with the x-axis. Note that, while T i is dependent on a fixed reference frame (the image reference frame), T ij is not, because all its values are computed with respect to the center and the axis of O i (i.e., P i and a i ). A configuration C = ft ij j1» i; j» N g is the set of all the possible mutual spatial relationships between the components of O. It is worth noticing that this representation contains redundant information (the dimension of C is O(N 2 ) while the position of N rigid objects in the space can be represented with O(N )). This redundancy will be exploited in the recognition phase (Section 5) to deal with possible missed components. Finally, we do not use continuity information to be robust to partially occluded objects or situations in which noise effects do not allow the system to reliably detect the adjacency of the component appearances in I (see Section 5). Next we need a way to recognize valid configurations.

3 Figure 1. An example of two components O i and O j in a given image. 4 Generalized Configuration Constraints Suppose we have a training set of images showing O in different configurations C 1 ; :::C E. For instance, C 1 ; :::C E can have been respectively extracted from E different images showing O having its articulations moved in different positions. The system should generalize these examples by learning which configurations are valid. For each pair of components (O i ;O j ) let RC ij be the relative configuration space describing the possible spatial relationships between O i and O j. RC ij is a simple space with only 3 dimensions corresponding to the parameters of Equation (2). Given a configuration example C k (1» k» E), the element Tij k of C k is a positive example of the spatial relation the system is trying to learn and it can be represented by a point in RC ij. A generalized configuration constraint is a generalization over all the examples T 1 ij ;:::;TE ij and is represented by a region R in the RC ij space such that R is the minimum connected region including all the positive examples Tij 1 ;:::;TE ij. For instance, R can be defined by the convex hull of the points Tij 1 ;:::;TE ij. Other more sophisticated machine learning techniques could be used to learn the decision surfaces of R, forinstance using nonlinear approaches (e.g., artificial neural networks). Nevertheless, simply-connected point distributions in RC ij, such those involved in the description of simple rotational joints (e.g., see Section 6), can be reliably represented by means of a polyhedron. In the current implementation of the system, R is approximated by the minimum parallelepiped P ij including all the points Tij 1 ;:::;TE ij.the error introduced by this approximation is compensated by the redundancy of the constraint structure and leads to a noticeable computational saving (Section 5). 5 Recognition of the valid configurations Each component O i of O is represented in a viewer independent Cartesian reference system. In the rigid recognition phase each O i is sequentially localized in I (see Sections 2 and 6.2). Every time the rigid recognition procedure succeeds, it returns a rigid geometric transformation mapping O i to a given position of I. Suppose the returned location is represented by: L =(t x ;t y ; ;k),beingt x and t y the translation parameters, the rotation and k the scaling factor. We associate such a success with the component instance: X = hi; Li, wherei is an index denoting the i-th component O i of O. Note that O i can be associated with zero, one or more than one component instances. We merge instances into groups, each group G = fx 1 ;:::;X N 0g (N 0» N) satisfying the following constraints: 1. If i; j are instance indexes of G then it must be i 6= j. 2. The scaling factor k must be (more or less) the same for all the elements L of the instances of G. 3. Other possible domain-specific constraints. Point 2 above is used to prune invalid configurations under the assumption that O is not large in the direction perpendicular to the camera plan. An example of object which does not satisfy this assumption are the railroad tracks going off to the horizon. However, such objects are quite uncommon and the assumption of a quasi-equal scale factor for all the components of O can be used to reduce the dimensions of the configuration spaces (Section 4). In order to take under control the number of possible groups to analyze, we adopt an incremental strategy similar to that presented in [4] and in [12] (see Section 1). Groups are built by means of an iterative process in which the system alternates adding new elements and verifying their constraints. In particular, in the RoboCup domain (Section 6) we start from the main component, the robot s torso and we add any leg component recognized in I. Backtracking to previous clustering choices is possible but this is not a problem for a small number of components (N ' 5)andasmall numberof candidatelocations in I [4, 12]. When either the candidate locations L 1 ;L 2 ;:::or the number of model components N grow up, continuity constraints should be used to take under control the grouping phase computational cost. Given a group G = fhi 1 ;L 1 i; :::; hi N 0;L N 0ig we check if the corresponding configuration is valid. If X k = hi k ;L k i 2 G, we can use L k in order to compute the parameters of T i k as defined in Equation (1). Then, using T i1 ;:::;T in we compute the configuration C 0 G = ft ij ji; j 2fi 1 ;:::;i N 0gg, wheret ij is defined as in Equation (2). The value of each T ij 2 C G is checked against the generalized configuration constraint P ij (see Section 4). The constraint is satisfied when T ij 2 P ij. Note that representing a generalized constraint by means of a parallelepiped (P ij ) makes this checking O(1). Note also that, generalizing the approach to deal with 3D transformations would lead to a six dimensional configuration space

4 RC ij. Nevertheless, representing a generalized constraint by means of a hyper-parallelepiped P ij, makes it possible to verify the constraint satisfaction (T ij 2 P ij ) still in O(1). Groups with a sufficient number of configuration constraints satisfied are recognized as articulated object s instances. The advantage of representing a given configuration C as a set of points (ft ij g), each point belonging to a separate space (RC ij ), versus the common approach in which a unique point is used to represent the positions of all the object s components in the pose space [9] is given by the robustness of our proposed method to possible missed elements and false positives produced in the rigid recognition phase. In fact, one of the main problems of common feature space representations is its dependence on fixed-size feature vectors which need to be filled in using the feature values extracted from the whole object [2]. But how can we deal with partially occluded objects in which some components have not been detected in the scene? Vice versa, the possibility to separately check the pose correctness of each recognized component with respect to the others implies that a number N 0 of components (N 0» N) is sufficient for pose recognition purposes. Depending on the specific domain, some components can be more important than others. For instance, in Section 6.3 each group must include at least one instance of a torso component of the robot. At the same time, false positives and wrong assemblies of components can be pruned using the spatial information about their relative positions. 6 A Case Study In this section we present a working system using the approach shown in the previous sections in a specific domain: recognition of articulated dog-like robots in the RoboCup championship. RoboCup is an international competition among autonomous robot teams playing a soccer-like game. In the four-legged league each team is composed of four AIBO dog-like robots (see Figure 2 (a)). The game field is a green rectangle of 4:6 3:1 meters and every object in the field (e.g., the ball, the doors, etc.) is characterized by a different colour. The robots of a same team wear a uniform whose colour is either blue or red depending on the team. A robot uniform is composed of a set of non-connected markers, each having a specific shape and they are attached to the rigid components of the robot (e.g., the torso or the legs). Figure 2 (b) shows the shape of the markers attached on both the sides of the torso. A low-resolution camera is fixed on the head of every robot. Markers are represented in the currently grabbed frame by either red or blue blobs. Nevertheless, in a given frame, the blobs representing different markers (belonging to either the same or more than one different robots) can be merged together as a result of the articulated movement or of partial occlusions of a robot with respect to another or of noise effects (see Figure 2 (a)). In order to recognize the opponent robots it is then necessary to associate a set of n blobs with a set of m robots possibly present in the current image. Moreover, since each robot is an articulated object, its components, as well as the corresponding markers, can vary their relative positions making particularly difficult the robot s recognition. There are very few proposals for the recognition of opponent robots in RoboCup. Most of the teams do not deal with the problem of multiple blobs and multiple robots in a same image, explicitly or implicitly assuming a one-to-one correspondence between blobs and robots [3, 13]. 6.1 Preprocessing operations Each frame of the video sequence is analyzed in turn. Pixels of the current image are segmented into homogeneous color groups searching for the colours of uniforms of the opponent robots (e.g., red). Moreover, the red (blue) pixels are grouped into connected sets of points called blobs. Given a blob B of I, we select those points of B representing its external contour. Let us call C B the closed curve representing such a contour. Finally, C B is approximated using a standard split-based polygonal approximation technique [10]. The aim of this last step is the selection of a set of points of C B which represent sharp corners and which will be used in the rigid recognition phase in order to drive the matching process. We call A B the set of break-points of C B produced by the polygonal approximation phase (the anchor points of B [7]). Figure 2 (a) shows some examples of video frames (top) associated with the corresponding sets of blobs (bottom) and anchor points (black dots). 6.2 Recognition of the robot rigid components In this section we show how to recognize the markers which are attached to the rigid components of the AIBO body. Let B beablobofi with contour C B and M one of the marker models to search for in I whose contour is C M. Figure 2 (b) shows the model M of the marker attached to the sides of the robot s torso. Furthermore, suppose A B and A M are, respectively, the sets of anchor points of B and M. TheRigid Alignment procedure, derived from [7], is the following. 1. Hypothesis phase: for each couple of consecutive points hp 1 ;p 2 i of A B and each couple of consecutive points hq 1 ;q 2 i of A M, compute the transformation T such that p 1 = T (q 1 ) and p 2 = T (q 2 ).

(a) Figure 2. (a) Three examples of real images and their corresponding color segmentation. (b) The rigid model of the lateral marker and its best overlapping on the tolerance area of an input image.

5 (a) Figure 2. (a) Three examples of real images and their corresponding color segmentation. (b) The rigid model of the lateral marker and its best overlapping on the tolerance area of an input image. 2. Fail. (a) Test phase: UseT to overlap C M on I. Suppose M 0 = T (C M ) is the set of points of I corresponding to C M after the transformation T. (b) Let Matched be the cardinality of the set fp 2 M 0 ^ CloseTo(p; C B )g. (c) If MaxMatched > then return h Matched, T i. The above procedure is iterated for every blob B of the image and every model M i representing the i-th marker. The transformation: T = (t x ;t y ; ;k), computed in Step 1, is composed of a 2D rotation ( ), a 2D translation (t x and t y ) and a scaling factor (k). It can be computed matching a couple of points of I (p 1 and p 2 ) with a couple of points of M (q 1 and q 2 ) and solving a simple equation system in which t x ;t y ; and k are the unknowns. Moreover, we use consecutive anchor points in order to reduce the number of hypotheses to be verified. This is possible because, even when some blobs are merged together, there is usually a large, connected part of the blob s contour free from occlusions which can be used to find segments for matching. Now we describe the verification process in Step 1.b. We do not compare the inner points of the model (M) and the blob (B) because, due to possible blob merging, a single blob can represent two or more objects. In this case, matching the right model with a part of the blob could lead to rejecting the matching if we compare the internal surfaces. Vice versa, we compare the external contours of the blob and the model, checking that a sufficient part of C M ( =0:4) can be aligned using T with C B. Given a point p 2 C M we can verify that T (p) is close to any point of C B (Step 1.b) by performing a dilation operation on C B. The product of this dilation is a tolerance area (F B ) surrounding C B (see Figure 2 (b)). If T (p) 2 F B then it is close to (b) B. F B is computed once for every blob of I in the preprocessing phase and it is stored in a suitable array associated with B and utilized by the function CloseTo. This representation of F B makes it possible to check the condition CloseT o(p; C B ) in O(1). Figure 2 (b) shows an example of output of this phase. The computational cost of the whole rigid recognition phase is O(hn 0 m 0 m), wheren 0 = #A I is the number of segments of the polygonal approximation of all the blobs of I, m 0 =#A M, m =#C M and h is the number of models used for representing all the markers. Currently, h = 3, since we use a model for the lateral torso marker (Figure 2 (b)), a model for the frontal torso marker (e.g., Figure 3 (b)) and a ribbon-like model for the leg markers. Note that the number of the articulated object s components (N) does not need to be equal to the number of rigid models (h). 6.3 Merging the recognized components in distinct robots Once the single rigid components of the robots have been recognized in the current frame, they are incrementally grouped as shown in Section 5; every final group being a different robot with a well defined position and orientation with respect to the viewer. The only domain-dependent constraint we have used (see Point 3 of Section 5) is the following. We impose that every final group obtained at the end of the incremental clustering process must include at least one instance of a torso component. A robot is represented by N =5components (the torso and four legs). 6.4 Results The system has been implemented in C++ and a preliminary experimentation has been performed using a Pentium III, 850Mhz. We have used as test images 100 snapshots extracted from a video sequence grabbed by a robot during a real match. The images represent 130 robots in different positions and having the body in different configurations. The images input to the system have been previously segmented [13] and their pixels classified into the color categories of the RoboCup domain (e.g., Figures 2 and 3). The average execution time is 0:72 seconds per image, including pre-processing operations (blob and anchor point extraction, etc.). Figure 3 (b) shows an example of correct system s output. The two black blobs represent two torso components, the light gray region shows an unrecognized blob (in this case a minor part of a lateral torso marker) and dark gray blobs represent leg markers. The first and the second legs (from left to right) have been correctly associated by the system with the first torso component, as well as the third leg which has been associated with second torso of the image. Figure 3 (c) is an example of image in which the

system did not recognize any rigid model (and, hence, any robot) due to the great amount of noise produced by both the low-resolution of the robot camera and by the color segmentation process.

Most of the missed items have been produced by failures obtained in the rigid recognition phase, which in turn it is mainly due to noise in the colorsegmented images (e.g., Figure 3 (c)).

6 system did not recognize any rigid model (and, hence, any robot) due to the great amount of noise produced by both the low-resolution of the robot camera and by the color segmentation process. The recognition rate obtained by the system is 0:8, with 104 correctly recognized robot configurations versus 130 total configurations. Most of the missed items have been produced by failures obtained in the rigid recognition phase, which in turn it is mainly due to noise in the colorsegmented images (e.g., Figure 3 (c)). (a) (b) (c) Figure 3. (a) An colour-segmented image. (b) The corresponding output in which two robot configurations have been correctly recognized. (c) A noisy image. 7 Conclusions Even if we have used the RoboCup domain to test the proposed method, it is thought to be applied to other articulated objects as well since no specific hypothesis on the domain is done. This flexibility is possible also because the representation of the legal configurations described by means of generalized configuration constraints satisfies the following properties: (1) it is invariant with respect to similarity transformations (rotations, translations and scale changes); (2) it is robust to partial occlusions/possible missed recognition of some components of O; (3) it is not based on heuristics depending on the specific object O but the constraints of a specific domain are learned by the system in the off-line training phase. Currently our efforts are addressed in studying how the present approach can be used to classify human gestures for natural human-computer interfaces. Giving a set of human gesture classes and a set of corresponding examples for all the classes, generalized configuration constraints are used to describe different classes. The rigid recognition phase used for robot landmarks must be adapted to deal with nonrigid blobs representing the head, the hands and the arms of the segmented human figure. This work is in progress. References [1] DAVIS,L.S.,PHILOMIN, V.,AND DURAISWAMI, R. Tracking humans from a moving platform. In ICPR (2000), pp [2] DUDA, R.O.,HART, P.E.,AND STRORCK, D.G. Pattern classification (2nd ed.). Wiley Interscience, [3] ESTIVILL-CASTRO, V., AND LOVELL, N. Improved object recognition: the RoboCup 4-legged league. In Proc. of 4th Int. Conf. IDEAL (2003). [4] FORSYTH, D.A.,AND FLECK, M. M. Body plans. In IEEE Conference on Computer Vision and Pattern Recognition (1997), pp [5] GRIMSON, W.E.L. Object Recognition by Computer: The Role of Geometric Constraints. The MIT Press Cambridge, Massachusetts, [6] HARITAOGLU, I.,HARWOOD, D.,AND DAVIS,L.S. Ghost: A human body part labeling system using silhouettes. In 14th ICPR (1998), pp [7] HUTTENLOCHER, D.P.,AND ULLMAN, S. Recognizing solid objects by alignment with an image. Int. J. of Computer Vision 5. No. 2 (1990), [8] JU, S.X.,BLACK, M.J.,AND YACOB, Y. Cardboard people: A parameterized model of articulated image motion. In Second International Conf. on Automatic Face and Gesture Recognition (1996), pp [9] MOESLUND, T.B.,AND GRANUM, E. Asurveyof computer vision-based human motion capture. Computer Vision and Image Understanding 81 (2001), [10] PAVLIDIS, T. Structural Pattern Recognition. Springer Verlag. Berlin Heidelberg, New York, [11] PAVLOVIC, V.I.,SHARMA, R.,AND HUANG, T.S. Visual interpretation of hand gesture for humancomputer interaction. IEEE Trans. On PAMI 19 No. 7 (1997), [12] RAMANAM, D.,AND FORSYTH, D. A. Finding and tracking people from the bottom up. In CVPR 2003 (2003). Madison, WI. [13] ROFER ET AL., T. Germanteam [14] WREN, C. R., AZARBAYEJANI, A., DARREL, T., AND PENTLAND, A. P. Pfinder: Real-time tracking of the human body. IEEE Trans. On PAMI 19, No. 7 (1997),

Realtime Object Recognition Using Decision Tree Learning

Realtime Object Recognition Using Decision Tree Learning Dirk Wilking 1 and Thomas Röfer 2 1 Chair for Computer Science XI, Embedded Software Group, RWTH Aachen wilking@informatik.rwth-aachen.de 2 Center