Understanding People Pointing: The Perseus System. Roger E. Kahn and Michael J. Swain. University of Chicago East 58th Street.

Size: px

Start display at page:

Download "Understanding People Pointing: The Perseus System. Roger E. Kahn and Michael J. Swain. University of Chicago East 58th Street."

Marilynn Cain
5 years ago
Views:

1 Understanding People Pointing: The Perseus System Roger E. Kahn and Michael J. Swain Department of Computer Science University of Chicago 1100 East 58th Street Chicago, IL Abstract In this paper we present Perseus, a purposive visual system used by our robot, CHIP, to locate objects being pointed at by people. Perseus uses knowledge about the task and environment at all levels of processing to more accurately and eciently perform visual tasks. 1 Introduction One of the tasks our robot, CHIP, performs quite well is to pick up trash around our oces and throw it away[6]. Sometimes we get tired of watching it clean up the entire room and decide we want it to pick up one particular piece of trash. Providing a verbal description of the trash's location to CHIP is awkward; it is far more natural to simply point at it. For CHIP to nd objects this way it needs to notice people, recognize when they are pointing, determine which area they are pointing to, and nd the object in that area. In this paper we present Perseus, a visual architecture implemented for CHIP, that enables it to perform this task. One important aspect of Perseus is that processing at all levels is inuenced by knowledge about the task and environment. Thus, unnecessary computations are avoided and we are able to tune the operations used to the system's specic needs. Perseus receives the higher level knowledge through its interface with a reactive execution system, such as the RAP system[5]. 2 The system Perseus is a purposive vision system[1] composed of six components (see gure 1): a set of retinotopic feature maps[16, 13, 12, 17, 3, 14], a short term visual memory (STVM), a collection of object representations (ORs), markers[3], a visual routine library[18, 3], and a long term visual memory (LTVM). A reactive execution system activates the visual routines providing higher level information about the task being performed to each component of Perseus. This information is used to select regions of attention and to parameterize visual operators. Computations at even the Early Maps Intensity Edge Color Motion Disparity Object Representations Lights Floor Trash / Pepsi People Markers Head Hand Trash Short Term Visual Memory Segmentation Map Long Term Visual Memory Symbols Visual Routines Identify-Object Find-Object-Pointed-At Figure 1: The Perseus system in the pointing task lowest level in the system, the feature maps, are inuenced by the task being performed. 2.1 Feature maps Feature maps are retinotopic descriptions of the scene that provide information about the location of features to ORs and markers. This information is used to segment, track, locate, and recognize objects. To maximize the usefulness of feature maps, ORs and markers parameterize each map based on their goals. The feature maps used in the pointing task are edge (Sobel), motion (normal ow [9]), disparity (NEC stereo algorithm[4]), and intensity. Other feature maps, such as color histogram backprojection[15, 8, 7] and texture measures have proven useful in other tasks. Each map can be parameterized with a resolution and an area to operate over. Additional parameters are shown in Figure 2. The parameterization of feature maps not only increases their sensitivity to the particular feature values of interest to the task, but it also makes certain types of feature maps possible that could not be used if parameterization was not allowed. For instance, the

2 Map Color Edge Motion Disparity Intensity Parameters Histogram Sobel threshold Min velocity, Gradient threshold Disparity range Upper and lower intensity bounds Figure 2: Feature map parameters color histogram backprojection algorithm requires a histogram to compute the feature map. 2.2 Object representations In Perseus object representations are structures containing data that describes an object and methods for segmenting, tracking, recognizing, and locating it. Their purpose is to acquire information about the realworld object they represent. This is done by invoking their methods. The methods call visual operators and store the results in the OR. Thus, ORs are not immutable. They can be specialized as methods acquire information about the object they represent. This specialization includes acquiring knowledge about the object's location, trajectory, color, etc. Furthermore, the addition of knowledge to an OR may inuence how its methods interpret the scene. ORs are initially instantiated either by a visual routine or are recalled from LTVM. Once an OR is instantiated, visual routines send it messages requesting information or specifying that an action be performed. Some of the messages an OR may receive include: locate object within area, segment object, grab object's color histogram, and place a marker on object. The methods available for an OR can be quite specic to that object. For instance, the person representation may be able to perform face recognition. However, many methods will be common to a large number of ORs, and the underlying operations performed by them will often be the same. We may have many ORs that use color histogram backprojection to locate an object. ORs use feature maps to determine which parts of the scene are most important to the operations they are performing. When using the maps, an OR may parameterize them so they better serve the task at hand. For instance, when a Pepsi can OR is asked for its location it may instantiate the color map with a Pepsi histogram and focus its search on peaks in that map. When multiple ORs parameterize a single feature map, the feature map process creates separate maps for each of them. Thus ORs cannot interfere with each other's maps, and the number of feature maps computed at any given time is limited by the number of ORs using them. We have found that the number of objects simultaneously being attended to by the agent is typically small in real world problems. Thus the total number of maps requiring computation at any given time is reasonably small. Also, ORs may instantiate other ORs and send requests to them. For example, when the segmentation method of a person OR is invoked, it instantiates a oor OR and invokes its segmentation method. The person OR uses the oor's segmentation to eliminate possible locations of the person. ORs can also instantiate markers[3] to track important parts of the scene. 2.3 Short term visual memory The STVM is a retinotopic segmentation map. When an OR performs a segmentation, the STVM is updated. Each point in the STVM is marked as belonging to an object that has claimed it in a segmentation, or as free. Any OR may access the STVM to aid in a computation. For instance when an OR is performing a segmentation it can use the STVM to eliminate points based on whether they belong to another object or not. 2.4 Markers Markers are created by ORs to track objects or subparts of an object (like a person's hand), much as those used by Chapman[3]. Once a marker is tracking an object, visual routines and ORs may ask the marker for its position. If the marker has not lost track of the object, it returns its position in retinotopic coordinates, otherwise it returns a message saying so. When a marker is instantiated it is parameterized with a starting location and a tracking method. An example of a tracking method is a function that tracks the centroid of a blob in a feature map using a Kalman lter. As with ORs, it is important for markers to be able to track a large number of objects with a relatively small number of tracking methods so that this architecture can be used for a wide variety of tasks. 2.5 Visual routines The interface between Perseus and the higher level system is a library of task specic visual routines[18, 3]. Visual routines use markers, ORs, and LTVM to perform functions for the higher level system. Typical visual routines may nd an object a person is pointing to, or identify a person in the room. Visual routines may be passed parameters when they are called. One type of parameter for visual routines are symbols that the LTVM associates with particular ORs. This allows the symbolic, higher level system to call visual routines with particular ORs. Additionally, other types of data such as ORs and mark-

3 ers may be passed to visual routines. This commonly occurs when one visual routine calls another. 2.6 Long term visual memory The LTVM associates symbols with ORs to allow the higher level system to reference objects when it calls visual routines. Each OR is typed and contains a comparison method used to see if two ORs represent the same object in the scene. The comparison method may use dierent operations depending upon the information contained in each OR. For instance, if two soda can representations exist that have previously been sent a message to segment-and-grab-histogram, color histogram intersection may be used to compare the two representations. However, if the histogram had not been acquired, but an edge representation for each can is known instead, the Haussdorf distance may be used for the comparison. By performing comparisons between objects the higher level system is able to determine if it has seen a particular object before. 3 Example: Finding trash The Perseus system has been used to identify trash pointed to by people. We assume that a stationary camera is observing a scene containing a single person. The pointing gesture is dened to occur when the person extends either of his arms to his side and holds it still for a while. This system has been implemented on CHIP and used to locate objects pointed to by visitors to our laboratory. The high level system needs to nd an object the person is pointing to. To do this it calls the visual routine Find-Object-Pointed-At. This visual routine requires two arguments, an OR for a person (person) and an OR for the object pointed at (trash). LTVM contains generic ORs of people and trash, i.e. ORs that do not have details such as a color histogram of the trash in their data, but do have information such as bounds on the size of trash. The high level system passes the symbols associated with these ORs to Find- Object-Pointed-At and it accesses the ORs via LTVM. Methods of person and trash will be invoked by the visual routine to ll in details about the objects they represent. Thus once these ORs are passed to Find-Object-Pointed-At they are no longer thought of as generic ORs, but instead as particular ORs being used in this particular task. Eventually, very specic information will be acquired by the OR's methods and stored in them. For instance, the location of the person's head will eventually be acquired by one of person's methods and stored in the person OR. It should be noted that if the higher level system's goal were to nd a particular type of trash being pointed at by a particular person, it would use that trash's OR rather than the generic one. The data contained in it would inuence how it was found in the scene. For instance, if we know the person will point at a Pepsi can, we could use the pepsi-can OR. This would result in the particular colors in Pepsi cans being used to further restrict the search in the area pointed to. 3.1 Finding the person Find-Object-Pointed-At rst invokes person's ndlocation method. This method currently assumes that only people move in the scene, thus motion is used to nd a person. It parameterizes the motion feature map to detect medium velocity and waits for a large area of motion in the map, see gure 3. Once motion is detected, a marker is placed on the centroid of the largest moving connected component. The marker is parameterized to track the centroid of this component in the motion map. The marker is returned. 3.2 Segmenting the person Next Find-Object-Pointed-At invokes person's ndbody-parts method which segments the person and places markers on the person's head and hands. The rst step in the method for nding the hands and head is to segment the person. The disparity map and motion maps are used to nd an average disparity of the person. This is done by averaging the disparity at each point in the scene that moves (we are assuming only one person is in the scene). Then the disparity map is parameterized to detect disparities within a range corresponding to a typical person's width around this average. The person is segmented by extracting connected components from the resulting disparity map near the marker tracking the person's body. Since the lights in the room saturate our cameras, the stereo algorithm may give erroneous information about their disparity. Therefore, before segmenting, person instantiates an OR for the lights (lights) and uses it to ignore the lights in the scene. Another dif- culty is that the oor has the same disparity as the person's feet where they are standing. person instantiates a oor OR (oor) so it can ignore the oor as well. Once lights is instantiated person invokes its segment method. This method sets the intensity feature map to threshold the image for high intensities. The STVM is updated with the saturated points from the intensity map. Person then instantiates oor and calls its segment method. This method assumes that the oor is textureless, so to get a rough segmentation all we must do is to nd the lowest edges in the scene. oor parameterizes the edge map and segments the oor as the pixels below the lowest edge in each row in the

Figure 3: The motion feature map, the short term visual memory, and the segmented person map. This is the same method Horswill uses for navigation of his robot Polly[10].

Person uses the disparity map and the STVM to segment the person by accepting regions in the disparity map that are near the marker on the person and are not part of the oor or lights, see gure 3.

4 Figure 3: The motion feature map, the short term visual memory, and the segmented person map. This is the same method Horswill uses for navigation of his robot Polly[10]. STVM is then updated with the resulting segmentation. Figure 3 shows the STVM after the lights and oor have been segmented. Person uses the disparity map and the STVM to segment the person by accepting regions in the disparity map that are near the marker on the person and are not part of the oor or lights, see gure Finding the head The nd-body-parts method now nds the head by locating the highest point in the person segmentation that is approximately above the marker tracking the person's body. If the head is found a marker is placed on it and parameterized to track the head as the highest point in its connected component in the motion map. This tracking method is the same as the one used to track the centroid of the person's body in section 3.1 except it is parameterized to track the top rather than the centroid of the component. 3.4 Finding the hands Now nd-body-parts nds the hands by nding the leftmost and rightmost points in the person segmentation. These points are required to be approximately to the side of the marker tracking the person's body. They are also required to be connected to the body by segments that are thin enough to be arms. Each hand found has a marker placed on it, but both hands do not have to be found. The hand markers are parameterized to track the leftmost/rightmost point in its connected component in the motion map. This tracking method is the same as the one used to track the centroid of the person's body in section 3.1 except it is parameterized to track the side rather than the centroid of the component. 3.5 Finding the area pointed to Find-Object-Pointed-At now monitors the position of the head and hand markers. If either is lost it relocates them as described above. If either hand marker maintains a steady position far to the side of the person for a short time, the pointing gesture has occurred. Find-Object-Pointed-At computes the line of sight from the head to the hand. This line approximately denotes the area occluded for the person by their hand. Next, Perseus computes a two degree wide cone centered around this line, starting at the hand, and projecting away from the person. The area contained within this cone is the region pointed at. 3.6 Finding the trash Find-Object-Pointed-At now invokes trash's locatein-region method with the search cone as a parameter. This method locates trash by searching for small, densely textured areas in the scene. A more sophisticated method could be used, but for the purpose of this experiment this denition is sucient (the dense texture denition locates cans, styrofoam cups, and crumpled pieces of paper on textureless surfaces like the oor of our lab). trash parameterizes the edge map to locate ne textures over the search area. If trash is found a marker is placed on it and parameterized to track the centroid of the region in the texture map; otherwise a message describing the situation is returned. The method used to track the trash is the same as the one used to track the centroid of the person's body in section 3.1 except it is parameterized to use the edge map rather than the motion map. If trash is not located the cone's width is increased until either it is found or the cone becomes too wide. Figure 4 shows the search area, head marker, hand marker, and trash marker. 3.7 Identifying the trash Find-Object-Pointed-At returns with a message to the higher level system saying if the trash has been found. The higher level system now determines if it knows the name of this trash. It does this by calling the Identify-Object visual routine with trash as an argument. Identify-Object compares its passed OR with

Pointing trial Person 1 2 3 4 5 6 7 8 1 21 1 3 0 2 43 1 2 5 2 0 2 27 6 2 0 3 11 78 4 6 0 7 5 4 0 0 2 14 68 4 7 0 5 0 17 0 1 31 0 0 1 Figure 5: Success of Perseus over successive trials.

5 Pointing trial Person Figure 5: Success of Perseus over successive trials. The number in each box describes the angle the trash was from the line of sight; it is in bold type if Perseus correctly found it. Blank entries indicate Perseus gave up. Figure 4: Perseus has found a Pepsi can the other ORs in LTVM of the same type using their comparison method. LTVM has ORs of Pepsi cans (pepsi-can), and Mountain Dew cans (mountain-dew-can). These ORs contain color histograms of the cans they represent and their method for comparison involves performing a histogram intersection[15]. When trash is compared with these ORs, the comparison method asks trash for its color histogram. The rst time trash receives this request its grab-histogram method is invoked which nds the convex hull of the textured region in the edge map and histograms the segmentation. In the future we will use a more sophisticated segmentation method, but in our restricted domain this procedure works well. Once the histogram is known, future segmentations are not required. Trash is compared with pepsi-can and mountain-dew-can. If either of these comparisons has a suciently high color intersection it is considered a match and returned. 4 Experiments This system has been tested with a variety of people pointing at objects in numerous positions on the oor. Typically when someone visits our lab we show them the system, and have found that it can successfully nd objects pointed to by our guests. Here we describe the system's success with ve people who have never used it before. Each person was told to place a piece of trash on the ground where CHIP could see it. Then he or she was asked to stand anywhere in CHIP's eld of view and point at the trash for at least three seconds. The subjects were not told to hold their arm in any particular position, yet we found that in almost all cases the line of sight from their head to their hand success- fully described the region they were pointing at. Table 5 shows the angle between the search line Perseus computes and the trash pointed at. Numbers in bold denote objects that Perseus correctly located; while angles in normal type refer to failures. Blanks in this table occur in places where Perseus gave up. It is interesting to note that the system has no problem nding trash lying at a dierent distance from the camera than the person. This is because lines project to lines under perspective projection, and therefore the line of sight from the head to the hand to the trash is still a line in the image plane. The only failure in this situation is when the trash lies between the person and CHIP, which occurs because Perseus currently can only nd the person's hand when it is to the side of his body. Our current implementation exhibits three other failure types. The rst type occurs when Perseus nds the person's head and hands correctly, but the trash pointed at is not near the search cone. This only happened 1 time in the above 40 runs, which suggests that people often do point along the line from their eye to their hand. The second failure type occurs when Perseus incorrectly locates the person's body parts. This type of error is almost always the result of an incorrect segmentation of the person. When a person stands too close to CHIP their legs may be included in the oor segmentation because of the oor OR's simplistic assumption that any region adjoining the bottom of the image is the oor. Another problem occurs when the person stands too close to other objects. This causes the disparity-based segmentation to fail. The nal type of failure results from our oversimplied trash representation. Non-trash objects will occasionally be found by the trash OR if they lie within the region pointed to.

6 As can be seen by the data, the system has been quite successful. It has proved to be robust to variations in the clothing people use, the hand they point with, their height, and many other variables. In the above experiments it correctly located the object 68% of the time, and 80% of the time the object was with 20 degrees of the search area. 5 Future work The failures described in Section 4 have resulted primarily from an over-simplication of our ORs, not from the Perseus architecture. Currently we are modifying the methods used by the person and oor ORs to be more robust, and we are rewriting the trash OR from scratch. The oor OR is being improved so that it can be used on textured oors. It will electronically verge the cameras to place their horoptor on the ground plane. Segmentation is then be done with a zero disparity lter[2]. This allows us to nd textured oors, and does not fail if a person is standing close to the camera. We are also experimenting with other methods for segmenting the person that can be used in conjunction with the current disparity segmentation procedure. In particular we have found that histogramming the intensity of each point in the image over time and performing gure/ground segmentation based upon how each pixel in the current frame diers from the mode of its histogram can improve our segmentation [11]. 6 Summary and conclusions A system based upon Perseus, a purposive vision architecture, has been presented and used to locate objects being pointed at in a non-engineered environment in real-time. The system has been tested with numerous people pointing at objects before various backgrounds while wearing a variety of clothes. The success of the system lies in its ability to use information about on the task and environment at every level of the architecture: feature maps are parameterized by the objects that are important to the system's goals, a short term visual memory allows object representations to know where other objects relevant to the task are, markers track objects with dierent methods depending on what they are tracking, the information extracted about objects in the scene is determined by the visual routines being run, and, of course, the visual routines are selected based on the goals of the higher level system. Because of this, Perseus is able to avoid unneeded computations, optimize operations for the particular goals of the system, and perform calculations at the lowest level that are not possible without task specic information. 7 Acknowledgements The authors would like to thank R. James Firby and Peter N. Prokopowicz for their contributions toward the design of our system and the Perseus architecture. References [1] J. Aloimonos. Purposive and qualitative active vision. International Conference on Pattern Recognition, pages 346{360, [2] P. J. Burt, P. Anandan, K. Hana, and G. van der Wal. A front end vision processor for unmanned vehicles. Technical report, David Sarno Research Center, [3] D. Chapman. Vision, Instruction, and Action. MIT Press, [4] I. J. Cox, S. Hingorani, B. M. Maggs, and S. B. Rao. Stereo without disparity gradient smoothing: a bayesian sensor fusion solution. British Machine Vision Conference, pages 337{346, [5] R. J. Firby. Adaptive Execution in Complex Dynamic Worlds. PhD thesis, Yale, [6] R. J. Firby, R. E. Kahn, P. N. Prokopowicz, and M. J. Swain. Collecting trash: A test of purposive vision. Workshop on Vision for Robots, August [7] B. V. Funt and G. D. Finlayson. Color constant color indexing. IEEE Pattern Analysis and Machine Intelligence, in press. [8] G. Healey and D. Slater. Using illumination invariant color histogram descriptors for recognition. Computer Vision and Pattern Recognition, pages 355{360, [9] B. K. P. Horn and B. G. Schunck. Determining optical ow. Articial Intelligence, 17(1-3):185{203, Augest [10] I. Horswill. Polly: A vision-based articial agent. American Association for Articial Intelligence, [11] R. E. Kahn and M. J. Swain. Background detection for segmentation. Animate Agent Project Working Note 8, University of Chicago, Augest [12] J. Malik and P. Perona. Preattentive texture discrimination with early vision mechanisms. Optical Society of America, 7(5):923{932, [13] D. Marr. Vision. W. H. Freeman and Company, San Francisco California, [14] R. Milanese. Detecting Salient Regions in an Image: From Biological Evidence to Computer Implementation. PhD thesis, University of Geneva, [15] M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer Vision, 7:11{32, [16] A. Treisman and G. Gelade. A feature-integration theory of attention. Cognitive Psychology, 12:97{136, [17] J. K. Tsotsos. Analyzing vision at the complexity level. Behavioral and Brain Sciences, 13:423{469, [18] S. Ullman. Visual routines. Cognition, 18:97{159, 1984.

and, if satised, the task is considered complete and the and one of the methods with a satised test is selected.

and, if satised, the task is considered complete and the and one of the methods with a satised test is selected. An Architecture for Vision and Action R. James Firby, Roger E. Kahn, Peter N. Prokopowicz and Michael J. Swain Department of Computer Science University of Chicago 1100 E. 58th St., Chicago, IL 60637 rby,kahn,peterp,swain@cs.uchicago.edu