Kinematic tracking and activity recognition using motion primitives

Size: px

Start display at page:

Download "Kinematic tracking and activity recognition using motion primitives"

Sydney Patrick
5 years ago
Views:

1 Germán González Kinematic tracking and activity recognition using motion primitives Master Thesis January 27, 2006 Master thesis at the Department of Computer and Systems Sciences at the Royal Institute of Technology (corresponding to 20 full-time working weeks).

2 IV Abstract. In recent years, there has been an increasing interest in monocular human tracking and activity recognition systems, due to the large amount of applications where those features can be used. Standard algorithms are not practical to employ for human tracking due to the computational cost that arises from the high number of degrees of freedom of the human body and from the ambiguity of the images obtained from a single camera. Constraints in the configuration of the human body can be used to reduce its complexity. The constraints can be deduced from demonstration, based on the human performance of different activities. A human tracking system is developed using this kind of constraints and then evaluated. The fact that the constraints are based on activities allows, while doing the tracking, the inference of the activity the human is performing.

3 Contents 1 Introduction Background Problem Objective Method Purpose Disposition Acknowledgment Background knowledge Looking at people D approaches to human tracking Approaches to activity recognition Particle filtering Motion modeling Figure-background segmentation Experimental setup System overview Human Model Image acquisition Positioning system Human pose estimator Motion primitives Particle Filter Likelihood function Activity recognition Hardware and software

4 VI Contents 4 Results Overview Motion primitives accuracy Human tracking system Particle filter performance Tracking of fast movements Different perspectives Whole body movements Activity recognition system Conclusion Reflections Future work A Dimension Reduction of BVH files B Silhouette extraction C Human model

5 List of Figures 2.1 Condensation algorithm Screenshot of the system Block diagram of the program Human model Silhouettes and 3D reconstruction Positioning system Motion Primitive Resampling of particles Bending cone Distance transform Validity of the motion primitives Ground-truth vs particle filter Fast movements Different viewpoint Whole body movement Motion primitives probability density function Relative position progress Relative position progress of punching action

7 1 Introduction 1.1 Background Human tracking and activity recognition are receiving increasing attention among computer scientists due to the wide spectrum of applications where they can be used, ranging from athletic performance analysis to video surveillance. By human tracking we refer to the ability of a computer to recover the position and orientation of the limbs of a human from a sequence of images. There have been several different approaches to allow computers to derive automatically the kinematic pose (joint angles of the human) and human activity from image sequences. A review is presented in Chapter 2. We will focus on monocular (i.e. systems that work with only one camera) tracking and activity recognition systems. Due to the complexity of the problem space and the ambiguity of the images obtained from the camera, standard algorithms produce results that are extremely computationally expensive. One approach to reduce the problem space and to make the problem computationally tractable is to provide constraints on the positions of the human body. Constraints can be based on temporal information, camera configuration, joint angles, or any combination of these. Camera configuration constraints are usually expressed by making assumptions on the relative positioning of the subject with respect to the camera. Temporal constraints refer to the fact that a human being can only move up to a certain speed, therefore given a certain configuration, the human can only reach a subset of all the possible configurations in the next time step. Joint angles constraints are the constraints that deal with the restrictions on the configurations of the human body due to its nature or to any other fact (i.e. activity performed, domain of the application, etc.). This last kind of constraints can be modeled as manifolds in the high dimensional space of the joint angles of the human body. These manifolds can be learned from human movement analysis, as demonstrated by Jenkins and Matarić (2004) and represent motion primitives, a general repre-

8 2 1 Introduction sentation of human action. Given an appropriate transformation, the learned manifold can be described as a subspace of the joint angles space, thus reducing the dimensionality of the problem and making algorithms such as particle filters appropriate to solve it. Furthermore, the motion primitives allow us to predict the next state of the human body in the joint angles space, making them suitable for their use as priors in the context of particle filtering. Several researchers have recently used motion constraints successfully to derive the human pose from images, like Elgammal and Lee (2004), Ong, Hilton, and Micilotta (2005), and Urtasun, Fleet, Hertzmann, and Fua (2005). These approaches are geared towards one single activity, meanwhile we will present a system able to cope with several competing human activities. Activity recognition has usually been considered as a higher level task. Once given accurate tracking and recuperation of the joint angles of the human in the image sequence, the activity is derived with techniques such as template matching or state-space approaches. An alternative to these approaches is presented. Activity recognition is performed at the same time as the recuperation of the body structure. Each activity is described by a set of constraints in the joint angle space. The set of constraints that produce the most accurate tracking is the one that describes better the action performed. 1.2 Problem Human pose recuperation from images is a difficult problem due to the high dimensionality of the human body. The most coarse human model is represented by 28 dimensions, while, when accuracy is desired, the dimensions are increased up to one hundred. Furthermore, the information provided by an image sequence obtained from a single camera is usually ambiguous or incomplete. Ambiguity comes from the fact that several poses can result in the same silhouette due to self occlusion or perspective. The second problem that is tackled is the inference of the activity the human is performing in front of the camera. The set of activities that a human being can perform is enormous. To reduce the complexity of the problem, an activity vocabulary is defined and loaded into the system. A framework for deciding which activity is being performed is therefore necessary. 1.3 Objective We pursue two objectives. The first one is to demonstrate that motion constraints based on activities, like the outcome of the work of Jenkins and

9 1.4 Method 3 Matarić (2004), can be used for human tracking when the human is performing the activities described by those constraints. The second objective is to perform human tracking using several competing motion constraints and being able to decide which of the activities described by the motion primitives is the human performing. 1.4 Method In order to fulfill the goals described above, a software application was developed. A deductive approach is followed to demonstrate with empirical data the hypothesis that kinematic tracking and activity recognition can be performed using Jenkins and Matarić (2004) motion primitives. The software engineering was done in the AI Lab of the Computer Science Department of Brown University in Providence, USA, under the constant supervision of Prof. Odest Chawdice Jenkins. Regular meetings were held every week to discuss the state of the work and the directions that should be followed. The requirements that Brown established for the code were: Use a particle filter over the motion primitives of the outcome of the work of Jenkins and Matarić (2004) to perform kinematic tracking and activity recognition. Make the system as fast as possible to try to reach a real-time application. The system should be able to track long video sequences. Before any coding or software development, a literature review was carried out in order to know the state-of-the-art of human tracking systems, activity recognition systems, and the applications of the particle filter algorithm. The most relevant articles are listed in the References section. Most of the articles were suggested by Prof. Jenkins, but some of them come from searches in the Xplore search engine of the Institute of Electrical and Electronic Engineers (2005) and the digital library of the Association for Computing Machinery (2006). The software methodology followed was top-down programming. The main structure was designed first and then divided into small pieces that could be easily coded. Nevertheless, due to the experimental nature of the project and the lack of tight requirements, it was necessary to change the structure several times before being able to have a complete system. The software development was done between Matthew Loper, master student in Brown, and the author of this thesis in a collaborative way. Both of us completed and modified the code of the other, making it impossible to know

10 4 1 Introduction the authorship of it. The engineering ideas, such as finding a likelihood function, were also shared, as brainstorming became a common technique among us and we used to build over the ideas of the other. The system was evaluated afterwards to see its performance. The evaluation was done in two different areas: accuracy in the joint angle estimation and accuracy in the activity recognition. The evaluations were done over video sequences captured for this project. A paper describing this project and explaining the results of it has been submitted to the Computer Vision and Pattern Recognition 2006 conference and it is still pending of approval. 1.5 Purpose Two objectives are followed in which different applications are pursued: human tracking and activity recognition. The system developed could be used in both fields. The recuperation of the configuration of the human body can be useful in the following areas: Athletic applications: The accurate derivation of the position and movement of human limbs allows the explanation of why some athletes perform better than other ones. These systems can be used to improve the performance of athletes. They can also be used in automatic personal training systems, as a method to evaluate how well the person in front of the camera is performing the exercises indicated. Animation: Character animation is a difficult task. Current approaches use motion capture systems to gather information on human movement and mimic it into the avatars. Usually those systems are based on markers. With the system proposed, there is no need for markers, therefore allowing a more natural set up for the actor. Also, by using known configurations of the human body, the system can correct defects of performance of the actor. Applications of an absolutely different kind arise within an activity recognition system. They are all based on the same idea, the automatic semantic analysis of video sequences. Some examples of these applications are: Video Surveillance: Surveillance systems can be designed where the subjects that are performing a suspicious activity (e.g., punching another human, stealing...) are highlighted, thus helping the human behind the monitor to know that something is going wrong.

11 1.6 Disposition 5 Proactive services: In modern housing, there is often a video surveillance system. The images of that video could be used to determine the activities the inhabitants are performing, and provide services according to those activities. One relevant example would be systems that take care of old people living alone. If they detect an activity such as falling down or not moving, they could call the medical service for help. In these applications, the computer is able to sense the real world and react with its change, adapting the programs running and services to it. Human robot/computer interaction: Personal robots are becoming a reality (Sony Aibo, irobot Roomba, Sony QRIO), interacting with non-technical people. With an interface based in motion primitives, it will not be the user that learns how to use the robot, but the robot the one that knows how to interact with the human. Here the thesis overlaps with artificial intelligence and context aware computing, allowing computers or robots to have a semantic representation of the images they sense from the world. Video indexing: With current computers, video can be easily stored and retrieved, even through the internet. To describe videos textually is a tedious and repetitive task, but very useful to perform search over their contents. If an activity recognition system is run over the video sequence, it can produce automatically its textual representation, thus allowing search over their contents. With the state of the art of video analysis tools, this seems as an utopia, but if the domain of the video sequence is known (i.e. a soccer match) the set of activities and video features are reduced, making the problem more simple and tractable. Most of these applications involve knowing what a person is doing, and it could be possible to identify the human in the image. This can result in a loose of privacy. How to use that information in an ethical way and who has access to that informations are facts to take into account when developing any proactive service or video surveillance system using this technology. 1.6 Disposition This thesis is divided into five chapters with the following disposition: Chapter 1. Contains a brief introduction, the problems faced in this work, the methodology followed and the purpose of the thesis. Chapter 2. Contains a brief overview of current human tracking and activity recognition systems. The concepts needed to understand the system developed are presented here.

12 6 1 Introduction Chapter 3. Presents the system developed to perform human tracking and activity recognition using the motion primitives of the work of Jenkins and Matarić (2004). Chapter 4. Contains the evaluation and performance of the system presented in Chapter 3 with regards to its two uses: human tracking and activity recognition. Chapter 5. The conclusions, reflections and future work that can arise from this thesis are presented here. The pdf of this thesis, videos demonstrating the performance of the system and code can be downloaded from its web page, x04- ggo/msc/. A DVD containing the databases used for this project is available upon request. 1.7 Acknowledgment This work is part of a research project in Brown University, Rhode Island, U.S.A., under the supervision of Prof. Odest Chawdice Jenkins. I would like to thank him for his guidance and advice. The coding has been done in collaboration with Matthew Loper, master student of the same institution. I would like to thank him for all the help with the programming. And, of course, thanks to Magnus Boman for his supervision and guidance in DSV. Special mention deserve my parents, Germán and Rosario, which have always supported me and given me the best present that can be given to a son. Without their support and education I would have never been able to be where I am now. And thanks to my brother Bruno for cheering me up in difficult moments.

13 2 Background knowledge 2.1 Looking at people The automatic processing of video sequences involving humans is a wide topic that can be approached from several perspectives. Perhaps the more important classification is where the focus of attention of the analysis is. It can be on the face and expressiveness of the actors in front of the camera, on the hand gestures. or in the movement of the whole body. We will focus on the last ones. According to Gavrila (1999), there are several classification parameters for systems that perform whole body tracking depending on: the human model representation, the dimensionality of the tracking space, the sensor modality, the sensor multiplicity, sensor placement and sensor mobility. For now on we will focus only in monocular tracking systems, systems that use only one camera. Another difference among systems that analyze images involving whole human body movement is whether they use a human model or they use only image features. According to Moeslund and Granum (2001), to make a computer understand video sequences involving human action, four problems should be addressed. They are: Initialization: Finding the initial pose of the human body and find the model that fits the person. Kinematic tracking: Establishing coherent relationships of the subject and/or its limbs along several frames. For an accurate tracking, it is important to segment the image and divide it into the pixels that correspond with the subject and the background pixels. Pose estimation: Obtaining the configuration of the human body in the images and how does it change over time. Activity recognition: Finding what activities are performing the humans in the images.

14 8 2 Background knowledge These problems are usually interrelated and they can be solved at the same time. Nevertheless, current work usually covers only one or two of the problems mentioned, making assumptions about or ignoring the others. There have been several different approaches to automatically perform segmentation, kinematic tracking and activity recognition. They vary in the degree of complexity of the techniques used. Some of techniques are: Use of image features: The images are analyzed according to the movement or position of its pixels. Use of 2D human models: A 2D human model is used to help with the connection of image features. Use of 3D human models: The system recovers the 3D configuration of the human performing in front of the camera. An example of a system that uses image features for activity recognition can be found in the work of Niu and Abdel-Mottaleb (2004). There, the authors find the silhouette of the human in the image and compare it with a database of shapes that define an activity, finding the nearest neighbor. No model of the human body is used. There is no recuperation of the pose of the human body, but the system is able to discriminate among different activities with the use of Hidden Markov Models. This work recovers the activity the human is performing without recovering his body configuration. Another example is the work of Min and Kasturi (2004), where the optical flow of the hands and legs of a valet dancer is used to create motion trajectories and derive the movements the ballerina is performing in front of the camera. The use of a 2D human model in image analysis involving human activity can be found, for instance, in the work of Ramanan, Forsyth, and Zisserman (2005). There the authors track long sequences of human movement using color information of the limbs of the actors in the images, learned in previous steps. A 2D human model is used to provide global constraints for the tracking and connection of the limbs and to perform a hierarchical search, finding first the torso of the actors and then limiting the space in the image where to search for the rest of the limbs. This work focusses on finding where a person is in the frame. We will focus on the systems that follows the third approach, the use of a 3D human model to recover the position and orientation of the human limbs from the images. These kinds of systems are reviewed in the next section.

15 2.2 3D approaches to human tracking D approaches to human tracking 3D approaches to human tracking deal with the recuperation of the 3D position and orientation of the limbs of the actor from a sequence of 2D images. The two main problems that this approach has to deal with are the high dimensionality of the human body and the ambiguity of the information obtained from the camera images due to baggy clothing or self occlusion. The problem of ambiguity of the images can be partially solved by using several cameras situated in different places. Thus, by having several simultaneous viewpoints, self occlusion present in one image does not have to be present in the other ones. There has been several approaches to deal with the problem of the high dimensionality of the human body. One option is to localize the limbs of the human body independently, assuming that it s configuration can be partitioned in the configuration of individual independent limbs. Other option is to constraint the possible configurations of the human body, reducing the problem space. As an example of a 3D approach to human tracking, Ren, Shakhnarovich, Hodgins, Pfister, and Viola (2005) use information from silhouettes obtained from three cameras to recover the kinematic pose of the actor, searching through a set of previous learned silhouettes and their corresponding pose. In order to perform a quick search among the database of images, machine learning techniques are used to derive easy to compute and distinguishing image features of the silhouettes. Another approach to recover the 3D pose is to render figures with a known configuration and compare them with the images obtained from the camera. The figure that obtains the best score in the comparison will be the one that describes better the pose of the human in the image. New problems arise in this approach: Dimensionality of the space: the most coarse description of the human body involves 28 dimensions. Even this dimensionality makes the search computationally intractable for current computers and standard algorithms. Evaluation: the figures are rendered from parameters in the joint angle space. The projection of the figure rendered is then compared to the image obtained from the camera. Two points very close in the joint angle space can produce very different images. How to find a comparison function between images that reflect the distance in the joint angle space is an open question. Positioning of the human in the image: A figure with a joint angle configuration different to the one from the image can produce a good match if it positioned in the correct position and vice versa. There are also issues

16 10 2 Background knowledge with the ambiguity of the images and with the specular symmetry of the human body. Accuracy of the model: Another important issue is how well the model of the human used to describe the image fits the actors. Issues such as size, weight, complexion, clothing and gender modify drastically the shape of the human beings even with the same joint angle configuration. The lack of accuracy of the model modify the results of the comparison between the figure rendered and the image obtained, modifying the performance of the algorithm. A review on different human models is presented in Aggarwal and Cai (1999). To perform the search in the high dimensional joint angle space, several approaches have been followed. Maybe one of the most important is the use of particle filtering. But a particle filter by itself is computationally very expensive due to the high dimensionality of the space. The concept of particle filtering and its applications will be reviewed bellow. If the activity the human is performing is known, then the search can be constrained to the set of configurations that describe that action. Several approaches to describe the actions have been followed by Urtasun, Fleet, Hertzmann, and Fua (2005), Ong, Hilton, and Micilotta (2005), Elgammal and Lee (2004), Jenkins and Matarić (2004). This issue will be covered in section Approaches to activity recognition Activity recognition systems are usually considered as a higher level task based on the successfully tracking of the human being, (Aggarwal and Cai 1999). Activity recognition is usually done either with template matching or state-space techniques. Template matching is used in 2D approaches as, for instance, by analyzing the optical flow of some limbs of the human being, as Min and Kasturi (2004) do. State-space approaches define each pose or set of poses as a state. The states are connected with links with probabilities. The system then recognizes the activity by analyzing the set of states that the actor is visiting. One advantage from this approach over template matching is that the state-space is not sensitive to the speed of performance as each state can be connected to itself. An example of the use of this techniques can be found in the work of Green and Guan (2004), where the authors use a Hidden Markov Model to perform activity recognition. Recently, another approach has been followed by several authors, such as Elgammal and Lee (2004), Jenkins and Matarić (2004). There the activities

17 2.4 Particle filtering 11 are described as manifolds in the space of joint angles of the human body. The same activities are used to perform kinematic tracking of the human in the image. By analyzing which manifold is the actor following, the activity can be derived. This approach is followed in the paper that has arisen from this thesis submitted to the 2006 Computer Vision and Pattern Recognition conference and pending of approval. 2.4 Particle filtering Particle filtering or the condensation algorithm was presented in the work of Isard and Blake (1998) as an extension of the Kalman filter to handle simultaneous alternative hypothesis and non linear systems. A Kalman filter is a well known algorithm that estimates the state of a process x(t), governed by a linear stochastic equation and a measurement z(t). The algorithm assumes that the relationships between x(t) and x(t 1), and between z(t) and x(t), are linear and have gaussian white noise. Due to its similarity with the particle filter algorithm, we will only describe the operation of the last one, for a further explanation on the Kalman filter the reader is referred to Welch and Bishop (1995). Given a set of observations Z t and a set of states on a problem space X t, particle filtering is an algorithm created to approximate the conditional probability density function p(z t X t ) due to the cost or impossibility of computing that function. It is based on the idea os sequential importance sampling, where random samples with associated weights are used to represent the function and compute its estimates. There are some assumptions that must be done in order to apply particle filtering to a problem: The problem consists of a set of ordered observations Z t = {z t, z t 1,..., z 0 }. The observations have an unknown correspondence with a problem space. The state of the problem x t can be represented as a set of points in an n-dimensional space. Each of those points is called a particle and consists not only of its position, but also of a weight that is a measure of how much that particle describes the observation of that time step z t. Each particle represents a hypothesis of the configuration of the system that produce the observation z t. We will define the history of the state of the system as X t = {x t, x t 1, x t 2,..., x 0 } The state of the system at the time step t + 1 has a relationship with the state of the system in the previous time step t. This is the same to say that there is a defined probability density function p(x t X t 1 ). Furthermore, that relationship form a Markov chain p(x t X t 1 ) = p(x t x t 1 )

18 12 2 Background knowledge The observations are independent both mutually and with respect to the dynamical process. There is a likelihood function p(z t x t = p i ). This function is a comparison among the state of the system described by a point in the space, the particle p i, and the observation at that time step z t. Ideally the function will give a score that represents the distance between p i and the position in the problem space that can represent better the observation z t. The state of the system that corresponds to the observation z t can be derived from the position of the particles through any operator, such as finding the particle with more likelihood, finding the mean of the position of the particles or any other operation. The mathematical formula to derive the state of the system given the previous state and the observation of the problem is the propagation formula 2.1 p(x t Z t ) = kp(z t x t ) p(x t x t 1 )p(x t 1 Z t 1 )) x t 1 (2.1) We are going to describe how to use that formula step by step. Figure 2.1 is a graphical representation of the algorithm and will help the reader with its understanding. At time step 0 the particles are distributed uniformly on the space problem, covering all the space. This is done because there is no information of the state of the system in the previous time step, therefore all problem space has the same probability of representing the observation. The particles are then weighted according to the observation z 0 and the likelihood function. After that, for each time step: 1. The system gets the particle distribution and weights from the previous time step, p(x t 1 Z t 1 ). This is the posterior of the previous time step. 2. A new set of particles is generated by sampling the previous one. The particles with more weight would be sampled more times, producing that in the new set of particles some of them are repeated. 3. The particles are moved according to the probability density function p(x t x t 1 ), it is the drift step in Figure 2.1. It is important that p(x t x t 1 ) is not a deterministic function, so particles that start from the same position end in different places, diffusion step in 2.1. Once moved, the particle distribution is called the prior and represents the prediction that can be done of the system from the previous time step, p(x t Z t 1 ). 4. The particles are weighted according to how much the state of the problem they represent resembles the observation. This is done through the

19 2.4 Particle filtering 13 use of the likelihood function p(z t x t = p i ). In figure 2.1 the weights are represented by the size of the circles in the measure step. Each circle represent a particle. This new distribution of particles is called posterior and represent the hypothesis of the state of the system for the current time step. Deductions on the state of the system can be done by measuring the distribution and weight of particles. 5. The algorithm loops. The position and weight of the particles is kept and used as a prior for the next time step. likelihood function Fig Activities performed on the particles in one time step. Illustration taken from Isard and Blake (1998) For a detailed description on particle filters the reader is referred to Arulampalam, Maskell, Gordon, and Clapp (2002) and Isard and Blake (1998). The direct application of particle filters to human tracking systems usually fails due to the high dimensionality of the human body and to the difficulty of finding an accurate likelihood function p(z t x t = p i ). Deutscher, Blake, and Reid (2000) face this problem using a set of increasingly sharp comparison functions instead of a likelihood function.

20 14 2 Background knowledge 2.5 Motion modeling By motion model or motion primitive we refer to the description of a given human activity. This description is able to predict the next state of the configuration of the human body when the actor is performing the activity described given a known state. Two main problems arise when creating motion models: Generality: how to generate motion models that account for a general human action. Activities: which motions should be modeled. One approach for the generation of motion models is to use the domain knowledge of a human expert to develop them. Nevertheless this method is prone to errors due to the difficulty of generating them and from the subjectivity of the experts. Another approach that has obtained more attention recently is to generate them automatically with machine learning techniques from motion capture data. The motion primitives are learned from human performance. This approach is followed by Jenkins and Matarić (2004), who segment a continuous set of data with several underlying activities into several motion primitives. Other authors use an extensive database of motion capture data to learn the motion primitives, like the work Elgammal and Lee (2004). Meanwhile, Urtasun, Fleet, Hertzmann, and Fua (2005) uses only one performance example to derive the underlying motion model. After creating the motion model some authors condense the data into less dimensions, using linear or non linear dimension reduction techniques. This is possible due to the fact that most activities can be described with the movement of few human limbs, as for instance punching, that involves a small torso rotation and the movement of the arms. 2.6 Figure-background segmentation Most of the systems designed to perform kinematic tracking or activity recognition need some system to determine which pixels of the image correspond to the human figure and which ones correspond to the background. According with Moeslund and Granum (2001), this segmentation of the image can be based on temporal or spatial information. The use of temporal information is based on the fact that usually the background of video sequences is constant or varies slowly with time. Therefore all the movement that appears on the images is due to the foreground. Two different approaches for segmentation based on temporal information can be

21 2.6 Figure-background segmentation 15 followed: subtraction and flow. The first one works by obtaining an image of the background and then performing subtraction from the image obtained from the camera. The pixels different to zero or close to it will be the ones belonging to the background. Flow approaches use instead the coherent temporal movement of pixels between frames. An example would be the use of optical flow in Min and Kasturi (2004). The use of spatial data for segmentation has also two variants. The first one is maybe the simplest one, it is applying a threshold to the image. If the subject is wearing clothes of different color or intensity than the background, the image can be segmented accurately. Statistical approaches use the characteristics of pixels or groups of pixels, such as color, size of pixels with the same intensity, etc... to perform segmentation. Note that spatial segmentation is done in the color space of the pixels and not in their (x, y) coordinates. Both temporal and spatial approaches can be combined. For instance, Ramanan, Forsyth, and Zisserman (2005) uses a statistical approach to find the color description of the limbs of the human in the image and then uses a 2D human model to perform temporal correlation between frames.

23 3 Experimental setup 3.1 System overview In order to test the approach to kinematic tracking and activity recognition described in chapter 1, a computer program was developed. The program uses the motion primitives described in the work of Jenkins and Matarić (2004) as constraints in the space of the joint angles of the human body. Those motion primitives model human actions. The program is able to track a human moving in front of the camera if the motion primitive that describes that action has been loaded. Furthermore, the program is able to decide which motion primitive is being used to perform the tracking among all the motion primitives loaded in it, therefore performing activity recognition. A screenshot of the program can be seen in Figure 3.1 The program is composed of various subsystems, each of which solves a particular task for a time step using information from other subsystems in the previous time step. The subsystems are: image acquisition, global positioning, kinematic pose estimation and activity recognition. A block diagram of their connection is depicted in figure 3.2. The system works as follows: Image acquisition: The video sequences recorded for testing the program are split into frames. For each of them, the silhouette of the actor is extracted using image segmentation techniques and stored into a sequence of ascii files. Global positioning: The position of the hips of the silhouette is obtained using the previous time step information (if any) of the position of the hips and the best estimate of the joint angles of the silhouette using that motion primitive. Kinematic pose estimation: The system keeps several hypotheses of the position of the joint angles of the human in the previous time step. From

18 3 Experimental setup Fig. 3.1. Human pose estimator and activity recognition system.

24 18 3 Experimental setup Fig Human pose estimator and activity recognition system. On the left we can see the image obtained from the camera and the computer interpretation of the kinematic pose superimposed. On the top left corner, the silhouette obtained from the image is drawn, this is the information the computer is using to derive the kinematic pose of the actor. On the right of the image, the best estimate of the pose for each motion primitive is drawn, as well as the two first dimensions of the latent space of each motion primitive and the position of the particles in that space. Fig Block diagram of the program developed to test how different motion primitives can be used in parallel to perform human tracking and activity recognition them, it derives the next set of hypotheses according to the motion models present on the system. These new hypotheses are weighted according to the similarity of the silhouette that they produce and the silhouette obtained from the camera. All this processing is done in the context of particle filtering. Activity recognition: The activity is derived according to the weight of the best hypothesis of each motion primitive and the temporal coherence among those weights. In the next sections we are going to take a closer look to those subsystems and the elements that make them work.

3.2 Human Model 19 3.2 Human Model Even though the human model is not a subsystem of the program, it is a key element in all the subsystems.

Those silhouettes will be compared afterwards with the silhouette obtained from the camera in order to induce if the joint angle description matches the joint angle configuration of the actor s

25 3.2 Human Model Human Model Even though the human model is not a subsystem of the program, it is a key element in all the subsystems. Therefore it is going to be explained here, before entering into subsystems details. The human model is used to render human silhouettes from a joint angle description. Those silhouettes will be compared afterwards with the silhouette obtained from the camera in order to induce if the joint angle description matches the joint angle configuration of the actor s performance. Here issues of adaptability of the human model appear, as the same pose can result in significantly different silhouettes regarding the model. Usually a human model consists of two parts: a skeleton and flesh covering it. For the skeleton we used the model described in the motion primitives of Jenkins and Matarić (2004). Each motion primitive is described in a biovision format file, which contains both motion data and the skeleton of the human model which whom that data corresponds. A skeleton representation is given in Appendix C. For the flesh, we rendered cylinders and spheres covering that skeleton using the OpenGL library (Woo 1997). The radius of the cylinders depend on the part of the human body they represent. The human model is drawn in Figure 3.3. Fig Human model used in the system This model accounts for simplicity and generality. It has been tested in three subjects with different body complexities in the complete system and the system performed the tracking. Its refinement and evaluation is left for future work. As pointed out in Green and Guan (2004), the success of tracking system depends on the accuracy on which the model represents the human. The problem of the human model can be thought as an initialization problem, or as an adaptation problem, where the human model is adapted to fit the actor while the tracking is performing. Those two approaches are described extensively in Moeslund and Granum (2001).

26 20 3 Experimental setup 3.3 Image acquisition The images were acquired using a Fire-i firewire digital camera (Unibrain ) and the software that comes by default with it, BTV Pro Carbon v (Bensoftware ). The hardware platform used to record the videos was an Apple ibook G4 due to its mobility, so we were able to record videos in different scenarios and positions. All the automatic features of the camera were turned off in order to facilitate background subtraction. The frame rate was established to 15 fps. The resolution of the camera was variable. For each trial two videos were recorded, the first one with only background images and the second one with the actor performance. The videos were recorded in QuickTime movie format (.mov) and split afterwards into image sequences in bitmap format (.bmp) using the BTV software. After that they were processed offline in Matlab to perform background subtraction. The silhouette is obtained with the following procedure: The background of the video sequence is generated by finding the average color of each pixel among the frames of the background video recorded. For each frame: Save the frame in the PPM format, as it is the easiest format to read in a C / C++ program. The difference between each pixel on the image and the pixel in the same position in the background is calculated in the Red, Green and Blue dimensions according to the following formula: d(x, y) = RGB(p(x, y) b(x, y)) 2 (3.1) where p(x, y) corresponds to the pixels of the image and b(x, y) to the pixels of the background and d(x, y) the difference image. The difference image is thresholded. The value of the threshold is empirically determined for each video sequence. We obtained a noisy image of the silhouette t(x, y). t(x, y) = { 1 : d(x, y) > threshold 0 : d(x, y) < threshold (3.2) A two dimensional median filter is applied to the thresholded image t(x, y) in order to reduce the salt and pepper noise. A median filter is an operation that assigns to each pixel in the image the value of the

3.4 Positioning system 21 average over the neighborhood of that pixel and the pixel itself. The result of that operation is the silhouette s(x, y). The silhouette is stored in a file in ascii format.

27 3.4 Positioning system 21 average over the neighborhood of that pixel and the pixel itself. The result of that operation is the silhouette s(x, y). The silhouette is stored in a file in ascii format. It is important for the overall performance of the system that the silhouette is extracted accurately and with as little noise as possible. This is because noise will distort the distance transform of the silhouette (see section 3.5.3) thus affecting the convergence of the particle filter algorithm. Fig Sequence of silhouettes extracted from the camera and the pose reconstruction used with the human model. The second row is the reconstruction of the pose from a different point of view. 3.4 Positioning system The information of where the position of the hips in the silhouette is of great importance, as it is the center of coordinates used for the rendering of the guessed silhouette. A bad positioning will result in a displacement of the rendered silhouettes and the likelihood function (see section 3.5.3) will produce a poor score even if the rendered silhouette is similar to the one obtained with the camera. The positioning system finds the position of the hips in two different ways, depending of the time step of the system. In the first frame, the center of mass of the silhouette is found. The position of the hips is placed in that point plus an offset determined empirically. The position of the hips is calculated according to the next formula: ˆx = ŷ = ResX x=1 ResY y=1 ResY y=1 x s(x, y) + x (3.3) ResX ResY ResX x=1 y s(x, y) + ȳ (3.4) ResX ResY where ResX stands for the horizontal resolution of the camera, ResY for the vertical one, and ȳ, x are the empirical offsets for those two dimensions.

28 22 3 Experimental setup In the next time steps the best estimate of the joint angles in the previous time step is used to render a silhouette. For now on, we will call that silhouette rendered guess to prevent its confusion with the silhouette obtained from the camera. The position where the guess is rendered is the position obtained in the previous time step. The guess is translated and scaled randomly and compared to the silhouette from the camera. The translation/scale that gives the best score determines the coordinates of the hips of the human in the image. We are assuming an error here, from frame to frame the silhouette of the actor stretch and bend, therefore the use of the joint angles in the previous time step will produce a guess different to the silhouette. Nevertheless, if the movements of the actor are not fast, the guess and the silhouette will resemble enough to make the method appropriate. A solution to this problem would be to perform a loop of the positioning system and the kinematic pose estimation for each frame. The comparison function used between the guess and the silhouette is: C(I, S) = ResX x=0 ResY y=0 I(x, y) G(x, y) I(x, y) G(x, y) (3.5) Where I represents the pixels of the silhouette, G represents the pixels of the guess and C(I, S) is the comparison function. 3.5 Human pose estimator The human pose estimator system is responsible of obtaining the kinematic pose of a silhouette given the position of the hips of the same. It is composed of a battery of particle filters, each of them operating in the space described by a motion primitive. The number of motion primitives is variable and limited only by the speed of the computer. Each particle filter has access to the silhouette provided by the image acquisition system and will produce a silhouette with its best estimation of the joint angle configuration of the image. The particle filter whose particle produces the silhouette that resembles the image more is chosen. That particle represents the best estimate the system can make of the joint angle configuration of the human in the image Motion primitives The motion primitives used in this system were provided by Jenkins and Matarić (2004). They are stored in biovision format and include both the description of the human action represented in the joint angle space and the

29 3.5 Human pose estimator 23 Fig This image represents the battery of particle filters. Each of them produces a silhouette that best explains the image using its motion primitive. The activity recognition system chooses among those silhouettes the one that resembles more the image obtained from the camera and calculates the probability of each activity. description of the skeleton of the human model associated with the joint angle space described by the dimensions of the primitive. Each motion primitive contains about points in the joint angle space. Those points describes the activity. The joint angle space has 96 dimensions. The data is stored as a 23000x96 matrix, D. The points are grouped in trajectories. Each trajectory consists of a sequence of one hundred points connected between them. The points are not equally distributed in the space, they form a manifold, therefore dimension reduction techniques can be used to simplify the description of the data. We choose to use Singular Value Decomposition (SVD) to perform dimension reduction due to its linearity and the simplicity to transform from the reduced or latent space to the original one. SVD is a mathematical technique used to decompose a matrix into three other matrices: D nxp = U nxn S nxp V T pxp (3.6) Where the columns of U are the description of the points of the motion primitive in a new space. S contains the singular values of D, is diagonal and the singular values have the property that they are ordered decreasingly, s ii > s jj for all i < j. V T can be thought as a matrix that describe the relation between the space described in D and the space of U. The economy size SVD is slightly different version of SVD, where the dimensions of U are reduced to allow its faster computation. The decomposition of D is:

30 24 3 Experimental setup D nxp = U nxp S pxp V T pxp (3.7) Dimension reduction can be performed over the matrix U. As the singular values of S are ordered in decreasing order, the columns (dimensions) of U are also ordered in decreasing order of relevance and, keeping only the few first ones produce a good approximation to the points of the problem. In fact, for our problem, keeping between 8 and 12 dimensions out of 96 preserves 90% of the variance of the motion primitive. The matrix U is then crop, keeping only the number of columns needed to preserve 90% of the variance. This matrix, once reduced, represents the latent space where the particles in the particle filter algorithm are going to reside. We choose to use SVD for dimension reduction due to its linearity and simplicity. The relationship between the latent space and the original space is easy and fast to compute, it is only two matrix multiplication. Nevertheless, other non-linear techniques can be used to describe the manifold more accurately with less dimensions. For each motion primitive the following files are generated. As an example, we will call the motion primitive punch.bvh. punch.bvh.reduced Represent the position of the points of the motion primitive in the latent space. It is the matrix U after being reduced to a few dimensions. punch.bvh.motionmodel Represents the vectors that connect the points in the motion primitives. The vectors are normalized. Each vector corresponds with one point in the latent space. Figure 3.6 represents the three first dimensions of the points of a motion primitive and the vectors connecting them. punch.bvh.v Represents the matrix V of the SVD. It is used to compute the correspondence between the latent space and the joint angle space. punch.bvh.s Has the values of the S matrix of the SVD Particle Filter A particle filter (see 2.4) is instantiated for each motion primitive that is loaded on the human pose estimation system. The particles reside in the latent space generated for that motion primitive. In each time step resampling, movement and weighting of the particles is performed. The resampling of the particles is done with probabilities according to the weights of the particles in the previous time step. The accumulated probability distribution function of the particles is generated. Then, if N is the number of particles, N random numbers are generated between zero and one. The particle that generates the interval in where the random number generated

31 3.5 Human pose estimator 25! !10!20! !20!40 30 Fig Graph of the three first dimensions of a motion primitive once performed SVD decomposition. Note how the points are connected forming trajectories. fits is copied into the new set of particles. If another random number falls in the same interval, the particle is copied again. In figure 3.7 a 1D example is drawn to exemplify the resampling procedure. The particles are moved according to the trajectories of the motion primitives described in Jenkins and Matarić (2004). In order to deal with the non-linearity of those trajectories, a bending cone probability distribution is used. For each particle, this distribution is formed by finding the point in the manifold that is closer to the particle. Then the trajectory to whom that point belongs is used as the axis of the bending cone. That axis is composed by a set of connected vectors. A random number, r, is generated from a gaussian probability density function and the particle is moved forward in that axis as many vectors as the number indicates. The variance of the function used to generate that number depends on the video that is being analyzed, a large variance will help predicting fast movements, but will produce coarse tracking, meanwhile a small variance will produce fine tracking, but will fail with fast movements. In our code, the random number is generated with gaussian probability density function of variance twenty and mean zero and then truncating the outcome of this function into an integer. To prevent overshooting we subtract an offset to the number obtained. r = N(0, 20) 3 (3.8)

32 26 3 Experimental setup Probability Probability density function Space Accumulated probability Joint probability density function Space Fig The image of the left represents the probability density function of the particles. In the image of the right, the accumulated probability function of the particles is represented as the stairway function, the random numbers generated for the sampling are drawn as red diamonds. From them, a line parallel to the x-axis is drawn to indicate in which interval of the accumulated probability function they fall. Finally, the green dots represent the new distribution of particles. Note that, as there are two random numbers that belong to the interval [0.5, 0.9], there are two particles in number eight of the x-axis. After moving the particle, cylindrical noise is added to perform the bending cone distribution. The axis of the cylinder is the vector in whom the particle is situated and the width is proportional to the random number generated previously. The cylinder is generated by finding an orthonormal basis of the hyper-space perpendicular to the vector that is the axis of the cylinder, we will call it b 0. This is done applying the Gram-Schmidt orthonormalization method to find a basis B that represent the space of the primitive using b 0 as the first vector of that basis. The vectors [b 1...b N ] represent the basis of the hyper-space perpendicular to b 0. Then, gaussian noise is added in all the vectors of that basis with variance proportional to the random number generated previously, r, to create radial noise. Uniform noise is added in the direction of b 0 with length the distance between the point of the manifold to whom the axis correspond and the next point in the trajectory. By this process a piecewise approximation to the bending cone distribution is generated. If a particle p has as nearest point p i, being i the position of the point in its trajectory, the new position of the particle, p is: N p = p i+r + U[0, d(p i+r, p i+r+1 )] b 0 + N(0, αr) b k (3.9) k=1

33 3.5 Human pose estimator 27 Fig Representation of the bending cone distribution. The origin of the cone is the initial position of the particle. In green there is a linear movement of the particle, that produces a bad prediction. The black dot represents the particle moved along the axis, p i+r, prior to the addition of the noise. From that point two blue vectors are drawn, representing the radial and axial noise. Finally, the red vector represents the noise vector and, the red dot, the final position of the particle Likelihood function To weight a particle of the particle filter, a silhouette is generated with the joints angles that the particle represent, we will call it guess, and compared to the silhouette obtained from the camera using a function that measures their similarity, the likelihood function. Both, the guess and the silhouette obtained from the camera are binary images. In them there is a set of points, P, where there is silhouette and are assigned the value 1 and a set of points with no silhouette, Q, with value 0. In order to explain how the likelihood function works, the concept of distance transform should be introduced. A distance transform of a binary image is a transformation of the image that assigns to each point q that does not belong to the silhouette (i.e., q Q) a value that represents the minimum distance to any point that belongs to the set P. Mathematically: DT (q) = min d(q, p) (3.10) p P Where d(p, q) is any measure of the distance between two points in the image, such as euclidean distance or manhattan distance. In our program we used the second one. The distance transform of a binary image can be seen as a grayscale image, where the intensity of the pixel is higher with the distance

28 3 Experimental setup to the silhouette. For a further explanation of distance transforms the reader is referred to Marchand-Maillet and Sharaiha (2000). Fig. 3.9.

34 28 3 Experimental setup to the silhouette. For a further explanation of distance transforms the reader is referred to Marchand-Maillet and Sharaiha (2000). Fig Distance transform of a silhouette. The image of the left is the binary silhouette. The image on the right is the distance transform of that silhouette. The color indicates the value of the distance transform, according to the scale of the right. There is another concept used in the comparison between images, it is the dot product. When the images are grayscale or binary, the dot product among them can be defined as: dot(i, S) = ResX x=0 ResY y=0 I(x, y) S(x, y) (3.11) Finally, the likelihood function is described by the following formula: Lk = 1 dot(i, DT (S)) + dot(dt (I), S) + ɛ (3.12) Where dot represents the dot product between images, dt is the distance transform of a binary image and ɛ is a constant introduced to eliminate divide-byzero errors. 3.6 Activity recognition There are two ways of performing activity recognition with the system. The first one is based on the weight or likelihood of the best particle of each motion model meanwhile the second one is based on the progression of that particle over the trajectories described in the motion primitive. In the first approach we generate a probability distribution function among all the motion primitives in the system for each frame. This probability density function is created by normalizing the likelihood of the best particle of each motion primitive among all the motion primitives according to the next formula.

35 3.7 Hardware and software 29 p(b i [t] z[t]) = p(z[t] x i[t]) B p(z[t] x i[t]) (3.13) Where B i represents the motion primitives. As the probabilities for each motion primitive are very noisy, a low pass filter is applied before the normalization. Then the motion primitive that presents a maximum of that likelihood is selected as the activity performed. For the second approach, we use the information of how the particle moves among the trajectories defined in the motion primitives. In an ideal case, the best particle for each motion primitive would follow one of those trajectories if the person is performing the action defined in that primitive. That is not usually the case because of several factors: the difference between the movement of the actor and the movements used to learn the motion primitive and errors due to baggy clothing, to self occlusion, etc... Even with these difficulties, the trajectory of the best particle should start at the beginning of one trajectory and finnish at the end of another. Thus, by measuring the relative position of the best particle in the trajectories of the motion primitive for each frame we can perform activity recognition. If the relative position of the best particle in the motion trajectories increases monotonously and it covers most of the relative space of the motion primitive, then we can assume that the action described by the motion primitive is being performed. This method of activity recognition was not completely coded in the system but promising preliminary results are found. 3.7 Hardware and software To create the program described above, we used the infrastructure available in the AI Lab of the Computer Science department in Brown University. The hardware platform used to develop and run the program is an AMD personal computer with 1Gb of RAM memory and an Nvidia 6200 graphics card. The operating system under which all was run is the Debian distribution of Linux. The programming language used varied according the function of the software. For processing the motion primitives and performing background subtraction on the images obtained with the camera, we used Matlab due to its simplicity and to the fact that those were offline processes. For the rest of the program, we used C++, as it produces code with better performance than Matlab.

37 4 Results 4.1 Overview The system described in the previous chapter performs simultaneously two actions: deriving the activity the human is performing and finding the limb configuration (joint angles) that better describe the silhouette obtained from the camera using the information from the motion primitives. Several questions regarding the evaluation of the system arise. Those are: 1. Accuracy of the motion primitives. Are the motion primitives representative of the video sequence? 2. Performance of the particle filter algorithm. Is the particle filter algorithm finding the point on the motion primitive that produces the most similar silhouette to the one obtained from the camera? 3. Performance of the activity recognition system. Is it possible to derive the activity the person is doing from the tracking systems? Under which conditions? 4. Performance of the human tracking. Is the limb configuration recovered representing the limb configuration of the person? To solve these questions we applied the system described in Chapter 3 or variations of it to several video sequences. The actors were the developers of the system and collaborators performing the motions described by the motion primitives. To know which actions were associated with those motion primitives, a program to reproduce BVH models was developed. The video sequences recorded to test the system are: mattnew.mov The actor performs sequentially and without interruptions several circles with the hands followed by a vertical waving of the right hand and horizontal waving of the same hand. The camera is positioned in front of the actor.

38 32 4 Results matttop.mov The actor performs horizontal hand waving. The camera is placed on top of the actor. This video was recorded to test that the system can work with different camera orientations. germanfast.mov The actor rises his arm as fast as he can. Recorded to test how the system reacts to fast movements. The whole action only lasts five frames. germanpunch.mov The actor performs a punch. In this activity the actor uses the whole body instead of only some limbs. 4.2 Motion primitives accuracy In order to evaluate if the motion primitives are able to describe the movement recorded in the video sequence, we performed the following test: using the mattnew video, the positioning system is cancelled and the position of the hips of the actor in the images is manually determined. With these conditions, an exhaustive search is performed among all the points of the motion primitive in order to find the point that produces the most similar silhouette to the one of each frame. The search is performed by rendering the silhouette associated to each point in the motion primitive and comparing it with the silhouette obtained from the camera using the procedure described in equation 3.5. We used this comparison function instead of the one described in equation 3.12 because it is much faster and provides similar results if the limbs of the silhouettes overlap, as happens in this problem. The point found is stored. The primitives used for the search are the ones that describe the activity the actor performs at each time, first the one that describes hand circles, then the one of vertical hand waving and at last the one of horizontal hand waving. In an ideal case, the point stored would follow a trajectory in the motion primitive, start at the beginning of the trajectory and finish at its end. Nevertheless, the video used to test the system is different from the ones used to learn the activities. As every human moves in a different way, we will permit the particle to jump from trajectory to trajectory if they are close enough. Also, a smaller coverage of the trajectory will be accepted. The results are drawn in Figure 4.1 for an instance of hand circles. As it can be seen in the image of the left, in the latent space of the motion primitive the trajectory followed by the best point (in red) is quite similar to the trajectories described by the motion primitive (in blue). The red line is formed by segments because the number of frames where there is tracking is only 15 (the action last one second) instead of the one hundred points that consist a trajectory in the motion primitive.

39 4.3 Human tracking system !5 40!10 30! !40! Fig The image on the left represents the trajectory that the point that describes better the image follows during the performance of the action described in the motion primitive (in red) and the trajectories of the motion primitive (in blue). The figure on the right is the progression of the relative position of the best point in the trajectory. Ideally it should be a straight line, in practice it is an approximation. The mattnew sequence consists of 800 frames. It took approximately ten minutes to derive the best point for each frame. The outcome of the test was superimposed on the images of the cameras as shown in Figure 3.1. Visually there is a good tracking of the human limbs. The results can be seen in a video in the web page of the thesis. 4.3 Human tracking system The human tracking system is in charge of recovering the joint angles of the actor performing in front of the camera using motion primitives. We made several experiments to test different aspects of it, such as the performance of the particle filter, how it worked with different speeds of the performance of movements and how it worked from different camera positions Particle filter performance The system described in Chapter 3 evaluates only a small set of points in each motion primitive to determine the position of the human limbs. Is the outcome of this system similar to the best description that can be done with the manifold of the video sequence? In order to answer this question, we run the system with the same conditions as in the previous experiment (positioning system disabled, fixed hips), but using particle filters with different number of particles. The output of the tracking system (the derived joint angle configuration of the actor) is compared with the results of section 4.2, as they

40 34 4 Results represent the most accurate tracking that can be done of the video sequence using that motion primitive. The results are depicted in figure exhaustive 4 particles 10 particles 128 particles 1000 particles dim2 5 0!5!10!15 distance to the ground truth !20!25 200! dim log2(nparticles) Fig In the figure on the left, the trajectories of the best particle for systems with four, ten, 128 and 1000 particles are compared to the results of Section 4.2. The figure on the right is the distance between the trajectory in Section 4.2 in the latent space and the trajectories followed by the particle filters. The results show that the difference between the ground-truth trajectory in the latent space and the trajectory followed by the best estimation of the particle filter decreases with the number of particles used. Nevertheless, this error is approximately constant when using more than ten particles. Therefore we can assume that by using only ten particles per motion primitive we can get a good approximation of the most accurate tracking that can be done with that motion primitive. Note that other authors such as Elgammal and Lee (2004) use 200 particles to perform the tracking. This test also gives insights as to the computational cost of using a particle filter instead of using a brute force search. With ten particles and using six different motion primitives, the system runs at a rate of one frame per second, meanwhile making a brute force search spends 45 minutes per frame. We believe that a better implementation of the system can speed it up, reaching real-time computation. The most expensive function in the code is the calculation of the distance transform, which is done entirely on the CPU of the computer. With modern programmable graphics cards, this function can be programmed in the Graphic Processing Unit, using computer graphics methods to calculate it. Due to the speed and the capability of the GPU of working with images, it would be faster than its calculation in the CPU Tracking of fast movements We applied the system to the germanfast video sequence. In this sequence the activity is performed in only five frames. Visually, the system is able of

4.3 Human tracking system 35 superimposing the skeleton over the images captured. The results are given in Figure 4.3. Fig. 4.3. Image sequence of an activity performed in only five frames (0.

long as the relative position of the camera to the human being is indicated to the program.

41 4.3 Human tracking system 35 superimposing the skeleton over the images captured. The results are given in Figure 4.3. Fig Image sequence of an activity performed in only five frames (0.33 sec) with the interpretation of the system superimposed Different perspectives As the system is designed, it should be able to track the human movement from different camera perspectives as long as the relative position of the camera to the human being is indicated to the program. This property is due to the fact that we are using a 3D human model to render the silhouettes, therefore we can render the silhouettes from any point of view of the virtual camera. We used the matttop sequence to test this property. The system is able to derive the kinematic pose of the actor, as it can be seen in Figure 4.4. Fig The system is robust with respect to the relative position of the camera to the actor. In this image we can see how the system is able to track human movement from an image sequence taken from over his head. The first row is the images obtained from the camera, the second row is the silhouette extracted and the third and fourth rows are the system s pose estimation from different points of view.

Behaviour based particle filtering for human articulated motion tracking

Loughborough University Institutional Repository Behaviour based particle filtering for human articulated motion tracking This item was submitted to Loughborough University's Institutional Repository by