Monocular Multiple People Tracking

Size: px

Start display at page:

Download "Monocular Multiple People Tracking"

Nicholas Page
5 years ago
Views:

1 EDIC RESEARCH PROPOSAL 1 Monocular Multiple People Tracking Timur Bagautdinov CVLAB, I&C, EPFL Abstract In this work we present a generic approach to the task of tracking multiple people using a single camera [1], and covering several specific aspects, such as people detection using Hough forests [2] and pose estimation using Laplacian Eigenmaps [3], both being state-of-the-art approaches to the corresponding tasks. Furthermore, we propose a novel probabilistic approach for detecting and tracking people by incorporating motion models and temporal consistency, and describe our related preliminary work, more specifically, the algorithm for people detection in a single depth map. Index Terms people tracking, monocular tracking, pose estimation, people detection, motion models I. INTRODUCTION The problem of detecting and tracking people in videos is a challenging topic due to a number of reasons. Changes in the background and illumination, human body deformation, appearance variability, as well as occlusions and self-occlusions all constitute to its complexity. However, the interest to the problem is only growing, since cameras are becoming cheaper and easier to use, thus opening a lot of potential applications in surveillance, behavior analysis, sports and entertainment. Single camera setups are even more appealing because they are cheaper and much easier to install and maintain. Proposal submitted to committee: May 1st, 2014; Candidacy exam date: May 7th, 2014; Candidacy exam committee: Prof. Pierre Dillenbourg, Prof. Pascal Fua, Dr. Francois Fleuret, dr. Pierre Vandergheynst. This research plan has been approved: Date: Doctoral candidate: (name and signature) Thesis director: (name and signature) Thesis co-director: (if applicable) (name and signature) Doct. prog. director: (B. Falsafi) (signature) EDIC-ru/ In spite of the mentioned problems, there has been a significant progress made in the field. A lot of state-of-theart approaches use the tracking-by-detection approach [1], [4] meaning that people are detected in individual frames, and those detections are then merged to get the trajectories. This can be done, for instance, by formulating the problem as a global optimization problem on a graph [5], [6]. The problem of detecting people, or, more generally, objects, in static images is one of the key problems in computer vision. In [7], authors provide a fairly complete survey and perform evaluation of the state-of-the-art pedestrian detectors. They state that those approaches using Histogram of Oriented Gradients (HOG) generally perform better, with support vector machines (SVM) or boosting used as a classification model. The general conclusions of the authors are, however, that the current overall performance of detectors is poor even under favorable conditions, and specifically note that the detector using the motion features performs the best. One of the methods not mentioned in the survey is Hough Forests [2], a relatively simple yet efficient method that combines Hough transforms with random forests. It is especially interesting for us since if used in on-line fashion, it can be used directly for tracking purposes. One of the major difficulties in multiple-object tracking are occlusions, and many of the tracking-by-detection approaches use a full-body detector [7], [2], that does not provide an explicit way to model those. This can possibly lead to poor performance in cluttered scenes. A relatively simple detection technique that handles occlusion naturally is Probabilistic Occupancy Map (POM) [8]. Even though POM is strongly dependent on multiple views and lacks discriminative power, having a probabilistic model that maps people location and the image evidence, is quite attractive, since it is straightforward to interpret, do occlusion reasoning, plug into tracking methods [5], [6] and, importantly, provides a principled way of introducing prior knowledge. Another way to improve performance of people tracking is by using additional appearence cues, such as faces, skin and cloth colors as well as depth data. It has been shown [7], [1], [9] that these can significantly improve general tracking performance. Most of the approaches described above make use of multiple camera setup, which makes it easier to disambiguate and reason about the image evidence. In [1] a complete framework is proposed for tracking multiple people from a single possibly moving camera. Authors formulate tracking problem as a maximum a-posteriori (MAP) problem and use Reversible- Jump Markov Chain Monte Carlo (RJ-MCMC) to solve it. The approach is rather flexible in that it incorporates various appearence cues, as well as motion priors, but has limited

2 EDIC RESEARCH PROPOSAL 2 means to handle occlusions. Another interesting example of handling the monocular cases is presented in [11] where occlusions are handled explicitly by using a bank of detectors, each trained for a specific body part, and then combined as a mixture of experts [12]. They use Hidden Markov Model (HMM) for long-term tracking, which can possibly be improved upon by using global optimization methods such as ones based on linear programming [5], [6]. Our proposed approach is based on the generative model described [8], in which we are aiming to introduce motion model and improve the appearance model. In the following sections, we will first focus on a generic monocular tracking system described in [1], and then go through promising methods for people detection [2] and pose estimation [3]. Then, we will discuss in more detail how we are planning to improve upon existing methods and also present some preliminary results on our depth-based people detector. II. GENERAL FRAMEWORK FOR MONOCULAR PEOPLE TRACKING We start with a description of a general framework for people tracking. In [1] a principled probabilistic approach is presented for tracking multiple objects with a single moving camera. At this point, we are not particularly interested in the moving camera setup, yet a possibility of algorithms to be generalized to support that is quite appealing, since even more application areas are available, such as e.g. robotics and autonomous driving. In the following sections, we will describe the model and inference procedure in more detail, following [1]. A. Model Let I t denote the image evidence at time t, Zt i be the location and velocity of a person i at time t in world coordinates, Θ t be the camera parameters, and G j t be the locations of a static scene feature j in world coordinates, with index t denotes the time. Note, that these geometric features are necessary to estimate the camera pose. The tracking problem in [1] is approached in a principled Bayesian way. That is, all the I t are considered observed random variables, Z t, G t and Θ t are considered hidden variables, with Ω t = {Z t, G t, Θ t } representing the complete hidden state of the scene. The relationship between the random variables is specified as a joint probability distribution, namely, it can be formalized as the following factorized posterior: p(ω t I 1:t ) p(ω t I t ) p(ω t Ω t 1 )p(ω t 1 I 1:t )dω t 1 (1) where p(i t Ω t ) is the observation likelihood, representing the generative model of the image evidence I t given the values of the state variables Ω t at time t; p(ω t Ω t 1 ) is the motion model (prior), that represents how the model evolves over time, and p(ω t 1 I 1:t ) is the posterior at the previous time t 1. The optimal value of Ω t is then estimated by finding the MAP solution, assuming that the posterior at time t = 1 is known. Now, let s consider the observation likelihood and motion model in more detail, omitting some of the technicalities. Observation Likelihood: The observation likelihood can be considered as a measure of how well a particular value of Ω t fits image evidence I t : p(i t Ω t ) = i p(i t Z i t, θ t ) j p(i t G j t, Θ t ) = i p(i t f Θt (Z i t)) j p(i t f Θt (G j t)) where f Θt denotes the camera projection function, which either assumes a simple camera model (all objects are on the ground plane), or the pinhole camera model. Note that the likelihood is factorized over i and j, where p(i t f Θt (Zt)) i is the i-th target observation likelihood, and p(i t f Θt (G j t)) is the j-th feature observation likelihood. The target observation likelihood is modeled as a weighted average of the output of several detectors: p(i t f Θt (Zt)) i exp j (2) w j log p j (I t Zt) i (3) where w j is the weight of the detector j. In total, there are seven different detectors used (when depth cues are available): HOG-based full-body detector HOG-based upper-body detector Face detector Skin color detector Depth-based shape detector Depth-based motion detector Target-specific appearance detector Now let s take a look at p(i t f Θt (G j t)), the geometric feature observation likelihood. The likelihood can be interpreted as a generative model of projecting the world coordinates of the feature j on the image and then detecting that projection, possibly corrupted by Gaussian noise, using a interest point detector. It is also possible that features cannot be detected e.g. due to occlusions, which is encoded as a separate uniform component, meant to make the estimation more robust. This model can be formalized as follows: { p(i t f Θt (G j N (τ j t)) = t f Θt (G j t), Σ G ) K B G j t is valid otherwise where τ j t is the interest point corresponding to the feature G j t, Σ G being noise covariance, and being K B is the uniform component. Motion Model: The motion model represents (smooth) transitions between state-spaces Ω t over time, assuming that the transitions of the camera, tracked targets and geometrical features are a-priori independent: (4) p(ω t Ω t 1 ) = p(θ t Θ t 1 )p(z t Z t 1 )p(g t G t 1 ) (5) Camera motion is assumed to be smooth over a short period of time, and so is modeled as a linear dynamic model. Target motion prior p(z t Z t 1 ) is defined as a product of two factors: p(z t Z t 1 ) = p e (Z t Z t 1 )p m (Z t Z t 1 ) (6) The first factor p e (Z t Z t 1 ) is the existence prior, that specifies the probability of existence of the targets at time

3 EDIC RESEARCH PROPOSAL 3 t. It is defined as a product of Bernoulli distributions over all the targets i. A very interesting part of the model is the factor p m, which encodes how the targets are moving and possibly interacting with each other. A most common assumption for target motion is that they are moving independently with constant velocity. Authors claim, however, that modeling interactions can lead to better tracking performance. Thus, interaction model is introduced as a mixture of a repulsion model and a group model. Such a mixture is meant to encode two common scenarios of people behavior: they can choose to stay on a distance from each other, or, otherwise, move together. It can be formalized as a Markov Random Field (MRF) as follows: p(z t Z t 1 ) = a<b ψ(za t, Zt b β a,b t ) a<b p(βa,b t β a,b t 1 ) N a=1 p(za t Zt 1) a (7) where β a,b t is a hidden variable indicating which component of the mixture is used for a pair of targets a, b. ψ(...) is a sigmoid of the distance between the two targets if β a,b t = 1 (minimum if they are close), and exponential of the distance otherwise (maximum if they are close). As for geometric motion prior, it encodes whether the features are valid or not, and whether they are consistent over time. B. Tracking An important part of the approach [1] is the MAP procedure based on RJ-MCMC [13]. The main reason why the sampling inference was used is the complexity of the posterior (1), thus making it hard to apply other methods. Tracking is done by finding the MAP configuration ˆΩ t : ˆΩ t = argmax Ωt p(ω t I 1:t ) (8) where posterior at each time t is approximated using a set of samples {Ω r t } N r=1, with N being the total number of samples. Metropolis-Hastings algorithm is used to obtain new samples, which means that at each iteration, new values of Ω t are generated by a proposal distribution, and are being rejected or accepted according to the acceptance ratio. Due to high dimensionality, sampling from the complete configuration space can lead to very slow convergence, which is why on each iteration one of the variables Z t, G t or Θ t is randomly selected for an update. Thus, the proposal distribution Q(Ω t, Ω t ) incorporates three parts, target proposal Q Z, geometric feature proposal Q G and Q Θ, each can be perturbed with a corresponding fixed probability, but only one at a time. The target proposal Q Z is used to generate a new sample for the target parameters Zt r+1 given the current Zt r. In order to be able to explore all the state space, authors propose the following set of reversible jump moves: stay-leave, add-delete, update, interaction flip, where each move is meant to explore a part of the state space, and has a reversible counterpart. More specifically, stay move denotes that one of the targets that exist at the previous timestamp, yet absent in the current sample should be present in the next accepted sample, and leave is the inverse of that. Add is meant to explore a possibility of initiating a new tracking target from the new detections, with delete being the inverse of that. Update simply proposes a new location for the target (can be reversed by itself), and interaction flip allows the change of the interaction mode. Analogously, for the geometric feature proposal distribution Q G, the defined proposal moves are: stay-leave and update. The camera proposal distribution Q Θ is modeled as a Gaussian with a mean on the previous step. Once the complete proposal distribution is defined, a new sample Ω r+1 t can be obtained and rejected based on the following acceptance ratio: t ) a = P (I t Ω r+1 P (I t Ω r t ) P (Ω r+1 t I 1:t 1 ) P (Ω r t I 1:t 1 ) Q(Ω r t ; Ω r+1 t ) Q(Ω r+1 t ; Ω r t ) which is a multiplication of three ratios: between image likelihoods, approximated predictions, and proposal distributions. If a 1, it means that the sample Ω r+1 t is more likely than Ω r t, and it should be accepted. C. Discussion In general, the idea of having a complete model that merges together most of the tracking aspects in a principled probabilistic framework, including multiple various detection cues, motion priors, modeling people interactions and even the camera pose estimation, is very attractive. However, there are still some shortcomings of the proposed approach. The performance presented in [1] indicates that for the case when no depth cues are available the proposed method performs slightly worse than the state-of-the-art. The reason for that could be the absence of explicit occlusion handling, as well as the errors caused by the crude sampling approximation of the posterior distribution. It is also not quite clear how the weights of the different detectors are defined. Within the proposed framework, it might be interesting to see whether incorporating a mixture of detectors for different body parts as e.g. in [10] can improve the performance. III. PEOPLE DETECTION USING HOUGH FORESTS In the previous section, we have described a general probabilistic framework for tracking multiple people in monocular settings, proposed in [1], which heavily relies on the results of HOG-based people detectors. Designing a well-performing detector is a very challenging task on its own, and considering the performance of the state-of-the-art [7], there is still a lot to be done in that field. One of the promising detectors that can be used in such a tracking system, is proposed in [2]. In fact, authors propose a generic approach for object detection, tracking and action recognition - Hough forests. From our perspective, the approach is interesting since it both defines a well-performing detection algorithm and a way to do object tracking over time within the unified framework. In this section, we will describe the basic idea about Hough forests, how they can be trained and then used for people detection and tracking. (9)

4 EDIC RESEARCH PROPOSAL 4 A. Hough Forests Hough forest is a specific type of random forest, in which each tree is a mapping from the image appearance features I to probabilistic votes h in a Hough space H, a space of all hypotheses for the location of the object center and class of the object. The detection is then made by applying such a forest to an image of interest, and searching for the maxima in the resulting Hough space representation. Training: Training a Hough forest requires a set of examples for each possible object class c {0, 1}, triples (I i, c i, d i ), where c i are class indicators (c i = 0 for background, c i = 1 for people) and d i is the displacement of the patch i from the center of the particular training sample (for c = 0, set to zero). Each tree is built from a random subset of the training set in a greedy fashion: starting from the root, the training set is split into two parts by evaluating a binary test, and is continued up until the termination criteria is met, that is, the maximum depth of the tree or the minimum number of samples are reached. Each leaf L in the constructed tree then contains a number of patches, with the proportion of patches of class c among those being p(c L). In order to construct the tree, we need to specify how the binary test is selected given a set of training samples A = {(I i, c i, d i )}. The non-leaf nodes represent a binary split based on the following test, which compares values of one of the feature channels in two locations on the patch: { 0 I t fpqτ (I) = f (p) < I f (q) + τ (10) 1 otherwise. where p and q are the locations on the image patch, f is a feature channel and τ is an offset. Hough forests are aiming to minimize two objectives simultaneously: classification uncertainty and displacement (regression) uncertainty. Minimizing the class label c uncertainty can be formulated as follows: U 1 (A) = A p(c A) log p(c A). (11) c {0,1} And minimizing the displacement can be done by minimizing the following measure: U 2 (A) = d i d 2 (12) (I i,c i,d i) A:c i=1 where d is the mean of the positive displacement vectors (negatives do not influence the displacement uncertainty measure). Whenever a new split in the tree is to be created, we randomly select which uncertainty measure will be used. Then, among a set of randomly generated binary tests, with values f, p, q, τ all selected at random, we select the one that minimizes the selected measure. Detection: In order to do object detection, one should extract all the appearance features for patches I and pass them through each tree in the Hough forest. Let s consider a patch P(y) = (I(y), c(y), d(y)), at a location y, where I(y) is the appearance of the patch, and c(y), d(y) are respectively the hidden (unobserved) object class and displacement. The output of a tree t for the given I(y) appearance is a leaf L t (y), that defines conditional probability p(h L t (y)) of a hypothesis h = (c = 1, x) of observing an object of class c = 1 at a location x. Note that in order to handle multiple scales or aspect ratios, the hypothesis space can be naturally extended by adding the corresponding dimensions. The hypothesis conditional can be computed as a product of two factors: p(h L t (y)) = p(h c(y) = 1, L t (y)) p(c(y) = 1 L t (y)) = p(d(y) = y x c(y) = 1, L t (y)) p(c(y) = 1 L t (y)) (13) where the latter factor is just a proportion of the positive examples in the leaf L t (y), and the former one can be estimated e.g. using Parzen window: p(d(y) = y x c(y) = 1, L t (y)) = 1 N (y x d, σ 2 ) D Lt d D Lt (14) where D Lt is a set of displacement vectors in the leaf L t (y), σ 2 is the covariance parameter indicating the window width. Then, summing over all the T trees in the forest and all the given patches, we will get the following Hough vote for a hypothesis h: p(h I) y 1 T T p(h L t (y)) (15) Computing these votes for each possible hypothesis h will result in a Hough image. Note, that the sums do not return the true probabilities, yet they are more numerically stable. It is possible, however, to use log-probabilities in order to get a strict probabilistic interpretation. After the Hough image is available, the detection is made by a local maxima search. The found local maxima ĥ correspond to the hypotheses for the centers of the possible detections, whereas p(ĥ I) denote the measure of confidence for those detections. Tracking: One of the advantages of Hough forests is their ability to be adapted on-line. This can be particularly useful to do appearance-based tracking. Detectors trained offline are trying to solve a more general problem than tracking, that is, they search for any object of the class of interest, whereas the tracking ultimately requires the detection of a specific target. One can assume, that the statistics of the target appearance do not change rapidly over time. Hence, it should be possible to update statistics in the leaves of the Hough forest such that it is tuned for a specific target. Authors of [2] suggest a fairly straightforward counting scheme. Similarly to estimating the conditional of the object presence hypothesis h (13), the probability of the target E hypothesis h E can be computed as follows: p(h E L(y)) = t=1 p(d(y) = y x c(y) = 1, L t (y)) p(h E = h c(y) = 1, L t (y)) p(c(y) = 1 L t (y)) (16) where the term p(h E = h c(y) = 1, L t (y)) is the one that is being estimated on-line. Namely, it is the proportion of times a specific entry of the leaf d D Lt votes for the target E.

5 EDIC RESEARCH PROPOSAL 5 B. Discussion In [2], authors demonstrate, that their approach is performing quite well compared to state-of-the-art for the majority of the evaluated datasets. One of the flaws is mentioned by the authors, it is the fact that votes are counted toward the center of the object, which can fail for non-rigid objects or objects with large variability in poses, such as, e.g. people performing sport activities. Moreover, no direct way is shown to handle occluded instances, which is crucial for the people detection in crowded scenes. However, in a more recent work [14] it has been demonstrated that it is possible to introduce explicit occlusion reasoning into the Hough forest framework. It is also worth noting that we were focusing on applying the method for people detection and tracking, whereas Hough forests are actually more generic and can be used for multiclass detection, and also work on sequences of images to perform action recognition. function, or by setting them to ones and zeros to represent a fixed number of nearest neighbors. The main assumption behind LE is to map points that are close to each other to points that are close in the lowdimensional space: min i Or, in a vector form: (x i x j ) 2 W ij (17) j min tr(xlx T ) (18) with L = D W being a Laplacian function, and D - diagonal matrix, such that D ii = j W ij. Note that in order to avoid the trivial solutions of X and x 1 = = x N for (18), the minimization is done subject to XDX T = I, XD1 = 0. IV. POSE ESTIMATION WITH MOTION PRIORS In the previous sections, we have been looking at a general tracking framework and a discriminative people detector. Our belief is, that it is possible to improve the tracking quality with stronger motion models, and, in particular, by introducing a notion of pose. Even a rough hypothesis about the pose of the person might simplify reasoning about occlusions and people location, not to mention that the pose itself is of great interest for a variety of applications. In this section, we will consider a generative model for pose estimation - Laplacian-Eigenmaps Latent Variable Model (LELVM) [3]. Human poses and motions are high-dimensional, making it ultimately impossible to perform brute-force optimization. But even the approximate methods that avoid exhaustive search can still suffer from high dimensionality. This is why dimensionality reduction (DR) methods have been explored for the purpose of pose reconstruction [3], [15]. A. Nonlinear Dimensionality Reduction The need of nonlinear DR rather than methods like PCA lies in the fact that some data is poorly captured by linear functions. Human motion is highly nonlinear and, as it is demonstrated e.g. in [3], [15], methods that encounter for that, significantly outperform PCA. There are various ways to perform nonlinear DR. Most of them rely on the idea of low-dimensional manifolds: a smooth, curved subspace of a higher-dimensional Euclidean space, in which it is embedded. One of such approaches, which lies in the core of the pose estimation method of [3] is Laplacian Eigenmaps (LE). C. Latent Variable Model LELVM can be considered as a way to define an out-ofsample mapping and a density model for LE. Mapping a new point y to the latent space F (y) = x can be formulated as follows: ( (X ) ( ) ( )) L K(y) X T min tr x x K(y) T 1 T K(y) x T (19) where K(y, y n ) = exp( (y yn) 2 2σ ) if y 2 n is a nearest neighbor of y, and 0 otherwise. The solution of (19) is as follows: x = N n=1 K(y, y n ) N n =1 K(y, y n )x n (20) The LVM itself is then defined as a joint kernel density estimate (KDE): p(x, y) = 1 N N K y (y, y n )K x (x, x n ) (21) n=1 where both K y and K x are Gaussian kernels. It follows that the dimensionality reduction mappings and reconstruction mappings: N F (y) = p(n y)x n (22) n=1 B. Laplacian Eigenmaps Given a set of points {y i } N i=1, LE algorithm constructs a graph, in which each node corresponds to y i, and weighted edges connect nodes which are close to each other, either being nearest neighbors or are in some ɛ-neighborhood. The weights of those edges W ij could be defined as a Gaussian affinity where: f(x) = p(n x) = N p(n x)y n (23) n=1 K x (x, x n ) N n =1 K(x, x n ) (24)

6 EDIC RESEARCH PROPOSAL 6 D. Tracking The tracking framework can be thought of as a simplistic version of the one described in [1]: p(s t z 1:t ) p(z t s t ) p(s t s t 1 )p(s t 1 z 1:t 1 )ds t 1 (25) where s = (x, d) are the hidden state variables, with x R L being the pose representation in the low-dimensional latent space, and d R 3 is the center of mass location; z represent the observed image evidence, that is, 2D image locations of the pose joints. The observation model p(z t s t ) is a Gaussian with isotropic covariance, and mean defined with the following transformation: z = P (d)y = P (d)f(x) (26) where P (d) is the perspective projection that shifts each 3D point by d. The dynamics model is a product of Gaussian random walks for x and d and LELVM prior p(x t ): p(s t s t 1 ) p d (d t d t 1 )p x (x t x t 1 )p(x t ) (27) which means that both the pose and location are meant to be close to the value on the previous step. Tracking is performed by running inference on the specified model using a particle filter based on a Gaussian mixture distribution and a Sigma-point Kalman filter. E. Discussion Authors demonstrated that LELVM can successfully be used as a motion prior for monocular pose estimation. However, it is not clear from the results presented in [3] whether the quality is sufficient for it to be used in unconstrained environments and be useful as a part of a multiple people tracking system. In fact, it is an open research question of how to efficiently incorporate the pose estimate into the multiple people tracking system such as [1] or [6]. V. DISCUSSION AND THESIS PROPOSAL In this section, we will introduce our plan on tackling the problem of monocular people tracking and discuss our preliminary work that has been done. We propose an approach for monocular tracking that is based on the POM [8], by introducing temporal consistency (motion model) and by using a more realistic appearance model, such as learned silhouettes. Note that both these things can be merged together to get a rough estimate of the human body motion, by introducing dynamics between silhouettes, which can be considered as weak pose estimates. Even though this estimate might be not very precise, it can significantly help to restrict the original high-dimensional problem of pose estimation. Unfortunately, in general an object position on a single RGB image is scale-ambiguous, and the original backgroundsubtraction-based POM will fail since it relies on multiple views to resolve this. To solve that, we propose detecting the body part whose position is probably the least ambiguous (at least in pedestrian tracking settings): feet. It is also worth looking at heads, since they are relatively easy to detect compared to other body parts and often also have a very typical location. A discriminative detector such as Hough forests can be used for that purpose, and probably will work more reliable when solving an easier task of estimating the position of the specific body part rather than looking for the whole human body. Working in single-camera settings can bring a lot of difficulties from the beginning, including depth ambiguity, variations in illumination and shadows. This is why we have decided to first try out our idea with temporal consistency and what we call weak pose estimates with depth maps. We are then planning to gradually switch to more complicated settings, that is, to stereo images, and then to single RGB-images. A. Preliminary Work: People Detection in Depth Maps One of the ways to handle the complexity of human motion is to use additional sensor data, such as the depth acquired by time-of-flight cameras, such as Kinect [16]. It has especially become relevant since the number of affordable sensors of such kind is increasing. It was demonstrated [16], that with a vast amount of training data, it is possible to get reliable results for pose estimation in single depth maps. Modeling Depth Maps: Let X = {X 1... X K } be the individual occupancy variables for each location of the ground plane. The idea behind the original POM algorithm [8] is to model people as rectangles, and then use a following function to model the dependency between the observed signal B (binary images, the result of background subtraction) and artificial signal A (which is a deterministic function of occupancy variables X) as follows: P (B = b A = a) exp( Ψ(b, a)) (28) where Ψ(b, a) is a similarity function between the realizations of variables B and A, which can also be thought of log P (B = b X), that is, negative log-likelihood of observing a binary image b, given the locations of people X or, in other words, a generative model for binary images. Then, an inference procedure is used to get the posterior distributions over X. Here, we take a very similar approach, but for depth maps, rather than binary ones. Let Z R W H and A R W H be the captured and the artificial maps correspondingly. Given the values of the occupancy variables X for all locations, the artificial depth map A can be generated as a deterministic represents a space of all hypotheses for object position function of those, by putting at each pixel the value of the depth of the closest silhouette (if any): ( ) A ij = min z bg min (29) ij, z k k:x k =1,(i,j) S k where S k stands for the silhouette at the location k, and z k is the depth of that silhouette w.r.t. the camera. At this, point, in our model each S k is a a projection of a 3D plane. z bg ij denotes the depth of the background. Note, that it is also possible that = z, which is a special no-value, corresponding to z bg ij

7 EDIC RESEARCH PROPOSAL 7 the case when no depth is captured, or if the captured value is very unreliable. Note that we assume that silhouettes are sorted w.r.t. the distance to the camera. Further, we specify the following data likelihood factorized over pixel locations: P (Z = z X) = (i,j) P (z ij X). (30) where P (z ij X) is a shorthand for P (Z ij = z ij X). This factorization assumption is, of course, too strong, and is made primarily to make inference easier. When it is known that silhouettes S 1:k 1 are absent, and S k is present (which is equivalent to X 1:k 1 = 0 and X k = 1), distribution of z ij is assumed to be the following: { P (z ij X 1:k 1 = 0, X k = 1) = p if z ij = z, (1 p )Ñ (z ij A ij, σ 2, p min ) otherwise (31) where p is a (Bernoulli) probability of observing no-value knowing that X k = 1 (a constant), and Ñ is a long-tail distribution, which can be thought of as a clamped Gaussian PDF that does not take values lower than p min. When there is certainty that the background depth is being observed (X 1:K = 0), we use a distribution Ñ, again, handling a no-value separately: { P (z ij X 1:K = 0) = p bg if z ij = z, (1 p bg )Ñ (z ij µ m, σm, 2 p min ) otherwise (32) Inference: We do not provide the complete details about how inference equations are derived, but in general we were just following the variational inference procedure [17] for our generative model. The final update equation for the posterior estimates Q(X k = 1) will be: exp ( log P (Z, X) X k = 0 log P (Z, X) X k = 1 ) (33) where is the expectation w.r.t. approximate posterior j k Q(X j) (with X k marginalized out). Preliminary Results: We have run our detector on several simple samples, and the quality of detections look promising, especially keeping in mind the simplicity of the appearance model, and its ability to handle occlusions. Sample output of the generative model for the estimated locations is given in the Figure 1. However, we have not yet performed formal performance evaluation of our approach with the existing few alternatives. The main reason is that there are not too many open datasets for people tracking in depth maps, neither there are many pure depth-based people detectors openly available. Note that, the original Kinect algorithm [16] for pose estimation is available within Kinect SDK, however, it can only track no more than 6 people and is optimized for the people facing the camera. Moreover it cannot be run on arbitrary depth maps, thus making it not possible to evaluate on an arbitrary dataset. It would be interesting to compare to the approach described in [18], one of the very few depth-based people detectors available, since the dataset for that work is available, however, it is unlikely for our algorithm to provide better results since their approach strongly relies on a state-of-the-art RGB-based detector, whereas there is no notion for using RGB cues in our model at all. We believe that after introducing those we can certainly provide competitive results. B. Preliminary Work: EM for Silhouette Learning The next step for our depth-based detector is to substitute crude silhouettes with more realistic ones learned from the data. For that, we are working on a weakly-supervised expectation-maximization-algorithm, meaning that it learns a dictionary of silhouettes from a set of positive examples - depth maps on which exactly one person is present. C. Future Work The ultimate goal of our work is to be able to reliably track multiple people in complex crowded environments. As mentioned earlier, we are planning to introduce a notion of motion and improve appearance model in terms of learned silhouettes into the original POM framework. The motion part would basically mean that instead of estimating the locations of the people individually in each frame, we will jointly estimate the people locations in k several subsequent frames. Then, for the given timestamp t, given the image evidence I 1:t, we will seek for an estimate for the posterior p(x t k,..., X t I 1:t ), where X t now represents not only the ground plane location, but also a silhouette that better explains the picture of the person. These location-silhouette detections then can be quite naturally used e.g. in the linear programming framework for tracking [9] over longer time periods. It should be also possible to use these silhouette cues to reduce the search space for the complete pose estimation such as one in [3].

EDIC RESEARCH PROPOSAL 8 Fig. 1: Output of the generative model for the detections. Top row pictures the output of the generative model based on the inferred people locations.

8 EDIC RESEARCH PROPOSAL 8 Fig. 1: Output of the generative model for the detections. Top row pictures the output of the generative model based on the inferred people locations. Bottom row represents the input depth maps. REFERENCES [1] W. Choi, C. Pantofaru, and S. Savarese, A general framework for tracking multiple people from a moving camera, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 7, pp , [2] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky, Hough forests for object detection, tracking, and action recognition, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 11, pp , [3] Z. Lu, M. A. Carreira-Perpinán, and C. Sminchisescu, People tracking with the laplacian eigenmaps latent variable model. in NIPS, vol. 20, 2007, pp [4] M. Andriluka, S. Roth, and B. Schiele, People-tracking-by-detection and people-detection-by-tracking, in Computer Vision and Pattern Recognition, CVPR IEEE Conference on. IEEE, 2008, pp [5] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, Multiple object tracking using k-shortest paths optimization, IEEE Transactions on Pattern Analysis and Machine Intelligence, [6] H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua, Multi-commodity network flow for tracking multiple people, [7] P. Dollar, C. Wojek, B. Schiele, and P. Perona, Pedestrian detection: An evaluation of the state of the art, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 4, pp , [8] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, Multicamera people tracking with a probabilistic occupancy map, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 30, no. 2, pp , [9] H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua, Tracking multiple people under global appearance constraints, in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp [10] C. Wojek, S. Walk, S. Roth, and B. Schiele, Monocular 3d scene understanding with explicit occlusion reasoning, in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp [11] C. Wojek, S. Walk, S. Roth, K. Schindler, and B. Schiele, Monocular visual scene understanding: Understanding multi-object traffic scenes, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 4, pp , [12] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, Adaptive mixtures of local experts, Neural computation, vol. 3, no. 1, pp , [13] Z. Khan, T. Balch, and F. Dellaert, Mcmc-based particle filtering for tracking a variable number of interacting targets, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 11, pp , [14] T. Wang, X. He, and N. Barnes, Learning structured hough voting for joint object detection and occlusion reasoning, in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp [15] R. Urtasun, D. J. Fleet, and P. Fua, 3d people tracking with gaussian process dynamical models, in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 1. IEEE, 2006, pp [16] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, Real-time human pose recognition in parts from single depth images, Communications of the ACM, vol. 56, no. 1, pp , [17] J. Winn and C. M. Bishop, Variational message passing, Journal of Machine Learning Research, vol. 6, no. 1, p. 661, [18] M. Munaro and E. Menegatti, Fast rgb-d people tracking for service robots, Autonomous Robots, pp. 1 16, 2014.

3D Human Motion Analysis and Manifolds

D E P A R T M E N T O F C O M P U T E R S C I E N C E U N I V E R S I T Y O F C O P E N H A G E N 3D Human Motion Analysis and Manifolds Kim Steenstrup Pedersen DIKU Image group and E-Science center Motivation