Monocular Multiple People Tracking

Size: px
Start display at page:

Download "Monocular Multiple People Tracking"

Transcription

1 EDIC RESEARCH PROPOSAL 1 Monocular Multiple People Tracking Timur Bagautdinov CVLAB, I&C, EPFL Abstract In this work we present a generic approach to the task of tracking multiple people using a single camera [1], and covering several specific aspects, such as people detection using Hough forests [2] and pose estimation using Laplacian Eigenmaps [3], both being state-of-the-art approaches to the corresponding tasks. Furthermore, we propose a novel probabilistic approach for detecting and tracking people by incorporating motion models and temporal consistency, and describe our related preliminary work, more specifically, the algorithm for people detection in a single depth map. Index Terms people tracking, monocular tracking, pose estimation, people detection, motion models I. INTRODUCTION The problem of detecting and tracking people in videos is a challenging topic due to a number of reasons. Changes in the background and illumination, human body deformation, appearance variability, as well as occlusions and self-occlusions all constitute to its complexity. However, the interest to the problem is only growing, since cameras are becoming cheaper and easier to use, thus opening a lot of potential applications in surveillance, behavior analysis, sports and entertainment. Single camera setups are even more appealing because they are cheaper and much easier to install and maintain. Proposal submitted to committee: May 1st, 2014; Candidacy exam date: May 7th, 2014; Candidacy exam committee: Prof. Pierre Dillenbourg, Prof. Pascal Fua, Dr. Francois Fleuret, dr. Pierre Vandergheynst. This research plan has been approved: Date: Doctoral candidate: (name and signature) Thesis director: (name and signature) Thesis co-director: (if applicable) (name and signature) Doct. prog. director: (B. Falsafi) (signature) EDIC-ru/ In spite of the mentioned problems, there has been a significant progress made in the field. A lot of state-of-theart approaches use the tracking-by-detection approach [1], [4] meaning that people are detected in individual frames, and those detections are then merged to get the trajectories. This can be done, for instance, by formulating the problem as a global optimization problem on a graph [5], [6]. The problem of detecting people, or, more generally, objects, in static images is one of the key problems in computer vision. In [7], authors provide a fairly complete survey and perform evaluation of the state-of-the-art pedestrian detectors. They state that those approaches using Histogram of Oriented Gradients (HOG) generally perform better, with support vector machines (SVM) or boosting used as a classification model. The general conclusions of the authors are, however, that the current overall performance of detectors is poor even under favorable conditions, and specifically note that the detector using the motion features performs the best. One of the methods not mentioned in the survey is Hough Forests [2], a relatively simple yet efficient method that combines Hough transforms with random forests. It is especially interesting for us since if used in on-line fashion, it can be used directly for tracking purposes. One of the major difficulties in multiple-object tracking are occlusions, and many of the tracking-by-detection approaches use a full-body detector [7], [2], that does not provide an explicit way to model those. This can possibly lead to poor performance in cluttered scenes. A relatively simple detection technique that handles occlusion naturally is Probabilistic Occupancy Map (POM) [8]. Even though POM is strongly dependent on multiple views and lacks discriminative power, having a probabilistic model that maps people location and the image evidence, is quite attractive, since it is straightforward to interpret, do occlusion reasoning, plug into tracking methods [5], [6] and, importantly, provides a principled way of introducing prior knowledge. Another way to improve performance of people tracking is by using additional appearence cues, such as faces, skin and cloth colors as well as depth data. It has been shown [7], [1], [9] that these can significantly improve general tracking performance. Most of the approaches described above make use of multiple camera setup, which makes it easier to disambiguate and reason about the image evidence. In [1] a complete framework is proposed for tracking multiple people from a single possibly moving camera. Authors formulate tracking problem as a maximum a-posteriori (MAP) problem and use Reversible- Jump Markov Chain Monte Carlo (RJ-MCMC) to solve it. The approach is rather flexible in that it incorporates various appearence cues, as well as motion priors, but has limited

2 EDIC RESEARCH PROPOSAL 2 means to handle occlusions. Another interesting example of handling the monocular cases is presented in [11] where occlusions are handled explicitly by using a bank of detectors, each trained for a specific body part, and then combined as a mixture of experts [12]. They use Hidden Markov Model (HMM) for long-term tracking, which can possibly be improved upon by using global optimization methods such as ones based on linear programming [5], [6]. Our proposed approach is based on the generative model described [8], in which we are aiming to introduce motion model and improve the appearance model. In the following sections, we will first focus on a generic monocular tracking system described in [1], and then go through promising methods for people detection [2] and pose estimation [3]. Then, we will discuss in more detail how we are planning to improve upon existing methods and also present some preliminary results on our depth-based people detector. II. GENERAL FRAMEWORK FOR MONOCULAR PEOPLE TRACKING We start with a description of a general framework for people tracking. In [1] a principled probabilistic approach is presented for tracking multiple objects with a single moving camera. At this point, we are not particularly interested in the moving camera setup, yet a possibility of algorithms to be generalized to support that is quite appealing, since even more application areas are available, such as e.g. robotics and autonomous driving. In the following sections, we will describe the model and inference procedure in more detail, following [1]. A. Model Let I t denote the image evidence at time t, Zt i be the location and velocity of a person i at time t in world coordinates, Θ t be the camera parameters, and G j t be the locations of a static scene feature j in world coordinates, with index t denotes the time. Note, that these geometric features are necessary to estimate the camera pose. The tracking problem in [1] is approached in a principled Bayesian way. That is, all the I t are considered observed random variables, Z t, G t and Θ t are considered hidden variables, with Ω t = {Z t, G t, Θ t } representing the complete hidden state of the scene. The relationship between the random variables is specified as a joint probability distribution, namely, it can be formalized as the following factorized posterior: p(ω t I 1:t ) p(ω t I t ) p(ω t Ω t 1 )p(ω t 1 I 1:t )dω t 1 (1) where p(i t Ω t ) is the observation likelihood, representing the generative model of the image evidence I t given the values of the state variables Ω t at time t; p(ω t Ω t 1 ) is the motion model (prior), that represents how the model evolves over time, and p(ω t 1 I 1:t ) is the posterior at the previous time t 1. The optimal value of Ω t is then estimated by finding the MAP solution, assuming that the posterior at time t = 1 is known. Now, let s consider the observation likelihood and motion model in more detail, omitting some of the technicalities. Observation Likelihood: The observation likelihood can be considered as a measure of how well a particular value of Ω t fits image evidence I t : p(i t Ω t ) = i p(i t Z i t, θ t ) j p(i t G j t, Θ t ) = i p(i t f Θt (Z i t)) j p(i t f Θt (G j t)) where f Θt denotes the camera projection function, which either assumes a simple camera model (all objects are on the ground plane), or the pinhole camera model. Note that the likelihood is factorized over i and j, where p(i t f Θt (Zt)) i is the i-th target observation likelihood, and p(i t f Θt (G j t)) is the j-th feature observation likelihood. The target observation likelihood is modeled as a weighted average of the output of several detectors: p(i t f Θt (Zt)) i exp j (2) w j log p j (I t Zt) i (3) where w j is the weight of the detector j. In total, there are seven different detectors used (when depth cues are available): HOG-based full-body detector HOG-based upper-body detector Face detector Skin color detector Depth-based shape detector Depth-based motion detector Target-specific appearance detector Now let s take a look at p(i t f Θt (G j t)), the geometric feature observation likelihood. The likelihood can be interpreted as a generative model of projecting the world coordinates of the feature j on the image and then detecting that projection, possibly corrupted by Gaussian noise, using a interest point detector. It is also possible that features cannot be detected e.g. due to occlusions, which is encoded as a separate uniform component, meant to make the estimation more robust. This model can be formalized as follows: { p(i t f Θt (G j N (τ j t)) = t f Θt (G j t), Σ G ) K B G j t is valid otherwise where τ j t is the interest point corresponding to the feature G j t, Σ G being noise covariance, and being K B is the uniform component. Motion Model: The motion model represents (smooth) transitions between state-spaces Ω t over time, assuming that the transitions of the camera, tracked targets and geometrical features are a-priori independent: (4) p(ω t Ω t 1 ) = p(θ t Θ t 1 )p(z t Z t 1 )p(g t G t 1 ) (5) Camera motion is assumed to be smooth over a short period of time, and so is modeled as a linear dynamic model. Target motion prior p(z t Z t 1 ) is defined as a product of two factors: p(z t Z t 1 ) = p e (Z t Z t 1 )p m (Z t Z t 1 ) (6) The first factor p e (Z t Z t 1 ) is the existence prior, that specifies the probability of existence of the targets at time

3 EDIC RESEARCH PROPOSAL 3 t. It is defined as a product of Bernoulli distributions over all the targets i. A very interesting part of the model is the factor p m, which encodes how the targets are moving and possibly interacting with each other. A most common assumption for target motion is that they are moving independently with constant velocity. Authors claim, however, that modeling interactions can lead to better tracking performance. Thus, interaction model is introduced as a mixture of a repulsion model and a group model. Such a mixture is meant to encode two common scenarios of people behavior: they can choose to stay on a distance from each other, or, otherwise, move together. It can be formalized as a Markov Random Field (MRF) as follows: p(z t Z t 1 ) = a<b ψ(za t, Zt b β a,b t ) a<b p(βa,b t β a,b t 1 ) N a=1 p(za t Zt 1) a (7) where β a,b t is a hidden variable indicating which component of the mixture is used for a pair of targets a, b. ψ(...) is a sigmoid of the distance between the two targets if β a,b t = 1 (minimum if they are close), and exponential of the distance otherwise (maximum if they are close). As for geometric motion prior, it encodes whether the features are valid or not, and whether they are consistent over time. B. Tracking An important part of the approach [1] is the MAP procedure based on RJ-MCMC [13]. The main reason why the sampling inference was used is the complexity of the posterior (1), thus making it hard to apply other methods. Tracking is done by finding the MAP configuration ˆΩ t : ˆΩ t = argmax Ωt p(ω t I 1:t ) (8) where posterior at each time t is approximated using a set of samples {Ω r t } N r=1, with N being the total number of samples. Metropolis-Hastings algorithm is used to obtain new samples, which means that at each iteration, new values of Ω t are generated by a proposal distribution, and are being rejected or accepted according to the acceptance ratio. Due to high dimensionality, sampling from the complete configuration space can lead to very slow convergence, which is why on each iteration one of the variables Z t, G t or Θ t is randomly selected for an update. Thus, the proposal distribution Q(Ω t, Ω t ) incorporates three parts, target proposal Q Z, geometric feature proposal Q G and Q Θ, each can be perturbed with a corresponding fixed probability, but only one at a time. The target proposal Q Z is used to generate a new sample for the target parameters Zt r+1 given the current Zt r. In order to be able to explore all the state space, authors propose the following set of reversible jump moves: stay-leave, add-delete, update, interaction flip, where each move is meant to explore a part of the state space, and has a reversible counterpart. More specifically, stay move denotes that one of the targets that exist at the previous timestamp, yet absent in the current sample should be present in the next accepted sample, and leave is the inverse of that. Add is meant to explore a possibility of initiating a new tracking target from the new detections, with delete being the inverse of that. Update simply proposes a new location for the target (can be reversed by itself), and interaction flip allows the change of the interaction mode. Analogously, for the geometric feature proposal distribution Q G, the defined proposal moves are: stay-leave and update. The camera proposal distribution Q Θ is modeled as a Gaussian with a mean on the previous step. Once the complete proposal distribution is defined, a new sample Ω r+1 t can be obtained and rejected based on the following acceptance ratio: t ) a = P (I t Ω r+1 P (I t Ω r t ) P (Ω r+1 t I 1:t 1 ) P (Ω r t I 1:t 1 ) Q(Ω r t ; Ω r+1 t ) Q(Ω r+1 t ; Ω r t ) which is a multiplication of three ratios: between image likelihoods, approximated predictions, and proposal distributions. If a 1, it means that the sample Ω r+1 t is more likely than Ω r t, and it should be accepted. C. Discussion In general, the idea of having a complete model that merges together most of the tracking aspects in a principled probabilistic framework, including multiple various detection cues, motion priors, modeling people interactions and even the camera pose estimation, is very attractive. However, there are still some shortcomings of the proposed approach. The performance presented in [1] indicates that for the case when no depth cues are available the proposed method performs slightly worse than the state-of-the-art. The reason for that could be the absence of explicit occlusion handling, as well as the errors caused by the crude sampling approximation of the posterior distribution. It is also not quite clear how the weights of the different detectors are defined. Within the proposed framework, it might be interesting to see whether incorporating a mixture of detectors for different body parts as e.g. in [10] can improve the performance. III. PEOPLE DETECTION USING HOUGH FORESTS In the previous section, we have described a general probabilistic framework for tracking multiple people in monocular settings, proposed in [1], which heavily relies on the results of HOG-based people detectors. Designing a well-performing detector is a very challenging task on its own, and considering the performance of the state-of-the-art [7], there is still a lot to be done in that field. One of the promising detectors that can be used in such a tracking system, is proposed in [2]. In fact, authors propose a generic approach for object detection, tracking and action recognition - Hough forests. From our perspective, the approach is interesting since it both defines a well-performing detection algorithm and a way to do object tracking over time within the unified framework. In this section, we will describe the basic idea about Hough forests, how they can be trained and then used for people detection and tracking. (9)

4 EDIC RESEARCH PROPOSAL 4 A. Hough Forests Hough forest is a specific type of random forest, in which each tree is a mapping from the image appearance features I to probabilistic votes h in a Hough space H, a space of all hypotheses for the location of the object center and class of the object. The detection is then made by applying such a forest to an image of interest, and searching for the maxima in the resulting Hough space representation. Training: Training a Hough forest requires a set of examples for each possible object class c {0, 1}, triples (I i, c i, d i ), where c i are class indicators (c i = 0 for background, c i = 1 for people) and d i is the displacement of the patch i from the center of the particular training sample (for c = 0, set to zero). Each tree is built from a random subset of the training set in a greedy fashion: starting from the root, the training set is split into two parts by evaluating a binary test, and is continued up until the termination criteria is met, that is, the maximum depth of the tree or the minimum number of samples are reached. Each leaf L in the constructed tree then contains a number of patches, with the proportion of patches of class c among those being p(c L). In order to construct the tree, we need to specify how the binary test is selected given a set of training samples A = {(I i, c i, d i )}. The non-leaf nodes represent a binary split based on the following test, which compares values of one of the feature channels in two locations on the patch: { 0 I t fpqτ (I) = f (p) < I f (q) + τ (10) 1 otherwise. where p and q are the locations on the image patch, f is a feature channel and τ is an offset. Hough forests are aiming to minimize two objectives simultaneously: classification uncertainty and displacement (regression) uncertainty. Minimizing the class label c uncertainty can be formulated as follows: U 1 (A) = A p(c A) log p(c A). (11) c {0,1} And minimizing the displacement can be done by minimizing the following measure: U 2 (A) = d i d 2 (12) (I i,c i,d i) A:c i=1 where d is the mean of the positive displacement vectors (negatives do not influence the displacement uncertainty measure). Whenever a new split in the tree is to be created, we randomly select which uncertainty measure will be used. Then, among a set of randomly generated binary tests, with values f, p, q, τ all selected at random, we select the one that minimizes the selected measure. Detection: In order to do object detection, one should extract all the appearance features for patches I and pass them through each tree in the Hough forest. Let s consider a patch P(y) = (I(y), c(y), d(y)), at a location y, where I(y) is the appearance of the patch, and c(y), d(y) are respectively the hidden (unobserved) object class and displacement. The output of a tree t for the given I(y) appearance is a leaf L t (y), that defines conditional probability p(h L t (y)) of a hypothesis h = (c = 1, x) of observing an object of class c = 1 at a location x. Note that in order to handle multiple scales or aspect ratios, the hypothesis space can be naturally extended by adding the corresponding dimensions. The hypothesis conditional can be computed as a product of two factors: p(h L t (y)) = p(h c(y) = 1, L t (y)) p(c(y) = 1 L t (y)) = p(d(y) = y x c(y) = 1, L t (y)) p(c(y) = 1 L t (y)) (13) where the latter factor is just a proportion of the positive examples in the leaf L t (y), and the former one can be estimated e.g. using Parzen window: p(d(y) = y x c(y) = 1, L t (y)) = 1 N (y x d, σ 2 ) D Lt d D Lt (14) where D Lt is a set of displacement vectors in the leaf L t (y), σ 2 is the covariance parameter indicating the window width. Then, summing over all the T trees in the forest and all the given patches, we will get the following Hough vote for a hypothesis h: p(h I) y 1 T T p(h L t (y)) (15) Computing these votes for each possible hypothesis h will result in a Hough image. Note, that the sums do not return the true probabilities, yet they are more numerically stable. It is possible, however, to use log-probabilities in order to get a strict probabilistic interpretation. After the Hough image is available, the detection is made by a local maxima search. The found local maxima ĥ correspond to the hypotheses for the centers of the possible detections, whereas p(ĥ I) denote the measure of confidence for those detections. Tracking: One of the advantages of Hough forests is their ability to be adapted on-line. This can be particularly useful to do appearance-based tracking. Detectors trained offline are trying to solve a more general problem than tracking, that is, they search for any object of the class of interest, whereas the tracking ultimately requires the detection of a specific target. One can assume, that the statistics of the target appearance do not change rapidly over time. Hence, it should be possible to update statistics in the leaves of the Hough forest such that it is tuned for a specific target. Authors of [2] suggest a fairly straightforward counting scheme. Similarly to estimating the conditional of the object presence hypothesis h (13), the probability of the target E hypothesis h E can be computed as follows: p(h E L(y)) = t=1 p(d(y) = y x c(y) = 1, L t (y)) p(h E = h c(y) = 1, L t (y)) p(c(y) = 1 L t (y)) (16) where the term p(h E = h c(y) = 1, L t (y)) is the one that is being estimated on-line. Namely, it is the proportion of times a specific entry of the leaf d D Lt votes for the target E.

5 EDIC RESEARCH PROPOSAL 5 B. Discussion In [2], authors demonstrate, that their approach is performing quite well compared to state-of-the-art for the majority of the evaluated datasets. One of the flaws is mentioned by the authors, it is the fact that votes are counted toward the center of the object, which can fail for non-rigid objects or objects with large variability in poses, such as, e.g. people performing sport activities. Moreover, no direct way is shown to handle occluded instances, which is crucial for the people detection in crowded scenes. However, in a more recent work [14] it has been demonstrated that it is possible to introduce explicit occlusion reasoning into the Hough forest framework. It is also worth noting that we were focusing on applying the method for people detection and tracking, whereas Hough forests are actually more generic and can be used for multiclass detection, and also work on sequences of images to perform action recognition. function, or by setting them to ones and zeros to represent a fixed number of nearest neighbors. The main assumption behind LE is to map points that are close to each other to points that are close in the lowdimensional space: min i Or, in a vector form: (x i x j ) 2 W ij (17) j min tr(xlx T ) (18) with L = D W being a Laplacian function, and D - diagonal matrix, such that D ii = j W ij. Note that in order to avoid the trivial solutions of X and x 1 = = x N for (18), the minimization is done subject to XDX T = I, XD1 = 0. IV. POSE ESTIMATION WITH MOTION PRIORS In the previous sections, we have been looking at a general tracking framework and a discriminative people detector. Our belief is, that it is possible to improve the tracking quality with stronger motion models, and, in particular, by introducing a notion of pose. Even a rough hypothesis about the pose of the person might simplify reasoning about occlusions and people location, not to mention that the pose itself is of great interest for a variety of applications. In this section, we will consider a generative model for pose estimation - Laplacian-Eigenmaps Latent Variable Model (LELVM) [3]. Human poses and motions are high-dimensional, making it ultimately impossible to perform brute-force optimization. But even the approximate methods that avoid exhaustive search can still suffer from high dimensionality. This is why dimensionality reduction (DR) methods have been explored for the purpose of pose reconstruction [3], [15]. A. Nonlinear Dimensionality Reduction The need of nonlinear DR rather than methods like PCA lies in the fact that some data is poorly captured by linear functions. Human motion is highly nonlinear and, as it is demonstrated e.g. in [3], [15], methods that encounter for that, significantly outperform PCA. There are various ways to perform nonlinear DR. Most of them rely on the idea of low-dimensional manifolds: a smooth, curved subspace of a higher-dimensional Euclidean space, in which it is embedded. One of such approaches, which lies in the core of the pose estimation method of [3] is Laplacian Eigenmaps (LE). C. Latent Variable Model LELVM can be considered as a way to define an out-ofsample mapping and a density model for LE. Mapping a new point y to the latent space F (y) = x can be formulated as follows: ( (X ) ( ) ( )) L K(y) X T min tr x x K(y) T 1 T K(y) x T (19) where K(y, y n ) = exp( (y yn) 2 2σ ) if y 2 n is a nearest neighbor of y, and 0 otherwise. The solution of (19) is as follows: x = N n=1 K(y, y n ) N n =1 K(y, y n )x n (20) The LVM itself is then defined as a joint kernel density estimate (KDE): p(x, y) = 1 N N K y (y, y n )K x (x, x n ) (21) n=1 where both K y and K x are Gaussian kernels. It follows that the dimensionality reduction mappings and reconstruction mappings: N F (y) = p(n y)x n (22) n=1 B. Laplacian Eigenmaps Given a set of points {y i } N i=1, LE algorithm constructs a graph, in which each node corresponds to y i, and weighted edges connect nodes which are close to each other, either being nearest neighbors or are in some ɛ-neighborhood. The weights of those edges W ij could be defined as a Gaussian affinity where: f(x) = p(n x) = N p(n x)y n (23) n=1 K x (x, x n ) N n =1 K(x, x n ) (24)

6 EDIC RESEARCH PROPOSAL 6 D. Tracking The tracking framework can be thought of as a simplistic version of the one described in [1]: p(s t z 1:t ) p(z t s t ) p(s t s t 1 )p(s t 1 z 1:t 1 )ds t 1 (25) where s = (x, d) are the hidden state variables, with x R L being the pose representation in the low-dimensional latent space, and d R 3 is the center of mass location; z represent the observed image evidence, that is, 2D image locations of the pose joints. The observation model p(z t s t ) is a Gaussian with isotropic covariance, and mean defined with the following transformation: z = P (d)y = P (d)f(x) (26) where P (d) is the perspective projection that shifts each 3D point by d. The dynamics model is a product of Gaussian random walks for x and d and LELVM prior p(x t ): p(s t s t 1 ) p d (d t d t 1 )p x (x t x t 1 )p(x t ) (27) which means that both the pose and location are meant to be close to the value on the previous step. Tracking is performed by running inference on the specified model using a particle filter based on a Gaussian mixture distribution and a Sigma-point Kalman filter. E. Discussion Authors demonstrated that LELVM can successfully be used as a motion prior for monocular pose estimation. However, it is not clear from the results presented in [3] whether the quality is sufficient for it to be used in unconstrained environments and be useful as a part of a multiple people tracking system. In fact, it is an open research question of how to efficiently incorporate the pose estimate into the multiple people tracking system such as [1] or [6]. V. DISCUSSION AND THESIS PROPOSAL In this section, we will introduce our plan on tackling the problem of monocular people tracking and discuss our preliminary work that has been done. We propose an approach for monocular tracking that is based on the POM [8], by introducing temporal consistency (motion model) and by using a more realistic appearance model, such as learned silhouettes. Note that both these things can be merged together to get a rough estimate of the human body motion, by introducing dynamics between silhouettes, which can be considered as weak pose estimates. Even though this estimate might be not very precise, it can significantly help to restrict the original high-dimensional problem of pose estimation. Unfortunately, in general an object position on a single RGB image is scale-ambiguous, and the original backgroundsubtraction-based POM will fail since it relies on multiple views to resolve this. To solve that, we propose detecting the body part whose position is probably the least ambiguous (at least in pedestrian tracking settings): feet. It is also worth looking at heads, since they are relatively easy to detect compared to other body parts and often also have a very typical location. A discriminative detector such as Hough forests can be used for that purpose, and probably will work more reliable when solving an easier task of estimating the position of the specific body part rather than looking for the whole human body. Working in single-camera settings can bring a lot of difficulties from the beginning, including depth ambiguity, variations in illumination and shadows. This is why we have decided to first try out our idea with temporal consistency and what we call weak pose estimates with depth maps. We are then planning to gradually switch to more complicated settings, that is, to stereo images, and then to single RGB-images. A. Preliminary Work: People Detection in Depth Maps One of the ways to handle the complexity of human motion is to use additional sensor data, such as the depth acquired by time-of-flight cameras, such as Kinect [16]. It has especially become relevant since the number of affordable sensors of such kind is increasing. It was demonstrated [16], that with a vast amount of training data, it is possible to get reliable results for pose estimation in single depth maps. Modeling Depth Maps: Let X = {X 1... X K } be the individual occupancy variables for each location of the ground plane. The idea behind the original POM algorithm [8] is to model people as rectangles, and then use a following function to model the dependency between the observed signal B (binary images, the result of background subtraction) and artificial signal A (which is a deterministic function of occupancy variables X) as follows: P (B = b A = a) exp( Ψ(b, a)) (28) where Ψ(b, a) is a similarity function between the realizations of variables B and A, which can also be thought of log P (B = b X), that is, negative log-likelihood of observing a binary image b, given the locations of people X or, in other words, a generative model for binary images. Then, an inference procedure is used to get the posterior distributions over X. Here, we take a very similar approach, but for depth maps, rather than binary ones. Let Z R W H and A R W H be the captured and the artificial maps correspondingly. Given the values of the occupancy variables X for all locations, the artificial depth map A can be generated as a deterministic represents a space of all hypotheses for object position function of those, by putting at each pixel the value of the depth of the closest silhouette (if any): ( ) A ij = min z bg min (29) ij, z k k:x k =1,(i,j) S k where S k stands for the silhouette at the location k, and z k is the depth of that silhouette w.r.t. the camera. At this, point, in our model each S k is a a projection of a 3D plane. z bg ij denotes the depth of the background. Note, that it is also possible that = z, which is a special no-value, corresponding to z bg ij

7 EDIC RESEARCH PROPOSAL 7 the case when no depth is captured, or if the captured value is very unreliable. Note that we assume that silhouettes are sorted w.r.t. the distance to the camera. Further, we specify the following data likelihood factorized over pixel locations: P (Z = z X) = (i,j) P (z ij X). (30) where P (z ij X) is a shorthand for P (Z ij = z ij X). This factorization assumption is, of course, too strong, and is made primarily to make inference easier. When it is known that silhouettes S 1:k 1 are absent, and S k is present (which is equivalent to X 1:k 1 = 0 and X k = 1), distribution of z ij is assumed to be the following: { P (z ij X 1:k 1 = 0, X k = 1) = p if z ij = z, (1 p )Ñ (z ij A ij, σ 2, p min ) otherwise (31) where p is a (Bernoulli) probability of observing no-value knowing that X k = 1 (a constant), and Ñ is a long-tail distribution, which can be thought of as a clamped Gaussian PDF that does not take values lower than p min. When there is certainty that the background depth is being observed (X 1:K = 0), we use a distribution Ñ, again, handling a no-value separately: { P (z ij X 1:K = 0) = p bg if z ij = z, (1 p bg )Ñ (z ij µ m, σm, 2 p min ) otherwise (32) Inference: We do not provide the complete details about how inference equations are derived, but in general we were just following the variational inference procedure [17] for our generative model. The final update equation for the posterior estimates Q(X k = 1) will be: exp ( log P (Z, X) X k = 0 log P (Z, X) X k = 1 ) (33) where is the expectation w.r.t. approximate posterior j k Q(X j) (with X k marginalized out). Preliminary Results: We have run our detector on several simple samples, and the quality of detections look promising, especially keeping in mind the simplicity of the appearance model, and its ability to handle occlusions. Sample output of the generative model for the estimated locations is given in the Figure 1. However, we have not yet performed formal performance evaluation of our approach with the existing few alternatives. The main reason is that there are not too many open datasets for people tracking in depth maps, neither there are many pure depth-based people detectors openly available. Note that, the original Kinect algorithm [16] for pose estimation is available within Kinect SDK, however, it can only track no more than 6 people and is optimized for the people facing the camera. Moreover it cannot be run on arbitrary depth maps, thus making it not possible to evaluate on an arbitrary dataset. It would be interesting to compare to the approach described in [18], one of the very few depth-based people detectors available, since the dataset for that work is available, however, it is unlikely for our algorithm to provide better results since their approach strongly relies on a state-of-the-art RGB-based detector, whereas there is no notion for using RGB cues in our model at all. We believe that after introducing those we can certainly provide competitive results. B. Preliminary Work: EM for Silhouette Learning The next step for our depth-based detector is to substitute crude silhouettes with more realistic ones learned from the data. For that, we are working on a weakly-supervised expectation-maximization-algorithm, meaning that it learns a dictionary of silhouettes from a set of positive examples - depth maps on which exactly one person is present. C. Future Work The ultimate goal of our work is to be able to reliably track multiple people in complex crowded environments. As mentioned earlier, we are planning to introduce a notion of motion and improve appearance model in terms of learned silhouettes into the original POM framework. The motion part would basically mean that instead of estimating the locations of the people individually in each frame, we will jointly estimate the people locations in k several subsequent frames. Then, for the given timestamp t, given the image evidence I 1:t, we will seek for an estimate for the posterior p(x t k,..., X t I 1:t ), where X t now represents not only the ground plane location, but also a silhouette that better explains the picture of the person. These location-silhouette detections then can be quite naturally used e.g. in the linear programming framework for tracking [9] over longer time periods. It should be also possible to use these silhouette cues to reduce the search space for the complete pose estimation such as one in [3].

8 EDIC RESEARCH PROPOSAL 8 Fig. 1: Output of the generative model for the detections. Top row pictures the output of the generative model based on the inferred people locations. Bottom row represents the input depth maps. REFERENCES [1] W. Choi, C. Pantofaru, and S. Savarese, A general framework for tracking multiple people from a moving camera, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 7, pp , [2] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky, Hough forests for object detection, tracking, and action recognition, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 11, pp , [3] Z. Lu, M. A. Carreira-Perpinán, and C. Sminchisescu, People tracking with the laplacian eigenmaps latent variable model. in NIPS, vol. 20, 2007, pp [4] M. Andriluka, S. Roth, and B. Schiele, People-tracking-by-detection and people-detection-by-tracking, in Computer Vision and Pattern Recognition, CVPR IEEE Conference on. IEEE, 2008, pp [5] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, Multiple object tracking using k-shortest paths optimization, IEEE Transactions on Pattern Analysis and Machine Intelligence, [6] H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua, Multi-commodity network flow for tracking multiple people, [7] P. Dollar, C. Wojek, B. Schiele, and P. Perona, Pedestrian detection: An evaluation of the state of the art, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 4, pp , [8] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, Multicamera people tracking with a probabilistic occupancy map, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 30, no. 2, pp , [9] H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua, Tracking multiple people under global appearance constraints, in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp [10] C. Wojek, S. Walk, S. Roth, and B. Schiele, Monocular 3d scene understanding with explicit occlusion reasoning, in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp [11] C. Wojek, S. Walk, S. Roth, K. Schindler, and B. Schiele, Monocular visual scene understanding: Understanding multi-object traffic scenes, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 4, pp , [12] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, Adaptive mixtures of local experts, Neural computation, vol. 3, no. 1, pp , [13] Z. Khan, T. Balch, and F. Dellaert, Mcmc-based particle filtering for tracking a variable number of interacting targets, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 27, no. 11, pp , [14] T. Wang, X. He, and N. Barnes, Learning structured hough voting for joint object detection and occlusion reasoning, in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp [15] R. Urtasun, D. J. Fleet, and P. Fua, 3d people tracking with gaussian process dynamical models, in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 1. IEEE, 2006, pp [16] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore, Real-time human pose recognition in parts from single depth images, Communications of the ACM, vol. 56, no. 1, pp , [17] J. Winn and C. M. Bishop, Variational message passing, Journal of Machine Learning Research, vol. 6, no. 1, p. 661, [18] M. Munaro and E. Menegatti, Fast rgb-d people tracking for service robots, Autonomous Robots, pp. 1 16, 2014.

3D Human Motion Analysis and Manifolds

3D Human Motion Analysis and Manifolds D E P A R T M E N T O F C O M P U T E R S C I E N C E U N I V E R S I T Y O F C O P E N H A G E N 3D Human Motion Analysis and Manifolds Kim Steenstrup Pedersen DIKU Image group and E-Science center Motivation

More information

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Articulated Pose Estimation with Flexible Mixtures-of-Parts Articulated Pose Estimation with Flexible Mixtures-of-Parts PRESENTATION: JESSE DAVIS CS 3710 VISUAL RECOGNITION Outline Modeling Special Cases Inferences Learning Experiments Problem and Relevance Problem:

More information

Probability Occupancy Maps for Occluded Depth Images

Probability Occupancy Maps for Occluded Depth Images Probability Occupancy Maps for Occluded Depth Images Timur Bagautdinov 1, François Fleuret 2, and Pascal Fua 1 1 École Polytechnique Fédérale de Lausanne (EPFL), Switzerland 2 IDIAP Research Institute,

More information

Human Upper Body Pose Estimation in Static Images

Human Upper Body Pose Estimation in Static Images 1. Research Team Human Upper Body Pose Estimation in Static Images Project Leader: Graduate Students: Prof. Isaac Cohen, Computer Science Mun Wai Lee 2. Statement of Project Goals This goal of this project

More information

Visual Motion Analysis and Tracking Part II

Visual Motion Analysis and Tracking Part II Visual Motion Analysis and Tracking Part II David J Fleet and Allan D Jepson CIAR NCAP Summer School July 12-16, 16, 2005 Outline Optical Flow and Tracking: Optical flow estimation (robust, iterative refinement,

More information

Efficient Acquisition of Human Existence Priors from Motion Trajectories

Efficient Acquisition of Human Existence Priors from Motion Trajectories Efficient Acquisition of Human Existence Priors from Motion Trajectories Hitoshi Habe Hidehito Nakagawa Masatsugu Kidode Graduate School of Information Science, Nara Institute of Science and Technology

More information

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Combining PGMs and Discriminative Models for Upper Body Pose Detection Combining PGMs and Discriminative Models for Upper Body Pose Detection Gedas Bertasius May 30, 2014 1 Introduction In this project, I utilized probabilistic graphical models together with discriminative

More information

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision Object Recognition Using Pictorial Structures Daniel Huttenlocher Computer Science Department Joint work with Pedro Felzenszwalb, MIT AI Lab In This Talk Object recognition in computer vision Brief definition

More information

Experts-Shift: Learning Active Spatial Classification Experts for Keyframe-based Video Segmentation

Experts-Shift: Learning Active Spatial Classification Experts for Keyframe-based Video Segmentation Experts-Shift: Learning Active Spatial Classification Experts for Keyframe-based Video Segmentation Yibiao Zhao 1,3, Yanbiao Duan 2,3, Xiaohan Nie 2,3, Yaping Huang 1, Siwei Luo 1 1 Beijing Jiaotong University,

More information

Evaluation of Moving Object Tracking Techniques for Video Surveillance Applications

Evaluation of Moving Object Tracking Techniques for Video Surveillance Applications International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Evaluation

More information

Real Time Person Detection and Tracking by Mobile Robots using RGB-D Images

Real Time Person Detection and Tracking by Mobile Robots using RGB-D Images Real Time Person Detection and Tracking by Mobile Robots using RGB-D Images Duc My Vo, Lixing Jiang and Andreas Zell Abstract Detecting and tracking humans are key problems for human-robot interaction.

More information

Simultaneous Appearance Modeling and Segmentation for Matching People under Occlusion

Simultaneous Appearance Modeling and Segmentation for Matching People under Occlusion Simultaneous Appearance Modeling and Segmentation for Matching People under Occlusion Zhe Lin, Larry S. Davis, David Doermann, and Daniel DeMenthon Institute for Advanced Computer Studies University of

More information

Video-Based People Tracking. P. Fua EPFL IC-CVLab

Video-Based People Tracking. P. Fua EPFL IC-CVLab Video-Based People Tracking P. Fua EPFL IC-CVLab Video-Based Tracking Challenges!Thousands of frames!frequent occlusions!poor quality of input images!sudden illumination changes 2 Multi-Step Algorithm

More information

Particle Filtering. CS6240 Multimedia Analysis. Leow Wee Kheng. Department of Computer Science School of Computing National University of Singapore

Particle Filtering. CS6240 Multimedia Analysis. Leow Wee Kheng. Department of Computer Science School of Computing National University of Singapore Particle Filtering CS6240 Multimedia Analysis Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore (CS6240) Particle Filtering 1 / 28 Introduction Introduction

More information

Markov Random Fields and Gibbs Sampling for Image Denoising

Markov Random Fields and Gibbs Sampling for Image Denoising Markov Random Fields and Gibbs Sampling for Image Denoising Chang Yue Electrical Engineering Stanford University changyue@stanfoed.edu Abstract This project applies Gibbs Sampling based on different Markov

More information

Lecture 16: Object recognition: Part-based generative models

Lecture 16: Object recognition: Part-based generative models Lecture 16: Object recognition: Part-based generative models Professor Stanford Vision Lab 1 What we will learn today? Introduction Constellation model Weakly supervised training One-shot learning (Problem

More information

Multi-Person Tracking-by-Detection based on Calibrated Multi-Camera Systems

Multi-Person Tracking-by-Detection based on Calibrated Multi-Camera Systems Multi-Person Tracking-by-Detection based on Calibrated Multi-Camera Systems Xiaoyan Jiang, Erik Rodner, and Joachim Denzler Computer Vision Group Jena Friedrich Schiller University of Jena {xiaoyan.jiang,erik.rodner,joachim.denzler}@uni-jena.de

More information

Background subtraction in people detection framework for RGB-D cameras

Background subtraction in people detection framework for RGB-D cameras Background subtraction in people detection framework for RGB-D cameras Anh-Tuan Nghiem, Francois Bremond INRIA-Sophia Antipolis 2004 Route des Lucioles, 06902 Valbonne, France nghiemtuan@gmail.com, Francois.Bremond@inria.fr

More information

Probabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences

Probabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences Probabilistic Tracking and Reconstruction of 3D Human Motion in Monocular Video Sequences Presentation of the thesis work of: Hedvig Sidenbladh, KTH Thesis opponent: Prof. Bill Freeman, MIT Thesis supervisors

More information

Introduction to behavior-recognition and object tracking

Introduction to behavior-recognition and object tracking Introduction to behavior-recognition and object tracking Xuan Mo ipal Group Meeting April 22, 2011 Outline Motivation of Behavior-recognition Four general groups of behaviors Core technologies Future direction

More information

Automatic Tracking of Moving Objects in Video for Surveillance Applications

Automatic Tracking of Moving Objects in Video for Surveillance Applications Automatic Tracking of Moving Objects in Video for Surveillance Applications Manjunath Narayana Committee: Dr. Donna Haverkamp (Chair) Dr. Arvin Agah Dr. James Miller Department of Electrical Engineering

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction A Monte Carlo method is a compuational method that uses random numbers to compute (estimate) some quantity of interest. Very often the quantity we want to compute is the mean of

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

A Novel Multi-Planar Homography Constraint Algorithm for Robust Multi-People Location with Severe Occlusion

A Novel Multi-Planar Homography Constraint Algorithm for Robust Multi-People Location with Severe Occlusion A Novel Multi-Planar Homography Constraint Algorithm for Robust Multi-People Location with Severe Occlusion Paper ID:086 Abstract Multi-view approach has been proposed to solve occlusion and lack of visibility

More information

Scene Segmentation in Adverse Vision Conditions

Scene Segmentation in Adverse Vision Conditions Scene Segmentation in Adverse Vision Conditions Evgeny Levinkov Max Planck Institute for Informatics, Saarbrücken, Germany levinkov@mpi-inf.mpg.de Abstract. Semantic road labeling is a key component of

More information

Selection of Scale-Invariant Parts for Object Class Recognition

Selection of Scale-Invariant Parts for Object Class Recognition Selection of Scale-Invariant Parts for Object Class Recognition Gy. Dorkó and C. Schmid INRIA Rhône-Alpes, GRAVIR-CNRS 655, av. de l Europe, 3833 Montbonnot, France fdorko,schmidg@inrialpes.fr Abstract

More information

Computer Vision 2 Lecture 8

Computer Vision 2 Lecture 8 Computer Vision 2 Lecture 8 Multi-Object Tracking (30.05.2016) leibe@vision.rwth-aachen.de, stueckler@vision.rwth-aachen.de RWTH Aachen University, Computer Vision Group http://www.vision.rwth-aachen.de

More information

Robust Model-Free Tracking of Non-Rigid Shape. Abstract

Robust Model-Free Tracking of Non-Rigid Shape. Abstract Robust Model-Free Tracking of Non-Rigid Shape Lorenzo Torresani Stanford University ltorresa@cs.stanford.edu Christoph Bregler New York University chris.bregler@nyu.edu New York University CS TR2003-840

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Machine Learning Lecture 3 Probability Density Estimation II 19.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Exam dates We re in the process

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

CS 664 Segmentation. Daniel Huttenlocher

CS 664 Segmentation. Daniel Huttenlocher CS 664 Segmentation Daniel Huttenlocher Grouping Perceptual Organization Structural relationships between tokens Parallelism, symmetry, alignment Similarity of token properties Often strong psychophysical

More information

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model

Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model Object Detection with Partial Occlusion Based on a Deformable Parts-Based Model Johnson Hsieh (johnsonhsieh@gmail.com), Alexander Chia (alexchia@stanford.edu) Abstract -- Object occlusion presents a major

More information

Supplementary Material: Decision Tree Fields

Supplementary Material: Decision Tree Fields Supplementary Material: Decision Tree Fields Note, the supplementary material is not needed to understand the main paper. Sebastian Nowozin Sebastian.Nowozin@microsoft.com Toby Sharp toby.sharp@microsoft.com

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 2014-2015 Jakob Verbeek, November 28, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

This chapter explains two techniques which are frequently used throughout

This chapter explains two techniques which are frequently used throughout Chapter 2 Basic Techniques This chapter explains two techniques which are frequently used throughout this thesis. First, we will introduce the concept of particle filters. A particle filter is a recursive

More information

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Estimating Human Pose in Images. Navraj Singh December 11, 2009 Estimating Human Pose in Images Navraj Singh December 11, 2009 Introduction This project attempts to improve the performance of an existing method of estimating the pose of humans in still images. Tasks

More information

Beyond Bags of features Spatial information & Shape models

Beyond Bags of features Spatial information & Shape models Beyond Bags of features Spatial information & Shape models Jana Kosecka Many slides adapted from S. Lazebnik, FeiFei Li, Rob Fergus, and Antonio Torralba Detection, recognition (so far )! Bags of features

More information

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization and Markov Random Fields (MRF) CS 664 Spring 2008 Regularization in Low Level Vision Low level vision problems concerned with estimating some quantity at each pixel Visual motion (u(x,y),v(x,y))

More information

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints Last week Multi-Frame Structure from Motion: Multi-View Stereo Unknown camera viewpoints Last week PCA Today Recognition Today Recognition Recognition problems What is it? Object detection Who is it? Recognizing

More information

Short Survey on Static Hand Gesture Recognition

Short Survey on Static Hand Gesture Recognition Short Survey on Static Hand Gesture Recognition Huu-Hung Huynh University of Science and Technology The University of Danang, Vietnam Duc-Hoang Vo University of Science and Technology The University of

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

08 An Introduction to Dense Continuous Robotic Mapping

08 An Introduction to Dense Continuous Robotic Mapping NAVARCH/EECS 568, ROB 530 - Winter 2018 08 An Introduction to Dense Continuous Robotic Mapping Maani Ghaffari March 14, 2018 Previously: Occupancy Grid Maps Pose SLAM graph and its associated dense occupancy

More information

Motion Tracking and Event Understanding in Video Sequences

Motion Tracking and Event Understanding in Video Sequences Motion Tracking and Event Understanding in Video Sequences Isaac Cohen Elaine Kang, Jinman Kang Institute for Robotics and Intelligent Systems University of Southern California Los Angeles, CA Objectives!

More information

Key Developments in Human Pose Estimation for Kinect

Key Developments in Human Pose Estimation for Kinect Key Developments in Human Pose Estimation for Kinect Pushmeet Kohli and Jamie Shotton Abstract The last few years have seen a surge in the development of natural user interfaces. These interfaces do not

More information

Structured Models in. Dan Huttenlocher. June 2010

Structured Models in. Dan Huttenlocher. June 2010 Structured Models in Computer Vision i Dan Huttenlocher June 2010 Structured Models Problems where output variables are mutually dependent or constrained E.g., spatial or temporal relations Such dependencies

More information

Scene Grammars, Factor Graphs, and Belief Propagation

Scene Grammars, Factor Graphs, and Belief Propagation Scene Grammars, Factor Graphs, and Belief Propagation Pedro Felzenszwalb Brown University Joint work with Jeroen Chua Probabilistic Scene Grammars General purpose framework for image understanding and

More information

Data-driven Depth Inference from a Single Still Image

Data-driven Depth Inference from a Single Still Image Data-driven Depth Inference from a Single Still Image Kyunghee Kim Computer Science Department Stanford University kyunghee.kim@stanford.edu Abstract Given an indoor image, how to recover its depth information

More information

Switching Hypothesized Measurements: A Dynamic Model with Applications to Occlusion Adaptive Joint Tracking

Switching Hypothesized Measurements: A Dynamic Model with Applications to Occlusion Adaptive Joint Tracking Switching Hypothesized Measurements: A Dynamic Model with Applications to Occlusion Adaptive Joint Tracking Yang Wang Tele Tan Institute for Infocomm Research, Singapore {ywang, telctan}@i2r.a-star.edu.sg

More information

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009 Analysis: TextonBoost and Semantic Texton Forests Daniel Munoz 16-721 Februrary 9, 2009 Papers [shotton-eccv-06] J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context

More information

DPM Score Regressor for Detecting Occluded Humans from Depth Images

DPM Score Regressor for Detecting Occluded Humans from Depth Images DPM Score Regressor for Detecting Occluded Humans from Depth Images Tsuyoshi Usami, Hiroshi Fukui, Yuji Yamauchi, Takayoshi Yamashita and Hironobu Fujiyoshi Email: usami915@vision.cs.chubu.ac.jp Email:

More information

People Tracking with the Laplacian Eigenmaps Latent Variable Model

People Tracking with the Laplacian Eigenmaps Latent Variable Model People Tracking with the Laplacian Eigenmaps Latent Variable Model Zhengdong Lu CSEE, OGI, OHSU zhengdon@csee.ogi.edu Miguel Á. Carreira-Perpiñán EECS, UC Merced http://eecs.ucmerced.edu Cristian Sminchisescu

More information

Computer Vision II Lecture 14

Computer Vision II Lecture 14 Computer Vision II Lecture 14 Articulated Tracking I 08.07.2014 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Outline of This Lecture Single-Object Tracking Bayesian

More information

Object recognition (part 1)

Object recognition (part 1) Recognition Object recognition (part 1) CSE P 576 Larry Zitnick (larryz@microsoft.com) The Margaret Thatcher Illusion, by Peter Thompson Readings Szeliski Chapter 14 Recognition What do we mean by object

More information

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm Group 1: Mina A. Makar Stanford University mamakar@stanford.edu Abstract In this report, we investigate the application of the Scale-Invariant

More information

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework Ji Won Yoon arxiv:37.99v [cs.lg] 3 Jul 23 Abstract In order to cluster

More information

Mobile Human Detection Systems based on Sliding Windows Approach-A Review

Mobile Human Detection Systems based on Sliding Windows Approach-A Review Mobile Human Detection Systems based on Sliding Windows Approach-A Review Seminar: Mobile Human detection systems Njieutcheu Tassi cedrique Rovile Department of Computer Engineering University of Heidelberg

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Scene Grammars, Factor Graphs, and Belief Propagation

Scene Grammars, Factor Graphs, and Belief Propagation Scene Grammars, Factor Graphs, and Belief Propagation Pedro Felzenszwalb Brown University Joint work with Jeroen Chua Probabilistic Scene Grammars General purpose framework for image understanding and

More information

Tri-modal Human Body Segmentation

Tri-modal Human Body Segmentation Tri-modal Human Body Segmentation Master of Science Thesis Cristina Palmero Cantariño Advisor: Sergio Escalera Guerrero February 6, 2014 Outline 1 Introduction 2 Tri-modal dataset 3 Proposed baseline 4

More information

Joint Vanishing Point Extraction and Tracking. 9. June 2015 CVPR 2015 Till Kroeger, Dengxin Dai, Luc Van Gool, Computer Vision ETH Zürich

Joint Vanishing Point Extraction and Tracking. 9. June 2015 CVPR 2015 Till Kroeger, Dengxin Dai, Luc Van Gool, Computer Vision ETH Zürich Joint Vanishing Point Extraction and Tracking 9. June 2015 CVPR 2015 Till Kroeger, Dengxin Dai, Luc Van Gool, Computer Vision Lab @ ETH Zürich Definition: Vanishing Point = Intersection of 2D line segments,

More information

Human Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview

Human Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview Human Body Recognition and Tracking: How the Kinect Works Kinect RGB-D Camera Microsoft Kinect (Nov. 2010) Color video camera + laser-projected IR dot pattern + IR camera $120 (April 2012) Kinect 1.5 due

More information

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK 1 Po-Jen Lai ( 賴柏任 ), 2 Chiou-Shann Fuh ( 傅楸善 ) 1 Dept. of Electrical Engineering, National Taiwan University, Taiwan 2 Dept.

More information

Algorithms for Markov Random Fields in Computer Vision

Algorithms for Markov Random Fields in Computer Vision Algorithms for Markov Random Fields in Computer Vision Dan Huttenlocher November, 2003 (Joint work with Pedro Felzenszwalb) Random Field Broadly applicable stochastic model Collection of n sites S Hidden

More information

Feature Detection. Raul Queiroz Feitosa. 3/30/2017 Feature Detection 1

Feature Detection. Raul Queiroz Feitosa. 3/30/2017 Feature Detection 1 Feature Detection Raul Queiroz Feitosa 3/30/2017 Feature Detection 1 Objetive This chapter discusses the correspondence problem and presents approaches to solve it. 3/30/2017 Feature Detection 2 Outline

More information

Content-based image and video analysis. Machine learning

Content-based image and video analysis. Machine learning Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all

More information

Visuelle Perzeption für Mensch- Maschine Schnittstellen

Visuelle Perzeption für Mensch- Maschine Schnittstellen Visuelle Perzeption für Mensch- Maschine Schnittstellen Vorlesung, WS 2009 Prof. Dr. Rainer Stiefelhagen Dr. Edgar Seemann Institut für Anthropomatik Universität Karlsruhe (TH) http://cvhci.ira.uka.de

More information

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction

Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Semi-Supervised Hierarchical Models for 3D Human Pose Reconstruction Atul Kanaujia, CBIM, Rutgers Cristian Sminchisescu, TTI-C Dimitris Metaxas,CBIM, Rutgers 3D Human Pose Inference Difficulties Towards

More information

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach Truth Course Outline Machine Learning Lecture 3 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Probability Density Estimation II 2.04.205 Discriminative Approaches (5 weeks)

More information

Artificial Intelligence for Robotics: A Brief Summary

Artificial Intelligence for Robotics: A Brief Summary Artificial Intelligence for Robotics: A Brief Summary This document provides a summary of the course, Artificial Intelligence for Robotics, and highlights main concepts. Lesson 1: Localization (using Histogram

More information

Face detection and recognition. Detection Recognition Sally

Face detection and recognition. Detection Recognition Sally Face detection and recognition Detection Recognition Sally Face detection & recognition Viola & Jones detector Available in open CV Face recognition Eigenfaces for face recognition Metric learning identification

More information

Dynamic Routing Between Capsules

Dynamic Routing Between Capsules Report Explainable Machine Learning Dynamic Routing Between Capsules Author: Michael Dorkenwald Supervisor: Dr. Ullrich Köthe 28. Juni 2018 Inhaltsverzeichnis 1 Introduction 2 2 Motivation 2 3 CapusleNet

More information

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped

More information

PEDESTRIAN DETECTION IN CROWDED SCENES VIA SCALE AND OCCLUSION ANALYSIS

PEDESTRIAN DETECTION IN CROWDED SCENES VIA SCALE AND OCCLUSION ANALYSIS PEDESTRIAN DETECTION IN CROWDED SCENES VIA SCALE AND OCCLUSION ANALYSIS Lu Wang Lisheng Xu Ming-Hsuan Yang Northeastern University, China University of California at Merced, USA ABSTRACT Despite significant

More information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Mustafa Berkay Yilmaz, Hakan Erdogan, Mustafa Unel Sabanci University, Faculty of Engineering and Natural

More information

Level-set MCMC Curve Sampling and Geometric Conditional Simulation

Level-set MCMC Curve Sampling and Geometric Conditional Simulation Level-set MCMC Curve Sampling and Geometric Conditional Simulation Ayres Fan John W. Fisher III Alan S. Willsky February 16, 2007 Outline 1. Overview 2. Curve evolution 3. Markov chain Monte Carlo 4. Curve

More information

Automated Video Analysis of Crowd Behavior

Automated Video Analysis of Crowd Behavior Automated Video Analysis of Crowd Behavior Robert Collins CSE Department Mar 30, 2009 Computational Science Seminar Series, Spring 2009. We Are... Lab for Perception, Action and Cognition Research Interest:

More information

Multiple-Person Tracking by Detection

Multiple-Person Tracking by Detection http://excel.fit.vutbr.cz Multiple-Person Tracking by Detection Jakub Vojvoda* Abstract Detection and tracking of multiple person is challenging problem mainly due to complexity of scene and large intra-class

More information

Machine Learning. Supervised Learning. Manfred Huber

Machine Learning. Supervised Learning. Manfred Huber Machine Learning Supervised Learning Manfred Huber 2015 1 Supervised Learning Supervised learning is learning where the training data contains the target output of the learning system. Training data D

More information

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures: Homework Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression 3.0-3.2 Pod-cast lecture on-line Next lectures: I posted a rough plan. It is flexible though so please come with suggestions Bayes

More information

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos

Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos Definition, Detection, and Evaluation of Meeting Events in Airport Surveillance Videos Sung Chun Lee, Chang Huang, and Ram Nevatia University of Southern California, Los Angeles, CA 90089, USA sungchun@usc.edu,

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Overview of Part Two Probabilistic Graphical Models Part Two: Inference and Learning Christopher M. Bishop Exact inference and the junction tree MCMC Variational methods and EM Example General variational

More information

COMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE

COMPUTER VISION > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE COMPUTER VISION 2017-2018 > OPTICAL FLOW UTRECHT UNIVERSITY RONALD POPPE OUTLINE Optical flow Lucas-Kanade Horn-Schunck Applications of optical flow Optical flow tracking Histograms of oriented flow Assignment

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Graphical Models for Computer Vision

Graphical Models for Computer Vision Graphical Models for Computer Vision Pedro F Felzenszwalb Brown University Joint work with Dan Huttenlocher, Joshua Schwartz, Ross Girshick, David McAllester, Deva Ramanan, Allie Shapiro, John Oberlin

More information

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.11, November 2013 1 Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial

More information

6.801/866. Segmentation and Line Fitting. T. Darrell

6.801/866. Segmentation and Line Fitting. T. Darrell 6.801/866 Segmentation and Line Fitting T. Darrell Segmentation and Line Fitting Gestalt grouping Background subtraction K-Means Graph cuts Hough transform Iterative fitting (Next time: Probabilistic segmentation)

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Behaviour based particle filtering for human articulated motion tracking

Behaviour based particle filtering for human articulated motion tracking Loughborough University Institutional Repository Behaviour based particle filtering for human articulated motion tracking This item was submitted to Loughborough University's Institutional Repository by

More information

Efficient Feature Learning Using Perturb-and-MAP

Efficient Feature Learning Using Perturb-and-MAP Efficient Feature Learning Using Perturb-and-MAP Ke Li, Kevin Swersky, Richard Zemel Dept. of Computer Science, University of Toronto {keli,kswersky,zemel}@cs.toronto.edu Abstract Perturb-and-MAP [1] is

More information

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques. . Non-Parameteric Techniques University of Cambridge Engineering Part IIB Paper 4F: Statistical Pattern Processing Handout : Non-Parametric Techniques Mark Gales mjfg@eng.cam.ac.uk Michaelmas 23 Introduction

More information

Static Gesture Recognition with Restricted Boltzmann Machines

Static Gesture Recognition with Restricted Boltzmann Machines Static Gesture Recognition with Restricted Boltzmann Machines Peter O Donovan Department of Computer Science, University of Toronto 6 Kings College Rd, M5S 3G4, Canada odonovan@dgp.toronto.edu Abstract

More information

Online Signature Verification Technique

Online Signature Verification Technique Volume 3, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Online Signature Verification Technique Ankit Soni M Tech Student,

More information

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation For Intelligent GPS Navigation and Traffic Interpretation Tianshi Gao Stanford University tianshig@stanford.edu 1. Introduction Imagine that you are driving on the highway at 70 mph and trying to figure

More information

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points] CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, 2015. 11:59pm, PDF to Canvas [100 points] Instructions. Please write up your responses to the following problems clearly and concisely.

More information

Markov Random Fields for Recognizing Textures modeled by Feature Vectors

Markov Random Fields for Recognizing Textures modeled by Feature Vectors Markov Random Fields for Recognizing Textures modeled by Feature Vectors Juliette Blanchet, Florence Forbes, and Cordelia Schmid Inria Rhône-Alpes, 655 avenue de l Europe, Montbonnot, 38334 Saint Ismier

More information

A Keypoint Descriptor Inspired by Retinal Computation

A Keypoint Descriptor Inspired by Retinal Computation A Keypoint Descriptor Inspired by Retinal Computation Bongsoo Suh, Sungjoon Choi, Han Lee Stanford University {bssuh,sungjoonchoi,hanlee}@stanford.edu Abstract. The main goal of our project is to implement

More information

Automatic Parameter Adaptation for Multi-Object Tracking

Automatic Parameter Adaptation for Multi-Object Tracking Automatic Parameter Adaptation for Multi-Object Tracking Duc Phu CHAU, Monique THONNAT, and François BREMOND {Duc-Phu.Chau, Monique.Thonnat, Francois.Bremond}@inria.fr STARS team, INRIA Sophia Antipolis,

More information

A Feature Point Matching Based Approach for Video Objects Segmentation

A Feature Point Matching Based Approach for Video Objects Segmentation A Feature Point Matching Based Approach for Video Objects Segmentation Yan Zhang, Zhong Zhou, Wei Wu State Key Laboratory of Virtual Reality Technology and Systems, Beijing, P.R. China School of Computer

More information