KERNEL-BASED TRACKING USING SPATIAL STRUCTURE Nicole M. Artner 1, Salvador B. López Mármol 2, Csaba Beleznai 1 and Walter G.

Size: px

Start display at page:

Download "KERNEL-BASED TRACKING USING SPATIAL STRUCTURE Nicole M. Artner 1, Salvador B. López Mármol 2, Csaba Beleznai 1 and Walter G."

Blaze Fowler
6 years ago
Views:

1 KERNEL-BASED TRACKING USING SPATIAL STRUCTURE Nicole M. Artner 1, Salvador B. López Mármol 2, Csaba Beleznai 1 and Walter G. Kropatsch 2 Abstract We extend the concept of kernel-based tracking by modeling the spatial structure of multiple tracked feature points belonging to the same object by a simple graph-based representation. The task of tracking parts or multiple feature points of an object without considering the underlying structure becomes ambiguous if the target representation (for example color histograms) is similar to other nearby targets or to that of the background. Instead of considering tracking of multiple targets as isolated processes, we propose an approach incorporating spatial dependencies between tracked targets and an iterative technique to efficiently locate the spatial arrangement of targets maximizing the joint posterior. We present a series of experiments demonstrating that the proposed method provides improved tracking stability and accuracy when compared to the standard Mean Shift tracking. Furthermore we analyze the proposed method in terms of robustness by assessing its performance in scenarios where occlusions are present. 1. Introduction Object representation is a crucial issue for the task of object tracking and it has a significant impact on the quality of a tracking algorithm. The tracking quality can be for example gauged in terms of spatial accuracy and temporal stability. In order to achieve reliable tracking performance the employed object representation needs to be discriminative, it has to be invariant against variations in the object appearance and it needs to be computationally efficient to be applicable for multiple targets as it is often the case in realistic scenarios. Especially, for targets with deformable shape at the same time undergoing photometric variations these requirements are difficult to meet. In the field of visual object recognition similar challenges exist. For object recognition significant achievements have been accomplished in recent years by devising part-based object representations of high representational power at a low computational complexity [5, 4]. Part-based or structural representations, however, are still relatively unexplored for the tracking task, where mainly either strictly local (point tracking) or global object representations prevail. This is quite surprising considering that structure is an important invariant. In this paper, we propose an initial concept for combining deterministic tracking of object parts with graph representation encoding structural dependencies between the parts. In general, image graphs can be used to represent structure and topology. The output of any segmentation algorithm, which produces regions having closed boundaries (e.g. the watershed algorithm) can be represented as a region adjacency graph (RAG) [16]. We use a Maximally Stable Extremal Regions (MSER) detec- 1 Austrian Research Centers GmbH - ARC, Smart Systems Division, Vienna, Austria {nicole.artner, csaba.beleznai}@arcs.ac.at 2 PRIP, Vienna University of Technology, Austria {salva, krw}@prip.tuwien.ac.at

2 tor [1] to generate regions which represent the nodes of the graph. For this purpose other region detectors (f.e. Harris-Affine, Salient Regions [11]) can also be used. The edges between the nodes define the region adjacencies. We compute color histograms on the nodes, thus obtaining an attributed graph (AG). Other regions descriptors, such as SIFT [8], can be also used to encode node properties into a compact representation. When the structure of the tracked object is represented by a graph, the data association task between adjacent frames (integral part of most tracking algorithms) becomes a graph matching problem. As graph matching is NP-complete, it is only feasible on graphs with few nodes. This usually motivates the use of multiple resolutions of the graph structure [3]. In our case, we avoid the complexity of graph matching and use the mode seeking property of the Mean Shift algorithm for inter-frame node association. During object tracking the color histograms of the AG and spring-like edge energies of the structure are used to carry out gradient ascent energy minimization on the joint (color similarity and structure) probability surface. The remaining parts of this paper are organized as follows: Section 2. gives an overview on related approaches employing structure to track objects. In Section 3. our approach and its algorithmic steps are explained. Section 4. shows the experiments and discusses the results. Finally, in Section 5. a conclusion is given and future work is described. 2. Related work The few approaches which use graph-based structure representation for tracking can be grouped into three different categories: 2.1. Graph-based methods using graph matching Graphs offer a way to represent structure in a rich and compact manner. After setting up node attributes - such as size, average color, position -, edges are defined to specify the spatial relationships (adjacency, border) between the nodes. Graph matching methods can be used to associate structures acquired at different time instances. In [7], links between RAGs at consecutive frames are established using temporal edges, which connect the same regions in different frames. Tang et. al. model in [15] tracked targets with Scale-invariant Feature Transform (SIFT) and represent their relationships with Attributed Relational Graphs (ARGs). The graph matching problem is solved by relaxation labeling in an efficient way. The quality of the RAG obtained by segmenting an image depends heavily on the characteristics of the image and the process is usually slow. Because of this, in [14] graph matching is used where some features are detected in the image directly instead of on the RAG Graph-based methods not using graph matching In these approaches graphs are only used to represent structure, but not for associating consecutive measurements. For example Ma et al. [9] represent the spatial configuration of multiple targets by a graph and the association problem is solved by a maximum a posteriori formulation. Graph matching in such a case would be highly complex since spatial relations between multiple individual objects might change significantly over time. A different approach for graph-based tracking is proposed by Conte et al. in [3]. They use graph pyramids to describe each frame at several levels of detail. By the use of graph pyramids the method

3 is able to assign labels (such as occluded, not occluded or background) to each pixel of a moving foreground region during partial occlusions Other approaches Some methods use graphs to represent task-specific prior knowledge in form of a graph structure. For example, Rehg and Kanade deal in [12] with self-occluding articulated objects applying a kinematic model to predict occlusions and using a graph with just one level. In their experiments they track the fingers of a hand. They can distinguish between occluded cases by the order of the templates related to the fingers and their ordering relative to the camera. Related approaches exist, which attempt to recover the pose of a 3D articulated model from 2D video sequences. Sminchisescu and Triggs present in [13] an approach which uses graph-based structural constraints on human motion, together with a high-dimensional search strategy. 3. Our approach This section introduces the methods used in our approach and explains their combination MSER For initializing the attributed graph (representing the object to be tracked), we use the Maximally Stable Extremal Regions (MSER) detector. The MSER detector has been developed by Matas et al. [1] and it has been evaluated [6, 11] as the most reliable interest point detector in terms of detection repeatability across various geometric transformations, image blur and photometric changes. Maximally stable extremal regions are connected components of an image thresholded according to a specific scheme. The term extremal refers to the property that all pixels within an extremal region have either higher (bright or positive extremal regions) or lower (dark or negative extremal regions) intensity values than the pixels at the regions outer boundary. The maximally stable property refers to the criterion used in the selection process of an optimum threshold for a given region. The threshold selection criterion selects a given threshold and creates a maximally stable extremal region, if in the current threshold neighborhood the rate of area increase has a local minimum with respect to the threshold variation. The output of the MSER algorithm is not a simple binary image: for certain parts of an image multiple thresholds might exist (fulfilling the criterion of maximum stability), creating in such cases a nested subset of regions. For more details refer to [1]. The MSER detector is applied to the image region containing the object delineated manually by the user. The MSER computation is used only once to initialize the graph structure and the Mean Shift trackers at each node. Given the extremal property of the MSER region, the computed node-specific local histograms - used by the Mean Shift tracking - are well-defined (narrow peaked) given the high color uniformity within the detected regions Mean Shift The Mean Shift algorithm is used to associate the nodes of the structure between adjacent image frames. The Mean Shift algorithm is a statistical and robust procedure, which locates local density maxima in a given probability distribution. For that it uses a search window positioned over a section of the probability distribution. Within this search window the density maximum can be determined by a simple weighted average computation. Then the search window is moved to the position of the maximum and the calculation is repeated until the algorithm converges. The convergence of the mode

4 seeking process implies that the nearest local density maximum or mode is found and the Mean Shift offset becomes after certain number of iterations very small. The implementation of the tracking with Mean Shift in this paper mainly follows the ideas in [2]. For every region we obtain from the initialization step by the MSER algorithm a target model ˆq, which is created in the form of a 3D color histogram. Every dimension of the histogram corresponds to one channel of the RGB color space. The histogram is subdivided into bins u =1...m to reduce the amount of data and to cluster similar colors. The discrete distribution of color probabilities is computed according to the following formula [2]: n ˆq u = C k ( x i 2) δ (b(x i ) u), (1) i=1 where C is a normalizing factor such that m ˆq u =1. (2) u=1 k in Equation (1) stands for the Epanechnikov kernel [1] and is used to control the influence of the pixels in the region on the target model. The pixels are weighted depending on their distance to the center of the region. x i are the pixel positions in the image and x i are the normalized pixel positions. b is a function mapping a pixel in the 2D space to the 1D space of the histogram bin indices. Depending on the RGB value of a pixel the function b provides the index of the corresponding histogram bin. δ is the Kronecker delta function. As proposed in [2] we calculate a candidate model ˆp in every frame in addition to the target model ˆq from the initialization. The candidate model n ˆp u (c) =C k ( x i 2) δ (b(x i ) u) (3) i=1 is created from the pixels in the search window at the actual position c =(c x,c y ). Equation (3) is a reformulation of Equation (1) for the position c. The candidate and the target histogram models are used to compute the new position c = n x i w i i=1 n (4) w i i=1 of the target object within the Mean Shift algorithm iterations. The pixels used to calculate position c are weighted according to m ˆqu w i = ˆp u (c) δ (b (x i) u), (5) u=1 whereas ˆq and ˆp (c) are the target and the candidate model. The obtained weight w i denotes the probability value of a pixel within the search window.

5 A B A C A B B B A B B (a) B (b) Figure 1. Edge relaxation examples. B and B denote the deformed and equilibrium node locations, respectively 3.3. Graph relaxation The graph encompassing structural dependencies between the MSER regions is obtained using the Delaunay triangulation. Our objective is to link the processes of (1) structural energy minimization of the graph and (2) color histogram similarity maximization at the nodes by Mean Shift tracking. The graph relaxation step introduces a mechanism which - upon drift in the Mean Shift tracking results - imposes structural constraints on the Mean Shift mode seeking process. As the tracked objects are rigid, the objective of the relaxation is to maintain the tracked structure as similar as possible to the initial structure. The graph relaxation is used to minimize the dissimilarity between the initial structure of the object and the tracked structure. This is an energy minimization problem on the total energy of the structure, E t. The total energy of the structure in the initial state is because the initial structure is considered as the true object structure. During the tracking process E t usually changes, because of the spatial tracking errors of the Mean Shift tracker. The structural energy E t is computed using the concept of spring-like edges between nodes E t = k e, e 2, (6) e where e, and e denote the deformed and undeformed edge lengths. The variations of the edge lengths and their directions are used to determine a structural offset component for each node. The direction of the offset points toward the maximum descent in the structural energy function. In other words, the offset vector represents the direction where a given node should move such that its edges restore their initial length and the energy of the structure is minimized. We calculate this structural energy minimization offset vector O for each node n as follows: O(n) = k ( e, e ) 2 ( d(e, n)), (7) e E(n) where E(n) are all the edges e incident to node n, k is the elasticity constant of the edges in the structure and d(e, n) is the unitary vector in the direction of edge e that points toward n. In Figure 1(a) three possible states of an edge are shown. First the initial state is shown. Next, a state where the edge is contracted is shown. In this case, the offset vector O will force node B to move, enlarging the edge length back to its initial length. In the third case the edge is too long, so O will tend to contract it. Figure 1(b) shows how the sum of the offset vectors of each edge would move node B to its structurally correct position B Combining Mean Shift and graph relaxation In the proposed combined Algorithm 1 graph relaxation is embedded into the iterative Mean Shift tracking process. For every frame we perform Mean Shift and structural iterations until the algorithm

6 converges, because a maximum number of iterations is reached ɛ i or the graph structure attains equilibrium, i.e. its total energy is beneath a certain threshold ɛ e. To compute the position of each region (node), Mean Shift offset and structure-induced offset are combined using a mixing coefficient g. The ordering of the region selection during the iterations is randomized to minimize deterministic errors. One could think of ordering the regions depending on the confidences of their Mean Shift trackers. This ordering would have the advantage that the iteration process will not start with an occluded region, but the drawback is that it could lead to the already mentioned deterministic errors. The algorithmic combination represents a joint iterative mode seeking process on the color similarity and on the structural energy surfaces. As it is demonstrated in the next section, the joint use of Mean Shift and structural constraints significantly improves tracking in presence of occlusions or in cases when multiple similarly colored nearby objects are tracked on patterned background. The calculation of the 3D color histograms for the Mean Shift iteration represents the biggest part of the computational costs. Because of this, we could say that the complexity of our algorithm scales linearly with the number of regions. Algorithm 1 Mean Shift using spatial structure 1: TRACKER(V, nf, R, nr, S, g) V video sequence, nf number of frames, R regions from MSER, nr number of regions, S initial structure, g mixing coefficient ɛ e threshold for total energy of structure, ɛ f threshold for maximum number of iterations 2: i 1, counter 1 iteration counters 3: while (i nf) do 4: converged false 5: while ( converged) do 6: rir list of randomized indices of regions R 7: for ir 1,nrdo 8: p ms Mean Shift iteration for R(rir(ir)) p ms is the position from Mean Shift 9: p s structural iteration for R(rir(ir)) p s is the position from structure 1: Calculate new position p n for R(rir(ir)): p n =(1 g) p ms + g p s 11: end for 12: E t Determine total energy of structure 13: counter counter +1 14: if E t <ɛ e counter < ɛ i then 15: converged true 16: end if 17: end while 18: i = i +1 19: end while 2: end 4. Results and discussion In this section the results of one synthetic and two real video sequences are presented. In our experiments we used a Matlab implementation, which can process a frame in less than one second on a Pentium 2,8 GHz with 512 MB RAM. Notice that our implementation is only a prototype and better performance could definitely be achieved. For all sequences the MSER algorithm is used to initialize the Mean Shift tracking process and to build up the graph representing the structure. All videos have a resolution of pixels and in all of the experiments a mixing coefficient g of.55 and a spring constant k of.2 is used. The thresholds defined in Algorithm 1, ɛ i and ɛ e are set to 4 and.5.

45 45 4 4 35 35 Spatial deviation 3 25 2 15 Spatial deviation 3 25 2 15 1 1 5 5 1 2 3 4 5 (a) 1 2 3 4 5 (b) Figure 3. Deviation from ground truth of the results of the synthetic sequence.

7 (a) Frame 5 (b) Frame 2 (c) Frame 35 (d) Frame 5 (e) Frame 5 (f) Frame 2 (g) Frame 35 (h) Frame 5 Figure 2. Tracking results for the synthetic sequence. The interesting parts of the frames were croped. Top row (a, b, c, d) without structure. Bottom row (e, f, g, h) with structure. The black graphs are ground truth and the white graphs are the results Spatial deviation Spatial deviation (a) (b) Figure 3. Deviation from ground truth of the results of the synthetic sequence. (a) Without structure. (b) With structure Comparison of Mean Shift with and without structure In the synthetic sequence the task is to track 2 homogeneous, rigidly-connected regions of an image pattern ( pixels). During the sequence the pattern is translated and rotated. Figure 2 shows the results with and without the use of structure. The accuracy of the tracking without structure is obviously worse than with structure. In Figure 3 the time evolution of the spatial deviation (Euclidean distance) between ground truth and the results can be seen. It is visible that tracking without using structure gradually loses several tracked nodes, while with structure no significant drift is present.

8 Total energy Total energy (a) Iteration number (b) Figure 4. Temporal evolution of the total structure energy. (a) for the whole synthetic sequence and (b) for one frame Energy evolution of structure over time The total energy E t of the structure depends on the configuration of the graph. Figure 4(a) visualizes the evolution of E t of the graph over time for the synthetic sequence (see Figure 2). By comparing Figure 4(a) with 3(b) the coherence of energy and spatial deviation can be seen. During the iterations of one frame the total energy of the structure is minimized as far as possible. Figure 4(b) shows the evolution of energy during the iterations for one frame of the synthetic sequence Robustness against occlusions Occlusions are a serious problem for tracking with Mean Shift. Occlusions corrupt the color distribution locally and lead to erroneous Mean Shift offsets. Figure 5 contains several frames of the synthetic video sequence with a pixel occlusion. Results are shown without structure (top row) and with structure (bottom row). The use of structure again produces improved results. Figure 6 displays the temporal evolution of the spatial deviations for the occluded case. Figure 7 demonstrates the robustness of Mean Shift tracking using structure against occlusions. Table 1 gives a summary of occlusion parameters and obtained results. The synthetic sequence was used for tracking with an increasing area of the occluding block (white rectangle). The spatial deviations do not change significantly (see Figure 7(a) to (c)) even though the occlusion size is growing from pixels to pixels. When the size of the occluding region becomes pixels (see Figure 7(d)), too many nodes of the graph are affected by erroneous Mean Shift measurements and the structure - while keeping the correct topology - starts to drift. The temporal evolution of the total energy shows increasing oscillations with growing amount of occlusions (see Figure 7(c) and (d)). These are due to the occluder-induced local perturbations. Nevertheless, the global structural constraints are able - up to a point - to restore the equilibrium structure Behavior in real video sequences The two real videos sequences show a checkerboard moving around in the scene. In real video sequence 1 the checkerboard is rotated and in sequence 2 it is occluded by a hand. Both real video

The occlusion (white rectangle) is 55 55 pixels. 7 45 Spatial deviation 6 5 4 3 2 1 Spatial deviation 4 35 3 25 2 15 1 5 1 2 3 4 5 (a) 1 2 3 4 5 (b) Figure 6.

9 (a) Frame 5 (b) Frame 15 (c) Frame 25 (d) Frame 45 (e) Frame 5 (f) Frame 15 (g) Frame 25 (h) Frame 45 Figure 5. Comparison of the tracking performance during occlusion without (a, b, c, d) and with (e, f, g, h) structure. Ground truth is marked with blacks graphs and the results with white. The occlusion (white rectangle) is pixels Spatial deviation Spatial deviation (a) (b) Figure 6. Spatial deviations over time from ground truth within the synthetic sequence (55 55 pixels occlusion). (a) Deviations without structure. (b) Deviations with structure. Graph Occlusion Occluded area Occluded nodes Highest deviation Highest energy (a) / / / / (b) / / / / (c) / / (d) / / Table 1. Results for the synthetic sequence with different occlusions using structure. First column: plot index in Figure 7. Second column: occlusion size (pixels). Third column: maximum occluded area, where (pixels) is the area of the pattern ( ). Fourth column: maximum number of occluded nodes out of the 2 nodes of the graph. Fifth column: max. spatial deviation during the sequence. Last: highest energy.

10 8 8 Spatial deviation, total energy Spatial deviation, total energy (a) (b) 8 8 Spatial deviation, total energy Spatial deviation, total energy (c) (d) Figure 7. Deviations from ground truth (red, thin) and evolution of E t (black, bold) within the synthetic structure with occlusions using structure. (a) pixels. (b) pixels. (c) pixels. (d) pixels. sequences showed that tracking with structure was successful in comparison to tracking without structure. The challenge in the real videos is on one hand the noise and on the other hand the task to track part of the checkerboard pattern without drifting apart. Figures 8 and 9 show interesting frames out of the sequences with and without structure. 5. Conclusion and future work The approach proposed in this paper improves tracking stability, accuracy and robustness compared to the standard Mean Shift in difficult scenes (similar objects and background) and during occlusions. The iterative concept of our approach enables real-time performance, although there is no guarantee that the algorithm converges to the global minimum, due to the local nature of employed search mechanism. Nevertheless, experiments show that the simultaneous use of structural and color similarity constraints together produce in most cases the optimum solution. The method is easily extensible for stochastic optimization, such as particle filtering, as planned for further improvement. Future work will extend the proposed method to work on non-rigid objects (structures) and include an adaptation process for the graph representation.

11 (a) Frame 1 (b) Frame 3 (c) Frame 65 (d) Frame 1 (e) Frame 1 (f) Frame 3 (g) Frame 65 (h) Frame 1 Figure 8. Results for real video sequence 1 without (top row) and with (bottom row) the use of structure. (a) Frame 5 (b) Frame 1 (c) Frame 166 (d) Frame 191 (e) Frame 5 (f) Frame 1 (g) Frame 166 (h) Frame 191 Figure 9. Results for real video sequence 2 without (top row) and with (bottom row) the use of structure. 6. Acknowledgment Partially supported by the Austrian Science Fund under grants P18716-N13 and S913-N13. References [1] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. PAMI, 24(5):63 619, 22. [2] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object tracking. PAMI, 25(5): , 23.

12 [3] D. Conte, P. Foggia, J.-M. Jolion, and M. Vento. Graph-Based Representations in Pattern Recognition, chapter A Graph-Based, Multi-Resolution Algorithm for Tracking Objects in Presence of Occlusions, pages Springer, 25. [4] D. J. Crandall and D. P. Huttenlocher. Composite models of objects and scenes for category recognition. In CVPR, pages 1 8, 27. [5] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. IJCV, 61(1):55 79, 25. [6] F. Fraundorfer and H. Bischof. A novel performance evaluation method of local detectors on non-planar scenes. In CVPR Workshop on Empirical Evaluation Methods in Computer Vision, pages 1 8, 25. [7] J. K. Lee, J. H. Oh, and S. Hwang. Clustering of video objects by graph matching. IEEE International Conference on Multimedia and Expo, pages , July 25. [8] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 6(2):91 11, 24. [9] Y. Ma, Q. Yu, and I. Cohen. Advances in Visual Computing, chapter Multiple Hypothesis Target Tracking Using Merge and Split of Graph s Nodes, pages Springer, 26. [1] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In British Machine Vision Conference, pages , 22. [11] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 65:43 72, 25. [12] J. M. Rehg and T. Kanade. Model-based tracking of self-occluding articulated objects. In ICCV, pages , Boston, IEEE. [13] Cristian Sminchisescu and Bill Triggs. Covariance scaled sampling for monocular 3D body tracking. In CVPR, pages IEEE, 21. [14] M. Taj, E. Maggio, and A. Cavallaro. Multimodal Technologies for Perception of Humans, chapter Multi-feature Graph-Based Object Tracking, pages Springer, 27. [15] F. Tang and H. Tao. Object tracking with dynamic feature graph. In International Conference on Computer Communications and Networks, pages 25 32, Washington, DC, 25. IEEE. [16] A. Tremeau and P. Colantoni. Regions adjacency graph applied to color image segmentation. IEEE Trans. on Image Processing, 9(4): , 2.

Requirements for region detection

Region detectors Requirements for region detection For region detection invariance transformations that should be considered are illumination changes, translation, rotation, scale and full affine transform