Scale-invariant visual tracking by particle filtering

Scale-invariant visual tracing by particle filtering Arie Nahmani* a, Allen Tannenbaum a,b a Dept. of Electrical Engineering, Technion - Israel Institute of Technology, Haifa 32000, Israel b Schools of Electrical and Computer and Biomedical Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332-0250 ABSTRACT Visual tracing is an important tas that has received a lot of attention in recent years. Robust generic tracing tools are of major interest for applications ranging from surveillance and security to image guided surgery. In these applications, the objects of interest may be translated and scaled. We present here an algorithm that uses scaled normalized crosscorrelation matching as the lielihood within the particle filtering framewor. We do not need color and contour cues in our algorithm. Experimental results with constant rectangular templates show that the method is reliable for noisy and cluttered scenarios, and provides accurate and smooth trajectories in cases of target translation and scaling. Keywords: Tracing, cross-correlation, CONDENSATION algorithm, scale-invariant, surveillance 1. INTRODUCTION In this note, we investigate the problem of tracing arbitrary targets in video sequences. Many of the algorithms available tend to be application-specific, are appropriate for a very limited class of video sequences, and suppose strong prior information on the traced target (e.g., shape, texture, size, color, camera dynamics, or motion constraints). On the other hand, a number of more generic target visual tracing algorithms search for distinctive features that can be followed from frame to frame. For these reasons, any progress on general arbitrary target (without distinctive features) tracers will be of interest for active vision, recognition, and surveillance applications. In the present wor, we propose a video tracing framewor for tracing non-articulated (blob-lie) targets, which lac prominent features. The proposed algorithm wors in a variety of scenarios, and deals naturally with clutter and noise in the scenes, target scaling, and low contrast targets. The most important assumption is that the target motion and scaling are smooth, without abrupt changes. We suppose that the target of interest is selected (by human operator or by automatic detection algorithm) in the first frame of video sequence. Tracing is performed by acquiring the target s centroid trajectory in a given bounding box. We should note that this problem formulation is not new, and a large literature is available on this topic. We mention here only a few of the most relevant wors for the approach taen in this paper. The comprehensive survey on visual tracing methods can be found in the paper by Yilmaz et.al. [1]. A deep analysis of particle filters is provided in [2], where rigorous theory and applications of particle filters are presented. Also, a powerful application of particle filters to image sequences (CONDENSATION algorithm) can be found in the paper by Blae and Isard [3]. The possible solutions to scale invariant template matching are presented in [4-6]; see these wors and the references therein. Although several attempts of combining area template matching with particle filtering have been made previously [7, 8], they used adaptive and learning schemes which maes them different from the algorithm given in this paper. The remainder of this paper is organized as follows. Section 2 explains the scale invariant template-matching problem. We briefly discuss the classical template matching with the normalized cross-correlation coefficient function (NCC), and we define the concept of scaled cross-correlation (SNCC). In Section 3, we consider the general problem of tracing with particle filters, and present the algorithm using measurement steps that are based on SNCC. In Section 4, we test our algorithm on three video sequences that illustrate some of its ey features. Finally, in Section 5, we summarize our research, and present the conclusions. We also discuss several problems that still need to be solved, and propose the future directions for the research.

2. SCALE-INVARIANT TEMPLATE MATCHING Let I(m,n) denote the intensity value of the image (or the search region), and P(i,j) denote the intensity value of the template patch. We assume that the size of I is M x M y, and the size of P is N x N y. Clearly, we assume that the size of I is greater than the size of P. It is nown that the noisy version of the patch is placed somewhere in the image I. Our goal is to determine the most probable position of the patch in image I. The standard approach to this problem is to compute the coordinates of the maximum normalized cross-correlation coefficient (NCC) between the image and the template. These coordinates represent the location of the best match. The normalized cross-correlation coefficient is defined for any pixel (m,n) by: NCC( m, n) ( I( i m 1, j n 1) I ( m, n))( P( i, j) P) i 1 j 1 2 2 ( I ( i m 1, j n 1) I ( m, n)) ( P( i, j) P) i 1 j 1 i 1 j 1 (1) where the mean intensity is defined by: N x N y 1 P P( i, j), (2) NxN y i 1 j 1 N x N y 1 I ( m, n) I( i m 1, j n 1), (3) i 1 j 1 m 1,2,..., M N 1, n 1,2,..., M N 1. y x x y (4) The values of NCC(m,n) are between -1 and 1 (1 for perfect match, and 0 for no correlation ). The technique presented here is used in many practical applications, and has demonstrated robustness to noise and intensity variations [9]. Unfortunately, this technique fails in the case of a scaling (zoom) of the desired target in the image I. The straightforward solution to this problem is to find the location of maximum for the scaled normalized crosscorrelation function (SNCC): ( J J ( m, n))( P( i, j) P) i 1 j 1 SNCC( m, n, s) (5) 2 2 ( J J ( m, n)) ( P( i, j) P) i 1 j 1 i 1 j 1 where s is the scaling factor (>0), J = I(m+s(i-1), n+s(j-1)) (if the indices are not integer, then they should be rounded, or the value of J should be interpolated from the closest neighbors), P - is defined in (2), and N 1 N J ( m, n) I( m s( i 1), n s( j 1)) (6) NxN y i 1 j 1

In other words, the template patch is compared to the scaled version of the image I, and the best match is found. Since the number of possible scalings is infinite, even the approximate solution by scale grating can be very computationally demanding, and not appropriate for real-time applications. We propose to overcome this problem by assuming that the scale does not change abruptly, therefore it can be modeled as a simple Marov process, e.g., for the frame : s s 1 v ; v ~ N(0, ); s0 1 (7) Remars: 1) One should mae sure that s remains positive for each frame. 2) If some prior nowledge about changes in scale is available, this nowledge can be incorporated into the model by modifying the distribution of v. For example, if we suppose that most of the time the scale will not change, then we should choose the truncated normal distribution added to delta distribution at s=0. This definition fits well into the particle filtering framewor, and maes the problem tractable. Furthermore, we are interested only in non-negative values of SNCC, thus we use the half-wave rectified scaled cross-correlation, in which the negative values replaced by zeros. In the next section, we will combine the advantages of the SNCC and particle filtering techniques. 3.1 Particle filtering 3. PARTICLE FILTERING Our tracer is based on the CONDENSATION algorithm proposed by Isard and Blae [3]. In this section, a short overview of the algorithm is given, and the application to scale invariant tracing is presented. The algorithm uses the SNCC as the lielihood for determining the target s position. We refer the reader to reference [2] for the complete bacground on particle filtering. In general, the goal of particle filtering is to estimate the sequence of hidden state parameters X, based only on the observed data Z. These estimates follow from the posterior distribution P(X Z 0,Z 1,,Z ). It is assumed that the state and the observations are first order Marov processes, and each Z depends only on X. The particle filter estimates the P(X Z 0,Z 1,,Z ) distribution, and it does not require any linearity or Gaussian assumptions on the model. The particle filter will generate a set of N samples that approximate the filtering distribution. For the -th frame, we denote the state vector by X =(x 1,x 2, ). For example, the state can be the top-left corner coordinates of the desired target (x 1 =x, x 2 =y) in the frame, and its scaling (x 3 =s). Additionally, the state can include velocity and acceleration of the target. The state estimate is recursively obtained as follows: where P( X Z, Z,... Z ) P( X Z, Z,... Z ) P( X X ) (8) 0 1 1 0 1 1 1 ( SNCC) P( Z X ) P ( Z X ) SNCC (9) The prediction step that corresponds to the distribution P( X X 1) is governed by system state dynamical equations. For example, if state time evolution is assumed to be smoothly changing, and there is no additional information about the target dynamics, then the simplest model given by is many times appropriate. The mean of X X 1 v, v ~ N(0, ) (10) X over all the particles is approximately the actual value of X.

3.2 The algorithm The state estimation is carried out by updating weighted particles according to (8). The following table summarizes the algorithm steps. INITIALIZATION The N particles ( n) X0, ( n 1,..., N) are drawn from the uniform distribution, or selected by the operator. For each video frame (-th frame), we perform the following steps: STEP 1: Using the particles from previous frame, predict the new state by sampling from: STEP 2: ( n) ( n) 1 X ~ P( X X ). (11) Measure and weight the new position in terms of the measured features Z : w ( n) ( SNCC) ( n) P Z X ( n) ( n) ( n) w 1 ( ),. (12) STEP 3: Resample the particles STEP 4: ( n X ), ( n 1,..., N) according to the weight ( n) w. Compute the state estimate from: N ˆ 1 ( n) X X, N n 1 (13) and repeat the steps (1-4) for the next video frame. The result of this algorithm is the estimated state ˆX, that includes the information about the position and scaling of the traced target in every video frame. 4. EXPERIMENTAL RESULTS AND DISCUSSION We tested the proposed algorithm in various situations, including highly cluttered exterior scenes with shadows and partial occlusions with a high rate of success. A single template was used for every video. We chose the simplest motion model (10). We selected the target manually in the first video frame. We traced the targets with 60 particles. The video resolution is 240x320, and the frame rate is 25 frames per second.

Figure 1: Maneuvering vehicle sequence with the tracing results. 4.1 Sequence 1: Maneuvering Vehicle In the first sequence, we want to trac a vehicle. Despite the significant zoom and moving camera, our tracer manages to follow the target (see Figure 1). This video represents a challenging scenario for tracing in outdoor conditions. 4.2 Sequence 2: Boat In the second sequence the boat is traced. The contrast of the boat with the bacground is so low, that the following the boat is hard even for a human observer (see Figure 2). Additionally, the scene is very noisy (water glare and the plume behind the boat). The tracer manages to overcome these problems. Although in frame 798 the tracer has the wrong estimate of scale (because of bad measurements), the algorithm reestablishes the correct estimate after a few frames.

Figure 2: The boat sequence with the tracing results. 4.3 Sequence 3: A Crowded Party In this sequence, we want to trac a single person in a large crowd. The results of tracing are shown in Figure 3. In the frame 83 the person traced, despite the variations in the form and partial occlusion. In the frames 115-125, a full occlusion occurs. At frame 123, our tracer temporary lost trac and the scaling is wrong. Nevertheless, the tracer finds the right position after the person reappears. We note that for all sequences, we used simple target dynamics model and a constant template. We assumed that no additional information is given about the target, besides the template. With learned higher order models, and smoothly changing adaptive template we expect to get even better results with the same algorithm.

Figure 3: Crowded party sequence with the tracing results. 5. CONCLUSION In this paper, we presented an algorithm for tracing video sequences of scaled and translated targets without the need for adaptation and learning mechanisms. Using a rather low dimensional state space, we achieve robust tracing results with many complicated and cluttered real world video sequences, including sequences with a moving camera. The combination of the particle filter with a correlation tracer maes it possible to get smooth target trajectories. The algorithm can cope with translations, and moderate deformations of the traced target, when the deformations affect only a small portion of pixels in the template. The algorithm is appropriate also for small targets with low contrast. The algorithm is time efficient, and should be suitable for real-time applications. The disadvantage of our approach is that it is not capable to trac the targets subjected to large rotations. The problems of partial and full occlusions should be addressed too. The next step in our research will be to add rotation states to the particle filter definition, and to choose good dynamic models for rotation, to achieve rotation invariant tracing. In addition, other types of correlation measures should be tested. Finally, in the future, the algorithm should be extended for multiple target tracing.

REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] Yilmaz, A., Javed, O., and Shah, M., "Object Tracing: A Survey," ACM Computing Surveys, Vol. 38(4), (2006). Doucet, A., de Freitas, N., and Gordon, N., Sequential Monte Carlo Methods in Practice, Springer, (2001). Isard, M., and Blae, A., "CONDENSATION Conditional Density Propagation for Visual Tracing," International Journal of Computer Vision, Vol. 29(1), pp. 5-28, (1998). Cahn von Seelen, U.M., and Bajcsy, R.,"Adaptive Correlation Tracing of Targets with Changing Scale," Reconnaisance, Surveillance, and Target Acquisition for the Unmanned Ground Vehicle, Morgan Kaufmann Publishers, San Francisco, CA, pp. 313-322, (1997). Zhao, F., Huang, Q., and Gao, W., "Image Matching by Normalized Cross-Correlation," ICASSP Proceedings, (2006). Ooi, J., and Rao, K., "New Insights Into Correlation-Based Template Matching," Proceedings of SPIE, Vol. 1468, pp. 740-751, (1991). Mei, X., Zhou, S.K., and Porili, F., "Probabilistic Visual Tracing via Robust Template Matching and Incremental Subspace Update," IEEE International Conference on Multimedia and Expo, pp. 1818-1821,( 2007). Zhou, S., Chellappa, R., and Moghaddam, B., "Appearance Tracing Using Adaptive Models in a Particle Filter," Proc. of Asian Conf. on Computer Vision, (2004) Lewis, J.P., "Fast Normalized Cross-Correlation," Vision Interface, Quebec, Canada, pp. 120-123, (1995).