VISUAL tracking plays an important role in signal processing

Size: px

Start display at page:

Download "VISUAL tracking plays an important role in signal processing"

Beryl Harrison
6 years ago
Views:

1 IEEE TRANSACTIONS ON CYBERNETICS 1 Correlation Filter Learning Toward Peak Strength for Visual Tracking Yao Sui, Guanghui Wang, and Li Zhang Abstract This paper presents a novel visual tracking approach to correlation filter learning toward peak strength of correlation response. Previous methods leverage all features of the target and the immediate background to learn a correlation filter. Some features, however, may be distractive to tracking, like those from occlusion and local deformation, resulting in unstable tracking performance. This paper aims at solving this issue and proposes a novel algorithm to learn the correlation filter. The proposed approach, by imposing an elastic net constraint on the filter, can adaptively eliminate those distractive features in the correlation filtering. A new peak strength metric is proposed to measure the discriminative capability of the learned correlation filter. It is demonstrated that the proposed approach effectively strengthens the peak of the correlation response, leading to more discriminative performance than previous methods. Extensive experiments on a challenging visual tracking benchmark demonstrate that the proposed tracker outperforms most state-of-the-art methods. Index Terms Correlation filtering, elastic net, kernel method, regression, visual tracking. I. INTRODUCTION VISUAL tracking plays an important role in signal processing and computer vision with various applications, such as video processing, motion analysis, and unmanned control systems. Visual tracking, in general, is classified into single object tracking and multiple objects tracking. They are associated with applications and different research methodologies. This paper aims at the single object tracking. Recent years have witnessed a rapid development in visual tracking [1], [2]. The performance of visual trackers is being significantly improved in terms of accuracy, robustness, and running speed. Some challenges, however, such as heavy occlusions, nonrigid deformations, illumination changes, scale Manuscript received September 5, 2016; revised January 15, 2017 and March 30, 2017; accepted April 1, This work was supported in part by the National Aeronautics and Space Administration LEARN II Program under Grant NNX15AN94N, in part by the New Faculty General Research Fund of the University of Kansas under grant , in part by the National Natural Science Foundation of China (NSFC under Grant and Grant , and in part by the Joint Fund of Civil Aviation Research by the NSFC and Civil Aviation Administration under Grant U This paper was recommended by Associate Editor W. Hu. (Corresponding author: Yao Sui. Y. Sui and G. Wang are with the Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS USA ( suiyao@gmail.com; ghwang@ku.edu. L. Zhang is with the Department of Electronic Engineering, Tsinghua University, Beijing , China ( chinazhangli@tsinghua.edu.cn. Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TCYB variations, background clutters, and in-plane/out-of-plane rotations, are still hindering the practical applications of visual tracking. Recently, there is a significant interest in developing correlation filtering [3] based visual trackers. Under this paradigm, a correlation filter is learned from the temporally obtained targets and the neighboring background. The target is identified as the region that has the strongest response against the learned filter when a correlation is imposed within a search area around the possible target location. Note that the target localization is actually a brute-force search within a local region using a sliding window method. Thus, it is computationally expensive to tracking. Fortunately, following Parseval s Identity, the correlation can be implemented in frequency domain using Fourier transform. As a result, the computational complexity is reduced to O(n log n for the target of size n n pixels. This implementation extremely speeds up visual tracking, leading to a high-speed visual tracker [3]. However, the properties of tracking, which can facilitate tracking in challenging situations, is neglected in [3]. A circulant structure of tracking is exploited in [4] to improve the tracking performance in various challenging cases. The samples used to learn the correlation filter are represented by cyclic shifts of a base sample, leading to an equivalence to the dense sampling method. Furthermore, the correlation filtering is interpreted from the perspective of ridge regression, resulting in a typical discriminative tracking model that focuses on distinguishing the target from its surrounding background. The work [4] significantly improves the tracking performance in various challenging situations by taking account of the circulant structure of tracking, while achieving high running speed. However, it is instable in the presence of scale variations because the size of the learned filter is fixed during tracking. Although several studies [5] [7] design the strategies to estimate the target scale, the tracking speed is unexpectedly reduced. The correlation filtering approach localizes the target according to the position, where the maximum (peak of the filter response (correlation output appears over a search region. Motivated by the fact that the target localization focuses only on the relative response values of the candidate regions, rather than the absolute values, we aim at learning such a correlation filter that achieves a peak-strengthened (PS filter response. It indicates that, with the strengthened peak, the target is represented more discriminatively against the background by the correlation filter. In this paper, we construct a robust correlation filter, which has strong peak strength c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 2 IEEE TRANSACTIONS ON CYBERNETICS over the search region by introducing the intrinsic structure of visual tracking, while ensuring the tracker to run fast. Note that the correlation in all existing methods, such as [3] [12], is applied over all the pixels within a search region. However, some of these pixels may be distractive, such as those from an occlusion or a significant nonrigid deformation. These distractions will influence the filter response over the search region, leading to a weak peak strength, and further resulting in an inaccurate target localization. Thus, a correlation approach that can adaptively ignore the distractive regions or pixels is required to enhance the peak strength of the filter response. To this end, a sparsity constraint [13] on the filter to be learned is desired to reduce the response from these distractive pixels by zeroing the corresponding entries of the correlation results. From a regression point of view, 1 such a sparsity-based correlation filter corresponds to the LASSO [14] in the case that the training samples are obtained by the cyclic shift of a base sample. Although LASSO has good performance on feature (pixel selection, it is unable to group the features in the regression. Note that, in visual tracking, the distractive pixels always appear densely in several regions. Thus, a region-level selection strategy is preferred, rather than a pixellevel selection. As a result, an elastic net regularization [15], which integrates the squared and the sparsity constraints, is enforced on the correlation filter. In brief, from a filtering perspective, we design a robust correlation filter to augment the peak strength of the response, while from a perspective of tracking-by-detection, we construct a feature-adaptive regressor via an elastic net regularization [15] to reliably separate the target from the surrounding background. Moreover, we define a new metric to quantitatively demonstrate the strengthened peak of the filter response from a large number of empirical results, and reveal how the proposed correlation filter works. Furthermore, we leverage a multiresolution approach to incorporate scale variations of the correlation filter, which can accurately estimate the scale changes of the target appearance in each frame. Extensive experimental results demonstrate the effectiveness of the proposed correlation filter, which significantly improve the tracking performance in various challenging situations against most state-of-the-art trackers. The remainder of this paper is organized as follows. In Section II, the work related to ours is reviewed. In Section III, the proposed approach is elaborated in details. The experimental results are reported in Section IV. And finally, we conclude this paper in Section V. II. RELATED WORK A. General Tracking Algorithms Visual tracking models are in general classified into two categories: 1 generative and 2 discriminative. The generative tracking model focuses on searching a region that best matches a learned target model, while the discriminative tracking model regards tracking as a binary classification and aims to train a classifier that can separate the target from the background. 1 The equivalence between correlation filtering and ridge regression is addressed in [4] when applying the cyclic shift to the training samples. Generative tracking model has good generalization capability while requiring only a relatively small number of training samples. There is an extensive literature on the generative tracking model, like subspace learning [16] [19], sparse representation [20] [26], low-rank approximation [27] [30], tensor subspace [31], [32], log-euclidean Riemannian subspace [33], Gaussian process regression [34], histograms matching [35], [36], fragment strategy [37], graph model [38], and compact representation [39]. Subspace learning is a popular generative tracking model because of the inherent characteristic of the targets. Ross et al. [16] introduced the incremental principal component analysis (PCA to visual tracking. Kwon and Lee [17] employed a sparse PCA method to decompose visual tracking into different submodules. Sui et al. [18] constructed a structured subspace that can deal with heavy distractions during tracking. Sui et al. [19] proposed a tracking approach that leveraged matrix completion to maintain a latent subspace for visual tracking. Sparse representation is usually adopted with subspace learning to improve the robustness of the tracker. Mei and Ling [20] introduced sparse representation to visual tracking. Zhang et al. [21] used a joint sparse representation to speed up [20]. Wang et al. [24], [40] formulated the target by a subspace and modeled the occlusions by a sparse error. Zhang et al. [41] exploited the circulant structure among the sparse representation. Tensor subspace is a straightforward extension on the subspace learning approach. It treats the target in the image domain. Li et al. [31] and Hu et al. [32] proposed an incremental tensor subspace learning for visual tracking. Wang and Lu [42] used 2-D PCA method to construct the target subspace. Graph model is also a popular approach to formulate visual tracking. Li et al. [38] proposed a tracking algorithm via random walks on a graph. Sui et al. [28] constructed a low-rank graph to conducted visual tracking. Discriminative tracking model achieves impressive performance in recent years. Various methods are developed based on this tracking model, including support vector machine (SVM [43] [47], boosting [48] [50], compressive sensing [51], superpixel [52], correlation filtering [3], [4], [10], [12], structural learning [53], [54], multiple instance learning [55], segmentation [56], hashing [57], and deep learning [58] [62]. Avidan [43] used an SVM classifier to separate the target from its surrounding background. Hare et al. [63] proposed a structural output SVM for visual tracking. Babenko et al. [55] leveraged multiple instance learning to solve the sample label ambiguity problem in discriminative tracking model. Avidan [48] employed a boosting approach to combine several weak classifiers into a strong classifier to solve tracking problem. Grabner and Bischof [64] proposed an online learning algorithm by using a boosting classifier. Zhang et al. [51] represented the target and the background by compressive sensing and separate them by an online learned Bayesian classifier. Bolme et al. [3] introduced correlation filtering method to visual tracking. Henriques et al. [4] exploited the circulant structure among the correlation filtering. Fan et al. [56] designed a segmentationbased tracking algorithm to distinguish the target from

3 SUI et al.: CORRELATION FILTER LEARNING TOWARD PEAK STRENGTH FOR VISUAL TRACKING 3 its surrounding background. Recently, deep learning is extensively used in visual tracking and achieves impressive performance. Zhang et al. [65] proposed a tracking algorithm via convolutional networks with training. Wang et al. [59] employed fully convolutional networks to conduct visual tracking. Ma et al. [58] analyzed the deep features from different layers of the convolutional networks. B. Tracking Algorithms Based on Correlation Filtering Recently, there is a significant interest in correlation filtering-based tracking algorithm design. In terms of both tracking accuracy and running speed, the correlation filteringbased visual trackers achieve state-of-the-art results. Under this paradigm, a correlation filter is learned from the previously obtained targets and their surrounding background. The learning problem can be exactly transferred to frequency domain by Fourier transform and Parseval s Identity. Consequently, the filter learning is computationally efficient in the frequency domain. It is also determined that this paradigm is a discriminative model and the target is located by means of tracking-by-detection. The target is detected by a correlation operation in the frequency domain over a region containing the possible target location. This ensures the high running speed of this paradigm. Finally, the target is localized according to the location where the peak of the filter response appears. Bolme et al. [3] introduced this paradigm to visual tracking and achieved high speed visual tracker. Henriques et al. [4] exploited the circulant structure of the training samples to approximate the locally dense sampling. They also explained the correlation filtering in visual tracking from a regression perspective, i.e., the correlation filtering is equivalent to a ridge regression over the target and its surrounding background, and the target will be assigned the largest regression value. In their substantial work [10], they theoretically proved the connection between the correlation filtering and the ridge regression and extended their circulant structure to high-dimensional nonlinear feature space by kernel tricks. Sui et al. [12] raiseda problem that the ridge regression via squared loss function may lead to overfitting when training the correlation filter over the circulant structure. They proposed to leverage robust loss function to compensate the possible overfitting and explained their approach from both robust regression and anisotropic filter response perspectives. Note that the fact that, in the above-mentioned approaches, maintaining a correlation filter of a fixed size during tracking to ensure the high running speed also weakens the capability to the scale adaptivity, because it always detects the target using the fixed size identical to the size itself. It is thus unable to deal with the scale variations of the target. Some studies aimed at estimating the scale of the target within this paradigm by using a multiresolution approach [6] and a motion model [5], [7]. However, the scale estimations significantly increase the computational load. Thus, a balance between the scale estimation and the running speed should be carefully considered for a robust correlation filter. Motivated by previous success, our approach is conducted within the framework that exploits the circulant structure of Fig. 1. Illustration of the cyclic shift. (a Base image patch. (b Cyclic shift of the base image by ±15 pixels in both x and y directions. training samples with the kernel tricks [10]. It would be helpful for readers to clarify the difference between our and the similar methods before presenting our approach in detail. From the regression perspective, a regression formulation includes two parts: 1 a data fitting term guaranteed by a loss function and 2 a regulation term that generates some properties for the regressor. Sui et al. [12] improved the data fitting performance by employing different robust loss functions, leading to a robustness promoted correlation filter. Different from their method, our approach leverages an elastic net constraint on the regulation term by a PS design for the filter response, to avoid acquiring the distractive information, which easily causes tracking failure, during the correlation filter learning. In addition, we also explain our approach from a feature selection point of view. Note that there are extensive studies on feature selection-based visual tracking algorithms. Sui et al. [18] proposed a sparsity-induced subspace learning method to exclude the distractive features during tracking. Zhang et al. [66] leveraged a feature selection strategy to improve the multiple instance learning framework, leading to an effective discriminative tracking model. III. PROPOSED APPROACH Given a target location, we generate the training samples around the target and its immediately surrounding background by a pixel-wise sliding method. This can be efficiently implemented by a cyclic shift of a base image patch [4], as illustrated in Fig. 1. The base image in this paper is determined as a spatially expended region of the target. We stack the base image into a column vector, and denote it by x. Thus, the training samples, denoted by the matrix X, of which each row denotes a sample, composed by the full cyclic shifts of x. Note that the sample matrix X has a good property [67], that is X = Ddiag (ˆx D H (1 where D denotes the discrete Fourier transform (DFT matrix, the hat ˆ stands for the DFT and hereafter, and H denotes the transpose and complex-conjugate. It indicates that the sample matrix X can be efficiently represented in frequency domain. The goal of our approach is to construct an efficient and effective discriminative tracking model over the samples X to separate the target from its surrounding background. A. Problem Statement To distinguish the target from the background, a linear regressor can be employed, such that the target is localized in the region, where the largest regression value appears. The regressor is simply trained by min w y Xw 2 2 (2

4 4 IEEE TRANSACTIONS ON CYBERNETICS where w denotes the linear coefficients, and y contains the regression values of the samples X. This is the well-known least squares regression and has a closed-form solution. Note that, however, all features of the samples are used to train the linear regressor in (2. It indicates that some distractive features, such as the pixels from occlusions or nonrigid deformations, may significantly degrade the accuracy of the regressor. For this reason, the regressor is expected to adaptively ignore theses distractive features. To this end, a sparsity on w is desired and promising to promote the robustness of the regressor, because the zeros locating at the distractive features can eliminate their contributions to the regression values. Thus, the regressor is trained by min w y Xw τ w 1 (3 where 1 denotes the l 1 -norm that returns the sum of the absolute values of all elements, and τ>0 is a weight parameter. The regressor in (3 is known as the LASSO [14]. The features that facilitate the regression can be adaptively selected by the nonzeros of w, while the rest features are ignored due to the zeros of w. Note that LASSO fail to group the features for the selection. However, in tracking, the distractive features, like occlusions, always appear in several regions (i.e., groups of pixels. Thus, a grouping strategy is required to incorporate the sparsity for the regressor. As a result, the regressor is reformulated as min w y Xw λ w τ w 1 (4 where λ>0 denotes a weight parameter. In (4, the squared regularization w 2 2 is used to group the features. This is also known as the elastic net regression [15]. To further enhance the regression, a popular method, known as the kernel tricks, is employed to transform the samples into a high-dimensional and nonlinear space, such that the regressor has a good nonlinearly separable capability. Let ϕ( denotes a nonlinear function, and α denotes the dual conjugate of w, such that w = i α iϕ(x i. The regression problem in (4 is described in its dual space as min α y Kα λαt Kα + τ α 1 (5 where K denotes the kernel matrix, of which the element k ij = ϕ(x i T ϕ ( x j. The above regression is conducted in the high-dimensional and nonlinear space defined by ϕ( over the adaptively selected features. Shortly, we will interpret in Section III-E that this regression is actually equivalent to a correlation filtering with a strengthened peak of the filter response. According to [10], it is demonstrated that, similar to X, the matrix K from some kernels, such as Gaussian or polynomial, also has a circulant structure, and can be diagonalized as K = Ddiag (ˆk 1 D H (6 where k 1 denotes the first row of K. In fact, the kernel matrix K is obtained from the full cyclic shifts of its first row k 1. B. Correlation Filter Learning The problem in (5 involves joint minimization on both the squared form and the l 1 -norm with respect to α. For the convenience of computation, we relax it by introducing another variable β min y α,β Kα λαt Kα + τ β 1 + μ α β 2 2 (7 where μ>0 is a weight parameter. Note that (7 is convex with respect to α if fixing β, and vice versa. This allows us to develop an iterative algorithm, like block coordinate descent, to approximate the solution to (5. Thus, the two subproblems with respect to α and β, respectively, are presented as follows: min α y Kα λαt Kα + μ α β 2 2 (8 min β τ β 1 + μ α β 2 2. (9 Note that (8 only contains the squared forms of α. Thus, through least squares, it has a globally unique solution with the closed-form α = ( K H K + λk + μi 1( K H y + μβ (10 where I denotes the identity matrix. Note that (10 involves inverse matrix, whose computational complexity yields regularly O ( n 3 for a n n matrix. Such a complexity is unable to satisfy the speed requirement of tracking. Fortunately, by combining (6, the above problem can be transformed into Fourier domain and solved efficiently with a complexity of O(n log n. As a result, α can be obtained from the following proposition. Proposition 1: Guaranteed by the Parseval s Identity, α can be solved in the Fourier domain by ˆα = ˆk 1 ŷ + μ ˆβ ˆk 1 ˆk 1 + λˆk 1 + μ (11 where denotes Hadamard product, i.e., (a b i = a i b i,the division is performed element-wise, and the asterisk denotes the conjugate operation. The proof of the above proposition are presented in the Appendix. Equation (9 presents a standard l 1 -regularized least squares problem. Through the shrinkage threshold algorithm [68], it can be solved by the following proposition. Proposition 2: Equation (9 has a globally unique solution ( τ β = δ 2μ, α (12 where δ(, denotes the shrinkage operator, defined as δ(ε, x = sign(x max(0, x ε. (13 Therefore, (7 can be solved iteratively by alternately optimizing α and β using (11 and (12, respectively. It stops when the difference of the objective values between two consecutive iterations is very small, e.g., 10 8 in this paper. In each iteration, the fast Fourier transform dominates the complexity, yielding O(n log n. According to the empirical results in this paper, the iterative algorithm converges after around ten

SUI et al.: CORRELATION FILTER LEARNING TOWARD PEAK STRENGTH FOR VISUAL TRACKING 5 Algorithm 1: Correlation Filter Learning Toward Peak Strength Input: Training samples X, and regression objective y.

5 SUI et al.: CORRELATION FILTER LEARNING TOWARD PEAK STRENGTH FOR VISUAL TRACKING 5 Algorithm 1: Correlation Filter Learning Toward Peak Strength Input: Training samples X, and regression objective y. Output: The peak-strengthened correlation filter α. 1 Calculate the kernel matrix K for k ij = ϕ(x i T ϕ ( x j. 2 Initialize ˆβ = ˆk 1 ŷ. 3 while not converged do 4 Compute α from Eq. (11. 5 Compute β from Eq. (12. 6 end Fig. 2. Visualization of the PS correlation filter learning. A frame in the case of occlusion is shown in the left, where the target are marked in red and the base image is marked in blue. The features selected to train the filter are shown in the right, highlighted by the green color. iterations. The formal description of the iterative algorithm is depicted in Algorithm 1. Note that, due to the circulant structure of K, (11 in fact learns a correlation filter in the Fourier domain. To exploit the temporal information, and avoid that the correlation filter changes abruptly in successive frames, the base image x t and the correlation filter α t in the tth frame are, respectively, updated in an incremental manner ˆx t = (1 πˆx t 1 + π ˆx (14 ˆα t = (1 π ˆα t 1 + π ˆα (15 where ˆx is the DFT of the spatially expended region of the (t 1th target, ˆα is obtained from (11, and π (0, 1 controls the update rate. C. Target Localization In the tth frame, a large number of target candidates are pixel-wise sampled within the search area defined by a base image denoted by x. The base image is sampled by a spatial expansion of the region, where the (t 1th target locates. With the circulant structure, these target candidates are obtained from the full cyclic shifts of x. As a result, the regression values of these candidates are computed from f ( x = F 1(ˆk ˆα t (16 where ˆk = ϕ T( x ϕ(x t denotes the kernel correlation between the temporally obtained target and the candidate regions, and F 1 ( denotes the inverse fast Fourier transform. The target is localized as the region, where the largest regression values (filter response locates. Note that (16 is actually a spatial correlation over the search area in the dual space defined by ϕ(x t. By transforming it into the Fourier domain, the correlation is implemented by the Hadamard product, which significantly improve the computational efficiency. D. Scale Estimation During tracking, the size of the filter α is fixed to maintain the fast speed. Thus, to incorporate the scale estimation, we employ a scale pool S = {s 1, s 2,...,s m } containing m scales. We sample m groups of target candidates, i.e., m base images x 1:m, and in the ith group, the target candidates yield the scale s i. Then, we compute the regression values of the m groups of target candidates by using (16. As a result, the scale of the target is estimated by s = arg max f ( s x (17 s S where f s ( denotes the regression values with respect to the scale s. Correspondingly, the criterion of the target localization is revised as the region, where the largest regression value locates with respect to the estimated scale. E. Discussion The goal of the proposed approach is to learn a PS correlation filter that can augment the discriminative capability to separate the target from its surrounding background. An elastic net constraint is imposed on the correlation filter to achieve this goal. To make the proposed approach more clear, we discuss the learning method from the perspectives of feature selection and correlation filtering, respectively. Note that we use the raw pixels as the features in this example for the clarity and simplicity. In the actual implementation of the proposed tracker, some other more effective features are adopted, such as histogram of orientation gradient (HOG feature. 1 Feature Selection: As presented in (4 and (5, an elastic net constraint is enforced on the correlation filter learning. The sparsity (i.e., through the l 1 -norm can adaptively ignore the distractive features (pixels, such as occlusions and cluttered background, by zeroing the corresponding entries of the correlation filter w or α. Meanwhile, these distractive features usually appear within a region (i.e., groups of pixels. To reflect the group property, a quadric constraint is then applied through the l 2 -norm. We visualize the correlation filter learning in the case of occlusions, as shown in Fig. 2. A representative frame is shown in the left, where the target (i.e., the man s face is marked in the red box and the base image is marked in the blue box. In the right, the features selected to learn the correlation filter are highlighted in green color. It is evident that the pixels from the face are extensively selected, while the pixels from the occlusion (i.e., the book are rarely considered. Moreover, the pixels from the background with large difference from the target are also adaptively selected. Such a selection promotes the robustness of the correlation filtering and augments the discriminative capability of the correlation response. It is also found that the features at top right of the target region (from the hair are extensively excluded. Note that

6 6 IEEE TRANSACTIONS ON CYBERNETICS the correlation filter has symmetric property over the base image region. Because the book (distractive object needs to be excluded, its symmetric region is from the hair. As a result, the hair region is excluded as well. In addition, from another point of view, the hair within the target region is less differentiable from its immediately surrounding background. For this reason, the hair region from either target or background is extensively excluded. All the selection is adaptively made by the proposed approach. Through this example, it is evident that the proposed feature selection is effective to learn the correlation filter by excluding the distractive features. 2 Correlation Filtering: The correlation filtering method localizes the target in terms of the position of the peak of the correlation response. Note that the target localization only aims at the relative response values of the candidate regions, rather than the absolute response values. Intuitively, it indicates that a good correlation filter produces a much strong response at the target location but the weak response at other regions, even the regions very close to the target. From a trackingby-detection point of view, the larger the difference of the response values between the target location and other regions, the more discriminative the correlation filter. We investigate quantitatively to what extent the proposed approach improves the discriminative capability. Clearly, we are interested in how strong the response of the target is over several other competitive candidate regions. We also consider how accurately the response peak appears at the center of the target location. To this end, we define a new metric, the peak strength s = 1 n ( 2 p rj n j=1 1 2 [ xp y p ] [ xgt y gt ] 2 (18 to evaluate the discriminative performance of the learned correlation filter, where p denotes the peak value of the response, r j denotes the jth response value, n denotes the number of the neighboring response values around the peak, and [x p, y p ] T and [x gt, y gt ] T denote the positions of the response peak (correlation output and the ground truth peak (center of the target location, respectively. The peak strength is expected to be high for a correlation filter with good discriminative performance. Note that, to balance between their scales, the two terms in (18 are normalized to [0, 1] over a video sequence, respectively. To demonstrate that the proposed approach improves the discriminative performance of the correlation filtering (i.e., PS, we construct another correlation filter learned only with a quadric constraint. We evaluate the peak strength of the two filters on the OTB 2013 benchmark [69], a popular tracking benchmark containing 50 challenging video sequences. We use the eight immediate neighbors of the peak values to calculate the peak strength, i.e., we set n = 8 in(18. Fig. 3(a shows the average peak strength on each of the 50 video sequences, from which it is evident that, at most video sequences, the proposed approach (PS has higher peak strength than its competing counterpart (NPS. Furthermore, the distributions of the peak strength obtained by the two Fig. 3. Investigation on the peak strength over the OTB 2013 benchmark. (a Average peak strength on each video sequence. (b Distribution of the peak strength on all frames. approaches on all frames are shown in Fig. 3(b. It is evident that the proposed approach has significantly higher peak strength over the tracking benchmark. It demonstrates that the elastic net constraint leads to a PS correlation filtering, as well as improved discriminative performance. For more thorough evaluations, please refer to the results shown in Fig. 13. IV. EXPERIMENTS A. Implementation Details The proposed tracking algorithm is implemented in MATLAB on a PC with an Intel Xeon CPU of W3520 at 2.67 Hz. The MATLAB scripts are programmed without any code optimization. The average running speed of the proposed tracker is 13.4 frames/s. The training samples X are the fully cyclic shifts of the base image patch centered at the current target location with a spatially expanded region 1.5 times of the current target. A cosine window is applied to the base image to alleviate the discontinuity caused by the cyclic shifts. HOG feature is extracted from the training samples, and Gaussian kernel is employed to embed the features into a highdimensional nonlinear space. The target candidates, sampled from the image patch centered at the latest obtained target, are generated by the same procedure as the training samples. As recommended in [10], the parameter λ in (5 issetto10 4. Following the suggestion in [15], the parameter τ in (5 is set to 1 λ. We empirically set the parameter μ in (7 to 10 5, and build the scale pool by seven different scaling coefficients, i.e., S = {0.985, 0.99, 0.995, 1, 1.005, 1.01, 1.015}. The source codes will be available on the authors websites. B. Evaluation Setting The proposed tracker is evaluated on two popular visual tracking benchmarks, the OTB 2013 benchmark [69] and the OTB 2015 benchmark [70], which respectively contain 50 and 100 video sequences with various challenging situations, such as illumination change, occlusion, nonrigid deformation, in-plane/out-of-plane rotation, and scale variation. In each frame of these video sequences of the two benchmarks, the target is manually labeled by a rectangle bounding box that is used as the ground truth in the quantitative evaluations. Eighteen other state-of-the-art trackers are referred to as the competing methods in the evaluations, including the top five trackers on the OTB 2013 benchmark according to Wu et al. s evaluation [69] (Struck [63], SCM [71],

SUI et al.: CORRELATION FILTER LEARNING TOWARD PEAK STRENGTH FOR VISUAL TRACKING 7 TABLE I TRACKING PERFORMANCE ON THE 50 VIDEO SEQUENCES OF THE OTB 2013 BENCHMARK.

Tracking performance of the proposed and the top five trackers in Wu et al. s evaluation [69] on the 50

7 SUI et al.: CORRELATION FILTER LEARNING TOWARD PEAK STRENGTH FOR VISUAL TRACKING 7 TABLE I TRACKING PERFORMANCE ON THE 50 VIDEO SEQUENCES OF THE OTB 2013 BENCHMARK. ρ AND φ DENOTE LOCATION ERROR THRESHOLD AND OVERLAP THRESHOLD, RESPECTIVELY. THE BEST RESULTS ARE MARKED IN BOLD-FACE FONTS. THE SECOND BEST RESULTS ARE MARKED BY UNDERLINES Fig. 4. Tracking performance of the proposed and the top five trackers in Wu et al. s evaluation [69] on the 50 video sequences of the OTB 2013 benchmark. Fig. 5. Tracking performance of the proposed tracker and eight other stateof-the-art trackers based on correlation filtering on the 50 video sequences of the OTB 2013 benchmark. TLD [54], ASLA [72], and CXT [73], the top five trackers on the OTB 2015 benchmark according to Wu et al. s evaluation [70] (Struck [63], SCM [71], ASLA [72], CSK [4], and L1APG [22], eight other correlation filtering-based trackers (RCF [12], SRDCF [7], KCF [10], SAMF [6], DSST [5], STC [8], CN [9], and CSK [4], two deep learning-based trackers (CNT [65] and HCFT [58], and two feature selectionbased trackers (ODFS [66] and SSL [18]. Two criteria are used to evaluate the performance of the proposed tracker, which are defined as follows. 1 Precision: The percentage of frames where the tracking location errors (TLEs are less than a predefined threshold. The TLE is defined as the Euclidean distance between the centers of the tracking and the ground truth bounding boxes. 2 Success Rate: The percentage of frames where the overlap rates (ORs are greater than a predefined threshold. The OR is defined as [(A t A g /(A t A g ], where A t and A g denote the areas of the tracking and the ground truth bounding boxes, respectively. C. Evaluations on the OTB 2013 Benchmark Fig. 4 shows the comparison on the tracking performance of the proposed tracker and the top five trackers in Wu et al. s evaluation [69] on the 50 video sequences of the OTB 2013 benchmark. It is evident that the proposed tracker significantly outperforms its five counterparts in this comparison. According to the quantitative evaluation results shown in Table I, the proposed outperforms its five counterparts by 16.5% and 18.4% in terms of precision (ρ = 20 and success rate (φ = 0.5, respectively, on the OTB 2013 benchmark. Fig. 5 shows the comparison on the tracking performance of the proposed tracker and eight other state-of-the-art trackers based on correlation filtering on the 50 video sequences of the OTB 2013 benchmark. The proposed tracker obtains the best performance in terms of precision and the second best performance in terms of success rate. Specifically, according to Table I, although the success rate of the proposed tracker is slightly inferior to the SRDCF tracker, the proposed tracker runs 2.5 times faster than the SRDCF tracker, as shown in Table III. Fig.6 shows the comparison on the tracking performance of the proposed tracker, a deep learning-based tracker CNT and two feature selection-based trackers. It is evident that the proposed tracker outperforms its three counterparts in this comparison by 10.5% and 9.2% in terms of precision (ρ = 20 and success rate (φ = 0.5, respectively. Overall, the evaluation results on the OTB 2013 benchmark demonstrate that the proposed tracker achieves competitive performance against the state-of-the-art approaches on the 50 challenging video sequences. D. Evaluations on the OTB 2015 Benchmark Fig. 7 shows the comparison on the tracking performance of the proposed tracker and the top five trackers in Wu et al. s evaluation [70] on the 100 video sequences of the OTB 2015 benchmark. It is evident that the proposed tracker significantly outperforms its five counterparts in this comparison. According to the quantitative evaluation results shown in Table II, the proposed outperforms its five counterparts by 14.1% and 14.2% in terms of precision (ρ = 20 and success rate (φ = 0.5, respectively, on the OTB 2015 benchmark.

8 IEEE TRANSACTIONS ON CYBERNETICS TABLE II TRACKING PERFORMANCE ON THE 100 VIDEO SEQUENCES OF THE OTB 2015 BENCHMARK. ρ AND φ DENOTE LOCATION ERROR THRESHOLD AND OVERLAP THRESHOLD, RESPECTIVELY.

Tracking performance of the proposed tracker and six other state-ofthe-art trackers based on correlation filtering on the 100 video sequences of the OTB 2015 benchmark.

8 8 IEEE TRANSACTIONS ON CYBERNETICS TABLE II TRACKING PERFORMANCE ON THE 100 VIDEO SEQUENCES OF THE OTB 2015 BENCHMARK. ρ AND φ DENOTE LOCATION ERROR THRESHOLD AND OVERLAP THRESHOLD, RESPECTIVELY. THE BEST RESULTS ARE MARKED IN BOLD-FACE FONTS. THE SECOND BEST RESULTS ARE MARKED BY UNDERLINES Fig. 8. Tracking performance of the proposed tracker and six other state-ofthe-art trackers based on correlation filtering on the 100 video sequences of the OTB 2015 benchmark. TABLE III RUNNING SPEEDS (IN FRAMES/S OF THE PROPOSED AND THE EIGHT OTHER STATE-OF-THE-ART CORRELATION FILTERING BASED TRACKERS achieves competitive performance against the state-of-the-art approaches on the 100 challenging video sequences. Fig. 6. Tracking performance of the proposed tracker, the deep learning based tracker [65], and two state-of-the-art feature selection-based trackers on the 50 video sequences of the OTB 2013 benchmark. Fig. 7. Tracking performance of the proposed tracker, the deep tracker [58] and the top five trackers in Wu et al. s evaluation [70] on the 100 video sequences of the OTB 2015 benchmark. Fig. 8 shows the comparison on the tracking performance of the proposed tracker and six other state-of-the-art trackers based on correlation filtering on the 100 video sequences of the OTB 2015 benchmark. The proposed tracker obtains the best performance in terms of precision and the second best performance in terms of success rate. However, the proposed tracker runs 2.5 times faster than the SRDCF tracker that achieves the best success rate. Overall, the evaluation results on the OTB 2015 benchmark demonstrate that the proposed tracker E. Evaluations on Running Speed Considering the time-sensitive nature of practical tracking applications, running speed is also a critical factor to the performance of a visual tracker. Table III shows the running speeds of the proposed and the eight state-of-the-art correlation filtering-based trackers. By incorporating the tracking performance in terms of precision and success rate shown in Tables I and II, the proposed tracker makes a good tradeoff between tracking accuracy and running speed. Even without any code optimization in MATLAB, it still achieves a competitive speed compared with its eight counterparts. F. Evaluations on Various Challenging Situations To thoroughly evaluate the proposed tracker, the performance in various challenging situations is investigated. The results are shown in Figs Both the OTB 2013 and the OTB 2015 benchmarks separate the 50 and 100 video sequences into 11 challenging situations, respectively, including occlusion, deformation, out-of-plane/in plane rotation, illumination variation, scale variation, background clutter, motion blur, fast motion, low resolution, and out of view. In this evaluation, five (SRDCF [7], RCF [12], KCF [10], SAMF [6], and DSST [5] and two (SRDCF [7] and KCF [10] other correlation filtering-based trackers are employed as the baselines on the OTB 2013 and the OTB 2015 benchmarks, respectively. The evaluation results in several challenging situations are reported below. 1 Occlusion: During tracking, the target is often occluded by other objects, leading to partially or entirely abrupt changes in the appearance. Occlusion is usually the major factor resulting in tracking failure. Fig. 9(a shows the tracking performance of the proposed and the competing trackers in the case of occlusion. It is evident that the proposed tracker outperforms its counterparts in this case. Some tracking results in the case of occlusion in representative frames are shown in Fig. 12(a, where a football player is running on the field and

9 SUI et al.: CORRELATION FILTER LEARNING TOWARD PEAK STRENGTH FOR VISUAL TRACKING 9 Fig. 9. Tracking performance of the proposed and the baseline trackers in various challenging situations on the OTB 2013 and the OTB 2015 benchmarks. The number in the legend of each precision plot denotes the precision under the threshold ρ = 20, and in the legend of each success rate plot denotes the average success rate on all thresholds φ [0, 1]. (a Occlusion on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. (b Deformation on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. (c Out-of-plane rotation on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. (d In-plane-rotation on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. sometimes occluded by other players with similar appearances. Only SRDCF and the proposed tracker successfully track the player, but the proposed tracker obtains more accurate center location and scale estimation, while unfortunately, the other competing trackers fail in this experiment. 2 Nonrigid Deformation: In this situation, the target deforms locally in the appearance. This enlarges the representation error of the target, making the tracking instable. Fig. 9(b shows the tracking performance of the visual trackers in the case of nonrigid deformation. The proposed tracker achieves the superior performance over its competing peers in this case. Fig. 12(b shows the qualitative tracking results in some representative frames where nonrigid deformation occurs. The proposed tracker performs very well in the tracking of the sprinter whose body appearance suffers from significant nonrigid deformations. Unfortunately, the SRDCF tracker drifts to another sprinter in the early tracking. 3 Out-of-Plane Rotation: Due to the motion of the target and the viewpoint change of the camera, the target appearance often suffers from out-of-plane rotation in successive frames. Fig. 9(c shows the tracking performance of the visual trackers in the case of out-of-plane rotation. The proposed tracker outperforms its competing counterparts in this case. As shown in Fig. 12(c, the qualitative tracking results in several representative frames in this case are illustrated, where a soccer is celebrating their victory with his teammates. It can be seen that the appearance of this soccer is changed drastically during tracking due to the out-of-plane rotation. The proposed tracker succeeds in tracking this soccer and obtains good results in these challenging frames. 4 In-Plane Rotation: During tracking, the motion of the target often causes the in-plane rotation in the appearance. In this case, it is difficult to estimate the boundary between the target and the background. This is thus a major factor

10 IEEE TRANSACTIONS ON CYBERNETICS Fig. 10. Tracking performance of the proposed and the baseline trackers in various challenging situations on the OTB 2013 and the OTB 2015 benchmarks.

10 10 IEEE TRANSACTIONS ON CYBERNETICS Fig. 10. Tracking performance of the proposed and the baseline trackers in various challenging situations on the OTB 2013 and the OTB 2015 benchmarks. The number in the legend of each precision plot denotes the precision under the threshold ρ = 20, and in the legend of each success rate plot denotes the average success rate on all thresholds φ [0, 1]. (a Illumination variation on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. (b Scale variation on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. (c Background clutter on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. (d Motion blur on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. influencing the accuracy of target localization. The tracking performance of the visual trackers is shown in Fig. 9(d. The proposed tracker yields the second and the third best performance in this case in terms of precision and success rate, respectively. Fig. 12(d shows the qualitative tracking results in several representative frames, where in-plane rotation happens. It can be seen that the proposed tracker performs well in these frames although the face of the person being tracked rotates in the camera plane during tracking. 5 Illumination Change: Due to the changes in the lighting condition of the scene where tracking is conducted, the entire target appearance varies abruptly. Fig. 11(a shows the tracking performance of the proposed and the competing trackers in the case of illumination change. The proposed tracker performs the third best in this case. Fig. 12(e shows the representative results in this case, where a person is walking in a room in which the illumination drastically changes. The proposed tracker accurately tracks the person in this scene. 6 Scale Variation: During tracking, the scale of the target often varies in successive frames as the motion of the target or the camera, or both. In this case, a good tracker is required to estimate the scale as accurately as possible, while keeping low TLE. Fig. 11(b shows the tracking performance of the visual trackers in the case of scale variation. The proposed tracker performs the best and second best in this case in terms of precision and success rate, respectively. Fig. 12(f shows the qualitative results in this case in several representative frames, where a woman is walking to the farther end along a corridor. The proposed tracker accurately estimate the scale of the appearance of this woman in these frames. It can also be seen from Fig. 10(d that the proposed tracker underperforms the SRDCF tracker in the case of motion blur. Because the target is smoothed in the presence of motion blur, the features (pixels are averaged within the target region. As a result, the sparsity-based feature selection strategy is unable to

11 SUI et al.: CORRELATION FILTER LEARNING TOWARD PEAK STRENGTH FOR VISUAL TRACKING 11 Fig. 11. Tracking performance of the proposed and the baseline trackers in various challenging situations on the OTB 2013 and the OTB 2015 benchmarks. The number in the legend of each precision plot denotes the precision under the threshold ρ = 20, and in the legend of each success rate plot denotes the average success rate on all thresholds φ [0, 1]. (a Fast motion on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. (b Low resolution on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. (c Out of view on the OTB 2013 (left two and the OTB 2015 (right two benchmarks. accurately localize the distractive features. This is a limitation of the proposed approach. G. Demonstration of the Peak Strength To analyze to what extent the proposed approach to the PS correlation filter learning improves the tracking performance, another comparison is conducted on the OTB 2013 benchmark between the PS and no PS (NPS trackers, as shown in Fig. 13. It can be seen that the PS learning method leads to 6.5% and 6.3% performance improvements in terms of precision and success rate, respectively. H. Demonstration of the Feature Grouping To investigate the contribution of the feature grouping [i.e., the l 2 -norm in (5] to the tracking performance, a comparison is conducted on the OTB 2013 benchmark between the trackers, respectively, implemented via the elastic net (combination of l 1 - and l 2 -norms and the LASSO (l 1 -norm only constraints. Correspondingly, we constructed another tracker via the LASSO constraint. The comparison results are shown in Fig. 14. It is evident that the grouping constraint leads to 7.1% and 6.0% performance improvements on the OTB 2013 benchmark in terms of precision and success rate, respectively. The l 1 constraint can ignore the distractive features during the filter learning but it fails to select features from the same group. However, the distractive features often appear at a local region, i.e., in a group form. For this reason, a l 2 constraint is leveraged to incorporate with the l 1 constraint for grouping the features. As a result, the correlation filter can suppress the response at the regions with significant appearance changes, such as occlusion, deformation, illumination variation, and in-plane/out-of-plane rotation, by its zero coefficients. This also enhances the target region in the correlation filtering, leading to the strengthened peak of the filter response for the improved discrimination performance. I. Investigation on Parameters In this paper, as recommended in [10], the parameter λ in (5 is set to Following the suggestion in [15], the parameter τ in (5 issetto1 λ. In(7, the parameter μ is investigated comprehensively. According to our initial observations, when μ is set to a relatively small value, the proposed tracker performs more stable and robust. Thus, we investigate μ within the selected range { 10 2, 10 3, 10 4, 10 5, 10 6} to observe its influence on the tracking performance on the benchmark. As shown in Fig. 15, the proposed tracker performs best when setting μ = Note that the parameters λ, τ, and μ balance the weights between their corresponding terms in (7. Because λ is set to 10 4, which controls the weight of the stability of the correlation filter in a squared form, μ is set

12 IEEE TRANSACTIONS ON CYBERNETICS Fig. 12. Tracking results in representative frames in various challenging situations. (a Occlusion. (b Nonrigid deformation. (c Out-of-plane rotation.

12 12 IEEE TRANSACTIONS ON CYBERNETICS Fig. 12. Tracking results in representative frames in various challenging situations. (a Occlusion. (b Nonrigid deformation. (c Out-of-plane rotation. (d In-plane rotation. (e Illumination change. (f Scale variation. Fig. 13. Tracking performance of the PS and NPS on the 50 video sequences of the OTB 2013 benchmark. The number in the legend of the precision plot denotes the precision under the threshold ρ = 20, and in the legend of the success rate plot denotes the average success rate on all thresholds φ [0, 1]. Fig. 14. Tracking performance of the feature grouped and no feature grouped trackers on the 50 video sequences of the OTB 2013 benchmark. The number in the legend of the precision plot denotes the precision under the threshold ρ = 20, and in the legend of the success rate plot denotes the average success rate on all thresholds φ [0, 1]. slightly smaller than λ may ensure the stability of the learned filter. Also, too smaller μ may make its corresponding term trivial. As a result, we set μ = 10 5 in all the experiments in this paper. Fig. 15. Investigation on the parameter μ in (7 on the OTB 2013 benchmark. V. CONCLUSION We have proposed a novel visual tracking approach to correlation filter learning toward peak strength of filter response. An elastic net constraint has been imposed on the correlation filter during the filter learning, among which the squared term ensures the features to be grouped, while the sparsity term adaptively ignores the distractive features in the correlation filtering. A new metric to evaluate the peak strength has been proposed to measure the discriminative capability of the learned correlation filter, by which the proposed approach has been demonstrated to effectively strengthen the peak of the correlation response, leading to more discriminative performance than previous methods. Extensive experiments on a popular visual tracking benchmark has demonstrated that the proposed tracker outperforms most state-of-the-art methods. APPENDIX This section presents the proof of Proposition 1. The problem in (8 has a closed-form solution that

13 SUI et al.: CORRELATION FILTER LEARNING TOWARD PEAK STRENGTH FOR VISUAL TRACKING 13 is shown in (10 α = ( K H K + λk + μi 1( K H y + μβ. By incorporating (6 K = Ddiag (ˆk 1 we have the following deviation: D H α = ( K H K + λk + μi 1( K H y + μβ ( = Ddiag (ˆk 1 ˆk 1 D H + λddiag (ˆk 1 D H + μi ( Ddiag(ˆk 1 D H y + μβ ( 1 = Ddiag ˆk 1 diag(ˆk 1 D H y ˆk 1 + λˆk 1 + μ ( 1 + μddiag ˆk 1 D H β. ˆk 1 + λˆk 1 + μ The conjugate of DFT of α is then found by ( ˆk 1 ˆα = diag ˆk 1 ˆk 1 + λˆk 1 + μ ( 1 + μ diag ˆβ ˆk 1 ˆk 1 + λˆk 1 + μ = ˆk 1 ŷ + μ ˆβ ˆk 1 ˆk 1 + λˆk 1 + μ. Equivalently, the DFT of α is obtained from ˆα = ŷ ˆk 1 ŷ + μ ˆβ ˆk 1 ˆk 1 + λˆk 1 + μ. REFERENCES 1 [1] A. Yilmaz, O. Javed, and M. Shah, Object tracking: A survey, ACM Comput. Surveys, vol. 38, no. 4, Dec. 2006, Art. no. 13. [2] A. W. M. Smeulders et al., Visual tracking: An experimental survey, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp , Jul [3] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, Visual object tracking using adaptive correlation filters, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2010, pp [4] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, Exploiting the circulant structure of tracking-by-detection with kernels, in Proc. Eur. Conf. Comput. Vis. (ECCV, Florence, Italy, 2012, pp [5] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, Accurate scale estimation for robust visual tracking, in Proc. Brit. Mach. Vis. Conf. (BMVC, Linköping, Sweden, 2014, pp [6] Y. Li and J. Zhu, A scale adaptive kernel correlation filter tracker with feature integration, in Proc. Eur. Conf. Comput. Vis. Workshop, Zürich, Switzerland, 2014, pp [7] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, Learning spatially regularized correlation filters for visual tracking, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Santiago, Chile, 2015, pp [8] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M.-H. Yang, Fast visual tracking via dense spatio-temporal context learning, in Proc. Eur. Conf. Comput. Vis. (ECCV, Zürich, Switzerland, 2014, pp [9] M. Danelljan, F. S. Khan, M. Felsberg, and J. V. D. Weijer, Adaptive color attributes for real-time visual tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Columbus, OH, USA, 2014, pp [10] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, High-speed tracking with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp , Mar [11] S. Liu, T. Zhang, X. Cao, and C. Xu, Structural correlation filter for robust visual tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2016, pp [12] Y. Sui, Z. Zhang, G. Wang, Y. Tang, and L. Zhang, Real-time visual tracking: Promoting the robustness of correlation filter learning, in Proc. Eur. Conf. Comput. Vis. (ECCV, Amsterdam, The Netherlands, 2016, pp [13] J. Wright et al., Sparse representation for computer vision and pattern recognition, Proc. IEEE, vol. 98, no. 6, pp , Jun [14] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Stat. Soc. B (Methodol., vol. 58, no. 1, pp , [15] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, J. Roy. Stat. Soc. B (Stat. Methodol., vol. 67, no. 2, pp , [16] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vis., vol. 77, nos. 1 3, pp , [17] J. Kwon and K. Lee, Visual tracking decomposition, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2010, pp [18] Y. Sui, S. Zhang, and L. Zhang, Robust visual tracking via sparsityinduced subspace learning, IEEE Trans. Image Process., vol. 24, no. 12, pp , Dec [19] Y. Sui, G. Wang, Y. Tang, and L. Zhang, Tracking completion, in Proc. Eur. Conf. Comput. Vis. (ECCV, Amsterdam, The Netherlands, 2016, pp [20] X. Mei and H. Ling, Robust visual tracking using L 1 minimization, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, 2009, pp [21] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, Robust visual tracking via multi-task sparse learning, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2012, pp [22] X. Mei and H. Ling, Robust visual tracking and vehicle classification via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 11, pp , Nov [23] M. Barnard, W. Wang, J. Kittler, S. M. Naqvi, and J. A. Chambers, A dictionary learning approach to tracking, in Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP, Kyoto, Japan, 2012, pp [24] D. Wang, H. Lu, and M.-H. Yang, Least soft-threshold squares tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Portland, OR, USA, 2013, pp [25] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, Robust visual tracking via structured multi-task sparse learning, Int. J. Comput. Vis., vol. 101, no. 2, pp , [26] T. Zhang et al., Structural sparse tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2015, pp [27] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, Low-rank sparse learning for robust visual tracking, in Proc. Eur. Conf. Comput. Vis. (ECCV, Florence, Italy, 2012, pp [28] Y. Sui et al., Self-expressive tracking, Pattern Recognit., vol. 48, no. 9, pp , [29] Y. Sui, Y. Tang, and L. Zhang, Discriminative low-rank tracking, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Santiago, Chile, 2015, pp [30] Y. Sui and L. Zhang, Robust tracking via locally structured representation, Int. J. Comput. Vis., vol. 119, no. 2, pp , [31] X. Li, W. Hu, Z. Zhang, X. Zhang, and G. Luo, Robust visual tracking based on incremental tensor subspace learning, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Rio de Janeiro, Brazil, 2007, pp [32] W. Hu et al., Incremental tensor subspace learning and its applications to foreground segmentation and tracking, Int. J. Comput. Vis., vol. 91, no. 3, pp , [33] W. Hu et al., Single and multiple object tracking using log-euclidean Riemannian subspace and block-division appearance model, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 12, pp , Dec [34] Y. Sui and L. Zhang, Visual tracking via locally structured Gaussian process regression, IEEE Signal Process. Lett., vol. 22, no. 9, pp , Sep [35] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, Robust tracking using local sparse appearance model and K-selection, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2011, pp [36] B. Liu, J. Huang, C. Kulikowski, and L. Yang, Robust visual tracking using local sparse appearance model and K-selection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp , Dec

14 IEEE TRANSACTIONS ON CYBERNETICS [37] A. Adam, E. Rivlin, and I. Shimshoni, Robust fragments-based tracking using the integral histogram, in Proc. IEEE Comput. Soc. Conf. Comput. Vis.

2144 2155, Sep. 2016. [39] X. Li, A. Dick, C. Shen, A. van den Hengel, and H. Wang, Incremental learning of 3D-DCT compact representations for robust visual tracking, IEEE Trans. Pattern Anal. Mach.

Zhang, A. Bibi, and B. Ghanem, In defense of sparse tracking: Circulant sparse tracker, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2016, pp. 3880 3888. [42] D. Wang and H.

14 14 IEEE TRANSACTIONS ON CYBERNETICS [37] A. Adam, E. Rivlin, and I. Shimshoni, Robust fragments-based tracking using the integral histogram, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, vol. 1. New York, NY, USA, 2006, pp [38] X. Li, Z. Han, L. Wang, and H. Lu, Visual tracking via random walks on graph model, IEEE Trans. Cybern., vol. 46, no. 9, pp , Sep [39] X. Li, A. Dick, C. Shen, A. van den Hengel, and H. Wang, Incremental learning of 3D-DCT compact representations for robust visual tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 4, pp , Apr [40] D. Wang, H. Lu, and M.-H. Yang, Online object tracking with sparse prototypes, IEEE Trans. Image Process., vol. 22, no. 1, pp , Jan [41] T. Zhang, A. Bibi, and B. Ghanem, In defense of sparse tracking: Circulant sparse tracker, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2016, pp [42] D. Wang and H. Lu, Object tracking via 2DPCA and L1-regularization, IEEE Signal Process. Lett., vol. 19, no. 11, pp , Nov [43] S. Avidan, Support vector tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 8, pp , Aug [44] S. Zhang, X. Yu, Y. Sui, S. Zhao, and L. Zhang, Object tracking with multi-view support vector machines, IEEE Trans. Multimedia, vol. 17, no. 3, pp , Mar [45] S. Zhang, Y. Sui, X. Yu, S. Zhao, and L. Zhang, Hybrid support vector machines for robust object tracking, Pattern Recognit., vol. 48, no. 8, pp , [46] S. Zhang, Y. Sui, S. Zhao, X. Yu, and L. Zhang, Multi-local-task learning with global regularization for object tracking, Pattern Recognit., vol. 48, no. 12, pp , [47] S. Zhang, S. Zhao, Y. Sui, and L. Zhang, Single object tracking with fuzzy least squares support vector machine, IEEE Trans. Image Process., vol. 24, no. 12, pp , Dec [48] S. Avidan, Ensemble tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, pp , Feb [49] H. Grabner, M. Grabner, and H. Bischof, Real-time tracking via online boosting, in Proc. Brit. Mach. Vis. Conf. (BMVC, vol , pp [50] H. Grabner, C. Leistner, and H. Bischof, Semi-supervised on-line boosting for robust tracking, in Proc. Eur. Conf. Comput. Vis. (ECCV, Marseilles, France, 2008, pp [51] K. Zhang, L. Zhang, and M.-H. Yang, Fast compressive tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 10, pp , Oct [52] S. Wang, H. Lu, F. Yang, and M.-H. Yang, Superpixel tracking, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Barcelona, Spain, 2011, pp [53] Z. Kalal, J. Matas, and K. Mikolajczyk, P-N learning: Bootstrapping binary classifiers by structural constraints, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 2010, pp [54] Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-learning-detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp , Jul [55] B. Babenko, M.-H. Yang, and S. Belongie, Robust object tracking with online multiple instance learning, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 8, pp , Aug [56] J. Fan, X. Shen, and Y. Wu, Scribble tracker: A matting-based approach for robust tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 8, pp , Aug [57] D. Du, L. Zhang, H. Lu, X. Mei, and X. Li, Discriminative hash tracking with group sparsity, IEEE Trans. Cybern., vol. 46, no. 8, pp , Aug [58] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, Hierarchical convolutional features for visual tracking, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Santiago, Chile, 2015, pp [59] L. Wang, W. Ouyang, X. Wang, and H. Lu, Visual tracking with fully convolutional networks, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Santiago, Chile, 2015, pp [60] Y. Qi et al., Hedged deep tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2016, pp [61] L. Wang, W. Ouyang, X. Wang, and H. Lu, STCT: Sequentially training convolutional networks for visual tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2016, pp [62] H. Nam and B. Han, Learning multi-domain convolutional neural networks for visual tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2016, pp [63] S. Hare, A. Saffari, and P. H. S. Torr, Struck: Structured output tracking with kernels, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Barcelona, Spain, 2011, pp [64] H. Grabner and H. Bischof, On-line boosting and vision, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, vol , pp [65] K. Zhang, Q. Liu, Y. Wu, and M.-H. Yang, Robust visual tracking via convolutional networks without training, IEEE Trans. Image Process., vol. 25, no. 4, pp , Apr [66] K. Zhang, L. Zhang, and M.-H. Yang, Real-time object tracking via online discriminative feature selection, IEEE Trans. Image Process., vol. 22, no. 12, pp , Dec [67] R. M. Gray, Toeplitz and Circulant Matrices: A Review. Boston, MA, USA: Now, [68] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imag. Sci., vol. 2, no. 1, pp , Jan [69] Y. Wu, J. Lim, and M.-H. Yang, Online object tracking: A benchmark, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Portland, OR, USA, 2013, pp [70] Y. Wu, J. Lim, and M. H. Yang, Object tracking benchmark, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp , Sep [71] W. Zhong, H. Lu, and M.-H. Yang, Robust object tracking via sparsitybased collaborative model, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2012, pp [72] X. Jia, H. Lu, and M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, 2012, pp [73] T. B. Dinh, N. Vo, and G. Medioni, Context tracker: Exploring supporters and distracters in unconstrained environments, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 2011, pp Yao Sui received the Ph.D. degree in electronic engineering from Tsinghua University, Beijing, China, under the supervision of Prof. L. Zhang. He is currently a Post-Doctoral Research Fellow with the Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, USA, researching with Prof. G. Wang. His current research interests include machine learning, computer vision, image processing, and pattern recognition. Guanghui Wang received the Ph.D. degree from the University of Waterloo, Waterloo, ON, Canada. He is currently an Assistant Professor with the University of Kansas, Lawrence, KS, USA. He is an Adjunct Professor with the Institute of Automation, Chinese Academy of Sciences, Beijing, China. He has published one book at Springer-Verlag, and over 80 papers in peer-reviewed journals and conference proceedings. His current research interests include computer vision, image processing, and robotics. Mr. Wang served as an Associate Editor and on the editorial board of two journals, as an Area Chair or a TPC Member of over 20 conferences, and as a Reviewer of over 20 journals. Li Zhang received the B.S., M.S., and Ph.D. degrees in electronic engineering from Tsinghua University, Beijing, China. He is currently a Professor with the Department of Electronic Engineering, Tsinghua University. He is directing the UAV Vision Laboratory, Tsinghua University, and also a member of the National Laboratory of Pattern Recognition, Beijing. His current research interests include image processing, computer vision, pattern recognition, and computer graphics.

Real-Time Visual Tracking: Promoting the Robustness of Correlation Filter Learning

Real-Time Visual Tracking: Promoting the Robustness of Correlation Filter Learning Yao Sui 1, Ziming Zhang 2, Guanghui Wang 1, Yafei Tang 3, Li Zhang 4 1 Dept. of EECS, University of Kansas, Lawrence,