AS a fundamental component in computer vision system,

Size: px

Start display at page:

Download "AS a fundamental component in computer vision system,"

Lorraine Edwards
5 years ago
Views:

18 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO.

Abstract In this paper, a novel spatial-temporal locality is proposed and unified via a discriminative dictionary learning framework for visual tracking.

The locality is formulated as a subspace model and exploited under a unified structure of discriminative dictionary learning with a subspace structure.

1 18 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 Exploiting Spatial-Temporal Locality of Tracking via Structured Dictionary Learning Yao Sui, Guanghui Wang, Senior Member, IEEE, Li Zhang, and Ming-Hsuan Yang, Senior Member, IEEE Abstract In this paper, a novel spatial-temporal locality is proposed and unified via a discriminative dictionary learning framework for visual tracking. By exploring the strong local correlations between temporally obtained target and their spatially distributed nearby background neighbors, a spatial-temporal locality is obtained. The locality is formulated as a subspace model and exploited under a unified structure of discriminative dictionary learning with a subspace structure. Using the learned dictionary, the target and its background can be described and distinguished effectively through their sparse codes. As a result, the target is localized by integrating both the descriptive and the discriminative qualities. Extensive experiments on various challenging video sequences demonstrate the superior performance of proposed algorithm over the other state-of-the-art approaches. Index Terms Visual tracking, dictionary learning, sparse representation, low-rank learning. I. INTRODUCTION AS a fundamental component in computer vision system, visual tracking has been one of the most active research topics in computer vision with various potential applications, such as video surveillance, human-computer interaction, autonomous driving, and robotics. Recent years have witnessed a great success in visual tracking [1], []. Nonetheless, many challenging problems still have not been solved effectively, such as drastic illumination change, background clutter, scale variation, and heavy occlusion. A typical tracking approach is to learn a robust representation for the target, Manuscript received October 7, 016; revised February 19, 017, June 19, 017, and September 8, 017; accepted November 3, 017. Date of publication December 4, 017; date of current version December 7, 017. This work was supported in part by the National Natural Science Foundation of China (NSFC under Grant , in part by the Kansas NASA EPSCoR Program under Grant KNEP-PDG KU, in part by the General Research Fund of the University of Kansas under Grant 8901, in part by the Joint Fund of Civil Aviation Research by NSFC and Civil Aviation Administration under Grant U153313, in part by the NSF CAREER under Grant , and in part by Gifts from Adobe and Nvidia. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Junsong Yuan. (Corresponding Author: Yao Sui. Y. Sui is with the Harvard Medical School, Harvard University, Boston, MA 0115 USA ( suiyao@gmail.com. G. Wang is with the Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS USA ( ghwang@ku.edu. L. Zhang is with the Department of Electronic Engineering, Tsinghua University, Beijing , China ( chinazhangli@tsinghua.edu.cn. M.-H. Yang is with the Department of Electrical Engineering and Computer Science, University of California at Merced, Merced, CA USA ( mhyang@ucmerced.edu. Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TIP Fig. 1. Illustration of the spatial-temporal locality. (a A video sequence being analyzed. (b Temporal and spatial correlations. (c Temporal correlations between pairwise targets. (d Average spatial correlations between the target and the background patches in every frame. In (c and (d, the cooler color indicates smaller value. the background, or both, by which the intrinsic structure of the target and the background is exploited via subspace [3], sparsity [4], and compactness [5] models. In this work, we propose a unified structure for visual tracking, i.e., thespatial-temporal locality. Such a structure leads to strong local correlations between temporally obtained targets and their spatially nearby regions in background. Fig. 1 shows an illustration of the spatial-temporal locality. The temporal and spatial correlations, as illustrated in Fig. 1(b, are analyzed on the video sequence shown in Fig. 1(a. The temporal correlations between pairwise targets are plotted in Fig. 1(c. It can be seen that large values are located near the diagonal blocks of the plot, which indicates that a target is strongly correlated to its temporal neighbors. Such a neighborhood results in salient locality in the temporal dimension. Meanwhile, the spatial correlations between the target and the background in every frame are investigated, and the average result on all frames is plotted in Fig. 1(d. It is evident that the closer the background to the target, the stronger the correlation between the target and the background. This observation leads to a locality structure in the spatial dimension IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 SUI et al.: EXPLOITING SPATIAL-TEMPORAL LOCALITY OF TRACKING VIA STRUCTURED DICTIONARY LEARNING 183 To further verify the above observation, we analyze the temporal and spatial correlations on the 51 video sequences of the OTB 013 benchmark [6], which include a total of 9,486 frames. The statistical results are shown in Fig.. The average correlations on all these frames between the ground truth targets in the frames within a temporal window are plotted in Fig. (a. It is evident that, as the window size grows (i.e., the temporal locality is weakened, the correlation decreases sharply. Meanwhile, the average correlations in all these frames between the target and the background away from the target within a certain radius are plotted in Fig. (b. It can be seen that, as the background moves far away from the target (i.e., the spatial locality tends to be not obvious, the correlation curve drops rapidly. As discussed above, the locality in either temporal or spatial dimension is evident. As a result, we are encouraged to exploit the joint locality in both the temporal and spatial dimensions, i.e., the spatial-temporal locality, to further improve tracking performance. The preceding analysis demonstrates that there are strong correlations between the recently obtained targets and the background regions near the targets, leading to a joint locality in both the temporal and spatial dimensions. The temporal locality provides an effective description for the target, and the spatial locality implicitly introduces the discrimination to tracking. For this reason, we propose to construct a tracking model with both descriptive and discriminative capabilities. Furthermore, it has been shown that sparse representation can well reflect the locality structure between the dictionary atoms and the samples being represented [7]. To this end, a discriminative dictionary learning for sparse representation has the desirable property to exploit the spatialtemporal locality in this work: Sparse representation can provide a description from a perspective of locality structure. Dictionary learning can promote the descriptive capability of the sparse representation. Discriminative learning can augment the separability between the target and the background. Additional structures can be embedded into the dictionary to well reflect the spatial-temporal locality. To effectively describe and separate the target from the background, additional structures need to be embedded into the dictionary. As analyzed above, the spatial-temporal locality leads to strong correlations in the recently obtained targets and the background near these targets. Subspace learning is a classical method to deal with correlations. In this work, these targets and background are considered to reside in a low-dimensional subspace. As a result, a subspace structure is embedded into the dictionary, such that the learned dictionary can well represent the spatial-temporal locality. Motivated by the above observations, a novel tracking algorithm is proposed by exploiting the spatial-temporal locality of tracking via structured dictionary learning. Fig. 3 illustrates the flow diagram of our tracking algorithm. The dictionary, used to reflect the spatial-temporal locality, is learned from the recently obtained targets and the background near these targets. To capture the spatial-temporal locality structure, a subspace constraint is imposed on the dictionary atoms via Fig.. Statistics of the temporal and spatial correlations in the 9,486 frames of the 51 video sequences of OTB 013 benchmark [6]. (a Average temporal correlations with respect to different temporal window sizes. (b Average spatial correlations with respect to different spatial radii. low-rank learning. A discriminative constraint is also enforced to the learning process, such that the dictionary can discriminatively encode the sparse codes, leading to linearly separable descriptions for the target and the background. During tracking, a number of target candidates are sampled from the current frame, and then encoded by the learned dictionary. With the output sparse codes, the descriptive scores and the discriminative scores of these target candidates are computed to measure the description and discrimination accuracy, respectively. Finally, the target is determined as the candidate with the largest likelihood that is evaluated according to both the descriptive and the discriminative scores. Different from the previous methods based on the (discriminative dictionary learning [8] [10], which utilize only the temporally obtained targets, our approach exploits the temporally local correlation by analyzing the most recently obtained targets, as shown in Fig. 4(a. Meanwhile, the previous methods use the background samples that are far away from the target locations, while our approach samples the background densely around the target locations to capture the spatially local correlation, as illustrated in Fig. 4(b. As a result, the most distinctive difference from the previous methods based on the dictionary learning [8] [10] is that a subspace structure is embedded into the dictionary to exploit the spatialtemporal locality of the tracking in our approach. Unlike the previous work that also considers temporal and spatial context information [11] [13], our method constructs a representation through the spatial-temporal locality to describe both the target and the background. The work [1] also utilizes the recently obtained targets and the nearby background to learn a subspace. However, the subspace transform in [1] is linear, while the proposed structured sparse coding is nonlinear, which is more powerful to represent the targets and the background. Meanwhile, [1] incorporates the subspace reconstruction with the classifier, where the subspace reconstruction has the same feature dimension with the original samples, i.e., there is no dimension reduction or promotion, leading to the fact that the classifier has more pressure to make the samples linearly separable. In contrast, the spatial-temporal locality is embedded into the dictionary and propagated to the sparse codes that are actually input to the classifier. By the nonlinear coding scheme and the dimension promotion from the dictionary, the classifier performs more stably to linearly separate the targets and

184 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 Fig. 3. Flow diagram of our tracking approach.

VI; and the evaluation algorithms of the dictionary learning and sparse coding are depicted in the Appendix. Fig. 4. Differences between the previous methods and ours. (a temporal difference.

In each subfigure, the previous method is illustrated in the left and ours is illustrated in the right. the background, resulting in more discriminative capability.

3 184 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 Fig. 3. Flow diagram of our tracking approach. presented in Section III; the experimental results are reported in Section IV; the limitation of the proposed approach is briefly discussed in Section V; conclusion of this paper is drawn in Section VI; and the evaluation algorithms of the dictionary learning and sparse coding are depicted in the Appendix. Fig. 4. Differences between the previous methods and ours. (a temporal difference. (b spatial difference. The targets are marked in solid boxes and the background samples are marked in dashed boxes. In each subfigure, the previous method is illustrated in the left and ours is illustrated in the right. the background, resulting in more discriminative capability. In summary, the novelty and contributions of this work are that 1 embedding the spatial-temporal locality structure into a dictionary learning is first introduced to visual tracking, and the joint learning of the sparsity and this structure is novel in visual tracking; the formulation for the tracking model and its solution is new to the field of visual tracking; and 3 different from previous sparse trackers that aim at a structured representation to improve tracking performance, the proposed approach focuses on learning a structured dictionary that leads to an accurate and robust representation. Extensive experiments are conducted to evaluate the performance of the proposed technique, including various challenging situations, such as abrupt appearance change, heavy occlusion, and cluttered background. Both qualitative and quantitative evaluations demonstrate that our tracking algorithm improves the performance of the baseline methods and outperforms most state-of-the-art approaches. The remainder of this paper is organized as follows: the related work is reviewed in Section II; our approach is II. RELATED WORK Constructing visual trackers via sparse learning has received much attention in recent years. It has been demonstrated that sparse learning is effective to handle partial occlusions [14]. Mei and Ling [4] introduced sparse representation to visual tracking. In their method, the current target is represented by a linear combination of a few target templates and the residual error is used to model the appearance changes, such as occlusion and deformation. This paradigm is the computationally expensive, and subsequently, compressive sensing [15] and joint sparse representation [16] were employed to speed up this paradigm, respectively. Mei et al. [17] analyzed the minimum error bound for the sparse tracker with occlusion detection. In [18], Liu et al. proposed a dictionary learning method, named K-selection, for visual tracking and employed local sparse representation to describe the target. Jia et al. [19] developed a local sparsity based feature to improve tracking performance, and Zhong et al. [0] exploited the sparsity structure of tracking and proposed a collaborative appearance model. Sui and Zhang [1] leveraged sparse learning to construct a Gaussian process model and use regression to conduct tracking. A tracker was introduced by Zhang et al. [] by exploiting global and local sparsity properties of appearance models. Wang et al. [3] proposed a tracking algorithm via joint optimization of optimization and classification. Wang et al. [4] designed a locally weighted distance metric and employed an inverse sparse approach to formulate the tracking problem. Lan et al. [5] leveraged joint sparse representation and a feature-level fusion approach to address tracking. Dictionary learning can significantly improve the representation quality of the sparse codes [6], [7]. A typical method to learn an effective dictionary is embedding some structures that are intrinsically consistent with the inherent

4 SUI et al.: EXPLOITING SPATIAL-TEMPORAL LOCALITY OF TRACKING VIA STRUCTURED DICTIONARY LEARNING 185 structures in the training data. Discrimination is one of the most widely used structures embedded in the dictionary, such as least squares [8] [30], label consistent [31], and fisher criterion [3]. In visual tracking, many algorithms have been proposed via various dictionary learning methods, but only a few of them embed structures into the dictionary. Bai et al. [33] learned a dictionary for sparse codes of the target and the background and adopted a support vector machine to separate them. Xing et al. [8] proposed an online multi-lifespan dictionary learning methods for visual tracking. Wang et al. [9] injected the non-negativity property into the dictionary and developed an online learning method for tracking. Yang et al. [10] designed an online discriminative dictionary learning method for visual tracking. Yang et al. [34] gathered the appearance information from two collaborative feature sets by exploiting the geometric structures through dictionary learning. Xie et al. [35] proposed a discriminative object tracking approach via a sparse coding based online dictionary learning. Fan and Xiang [36] adopted a multi-task joint dictionary learning for visual tracking. Subspace learning is a simple but effective method in visual tracking. In this paradigm, the temporally obtained targets are assumed to reside in a low-dimensional subspace. Ross et al. [3] introduced an incrementally subspace learning to visual tracking. It has been demonstrated that subspace learning is effective to illumination variations and pose changes [37], [38]. However, subspace learning is very sensitive to occlusions. For this reason, many subspace learning methods [1], [39] [44] leveraged sparse representation to enhance the stability in the case of occlusions. The basic idea of such methods is employing subspace learning to describe the target and utilizing the sparse representation to absorb the occlusions. Sui et al. [1], [13] proposed a discriminative subspace learning method to describe the target and the surrounding background and a sparse error term is used to deal with the occlusion. Wang et al. [4] proposed an approach by assuming the temporally obtained targets yield a Gaussian distribution (subspace prior and the occlusions follow a Laplace distribution (sparsity prior. Extensive tracking algorithms, beyond sparse learning and subspace learning, are proposed in recent years, achieving impressive tracking performance. Some representative approaches include structural discriminative learning [45], [46], correlation filtering [47] [51], deep learning and deep features [5] [56], etc. Readers can refer to [1], [] for a thorough reviews of the state-of-the-art tracking methods. III. PROPOSED APPROACH During tracking, we use a rectangle box to mark the target. In each frame, a number of regions of interest (ROI are generated and used as the target candidates. We evaluate the likelihood of these candidates to be the target. The candidate with the highest likelihood is determined as the target. Specifically, each candidate is associate with a motion state variable s = x, y,σ}, (1 where x and y denote the D position of a candidate, and σ denotes the scale coefficient. The motion state variables are estimated by a Bayesian inference framework, as discussed in Sec. III-C. Given a candidate, we crop it out of the frame according to its motion state variable. All candidates are resized to the same size and stacked into column vectors, respectively. The likelihood of the candidates is evaluated over the corresponding column vectors. A. Structured Dictionary Learning Given a training sample set, denoted by X, the standard sparse coding used dictionary D can be learned from min D,z k (x k Dz k x k X s.t. z k 0 z 0, ( where ( is an empirical loss function, z k denotes the sparse codes of x k, 0 counts the nonzeros of a vector, and z 0 is a predefined non-negative integer that controls the sparsity of the codes z k. Because of the non-convexity of 0,and considering the computational efficiency, the convex conjugate 1 is usually used to approximate 0 in practice. To discriminatively encode the sparse codes, a classifier needs to be trained simultaneously. Assuming that each training sample x k is assigned to a binary class labels y k +1, 1} 1, the discriminative dictionary is learned from min D,z k,w (x k Dz k + λ δ (y k w T z k x k X x k X s.t. z k 0 z 0, (3 where δ ( is the classification loss function, w denotes the linear classifier,andλ is a weight parameter. As discussed above, the spatial-temporal locality is exploited by learning a dictionary with the subspace structure. We formulate the learning as (x k Dz k + λ min D,z k,w x k X z k s.t. 0 z 0 rank(d r 0, x k X δ (y k w T z k where rank( computes the rank value of a matrix, and the non-negative integer r 0 controls the rank of D. Note that the minimization on rank( function is non-convex and has been demonstrated to be NP-hard. In practice, the convex conjugate, called the nuclear norm, is usually adopted to efficiently approximate the rank( function. In this work, to emphasize the spatial-temporal locality, the training samples consist of the recently obtained targets and the nearby background samples around the latest target location. Considering the abrupt appearance changes of the target during tracking, e.g., in the case of occlusion, we leverage 1 -loss for the empirical loss function (, to allow for extremely large fitting errors. Meanwhile, we employ 1 Because there are only two categories that need to be considered, i.e., the target and the background, we discuss the problem only within binary classification. The bias unit of the linear classifier is truncated in this work. (4

186 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 -loss (squared loss for the classification loss function δ ( to obtain a stable classification performance.

5 186 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 -loss (squared loss for the classification loss function δ ( to obtain a stable classification performance. As a result, the dictionary in this work is learned from min X DZ 1 + λ y w T Z D,Z,w F s.t. Z0, z 0 rank(d r 0, where each column of the matrix X = [x 1, x,...] denotes a training sample, the matrix Z denotes the sparse codes of X, the row vector y = [y 1, y,...] consists of the class labels of the training samples, 1 stands for the sum of absolute values of all the elements of an input matrix, F is the Frobeniusnorm, and 0, returns the maximum number of nonzeros in each column of an input matrix. The above problem can be solved using augmented Lagrange multiplier method; and the computation details are provided in Appendix. Note that the low-rank structure imposed on the dictionary D is one of the key differences of our approach from previous methods. First, from the dictionary learning point of view, the dictionary is expected to reflect the intrinsic structure among the data samples. As analyzed above, the data samples have an apparent low-dimensional subspace structure. Thus, we impose a low-rank constraint on the dictionary to exploit the subspace structure, leading to a better representation capability to the data samples. Moreover, from the coding perspective, the sparse coding benefits from the redundancy of the dictionary. In the proposed model, the dictionary is jointly learned with a linear classifier. Its redundancy leads to more sparser representation with higher non-linear encoding scheme, resulting in better linear separability. Note that the joint learning framework can make a good trade-off between the dictionary redundancy (low-rank structure and the linear separability (discrimination. B. Target Localization Concerning the learned dictionary D, given a candidate, denoted by c, the sparse codes z is obtained from 3 (5 min z c Dz 1 s.t. z 0 z 0. (6 We can evaluate the descriptive performance on the candidate c via the fitting error c Dz 1. The smaller the error, the higher the descriptive performance. Within this method, a candidate more likely to be the target is expected to have smaller error, i.e., it is better described by the learned dictionary, as illustrated in Fig. 5. As a result, we consider the fitting error as a criterion of target localization, called the descriptive score, whichisdefinedas } ε (c = exp 1lε c Dz 1, (7 where l ε denotes the scale of the exponential function. Meanwhile, with the learned linear classifier w, the candidate c can be classified as either the target or the background via its sparse codes z. In this work, instead of the class label, 3 The solution of this problem is also presented in Appendix. Fig. 5. Illustration of descriptive performance on four different candidates. One candidate more likely to be the target (in solid box and three candidates (less likely to be the target with translations and scaling (in dashed boxes are marked in the left image. Their fitting errors are shown in the right image. Fig. 6. Illustration of discriminative scores of different candidates. From left to right, the classifier responses of the 6 candidates are respectively 1.0, 0.6, 0.1, 0.1, 0., 0.9. we evaluate the discriminative score of a candidate, which is defined as ϕ (c = exp 1 } 1 w T z, (8 l ϕ where l ϕ denotes the scale of the exponential function. The above definition indicates that a candidate more likely to be the target is expected to have the classifier response much closer to the positive class label (the target label +1. Fig. 6 illustrates the discriminative scores of several candidates with different likelihood to be the target. It is evident that the candidates more likely to be the target are assigned to larger discriminative scores from the classifier responses. The above discussions encourage us to consider both the descriptive score and the discriminative score when localizing the target. Consequently, we define a criterion L (c = ε (c ϕ τ (c, (9 to evaluate the likelihood of a candidate c to be the target, where τ is a weight parameter that balances the importance of the descriptive and the discriminative scores. Note that τ needs to be finely tuned with respect to different candidates in each frame. To circumvent tuning this parameter, we modify the above criterion as the following normalized one L (c = exp 1 } l (ε c + ϕ c, (10 where ε c and ϕ c are the normalized scores over all the candidates in one frame, defined as ε c = κ c Dz 1, C}, (11 } 1 ϕ c = κ w T z, C, (1 where C denotes the set of all candidates in one frame, and the function f (α min α f (α κ f (α,} = (13 max α f (α min α f (α normalizes the values of f (α for all α in the range of [0, 1]. In Eq. (10, the parameter l > 0 denotes the

SUI et al.: EXPLOITING SPATIAL-TEMPORAL LOCALITY OF TRACKING VIA STRUCTURED DICTIONARY LEARNING 187 Fig. 7. Demonstration of the likelihood evaluation.

6 SUI et al.: EXPLOITING SPATIAL-TEMPORAL LOCALITY OF TRACKING VIA STRUCTURED DICTIONARY LEARNING 187 Fig. 7. Demonstration of the likelihood evaluation. (a A frame where the ground truth target and the search area are marked in solid and dashed boxes, respectively. (b, (c, and (d Likelihood of the candidates centered at every pixel within the search area, which is evaluated by using both the descriptive and the discriminative scores, the descriptive score only, and the discriminative score only, respectively. The largest likelihood is expected to appear at the center of each plot. scale of the exponential function. It is employed to enlarge the differences exponentially between the likelihood of the candidates. Obviously, in the step of target localization, it is trivial because only the relative values of the likelihood are referred to. However, it is responsible to the resampling of the tracking framework that relies on a Bayesian inference model, which will be presented in the next subsection. In all the experiments reported in this work, we set l = 10 4 according to the empirical results. Below is an explanation on the likelihood function defined in Eq. (10. As shown in Fig. 7, where a closer look at the target localization in one frame is presented in the case of occlusions. We densely generate the candidates around the ground truth target, whose center locations are at every pixel within the search region marked by the dashed box. We calculate and plot the likelihood of these candidates according to both the descriptive and the discriminative scores, the descriptive score only, and the discriminative score only, respectively. The largest likelihood is expected to appear at the center of each plot, i.e., overlap with the ground truth target. Note that the 1 -norm on the representation, as shown in Eqs. (5 and (6, can effectively handle the occlusions by compensating a large error. It indicates that the three criteria of the likelihood are robust against the occlusions in this example. However, it is evident that only the combination of both the descriptive and the discriminative scores obtains reasonable result in this case, as shown in Fig. 7(b. Using the spatial-temporal locality, the dictionary and the classifier are trained over the recently obtained targets and the nearby background regions. For this reason, both the target and nearby background can be described accurately by the dictionary. This leads to higher descriptive scores for the target and nearby background, as shown in Fig. 7(c. It indicates that the descriptive score cannot recognize the target from its nearby regions. However, it can exclude the distant regions because the dictionary describes the distant regions poorly. Meanwhile, although the classifier performs unstably in the distant regions due to lack of training samples in those regions, it can accurately distinguish the target from the nearby background regions according to the discriminative scores, as shown in Fig. 7(d. As a result, the likelihood defined in Eq. (10 penalizes the distant background regions and enhances the target region by integrating both the descriptive and the discriminative scores. C. Tracking Framework In the j-th frame, given the previously obtained targets t 1: j 1, the motion state of the j-th target, denoted by s j,is estimated by maximizing the posterior distribution p ( s j t 1: j 1 = p ( ( s j s j 1 p s j 1 t 1: j 1 ds j 1, (14 where p ( s j s j 1 denotes the motion model, which controls the candidate generation. Then, a candidate c is observed and the distribution is updated by p ( s j c, t 1: j 1 = p ( c s j p ( s j t 1: j 1 p ( c t 1: j 1, (15 where p ( c s j denotes the observation model. The target in the j-th frame, denoted by t j, is found by t j = arg max c C p ( s j c, t 1: j 1. (16 In this work, the motion model p ( s j s j 1 is defined by a Gaussian distribution N ( s j 1,, where the covariance matrix is diagonal, whose diagonal elements denote the variances of the x- andy-translations and scaling. The observation model p ( c s j is defined to be proportional to the likelihood defined in Eq. (10. The covariance matrix is set to = diag3, 3, 0.05} in this work. We choose relatively small values for these three parameters, considering the trade-off between the tracking accuracy and the running speed. On one hand, the Baysian inference framework is used as the sampling strategy that can speedup the tracking by conducting a local search with different possibilities. Note that smaller values of these three parameters indicate smaller search area in the sampling. It means that fewer particles (candidates are needed to cover the smaller search area. Considering the speed of the tracking algorithm is linearly proportional to the number of particles, we prefer fewer particles, i.e., smaller values of the three parameters. On the other hand, high tracking accuracy relies on a large search area for the candidates, especially for the case of fast motion and significant scale variation. It requires the x-, y-translations and the scale coefficient cover a wide range for the candidates search, leading to large values of these three parameters. Specifically, for example, the x- and y-translations are set to 3 pixels in this work. It indicates that the tracker covers a search area with a radius of 6 pixels.

188 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 Fig. 8. Tracking performance of the proposed tracker and the top 5 trackers in [6] over the OTB 013 benchmark. Fig. 9.

This is because 95% samples locate within the range of 3 times standard deviation. In summary, as a trade-off between the tracking accuracy and the running speed, we select 3, 3, 0.

7 188 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 Fig. 8. Tracking performance of the proposed tracker and the top 5 trackers in [6] over the OTB 013 benchmark. Fig. 9. Tracking performance of the proposed tracker and 7 other lowrank and/or sparse learning based state-of-the-art trackers over the OTB 013 benchmark. This is because 95% samples locate within the range of 3 times standard deviation. In summary, as a trade-off between the tracking accuracy and the running speed, we select 3, 3, 0.05} for the three parameters of the covariance matrix. IV. E XPERIMENTS The proposed tracking algorithm is implemented in Matlab R015a on a PC with an Intel Core.6 GHz processor. The pixels in each frame are converted to grayscale and normalized to [0, 1]. 400 candidates are generated in each frame. The ROIs of the candidates are resized to 0 0 pixels. The number of the dictionary atoms is set to 0, i.e., the dictionary D R400 0 since the samples are 400-dimensional (0 0 pixels. In Eq. (7 and (8, we set lε = 1 and lϕ = 1. The dictionary is learned at every 5 frames considering the efficiency of the tracking algorithm. In the dictionary learning, we use the 50 most recently obtained targets as the positive samples, and 56 background regions as the negative samples, which locate at the regions of d pixels away from the latest target location, where d denotes the x- and/or y-translations with ρ times width and/or height of the latest target, and ρ 0.1, 0.15, 0., 0.5, 0.3, 0.35, 0.4}. We evaluated our tracker on three popular benchmarks, the OTB 013 [6], the OTB 015 [57], and the VOT 014, which consist of 51, 100 and 5 video sequences including various challenges, respectively, such as illumination change, occlusion, background clutter, in-plane/out-of-plane rotation, and scale variation. To thorough evaluate the proposed tracker, we compare it to other 7 state-of-the-art trackers. Two criteria are employed in the evaluations: precision and success rate. The precision is defined as the percentage of frames where the center location errors between the tracking results and the ground truths are less than a threshold that normally varies within the range [0, 50]. The success rate is defined as the percentage of frames where the overlap rate between the tracking results and the ground truths are less than a threshold from 0 to 1. A. Evaluations on the Benchmarks We extensively compared our tracker with most state-ofthe-art trackers on the OTB 013, the OTB 015, and the VOT 014 benchmarks. To thoroughly evaluate the proposed tracker, we group the comparisons into 5 categories, as shown in Figs In the first comparison, the baseline methods are the top 5 trackers over the OTB 013 benchmark according to Fig. 10. Tracking performance of the proposed tracker and 7 other state-ofthe-art trackers over the OTB 013 benchmark. Fig. 11. Tracking performance of the proposed tracker and 8 other state-ofthe-art trackers over the OTB 015 benchmark. Fig. 1. Tracking performance of the proposed tracker and 11 other stateof-the-art trackers over the VOT 014 benchmark. Wu et al. s evaluation [6], including Struck [45], SCM [0], TLD [46], ASLA [19], and CXT [58]. The results are shown in Fig. 8. It can be seen that our tracker outperforms the top 5 trackers with an increase of 6% in precision and 5% in success rates. Moreover, in order to demonstrate the superior performance improvement over the baseline methods, we compared our tracker with 7 other popular low-rank and/or sparse learning based trackers over the OTB 013 benchmark, including LLR [43], L1APG [59], MTT [16], LRST [60], [61], CT [6], SSL [1], and PCOM [63], [64]. The results are shown in Fig. 9. It is evident that our tracker achieves the best result in terms of both success rate and precision, leading

8 SUI et al.: EXPLOITING SPATIAL-TEMPORAL LOCALITY OF TRACKING VIA STRUCTURED DICTIONARY LEARNING 189 Fig. 13. Tracking performance of our tracker and 3 other state-of-the-art trackers in different challenging cases on the OTB 013 benchmark. (a occlusion (31 sequences; (b illumination change (5 sequences; (c non-rigid deformation (19 sequences; (d out-of-plane rotation (39 sequences; (e background clutter ( sequences; and (f scale variation (8 sequences. to significantly improved performance against the 7 baselines methods. Furthermore, we compared our tracker with additional 7 state-of-the-art trackers over the OTB 013 benchmark, including DSL [1], STC [11], CNT [65], KCF [48], MEEM [66], and FCNT [53]. These competing trackers leverages state-ofthe-art models and algorithms to address the tracking problem, such as correlation filtering and deep learning. The comparison results are shown in Fig. 10. It is evident that the proposed tracker obtains competitive performance compared to the 7 counterparts. In particular, the proposed tracker achieves the second best performance in terms of success rate. In addition, we compared our tracker with 8 other stateof-the-art trackers over the OTB 015 benchmark, including PST [51], HCFT [5], KCF [48], Struck [45], SCM [0], ASLA [19], CSK [47] and L1APG [59]. These trackers are respectively related to the state-of-the-art tracking models including deep learning, correlation filtering, structural learning, and sparse learning. From the comparison results shown in Fig. 11, it can be seen that the proposed tracker obtains comparable results against the 8 state-of-the-art trackers. We also evaluate the proposed tracker on the VOT 014 benchmark. 11 state-of-the-art trackers are employed as the baseline trackers, including Struck [45], SCM [0], ASLA [19], IVT [3], MTT [16], LRST [60], DPCA [67], CT [6], PCOM [63], [64], ODFS [68], and LSST [68]. The evaluate results are shown in Fig. 1, from which it is evident that the proposed tracker outperforms the 11 baseline trackers in this benchmark in terms of both precision and success rate. B. Evaluations in Challenging Situations We thoroughly evaluated our tracker in various challenging situations, such as illumination change, occlusion, non-rigid deformation, and background clutter. Also, the LLR tracker, the DSL tracker, and the STC tracker are used as the competing trackers. The results are shown in Fig Occlusion: In this situation, the target is occluded by other similar or dissimilar objects, leading to abrupt changes in the target appearance. Our tracker obtains good tracking results in this scenario. This is attributed to: 1 the target description is robust to the occlusions due to the 1 -loss used in the dictionary learning, where the occlusions are penalized by the sparse but large residual errors; and the discrimination of the sparse codes is able to separate the target from the nearby background regions. Illumination Change: In this case, the appearance of the target significantly changes due to variations of lighting condition in the scenarios. It can be seen from the results that, owing to the descriptive capability of the sparse codes, our tracker is insensitive to illumination changes. In fact, the sparse learning method can be regarded as a non-linear redundant

9 190 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 subspace approach, where the atoms (i.e., the column vector of the dictionary are considered as over complete (i.e., nonorthogonal basis vector of the subspace, and the sparse codes are viewed as the coordinates in the subspace. It has been demonstrated that the subspace method is effective to illumination change [37], [38]. For this reason, from the subspace perspective, it explains why our method performs better in this case. 3 Non-Rigid Deformation: The non-rigid deformation usually leads to local changes in the target appearance. The 1 -loss used in the dictionary learning is robust to these local changes. For this reason, our tracker performs very well in this situation. Moreover, from the subspace point of view, our method is insensitive to pose changes (one case of nonrigid deformation, because it has been shown that subspace method can successfully deal with pose changes in visual tracking [37], [38]. 4 Out-of-Plane Rotation: The motions from either the target or the camera can result in the out-of-plane rotations. This challenge leads to significant changes in the target appearance. It is evident from the results that our tracker obtains much better results in this situation, because our method can effectively distinguish the target from the nearby background regions. 5 Background Clutter: In this scenario, the cluttered background distracts the tracker severely, and leads to tracking drift. It can be seen from the results that our tracker performs the best in this case. This is also attributed to the good discriminative capability of our method, where the target and the background can be accurately separated. 6 Scale Variation: In this situation, the scale of the target appearance varies in successive frames during tracking. It is evident from the results that our tracker obtains better results in the case of scale variation. This is attributed to 1 the higher accuracy of the target description through the dictionary learning, and the fact that we consider the scale variation in successive frames in the motion model of our tracking framework through the scale coefficient σ in Eq. (1. C. Effectiveness of the Spatial-Temporal Locality To verify the effectiveness of the spatial-temporal locality to the tracking performance, we conducted the following experiments on the OTB 015 and the VOT 014 benchmarks. We implement two trackers. One, denoted by the STL, exploits the spatial-temporal locality, and the other, denoted by the NSTL, does not exploit the spatial-temporal locality. The STL tracker learns the dictionary over the recently obtained targets and the nearby background, while the NSTL tracker learns the dictionary over all the previously obtained targets, as well as both the distant and the nearby background, while the lowrank constraint is removed. Except for the difference in the usage of the spatial-temporal locality, the other parts of the two trackers are identical. We evaluated the two trackers over the OTB 015 and the VOT 014 benchmarks, and the results are shown in Figs. 14 and 15. It is evident from the results that the STL tracker significantly outperforms the NSTL tracker on both the OTB 015 and the VOT 014 benchmarks. Fig. 14. Tracking performance of the trackers that exploits (STL and does not exploit (NSTL the temporal-spatial locality, respectively, on the OTB 015 benchmark. Fig. 15. Tracking performance of the trackers that exploits (STL and does not exploit (NSTL the temporal-spatial locality, respectively, on the VOT 014 benchmark. It indicates that exploiting the spatial-temporal locality of tracking can significantly improve the tracking performance. Without the spatial-temporal locality, the NSTL tracker needs a large number of training samples to compensate the diversity of the background samples and obtain a good discrimination for the linear classifier (perhaps a nonlinear classifier is required. In contrast, the STL tracker successfully fits the inherent structure among the training samples by exploiting the spatial-temporal locality, leading to a classifier with more discriminative capability. D. Effectiveness of the Structured Dictionary Learning As shown above, exploiting the spatial-temporal locality can significantly improve the tracking performance. Therefore, it is reasonable to investigate how different implementations of the spatial-temporal locality influence the tracker performance. Our approach implements the spatial-temporal locality via the discriminative dictionary learning with a subspace structure. We construct three other trackers, which implement the spatial-temporal locality via the random dictionary [43], templates based dictionary construction [59], and the K-SVD dictionary learning [69], respectively. We use 0 atoms for our dictionary, the random dictionary, and the K-SVD dictionary. Referring to [59], we use 10 atoms for the templates dictionary. We investigate the tracking performance on the OTB 015 and the VOT 014 benchmarks. The results are shown in Figs. 16 and 17. It is evident from the result that our approach yields a more accurate target localization result than its three peers. As a result, we recommend to exploit the spatial-temporal locality of tracking via the discriminative dictionary learning with a subspace structure. Furthermore, we investigated the influence of different numbers of the dictionary atoms on the tracking performance. We select four dictionaries with 5, 10, 0, 30, and 50 atoms

10 SUI et al.: EXPLOITING SPATIAL-TEMPORAL LOCALITY OF TRACKING VIA STRUCTURED DICTIONARY LEARNING 191 Fig. 16. Tracking performance of the four implementations of the temporalspatial locality on the OTB 015 benchmark. Fig. 0. Tracking performance of the trackers that leverage dictionary learning based sparse representation (DLSR, linear classification (LC, and both (Ours, respectively, on the OTB 015 benchmark. Fig. 17. Tracking performance of the four implementations of the temporalspatial locality on the VOT 014 benchmark. Fig. 1. Tracking performance of the trackers that leverage dictionary learning based sparse representation (DLSR, linear classification (LC, and both (Ours, respectively, on the VOT 014 benchmark. TABLE I RUNNING SPEED IN FPS (FRAMES PER SECOND OF THE PROPOSED AND SOME SPARSE TRACKERS Fig. 18. Tracking performance with respect to different numbers d of the dictionary atoms on the OTB 015 benchmark. to the tracking performance, we construct two other trackers that leverage dictionary learning based sparse representation (DLSR and linear classification (LC, respectively. Figs. 0 and 1 show the tracking performance of the two trackers on the OTB 015 and the VOT 014 benchmarks. It is evident that jointly exploiting the representation and the discrimination yields better performance than exploiting either of them independently on the two benchmarks. Fig. 19. Tracking performance with respect to different numbers d of the dictionary atoms on the VOT 014 benchmark. and evaluate the corresponding trackers on the OTB 015 and the VOT 014 benchmarks. The results are shown in Figs. 18 and 19. It is evident that the tracker with 0 dictionary atoms obtains the best performance. The dictionary with more atoms has weaker representation capability, while the dictionary with fewer atoms lacks of the redundancy required by the sparse coding. As a result, we empirically set the number of the dictionary atom to 0. E. Representation v.s Discrimination To investigate the contributions of the representation (through the dictionary learning based sparse representation and the discrimination (via the linear classification F. Complexity and Running Speed The proposed approach relies on an iterative algorithm. In each iteration, the SVD and the shrinkage operations are involved. Note that the SVD yields a cubic order complexity while the shrinkage has a linear order computational cost. As a result, the SVD dominates the computational complexity in the iterations. Specifically, the SVD imposes on the dictionary that is a matrix of size m d. The computational complexity of the proposed algorithm thus yields O ( md. We compared the running speed of ours and 6 other stateof-the-art sparse trackers. Table. I shows the running speed in fps (frames per second of the proposed and the other 6 sparse trackers. It is evident that although our tracker does not run in real-time, it still runs much faster than the other 5 sparse trackers, and slightly more slowly than the DSL tracker. Note that the proposed tracker is eligible to speed up in an easy way. The proposed tracking model is driven by the

11 19 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 Bayesian inference framework. The running speed is linearly proportional to the number of candidates. Note that the candidates are independently evaluated for the target localization. It indicates that these evaluations can be conducted in parallel. In that case, the proposed tracker can be sped up significantly by employing concurrent computation methods, such as multithread/process implementation and distributed computing system. V. LIMITATION Any model has its limitation for various data, applications, and systems. In this work, the Bayesian inference method is adopted to formulate the motion model of the proposed tracker, where the motion transition of the target between two consecutive frames is considered as a Gaussian distribution with a zero mean and a diagonal covariance matrix. Two of the diagonal elements control the translations of the target in the x- and y-directions, respectively. Note that, for a Gaussian distribution, 95% samples are drawn from the range within three standard deviations. We set the variances of the translations (i.e., two of the diagonal elements to 3. It indicates that we have a very high possibility to cover the target in the next frame if the target does not move faster than 6 or more pixels in that frame; otherwise, we might lose the target. We could set the variances of the translations to a larger value, such that we have a wider range to cover the target in the next frame. However, we have to increase the number of the candidates at the same time to ensure the adequate sampling density within this range. As a result, the tracker will be slowed down as the running speed is proportional to the number of the candidates. Thus, a trade-off between the tracking accuracy and the running speed has to made to configure our tracker as presented above. VI. CONCLUSION We have proposed a powerful structure for visual tracking, the spatial-temporal locality, from the recently obtained targets and the nearby background regions. We have exploited the spatial-temporal locality of tracking via the structured dictionary learning. A large number of experiments have been conducted, and the experimental results have shown that 1 the spatial-temporal locality can significantly improve the tracking performance; and the discriminative dictionary learning with a subspace structure is recommended to exploit the spatial-temporal locality of tracking. Both the qualitative and quantitative evaluations on various challenging video sequences have demonstrated that our tracker outperforms most other state-of-the-art trackers. APPENDIX A ALGORITHM OF THE DICTIONARY LEARNING The problem of the dictionary learning shown in Eq. (5 of the manuscript is min X DZ 1 + λ y w T Z D,Z,w F Z s.t. 0, z 0 rank(d r 0. To efficiently solve it, we further relax the constraint on Z to Z 1 nz 0,wheren denotes the column number of Z. Then the minimization can be written as min Z 1 + α E 1 + βrank(d D,Z,w X DZ + E s.t. y w T Z, (17 where a relax variable E is introduced to implement the 1 -loss, and α and β are the weight parameters. Note that Z and D are coupled in the multiplications in the constraints. To solve the above problem, another two relax variables Z and D are introduced, and Z and D are re-written as Z 1 and D 1, respectively. In addition, considering the stability of the linear classifier and to avoid overfitting, a regularization on w is enforced. As a result, the above problem is approximated by min Z α E 1 + β D 1 + γ w D 1,D,Z 1,Z,E,w X D Z + E D 1 = D s.t. y w T (18 Z Z 1 = Z, where 0 -norm and rank( function are approximated by the convex conjugates 1 -norm and nuclear norm, respectively, and γ is a weight parameter. Note that there are six variables in the above problem. Although the problem is not convex with respect to all six variables, it is convex with respect to any of the six variables. For this reason, an iterative algorithm is developed. In each iteration, we alternately solve one variable while fixing the other five variables. We use the augmented Lagrange multiplier (ALM method to solve the above problem. The ALM formulation of the problem is min Z α E 1 + β D 1 + γ w D 1,D,Z 1,Z,E,w + Y 1, X D Z E + Y, y w T Z + Y 3, Z Z 1 + Y 4, D D 1 + μ ( y w T Z + X D Z E F + Z Z 1 F + D D 1 F (19 where Y 1, Y, Y 3,andY 4 denote the Lagrange multipliers, μ is a penalty parameter, and, denotes the inner product. In this work, we empirically set α = β = γ = 1andμ = 10. Compute D 1. The problem with respect to D 1 is reduced to min β D 1 + Y 4, D D 1 + μ D 1 D D 1 F. (0 By completing squares and using the singular value thresholding algorithm [70], D 1 can be obtained from D 1 = US β/μ (S V T, (1 where [ U, S, V T ] = svd (D + Y 4 /μ, ands θ ( denotes the shrinkage function, defined as S θ (x = sign (x max (0, x θ, θ > 0. (

12 SUI et al.: EXPLOITING SPATIAL-TEMPORAL LOCALITY OF TRACKING VIA STRUCTURED DICTIONARY LEARNING 193 Compute D. The problem with respect to D is reduced to min Y 1, X D Z E + Y 4, D D 1 D + μ ( X D Z E F + D D 1 F, (3 which is a least squares problem with a closed-form solution [ ]( 1 D = (X E + Y 1 /μ Z T + D 1 Y 4 /μ Z Z T + I, (4 where I denotes the identity matrix. Compute Z 1. The problem with respect to Z 1 is reduced to min Z Y 3, Z Z 1 + μ Z 1 Z Z 1 F. (5 By completing squares and using the iterative shrinkage algorithm [71], Z 1 can be obtained from Z 1 = S 1/μ (Z + Y 3 /μ. (6 Compute Z. The problem with respect to Z is reduced to min Y 1, X D Z E + Y, y w T Z Z + Y 3, Z Z 1 + μ ( y w T Z + X D Z E F + Z Z 1 F, (7 which is a least squares problem, and it has a closed-form solution ( 1 Z = D T D + ww T + I [Z1 Y 3 /μ ] +D T (X E + Y 1/μ + w (y + Y /μ, (8 where I denotes the identity matrix. Compute E. The problem with respect to E is reduced to min α E 1 + Y 1, X D Z E + μ E X D Z E F. (9 By completing squares and using the iterative shrinkage algorithm [71], E can be obtained from E = S α/μ (X D Z + Y 1 /μ. (30 Compute w. The problem with respect to w is reduced to min γ w w + Y, y w T Z + μ y w T Z, (31 which is a least squares problem, with the following closedform solution w = ( γ μ I + Z Z T 1 Z (y + Y /μ T, (3 where I denotes the identity matrix. The solutions of the six variables are obtained when the iterative algorithm converges. The convergence criterion achieved by evaluating X D Z E + y w T Z + D D 1 + Z Z 1. When the difference between two consecutive iterations is smaller than a threshold, e.g., 10 8, the iterative algorithm stops. APPENDIX B ALGORITHM OF THE SPARSE CODING The sparse coding problem shown in Eq. (6 of the manuscript is which can be written as min z c Dz 1 s.t. z 0 z 0 min z z 1 + λ e 1 s.t. c Dz + e, (33 where a relax variable e is introduced, λ is a weight parameter, and the 0 -norm of z is approximated by its convex conjugate 1 -norm. Note that the above problem is not convex with respect to both z and e, however, it is convex with respect to either z or e. Furthermore, because z is coupled in the multiplication in the constraint, another relax variable z is introduced, and z is re-written as z 1. As a result, the ALM formulation of the above problem is formulated as min z 1 z 1,z,e 1 + λ e 1 + Y 1, c Dz e + Y, z z 1 + μ ( c Dz e + z z 1, (34 where Y 1 and Y denote the Lagrange multipliers. An iterative algorithm can be derived to solve z 1, z and e. In each iteration, we alternately solve one variable while fixing the others. Compute z 1. The problem with respect to z 1 is reduced to min z Y, z z 1 + μ z 1 z z 1. (35 By completing squares and using the iterative shrinkage algorithm [71], z 1 can be obtained from z 1 = S 1/μ (z + Y /μ. (36 Compute z. The problem with respect to z is reduced to min Y 1, c Dz e + Y, z z 1 z + μ ( c Dz e z 1, (37 which is a least squares problem. It has a closed-form solution ( 1 [ ] z = D T D + I z 1 Y /μ + D T (c e + Y 1 /μ, (38 where I denotes the identity matrix. Compute e. The problem with respect to e is reduced to min λ e 1 + Y 1, c Dz e + μ e c Dz e. (39 By completing squares and using the iterative shrinkage algorithm [71], e can be obtained from e = S λ/μ (c Dz + Y 1 /μ. (40 The convergence is determined from the difference between two consecutive iterations on the value c Dz e + z z 1.

13 194 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 REFERENCES [1] A. Yilmaz, O. Javed, and M. Shah, Object tracking: A survey, ACM Comput. Surveys, vol. 38, no. 4, pp , Dec [Online]. Available: [] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, Visual tracking: An experimental survey, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp , Jul [Online]. Available: gov/pubmed/ [3] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, Incremental learning for robust visual tracking, Int. J. Comput. Vis., vol. 77, nos. 1 3, pp , Aug [Online]. Available: com/index/ /s [4] X. Mei and H. Ling, Robust visual tracking using 1 minimization, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Sep. 009, pp [Online]. Available: org/xpls/abs_all.jsp?arnumber=54599 [5] X. Li, A. Dick, C. Shen, A. van den Hengel, and H. Wang, Incremental learning of 3D-DCT compact representations for robust visual tracking, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 4, pp , Apr [Online]. Available: org/xpls/abs all.jsp?arnumber= [6] Y. Wu, J. Lim, and M.-H. Yang, Online object tracking: A benchmark, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 013, pp [Online]. Available: [7] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, Sparse representation for computer vision and pattern recognition, Proc. IEEE, vol. 98, no. 6, pp , Jun [Online]. Available: all.jsp?arnumber= [8] J. Xing, J. Gao, B. Li, W. Hu, and S. Yan, Robust object tracking with online multi-lifespan dictionary learning, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Dec. 013, pp [Online]. Available: [9] N. Wang, J. Wang, and D. Yeung, Online robust non-negative dictionary learning for visual tracking, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Sep. 013, pp [Online]. Available: [10] F. Yang, Z. Jiang, and L. S. Davis, Online discriminative dictionary learning for visual tracking, in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV, Mar. 014, pp [11] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M.-H. Yang, Fast visual tracking via dense spatio-temporal context learning, in Proc. Eur. Conf. Comput. Vis. (ECCV, 014, pp [1] Y. Sui, Y. Tang, and L. Zhang, Discriminative low-rank tracking, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Dec. 015, pp [13] Y. Sui, Y. Tang, L. Zhang, and G. Wang, Visual tracking via subspace learning: A discriminative approach, Int. J. Comput. Vis., pp. 1, 017. [online]: z, doi: [14] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no., pp. 10 7, Feb [Online]. Available: all.jsp?arnumber= [15] H. Li, C. Shen, and Q. Shi, Real-time visual tracking using compressive sensing, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 011, pp [Online]. Available: [16] T. Zhang, B. Ghanem, and S. Liu, Robust visual tracking via multitask sparse learning, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 01, pp [Online]. Available: [17] X. Mei, H. Ling, and Y. Wu, Minimum error bounded efficient 1 tracker with occlusion detection, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 011, pp [Online]. Available: org/xpls/abs_all.jsp?arnumber= [18] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, Robust tracking using local sparse appearance model and K-selection, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 011, pp [Online]. Available: epic03/wrapper.htm?arnumber= [19] X. Jia, H. Lu, and M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 01, pp [Online]. Available: ucmerced.edu/mhyang/papers/cvpr1c.pdf [0] W. Zhong, H. Lu, and M.-H. Yang, Robust object tracking via sparsity-based collaborative model, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 01, pp [Online]. Available: cvpr1_scm/cvpr1_scm.files/cvpr1_wei.pdf [1] Y. Sui and L. Zhang, Visual tracking via locally structured Gaussian process regression, IEEE Signal Process. Lett., vol., no. 9, pp , Sep [] T. Zhang et al., Structural sparse tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 015, pp [3] Q. Wang, F. Chen, W. Xu, and M.-H. Yang, Object tracking with joint optimization of representation and classification, IEEE Trans. Circuits Syst. Video Technol., vol. 5, no. 4, pp , Apr [4] D. Wang, H. Lu, Z. Xiao, and M.-H. Yang, Inverse sparse tracker with a locally weighted distance metric, IEEE Trans. Image Process., vol. 4, no. 9, pp , Sep [5] X. Lan, A. J. Ma, P. C. Yuen, and R. Chellappa, Joint sparse representation and robust feature-level fusion for multi-cue visual tracking, IEEE Trans. Image Process., vol. 4, no. 1, pp , Dec [6] Y. Cong, J. Liu, G. Sun, Q. You, Y. Li, and J. Luo, Adaptive greedy dictionary selection for Web media summarization, IEEE Trans. Image Process., vol. 6, no. 1, pp , Jan [7] X. Liu, D. Zhai, J. Zhou, S. Wang, D. Zhao, and H. Guo, Sparsitybased image error concealment via adaptive dual dictionary learning and regularization, IEEE Trans. Image Process., vol. 6, no., pp , Feb [8] D.-S. Pham and S. Venkatesh, Joint learning and dictionary construction for pattern recognition, in Proc. IEEE Conf. CVPR, Anchorage, AK, USA, Jun. 008, pp [Online]. Available: org/lpdocs/epic03/wrapper.htm?arnumber= [9] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, Discriminative learned dictionaries for local image analysis, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 008, pp [Online]. Available: org/xpls/abs_all.jsp?arnumber= [30] Q. Zhang and B. Li, Discriminative K-SVD for dictionary learning in face recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 010, pp [Online]. Available: [31] Z. Jiang, Z. Lin, and L. S. Davis, Learning a discriminative dictionary for sparse coding via label consistent K-SVD, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 011, pp [Online]. Available: org/xpls/abs_all.jsp?arnumber= [3] M. Yang, L. Zhang, X. Feng, and D. Zhang, Fisher discrimination dictionary learning for sparse representation, in Proc. 13th IEEE Int. Conf. Comput. Vis., Nov. 011, pp [Online]. Available: [33] T. Bai, Y. F. Li, and X. Zhou, Robust and fast visual tracking using constrained sparse coding and dictionary learning, in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 01, pp [Online]. Available: [34] F. Yang, H. Lu, and M.-H. Yang, Learning structured visual dictionary for object tracking, Image Vis. Comput., vol. 31, no. 1, pp , Dec [Online]. Available: com/retrieve/pii/s [35] Y. Xie, W. Zhang, Y. Qu, and Y. Zhang, Discriminative subspace learning with sparse representation view-based model for robust visual tracking, Pattern Recognit., vol. 47, no. 3, pp , Mar [Online]. Available: elsevier.com/retrieve/pii/s [36] H. Fan and J. Xiang, Robust Visual Tracking with Multitask Joint Dictionary Learning, IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 5, pp , May 017. [37] G. D. Hager and P. N. Belhumeur, Real-time tracking of image regions with changes in geometry and illumination, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 1996, pp [38] P. N. Belhumeur and D. J. Kriegmant, What is the set of images of an object under all possible lighting conditions? in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 1996, pp [39] D. Wang, H. Lu, and M.-H. Yang, Online object tracking with sparse prototypes, IEEE Trans. Image Process., vol., no. 1, pp , Jan [Online]. Available: org/xpls/absall.jsp?arnumber=61358

Zhang, Self-expressive tracking, Pattern Recognit., vol. 48, no. 9, pp. 87 884, 015. [Online]. Available: http://linkinghub.elsevier. com/retrieve/pii/s003130315000965 [4] D. Wang, H. Lu, and M.-H.

14 SUI et al.: EXPLOITING SPATIAL-TEMPORAL LOCALITY OF TRACKING VIA STRUCTURED DICTIONARY LEARNING 195 [40] Y. Sui, S. Zhang, and L. Zhang, Robust visual tracking via sparsityinduced subspace learning, IEEE Trans. Image Process., vol. 4, no. 1, pp , Dec [41] Y. Sui, X. Zhao, S. Zhang, X. Yu, S. Zhao, and L. Zhang, Self-expressive tracking, Pattern Recognit., vol. 48, no. 9, pp , 015. [Online]. Available: com/retrieve/pii/s [4] D. Wang, H. Lu, and M.-H. Yang, Least soft-threshold squares tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 013, pp [Online]. Available: [43] Y. Sui and L. Zhang, Robust tracking via locally structured representation, Int. J. Comput. Vis., vol. 119, no., pp , 016. [Online]. Available: [44] Y. Sui, G. Wang, Y. Tang, and L. Zhang, Tracking completion, in Proc. Eur. Conf. Comput. Vis. (ECCV, 016, pp [45] S. Hare, A. Saffari, and P. H. S. Torr, Struck: Structured output tracking with kernels, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Nov. 011, pp [Online]. Available: abs_all.jsp?arnumber=61651 [46] Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-learningdetection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp , Jul. 01. [Online]. Available: org/xpls/abs all.jsp?arnumber= [47] F. Henriques, R. Caseiro, P. Martins, and J. Batista, Exploiting the circulant structure of tracking-by-detection with kernels, in Proc. Eur. Conf. Comput. Vis. (ECCV, 01, pp [48] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, High-speed tracking with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp , Mar [Online]. Available: [49] S. Liu, T. Zhang, X. Cao, and C. Xu, Structural correlation filter for robust visual tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 016, pp [50] Y. Sui, Z. Zhang, G. Wang, Y. Tang, and L. Zhang, Real-time visual tracking: Promoting the robustness of correlation filter learning, in Proc. Eur. Conf. Comput. Vis. (ECCV, 016, pp [51] Y. Sui, G. Wang, and L. Zhang, Correlation filter learning toward peak strength for visual tracking, IEEE Trans. Cybern., to be published. [Online]. Available: doi: /TCYB [5] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, Hierarchical convolutional features for visual tracking, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Dec. 015, pp [53] L. Wang, W. Ouyang, X. Wang, and H. Lu, Visual tracking with fully convolutional networks, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV, Dec. 015, pp [54] Y. Qi et al., Hedged deep tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 016, pp [55] L. Wang, W. Ouyang, X. Wang, and H. Lu, STCT: Sequentially training convolutional networks for visual tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 016, pp [56] H. Nam and B. Han, Learning Multi-Domain Convolutional Neural Networks for Visual Tracking, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 016, pp [57] Y. Wu, J. Lim, and M. H. Yang, Object tracking benchmark, IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp , Sep [58] T. B. Dinh, N. Vo, and G. Medioni, Context tracker: Exploring supporters and distracters in unconstrained environments, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 011, pp [Online]. Available: org/lpdocs/epic03/wrapper.htm?arnumber= [59] X. Mei and H. Ling, Robust visual tracking and vehicle classification via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 11, pp. 59 7, Nov [Online]. Available: [60] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, Low-rank sparse learning for robust visual tracking, in Proc. Eur. Conf. Comput. Vis. (ECCV, 01, pp [Online]. Available: com/index/wkpmk0tt3683t.pdf [61] T. Zhang, S. Liu, N. Ahuja, M.-H. Yang, and B. Ghanem, Robust visual tracking via consistent low-rank sparse learning, Int. J. Comput. Vis., vol. 111, no., pp , Jun [Online]. Available: [6] K. Zhang, L. Zhang, and M.-H. Yang, Real-time compressive tracking, in Proc. Eur. Conf. Comput. Vis. (ECCV, 01, pp [63] D. Wang and H. Lu, Visual tracking via probability continuous outlier model, in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR, Jun. 014, pp [64] D. Wang, H. Lu, and C. Bo, Fast and robust object tracking via probability continuous outlier model, IEEE Trans. Image Process., vol. 4, no. 1, pp , Dec [65] K. Zhang, Q. Liu, Y. Wu, and M.-H. Yang, Robust visual tracking via convolutional networks without training, IEEE Trans. Image Process., vol. 5, no. 4, pp , Apr [66] J. Zhang, S. Ma, and S. Sclaroff, MEEM: Robust tracking via multiple experts using entropy minimization, in Proc. Eur. Conf. Comput. Vis. (ECCV, 014, pp [67] D. Wang and H. Lu, Object tracking via DPCA and 1 - regularization, IEEE Signal Process. Lett., vol. 19, no. 11, pp , Nov. 01. [Online]. Available: jsp?arnumber= [68] K. Zhang, L. Zhang, and M.-H. Yang, Real-time object tracking via online discriminative feature selection, IEEE Trans. Image Process., vol., no. 1, pp , Dec [Online]. Available: [69] M. Aharon, M. Elad, and A. Bruckstein, K-SVD : An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process., vol. 54, no. 11, pp , Nov [70] J.-F. Cai, E. J. Candes, and Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM J. Optim., vol. 0, no. 4, pp , 010. [Online]. Available: org/doi/pdf/ / [71] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imag. Sci., vol., no. 1, pp , 009. [Online]. Available: org/doi/abs/ / Yao Sui received the master s degree in software engineering from Peking University, Beijing, China and the Ph.D. degree in electronic engineering from Tsinghua University, Beijing. He was a Post- Doctoral Researcher with the Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS, USA. He is currently a Research Fellow with the Harvard Medical School, Harvard University, Boston, MA, USA. His research interests include machine learning, computer vision, image processing, pattern recognition, and medical imaging. Guanghui Wang (SM 17 received the Ph.D. degree in computer vision from the University of Waterloo, Waterloo, ON, Canada. From 003 to 005, he was a Research Fellow and a Visiting Scholar with the Department of Electronic Engineering, The Chinese University of Hong Kong. From 005 to 007, he acted as a Professor with the Department of Control Engineering, Changchun Aviation University, China. From 006 to 010, he was a Research Fellow with the Department of Electrical and Computer Engineering, University of Windsor, Canada. He is currently an Assistant Professor with the University of Kansas, Lawrence, KS, USA. He is also with the Institute of Automation, Chinese Academy of Sciences, China, as an Adjunct Professor. He has authored one book Guide to Three Dimensional Structure and Motion Factorization (Springer Verlag. He has authored or co-authored over 90 papers in peer-reviewed journals and conferences. His research interests include computer vision, structure from motion, object detection and tracking, artificial intelligence, and robot localization and navigation. He has served as an associate editor and on the editorial board of two journals, as an area chair or TPC member of over 0 conferences, and as a reviewer of over 0 journals.

196 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 Li Zhang received the B.S., M.S., and Ph.D. degrees from Tsinghua University, Beijing, China, all in electronic engineering.

He joined the Department of Electronic Engineering, Tsinghua University, as a faculty member in 199, where he is currently a Professor and also the Director of the UVA Vision Laboratory.

15 196 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 7, NO. 3, MARCH 018 Li Zhang received the B.S., M.S., and Ph.D. degrees from Tsinghua University, Beijing, China, all in electronic engineering. He was an Instructor with the Department of Physics, Tsinghua University from 1987 to 199. He joined the Department of Electronic Engineering, Tsinghua University, as a faculty member in 199, where he is currently a Professor and also the Director of the UVA Vision Laboratory. He is also a member of the National Laboratory of Pattern Recognition, Beijing. His interests include image processing, computer vision, pattern recognition, and computer graphics. Ming-Hsuan Yang (SM 06 received the Ph.D. degree in computer science from the University of Illinois at Urbana Champaign in 000. In 008, he joined the University of California at Merced (UC Merced, where he is currently a Professor in electrical engineering and computer science with. He was a Senior Research Scientist with the Honda Research Institute, where he was involved in vision problems related to humanoid robots. He served as an Associate Editor of the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLI- GENCE from 007 to 011. He is currently an Associate Editor of the International Journal of Computer Vision, Image and Vision Computing and the Journal of Artificial Intelligence Research. He received the NSF CAREER Award in 01, the Senate Award for Distinguished Early Career Research from UC Merced in 011, and the Google Faculty Award in 009. He is a Senior Member of the ACM.

arxiv: v2 [cs.cv] 9 Sep 2016

arxiv: v2 [cs.cv] 9 Sep 2016 arxiv:1608.08171v2 [cs.cv] 9 Sep 2016 Tracking Completion Yao Sui 1, Guanghui Wang 1,4, Yafei Tang 2, Li Zhang 3 1 Dept. of EECS, University of Kansas, Lawrence, KS 66045, USA 2 China Unicom Research Institute,