Fast Rigid Motion Segmentation via Incrementally-Complex Local Models

Fast Rigid Motion Segmentation via Inrementally-Complex Loal Models Fernando Flores-Mangas Allan D. Jepson Department of Computer Siene, University of Toronto {mangas,jepson}@s.toronto.edu Abstrat The problem of rigid motion segmentation of trajetory data under orthography has been long solved for nondegenerate motions in the absene of noise. But beause real trajetory data often inorporates noise, outliers, motion degeneraies and motion dependenies, reently proposed motion segmentation methods resort to non-trivial representations to ahieve state of the art segmentation auraies, at the expense of a large omputational ost. This paper proposes a method that dramatially redues this ost (by two or three orders of magnitude) with minimal auray loss (from 98.8% ahieved by the state of the art, to 96.2% ahieved by our method on the standard Hopkins 55 dataset). Computational effiieny omes from the use of a simple but powerful representation of motion that expliitly inorporates mehanisms to deal with noise, outliers and motion degeneraies. Subsets of motion models with the best balane between predition auray and model omplexity are hosen from a pool of andidates, whih are then used for segmentation. (a) Original video frames (b) Class-labeled trajetories () Model support region (d) Model inliers and ontrol points (e) Model residuals (f) Segmentation result Figure : Model instantiation and segmentation. a) f th original frame, Italian Grand Prix 202 Formula. b) Classlabeled, trajetory data W (red, green, blue and blak orrespond to hassis, helmet, bakground and outlier lasses respetively). ) Spatially-loal support subset W f for a andidate motion in blue. d) Candidate motion model inliers in red, ontrol points (fi from Eq. 3) in white. e) Residuals (rfi from Eq. ) olor-oded with label data, the radial oordinate is logarithmi. f) Segmentation result.. Rigid Motion Segmentation Rigid motion segmentation (MS) onsists on separating regions, features, or trajetories from a video sequene into spatio-temporally oherent subsets that orrespond to independent, rigidly-moving objets in the sene (Figure.b or.f). The problem urrently reeives renewed attention, partly beause of the extensive amount of video soures and appliations that benefit from MS to perform higher level omputer vision tasks, but also beause the state of the art is reahing funtional maturity. Motion Segmentation methods are widely diverse, but most apture only a small subset of onstraints or algebrai properties from those that govern the image formation proess of moving objets and their orresponding trajetories, suh as the rank limit theorem [9, 0], the linear independene onstraint (between trajetories from independent motions) [2, 3], the epipolar onstraint [7], and the redued rank property [, 5, 3]. Model-seletion based methods [, 6, 8] balane model omplexity with modeling auray and have been suessful at inorporating more of these aspets into a single formulation. For instane, in [8] most model parameters are estimated automatially from the data, inluding the number of independent motions and their omplexity, as well as the segmentation labels (inluding outliers). However, beause of the large number of neessary motion hypotheses that need to be instantiated, as well as the varying and potentially very large number of

model parameters that must be estimated, the flexibility offered by this method omes at a large omputational ost. Current state of the art methods follow the trend of using sparse low-dimensional subspaes to represent trajetory data. This representation is then fed into a lustering algorithm to obtain a segmentation result. A prime example of this type of method is Sparse Subspae Clustering (SSC) [3] in whih eah trajetory is represented as a sparse linear ombination of a few other basis trajetories. The assumption is that the basis trajetories must belong to the same rigid motion as the reonstruted trajetory (or else, the reonstrution would be impossible). When the assumption is true, the sparse mixing oeffiients an be interpreted as the onnetivity weights of a graph (or a similarity matrix), whih is then (spetral) lustered to obtain a segmentation result. At the time of publiation, SSC produed segmentation results three times more aurate than the best predeessor. The pratial downside, however, is the inherently large omputational ost of finding the optimal sparse representation, whih is at least ubi on the number of trajetories. The work of [4] also falls within the lass of subspae separation algorithms. Their approah is based on lustering the prinipal angles (CPA) of the loal subspaes assoiated to eah trajetory and its nearest neighbors. The lustering re-weights a traditional metri of subspae affinity between prinipal angles. Re-weighted affinities are then used for segmentation. The approah produes segmentation results with auraies similar to those of SSC, but the omputational ost is lose to 0 times bigger than SSC s. In this work we argue that ompetitive segmentation results are possible using a simple but powerful representation of motion that expliitly inorporates mehanisms to deal with noise, outliers and motion degeneraies. The proposed method is approximately 2 or 3 orders of magnitude faster than [3] and [4] respetively, urrently onsidered the state of the art... Affine Motion Projetive Geometry is often used to model the image motion of trajetories from rigid objets between pairs of frames. However, alternative geometri relationships that failitate parameter omputation have also been proven useful for this purpose. For instane, in perspetive projetion, general image motion from rigid objets an be modeled via the omposition of two elements: a 2D homography, and parallax residual displaements [5]. The homography desribes the motion of an arbitrary plane, and the parallax residuals aount for relative depths, that are unaounted for by the planar surfae model. Under orthography, in ontrast, image motion of rigid objets an be modeled via the omposition of a 2D affine transformation plus epipolar residual displaements. The 2D affine transformation models the motion of an arbitrary plane, and the epipolar residuals aount for relative depths. Cruially, these two omponents an be omputed separately and inrementally, whih enables an expliit mehanism to deal with motion degeneray. In the ontext of 3D motion, a motion is degenerate when the trajetories originate from a planar (or linear) objet, or when neither the amera nor the imaged objet exerise all of their degrees of freedom, suh as when the objet only translates, or when the amera only rotates. These are ommon situations in real world video sequenes. The inremental nature of the deompositions desribed above, failitate the transition between degenerate motions and nondegenerate ones. Planar Model Under orthography, the projetion of trajetories from a planar surfae an be modeled with the affine transformation: x [ ] x w x w y D t = 0 y w = A w 2D y w, () where D R 2 2 is an invertible matrix, and t R 2 is a translation vetor. Trajetory oordinates (x w, y w ) are in the plane s referene frame (modulo a 2D affine transformation) and (x, y ) are image oordinates. Now, let W R 2F P be matrix of trajetory data that ontains the x and y image oordinates of P feature points traked through F frames, as in x, x,p y, y,p W =...... (2) x F, x F,P y F, y F,P To ompute the parameters of A 2D from trajetory data, let C = [, 2, 3 ] R 2f 3 be three olumns (three full trajetories) from W, and let f i = [2f i, 2f i ] be the x and y oordinates of the i-th ontrol trajetory at frame f. Then the transformation between points from an arbitrary soure frame s to a target frame f an be written as: [ f f 2 f ] 3 = A s f 2D and A s f 2D A s f 2D = an be simply omputed as: [ f f 2 f 3 [ s s 2 s 3 ] [ s s 2 s 3 ], (3) ]. (4) The inverse in the right-hand side matrix of Eq. 4 exists so long as the points s i are not ollinear. For simpliity we refer to A s f 2D as A f 2D and onsequently As 2D is the identity matrix.

3D Model In order to upgrade a planar (degenerate) model into a full 3D one, relative depth must be aounted using the epipolar residual displaements. This means extending Eq. with a diretion vetor, saled by the orresponding relative depth of eah point, as in: x [ ] y D t = xw 0 y w + δz w a 3 a 23. (5) 0 The depth δz w is relative to the arbitrary plane whose motion is modeled by A 2D ; a point that lies on suh plane would have δz w = 0. We all the orthographi version of the plane plus parallax deomposition, the 2D Affine Plus Epipolar (2DAPE) deomposition. Eq. 5 is equivalent to x y = a a 2 a 3 t a 2 a 22 a 23 t 2 0 0 0 x w y w δz w (6) where it is lear that the parameters of A 3D define an orthographially projeted 3D affine transformation. Determining the motion and struture parameters of a 3D model from point orrespondenes an be done using the lassial matrix fatorization approah [0], but besides being sensitive to noise and outliers, the ommon senario where the solution beomes degenerate makes the approah diffiult to use in real-world appliations. Appropriately aommodating and dealing with the degenerate ases is one of the key features of our work. 2. Overview of the Method The proposed motion segmentation algorithm has three stages. First, a pool of M motion model hypotheses M = {M,..., M M } is generated using a method that ombines a Random Sampling and Consensus (RANSAC) [4] tehnique with the 2DAPE deomposition. The goal is to generate one motion model for eah of the N independent, rigidly-moving objets in the sene; N is assumed to be known a priori. The method instantiates many more models than those expeted neessary (M N) in an attempt inrease the likelihood of generating orret model proposals for all N motions. A good model aurately desribes a large subset of oherently moving trajetories with the smallest number of parameters ( 3). In the seond stage, subsets of motion models from M are ombined to explain all the trajetories in the sequene. The problem is framed as an objetive funtion that must be minimized. The objetive funtion is the negative loglikelihood over predition auray, regularized by model omplexity (number of model parameters) and modeling overlap (trajetories explained by multiple models). Notie that after this stage, the segmentation that results from the optimal model ombination ould be reported as a segmentation result ( 5). The third stage inorporates the results from a set of model ombinations that are losest to the optimal. Segmentation results are aggregated into an affinity matrix, whih is then passed to a spetral lustering algorithm to produe the final segmentation result. This refinement stage generally results in improved auray and redued segmentation variability ( 6). 3. Motion Model Instantiation Eah model M M is instantiated independently using RANSAC. This hoie is motivated beause of this method s well-known omputational effiieny and robustness to outliers, but also beause of its ability to inorporate spatially loal onstraints and (as explained below) beause most of the omputations neessary to evaluate a planar model an be reused to estimate the likelihoods of a potentially neessary 3D model, yielding signifiant omputational savings. The input to our model instantiation algorithm is a spatially-loal, randomly drawn subset of trajetory data Ŵ [2F I] W [2F P ] ( 3.). In turn, at eah RANSAC trial, the algorithm draws uniformly distributed, random subsets of three ontrol trajetories (C [2F 3] Ŵ [2F I] ). Eah set of ontrol trajetories is used to estimate the family of 2D affine transformations {A,..., A F } between the base frame and all other frames in the sequene, whih are then used to determine a omplete set of model parameters M = {B, σ, C, ω}. The matrix B {0, } [F I] indiates whether the i-th trajetory should be predited by model M at frame f (inlier, b f i = ) or not (outlier, b f i = 0), σ = {σ,..., σ F } are estimates of the magnitude of the noise for eah frame, and ω {2D, 3D} is the estimated model type. The goal is to find the ontrol points and the assoiated parameters that minimize the objetive funtion O(Ŵ, M) = b f i (ŵ ) Lω f i Af, σ f + Ψ(ω) + Γ(B) (7) f F i I aross a number of RANSAC trials, where ŵ f i = (x f i, yf i ) = (ŵ2f i, ŵ 2f i ) are the oordinates of the i-th trajetory from the support subset Ŵ at frame f. The negative log-likelihood term L ω ( ) penalizes reonstrution error, while Ψ( ) and Γ( ) are regularizers. The three terms are defined below. Knowing that 2D and 3D affine models have 6 and 8 degrees of freedom respetively, Ψ(ω) regularizes over model omplexity using: Ψ(ω) = { 6(F ), if ω = 2D 8(F ), if ω = 3D. (8)

Γ(B) strongly penalizes models that desribe too few trajetories: {, if I F Γ(B) = bf i < F λ i (9) 0, otherwise The ontrol set C whose M minimizes Eq. 7 aross a number of RANSAC trials beomes part of the pool of andidates M. 2D likelihoods. For the planar ase (ω = 2D) the negative log-likelihood term is evaluated with: ( { L 2D (ŵ f i Af, σ f ) = log exp }) 2π Σ 2 2 rf i Σ r f i, (0) whih is a zero-mean 2D Normal distribution evaluated at the residuals r f i. The spherial ovariane matrix is Σ = (σ f ) 2 I. The residuals r f i are determined by the differenes between the preditions made by a hypothesized model A f, and the observations at eah frame [ ] [ ] [ ] r f w f w = A f s. () 3D likelihoods. The negative log-likelihood term for the 3D ase is based on the the 2DAPE deomposition. The 2D affinities A f and residuals r f are reused, but to aount for the effet of relative depth, an epipolar line segment e f is robustly fit to the residual data at eah frame (please see supplementary material for details on the segment fitting algorithm). The 2DAPE does not onstrain relative depths to remain onstant aross frames, but only requires trajetories to be lose to the epipolar line. So, if the unitary vetor e f indiates the orthogonal diretion to e f, then the negative log-likelihood term for the 3D ase is estimated with: ( ) L 3D (ŵ f i Af, σ f ) = 2 log r f exp 2πσ f i e f 2 2(σ f ) 2, (2) whih is also a zero-mean 2D Normal distribution omputed as the produt of two idential, separable, singlevariate, normal distributions, evaluated at the distane from the residual to the epipolar line. The first one orresponds to the atual deviation in the diretion of e f, whih is analytially omputed using r f i e f. The seond one orresponds to an estimate of the deviation in the perpendiular diretion (e f ), whih annot be determined using the 2DAPE deomposition model, but an be approximated to be equal to r f i e f, whih is a plausible estimate under the isotropi noise assumption. Note that Eq. 7 does not evaluate the quality of a model using the number of inliers, as it is typial for RANSAC. Instead, we found that better motion models resulted from Algorithm : Motion model instantiation Input: Trajetory data W [2F P ], number of RANSAC trials K, arbitrary base frame b Output: Parameters of the motion model M = {B, σ n, ω} // determine the training set Ŵ rand(, P ); r rand(r min, r max) // random enter and radius Ŵ [2F I] trajetorieswithindisk(w, r, ) // support subset X homocoords(ŵ b ) // points at base frame for K RANSAC trials do rand3(, I) for f {,..., F } {b} do Y homocoords(ŵ f ) A Y X R AX Y // three random ontrol trajetory indies // points at frame f // 2D affine model // residuals [B f 2D, σf 2D ] ompute2dinliers(r) L f 2D ompute2dlikelihoods(r, σf 2D ) [U, S, V ] = svd(weightedcov(r, B 2D) // s s 2 if s s2 > + λ 3D then [B f 3D, σf 3D ] ompute3dinliers(r) L f 3D ompute3dlikelihoods(r, σf 3D ) // omplete penalized neg-loglikleihoods l 2D f i B2D(f, i)l2d(f, i) + Ψ(2D) + Γ(B2D) l 3D f i B3D(f, i)l3d(f, i) + Ψ(3D) + Γ(B3D) // keep the best model overall if (min(l 2D, l 3D) < l ) then if (l 2D < l 3D) then l l 2D; σ n σ2d; ω 2; B B 2D else l l 3D; σ n σ3d; ω 3; B B 3D return M = {B, σ n, ω } optimizing over the auray of the model preditions for an (estimated) inlier subset, whih also means that the effet of outliers is expliitly unounted. Figure.b shows an example of lass-labeled trajetory data,. shows a typial spatially-loal support subset. Figures.d and.e show a model s ontrol points and its orresponding (lass-labeled) residuals, respetively. A pseudoode desription of the motion instantiation algorithm is provided in Algorithm. Details on how to determine Ŵ, as well as B, σ, and ω follow. 3.. Loal Coherene The subset of trajetories Ŵ given to RANSAC to generate a model M is onstrained to a spatially loal region. The probability of hoosing an unontaminated set of 3 ontrol trajetories, neessary to ompute a 2D affine model, from a dataset with a ratio r of inliers, after k trials is: p = ( r 3 ) k. This means that the number of trials needed to find a subset of 3 inliers with probability p is k = log( p) log( r 3 ). (3) A ommon assumption is that trajetories from the same underlying motion are loally oherent. Hene, a ompat region is likely to inrease r, exponentially reduing

Figure 2: Preditions (red) from a 2D affine model with standard Gaussian noise (green) on one of the ontrol points (blak). Noiseless model preditions in blue. All four senarios have idential noise. The magnitude of the extrapolation error hanges with the distane between the ontrol points. k, and with it, RANSAC s omputation time by a proportional amount. The trade-off that results from drawing model ontrol points from a small region, however, is extrapolation error. A motion model is extrapolated when utilized to make preditions for trajetories outside the region defined by the ontrol points. The magnitude of modeling error depends on the magnitude of the noise affeting the ontrol points, and although hard to haraterize in general, extrapolation error an be expeted to grow with the distane from the predition to the ontrol points, and inversely with the distane between the ontrol points themselves. Figure 2 shows a series of syntheti senarios where one of the ontrol points is affeted by zero mean Gaussian noise of small magnitude. Idential noise is added to the same trajetory in all four senarios. The figure illustrates the relation between the distane between the ontrol points and the magnitude of the extrapolation errors. Our goal is to maximize the region size while limiting the number of outliers. Without any prior knowledge regarding the sale of the objets in the sene, determining a fixed size for the support region is unlikely to work in general. Instead, the issue is avoided by randomly sampling disk-shaped regions of varying sizes and loations to onstrut a diverse set of support subsets. Eah support subset is then determined by Ŵ = {w i (x b i o x ) 2 + (y b i o y ) 2 < r 2 }, (4) where (o x, o y ) are the oordinates of the enter of a disk of radius r. To promote uniform image overage, the disk is entered at a randomly hosen trajetory (o x, o y ) = (x b i, yb i ) with uniformly distributed i U(, P ) and base frame b U(, F ). To allow for different region sizes, the radius r is hosen from a uniform distribution r U(r min, r max ). If there are I trajetories within the support region, then Ŵ R 2F I. It is worth noting that the onstrution of the support region does not inorporate any knowledge about the motion of objets in the sene, and in onsequene Ŵ will likely ontain trajetories that originate from more than one independently moving objet (Figure 3). Figure 3: Two randomly drawn loal support sets. Left: A mixed set with some trajetories from the blue and green lasses. Right: Another mixed set with all of the trajetories in the red lass and some from the blue lass. 4. Charaterizing the Residual Distribution At eah RANSAC iteration, residuals r f are omputed using the 2D affine model A f that results from the onstraints provided by the ontrol trajetories C. Charaterizing the distribution of r f has three initial goals. The first one is to determine 2D model inliers b f 2D ( 4.), the seond one is to ompute estimates of the magnitude of the noise at every frame σ f 2D ( 4.2), and the third one is to determine whether the residual distribution originates from a planar or a 3D objet ( 4.3). If the objet is suspeted 3D, then two more goals need to be ahieved. The first one is to determine 3D model inliers b f 3D ( 4.4), and the seond one is to estimate the magnitude of the noise (σ f 3D ) to reflet the use of a 3D model ( 4.5). 4.. 2D Inlier Detetion Suppose the matrix Ŵ ontains trajetories Ŵ R 2F I and Ŵ 2 R 2F J from two independently moving objets, and that these trajetories are ontaminated with zero-mean Gaussian noise of spherial ovariane η N (0, (σ f ) 2 I): Ŵ = [Ŵ Ŵ2] + η. (5) Now, assume we know the true affine transformations A f and A f 2 that desribe the motion of trajetories for the subsets Ŵ and Ŵ2, respetively. If A f is used to ompute preditions for all of Ŵ (at frame f), the expeted value (denoted by ) of the magnitude of the residuals (r f from Eq. ) for trajetories in Ŵ will be in the order of the magnitude of the underlying noise r f i = σf for eah i {,..., I}. But in this senario, trajetories in Ŵ2 will be predited using the wrong model, resulting in residuals with magnitudes determined by the motion differential r f i = (A f Af 2 )ŵb i. If we an assume that the motion differential is bigger than the displaement due to noise: (A f Af 2 )wb i > σ f, (6)

then the model inliers an be determined by thresholding r f i with the magnitude of the noise, saled by a onstant (τ = λ σ σ f ): { b f i =, r f i τ (7) 0, otherwise. But beause σ f is generally unknown, the threshold (τ) is estimated from the residual data. To do so, let ˆr be the vetor of residual magnitudes where ˆr i ˆr i+. Now, let r = median (ˆr i+ ˆr i ). The threshold is then defined as τ = min{ˆr i (r i+ r i ) > λ r r}, (8) whih orresponds to the smallest residual magnitude before a salient magnitude gap. Our experiments showed this test to be effiient and effetive. Figure.e shows lasslabeled residuals. Notie the presene of a (low density) gap between the residuals from the trajetories explained by the orret model (in red, lose to the origin), and the rest. 4.2. Magnitude of the Noise, 2D Model Let ˆr f 2D ontain only the residuals of the inlier trajetories (those where b f i = ), and let USV be the singular value deomposition of the ovariane matrix of ˆr f 2D : ( ) USV = svd (ˆr f b f 2D ) ˆr f 2D. (9) p Then the magnitude of the noise orresponds to the largest singular value σ 2 = s, beause if the underlying geometry is in fat planar, then the only unaounted displaements aptured by the residuals are due to noise. Model apaity an also be determined from S, as explained next. 4.3. Model Capaity The ratio of largest over smallest singular values (s /s 2 ) determines when upgrading to a 3D model is benefiial. When the underlying geometry is atually non-planar, the residuals from a planar model should distribute along a line (the epipolar line), refleting that their relative depth is being unaounted for. This produes a ovariane matrix with a large ratio s /s 2. If on the other hand, if s /s 2, then there is no indiation of unexplained relative depth, in whih ase, fitting a line to spherially distributed residuals will only inrease the model omplexity without explaining the residual variane muh better. A small spherial residual ovariane strongly suggests a planar underlying geometry. 4.4. 3D Inlier Detetion When the residual distribution is elongated (s /s 2 ), a line segment is robustly fit to the (potentially ontaminated) set of residuals. The segment must go through the origin and its parameters are omputed using a Hough transform. Further details about this algorithm an be found in the supplementary material. Inlier detetion The resulting line segment is used to determine 3D model inliers. Trajetory i beomes an inlier at frame f if it satisfies two onditions. First, the projetion of r f i onto the line must lie within the segment limits (β r f i e f γ). Seond, the normalized distane to the line must be below a threshold (e f rf i σ 2 λ d ). Notie that the threshold depends on the smallest singular value from Eq. 9 to (roughly) aount for the presene of noise in the diretion perpendiular to the epipolar (e f ). 4.5. Magnitude of the Noise, 3D Model Similarly to the 2D ase, let ˆr f 3D ontain the residual data from the orresponding 3D inlier trajetories. An estimate for the magnitude of the noise that reflets the use of a 3D model an be obtained from the singular value deomposition of the ovariane matrix of ˆr f 3D (as in Eq. 9). In this ase, the largest singular value s aptures the spread of residuals along the epipolar line, so its magnitude is mainly related to the magnitude of the displaements due to relative depth. However, s 2 aptures deviations from the epipolar line, whih in a rigid 3D objet an only be attributed to noise, making σ 2 = s 2 a reasonable estimate for its magnitude. Optimal model parameters When both 2D and 3D models are instantiated, the one with the smallest penalized negative log-likelihood (7) beomes the winning model for the urrent RANSAC run. The same penalized negative loglikelihood metri is used to determine the better model from aross all RANSAC iterations. The winning model is added to the pool M, and the proess is repeated M times, forming the pool M = {M,..., M M }. 5. Optimal Model Subset The next step is to find the model ombination M M that maximizes predition auray for the whole trajetory data W, while minimizing model omplexity and modelling overlap. For this purpose, let M j = {M j,,..., M j,n } be the j-th model ombination, and let M! {M j } be the set of all M C N = N!(M N)!) ombinations of N-sized models than an be drawn from M. The model seletion problem is then formulated as M = argmin O S (M j ), (20) {M j}

where the objetive is O S(M j) = N n= p= P π p,ne (w p, M j,n) P + λ Φ Φ(w p, M j,n) + λ Ψ i= N Ψ(M j,n). n= (2) The first term aounts for predition auray, the other two are regularization terms. Details follow. Predition Auray In order to determine how well a model M predits an arbitrary trajetory w, the affine transformations estimated by RANSAC ould be re-used. However, the inherent noise in the ontrol points, and the potentially short distane between them, often render this approah impratial, partiularly when w is spatially distant from the ontrol points (see 3.). Instead, model parameters are omputed with a fatorization based [0] method. Given the inlier labeling B in M, let W B be the subset of trajetories where b f i = for at least half of the frames. The orthonormal basis S of a ω = 2D (or 3D) motion model an be determined by the 2 (or 3) left singular vetors of W B. Using S as the model s motion matries, predition auray an be omputed using: E(w, M) = SS w w 2, (22) whih is the sum of squared Eulidean deviations from the preditions (SS w), to the observed data (w). Our experiments indiated that, although sensitive to outliers, these model preditions are muh more robust to noise. Ownership variables Π {0, } [P N] indiate whether trajetory p is explained by the n-th model (π p,n = ) or not (π p,n = 0), and are determined by maximum predition auray (i.e. minimum Eulidean deviation):, if M j,n = argmin E(w p, M) π p,n = M M j (23) 0, otherwise. Regularization terms The seond term from Eq. 2 penalizes situations where multiple models explain a trajetory (w) with relatively small residuals. For brevity, let Ê(w, M) = exp{ E(w, M)}, then: Φ(w, M j ) = log max Ê(w, M) M M j Ê(w, M). (24) M M j The third term regularizes over the number of model parameters, and is evaluated using Eq. 8. The onstants λ Φ and λ Ψ modulate the effet of the orresponding regularizer. Table : Auray and run-time for the H55 dataset. Naive RANSAC inluded as a baseline with overall auray and total omputation time estimated using data from [2]. Algorithm Average Auray [%] Computation time [s] SSC [3] 98.76 4500 CPA [4] 98.75 47600 RANSAC 89.5 30 Ours 96.9 27 6. Refinement The optimal model subset M yields ownership variables Π whih an already be interpreted as a segmentation result. However, we found that segmentation auray an be improved by inorporating the labellings Π t from the top T subsets {M t t T } losest to optimal. Multiple labellings are inorporated into an affinity matrix F, where the f i,j entry indiates the frequeny with whih trajetory i is given the same label as trajetory j aross all T labellings, weighted by the relative objetive { funtion Õt = exp f i,j = O S(W M t ) O S (W M ) T t= Õt T t= } for suh a labelling: ( π t i,: π t j,: ) Õt (25) Note that the inner produt between the label vetors (π i,: π j,: ) is equal to one only when the labels are the same. A spetral lustering method is applied on F to produe the method s final segmentation result. 7. Experiments Evaluation was made through three experimental setups. Hopkins 55 The Hopkins 55 (H55) dataset has been the standard evaluation metri for the problem of motion segmentation of trajetory data sine 2007. It onsists of hekerboard, traffi and artiulated sequenes with either 2 or 3 motions. Data was automatially traked, but traking errors were manually orreted; further details are available in [2]. The use of a standard dataset enables diret omparison of auray and run-time performane. Table shows the relevant figures for the two most ompetitive algorithms that we are aware of. The data indiates that our algorithm has run-times that are lose to 2 or 3 orders of magnitude faster than the state of the art methods, with minimal auray loss. Computation times are measured in the same (or very similar) hardware arhitetures. Like in CPA, our implementation uses a single set of parameters for all the experiments, but as others had pointed out [4], it remains unlear whether the same is true for the results reported in the original SSC paper.

Figure 4: Auray error-bars aross artifiial H55 datasets with ontrolled levels of Gaussian noise. Artifiial Noise The seond experimental setup omplements an unexplored dimension in the H55 dataset: noise. The goal is to determine the effets of noise of different magnitudes towards the segmentation auray of our method, in omparison with the state of the art. We noted that H55 ontains strutured long-tailed noise, but for the purpose of this experiment we required a noise-free dataset as a baseline. To generate suh a dataset, ground-truth labels were used to ompute a rank 3 reonstrution of (mean-subtrated) trajetories for eah segment. Then, multiple versions of H55 were omputed by ontaminating the noise-free dataset with Gaussian noise of magnitudes σ n {0.0, 0.25, 0.5,, 2, 4, 8}. Our method, as well as SSC and CPA were run on these noise-ontrolled datasets; results are shown in Figure 4. The error bars on SSC and Ours indiate one standard deviation, omputed over 20 runs. The plot for CPA is generated with only one run for eah dataset (running time:.95 days). The graph indiates that our method only ompromises auray for large levels of noise, while still being around 2 or 3 orders of magnitude faster than the most ompetitive algorithms. KLT Traking The last experimental setup evaluates the appliability of the algorithm in real world onditions using raw traks from an off-the-shelf implementation [] of the Kanade-Luas-Tomasi algorithm. Several sequenes were traked and the resulting trajetories lassified by our method. Figure 5 shows qualitatively good motion segmentation results for four sequenes. Challenges inlude very small relative motions, traking noise, and a large presene of outliers. 8. Conlusions We introdued a omputationally effiient motion segmentation algorithm for trajetory data. Effiieny omes from the use of a simple but powerful representation of motion that expliitly inorporates mehanisms to deal with noise, outliers and motion degeneraies. Run-time omparisons indiate that our method is 2 or 3 orders of magnitude faster than the state of the art, with only a small loss in auray. The robustness of our method to Gaussian noise Figure 5: Segmentation results from raw KLT automati traks from four Formula sequenes. Italian Grand Prix 202 Formula. In this figure, all trajetories are given a motion label, inluding outliers. of different magnitudes was found ompetitive with state of the art, while retaining the inherent omputational effiieny. The method was also found to be useful for motion segmentation of real-world, raw trajetory data. Referenes [] http://www.es.lemnson.edu/ stb/klt. 8 [2] J. P. Costeira and T. Kanade. A Multibody Fatorization Method for Independently Moving Objets. IJCV, 998. [3] E. Elhamifar and R. Vidal. Sparse subspae lustering. In Pro. CVPR, 2009. 2, 7 [4] M. A. Fishler and R. C. Bolles. Random sample onsensus: a paradigm for model fitting with appliations to image analysis and automated artography. Commun. ACM, 98. 3 [5] M. Irani and P. Anandan. Parallax geometry of pairs of points for 3d sene analysis. Pro. ECCV, 996. 2 [6] K. Kanatani. Motion segmentation by subspae separation: Model seletion and reliability evaluation. International Journal Image Graphis, 2002. [7] H. Longuet-Higgins. A omputer algorithm for reonstruting a sene from two projetions. Readings in Computer Vision: Issues, Problems, Priniples, and Paradigms, MA Fishler and O. Firshein, eds, 987. [8] K. Shindler, D. Suter,, and H. Wang. A model-seletion framework for multibody struture-and-motion of image sequenes. Pro. IJCV, 79(2):59 77, 2008. [9] C. Tomasi and T. Kanade. Shape and motion without depth. Pro. ICCV, 990. [0] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a fatorization method. IJCV, 992., 3, 7 [] P. Torr. Geometri motion segmentation and model seletion. Phil. Tran. of the Royal So. of Lon. Series A: Mathematial, Physial and Engineering Sienes, 998. [2] R. Tron and R. Vidal. A Benhmark for the Comparison of 3-D Motion Segmentation Algorithms. In Pro. CVPR, 2007. 7 [3] J. Yan and M. Pollefeys. A fatorization-based approah for artiulated nonrigid shape, motion and kinemati hain reovery from video. PAMI, 2008. [4] L. Zappella, E. Provenzi, X. Lladó, and J. Salvi. Adaptive motion segmentation algorithm based on the prinipal angles onfiguration. Pro. ACCV, 20. 2, 7 [5] L. Zelnik-Manor and M. Irani. Degeneraies, dependenies and their impliations in multi-body and multi-sequene fatorizations. Pro. CVPR, 2003.