Low-Rank Matrix Recovery III: Fast Algorithms and Scalable Applications Zhouchen Lin

Size: px

Start display at page:

Download "Low-Rank Matrix Recovery III: Fast Algorithms and Scalable Applications Zhouchen Lin"

Bertina Gilbert
5 years ago
Views:

1 Low-Rank Matrix Recovery III: Fast Algorithms and Scalable Applications Zhouchen Lin Visual Computing Group Microsoft Research Asia Aug. 11, 2011

unknowns: 1000x1000 matrix 2 millions unknowns! Existing

2 Why do we need new algorithms? min kak + kek 1 subj A + E = D High-dimensional, non-smooth convex optimization: 2n 2 unknowns: 1000x1000 matrix 2 millions unknowns! Existing second-order (e.g., interior point) algorithms cannot solve moderate-to-large scale instances Complexity is O(n 6 ). CVX (Stanford, Boyd) solves only up to 80x80 on a typical PC..

3 Existing work min kak + kek 1 min kek 1 L1-norm minimization: Stanford university: Emmanuel Candès 06~ 09 Rice university: Wotao Yin and Yin Zhang National University of Singapore: K.C. Toh 09 Technion Israel Institute of Technology: Amir Beck Tel Aviv University: Marc Teboulle University of Washington: Paul Tseng min kak Nuclear norm minimization: Stanford university: Jianfeng Cai and Emmanuel Candès Rice university: Wotao Yin and Yin Zhang 09 National University of Singapore: K.C. Toh, Zuowei Shen 09 Baptist university of Hong Kong: Xiaoming Yuan 09 Columbia University: Shiqian Ma, Donald Goldfarb, Lifeng Chen 09 and many others!

4 THIS TALK: exciting development of fast algorithms Time required to solve a 1000x1000 PRCA problem: min kak + kek 1 subj A + E = D Algorithms Accuracy Rank E _0 # iterations time (sec) IT 5.99e ,268 8, ,370.3 DUAL 8.65e , ,855.4 APG 5.85e , ,468.9 APG P 5.91e , ,000 times speedup! ALM P 2.07e , ADM P 3.83e , We will see: How to efficiently solve large matrix recovery problems by choosing the right first-order method (4 order of magnitude speedup!) Ideas behind the algorithms that apply to many related problems.

5 Why are scalable solutions possible? The complexity of solving the convex generic problem: min f(x) x using first-order methods depends strongly on the smoothness of f : f smooth, r f Lipschitz: O(" 1=2 ) f di erentiable: O(" 1 ) f non-smooth: O(" 2 ) BAD NEWS: Our model problem is large and nonsmooth. min kak + kek 1 subj A + E = D Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, 2003.

6 Why are scalable solutions possible? GOOD NEWS: The RPCA problem has a special structure min kak + kek 1 subj A + E = D KEY OBSERVATION: closed form solutions for the proximal minimizations: S " (Q) = argmin X "kxk kx Qk2 F D " (Q) = argmin X "kxk kx Qk2 F

7 Why are scalable solutions possible? GOOD NEWS: The RPCA problem has a special structure min kak + kek 1 subj A + E = D KEY OBSERVATION: closed form solutions for the proximal minimizations: S " (Q) = argmin X "kxk kx Qk2 F Lemma. The solution to the above problem is given by applying the soft-thresholding operator S " (q) = max(jqj "; 0) sgn(q) to each entry of the matrix Q. q W. Yin, S. Osher, D. Goldfarb, and J. Darbon, Bregman iterative algorithms for l1-minimization with applications to compressed sensing, SIAM Journal on Imaging Sciences,1 (2008), pp

8 Why are scalable solutions possible? GOOD NEWS: The RPCA problem has a special structure min kak + kek 1 subj A + E = D KEY OBSERVATION: closed form solutions for the proximal minimizations: D " (Q) = argmin X "kxk kx Qk2 F Lemma. The solution to the above problem is given by applying the soft-thresholding operator to the singular values of Q = U V T D " (Q) = US " ( )V T q J.-F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization 20(4): , 2008.

9 Our Roadmap Computing time #iterations time for each iteration Make the iterations as few as possible: Iterative Thresholding Accelerated Proximal Gradient Augmented Lagrange Multiplier Alternating Direction Method of Multipliers Make the iterations as efficient as possible: Partial SVD O(rn 2 ) Block Lanczos with Warm Start (BLWS)

10 Our Roadmap Computing time #iterations time for each iteration Make the iterations as few as possible: Iterative Thresholding Accelerated Proximal Gradient Augmented Lagrange Multiplier Alternating Direction Method of Multipliers Make the iterations as efficient as possible: Partial SVD O(rn 2 ) Block Lanczos with Warm Start (BLWS)

11 Iterative Thresholding Model problem: min kak + kek 1 subj A + E = D: Convenient approximation (exact as ¹ & 0 ) : min kak + kek 1 + ¹ 2 (kak2 F + kek 2 F) subj A + E = D:

12 Iterative Thresholding Model problem: min kak + kek 1 subj A + E = D: Convenient approximation (exact as ¹ & 0 ) : min kak + kek 1 + ¹ 2 (kak2 F + kek 2 F) subj A + E = D: Lagrangian: L(A; E; Y ) = : kak + kek 1 + ¹ 2 (kak2 F + kek2 F ) + hy; D A Ei Algorithm (Iterative Thresholding): (A k+1 ; E k+1 ) = arg min L(A; E; Y k ) Y k+1 = Y k + ± k (D A k+1 E k+1 ): ± k < 1 Theorem [Wright et. al ala Cai et. al. 09]. Provided, the iterates converge to the unique optimal solution to the approximated problem.

13 A recurring theme Similar ideas appear in many places in the literature min M(x)=b f(x) relaxation min M(x)=b f(x) + ¹ 2 kxk2 Solution via Uzawa s algorithm x k+1 = arg min f(x) + ¹ 2 kxk2 + hy k ; b M(x)i y k+1 = y k + ± k (b M(x k+1 )): [Cai, Candes, Shen 09] A Singular Value Thresholding Algorithm for Matrix Completion [Osher, Mao, Dong, Yin 09] Fast linearized Bregman iteration for Compressive Sensing and sparse denoising

14 How do we solve the subproblem? Key subproblem: (A k+1 ; E k+1 ) = arg min L(A; E; Y k ) = arg min kak + kek 1 + ¹ 2 (kak2 F + kek2 F ) + hy k; D A Ei : Using our previous observations A k+1 = arg min A kak + ¹ 2 kak2 F hy k; Ai = arg min A kak + ¹ 2 ka ¹ 1 Y k k 2 F = ¹ 1 D 1 (Y k ): Shrink singular values E k+1 = arg min E kek 1 + ¹ 2 kek2 F hy k; Ei = arg min E kek 1 + ¹ 2 ke ¹ 1 Y k k 2 F = ¹ 1 S (Y k ): Shrink absolute values So each iteration is relatively simple yet expensive ( cost of one SVD ).

15 Iterative Thresholding: Pros and Cons Extremely simple algorithm for robust PCA: A k+1 = ¹ 1 D 1 (Y k ); E k+1 = ¹ 1 S (Y k ); Y k+1 = Y k + ± k (D A k+1 E k+1 ): Strong points: In practice, controls the rank of iterates A k. Scalable: can solve medium-large problems. Weak point: Slow many iterations for convergence. E.g., recovering a 1,000 x 1,000 matrix of rank 50 from 10% errors, requires >8,000 iterations and >27 hours on a standard PC. Next: How can we cut the number of iterations?

16 Our Roadmap Computing time #iterations time for each iteration Make the iterations as few as possible: Iterative Thresholding Accelerated Proximal Gradient Augmented Lagrange Multiplier Alternating Direction Method of Multipliers Make the iterations as efficient as possible: Partial SVD O(rn 2 ) Block Lanczos with Warm Start (BLWS)

17 Accelerated Proximal Gradient (APG) Method The gradient descent: x k+1 = x k k rf(x k ) The convergence rate is only O(k 1 )! Prior to 1983, best lower bound was is it actually possible to achieve this? O(k 2 ) (much smaller ) Theorem [Nesterov 83]: Consider the problem min f(x); f : convex If f is differentiable with Lipschitz continuous gradient: k rf(x 1 ) rf(x 2 )k Lkx 1 x 2 k; there exists a first-order algorithm with rate in function values. O(k 2 ) convergence Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Mathematics Doklady, 27(2): , 1983.

18 Nesterov s Optimal Gradient Method Problem:, convex, differentiable, and L-Lipschitz. min f(x) f rf First Idea: minimize sequence of quadratic approximations to f: f(x) f(y) + hr f(y); x yi + L 2 kx yk2 : = QL (x; y) Repeat: x k+1 = arg min x Q L (x; y k ) = y k 1 L r f(y k) (point where we form approx.) Natural choice: y k = x k O(L=") standard gradient algorithm iterations for an -suboptimal solution "

19 Nesterov s Optimal Gradient Method Problem:, convex, differentiable, and L-Lipschitz. min f(x) f rf First Idea: minimize sequence of quadratic approximations to f: f(x) f(y) + hr f(y); x yi + L 2 kx yk2 : = QL (x; y) Repeat: x k+1 = arg min x Q L (x; y k ) = y k 1 L r f(y k) Non-obvious alternative: t k+1 = 1 + p 1 + 4t 2 k 2 ; y k+1 = x k + t k 1 t k+1 (x k x k 1 ); O( p L=") iterations for an -suboptimal solution! " Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k 2 ). Soviet Mathematics Doklady, 27(2): , 1983.

20 Generalization min F(x) g(x) + f(x); with g, f convex, f 2 C 1;1 We can still make quadratic approximations to the smooth part Q L (x; y) = f(y) + hrf(y); x yi + L 2 kx yk2 + g(x) and iteratively minimize them: x k+1 = arg min Q L (x; y k ) x = arg min g(x) + L x (y L 1 rf(y)) 2 x 2 Theorem [Beck and Teboulle 09]: The above algorithm converges with rate F(x k ) F 2Lkx 0 x k 2 (k + 1) 2 : Moral: if we can solve the problem min Q L (x; y k ) efficiently, we retain the x advantages of Nesterov s algorithm, even though F contains a nonsmooth term. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): , Mar 2009.

21 Beck and Teboulle s APG method Algorithm 1 Accelerated Proximal Gradient (APG) method 1: Initialize x 0 = y 1, t 1 = 1, k = 1. 2: while not converged do 3: x k = arg min Q L (x; y k ), 4: t k+1 = 1 2 (1 + p 1 + 4t 2 k ), y k+1 = x k + t k 1 t k+1 (x k x k 1 ); 5: end while [Beck + Teboulle 09] A fast iterative shrinkage-thresholding algorithm for linear inverse problems (theory + application to sparse recovery) [Liu, Sun and Toh 09] An Implementable Proximal Point Algorithmic Framework for Nuclear Norm Minimization (application to matrix completion) [Ganesh, Lin, Ma, Wu, Wright 09] Fast Algorithms for Recovering a Corrupted Low-Rank Matrix (application to robust PCA)

22 What about matrix recovery? Model problem: min kak + kek 1 subj A + E = D: Penalized version (exact as ¹ & 0 ) : min ¹ ( kak + kek 1 ) kd A Ek2 F: Non-differentiable. Smooth, Lipschitz gradient.

23 Solving the subproblem? x k+1 = arg min Q L (x; y k ) x = arg min g(x) + L x (y L 1 rf(y)) 2 x 2 In our case, x = (x A ; x E ) 2 R m n R m n y = (y A ; y E ) 2 R m n R m n It is not difficult to show that L = p 2 and rf = (D A E; D A E). So, we find that the update equations are again given by shrinkage A k+1 = arg min ¹ L kak ka yk A L 1 (D ya k ye)k k 2 F = D ¹=L (ya k L 1 (D ya k ye)): k E k+1 = arg min ¹ L kek ke yk E L 1 (D y k A y k E)k 2 F = S ¹=L (y k E L 1 (D y k A y k E)):

24 APG: Pros and Cons A k+1 = D ¹ L y k A L 1 (D ya k yk E ) ; E k+1 = S ¹ y k L E L 1 (D ya k yk E ) ; t k+1 = 1 2 (1 + p 1 + 4t 2 k ); y k+1 A = A k + t k 1 t k+1 (A k A k 1 ); y k+1 E = E k + t k 1 t k+1 (E k E k 1 ): Strong points: Scalable: can solve medium-large problems. Dramatically improved iteration complexity: Cuts #iterations from >8,000 to 135 (!) Weak point: Does not control the rank of the iterates Requires continuation for very accurate solution: ¹ k+1 = max( ¹ k ; ¹ min ); 2 (0; 1): Next: Are there better frameworks for continuation?

25 Our Roadmap Computing time #iterations time for each iteration Make the iterations as few as possible: Iterative Thresholding Accelerated Proximal Gradient Augmented Lagrange Multiplier Alternating Direction Method of Multipliers Make the iterations as efficient as possible: Partial SVD O(rn 2 ) Block Lanczos with Warm Start (BLWS)

26 Augmented Lagrange Multiplier (ALM) Method Model problem: min kak + kek 1 subj A + E = D: We ve seen two approximations can we just efficiently solve the exact problem? Can write as min x f(x); s:t: g i (x) = 0; i = 1; ; m; Lagrangian: L(x; ) = f(x) + mx ig i (x): i=1 Augmented Lagrangian [Hestenes 69, Powell 69]: mx ~L(x; ; ¹) = f(x) + ig i (x) (k+1) i = (k) i i=1 + ¹ (k) i g i (x k+1 ) See, e.g., D. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods, 1981 mx ¹ i gi 2 (x): i=1 Penalty function

27 Augmented Lagrange Multiplier (ALM) Method Algorithm 1 Augmented Lagrange Multiplier Method 1: Initialize x 0, (0) i, ¹ (0) i > 0, k = 0, ½ > 1. 2: while not converged do 3: x k+1 = arg min L(x; ~ (k) ; ¹ (k) ); x 4: (k+1) i 5: ¹ (k+1) i = (k) i 6: k Ã k : end while = ½¹ (k) i ; + ¹ (k) i g i (x k+1 ); ALM is advantageous when the subproblem is easily solvable: fast convergence: O μ 1 x k+1 = arg min x ~ L(x; (k) ; ¹ (k) ) See, e.g., D. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods, 1981

28 ALM: Solving the Subproblem ~L(A; E; Y; ¹) = kak + kek 1 + hy; D A Ei + ¹ 2 kd A Ek2 F : Solve the subproblem : (A k+1 ; E k+1 ) = arg min A;E ~L(A; E; Y k ; ¹ k ) repeat k+1 = D ¹ 1(D E j k+1 ¹ 1 k Y k); k E k+1 = 1 S ¹ (D A j+1 k+1 ¹ 1 k Y k): k A j+1 Shrink singular values Shrink absolute values Then update Lagrange multiplier: Y k+1 = Y k + ¹ k (D A k+1 E k+1 ): The inner loop slows down when of SVDs grows! ¹ k grows! The total number Next: Do we need to solve the subproblem exactly? Z. Lin, M. Chen, L. Wu, and Y. Ma, The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrix, submitted to Mathematical Programming.

29 Our Roadmap Computing time #iterations time for each iteration Make the iterations as few as possible: Iterative Thresholding Accelerated Proximal Gradient Augmented Lagrange Multiplier Alternating Direction Method of Multipliers Make the iterations as efficient as possible: Partial SVD O(rn 2 ) Block Lanczos with Warm Start (BLWS)

30 ADM for RPCA ~L(A; E; Y; ¹) = kak + kek 1 + hy; D A Ei + ¹ 2 kd A Ek2 F : Minimizing ~L(A; E; Y; ¹) over (A; E) simultaneously is nontrivial Minimizing ~L(A; E; Y; ¹) with respect to just A or E is easy: min A ~L(A; E; Y; ¹) = D ¹ 1(D E ¹ 1 Y ); min E ~ L(A; E; Y; ¹) = S ¹ 1(D A ¹ 1 Y ): Solution: Alternating Direction Method of Multipliers [Gabay and Mercier 76]: A k+1 = D ¹ 1 k (D E k ¹ 1 k Y k); E k+1 = 1 S ¹ (D A k+1 ¹ 1 k Y k); k Y k+1 = Y k + ¹ k (D A k+1 E k+1 ): We only have to update A and E once before updating Y!

31 Convergence of ADM for RPCA Classical theory: convergence provided ¹ k is bounded. In practice, increasing sequences Recently justified by Lin et. al.: ¹ k yield much faster convergence. Theorem. If f¹ k g is nondecreasing, then (A k ; E k ) globally converges to an optimal solution (A ; E ) to the RPCA problem if and only if +1X k=1 ¹ 1 k = +1: Z. Lin, M. Chen, L. Wu, and Y. Ma, The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrix, submitted to Mathematical Programming.

32 Summary of Representative Algorithms IT: L(A; E; Y ) = kak + kek 1 + ¹ 2 (kak2 F + kek2 F ) + ¹ hy; D A Ei repeat Shrink singular values A k+1 = D ¹ 1(Y k ); E k+1 = S ¹ 1(Y k ); Y k+1 = Y k + ± k (D A k+1 E k+1 ): Shrink absolute values ADM: ~L(A; E; Y; ¹) = kak + kek 1 + hy; D A Ei + ¹ 2 kd A Ek2 F repeat A k+1 = D ¹ 1 k (D E k ¹ 1 k Y k); E k+1 = 1 S ¹ (D A k+1 ¹ 1 k Y k); k Y k+1 = Y k + ¹ k (D A k+1 E k+1 ): Shrink singular values Shrink absolute values

33 ADM: Pros and Cons A k+1 = D ¹ 1 k (D E k ¹ 1 k Y k); E k+1 = 1 S ¹ (D A k+1 ¹ 1 k Y k); k Y k+1 = Y k + ¹ k (D A k+1 E k+1 ): Strong points: Scalable: can solve medium-large problems. Further improved iteration complexity: down to iterations Best algorithm for this problem, in our experience. Weak point: Convergence rate still open (at least O(k 1 ) [X. Yuan et al.]). Extensions to >2 terms still open. [X. Yuan and collaborators] Next: Can we reduce the complexity of each iteration? Z. Lin, M. Chen, L. Wu, and Y. Ma, The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrix, submitted to Mathematical Programming.

34 Our Roadmap Computing time #iterations time for each iteration Make the iterations as few as possible: Iterative Thresholding Accelerated Proximal Gradient Augmented Lagrange Multiplier Alternating Direction Method of Multipliers Make the iterations as efficient as possible: Partial SVD O(rn 2 ) Block Lanczos with Warm Start (BLWS)

35 Using Partial SVD ADM: repeat A k+1 = D ¹ 1 k (D E k ¹ 1 k Y k); E k+1 = 1 S ¹ (D A k+1 ¹ 1 k Y k); k Y k+1 = Y k + ¹ k (D A k+1 E k+1 ): Shrink singular values Shrink absolute values Recall: If the SVD of Q is Q = U V T, then D " (Q) = US " ( )V T, where S " (x) = max(jxj "; 0) sgn(x). Only need singular values of D E k ¹ 1 k Y k that are larger than ¹ 1 k. Cuts complexity from O(n 3 ) to O(Kn 2 )! PROPACK: We have optimized it ( We also modified it ( Zhouchen Lin, Some Software Packages for Partial SVD Computation, arxiv:

36 Lanczos Method for Partial SVD Approximate Q as and can be found by Lanczos procedure, with k steps. U k is bi-diagonal. B k V k The Lanczos procedure starts from a random vector. Suppose the SVD of is B k = ^U k k V ~ T B k k, then the SVD of Q is approximately: The leading singular values/vectors in, U k Uk ~ and converge to those of Q quickly when k increases. The complexity is O(kn 2 +k 3 ). Q ¼ U k B k V T k Q ¼ (U k ~ Uk ) k (V k ~ Vk ) T k V k ~ Vk G. Golub et al. Matrix Computations, John Hopkins Univ. Press,1996.

37 Exciting development of fast algorithms For a 1000x1000 matrix of rank 50, with 10% (100,000) entries randomly corrupted: min kak + kek 1 subj A + E = D Algorithms Accuracy Rank E _0 # iterations time (sec) IT 5.99e ,268 8, ,370.3 DUAL 8.65e , ,855.4 APG 5.85e , ,468.9 APG P 5.91e , ALM P 2.07e , ADM P 3.83e , ,000 times speedup! Provably Robust PCA at only a constant factor ( 20) more computation than conventional PCA!

38 Our Roadmap Computing time #iterations time for each iteration Make the iterations as few as possible: Iterative Thresholding Accelerated Proximal Gradient Augmented Lagrange Multiplier Alternating Direction Method of Multipliers Make the iterations as efficient as possible: Partial SVD O(rn 2 ) Block Lanczos with Warm Start (BLWS)

39 Key Observations for Acceleration When solving the subproblem A j = argmin A usually slightly differs from. " j kak ka Q jk 2 F Q j Q j 1 Computing partial SVD of Q j independently does not utilize this information We are seeking the principal singular subspace of. Q j Z. Lin and S. Wei, A Block Lanczos with Warm Start Technique for Accelerating Nuclear Norm Minimization Algorithms, submitted to Optimization Letters.

40 Drawbacks of Vector-Based Lanczos Method The initial vector q 1 to start the Lanczos procedure does not carry enough information of the principal singular subspace. Even if q 1 is the largest singular vector of Q j 1 Z. Lin and S. Wei, A Block Lanczos with Warm Start Technique for Accelerating Nuclear Norm Minimization Algorithms, submitted to Optimization Letters.

41 Key Ideas Use Block-based Lanczos method for partial SVD Use the principal singular subspace of Q j 1 to start the block Lanczos procedure Code available at Z. Lin and S. Wei, A Block Lanczos with Warm Start Technique for Accelerating Nuclear Norm Minimization Algorithms, submitted to Optimization Letters.

42 Experimental Results - BLWS for RPCA Table 1: BLWS-ADM vs. ADM on di erent synthetic data. ^A and ^E are the computed low rank and sparse matrices and A is the ground truth. k m method F kak F rank( ^A) k ^Ek 0 #iter time(s) 500 ADM 5.27e BLWS-ADM 9.64e ADM 3.99e BLWS-ADM 6.05e ADM 2.80e BLWS-ADM 4.30e ADM 2.52e BLWS-ADM 3.90e Z. Lin and S. Wei, A Block Lanczos with Warm Start Technique for Accelerating Nuclear Norm Minimization Algorithms, submitted to Optimization Letters.

43 Experimental Results - BLWS for MC Table 1: BLWS-SVT vs. SVT on synthetic data. ^A is the recovered low rank matrix and A is the ground truth. m is the size of matrix and n is the number of sampled entries. d r = r(2m r) is the DOF in an m m matrix of rank r. m r n=d r n=m 2 algorithm time(s) #iter k ^A Ak F kak F SVT e BLWS-SVT e SVT e BLWS-SVT e SVT e BLWS-SVT e SVT e BLWS-SVT e SVT e BLWS-SVT e SVT e BLWS-SVT e SVT e BLWS-SVT e-004 Z. Lin and S. Wei, A Block Lanczos with Warm Start Technique for Accelerating Nuclear Norm Minimization Algorithms, submitted to Optimization Letters.

44 Conclusions State-of-the-art algorithms for RPCA and its many variations (10 5 speedup on PC). GPU implementation gives another 5x speedup HPC cluster implementation using distributed SVD. Next: Applications! What can we do with these new theoretical and algorithmic tools?

45 Extensions and Broad Applications of RPCA Background Modeling Image Alignment Latent Semantic Indexing for Text Documents Photometric Stereo Image Tagging Refinement Robust Filtering Graphical Modeling Learning Computational scale well beyond what is for TILT

46 Computation Examples Background modeling minkak + kek 1 subj D = A + E; A 0: High-res video, 720x576, 102 frames, s on a workstation (Intel Xeon E GHz CPU, 4 cores and 24GB memory)

47 Computation Examples Face Alignment [Peng et al.] minkak + kek 1 subj D ± = A + E: 80x60, 48, 7min on a workstation

48 Computation Examples Web document corpus analysis [Min et al.] minkak + kek 1 subj D = A + E; E 0: D: tf-idf matrix, x 18320, 90.3h on an HPC cluster

49 Other Applications Yi Ma Visual Computing Group Microsoft Research Asia Aug. 11, 2011

50 APPLICATIONS Background modeling from video Static camera surveillance video 200 frames, 144 x 172 pixels, Video = Low-rank appx. + Sparse error Significant foreground motion RPCA Candes, Li, Ma, and Wright, Journal of the ACM, May 2011.

51 APPLICATIONS Background modeling from video Surveillance video: 250 frames, 128 x 160 pixels, with significant illumination variation Video By RPCA Results of Black and de la Torre Candes, Li, Ma, and Wright, Journal of the ACM, May 2011.

52 APPLICATIONS Repairing vintage movies Original Repaired Corruptions Frame pixels

53 APPLICATIONS Repairing vintage movies Original Repaired Corruptions Frame 2

54 APPLICATIONS Repairing vintage movies Original Repaired Corruptions Frame 3

55 APPLICATIONS Repairing vintage movies Original Repaired Corruptions Frame 4

56 APPLICATIONS Repairing vintage movies Original Repaired Corruptions Frame 5

57 APPLICATIONS Repairing vintage movies Original Repaired Corruptions Frame 6

58 APPLICATIONS Repairing vintage movies Original Repaired Corruptions Frame 7

59 APPLICATIONS Faces under varying illumination 58 images of one person under varying lighting: RPCA Candes, Li, Ma, and Wright, Journal of the ACM, May 2011.

60 APPLICATIONS Faces under varying illumination 58 images of one person under varying lighting: specularity RPCA cast shadows Candes, Li, Ma, and Wright, Journal of the ACM, May 2011.

61 APPLICATIONS -- High-quality photometric stereo specularities, shadows surface normals relight motion blurs

62 Robust photometric stereo: synthesized images Input images Mean error o 0.96 o Max error 0.20 o 8.0 o Wu, Ganesh, Li, Matsushita, and Ma, in ACCV 2010.

63 Robust photometric stereo: real images Wu, Ganesh, Li, Matsushita, and Ma, in ACCV 2010.

Robust Alignment via Sparse and Low-rank

observation aligned low-rank signals sparse

component Solution: Robust Alignment via

64 Robust Alignment via Sparse and Low-rank Decomposition corrupted & misaligned observation aligned low-rank signals sparse errors o Problem: Given recover, and. Parametric deformations (rigid, affine, projective ) Low-rank component Sparse component Solution: Robust Alignment via Low-rank and Sparse (RASL) Decomposition Iteratively solving the linearized convex program::

65 APPLICATIONS Batch face alignment Initial imprecise alignment, inappropriate for recognition: Peng, Ganesh, Wright, and Ma, CVPR 10

66 APPLICATIONS Batch face alignment Peng, Ganesh, Wright, and Ma, CVPR 10

67 APPLICATIONS Batch face alignment Peng, Ganesh, Wright, and Ma, CVPR 10

68 APPLICATIONS Batch face alignment Peng, Ganesh, Wright, and Ma, CVPR 10

69 APPLICATIONS Batch face alignment Peng, Ganesh, Wright, and Ma, CVPR 10

70 APPLICATIONS Batch face alignment Peng, Ganesh, Wright, and Ma, CVPR 10

71 APPLICATIONS Batch face alignment Peng, Ganesh, Wright, and Ma, CVPR 10

72 APPLICATIONS Batch face alignment Peng, Ganesh, Wright, and Ma, CVPR 10

73 APPLICATIONS Batch face alignment Peng, Ganesh, Wright, and Ma, CVPR 10

74 APPLICATIONS Batch face alignment Peng, Ganesh, Wright, and Ma, CVPR 10

75 APPLICATIONS Batch face alignment Final result: per-pixel alignment Peng, Ganesh, Wright, and Ma, CVPR 10

76 APPLICATIONS Batch face alignment: accuracy evaluation 100 misaligned corrupted images: Vedaldi CVPR 08 direct/gradient RASL: Mean error Error std. Max error Initial misalignment Vedaldi (direct/gradient) 1.97/ / /4.02 RASL (this work) Peng, Ganesh, Wright, and Ma, CVPR 10

77 APPLICATIONS Simultaneous Alignment and Repairing Peng, Ganesh, Wright, Ma, CVPR 10

78 APPLICATIONS Aligning Face Images from the Internet *48 images collected from internet Peng, Ganesh, Wright, Ma, CVPR 10

79 APPLICATIONS Faces Detected Input: faces detected by a face detector ( ) Average Peng, Ganesh, Wright, Ma, CVPR 10

80 APPLICATIONS Faces Aligned Output: aligned faces ( ) Average Peng, Ganesh, Wright, Ma, CVPR 10

81 APPLICATIONS Faces Repaired and Cleaned Output: clean low-rank faces ( ) Average Peng, Ganesh, Wright, Ma, CVPR 10

82 APPLICATIONS Sparse errors of the face images Output: sparse error images ( ) Peng, Ganesh, Wright, Ma, CVPR 10

83 APPLICATIONS Celebrities from the Internet Average face before alignment & repairing Gloria Macapagal Arroyo Jennifer Capriati Laura Bush Serena Williams Barack Obama Ariel Sharon Arnold Schwarzenegger Colin Powell Donald Rumsfeld George W Bush Gerhard Schroeder Hugo Chavez Jacques Chirac Jean Chretien John Ashcroft Junichiro Koizumi Lleyton Hewitt Luiz Inacio Lula da Silva Tony Blair Vladimir Putin Peng, Ganesh, Wright, Ma, CVPR 10

84 APPLICATIONS Face recognition with less controlled data? Average face after alignment & repairing Gloria Macapagal Arroyo Jennifer Capriati Laura Bush Serena Williams Barack Obama Ariel Sharon Arnold Schwarzenegger Colin Powell Donald Rumsfeld George W Bush Gerhard Schroeder Hugo Chavez Jacques Chirac Jean Chretien John Ashcroft Junichiro Koizumi Lleyton Hewitt Luiz Inacio Lula da Silva Tony Blair Vladimir Putin Peng, Ganesh, Wright, Ma, CVPR 10

85 APPLICATIONS Aligning handwritten digits Learned-Miller PAMI 06 Vedaldi CVPR 08 Peng, Ganesh, Wright, Ma, CVPR 10

86 APPLICATIONS 2D image matching and 3D modeling 2D homographies Peng, Ganesh, Wright, Ma, CVPR 10

87 Other Applications: Web Document Corpus Analysis Latent Semantic Indexing: the classical solution (PCA) Documents CHRYSLER SETS STOCK SPLIT, HIGHER DIVIDEND Words Chrysler Corp said its board declared a three-for-two stock split in the form of a 50 pct stock dividend and raised the quarterly dividend by seven pct. The company said the dividend was raised to 37.5 cts a share from 35 cts on a pre-split basis, equal to a 25 ct dividend on a post-split basis. Chrysler said the stock dividend is payable April 13 to holders of record March 23 while the cash dividend is payable April 15 to holders of record March 23. It said cash will be paid in lieu of fractional shares. With the split, Chrysler said 13.2 mln shares remain to be purchased in its stock repurchase program that began in late That program now has a target of 56.3 mln shares with the latest stock split. Chrysler said in a statement the actions "re ect not only our outstanding performance over the past few years but also our optimism about the company's future." word frequency (or TF/IDF) Dense, difficult to interpret a better model/solution? Low-rank background topic model Informative, discriminative keywords Low dimensional topic models with keywords

88 Other Applications: Sparse Keywords Extracted Reuters dataset: 1,000 longest documents; 3,000 most frequent words CHRYSLER SETS STOCK SPLIT, HIGHER DIVIDEND Chrysler Corp said its board declared a three-for-two stock split in the form of a 50 pct stock dividend and raised the quarterly dividend by seven pct. The company said the dividend was raised to 37.5 cts a share from 35 cts on a pre-split basis, equal to a 25 ct dividend on a post-split basis. Chrysler said the stock dividend is payable April 13 to holders of record March 23 while the cash dividend is payable April 15 to holders of record March 23. It said cash will be paid in lieu of fractional shares. With the split, Chrysler said 13.2 mln shares remain to be purchased in its stock repurchase program that began in late That program now has a target of 56.3 mln shares with the latest stock split. Chrysler said in a statement the actions "re ect not only our outstanding performance over the past few years but also our optimism about the company's future." Min, Zhang, Wright, Ma, CIKM 2010.

89 Other Applications: Web Image Tag Refinement Zhu, Yan, and Ma, ACM MM 2010.

Other Applications: Robust Filtering and System ID GPS on a Car: ½ _x = Ax + Bu; A 2 < r r y = Cx + z + e gross sparse errors (due to buildings, trees ) Robust Kalman Filter:

90 Other Applications: Robust Filtering and System ID GPS on a Car: ½ _x = Ax + Bu; A 2 < r r y = Cx + z + e gross sparse errors (due to buildings, trees ) Robust Kalman Filter: Robust System ID: ^x t+1 = Ax t + K(y t C^x t ) y n y n 1 y n 2 y 0 y n 1 y n 2... y 1 y n y n = O n r X r n + S y 0 y 1 y n+2 y n+1 Hankel matrix

91 Other Application: Graphical Model with Latent Variables cond. indep. given other variables Separation Principle: sparse pattern conditional (in)dependence rank of second component number of hidden variables Work of Chandrasekharan et. al.

(parallel, distributed, networked) Applications & Services (data processing, analysis, compression, knowledge

92 A Perfect Storm in the Cloud Mathematical Theory (high-dimensional statistics, measure concentration, combinatorics ) Massive Data (images, videos, texts, audios, speeches, stocks, user rankings ) Cloud Computing (parallel, distributed, networked) Applications & Services (data processing, analysis, compression, knowledge discovery, search, recognition ) Computational Methods (convex optimization, first-order methods, hashing, approximate solutions )

93 THANK YOU! Questions, please?

Robust Principal Component Analysis (RPCA)

Robust Principal Component Analysis (RPCA) & Matrix decomposition: into low-rank and sparse components Zhenfang Hu 2010.4.1 reference [1] Chandrasekharan, V., Sanghavi, S., Parillo, P., Wilsky, A.: Ranksparsity