A few multilinear algebraic definitions I Inner product A, B = i,j,k a ijk b ijk Frobenius norm Contracted tensor products A F = A, A C = A, B 2,3 = A

Size: px

Start display at page:

Download "A few multilinear algebraic definitions I Inner product A, B = i,j,k a ijk b ijk Frobenius norm Contracted tensor products A F = A, A C = A, B 2,3 = A"

Miles Cross
5 years ago
Views:

1 Krylov-type methods and perturbation analysis Berkant Savas Department of Mathematics Linköping University Workshop on Tensor Approximation in High Dimension Hausdorff Institute for Mathematics, Universität Bonn August 2, 2011 Joint work with Lars Eldén, Linköping University

2 A few multilinear algebraic definitions I Inner product A, B = i,j,k a ijk b ijk Frobenius norm Contracted tensor products A F = A, A C = A, B 2,3 = A, B 1 c αβ = j,k a αjk b βjk Multilinear rank rank (A) = (r 1, r 2, r 3 ) r 1 = rank(a (1) ) r 2 = rank(a (2) ) r 3 = rank(a (3) )

3 A few multilinear algebraic definitions II Multilinear tensor-matrix multiplication: U, V, W matrices A = U S V T W T A = (U, V, W ) S a ijk = λ,µ,ν u iλ v jµ w kν s λµν convention ( U T, V T, W T) A = A (U, V, W ) special cases (U, V ) A = UAV T A (I, I, W )=A (W ) 3

4 Multilinear low rank approximation With min B A B rank(b) (r 1, r 2, r 3 ) B =(X, Y, Z) C C has dimensions r 1 r 2 r 3 A X C Z T Y T

5 Multilinear low rank approximation min A (X, Y, Z) C min A XCY T A X C Z T Y T Problem overparameterized: C can be eliminated Problem equivalent to max A (X, Y, Z) max X T AY X T X = I Y T Y = I Z T Z = I

6 Multilinear low rank approximation min A (X, Y, Z) C min A XCY T A X C Z T Y T In addition to X T X = I Y T Y = I Z T Z = I A (X, Y, Z) = A (X U, Y V, ZW ) X T AY = U T X T AY V Grassmann manifold problem!

7 Computing the core C and the low rank approximation Given orthonormal X, Y, Z min A (X, Y, Z) C min A XCY T C Solution C = A (X, Y, Z) C C = X T AY and the low rank approximation is ( Â =(X, Y, Z) C= XX T, YY T, ZZ T) A ( Â = XCY T = XX T AYY T = XX T, YY T) A

8 HOSVD [De Lathauwer et al. 2000] A = U S V T W T A = (U, V, W ) S a ijk = λ,µ,ν u iλ v jµ w kν s λµν U, V, W are orthogonal S is all-orthogonal

9 Methods for best low rank approximation Alternating minimization 1 HOOI (Kroonenberg, De Lathauwer) 2 Trace maximization (Regalia) Grassmann manifold approach 1 Newton (Eldén, Savas) 2 Trust region/newton (Ishteva, De Lathauwer et al.) 3 BFGS quasi-newton (Savas, Lim) 4 Limited memory BFGS (Savas, Lim) 5 Symmetric tensors, without explicit representation (Morton) In this talk we will consider Krylov-type tensor computations. OBS: This approach does not solve the best low rank problem

10 Krylov-type tensor computations

11 Krylov subspaces for matrices Given A R m m and starting vector v R m K k (A, v) = span{v, Av, A 2 v,..., A k 1 v} If v 1 = v K k (A, v) = span{v 1, v 2, v 3,..., v k } v i+1 = Av i, i =1,..., k 1 Specifically useful for large and sparse problems Systems of linear equations Eigenvalues and eigenvectors Singular values and singular vectors Approximation of matrices and functions of matrices

12 A general square The Arnoldi process for k =1, 2,... do 1 h k = Uk TAu k 2 v = Au k U k h k 3 β k = h k+1,k = v 2 4 u k+1 = v/β k 5 Ĥ k = end for (Ĥk 1 h k 0 h k+1,k ) The Arnoldi decomposition: AU k = U k+1 Ĥ k, Ĥ k (k + 1) k Hessenberg If A is symmetric Lanczos recurrence.

13 A R m n Golub-Kahan bidiagonalization β 1 v 1, u 0 =0 for j =1, 2,..., k do 1 α j u j = A T v j β j u j 1 2 β j+1 v j+1 = Au j α j v j end for α j,β j are chosen to normalize u j, v j U k =[u 1,..., u k ] and V k+1 =[v 1,..., v k+1 ] we have AU k = V k+1 B k+1, V T k V k = I, U T k+1 U k+1 = I and B k+1 is bidiagonal, U k and V k orthonormal basis for K k (A T A, u) = span{u, (A T A)u, (A T A) 2 u,..., (A T A) k 1 u} K k (AA T, v) = span{v, (AA T )v, (AA T ) 2 v,..., (AA T ) k 1 v}

14 Matrix-vector and tensor-vector products A R m n, v R n A R l m n, v R m, w R n Av = A (v) 2 R m A (w) 3 R l m A (v, w) 2,3 R l [A (w) 3 ] ij = k [A (v, w) 2,3 ] i = jk a ijk w k a ijk v j w k similarly A (u, w) 1,3 R m A (u, v) 1,2 R n

15 Minimal Krylov method I A R l m n and starting vectors with norm one u 1 R l, v 1 R m, w 1 R n u i+1 = A (v i, w i ) 2,3 i =1,..., k 1 v i+1 = A (u i, w i ) 1,3 i =1,..., k 1 w i+1 = A (u i, v i ) 1,2 i =1,..., k 1 Set U k =[u 1 u 2... u k ] V k =[v 1 v 2... v k ] W k =[w 1 w 2... w k ] orthogonalize explicitly using Gram Schmidt on each sequence

16 Minimal Krylov method II A R l m n and starting vectors with norm one u 1 R l, v 1 R m, w 1 R n u i+1 = A (v i, w i ) 2,3 i =1,..., k 1 v i+1 = A (u i+1, w i ) 1,3 i =1,..., k 1 w i+1 = A (u i+1, v i+1 ) 1,2 i =1,..., k 1 Orthogonalize, set U k =[u 1 u 2... u k ] V k =[v 1 v 2... v k ] W k =[w 1 w 2... w k ] and approximate A ( ) U k Uk T, V kvk T, W kwk T A

17 Krylov method on a low rank matrix Let A R n n with rank(a) =k. A = U k Σ k Vk T then K k (A, u) =K k+p (A, u) p 1 We only need to do k steps of Arnoldi.

18 Low rank tensor and minimal Krylov method Theorem Let A R l m n with rank(a) =(p, q, r) and assume p q r. Then A = X C Z T Y T A = (X, Y, Z) C In p steps we get U p, s.t. span(u p ) = span(x ) In q steps we get V q, s.t. span(v q ) = span(y ) In r steps we get W r, s.t. span(w r ) = span(z) using a modified minimal Krylov recursion.

19 Low rank tensors + noise and minimal Krylov method Theorem Let A R l m n with rank(a) =(p, q, r) and add noise ρe, again assume p q r. Then A = (X, Y, Z) C + ρe In p steps we get U p, s.t. span(u p ) span(x ) within level of noise In q steps we get V q, s.t. span(v p ) span(y ) within level of noise In r steps we get W r, s.t. span(w p ) span(z) within level of noise using a modified minimal Krylov recursion.

20 Maximal Krylov Method Generate all possible combinations at each step {u 1 } {v 1 } w 1 {v 1 } {w 1 } u 2 {u 1, u 2 } {w 1 } {v 2 v 3 } {u 1, u 2 } {v 1, v 2, v 3 } {w 1, w 2, w 3, w 4, w 5, w 6 } {v 1, v 2, v 3 } {w 1, w 2,..., w 6 } {u 2, u 3,..., u 19 } {u 1, u 2,..., u 19 } {w 1, w 2,..., w 6 } {v 2, v 3, v 4,..., v 115 }

21 Krylov factorization for maximal recursion Theorem (Tensor Krylov factorizations) After a complete u-loop: A (V k, W l ) 2,3 = (U j ) 1 H jkl. After a complete v-loop: A (U j, W l ) 1,3 = (V m ) 2 H jml. After a complete w-loop: A (U j, V m ) 1,2 = (W n ) 3 H jmn.

22 Example of a maximal Krylov method 1 {u 1 } {v 1 } w 1 A (u 1, v 1 ) 1,2 = (w 1 ) 3 H {v 1 } {w 1 } u 2 A (v 1, w 1 ) 2,3 = ([u 1 u 2 ]) 1 H {u 1, u 2 } {w 1 } {v 2 v 3 } A ([u 1 u 2 ], w 1 ) 1,3 = ([v 1 v 2 v 3 ]) 2 H {u 1, u 2 } {v 1, v 2, v 3 } {w 1, w 2, w 3, w 4, w 5, w 6 } 5... A ([u 1 u 2 ], [v 1 v 2 v 3 ]) 1,2 = ([w 1 w 6 ]) 2 H the bad: dimension of subspaces explode

23 Krylov subspaces of contracted tensor products Recall: for a matrix A R m n we have AA T = A, A 1 R m m, A T A = A, A 2 R n n A R m n l, u R m, v R n, w R l and consider A, A 1 = A (1) (A (1) ) T R m m K k ( A, A 1, u) A, A 2 = A (2) (A (2) ) T R n n K k ( A, A 2, v) A, A 3 = A (3) (A (3) ) T R l l K k ( A, A 3, w) Symmetric matrices! Apply the Lanczos recurrence. All computations are implemented using A (v, w) 2,3 A (u, w) 1,3 A (u, v) 1,2 Optimal subspaces give truncated HOSVD

24 Optimized tensor-krylov approach Let U i =[u 1 u i ], V i =[v 1 v i ], and W i 1 =[w 1 w i 1 ]. Find θ and η that give optimal ŵ ŵ = A (U i θ, V i η) 1,2, max θ,η ŵ, s.t. ŵ W i 1, θ = η =1, θ, η R i. Solution: best rank-(1, 1, 1) approximation ( ) (θ, η, ω) S 111 A U i, V i, I W i 1 Wi 1 T. [Goreinov, Oseledets and Savostyanov 2010] and [Savas and Eldén 2010]

25 Experiments: Minimal Krylov vs truncated HOSVD ( H min H hosvd )/ H hosvd random runs Difference between Â min and Â hosvd, of a tensor A. Rank of approximation is (10, 10, 10).

26 HOSVD using the minimal tensor Krylov method Let A have exact low rank A = X C Z T Y T with HOSVD A = (U, V, W ) S Alternative I 1 SVD of A (1) gives U 2 SVD of A (1) gives V 3 SVD of A (1) gives W 4 Compute S = A (U, V, W ) Alternative II 1 Minimal Krylov method on A gives U p, V q, and W r 2 3 Compute C = A (U p, V q, W r ) Compite HOSVD of C = ( Ū, V, W ) S 4 Change basis: U = U p Ū, V = V q V, W = Wr W

27 Experiments: Applied on the Natflix data Tensor of size User mode: number of users Movie mode: number of movies Time mode: 2243 number of days Time for computing a rank-(100, 100, 100) approximation: 17 hours Bottleneck: u = A (v, w) 2,3 Efficient storage schemes are needed Do not compute A (w) 3 as it will be dense

28 Optimality conditions

29 Recall: Multilinear low rank approximation min A (X, Y, Z) C A X C Z T Y T Problem overparameterized, nonlinear, and equivalent to max f (X, Y, Z), s.t. X T X = I Y T Y = I Z T Z = I where f (X, Y, Z) = A (X, Y, Z) 2 F

30 First order optimality conditions Objective function f (X, Y, Z) = A (X, Y, Z) 2 F At a stationary point (X, Y, Z) we have f =0 f x = A (X, Y, Z), A (X, Y, Z) 1 =0 f y = A (X, Y, Z), A (X, Y, Z) 2 =0 f z = A (X, Y, Z), A (X, Y, Z ) 3 =0 [X X ], [Y Y ], [Z Z ] are orthogonal For the matrices we have f (X, Y )= X T AY 2 F X T AX, X T AY 1 = X T AX (X T AY )T =0 Let [U k U ] and [V k V ] contain left and right singular vectors, then [ [U k U ] T U T A[V k V ]= k AV k Uk TAV ] [ ] Σk 0 U TAV k U TAV = 0 Σ

31 A (X,Y,Z) A (X, Y, Z ) A (X,Y,Z ) WOOW..., what is that?? A (X, Y, Z), A (X, Y, Z) 1 =0 A (X, Y, Z) A (X, Y, Z ) A (X, Y,Z) A (X, Y,Z ) A (X, Y, Z) Figure visualizes the generalization of [ [U k U ] T U T A[V k V ]= k AV k Uk TAV ] [ ] Σk 0 U TAV k U TAV = 0 Σ

32 Second order optimality conditions ordering (X, Y, Z) is a local max of f (X, Y, Z) = A (X, Y, Z) 2 F Hessian is negative definite. if the Theorem Let (X, Y, Z) be a local min and B = [ ] A (X, Y, Z) then A (X, Y, Z) B(1, :, :) B(r 1, :, :) > B(r 1 +1, :, :) B(l, :, :) A (X, Y, Z) A (X, Y, Z ) A (X, Y,Z) A (X, Y,Z ) A (X, Y, Z) A (X,Y,Z) A (X, Y, Z ) A (X

33 We can also get orthogonality (X, Y, Z) is a local max of f (X, Y, Z) = A (X, Y, Z) 2 F Hessian is negative definite. if the Theorem Let (X, Y, Z) be a local min and B = [ ] A (X, Y, Z) then A (X, Y, Z) B(i, :, :), B(j, :, :) =0, i j A (X, Y, Z) A (X, Y, Z ) A (X, Y,Z) A (X, Y,Z ) A (X, Y, Z) A (X,Y,Z) A (X, Y, Z ) A (X

34 Compare with the HOSVD... A = U S V T W T U, V, W are orthogonal S is all-orthogonal: S(:, :, i), S(:, :, j) =0 i j S(:, :, 1) S(:, :, 2) S(:, :, 3) Properties hold in all three modes simultaniously

35 Perturbation theory and concept of gap for tensors

36 Perturbation analysis: setup (X, Y, Z) representative of a stationary point for A (X, Y, Z) C 2 F Now perturbe A with a small E Ã = A + E What are the first order perturbation ( X, Ỹ, Z) of the stationary point? X = X + δx Ỹ = Y + δy Z = Z + δz We want to bound δx, δy, and δz in terms of some properties of A

37 First order optimality condition The perturbed point ( X, Ỹ, Z) has to satisfy the first order optimality conditions, i.e. f =( f x, f y, f z )=0 f x = Ã ( X, Ỹ, Z ), ( X Ã, Ỹ, Z ) =0 1 f y = Ã ( X, Ỹ, Z ), ( X Ã, Ỹ, Z ) =0 2 f z = Ã ( X, Ỹ, Z ), ( X Ã, Ỹ, Z ) =0 3

38 First order optimality condition: shake the gradient......, and we get the Hessian! In operator form we have H xx H xy H xz x H yx H yy H yz y H zx H zy H zz z The Hessian in the LHS and some other things on the RHS H xx (δx )+H xy (δy )+H xz (δz) = ( F x, E E x, F 1 ) H yx (δx )+H yy (δy )+H yz (δz) = ( F y, E E y, F 2 ) H zx (δx )+H zy (δy )+H zz (δz) = ( F z, E E z, F 3 ) This looks messy... BUT: these equations will give us the gap!

39 Example 1: Matrix case [ ][ ] [ ] 1 (Σ 2 1 I ) Σ 1 Σ 2 vec(δx ) (Σ1 I ) vec(e Σ 1 Σ 2 (Σ 2 1 I ) = x ) vec(δy ) (Σ 1 I ) vec(e y ) 2 We can uncouple into 2 2 block systems. The worst is [ ][ ] [ ] σ 2 r σ r σ r+1 δx σr e = x σ r σ r+1 δy σ r e y σ 2 r 3 Solution: [ ] δx = δy 1 (σ r σ r+1 )(σ r + σ r+1 ) [ ] σr e x + σ r+1 e y σ r+1 e x + σ r e y 4 Bounding gives: δx 1 σ r σ r+1 ( e x + e y ) δy 1 σ r σ r+1 ( e x + e y ) 5 The quantity (σ r σ r+1 ) is called the gap!

40 A (X,Y,z) A (X, y, Z ) A (X,Y,Z ) Example 2: Tensor case with rank-(1, 1, 1) approximation Let (x, y, z) be a stationary point of f (x, y, z) = A (x, y, z) 2 F With c = A (x, y, z) the perturbation equations become c 2 I cf xy cf xz δx ce x cf yx c 2 I cf yz δy = ce y c 2 I δz ce z cf zx cf zy Where F xy = A (X, Y, z),... A (x, y, Z ) A (x, y, z) A (x, Y,Z ) A (x, Y,z) A (X, y, z) Due to orthogonality blue vectors are zero!

41 Example 2: Tensor case with rank-(1, 1, 1) approximation The perturbation satisfy c 2 I cf xy cf xz δx ce x cf yx c 2 I cf yz δy = ce y c 2 I δz ce z cf zx cf zy We get the bound δx δy δz G 1 e x e y e z I F xy F xz G = c 0 I 0 + F yx 0 F yz 0 0 I F zy F zx 0 What can we say about G 1?

42 A (X,Y,z) A (X, y, Z ) A (X,Y,Z ) Example 2: Tensor case with rank-(1, 1, 1) approximation We have that G 1 1 λ min (G) I F xy F xz G = c 0 I 0 + F yx 0 F yz ci + F 0 0 I F zy F zx 0 The gap becomes λ min (G) = c µ max where µ max is the largest eigenvalue of F A (x, y, Z ) A (x, y, z) A (x, Y,Z ) A (x, Y,z) [ σ1 0 ] 0 Σ A (X, y, z) Compare with the matrix case

43 A (X, Y, Z) A (X, Y, Z) A (X, Y,Z) A (X, Y, Z ) A (X, Y, Z ) A (X, Y,Z ) Summary 1 We considered the multilinear low rank approximation of a tensor A X C Z T Y T 2 Presented several Krylov-type procedures that generate low rank approximations 3 Interpreted first and second order optimality conditions: ordering and orthogonality A (X,Y,Z) A (X,Y,Z ) 4 Generalized the gap from matrix sensitivity analysis to tensors

44 Thank you for your time!

Quasi-Newton algorithm for best multilinear rank approximation of tensors

Quasi-Newton algorithm for best multilinear rank of tensors and Lek-Heng Lim Department of Mathematics Linköpings Universitet 6th International Congress on Industrial and Applied Mathematics Outline 1