Divide and Conquer Kernel Ridge Regression

Size: px

Start display at page:

Download "Divide and Conquer Kernel Ridge Regression"

Zoe Palmer
5 years ago
Views:

1 Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

2 Problem set-up Goal Solve the following problem: minimize E[(f(x) y) 2 ] subject to f H where (x, y) is sampled from joint distribution P, and H is a Reproducing Kernel Hilbert Space (RKHS). Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

3 Problem set-up Goal Solve the following problem: minimize E[(f(x) y) 2 ] subject to f H where (x, y) is sampled from joint distribution P, and H is a Reproducing Kernel Hilbert Space (RKHS). H is defined by a kernel function k : X X R. Formally: H = {f : f = α i k(x i, ), x i X }. i=1 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

4 Kernel trick review 1 Linear regression f(x) = θ T x only fits linear problems. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

5 Kernel trick review 1 Linear regression f(x) = θ T x only fits linear problems. 2 For non-linear problems, the trick is to map x onto a high-dimension feature space x φ(x), then learn a model f(x) = θ T φ(x), so that f is a non-linear function of x. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

6 Kernel trick review 3 Usually, φ(x) is a infinite-dimensional vector or it is expensive to compute. We can reformulate the learning algorithm such that the input vector enters only in the form of inner product k(x, x ) = φ(x) T φ(x ). Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

7 Kernel trick review 3 Usually, φ(x) is a infinite-dimensional vector or it is expensive to compute. We can reformulate the learning algorithm such that the input vector enters only in the form of inner product k(x, x ) = φ(x) T φ(x ). 4 k is called the kernel function and should be easy to compute. Examples: Polynomial kernel: k(x, x ) = (1 + x T x ) d. Gaussian kernel: k(x, x ) = exp ( x x 2 2σ ). 2 Sobolev kernel in R 1 : k(x, x ) = 1 + min(x, x ). Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

8 Kernel ridge regression Given N samples (x 1, y 1 ),..., (x N, y N ), we want to compute the empirical minimizer ˆf = argmin f H 1 N N (f(x i ) y i ) 2 + λ f 2 H i=1 as an estimate to f = argmin f H E[(f(x) y) 2 ]. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

9 Kernel ridge regression Given N samples (x 1, y 1 ),..., (x N, y N ), we want to compute the empirical minimizer ˆf = argmin f H 1 N N (f(x i ) y i ) 2 + λ f 2 H i=1 as an estimate to f = argmin f H E[(f(x) y) 2 ]. This minimization problem has a closed-form solution: ˆf = N α i k(x i, ), where α = (K + λni) 1 y. i=1 K is the N N kernel matrix defined by K ij = k(x i, x j ). Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

10 Think about large datasets The matrix inversion α = (K + λni) 1 y takes O(N 3 ) time and O(N 2 ) memory space, which can be prohibitively expensive when N is large. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

11 Think about large datasets The matrix inversion α = (K + λni) 1 y takes O(N 3 ) time and O(N 2 ) memory space, which can be prohibitively expensive when N is large. Fast approaches to compute kernel ridge regression: 1 Low-rank matrix approximation: Kernel PCA. Incomplete Cholesky decomposition Nystrom sampling. 2 Iterative optimization algorithm: Gradient descent. Conjugate gradient methods. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

12 Think about large datasets The matrix inversion α = (K + λni) 1 y takes O(N 3 ) time and O(N 2 ) memory space, which can be prohibitively expensive when N is large. Fast approaches to compute kernel ridge regression: 1 Low-rank matrix approximation: Kernel PCA. Incomplete Cholesky decomposition Nystrom sampling. 2 Iterative optimization algorithm: Gradient descent. Conjugate gradient methods. However, none of these method can be shown achieving the same level of accuracy as the exact algorithm does. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

13 Our main idea Only keep the diagonal blocks, so that the matrix inversion is fast. K = K 11 K 12 K 13 K 14 K 15 K 16 K 21 K 22 K 23 K 24 K 25 K 26 K 31 K 32 K 33 K 34 K 35 K 36 K 41 K 42 K 43 K 44 K 45 K 46 K 51 K 52 K 53 K 54 K 55 K 56 K 61 K 62 K 63 K 64 K 65 K 66 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

14 Our main idea Only keep the diagonal blocks, so that the matrix inversion is fast. K = K 33 K 36 K 34 K 32 K 31 K 35 K 63 K 66 K 64 K 62 K 61 K 65 K 43 K 46 K 44 K 42 K 41 K 45 K 23 K 26 K 24 K 22 K 21 K 25 K 13 K 16 K 14 K 12 K 11 K 15 K 53 K 56 K 54 K 52 K 61 K 55 Random Shuffle K 11 K 12 K 13 K 14 K 15 K 16 K 21 K 22 K 23 K 24 K 25 K 26 K 31 K 32 K 33 K 34 K 35 K 36 K 41 K 42 K 43 K 44 K 45 K 46 K 51 K 52 K 53 K 54 K 55 K 56 K 61 K 62 K 63 K 64 K 65 K 66 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

15 Our main idea Only keep the diagonal blocks, so that the matrix inversion is fast. K = K 11 K 12 K 13 K 14 K 15 K 16 K 21 K 22 K 23 K 24 K 25 K 26 K 31 K 32 K 33 K 34 K 35 K 36 K 41 K 42 K 43 K 44 K 45 K 46 K 51 K 52 K 53 K 54 K 55 K 56 K 61 K 62 K 63 K 64 K 65 K 66 K 33 K 36 K 34 K 32 K 31 K 35 K 63 K 66 K 64 K 62 K 61 K 65 K 43 K 46 K 44 K 42 K 41 K 45 K 23 K 26 K 24 K 22 K 21 K 25 K 13 K 16 K 14 K 12 K 11 K 15 K 53 K 56 K 54 K 52 K 61 K 55 Random Shuffle K 33 K K 63 K K 44 K K 24 K K 11 K K 61 K 55 Block Diagonalize Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

16 Fast kernel ridge regression (Fast-KRR) We propose a divide-and-conquer approach: 1 Divide the set of samples {(x 1, y 1 ),..., (x N, y N )} evenly and uniformly at randomly into the m disjoint subsets: S 1,..., S m X R. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

17 Fast kernel ridge regression (Fast-KRR) We propose a divide-and-conquer approach: 1 Divide the set of samples {(x 1, y 1 ),..., (x N, y N )} evenly and uniformly at randomly into the m disjoint subsets: S 1,..., S m X R. 2 For each i = 1, 2,..., m, compute the local KRR estimate { 1 ˆf i := argmin f H S i (f(x) y) 2 (x,y) S i }{{} local risk + λ f 2 H }{{} under-regularized }. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

18 Fast kernel ridge regression (Fast-KRR) We propose a divide-and-conquer approach: 1 Divide the set of samples {(x 1, y 1 ),..., (x N, y N )} evenly and uniformly at randomly into the m disjoint subsets: S 1,..., S m X R. 2 For each i = 1, 2,..., m, compute the local KRR estimate { 1 ˆf i := argmin f H S i (f(x) y) 2 (x,y) S i }{{} local risk + λ f 2 H }{{} under-regularized 3 Average together the local estimates and output f = 1 m m i=1 ˆf i. }. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

19 Fast kernel ridge regression (Fast-KRR) We propose a divide-and-conquer approach: 1 Divide the set of samples {(x 1, y 1 ),..., (x N, y N )} evenly and uniformly at randomly into the m disjoint subsets: S 1,..., S m X R. 2 For each i = 1, 2,..., m, compute the local KRR estimate { 1 ˆf i := argmin f H S i (f(x) y) 2 (x,y) S i }{{} local risk + λ f 2 H }{{} under-regularized 3 Average together the local estimates and output f = 1 m m i=1 ˆf i. Computation time: O(N 3 /m 2 ); memory space: O(N 2 /m 2 ). }. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

20 Theoretical result Theorem With m splits, Fast-KRR achieves the mean square error: E[ f ( f 2 2] C λ f 2 H }{{} squared bias + γ(λ) ) + T (λ, m) } N{{} variance Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

21 Theoretical result Theorem With m splits, Fast-KRR achieves the mean square error: E[ f ( f 2 2] C λ f 2 H }{{} squared bias + γ(λ) ) + T (λ, m) } N{{} variance λ f 2 H is the bias introduced by regularization. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

22 Theoretical result Theorem With m splits, Fast-KRR achieves the mean square error: E[ f ( f 2 2] C λ f 2 H }{{} squared bias + γ(λ) ) + T (λ, m) } N{{} variance λ f 2 H is the bias introduced by regularization. T (λ, m) becomes a higher-order negligible term when m is below the threshold m N/γ 2 (λ). Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

23 Theoretical result Theorem With m splits, Fast-KRR achieves the mean square error: E[ f ( f 2 2] C λ f 2 H }{{} squared bias + γ(λ) ) + T (λ, m) } N{{} variance λ f 2 H is the bias introduced by regularization. T (λ, m) becomes a higher-order negligible term when m is below the threshold m N/γ 2 (λ). γ(λ) is the effective dimensionality: let µ 1 µ 2... be the sequence of eigenvalues in kernel k s eigen-expansion, then γ(λ) = µ k /(λ + µ k ). k=1 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

24 Theoretical result Theorem With m splits, Fast-KRR achieves the mean square error: E[ f ( f 2 2] C λ f 2 H }{{} squared bias + γ(λ) ) + T (λ, m) } N{{} variance λ f 2 H is the bias introduced by regularization. T (λ, m) becomes a higher-order negligible term when m is below the threshold m N/γ 2 (λ). γ(λ) is the effective dimensionality: let µ 1 µ 2... be the sequence of eigenvalues in kernel k s eigen-expansion, then γ(λ) = µ k /(λ + µ k ). k=1 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

25 Apply to specific kernels Corollary For polynomial kernel if m cn/ log N then ( ) E[ f 1 f 2 2] = O (minimax optimal rate) N Time: O(N 3 ) O(N log 2 N) Space: O(N 2 ) O(log 2 N) Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

26 Apply to specific kernels Corollary For polynomial kernel if m cn/ log N then ( ) E[ f 1 f 2 2] = O (minimax optimal rate) N Time: O(N 3 ) O(N log 2 N) Space: O(N 2 ) O(log 2 N) Corollary For Gaussian kernel, if m cn/log 2 N then ( ) E[ f log N f 2 2] = O (minimax optimal rate) N Time: O(N 3 ) O(N log 4 N) Space: O(N 2 ) O(log 4 N) Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

27 Apply to specific kernels Corollary For Sobolev kernel of smoothness ν, if m cn 2ν 1 2ν+1 / log N then E[ f ) f 2 2] = O (N 2ν 2ν+1 (minimax optimal rate) Time: O(N 3 ) O(N 2ν+5 2ν+1 log 2 N) Space: O(N 2 ) O(N 4 2ν+1 log 2 N) Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

28 Simulation Study Data (x, y) is generated by y = min(x, 1 x) + ε where ε N(0, 0.2) y x Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

29 Compare Fast-KRR and exact KRR We use a Sobolev kernel of smoothness-1 to fit the data. m=1 m=4 m=16 m=64 Mean square error Total number of samples (N) Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

30 Compare Fast-KRR and exact KRR We use a Sobolev kernel of smoothness-1 to fit the data. m=1 m=4 m=16 m=64 Mean square error Total number of samples (N) Fast-KRR s performance is very close to exact KRR for m 16. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

31 Threshold for data partitioning Mean square error is plotted for varied choices of m. Mean square error N=256 N=512 N=1024 N=2048 N=4096 N= log(# of partitions)/log(# of samples) Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

32 Threshold for data partitioning Mean square error is plotted for varied choices of m. Mean square error N=256 N=512 N=1024 N=2048 N=4096 N= log(# of partitions)/log(# of samples) As long as m N 0.45, the accuracy is not hurt. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

33 Conclusion We propose a divide-and-conquer approach for kernel ridge regression that leads to substantial reduction in computation time and memory space. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

34 Conclusion We propose a divide-and-conquer approach for kernel ridge regression that leads to substantial reduction in computation time and memory space. The proposed algorithm archives the optimal convergence rate for the full sample size N, as long as the partition number m is not too large. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

35 Conclusion We propose a divide-and-conquer approach for kernel ridge regression that leads to substantial reduction in computation time and memory space. The proposed algorithm archives the optimal convergence rate for the full sample size N, as long as the partition number m is not too large. As concrete examples, our theory guarantees that m may grow polynomially in N for Sobolev spaces, and grow nearly linearly in N for finite-rank kernels and Gaussian kernels. Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT / 15

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE REGULARIZATION METHODS FOR HIGH DIMENSIONAL LEARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu June 6, 2011 ABOUT THIS