random fourier features for kernel ridge regression: approximation bounds and statistical guarantees

Size: px

Start display at page:

Download "random fourier features for kernel ridge regression: approximation bounds and statistical guarantees"

Sabina Daniels
5 years ago
Views:

1 random fourier features for kernel ridge regression: approximation bounds and statistical guarantees Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh Tel Aviv University, EPFL, and MIT. ICML

2 overview Our Contributions: 1

3 overview Our Contributions: Analyze the random Fourier features method (Rahimi Recht 07) for kernel approximation using leverage score-based techniques. 1

4 overview Our Contributions: Analyze the random Fourier features method (Rahimi Recht 07) for kernel approximation using leverage score-based techniques. Concrete: Introduce new sampling distribution that gives statistical guarantees for kernel ridge regression when used to approximate the Gaussian kernel. 1

5 overview Our Contributions: Analyze the random Fourier features method (Rahimi Recht 07) for kernel approximation using leverage score-based techniques. Concrete: Introduce new sampling distribution that gives statistical guarantees for kernel ridge regression when used to approximate the Gaussian kernel. High Level: Hope that Fourier leverage scores will have further applications in kernel approximation, function approximation, and sparse Fourier transform methods. 1

6 kernel approximation Kernel methods are expensive. 2

7 kernel approximation Kernel methods are expensive. Even writing down K requires Ω(n 2 ) time. 2

8 kernel approximation Kernel methods are expensive. Even writing down K requires Ω(n 2 ) time. Other operations require even more. A single iteration of a linear system solver takes Ω(n 2 ) time. 2

9 kernel approximation Kernel methods are expensive. Even writing down K requires Ω(n 2 ) time. Other operations require even more. A single iteration of a linear system solver takes Ω(n 2 ) time. For n = 100, 000, K has 10 billion entries. Takes 80 GB of storage if each is a double. 2

10 solution: low-rank approximation. Employ classic solution: low-rank approximation 3

11 solution: low-rank approximation. Employ classic solution: low-rank approximation Storing Z uses O(ns) space and computing ZZ T x takes O(ns) time. Orthogonalization, eigendecomposition, and pseudo-inversion of ZZ T all take just O(ns 2 ) time. 3

12 efficient low-rank approximation? Low-rank approximation is itself an expensive task. 4

13 efficient low-rank approximation? Low-rank approximation is itself an expensive task. Optimal low-rank approximation via a direct eigendecomposition, or even approximation via Krylov subspace methods are out of the question since they at least require fully forming K. 4

14 efficient low-rank approximation? Low-rank approximation is itself an expensive task. Optimal low-rank approximation via a direct eigendecomposition, or even approximation via Krylov subspace methods are out of the question since they at least require fully forming K. Many faster methods have been studied: incomplete Cholesky factorization (Fine & Scheinberg 02, Bach & Jordan 02), entrywise sampling (Achlioptas, McSherry, & Schölkopf 01), Nyström approximation (Williams & Seeger 01), random Fourier features (Rahimi & Recht 07) 4

15 efficient low-rank approximation? Low-rank approximation is itself an expensive task. Optimal low-rank approximation via a direct eigendecomposition, or even approximation via Krylov subspace methods are out of the question since they at least require fully forming K. Many faster methods have been studied: incomplete Cholesky factorization (Fine & Scheinberg 02, Bach & Jordan 02), entrywise sampling (Achlioptas, McSherry, & Schölkopf 01), Nyström approximation (Williams & Seeger 01), random Fourier features (Rahimi & Recht 07) 4

16 random fourier features Rahimi & Recht NIPS 07: 5

17 random fourier features Rahimi & Recht NIPS 07: For any shift-invariant k(x i, x j ) = k(x i x j ) let p( ) be the Fourier transform of k( ). By Bochner s theorem, p(η) 0 for all η. 5

18 random fourier features Rahimi & Recht NIPS 07: For any shift-invariant k(x i, x j ) = k(x i x j ) let p( ) be the Fourier transform of k( ). By Bochner s theorem, p(η) 0 for all η. Sample η 1,..., η s R d with probabilities proportional to p(η). 5

19 random fourier features Rahimi & Recht NIPS 07: For any shift-invariant k(x i, x j ) = k(x i x j ) let p( ) be the Fourier transform of k( ). By Bochner s theorem, p(η) 0 for all η. Sample η 1,..., η s R d with probabilities proportional to p(η). 5

20 random fourier features Rahimi & Recht NIPS 07: For any shift-invariant k(x i, x j ) = k(x i x j ) let p( ) be the Fourier transform of k( ). By Bochner s theorem, p(η) 0 for all η. Sample η 1,..., η s R d with probabilities proportional to p(η). ] Set z i = 1 s [e 2πiηT 1 x i,..., e 2πiηT s x i. 5

21 view as matrix sampling method Fourier transform k(z) = η R d p(η)e 2πiηT z dη gives: 6

22 view as matrix sampling method Fourier transform k(z) = η R d p(η)e 2πiηT z dη gives: 6

23 view as matrix sampling method Fourier transform k(z) = η R d p(η)e 2πiηT z dη gives: η Φ i(η)p(η)φ j (η) dη 6

24 view as matrix sampling method Fourier transform k(z) = η R d p(η)e 2πiηT z dη gives: η Φ i(η)p(η)φ j (η) dη = η e 2πiηT (x i x j ) p(η)dη 6

25 view as matrix sampling method Fourier transform k(z) = η R d p(η)e 2πiηT z dη gives: η Φ i(η)p(η)φ j (η) dη = η e 2πiηT (x i x j ) p(η)dη = k(x i x j ) 6

26 view as matrix sampling method Fourier transform k(z) = η R d p(η)e 2πiηT z dη gives: η Φ i(η)p(η)φ j (η) dη = η e 2πiηT (x i x j ) p(η)dη = k(x i x j ) = K i,j. 6

27 view as matrix sampling method Fourier transform k(z) = η R d p(η)e 2πiηT z dη gives: η Φ i(η)p(η)φ j (η) dη = η e 2πiηT (x i x j ) p(η)dη = k(x i x j ) = K i,j. Set Φ = ΦP 1/2. So K = Φ Φ T. 6

28 view as matrix sampling method 7

29 view as matrix sampling method Z(j) = 1 Φ(η) with probability p(η). So E[ZZ T ] = K. sp(η) 7

30 view as matrix sampling method Z(j) = 1 Φ(η) with probability p(η). So E[ZZ T ] = K. sp(η) ] z i = 1 s [e 2πiηT 1 x i,..., e 2πiηT s x i for η 1,..., η s sampled according to p(η). 7

31 view as matrix sampling method 8

32 view as matrix sampling method Z is a sample of Φ = ΦP 1/2. Columns are sampled with probability p(η), i.e., their squared column norms. 8

33 view as matrix sampling method Z is a sample of Φ = ΦP 1/2. Columns are sampled with probability p(η), i.e., their squared column norms. It is well known from work on randomized methods in linear algebra that there are better sampling probabilities (in both theory and practice): the column leverage scores. 8

34 view as matrix sampling method Z is a sample of Φ = ΦP 1/2. Columns are sampled with probability p(η), i.e., their squared column norms. It is well known from work on randomized methods in linear algebra that there are better sampling probabilities (in both theory and practice): the column leverage scores. Also noted by Bach 17, implicit in Rudi et al

35 comparision of approximation guarantees Column Norm Sampling: s = Õ(d/ɛ 2 ) samples ensure that ( ZZ T ) i,j = K i,j ± ɛ for all i, j with high probability [RR07]. 9

36 comparision of approximation guarantees Column Norm Sampling: s = Õ(d/ɛ 2 ) samples ensure that ( ZZ T ) i,j = K i,j ± ɛ for all i, j with high probability [RR07]. Ridge Leverage Score Sampling: s = Õ(s λ/ɛ 2 ) samples gives spectral approximation: (1 ɛ)(zz T + λi) K + λi (1 + ɛ)(zz T + λi). where s λ = tr(k(k + λi) 1 ) is the statistical dimension. 9

37 comparision of approximation guarantees Column Norm Sampling: s = Õ(d/ɛ 2 ) samples ensure that ( ZZ T ) i,j = K i,j ± ɛ for all i, j with high probability [RR07]. Ridge Leverage Score Sampling: s = Õ(s λ/ɛ 2 ) samples gives spectral approximation: (1 ɛ)(zz T + λi) K + λi (1 + ɛ)(zz T + λi). where s λ = tr(k(k + λi) 1 ) is the statistical dimension. Spectral approximation gives statistical guarantees for kernel ridge regression (this work), and approximation bounds for kernel PCA and k-means clustering (Cohen, Musco, Musco 16, 17) 9

38 how to leverage score sample The ridge leverage score for frequency η is given by: τ λ (η) = Φ(η) T (K + λi) 1 Φ(η). 10

39 how to leverage score sample The ridge leverage score for frequency η is given by: τ λ (η) = Φ(η) T (K + λi) 1 Φ(η). Expensive to invert K + λi. But even if you could do that efficiently, it is not at all clear you could efficiently sample from the leverage score distribution. 10

40 bounding the fourier leverage scores Main goal: Get a handle on the Fourier ridge leverage scores for common kernels by upper bounding them with simple distributions. 11

41 bounding the fourier leverage scores Main goal: Get a handle on the Fourier ridge leverage scores for common kernels by upper bounding them with simple distributions. 11

bounding the fourier leverage scores Main goal: Get a handle on the Fourier ridge leverage scores for common kernels by upper bounding them with simple distributions. 1.

42 bounding the fourier leverage scores Main goal: Get a handle on the Fourier ridge leverage scores for common kernels by upper bounding them with simple distributions. 1. Improve random Fourier features. 2. Bound statistical dimension by the sum of leverage scores. 3. Connections with sparse Fourier transforms, Fourier interpolation, and other problems. 11

43 alternative leverage score characterization Ridge leverage score τ λ (η) = Φ(η) T (K + λi) 1 Φ(η) also equals: τ λ (η) = min y λ 1 Φy Φ(η) y

44 alternative leverage score characterization Ridge leverage score τ λ (η) = Φ(η) T (K + λi) 1 Φ(η) also equals: τ λ (η) = min y λ 1 Φy Φ(η) y 2 2. Intuition: τ λ (η) is small iff there exists a function y : R d C with low energy ( y 2 2 small) whose ( p(η) weighted) Fourier transform is close to the frequency e 2πixT j η at each data point. 12

alternative leverage score characterization Ridge leverage score τ λ (η) = Φ(η) T (K + λi) 1 Φ(η) also equals: τ λ (η) = min y λ 1 Φy Φ(η) 2 2 + y 2 2.

45 alternative leverage score characterization Ridge leverage score τ λ (η) = Φ(η) T (K + λi) 1 Φ(η) also equals: τ λ (η) = min y λ 1 Φy Φ(η) y 2 2. Intuition: τ λ (η) is small iff there exists a function y : R d C with low energy ( y 2 2 small) whose ( p(η) weighted) Fourier transform is close to the frequency e 2πixT j η at each data point. y reconstructs frequency η from other frequencies. The easier it is to reconstruct, the less important it is to sample. 12

46 frequency reconstruction for bounded data Assume data points are 1-dimensional and bounded: x 1,..., x n [ δ, δ]. One possibility is to choose y with ( p(η) weighted) Fourier transform equal to e 2πixη for all x [ δ, δ]. 13

47 frequency reconstruction for bounded data Assume data points are 1-dimensional and bounded: x 1,..., x n [ δ, δ]. One possibility is to choose y with ( p(η) weighted) Fourier transform equal to e 2πixη for all x [ δ, δ]. Achieved by the shifted sinc function weighted by 1/ p(η). 13

48 frequency reconstruction for bounded data Assume data points are 1-dimensional and bounded: x 1,..., x n [ δ, δ]. One possibility is to choose y with ( p(η) weighted) Fourier transform equal to e 2πixη for all x [ δ, δ]. Achieved by the shifted sinc function weighted by 1/ p(η). pure wave with frequency η target function for bounding the leverage τ λ (η) 1.5 candidate test function that reconstructs target target frequency, η e -2πixη, real component sinc(2δ[ξ-η]) δ -2δ -δ 0 δ 2δ 3δ x -0.5 frequency, ξ η 13

49 improved test function Unfortunately the sinc function falls off too slowly. 14

50 improved test function Unfortunately the sinc function falls off too slowly. 1 For the Gaussian kernel, the e η2 /4 weighting, will grow p(η) faster than sinc(2δη) = sin(2δη) η falls off. So y 2 is unbounded. 14

51 improved test function Unfortunately the sinc function falls off too slowly. 1 For the Gaussian kernel, the e η2 /4 weighting, will grow p(η) faster than sinc(2δη) = sin(2δη) η falls off. So y 2 is unbounded. Solution: Dampen the sinc by multiplying with a Gaussian, keeping Fourier transform nearly identical. 14

52 improved test function Unfortunately the sinc function falls off too slowly. 1 For the Gaussian kernel, the e η2 /4 weighting, will grow p(η) faster than sinc(2δη) = sin(2δη) η falls off. So y 2 is unbounded. Solution: Dampen the sinc by multiplying with a Gaussian, keeping Fourier transform nearly identical. smoothed approximation to target function 1.5 sinc and gaussian components dampened test function with lower energy real component g η (ξ) = y η (ξ)p(ξ) δ -2δ -δ 0 δ 2δ 3δ x -0.5 frequency, ξ η 14

53 ultimate gaussian kernel leverage score bound Upshot: easy to sample from approximate leverage distribution for the Gaussian kernel with x 1,..., x n [ δ, δ] d : Õ(δ d ) when η log n/λ τ λ (η) p(η) = e η 2 2 /2 otherwise. relative sampling probability ridge leverage function classical random Fourier features proposed modified random Fourier features frequency 15

54 experimental results Example of approximating a synthetic wiggly function : True Function KRR Estimator Data True Function MRF Estimator CRF Estimator CRF = classic random Fourier features column norm sampling, MRF = our modified sampling distribution. 16

55 Questions? 17

Divide and Conquer Kernel Ridge Regression

Divide and Conquer Kernel Ridge Regression Yuchen Zhang John Duchi Martin Wainwright University of California, Berkeley COLT 2013 Yuchen Zhang (UC Berkeley) Divide and Conquer KRR COLT 2013 1 / 15 Problem