Fast and Scalable Polynomial Kernels via Explicit Feature Maps

Size: px

Start display at page:

Download "Fast and Scalable Polynomial Kernels via Explicit Feature Maps"

Cory Hall
5 years ago
Views:

1 Fast and Scalable Polynomial Kernels via Explicit Feature Maps Ninh Pham IT University of Copenhagen Rasmus Pagh IT University of Copenhagen 1

2 Outline Nonlinear versus Linear SVMs Tensor Sketching Tensor Product Count Sketches Tensor Sketches Experiments Conclusions and Future Work 2

Nonlinear SVMs Support Vector Machine Constructing hyperplane classifiers in high or infinitive dimensional space by using kernel tricks Kernel tricks Implicit non-linear data mapping from original

3 Nonlinear SVMs Support Vector Machine Constructing hyperplane classifiers in high or infinitive dimensional space by using kernel tricks Kernel tricks Implicit non-linear data mapping from original data space into highdimensional feature space in expectation that the linear structure is gained The scalability is a bottleneck: O(dn 3 ) time and O(n 2 ) space complexities with n training points in d-dimensional Euclidean space. Data space Feature space 3

4 Linear SVMs Rich data space is almost linearly separable (e.g. document classification). Fast linear SVM solvers: SVM perf, Pegasos, LIBLINEAR in O(dn) time Support vectors Support vectors Data space 4

5 From Linear SVMs to Nonlinear SVMs Random feature mapping f [RR 07] from the original data space to the randomized feature space such that: E f( x), f( y) ( x), ( y) ( x, y) Data space Feature space 3 Odn ( ) Randomized feature space Odn ( ) f 5

6 Some Random Feature Mappings Random Fourier features [RR 07] for the Gaussian kernel f ( wx, ): x f ( x) cos wx,,sin wx, RR RR Random Maclaurin features [KK 12] for the polynomial kernel f KK ( w1,, wp, x): x f KK ( x) w, i 1 i x p Complexity: O(dDn) time and O(dD) space storage for D random feature maps 6

7 Motivation and Contribution Linear SVMs in O(Dn) Data space Feature construction in O(dDn) Randomized feature space Feature construction dominates linear SVMs in case D = O(d). Tensor Sketch improves a factor of in feature construction for log D polynomial kernels. d 7

8 Tensor Sketching: An Overview Tensor Product Count Sketches (Sparse RP) Polynomial feature space Data space Tensor Sketches Randomized feature space 8

9 Tensor Sketching: An Overview Tensor Product Count Sketches (Sparse RP) Polynomial feature space Data space Tensor Sketches Randomized feature space 9

10 Tensor Sketching: An Overview Data x p times Count Sketch (1) C x Convolution Tensor Sketch ( p) Cx x d ( p) C x D Count Sketch D ( p) x Tensor Product d p 10

11 Tensor Product 2-level tensor product (outer product) (2) d x xx p-level tensor product ( p) x x x x xx 1 1 xx 1 2 xx 1 xx xx xx xx xx xx p d 1 d 2 d d times Tensor Product is an explicit feature map for the polynomial kernel d p d d 2 ( p) ( p) x, y x, y p 11

12 Count Sketches Definition: Given hash functions h: d D and s: d 1. Count Sketch of a point,, d x x1 x d is denoted by,, D Cx Cx Cx where Cx s() i x. 1 D : ( ) i j ih i j Example: h s d 4 D Properties: E Cx, Cy x, y, Var Cx, Cy x, y x y. D 12

13 Convolution of Count Sketches Observations on outer product domain [Pagh 12]: View count sketching as the polynomial transforming d h i 1 P ( () 1 ) s ( 1 i ) xi with hash functions h 1 and s 1 i 1 d h i 2 P ( () 2 ) s ( 2 i ) xi with hash functions h 2 and s 2 i 1 P(ω) is a Count Sketch of outer product with hash functions H(i, j) = h 1 (i) + h 2 (j) mod D and S(i, j) = s 1 (i)s 2 (j): P P P 1 ( ) FFT (FFT( 1( ) * FFT( 2( ))) P(ω) can be seen as an explicit random feature mapping (random projection) for the degree-2 polynomial kernel (polynomial feature space). 13

14 Convolution of Count Sketches Generalization on tensor product domain: P(ω) is a Count Sketch of p-level tensor product with 2- wise hash functions: Hi1, ip h1( i1) hp( ip)mod D, Si (,, i) s( i) s( i) 1 p 1 1 p p Fast computation P P P p 1 ( ) FFT (FFT( 1( ) * * FFT( ( ))) P(ω) can be seen as an explicit random feature mapping (random projection) for the degree-p polynomial kernels (polynomial feature space). 14

15 function TENSOR_SKETCH = TensorSketch(DATA, p,d) [n, d] = size(data); % Data information indexhash = randi(d,p,d); bithash = double(randi(2,p,d)-1.5)*2; % Hash functions TENSOR_SKETCH = zeros(n,d); % Initialize Tensor Sketches for Xi = 1 : n % Each point Xi temp = DATA(Xi, :); % Coordinates of Point Xi P = zeros(p, D); % Polynomials correspond to different Count Sketches for Xij = 1 : d % Each coordinate Xij of Point Xi for pi = 1 : p % Each polynomial/count Sketch ihashindex = indexhash(pi, Xij); ihashbit = bithash(pi, Xij); P(pi, ihashindex) = P(pi, ihashindex) + ihashbit * temp(xij); end end P = fft(p, [], 2); temp = prod(p, 1); % FFT % Component-wise product TENSOR_SKETCH(Xi, :) = ifft(temp); % inverse FFT end Tensor Sketching runs in O(np(d+DlogD)) time. 15

16 Error Analysis Relative Error Bound ( p) ( p) p p Pr Cx, Cy x, y x, y. 2 D cos xy Absolute Error Bound Normalization Preservation 2 1 D 2R 2 ( ) ( ) p Pr p p Cx, Cy x, y 2exp. 4 p 2 p ( p) ( p) Pr Cx, Cy

17 Experiments Random feature construction time: Comparison of CPU time (s) between Tensor Sketching (TS) and Random Maclaurin (RM) [KK 12] approaches on 2 datasets: Adult (d = 123) and Mnist (d = 780) using 1 x, y. 4 17

18 Experiments Accuracy of kernel approximation: Comparison of relative errors between Tensor Sketching (TS) and Random Maclaurin (RM) [KK 12] estimators on the Adult dataset (d = 123) using different polynomial kernels. 18

19 Experiments Accuracy of classification on Adult and Mnist datasets xy, 2 1 x, y 2 19

20 Experiments Training time and accuracy of classification: Comparison between Linear SVMs solver (LIBLINEAR) + Tensor Sketch (TS) or Random Maclaurin (RM) and non-linear SVMs (LIBSVM) on 2 datasets: Mnist (d = 780, D = 1,000) and Adult (d = 123, D = 200) MNIST Adult 20

21 Conclusions and Future Work Tensor Sketching - a fast and scalable random feature mapping for polynomial kernels Theoretical error analysis Experimental results on the accuracy, effectiveness and efficiency on large-scale real-world data sets Future work: Applying Tensor Sketching on dot product kernels (e.g. Gaussian kernel) by exploiting their Taylorseries approximations. 21

Fast and Scalable Polynomial Kernels via Explicit Feature Maps *

Fast and Scalable Polynomial Kernels via Explicit Feature Maps * Ninh Pham IT University of Copenhagen Copenhagen, Denmark ndap@itu.dk Rasmus Pagh IT University of Copenhagen Copenhagen, Denmark pagh@itu.dk