Virtual Vector Machine for Bayesian Online Classification

Size: px

Start display at page:

Download "Virtual Vector Machine for Bayesian Online Classification"

Malcolm Arnold
5 years ago
Views:

1 Virtual Vector Machine for Bayesian Online Classification Thomas P. Minka, Rongjing Xiang, Yuan (Alan) Qi Appeared in UAI 2009 Presented by Lingbo Li

2 Introduction Online Learning Update model parameters and make predictions based on data points received sequentially with a fixed-size memory buffer Online classification: Perception Online regression: Kalman filter Previous Online Learning work Natural gradient (Le Roux et al., 2007) Online Gaussian process classification (Csató & Opper, 2002) Passive-Aggressive algorithm (Crammer et al. 2006) Stochastic gradient: Perceptron(Rosenblatt 1962) Ignore previous data points

3 Virtual Vector Machine preview A novel Bayesian online learning algorithm Summarize all previous samples by representative virtual points & dynamically update them Store both Gaussian distribution and a data cache Gaussian approximation factors Virtual points from the data cache for non-gaussian factors Summarize multiple real data points Flexible functional forms Stored in data cache with a user-defined size. Key intuition: Many data points are not important to classification: Prune them Many data points are similar: Group them

4 Online Binary classification Let w denote the model parameters, (x 1,..., x T ) denote the data from time 1 to time T, and f (x t ; w) denote the likelihood function for data point at time t The posterior at time T is p(w x 1,..., x T ) = p(w) T t=1 f (x t; w) p(w) T t=1 f (x t; w)dw p(w) is the prior distribution: p(w) N (0, I)

5 Online Binary classification Flipping noise model The likelihood is a step function f (x; w) = ɛ(1 Θ(w T x)) + (1 ɛ)θ(w T x) where Θ(z) = { 1 if z > 0 0 if z 0 (1) Labeling error rate ɛ [0, 1] x: feature vector scaled by 1 or 1 depending on the label Posterior distribution of w is piecewise constant over convex polygonal regions (Herbrich et al., 1999) Approximate the posterior q(w) N (m w, V w )

6 Expectation Propagation(EP) Review Posterior distribution p(θ D) = p 0(θ) i p(x i ;θ) p(d) N i=0 t i(θ) Approximation q(θ) = N i=0 t i (θ) to minimize KL(p q) Fit the approximation by finding min D KL (t i (θ) t j (θ) t i (θ) t j (θ)) t i (θ) j i j i t i (θ) should belong to the exponential family

7 Online Binary classification Gaussian approximation by EP EP approximate the exact posterior distribution by a distribution in an exponential family q EP (w) = f 0 (w) T ft (w) t=1 where f t (w) is the approximation of the likelihood factor f (x t ; w); f 0 (w) = p(w) Both f t (w) and f 0 (w) are Gaussian distributed

8 Virtual Vector Machine Enlarge the approximation family: a Gaussian term times a set of virtual data likelihoods: q(w) r(w) i f (b i ; w) r(w) N (m r, V r ) b i is the ith virtual data point f (b i ; w) has the form of the original likelihood factors r(w) is the "residual"

9 Virtual Vector Machine Let q new (w) be the new approximation Minimize a distance measure as D = KL(q + (w) q new (w)) q + (w) = q(w)f (x T ; w) contains one additional data point Too costly to minimize D over parameters (B, m r, V r ) Two steps To run EP on q + to get a fully Gaussian q + with approximate factors f i Consider either evicting a virtual point or merging two points, based on the their scores; Merging never gets chosen over eviction

10 Virtual Vector Machine Reduction to Gaussian From the augmented representation, we can reduce to a Gaussian by running EP over the virtual points with prior r(w) Approximate q(w) with q(w) N (m w, V w ) where q(w) = r(w) i fi (w) fi (w) is Gaussian

11 Virtual Vector Machine Cost function for finding virtual points Intuitively, try to keep non-gaussian information contained in q + Maximize the non-gaussianity of q new by the divergence between q new and q new as D 2 = KL(q new (w) q new (w)), but difficult to compute exactly Maximize surrogate function: E = i KL(f (b i; w) q new (w)/ f i (w) q new (w))

12 Virtual Vector Machine Eviction: Delete the least informative point Add the new point into the virtual point set Select b j by maximizing E Remove b j from cache Update the residual by r new (w) = r(w) f j (w)

13 Virtual Vector Machine Eviction 0 Figures are from the author s original slides

14 Virtual Vector Machine Merging: Merge two similar points to one Remove (b 1, b 2 ) from cache Insert the merged point b = b 1+b 2 2 into cache Update the residual via r new (w) = r(w)g(w), where Gaussian residual term g captures the lost information from the original two factors The residue g is computed by matching the moments proj[f (b 1 ; w)f (b 2 ; w) q \12 + (w)] = proj[f (b ; w)g(w) q \12 + (w)] where q \12 q+(w) + (w) = f1 (w) f 2 (w) Inverse ADF Projection

15 Virtual Vector Machine Merging 0 Figures are from the author s original slides

16 Virtual Vector Machine

17 Virtual Vector Machine Nonlinear classification via Random Feature Expansion Use random Fourier features to approximate an RBF Gaussian kernel z(x) [cos(w T 1 x)... cos(w T D x) sin(w T 1 x)... sin(w T D x)]t w 1,..., w D are iid sample from a special p(w) As shown by (Rahimi & Recht, 2007), these random features approximate the kernel k: z(x) z(y) k(x y)

18 Experiment: synthetic dataset Figure: Mean square error of estimated poterior mean obtained by EP, VVM, ADF and window-ep(w-ep). The exact posterior mean is obtained via a Monte Carlo method. EP performs smoothing on the full dataset of 300 points; VVM and window-ep use two buffer sizes (10 and 40 points); and ADF effectively use a buffer of size one. The results are averaged over 20 runs.

19 Experiment: real dataset I

20 Experiment: real dataset II Figure: Accumulative prediction error rates of VVM, the sparse online Gaussian process classifier (SOGP), the Passive-Aggressive (PA) algorithm and the Top-moumoute online natural gradient (NG) algorithm on the Spambase dataset. The dataset is randomly permutated 10 times and the results are averaged. The size of virtual point set used by VVM is 30, while the online Gaussian process model has 143 basis points.

21 Experiment: real dataset III Figure: Accumulative prediction error rates of VVM, the sparse online Gaussian process classifier (SOGP), the Passive-Aggressive (PA) algorithm and the Top-moumoute online natural gradient (NG) algorithm on the Thyroid dataset. The dataset is randomly permutated 10 times and the results are averaged. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 10). NG and VVM both use a buffer to cache 10 points, while SOGP and PA have 12 and 91 basis points, respectively.

22 Experiment: real dataset IV Figure: Accumulative prediction error rates of VVM, the sparse online Gaussian process classifier (SOGP), the Passive-Aggressive (PA) algorithm and the Top-moumoute online natural gradient (NG) algorithm on the Ionosphere dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 30 points, while SOGP and PA have 279 and 189 basis points, respectively.

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures: Homework Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression 3.0-3.2 Pod-cast lecture on-line Next lectures: I posted a rough plan. It is flexible though so please come with suggestions Bayes