Online Learning. Lorenzo Rosasco MIT, L. Rosasco Online Learning

Size: px

Start display at page:

Download "Online Learning. Lorenzo Rosasco MIT, L. Rosasco Online Learning"

Barnaby White
5 years ago
Views:

1 Online Learning Lorenzo Rosasco MIT, 9.520

2 About this class Goal To introduce theory and algorithms for online learning.

3 Plan Different views on online learning From batch to online least squares Other loss functions Theory

4 (Batch) Learning Algorithms A learning algorithm A is a map from the data space into the hypothesis space and f S = A(S), where S = S n = (x 0, y 0 ).... (x n 1, y n 1 ). We typically assume that: A is deterministic, A does not depend on the ordering of the points in the training set. notation: note the weird numbering of the training set!

5 Online Learning Algorithms The pure online learning approach is: let f 1 = init for n = 1,... f n+1 = A(f n, (x n, y n )) The algorithm works sequentially and has a recursive definition.

6 Online Learning Algorithms (cont.) A related approach (similar to transductive learning) is: let f 1 = init for n = 1,... f n+1 = A(f n, S n, (x n, y n )) Also in this case the algorithm works sequentially and has a recursive definition, but it requires storing the past data S n.

7 Why Online Learning? Different motivations/perspectives that often corresponds to different theoretical framework. Biologically plausibility. Stochastic approximation. Incremental Optimization. Non iid data, game theoretic view.

8 Online Learning and Stochastic Approximation Our goal is to minimize the expected risk I[f ] = E (x,y) [V (f (x), y)] = V (f (x), y)dµ(x, y) over the hypothesis space H, but the data distribution is not known. The idea is to use the samples to build an approximate solution and to update such a solution as we get more data.

9 Online Learning and Stochastic Approximation Our goal is to minimize the expected risk I[f ] = E (x,y) [V (f (x), y)] = V (f (x), y)dµ(x, y) over the hypothesis space H, but the data distribution is not known. The idea is to use the samples to build an approximate solution and to update such a solution as we get more data.

10 Online Learning and Stochastic Approximation (cont.) More precisely if we are given samples (x i, y i ) i in a sequential fashion at the n th step we have an approximation G(f, (x n, y n )) of the gradient of I[f ] then we can define a recursion by let f 0 = init for n = 0,... f n+1 = f n + γ n (G(f n, (x n, y n ))

11 Online Learning and Stochastic Approximation (cont.) More precisely if we are given samples (x i, y i ) i in a sequential fashion at the n th step we have an approximation G(f, (x n, y n )) of the gradient of I[f ] then we can define a recursion by let f 0 = init for n = 0,... f n+1 = f n + γ n (G(f n, (x n, y n ))

12 Incremental Optimization Here our goal is to solve empirical risk minimization I S [f ], or regularized empirical risk minimization I λ S [f ] = I S[f ] + λ f 2 over the hypothesis space H, when the number of points is so big (say n = ) that standard solvers would not be feasible. Memory is the main constraint here.

13 Incremental Optimization (cont.) In this case we can consider let f 0 = init for t = 0,... f t+1 = f t + γ t (G(f t, (x nt, y nt )) where here G(f t, (x nt, y nt )) is a pointwise estimate of the gradient of I S or I λ S. Epochs Note that in this case the number of iteration is decoupled to the index of training set points and we can look at the data more than once, that is consider different epochs.

14 Non i.i.d. data, game theoretic view If the data are not i.i.d. we can consider a setting when the data is a finite sequence that we will be disclosed to us in a sequential (possibly adversarial) fashion. Then we can see learning as a two players game where at each step nature chooses a samples (x i, y i ) at each step a learner chooses an estimator f i+1. The goal of the learner is to perform as well as if he could view the whole sequence.

15 Non i.i.d. data, game theoretic view If the data are not i.i.d. we can consider a setting when the data is a finite sequence that we will be disclosed to us in a sequential (possibly adversarial) fashion. Then we can see learning as a two players game where at each step nature chooses a samples (x i, y i ) at each step a learner chooses an estimator f i+1. The goal of the learner is to perform as well as if he could view the whole sequence.

16 Plan Different views on online learning From batch to online least squares Other loss functions Theory

17 Recalling Least Squares We start considering a linear kernel so that I S [f ] = 1 n n 1 (y i xi T w) = Y Xw 2 i=0 Remember that in this case n 1 w n = (X T X) 1 X T Y = Cn 1 x i y i. (Note that if we regularize we have (C n + λi) 1 in place of C 1 n. notation: note the weird numbering of the training set! i=0

18 Recalling Least Squares We start considering a linear kernel so that I S [f ] = 1 n n 1 (y i xi T w) = Y Xw 2 i=0 Remember that in this case n 1 w n = (X T X) 1 X T Y = Cn 1 x i y i. (Note that if we regularize we have (C n + λi) 1 in place of C 1 n. notation: note the weird numbering of the training set! i=0

19 A Recursive Least Squares Algorithm Then we can consider Proof w n+1 = w n + C 1 n+1 x n[y n x T n w n ]. w n = Cn 1 ( n 1 i=0 x iy i ) w n+1 = C 1 n+1 ( n 1 i=0 x iy i + x n y n ) w n+1 w n = C 1 n+1 (x ny n ) + C 1 n+1 (C n C n+1 )Cn 1 C n+1 C n = x n xn T. n 1 i=0 x iy i

20 A Recursive Least Squares Algorithm Then we can consider Proof w n+1 = w n + C 1 n+1 x n[y n x T n w n ]. w n = Cn 1 ( n 1 i=0 x iy i ) w n+1 = C 1 n+1 ( n 1 i=0 x iy i + x n y n ) w n+1 w n = C 1 n+1 (x ny n ) + C 1 n+1 (C n C n+1 )Cn 1 C n+1 C n = x n xn T. n 1 i=0 x iy i

21 A Recursive Least Squares Algorithm Then we can consider Proof w n+1 = w n + C 1 n+1 x n[y n x T n w n ]. w n = Cn 1 ( n 1 i=0 x iy i ) w n+1 = C 1 n+1 ( n 1 i=0 x iy i + x n y n ) w n+1 w n = C 1 n+1 (x ny n ) + C 1 n+1 (C n C n+1 )Cn 1 C n+1 C n = x n xn T. n 1 i=0 x iy i

22 A Recursive Least Squares Algorithm Then we can consider Proof w n+1 = w n + C 1 n+1 x n[y n x T n w n ]. w n = Cn 1 ( n 1 i=0 x iy i ) w n+1 = C 1 n+1 ( n 1 i=0 x iy i + x n y n ) w n+1 w n = C 1 n+1 (x ny n ) + C 1 n+1 (C n C n+1 )Cn 1 C n+1 C n = x n xn T. n 1 i=0 x iy i

23 A Recursive Least Squares Algorithm (cont.) We derived the algorithm The above approach is recursive; w n+1 = w n + C 1 n+1 x n[y n x T n w n ]. requires storing all the data; requires inverting a matrix (C i ) i at each step.

24 A Recursive Least Squares Algorithm (cont.) The following matrix equality allows to alleviate the computational burden. Matrix Inversion Lemma [A + BCD] 1 = A 1 A 1 B[DA 1 B + C 1 ] 1 DA 1 Then C 1 n+1 = C 1 n C 1 n x n xn T Cn xn T Cn 1. x n

25 A Recursive Least Squares Algorithm (cont.) The following matrix equality allows to alleviate the computational burden. Matrix Inversion Lemma [A + BCD] 1 = A 1 A 1 B[DA 1 B + C 1 ] 1 DA 1 Then C 1 n+1 = C 1 n C 1 n x n xn T Cn xn T Cn 1. x n

26 A Recursive Least Squares Algorithm (cont.) Moreover n+1 x n = Cn 1 x n C 1 n x n xn T Cn xn T Cn 1 x n = x n C 1 we can derive the algorithm w n+1 = w n + C 1 n 1 + x T n C 1 n x n x n, C 1 n 1 + x T n C 1 n x n x n [y n x T n w n ]. Since the above iteration is equivalent to empirical risk minimization (ERM) the conditions ensuring its convergence as n are the same as those for ERM.

27 A Recursive Least Squares Algorithm (cont.) Moreover n+1 x n = Cn 1 x n C 1 n x n xn T Cn xn T Cn 1 x n = x n C 1 we can derive the algorithm w n+1 = w n + C 1 n 1 + x T n C 1 n x n x n, C 1 n 1 + x T n C 1 n x n x n [y n x T n w n ]. Since the above iteration is equivalent to empirical risk minimization (ERM) the conditions ensuring its convergence as n are the same as those for ERM.

28 A Recursive Least Squares Algorithm (cont.) Moreover n+1 x n = Cn 1 x n C 1 n x n xn T Cn xn T Cn 1 x n = x n C 1 we can derive the algorithm w n+1 = w n + C 1 n 1 + x T n C 1 n x n x n, C 1 n 1 + x T n C 1 n x n x n [y n x T n w n ]. Since the above iteration is equivalent to empirical risk minimization (ERM) the conditions ensuring its convergence as n are the same as those for ERM.

29 A Recursive Least Squares Algorithm (cont.) The algorithm is of the form w n+1 = w n 1 + let f 0 = init for n = 1,... f n+1 = A(f n, S n, (x n, y n )). The above approach is recursive; requires storing all the data; C 1 n 1 + x T n C 1 n x n x n [y n x T n w n ]. updates the inverse just via matrix/vector multiplication.

30 A Recursive Least Squares Algorithm (cont.) The algorithm is of the form w n+1 = w n 1 + let f 0 = init for n = 1,... f n+1 = A(f n, S n, (x n, y n )). The above approach is recursive; requires storing all the data; C 1 n 1 + x T n C 1 n x n x n [y n x T n w n ]. updates the inverse just via matrix/vector multiplication.

31 Recursive Least Squares (cont.) How about the memory requirement? The main constraint comes from the matrix storing/inversion We can consider w n+1 = w n + C 1 n 1 + x T n C 1 n x n x n [y n x T n w n ]. w n+1 = w n + γ n x n [y n x T n w n ]. where γ t is a decreasing sequence.

32 Recursive Least Squares (cont.) How about the memory requirement? The main constraint comes from the matrix storing/inversion We can consider w n+1 = w n + C 1 n 1 + x T n C 1 n x n x n [y n x T n w n ]. w n+1 = w n + γ n x n [y n x T n w n ]. where γ t is a decreasing sequence.

33 Recursive Least Squares (cont.) How about the memory requirement? The main constraint comes from the matrix storing/inversion We can consider w n+1 = w n + C 1 n 1 + x T n C 1 n x n x n [y n x T n w n ]. w n+1 = w n + γ n x n [y n x T n w n ]. where γ t is a decreasing sequence.

34 Online Least Squares The algorithm w n+1 = w n + γ n x n [y n x T n w n ]. is recursive; does not requires storing the data; does not require updating the inverse, but only vector/vector multiplication.

35 Convergence of Online Least Squares (cont.) When we consider w n+1 = w n + γ n x n [y n x T n w n ]. we are no longer guaranteed convergence. In other words: how shall we choose γ n?

36 Batch vs Online Gradient Descent Some ideas from the comparison with batch least squares. w n+1 = w n + γ n x n [y n x T n w n ]. Note that (y n x T n w) 2 = x n [y n x T n w]. The batch gradient algorithm would be w n+1 = w n + γ n 1 t t 1 x t [y t xt T w n ]. t=0 since I(w) = 1 t t 1 (y t xt T w) 2 = 1 t 1 x t (y t xt T w) t t=0 t=0

37 Batch vs Online Gradient Descent Some ideas from the comparison with batch least squares. w n+1 = w n + γ n x n [y n x T n w n ]. Note that (y n x T n w) 2 = x n [y n x T n w]. The batch gradient algorithm would be w n+1 = w n + γ n 1 t t 1 x t [y t xt T w n ]. t=0 since I(w) = 1 t t 1 (y t xt T w) 2 = 1 t 1 x t (y t xt T w) t t=0 t=0

38 Convergence of Recursive Least Squares (cont.) If γ n decreases too fast the iterate will get stuck away from the real solution. In general the convergence of the online iteratatoin is slower than the recursive least squares since we use a simpler step-size. Polyak Averging If we choose γ n = n α, with α (1/2, 1) and use, at each step, the solution obtained averaging all previous solutions (namely Polyak averaging), then convergence is ensured and is almost the same as recursive least squares.

39 Extensions Other Loss functions Other Kernels.

40 Other Loss functions The same ideas can be extended to other convex differentiable loss functions, considering w n+1 = w n + γ n V (y n, x T n w n ). If the loss is convex but not differentiable then more sophisticated methods must be used, e.g. proximal methods, subgradient methods.

41 Non linear Kernels If we use different kernels f (x) = Φ(x) T w = n 1 i=0 ci K (x, x i ) then we can consider the iteration c i n+1 = ci n, i = 1,..., n c n+1 n+1 = γ n[y n c T n K n (x n )]), where K n (x) = K (x, x 0 ),..., K (x, x n 1 ) and c n = (c 1 n,..., c n n).

42 Non linear Kernels (cont.) Proof f n (x) = wn T Φ(x) = K n (x) T c n. w n+1 = w n + γ n Φ(x n )[y n Φ(x n ) T w n ], gives, for all x, f n+1 (x) = f n (x) + γ n [y n K n (x n ) T c n ]K (x, x n ), that can be written as K n+1 (x) T c n+1 = K n (x) T c n + γ n [y n K n (x n ) T c n ]K (x, x n ). QED

43 Non linear Kernels c i n+1 = ci n, i = 1,..., n c n+1 n+1 = γ n[y n c T n K n (x n )]). The above approach is recursive; requires storing all the data; does not require updating the inverse, but only vector/vector multiplication O(n).

Proximal operator and methods

Proximal operator and methods Master 2 Data Science, Univ. Paris Saclay Robert M. Gower Optimization Sum of Terms A Datum Function Finite Sum Training Problem The Training Problem Convergence GD I Theorem