Logarithmic Time Prediction

Size: px

Start display at page:

Download "Logarithmic Time Prediction"

Johnathan Thompson
5 years ago
Views:

1 Logarithmic Time Prediction John Langford Microsoft Research DIMACS Workshop on Big Data through the Lens of Sublinear Algorithms

2 The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict ŷ {1,..., K} 3 See y

3 The Multiclass Prediction Problem Repeatedly 1 See x 2 Predict ŷ {1,..., K} 3 See y Goal: Find h(x) minimizing error rate: with h(x) fast. Pr (h(x) y) (x,y) D

4 Why?

5 Why?

6 Trick #1 K is small

7 Trick #2: A hierarchy exists

8 Trick #2: A hierarchy exists So use Trick #1 repeatedly.

9 Trick #3: Shared representation

10 Trick #3: Shared representation Very helpful... but computation in the last layer can still blow up.

11 Trick #4: Structured Prediction

12 Trick #4: Structured Prediction But what if the structure is unclear?

13 Trick #5: GPU

14 Trick #5: GPU 4 Teraflops is great... yet still burns energy.

15 How fast can we hope to go?

16 How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K) time to train or test per example.

17 How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K) time to train or test per example. Proof: By construction Pick y U(1,..., K)

18 How fast can we hope to go? Theorem: There exists multiclass classification problems where achieving 0 error rate requires Ω(log K) time to train or test per example. Proof: By construction Pick y U(1,..., K) Any prediction algorithm outputting less than log 2 K bits loses with constant probability. Any training algorithm reading an example requires Ω(log 2 K) time.

19 Can we predict in time O(log 2 K)? Benefit Computational Advantage of Log Time K / log(k) e+06 K

20 Not it #1: Sparse Error Correcting Output Codes 1 Create O(log K) binary vectors b iy of length K

21 Not it #1: Sparse Error Correcting Output Codes 1 Create O(log K) binary vectors b iy of length K 2 Train O(log K) binary classifiers h i to minimize error rate: Pr x,y (h i (x) b iy )

22 Not it #1: Sparse Error Correcting Output Codes 1 Create O(log K) binary vectors b iy of length K 2 Train O(log K) binary classifiers h i to minimize error rate: Pr x,y (h i (x) b iy ) 3 Predict by finding y with minimal error.

23 Not it #1: Sparse Error Correcting Output Codes 1 Create O(log K) binary vectors b iy of length K 2 Train O(log K) binary classifiers h i to minimize error rate: Pr x,y (h i (x) b iy ) 3 Predict by finding y with minimal error. Prediction is Ω(K)

24 Not it #2: Hierarchy Construction 1 Build confusion matrix of errors.

25 Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy.

26 Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy. 3 Apply hierarchy solution.

27 Not it #2: Hierarchy Construction 1 Build confusion matrix of errors. 2 Recursive partition to create hierarchy. 3 Apply hierarchy solution. Training is Ω(K) or worse.

28 Not it #3: Unnormalized learning Train K regressors by For each example (x, y) 1 Train regressor y with (x, 1).

29 Not it #3: Unnormalized learning Train K regressors by For each example (x, y) 1 Train regressor y with (x, 1). 2 Pick y y uniformly at random. 3 Train regressor y with (x, 1).

30 Not it #3: Unnormalized learning Train K regressors by For each example (x, y) 1 Train regressor y with (x, 1). 2 Pick y y uniformly at random. 3 Train regressor y with (x, 1). Prediction is still Ω(K).

31 Can we predict in time O(log 2 K)?

32 Is logarithmic time even possible? P(y=1) =.4 P(y=2) =.3 P(y=3) =.3 P({2, 3}) > P(1) lose for divide and conquer 1 1 v {2,3} 2 v 3 2 3

33 Filter Trees [BLR09] P(y=1) =.4 P(y=2) =.3 P(y=3) =.3 1 Learn 2v3 first 2 Throw away all error examples 3 Learn 1 v Survivors 1 1 v {2,3} 2 v Theorem: For all multiclass problems, for all binary classifiers, Multiclass Regret Average Binary Regret * log(k)

34 Can you make it robust? Winner

35 Can you make it robust? Winners

36 Can you make it robust? Winners

37 Can you make it robust? Winners Theorem: [BLR09] For all multiclass problems, for all binary classifiers, a log(k)-correcting tournament satisfies: Multiclass Regret Average Binary Regret * 5.5 Determined best paper prize for ICML2012 (area chair decisions).

38 How do you learn structure? Not all partitions are equally difficult. Compare {1, 7}v{3, 8} to {1, 8}v{3, 7} What is better?

39 How do you learn structure? Not all partitions are equally difficult. Compare {1, 7}v{3, 8} to {1, 8}v{3, 7} What is better? [BWG10]: Better to confuse near leaves than near root. Intuition: the root predictor tends to be overconstrained while the leafwards predictors are less constrained.

40 The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: E x,y Pr(h(x) = 1, y) Pr(h(x) = 1) Pr(y)

41 The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: E x Pr(y) Pr(h(x) = 1 x X y ) Pr(h(x) = 1) y where X y is the set of x associated with y.

42 The Partitioning Problem [CL14] Given a set of n examples each with one of K labels, find a partitioner h that maximizes: Nonconvex for any symmetric hypothesis class (ouch)

43 Bottom Up doesn t work Suppose you use linear representations.

44 Bottom Up doesn t work Suppose you use linear representations. Suppose you first build a 1v3 predictor.

45 Bottom Up doesn t work Suppose you use linear representations. Suppose you first build a 1v3 predictor. Suppose you then build a 2v{1v3} predictor. You lose.

46 Does partitioning recurse well? Theorem: If at every node n, E x,y Pr(h(x) = 1, y) Pr(h(x) = 1) Pr(y) > γ then after ( ) 4(1 γ) 2 ln k 1 γ 2 ɛ splits, the multiclass error is less than ɛ.

47 Online Partitioning Relax the optimization criteria: E x,y E x y [ŷ(x)] E x [ŷ(x)]... and approximate with running average

48 Online Partitioning Relax the optimization criteria: E x,y E x y [ŷ(x)] E x [ŷ(x)]... and approximate with running average Let e = 0 and for all y, e y = 0, n y = 0 For each example (x, y) 1 if e y < e then b = 1 else b = 1 2 Update w using (x, b) 3 n y n y e y (ny 1)ey n y + ŷ(x) n y 5 e (t 1)e t + ŷ(x) t Apply recursively to construct a tree structure.

49 Accuracy for a fixed training time accuracy isolet LOMtree vs one-against-all 105 sector 1000 aloi LOMtree OAA imagenet ODP number of classes

50 Test Error %, optimized, no train-time constraint Test Error % Performance of Log-time algorithms Rand Filter LOM Isolet Sector Aloi Imagenet ODP

51 Test Error %, optimized, no train-time constraint Test Error % Rand Filter LOM OAA Compared to OAA Isolet Sector Aloi Imagenet ODP

52 Classes vs Test time ratio 12 LOMtree vs one against all 10 log 2 (time ratio) log 2 (number of classes)

53 Can we predict in time O(log 2 K)?

54 Can we predict in time O(log 2 K)? What is the right way to achieve consistency and dynamic partition?

55 Can we predict in time O(log 2 K)? What is the right way to achieve consistency and dynamic partition? How can you balance representation complexity and sample complexity?

56 Bibliography Alina Beygelzimer, John Langford, Pradeep Ravikumar, Error-Correcting Tournaments, Samy Bengio, Jason Weston, David Grangier, Label embedding trees for large multi-class tasks, NIPS Anna Choromanska, John Langford, Logarithmic Time Online Multiclass prediction,

The Offset Tree for Learning with Partial Labels

The Offset Tree for Learning with Partial Labels Alina Beygelzimer IBM Research John Langford Yahoo! Research June 30, 2009 KDD 2009 1 A user with some hidden interests make a query on Yahoo. 2 Yahoo chooses