Risk bounds for some classification and regression models that interpolate

Size: px

Start display at page:

Download "Risk bounds for some classification and regression models that interpolate"

Ashlee Liliana Whitehead
5 years ago
Views:

1 Risk bounds for some classification and regression models that interpolate Daniel Hsu Columbia University Joint work with: Misha Belkin (The Ohio State University) Partha Mitra (Cold Spring Harbor Laboratory)

2 (Breiman, 1995) 2

3 When is "interpolation" justified in ML? Supervised learning: use training examples to find function that predicts accurately on new example Interpolation: find function that perfectly fits training examples Some call this "overfitting" PAC learning (Valiant, 1984; Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987; ): realizable, noise-free setting bounded-capacity hypothesis class Regression models: Can interpolate if no noise! E.g., linear models with! # 3

4 Overfitting 4

5 (Zhang, Bengio, Hardt, Recht, & Vinyals, 2017) Some observations from the field Can fit any training data, given enough time and large enough network. Can generalize even when training data has substantial amount of label noise. 5

6 (Belkin, Ma, & Mandal, 2018) More observations from the field MNIST Can fit any training data, given enough time and rich enough feature space. Can generalize even when training data has substantial amount of label noise. 6

7 Summary of some empirical observations Training produces a function!" that perfectly fits noisy training data.!" is likely a very complex function! Yet, test error of!" is non-trivial: e.g., noise rate + 5%. Can theory explain these observations? 7

8 "Classical" learning theory Generalization: 0 true error rate training error rate + deviation bound Deviation bound: depends on "complexity" of learned function Capacity control, regularization, smoothing, algorithmic stability, margins, None known to be non-trivial for functions interpolating noisy data. E.g., function is chosen from class rich enough to express all possible ways to label Ω(%) training examples. Bound must exploit specific properties of chosen function. 8

9 (Wyner, Olson, Bleich, & Mease, 2017) Even more observations from the field Some "local interpolation" methods are robust to label noise. Can limit influence of noisy points in other parts of data space. 9

10 What is known in theory? Nearest neighbor (Cover & Hart, 1967) Predict with label of nearest training example Interpolates training data Not always consistent, but almost Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998) Bandwidth-free(!) Nadaraya-Watson smoothing kernel regression Interpolates training data Consistent(!!), but no rates! " " $ = 1 " " $ ' 10

11 Our goals Counter the "conventional wisdom" re: interpolation Show interpolation methods can be consistent (or almost consistent) for classification & regression problems Identify some useful properties of certain local prediction methods Suggest connections to practical methods 11

12 Our new results Analyses of two new interpolation schemes 1. Simplicial interpolation Natural linear interpolation based on multivariate triangulation Asymptotic advantages compared to nearest neighbor rule 2. Weighted k-nn interpolation Consistency + non-asymptotic convergence rates 12

13 1. Simplicial interpolation 13

14 Interpolation via multivariate triangulation IID training examples! ", $ ",,! &, $ & R ) [0,1] Partition / conv! ",,! & into simplices with! 5 as vertices via Delaunay. Define 7(!) on each simplex by affine interpolation of vertices' labels. Result is piecewise linear on /. (Punt on what happens outside of /.) For classification ($ {0,1}), let <= be plug-in classifier based on 7. 14

15 What happens on a single simplex Simplex on! ",,! '(" with corresponding labels ) ",, ) '(" Test point! in simplex, with barycentric coordinates (+ ",, + '(" ). Linear interpolation at! (i.e., least squares fit, evaluated at!):! "! '(".! = 0 12" + 1 ) 1! #! $ Key idea: aggregates information from all vertices to make prediction. (C.f. nearest neighbor rule.) 15

16 Comparison to nearest neighbor rule Suppose! " = Pr(' = 1 ") < 1/2 for all points in a simplex Bayes optimal prediction is 0 for all points in simplex. Suppose '. = = ' 0 = 0, but ' 02. = 1 (due to "label noise") x 2 x x 3 34 " = 1 here x 2 x x 3 Effect even more pronounced in high dimensions! Nearest neighbor rule Simplicial interpolation 16

17 Asymptotic risk Theorem: Assume distribution of! is uniform on some convex set, " is Holder smooth. Then simplicial interpolation estimate satisfies limsup ) and plug-in classifier satisfies limsup ) * "! "! Pr 67! 7 9:;! < 1 0 Near-consistency in high-dimension: Bayes optimal + < = > C.f. nearest neighbor classifier: twice Bayes optimal "Blessing" of dimensionality (with caveat about convergence rate). 17

18 2. Weighted k-nn interpolation 18

19 Weighted k-nn scheme For given test point!, let! (#),,! ' be ( nearest neighbors in training data, and let ) (#),, ) ' be corresponding labels.! (#)!! (*)! (') Define,! = ' /0# 1(!,! / ) ) / ' 1(!,! / ) where /0# 1!,! / =!! / 34, 5 > 0 Interpolation:,! ) / as!! / 19

20 Comparison to Hilbert kernel estimate Weighted k-nn Hilbert kernel (Devroye, Györfi, & Krzyżak, 1998) ) " # = &'( *(#, # & ). & ) *(#, # & ) &'( *(#, # & ) = # # & 12 Our analysis needs 0 < 5 < 6/2 " # = &'( 9 *(#, # & ). & 9 *(#, # & ) &'( * #, # & = # # 12 & MUST have 5 = 6 for consistency Localization makes it possible to prove non-asymptotic rate. 20

21 Convergence rates Theorem: Assume distribution of! is uniform on some compact set satisfying regularity condition, and " is #-Holder smooth. For appropriate setting of $, weighted k-nn estimate satisfies % " ' " ' ) +, -)./().12) If Tsybakov noise condition with parameter 4 > 0 also holds, then plug-in classifier, with appropriate setting of $, satisfies -.?/(. )1? 12) Pr 9:! : <=>! +, 21

22 Closing thoughts 22

23 Connections to models used in practice Kernel ridge regression: Simplicial interpolation is like Laplace kernel in R " Random forests: Large ensembles with random thresholds may approximate locally-linear interpolation (Cutler & Zhao, 2001) Neural nets: Many recent empirical studies that find similarities between neural nets and k-nn in terms of performance and noise-robustness (Drory, Avidan, & Giryes, 2018; Cohen, Sapiro, & Giryes, 2018) 23

24 "Adversarial examples" Interpolation works because mass of region immediately around noisily-labeled training examples is small in high-dimensions. But also a great source of adversarial examples -- easy to find using local optimization around training examples. 24

25 Open problems Generalization theory to explain behavior of interpolation methods Kernel methods: min $ H ' H ( subject to ' ) * =, * for all - = 1,, 1. When does this work (with noisy labels)? Very recent work by T. Liang and A. Rakhlin (2018+) provides some analysis in some regimes. Benefits of interpolation? 25

26 Acknowledgements National Science Foundation Sloan Foundation Simons Institute for the Theory of Computing arxiv.org/abs/

arxiv: v1 [stat.ml] 28 Dec 2018

arxiv: v1 [stat.ml] 28 Dec 2018 Reconciling modern machine learning and the bias-variance trade-off Mikhail Belkin 1, Daniel Hsu 2, Siyuan Ma 1, and Soumik Mandal 1 arxiv:1812.11118v1 [stat.ml] 28 Dec 2018 1 The Ohio State University,