Computational and Statistical Tradeoffs in VoI-Driven Learning

Size: px

Start display at page:

Download "Computational and Statistical Tradeoffs in VoI-Driven Learning"

Edith Goodman
6 years ago
Views:

1 ARO ARO MURI MURI on on Value-centered Theory for for Adaptive Learning, Inference, Tracking, and and Exploitation Computational and Statistical Tradefs in VoI-Driven Learning Co-PI Michael Jordan University California, Berkeley

3 Four major axes progress Progress 1: Time/Data Tradefs: A new Gaussian squared complexity measure for VoI Opera?onal tradef curves linking?me and data resources for a class denoising problems Progress 2: Distributed Resampling Methods: A new, distributed version the bootstrap Terabyte- scale computa?on confidence intervals Progress 3: Theory Ranking and Theory Privacy Human- in- the- loop processing Progress 4: Streaming Varia?onal Inference Bayesian filtering with varia?onal updates Approximate Bayesian inference for large- scale data

4 1. Computation/Statistics More data generally means more computation in our current state understanding but statistically more data generally means less risk (i.e., error) and statistical inferences are ten simplified as the amount data grows somehow these facts should have algorithmic consequences I.e., somehow we should be able to get by with less computation as the amount data grows need a new notion controlled algorithm weakening Chandrasekaran, V. & Jordan, M. (2013). Computa?onal and sta?s?cal tradefs via convex relaxa?on. Proceedings the Na1onal Academy Sciences, 110, E1181- E1190.

5 Time-Data Tradefs Consider an inference problem with fixed risk Vertical lines Runtime Classical estimation theory well understood Number samples n

6 Time-Data Tradefs Consider an inference problem with fixed risk Horizontal lines Runtime Complexity theory lower bounds poorly understood depends on computational model Number samples n

7 Time-Data Tradefs Consider an inference problem with fixed risk Runtime o More data means smaller run?me upper bound o Need weaker algorithms for larger datasets Number samples n

8 A Denoising Problem Signal Noise from known (bounded) set Observation model Observe n i.i.d. samples

9 Convex Programming Estimator Sample mean statistic is sufficient Natural estimator Convex relaxation C is a convex set such that

10 Statistical Performance Estimator Consider cone feasible directions into C

11 Statistical Performance Estimator Theorem: The risk the estimator is Intuition: Only consider error in feasible cone Can be refined for better bias-variance tradefs

12 Value Corr: To obtain risk at most 1, Key point: If we have access to larger n, can use larger C

13 Hierarchy Convex Relaxations If we have access to larger n, can use larger C à Obtain weaker estimation algorithm

14 Hierarchy Relaxations If algebraic, then one can obtain family outer convex approximations polyhedral, semidefinite, hyperbolic relaxations (Sherali-Adams, Parrilo, Lasserre, Garding, Renegar) Sets ordered by computational complexity Central role played by lift-and-project

15 Example consists cut matrices E.g., collaborative filtering, clustering

16 Other Examples Estimation over the set that consists all adjacency matrices graphs with only a clique on square-root many nodes cf. sparse PCA Banding estimators for covariance matrices where we don t know a priori the ordering the variables, but must infer the ordering Estimation perfect matchings in a graph cf. network inference

17 2. Distributed Resampling Methods Critical need to obtain confidence intervals for VoI measures and other statistics in order to make reliable decisions Rarely are analytical confidence intervals available Resampling methods such as the bootstrap fer a solution but there is a major scalability issue Kleiner, A., Talwalkar, A., Sarkar, P., & Jordan, M. (in press). A scalable bootstrap for massive data. Journal the Royal Sta1s1cal Society, B.

18 The Bootstrap Use the observed data to simulate multiple datasets size n: Repeatedly resample n points with replacement from the original dataset size n. Compute θ * n on each resample. Compute ξ based on these multiple realizations θ * n as our estimate ξ for θ n. X (1) 1,...,Xn (1) ˆθ (1) n X 1,...,X n X (2) 1,...,Xn (2) ˆθ (2) n ξ(ˆθ n (1),...,ˆθ n (m) ).. X (m) 1,...,Xn (m) ˆθ (m) n

19 The Bootstrap: Computational Issues Seemingly a wonderful match to modern parallel and distributed computing platforms But the expected number distinct points in a bootstrap resample is ~ 0.632n e.g., if original dataset has size 1 TB, then expect resample to have size ~ 632 GB Can t feasibly send resampled datasets this size to distributed servers Even if one could, can t compute the estimate locally on datasets this large

20 The Bag Little Bootstraps X (1) 1,...,Xn (1) ˆθ (1) n ˇX (1) (1) 1,..., ˇX b(n) X (2) 1,...,Xn (2) ˆθ (2) n ξ(ˆθ (1) n,...,ˆθ (m 2) n )=ξ 1.. X (m 2) 1,...,X (m 2) n ˆθ (m 2) n X 1,...,X n. avg(ξ 1,...,ξ m 1 ) X (1) 1,...,Xn (1) ˆθ (1) n ˇX (m 1) 1,..., ˇX (m 1) b(n) X (2) 1,...,Xn (2) ˆθ (2) n ξ(ˆθ (1) n,...,ˆθ (m 2) n )=ξ m 1.. X (m 2) 1,...,X (m 2) n ˆθ (m 2) n

21 Empirical Results: Bag Little Bootstraps (BLB) BLB! 0.7 BOOT Relative Error Time (sec)

22 3. Theory Ranking and Theory Privacy Human operators ten prefer ranked alternatives to binary or multiway tests Human data-providers ten desire a controllable degree anonymity How do we develop VoI-based methods that permit statistical inference ranking losses and under privacy constraints? Duchi, J., Mackey, L., & Jordan, M. (in press). The asympto?cs ranking algorithms. Annals Sta1s1cs. Duchi, J., Jordan, M., & Wainwright, M. (in press). Local privacy and sta?s?cal minimax rates, Symposium on Founda1ons Computer Science (FOCS).

23 4. Streaming Varia?onal Bayes Large, streaming data sets are increasingly the norm Inference for Big Data has generally been non- Bayesian Advantages Bayes: complex models, coherent treatment uncertainty, etc. We deliver: SDA- Bayes, a framework for Streaming, Distributed, Asynchronous Bayesian inference Experiments demonstra?ng streaming topic discovery with comparable predic?ve performance to non- streaming algorithms (on Wikipedia and scien?fic journal Nature) Broderick, T., Boyd, N., Wibisono, A., Wilson, A., & Jordan, M. (in press). Streaming varia?onal Bayes. In Proceedings Neural Informa1on Processing Conference.

24 Streaming Varia?onal Bayes SDA performs at least as well as Stochas?c varia?onal inference (SVI), an algorithm not designed for the streaming seang (table shows SDA with 32 threads or one thread) More threads improves SDA performance and run?me

25 Future and ongoing focus areas Support recovery and?me/data tradefs Communica?on- complexity lower bounds for distributed es?ma?on Nonparametric streaming varia?onal Bayes Aggrega?on- based ranking algorithms Diagnos?cs for resampling- based confidence intervals Topological no?ons privacy

26 Publica?ons in Year 2 V. Chandrasekaran & M. Jordan, Computa?onal and sta?s?cal tradefs via convex relaxa?on, Proceedings the Na1onal Academy Sciences, 110, E1181- E1190, J. Duchi, L. Mackey, and M. Jordan, The asympto?cs ranking algorithms, Annals Sta1s1cs, in press. A. Kleiner, A. Talwalkar, P. Sarkar and M. Jordan, A scalable bootstrap for massive data, Journal the Royal Sta1s1cal Society, Series B, in press. M. Jordan, On sta?s?cs, computa?on and scalability, Bernoulli, 19, , J. Duchi, M. Jordan, & M. Wainwright, Local privacy and sta?s?cal minimax rates, Symposium on Founda1ons Computer Science (FOCS), Y. Zhang, J. Duchi, M. Jordan, & M. Wainwright, Informa?on- theore?c lower bounds for distributed sta?s?cal es?ma?on with communica?on constraints, In Proceedings NIPS, T. Broderick, N. Boyd, A. Wibisono, A. Wilson, & M. Jordan, Streaming varia?onal Bayes, In Proceedings NIPS, J. Duchi, M. Jordan, & M. Wainwright, Local privacy and minimax bounds: Sharp rates for probability es?ma?on, In Proceedings NIPS, A. Kleiner, A. Talwalkar, S. Agarwal, M. Jordan, & I. Stoica, A general bootstrap performance diagnos?c, ACM Conference on Knowledge Discovery and Data Mining (SIGKDD), X. Pan, J. Gonzalez, S. Jegelka, T. Broderick, & M. Jordan, Op?mis?c concurrency control for distributed unsupervised learning, In Proceedings NIPS, F. Wauthier, M. Jordan, & N. Jojic, Efficient ranking from pairwise comparisons, In Proceedings ICML, 2013.

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences