2. Blackbox hyperparameter optimization and AutoML

Size: px

Start display at page:

Download "2. Blackbox hyperparameter optimization and AutoML"

Percival Daniel
6 years ago
Views:

1 AutoML Automatic Selection, Configuration & Composition of ML Algorithms. at ECML PKDD 2017, Skopje. 2. Blackbox hyperparameter optimization and AutoML Pavel Brazdil, Frank Hutter, Holger Hoos, Joaquin Vanschoren

2 Outline 2 Blackbox hyperparameter optimization AutoML systems based on blackbox hyperparameter optimization

3 Blackbox Hyperoptimization of ML Algorithms 3 Hyperparameter configuration Train & evaluate ML algorithm Cross-validation performance f( ) Hyperparameter optimizer

4 Blackbox Hyperoptimization of ML Pipelines 4 Cross-validation performance f( ) Hyperparameter optimizer

5 A General Notion of Hyperparameters 5 Hyperparameter types Continuous (e.g., learning rate), integer (e.g, #units), ordinal Categorical: finite domain, unordered, e.g. {SVM, RF, NN} Hyperparameter space has structure E.g., top-level hyperparameter A chooses algorithm: {SVM, RF} SVM s soft margin C is only active if A = SVM C is a conditional hyperparameter with parent A Hyperparameters give rise to a structured space of algorithms Many configurations (e.g ) Configurations often yield qualitatively different behaviour

6 The Simplest Strategy: Random Search 6 Select configurations uniformly at random Completely uninformed Global search, won t get stuck in a local region At least it s better than grid search: Image source: Bergstra et al, Random Search for Hyperparameter Optimization, JMLR 2012

7 7 The Other Extreme: Gradient Descent (aka hill climbing) Start with some configuration repeat Modify a single parameter if performance on a benchmark set degrades then undo modification until no more improvement possible (or good enough")

8 Stochastic Local Search 8 Balance intensification and diversification [e.g., Hoos and Stützle, 2005] Intensification: gradient descent Diversification: restarts, random steps, perturbations, Prominent general methods Tabu search [Glover, 1986] Simulated annealing [Kirkpatrick, Gelatt, C. D.; Vecchi, 1983] Iterated local search [Lourenço, Martin & Stützle, 2003]

9 Population-based Methods 9 Population of configurations Global + local search via population Maintain population fitness & diversity Examples Genetic algorithms [e.g., Barricelli, 57, Goldberg, 89] Evolutionary strategies [e.g., Beyer & Schwefel, 02]

10 Estimation of Distribution (EDA) 10 [e.g., Pelikan, Goldberg and Lobo, 2002] Categorize performance into good and bad, and fit a model (density estimator) of the good points in the space: P(x is good ) Often: independent Gaussians for each dimension Sample next point to evaluate from the model Image source: Wikipedia

11 Bayesian Optimization Fit a (probabilistic) model

exploitation vs exploration Prominent method for

, '78] Recent convergence results [Srinivas et al,

11 11 Bayesian Optimization Fit a (probabilistic) model of the function p(f x) Use that model to trade off exploitation vs exploration Prominent method for expensive blackbox optimization [Mockus et al., '78] Recent convergence results [Srinivas et al, '10; Bull '11; de Freitas, Smola, Zoghi, '12, Kawaguchi et al, '15]

12 AutoML Challenges for Bayesian Optimization 12 Problems for standard Gaussian Process (GP) approach: Complex hyperparameter space High-dimensional (low effective dimensionality) Mixed continuous/discrete hyperparameters Conditional hyperparameters Discrete change points Noise: sometimes heteroscedastic, large, non-gaussian Robustness (usability out of the box) Model overhead (budget is runtime, not #function evaluations) Simple solution: random forests [Breiman, '01] Adapted to yield uncertainty estimates as a mixture model over trees

13 13 Bayesian Optimization with Random Forests [Hutter, Hoos, Leyton-Brown, ] SMAC: Sequential Model-Based Algorithm Configuration repeat construct RF model to predict performance use that model to select promising configurations compare each selected configuration against the best known until time budget exhausted Distributed SMAC Maintain queue of promising configurations Compare these to * on distributed worker cores

14 Comparing Bayesian Hyperparameter Optimizers 14 [Eggensperger, Feurer, Hutter, Bergstra, Snoek, Hoos & Leyton-Brown, BayesOpt 2013] Hyperparameter optimization library: automl.org/hpolib Benchmarks From 2-dimensional continuous hyperparameter spaces To structured ones with 768 hyperparameters Optimizers SMAC [Hutter et al, '11], based on random forests Spearmint [Snoek et al, '12], based on Gaussian processes TPE [Bergstra et al, '11], based on 1-d distributions of good values Results GP-based Spearmint is best for low-dimensional & continuous RF-based SMAC is best for high-dim, categorical & conditional

15 Neural networks to the rescue? 15 Two recent promising models for Bayesian optimization Neural networks with Bayesian linear regression using the features in the output layer [Snoek et al, ICML 2015] Fully Bayesian neural networks, trained with stochastic gradient Hamiltonian Monte Carlo [Springenberg et al, NIPS 2016] Good performance on low-dimensional HPOlib tasks So far not studied for: High dimensionality Conditional hyperparameters

16 Outline 16 Blackbox hyperparameter optimization AutoML systems based on blackbox hyperparameter optimization

17 Auto-WEKA s AutoML approach 17 Expose the choices in a machine learning framework Algorithms, hyperparameters, preprocessors, Optimize CV performance using SMAC Obtain a true push-button solution for machine learning [Thornton, Hutter, Hoos, Leyton-Brown; KDD 2013] Learn Here: use the broad range of methods implemented in WEKA [Witten et al, 1999-current] 27 base classifiers (with up to 10 parameters each) 10 meta-methods 2 ensemble methods

18 WEKA s configuration space 18 Base classifiers 27 choices, each with up to 10 subparameters Coarse discretization: about 10 8 instantiations Hierarchical structure on top of base classifiers

19 WEKA s configuration space (cont d) 19 Feature selection Search method: which feature subsets to evaluate Evaluation method: how to evaluate feature subsets in search Both methods have subparameters about 10 7 instantiations In total: 768 parameters, configurations

Auto-WEKA: results 20 Auto-WEKA performed better than best base classifier Even when best base classifier determined by an oracle In 6/21 datasets more than 10% reductions in relative error In WEKA

20 Auto-WEKA: results 20 Auto-WEKA performed better than best base classifier Even when best base classifier determined by an oracle In 6/21 datasets more than 10% reductions in relative error In WEKA Comparison packageto manager; full grid downloaded search 400 times per week Union of grids over parameters of all 27 base classifiers Auto-WEKA was 100 times faster Auto-WEKA had better test performance in 15/21 cases Auto-WEKA based on SMAC vs. TPE [Bergstra et al, NIPS'11] SMAC yielded better CV performance in 19/21 cases SMAC yielded better test performance in 14/21 cases Differences usually small, in 3 cases substantial (SMAC better)

21 The Auto-WEKA approach applied to deep nets Deep network structure & hyperparameters Cross-validation performance f( ) Bayesian optimization Units per layer dog cat Kernel size # convolutional layers # fully connected layers + Learning rates, batch sizes, dropout rates, 21

Application 1: Object Recognition 22 Parameterized the Caffe framework [Jia, 2013] Convolutional neural network with up to 6 layers 81 hyperparameters 9 network hyperparameters 12 layer-wise

22 Application 1: Object Recognition 22 Parameterized the Caffe framework [Jia, 2013] Convolutional neural network with up to 6 layers 81 hyperparameters 9 network hyperparameters 12 layer-wise hyperparameters for each of the 6 layers Results for CIFAR-10 New best result for CIFAR-10 without data augmentation SMAC outperformed TPE (only other applicable hyperparameter optimizer) [Domhan, Springenberg, Hutter, IJCAI 2015]

Application 2: Movement Decoding from EEG 23 [Schirrmeister, Fiederer, Springenberg, Eggensperger, Ball, Hutter, Tangermann, Human-Brain Mapping 2017] Convolutional neural network for motor-execution

23 Application 2: Movement Decoding from EEG 23 [Schirrmeister, Fiederer, Springenberg, Eggensperger, Ball, Hutter, Tangermann, Human-Brain Mapping 2017] Convolutional neural network for motor-execution data Tap fingers on left hand / right hand / do nothing / clench toes EEG data from 128 channels Results for Auto-Net Automatically selected useful subset of channels Outperformed manual solution, by 10% relative error Per-patient optimization: cross-validation error rates reduced by factor of 2

24 Application 3: AutoML Challenge [Mendoza, Klein, Feurer, Springenberg, Hutter, AutoML 2016] Unstructured data fully-connected network Up to 5 layers (with 3 layer hyperparameters each) 14 network

24 24 Application 3: AutoML Challenge [Mendoza, Klein, Feurer, Springenberg, Hutter, AutoML 2016] Unstructured data fully-connected network Up to 5 layers (with 3 layer hyperparameters each) 14 network hyperparameters, in total 29 hyperparameters Optimized for 18h on 5GPUs Timeout of 30 minutes per network ( 500 networks evaluated) Auto-Net won several datasets against human experts E.g., Alexis data set: data points, 5000 features, 18 classes Test set AUC 90% All other (manual) approaches < 80% First automated deep learning system to win a ML competition data set against human experts

Automatic Machine Learning (AutoML): A Tutorial

Automatic Machine Learning (AutoML): A Tutorial Frank Hutter University of Freiburg fh@cs.uni-freiburg.de Joaquin Vanschoren Eindhoven University of Technology j.vanschoren@tue.nl Slides available at automl.org/events