BAYESIAN GLOBAL OPTIMIZATION

Similar documents
Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Keras: Handwritten Digit Recognition using MNIST Dataset

Tutorial on Machine Learning Tools

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

A Brief Look at Optimization

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Perceptron: This is convolution!

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Scaled Machine Learning at Matroid

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

2. Blackbox hyperparameter optimization and AutoML

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Deep Learning with Tensorflow AlexNet

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

CS 229 Midterm Review

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Machine Learning in the Process Industry. Anders Hedlund Analytics Specialist

Transforming Transport Infrastructure with GPU- Accelerated Machine Learning Yang Lu and Shaun Howell

Deep Learning on AWS with TensorFlow and Apache MXNet

Deep Learning and Its Applications

XES Tensorflow Process Prediction using the Tensorflow Deep-Learning Framework

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Machine Learning Workshop

10 things I wish I knew. about Machine Learning Competitions

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Inception Network Overview. David White CS793

The Mathematics Behind Neural Networks

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning Frameworks with Spark and GPUs

Simple ML Tutorial. Mike Williams MIT June 16, 2017

An Introduction to Deep Learning with RapidMiner. Philipp Schlunder - RapidMiner Research

ML 프로그래밍 ( 보충 ) Scikit-Learn

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Deep Learning. Volker Tresp Summer 2014

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Using Existing Numerical Libraries on Spark

ECG782: Multidimensional Digital Signal Processing

ECS289: Scalable Machine Learning

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Machine Learning Software ROOT/TMVA

Deep Face Recognition. Nathan Sun

arxiv: v1 [cs.lg] 14 Dec 2016

An Introduction to NNs using Keras

Matrix Computations and " Neural Networks in Spark

Keras: Handwritten Digit Recognition using MNIST Dataset

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Deep Learning for Computer Vision II

Voice, Image, Video : AI in action with AWS. 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

scikit-learn (Machine Learning in Python)

Machine Learning in WAN Research

ECS289: Scalable Machine Learning

Apparel Classifier and Recommender using Deep Learning

CS281 Section 3: Practical Optimization

Allstate Insurance Claims Severity: A Machine Learning Approach

Data Analytics and Machine Learning: From Node to Cluster

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

DECISION SUPPORT SYSTEM USING TENSOR FLOW

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

M. Sc. (Artificial Intelligence and Machine Learning)

Machine Learning in WAN Research

INTRODUCTION TO DEEP LEARNING

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

CS5670: Computer Vision

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

Fei-Fei Li & Justin Johnson & Serena Yeung

3D Convolutional Neural Networks for Landing Zone Detection from LiDAR

Efficient Deep Learning Optimization Methods

Python With Data Science

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Progressive Neural Architecture Search

Facial Expression Classification with Random Filters Feature Extraction

A Quick Guide on Training a neural network using Keras.

Black-Box Hyperparameter Optimization for Nuclei Segmentation in Prostate Tissue Images

ImageNet Classification with Deep Convolutional Neural Networks

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Image Moderation Using Machine Learning

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Parallel Deep Network Training

Predicting Computing Prices Dynamically Using Machine Learning

Why data science is the new frontier in software development

Bayesian model ensembling using meta-trained recurrent neural networks

Using Numerical Libraries on Spark

Supplementary material for: BO-HB: Robust and Efficient Hyperparameter Optimization at Scale

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

MoonRiver: Deep Neural Network in C++

Deep Neural Network Hyperparameter Optimization with Genetic Algorithms

CS489/698: Intro to ML

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

STA 4273H: Statistical Machine Learning

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Metric Learning for Large-Scale Image Classification:

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Transcription:

BAYESIAN GLOBAL OPTIMIZATION Using Optimal Learning to Tune Deep Learning Pipelines Scott Clark scott@sigopt.com

OUTLINE 1. Why is Tuning AI Models Hard? 2. Comparison of Tuning Methods 3. Bayesian Global Optimization 4. Deep Learning Examples 5. Evaluating Optimization Strategies

Deep Learning / AI is extremely powerful Tuning these systems is extremely non-intuitive

What is the most important unresolved problem in machine learning?...we still don't really know why some configurations of deep neural networks work in some case and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters. Xavier Amatriain, VP Engineering at Quora (former Director of Research at Netflix) https://www.quora.com/what-is-the-most-important-unresolved-problem-in-machine-learning-3

Photo: Joe Ross

TUNABLE PARAMETERS IN DEEP LEARNING

TUNABLE PARAMETERS IN DEEP LEARNING

Photo: Tammy Strobel

STANDARD METHODS FOR HYPERPARAMETER SEARCH

STANDARD TUNING METHODS Parameter Configuration Training Data ML / AI Model - Weights - Thresholds - Window sizes - Transformations Grid Search Testing Data Cross Validation Manual Search Random Search?

OPTIMIZATION FEEDBACK LOOP Training Data ML / AI Model New configurations Better Results Testing Data Cross Validation Objective Metric REST API

BAYESIAN GLOBAL OPTIMIZATION

OPTIMAL LEARNING the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive. Prof. Warren Powell - Princeton What is the most efficient way to collect information? Prof. Peter Frazier - Cornell How do we make the most money, as fast as possible? Scott Clark - CEO, SigOpt

BAYESIAN GLOBAL OPTIMIZATION Optimize objective function Loss, Accuracy, Likelihood Given parameters Hyperparameters, feature/architecture params Find the best hyperparameters Sample function as few times as possible Training on big data is expensive

HOW DOES IT WORK? SMBO Sequential Model-Based Optimization

GP/EI SMBO 1. Build Gaussian Process (GP) with points sampled so far 2. Optimize the fit of the GP (covariance hyperparameters) 3. Find the point(s) of highest Expected Improvement within parameter domain 4. Return optimal next best point(s) to sample

GAUSSIAN PROCESSES

GAUSSIAN PROCESSES

GAUSSIAN PROCESSES

GAUSSIAN PROCESSES

GAUSSIAN PROCESSES

GAUSSIAN PROCESSES

GAUSSIAN PROCESSES

GAUSSIAN PROCESSES

GAUSSIAN PROCESSES overfit good fit underfit

EXPECTED IMPROVEMENT

EXPECTED IMPROVEMENT

EXPECTED IMPROVEMENT

EXPECTED IMPROVEMENT

EXPECTED IMPROVEMENT

EXPECTED IMPROVEMENT

DEEP LEARNING EXAMPLES

SIGOPT + MXNET Classify movie reviews using a CNN in MXNet

TEXT CLASSIFICATION PIPELINE Training Text ML / AI Model Hyperparameter Configurations and Feature Transformations (MXNet) Better Results Accuracy Testing Text Validation REST API

TUNABLE PARAMETERS IN DEEP LEARNING

STOCHASTIC GRADIENT DESCENT Comparison of several RMSProp SGD parametrizations

ARCHITECTURE PARAMETERS

TUNING METHODS Grid Search Random Search?

MULTIPLICATIVE TUNING SPEED UP

SPEED UP #1: CPU -> GPU

SPEED UP #2: RANDOM/GRID -> SIGOPT

CONSISTENTLY BETTER AND FASTER

SIGOPT + TENSORFLOW Classify house numbers in an image dataset (SVHN)

COMPUTER VISION PIPELINE Training Images ML / AI Model Hyperparameter Configurations and Feature Transformations (Tensorflow) Better Results Testing Images Cross Validation Accuracy REST API

METRIC OPTIMIZATION

SIGOPT + NEON All convolutional neural network Multiple convolutional and dropout layers Hyperparameter optimization mixture of domain expertise and grid search (brute force) http://arxiv.org/pdf/1412.6806.pdf

COMPARATIVE PERFORMANCE Expert baseline: 0.8995 (using neon) SigOpt best: 0.9011 1.6% reduction in error rate No expert time wasted in tuning

SIGOPT + NEON Explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions Variable depth Hyperparameter optimization mixture of domain expertise and grid search (brute force) http://arxiv.org/pdf/1512.03385v1.pdf

COMPARATIVE PERFORMANCE Expert baseline: 0.9339 (from paper) Standard Method SigOpt best: 0.9436 15% relative error rate reduction No expert time wasted in tuning

EVALUATING THE OPTIMIZER

OUTLINE Metric Definitions Benchmark Suite Eval Infrastructure Visualization Tool Baseline Comparisons

METRIC: BEST FOUND What is the best value found after optimization completes? BEST_FOUND BLUE RED 0.7225 0.8949

METRIC: AUC How quickly is optimum found? (area under curve) BLUE RED BEST_FOUND 0.9439 0.9435 AUC 0.8299 0.9358

STOCHASTIC OPTIMIZATION

BENCHMARK SUITE Optimization functions from literature ML datasets: LIBSVM, Deep Learning, etc TEST FUNCTION TYPE COUNT Continuous Params 184 Noisy Observations 188 Parallel Observations 45 Integer Params 34 Categorical Params / ML 47 Failure Observations 30 TOTAL 489

INFRASTRUCTURE On-demand cluster in AWS for parallel eval function optimization Full eval consists of ~20000 optimizations, taking ~30 min

RANKING OPTIMIZERS 1. Mann-Whitney U tests using BEST_FOUND 2. Tied results then partially ranked using AUC 3. Any remaining ties, stay as ties for final ranking

RANKING AGGREGATION Aggregate partial rankings across all eval functions using Borda count (sum of methods ranked lower)

SHORT RESULTS SUMMARY

BASELINE COMPARISONS

SIGOPT SERVICE

OPTIMIZATION FEEDBACK LOOP Training Data ML / AI Model New configurations Better Results Testing Data Cross Validation Objective Metric REST API

SIMPLIFIED OPTIMIZATION Client Libraries Python Java R Matlab And more... Framework Integrations TensorFlow scikit-learn xgboost Keras Neon And more... Live Demo

DISTRIBUTED TRAINING SigOpt serves as a distributed scheduler for training models across workers Workers access the SigOpt API for the latest parameters to try for each model Enables easy distributed training of non-distributed algorithms across any number of models

COMPARATIVE PERFORMANCE Better Results, Faster and Cheaper Quickly get the most out of your models with our proven, peer-reviewed ensemble of Bayesian and Global Optimization Methods A Stratified Analysis of Bayesian Optimization Methods (ICML 2016) Evaluation System for a Bayesian Optimization Service (ICML 2016) Interactive Preference Learning of Utility Functions for Multi-Objective Optimization (NIPS 2016) And more... Fully Featured Tune any model in any pipeline Scales to 100 continuous, integer, and categorical parameters and many thousands of evaluations Parallel tuning support across any number of models Simple integrations with many languages and libraries Powerful dashboards for introspecting your models and optimization Advanced features like multi-objective optimization, failure region support, and more Secure Black Box Optimization Your data and models never leave your system

Try it yourself! https://sigopt.com/getstarted

Questions? contact@sigopt.com https://sigopt.com @SigOpt