Linear Regression Optimization

Similar documents
ID2223 Lecture 3: Gradient Descent and SparkML

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Rapid growth of massive datasets

CS 179 Lecture 16. Logistic Regression & Parallel SGD

A Brief Look at Optimization

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Scaled Machine Learning at Matroid

Warehouse- Scale Computing and the BDAS Stack

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

Logistic Regression: Probabilistic Interpretation

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Lecture 22 : Distributed Systems for ML

Matrix Computations and " Neural Networks in Spark

Parallelism. CS6787 Lecture 8 Fall 2017

CS281 Section 3: Practical Optimization

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Logistic Regression and Gradient Ascent

Parallel Deep Network Training

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Scaling Distributed Machine Learning

Logistic Regression

Stanford University. A Distributed Solver for Kernalized SVM

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Optimization. Industrial AI Lab.

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning Basics. Sargur N. Srihari

Towards the world s fastest k-means algorithm

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Programming Systems for Big Data

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Introduction to MapReduce Algorithms and Analysis

Louis Fourrier Fabien Gaie Thomas Rolf

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Parallel learning of content recommendations using map- reduce

Linear Regression & Gradient Descent

Machine Learning / Jan 27, 2010

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Cost Functions in Machine Learning

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

CS294-1 Assignment 2 Report

Cloud Computing & Visualization

Parallel Stochastic Gradient Descent

Conquering Massive Clinical Models with GPU. GPU Parallelized Logistic Regression

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

ECE 152 Introduction to Computer Architecture

Distributed Computing with Spark

ECS289: Scalable Machine Learning

Distributed Optimization for Machine Learning

Network Traffic Measurements and Analysis

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Today. Adding Memory Does adding memory always reduce the number of page faults? FIFO: Adding Memory with LRU. Last Class: Demand Paged Virtual Memory

Distributed Machine Learning" on Spark

An Introduction to Big Data Analysis using Spark

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

CS489/698: Intro to ML

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

Flash Storage Complementing a Data Lake for Real-Time Insight

Apache Spark Graph Performance with Memory1. February Page 1 of 13

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Generic descent algorithm Generalization to multiple dimensions Problems of descent methods, possible improvements Fixes Local minima

Mission Impossible: Surviving Today s Flood of Critical Data

Machine Learning Classifiers and Boosting

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Look Up! Your Future is in the Cloud

ECS289: Scalable Machine Learning

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Training Deep Neural Networks (in parallel)

Lecture 11 Hadoop & Spark

5 Machine Learning Abstractions and Numerical Optimization

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

The Future of High Performance Computing

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Bringing Data to Life

Why do we need graph processing?

Convex and Distributed Optimization. Thomas Ropars

ECE468 Computer Organization and Architecture. Memory Hierarchy

Achieving Horizontal Scalability. Alain Houf Sales Engineer

China Big Data and HPC Initiatives Overview. Xuanhua Shi

Learning via Optimization

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Introduction to MapReduce

Recap: Machine Organization

Main Memory and the CPU Cache

Machine Learning: Think Big and Parallel

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

CS 61C: Great Ideas in Computer Architecture. MapReduce

An Oracle White Paper September Oracle Utilities Meter Data Management Demonstrates Extreme Performance on Oracle Exadata/Exalogic

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

Transcription:

Gradient Descent

Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective: f(w) = wx y 2 2 = (wx (j) y (j) ) 2 j=1

Gradient Descent Start at a random point f(w) w * w0 w

Gradient Descent Start at a random point f(w) Determine a descent direction w * w0 w

Gradient Descent Start at a random point f(w) Determine a descent direction Choose a step size w * w0 w

Gradient Descent Start at a random point f(w) Determine a descent direction Choose a step size Update w * w1 w0 w

Gradient Descent f(w) Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied w * w1 w0 w

Gradient Descent f(w) Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied w * w1 w0 w

Gradient Descent f(w) Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied w * w1 w0 w

Gradient Descent f(w) Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied w * w1 w0 w

Gradient Descent f(w) Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied w * w2 w1 w0 w

Gradient Descent f(w) Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied w * w

Gradient Descent f(w) Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied w * w2 w1 w0 w

Where Will We Converge? f(w) Convex g(w) Non-convex w * w w! w * w Any local minimum is a global minimum Multiple local minima may exist Least Squares, Ridge Regression and Logistic Regression are all convex!

Choosing Descent Direction (1D) f(w) positive go left! f(w) negative go right! zero done! w * w0 w w0 w * w Step Size We can only move in two directions Negative slope is direction of descent! Update Rule: w i+1 = w i α i df dw (w i) Negative Slope

Choosing Descent Direction 2D Example: Function values are in black/white and black represents higher values Arrows are gradients "Gradient2" by Sarang. Licensed under CC BY-SA 2.5 via Wikimedia Commons http://commons.wikimedia.org/wiki/file:gradient2.svg#/media/file:gradient2.svg Step Size We can move anywhere in R d Negative gradient is direction of steepest descent! Update Rule: w i+1 = w i α i f(w i ) Negative Slope

Gradient Descent for Least Squares Update Rule: w i+1 = w i α i df dw (w i) n Scalar objective: f(w) = wx y 2 2 = (wx (j) y (j) ) 2 Derivative: (chain rule) Scalar Update: (2 absorbed in α ) n df dw (w) =2 j=1 w i+1 = w i α i n (wx (j) j=1 j=1 y (j) )x (j) (w i x (j) y (j) )x (j) Vector Update: w i+1 = w i α i n j=1 (w i x (j) y (j) )x (j)

Choosing Step Size f(w) f(w) f(w) w * w w * w w * w Too small: converge very slowly Too big: overshoot and even diverge Reduce size over time Theoretical convergence results for various step sizes α Constant A common step size is α i = n i Iteration # # Training Points

Parallel Gradient Descent for Least Squares Vector Update: w i+1 = w i α i n j=1 (w i x (j) y (j) )x (j) Compute summands in parallel! note: workers must all have w i Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage w i+1 map: (w i x (j) y (j) )x (j) (w i x (j) y (j) )x (j) (w i x (j) y (j) )x (j) Distributed O(d) Local O(nd) Storage Computation reduce: n (w i x (j) y (j) )x (j) O(d) Local O(d) Local j=1 Computation Storage

Parallel Gradient Descent for Least Squares Vector Update: w i+1 = w i α i n j=1 (w i x (j) y (j) )x (j) Compute summands in parallel! note: workers must all have w i Example: n = 6; 3 workers workers: x (1) x (5) x (4) x (6) x (3) x (2) O(nd) Distributed Storage w i+1 map: (w i x (j) y (j) )x (j) (w i x (j) y (j) )x (j) (w i x (j) y (j) )x (j) Distributed O(d) Local O(nd) Storage Computation reduce: n (w i x (j) y (j) )x (j) O(d) Local O(d) Local j=1 Computation Storage

Gradient Descent Summary Pros: Easily parallelized Cheap at each iteration Stochastic variants can make things even cheaper Cons: Slow convergence (especially compared with closed-form) Requires communication across nodes!

Communication Hierarchy

Communication Hierarchy 2 billion cycles/sec per core Clock speeds not changing, CPU but number of cores growing with Moore s Law

Communication Hierarchy 10-100 GB 50 GB/s Capacity growing with Moore s Law RAM CPU

Communication Hierarchy 50 GB/s RAM CPU 100 MB/s 1-2 TB Capacity growing exponentially, but not speed Disk

Communication Hierarchy Network Top-of-rack switch 50 GB/s 10Gbps (1 GB/s) RAM 10Gbps 3Gbps CPU Nodes in 100 MB/s same rack Nodes in other racks Disk 10

Summary Access rates fall sharply with distance 50 gap between memory and network! 50 GB/s 1 GB/s 1 GB/s 0.3 GB/s CPU RAM Local disks Rack Different Racks Must be mindful of this hierarchy when developing parallel algorithms!

Distributed ML: Communication Principles

Communication Hierarchy Access rates fall sharply with distance Parallelism makes computation fast Network makes communication slow 50 GB/s 1 GB/s 1 GB/s 0.3 GB/s CPU RAM Local disks Rack Different Racks Must be mindful of this hierarchy when developing parallel algorithms!

2nd Rule of thumb Perform parallel and in-memory computation Persisting in memory reduces communication Especially for iterative computation (gradient descent) Scale-up (powerful multicore machine) No network communication Expensive hardware, eventually hit a wall CPU RAM RAM Disk CPU Disk

2nd Rule of thumb Perform parallel and in-memory computation Persisting in memory reduces communication Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) Need to deal with network communication Commodity hardware, scales to massive problems Network RAM RAM RAM RAM CPU CPU CPU CPU Disk Disk Disk Disk

2nd Rule of thumb Perform parallel and in-memory computation Persisting in memory reduces communication Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) Need to deal with network communication Commodity hardware, scales to massive problems Network Persist training data across iterations RAM RAM RAM RAM CPU CPU CPU CPU Disk Disk Disk Disk

3rd Rule of thumb Minimize Network Communication Q: How should we leverage distributed computing while mitigating network communication? First Observation: We need to store and potentially communicate Data, Model and Intermediate objects A: Keep large objects local

3rd Rule of thumb Minimize Network Communication - Stay Local Example: Linear regression, big n and small d Solve via closed form (not iterative!) Communicate O(d 2 ) intermediate data Compute locally on data (Data Parallel) workers: x (1) x (5) x (4) x (6) x (3) x (2) map: x (i) x (i) x (i) x (i) x (i) x (i) reduce: ( ) -1 x (i) x (i)

3rd Rule of thumb Minimize Network Communication - Stay Local Example: Linear regression, big n and big d Gradient descent, communicate w i O(d) communication OK for fairly large d Compute locally on data (Data Parallel) workers: x (1) x (5) x (4) x (6) x (3) x (2) w i+1 map: (w i x (j) y (j) )x (j) (w i x (j) y (j) )x (j) (w i x (j) y (j) )x (j) reduce: n j=1 (w i x (j) y (j) )x (j)

3rd Rule of thumb Minimize Network Communication - Stay Local Example: Hyperparameter tuning for ridge regression with small n and small d Data is small, so can communicate it Model is collection of regression models corresponding to different hyperparameters Train each model locally (Model Parallel)

3rd Rule of thumb Minimize Network Communication - Stay Local Example: Linear regression, big n and huge d Gradient descent O(d) communication slow with hundreds of millions parameters Distribute data and model (Data and Model Parallel) Often rely on sparsity to reduce communication

3rd Rule of thumb Minimize Network Communication Q: How should we leverage distributed computing while mitigating network communication? First Observation: We need to store and potentially communicate Data, Model and Intermediate objects A: Keep large objects local Second Observation: ML methods are typically iterative A: Reduce # iterations

3rd Rule of thumb Minimize Network Communication - Reduce Iterations Distributed iterative algorithms must compute and communicate In Bulk Synchronous Parallel (BSP) systems, e.g., Apache Spark, we strictly alternate between the two Distributed Computing Properties Parallelism makes computation fast Network makes communication slow Idea: Design algorithms that compute more, communicate less Do more computation at each iteration Reduce total number of iterations

3rd Rule of thumb Minimize Network Communication - Reduce Iterations Extreme: Divide-and-conquer Fully process each partition locally, communicate final result Single iteration; minimal communication Approximate results

3rd Rule of thumb Minimize Network Communication - Reduce Iterations Less extreme: Mini-batch Do more work locally than gradient descent before communicating Exact solution, but diminishing returns with larger batch sizes

3rd Rule of thumb Minimize Network Communication - Reduce Iterations Throughput: How many bytes per second can be read Latency: Cost to send message (independent of size) Latency Memory Hard Disk Network (same datacenter) Network (US to Europe) 1e-4 ms 10 ms.25 ms >5 ms We can amortize latency! Send larger messages Batch their communication E.g., Train multiple models together

1st Rule of thumb Computation and storage should be linear (in n, d) 2nd Rule of thumb Perform parallel and in-memory computation 3rd Rule of thumb Minimize Network Communication

Lab Preview

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Goal: Predict song s release year from audio features Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Raw Data: Millionsong Dataset from UCI ML Repository Explore features Shift labels so that they start at 0 (for interpretability) Visualize data Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Split Data: Create training, validation, and test sets Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Feature Extraction: Initially use raw features Subsequently compare with quadratic features Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Split Data Supervised Learning: Least Squares Regression First implement gradient descent from scratch Then use MLlib implementation Visualize performance by iteration Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Evaluation (Part 1): Hyperparameter tuning Use grid search to find good values for regularization and step size hyperparameters Evaluate using RMSE Visualize grid search Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Evaluation (Part 2): Evaluate final model Evaluate using RMSE Compare to baseline model that returns average song year in training data Split Data Feature Extraction Supervised Learning Evaluation Predict

training set model full dataset validation set new entity test set accuracy prediction Obtain Raw Data Split Data Predict: Final model could be used to predict song year for new songs (we won t do this though) Feature Extraction Supervised Learning Evaluation Predict