Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Similar documents
Avoiding communication in primal and dual block coordinate descent methods

ECS289: Scalable Machine Learning

Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Scaling Distributed Machine Learning

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

Rapid growth of massive datasets

Constrained optimization

Matrix Computations and " Neural Networks in Spark

Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Deep Neural Networks for Market Prediction on the Xeon Phi *

Statistical Models for Automatic Performance Tuning

GPUML: Graphical processors for speeding up kernel machines

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Parallelism. CS6787 Lecture 8 Fall 2017

Accelerating the Iterative Linear Solver for Reservoir Simulation

Modern GPUs (Graphics Processing Units)

Variable Selection 6.783, Biomedical Decision Support

Sparse Training Data Tutorial of Parameter Server

Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments

Machine Learning: Think Big and Parallel

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Network Lasso: Clustering and Optimization in Large Graphs

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Distributed Optimization for Machine Learning

Distributed Computing with Spark

Asynchronous Stochastic Gradient Descent on GPU: Is It Really Better than CPU?

ECS289: Scalable Machine Learning

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

CS 6453: Parameter Server. Soumya Basu March 7, 2017

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Convex and Distributed Optimization. Thomas Ropars

Introduction to parallel Computing

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Linear Regression Optimization

Using Existing Numerical Libraries on Spark

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

S8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande

Comparison of Optimization Methods for L1-regularized Logistic Regression

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Alternating Direction Method of Multipliers

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Scale-Out Acceleration for Machine Learning

Apache SystemML Declarative Machine Learning

Matrix-free IPM with GPU acceleration

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Object-Parallel Solution of Large-Scale Lasso Problems

designing a GPU Computing Solution

A Dendrogram. Bioinformatics (Lec 17)

Training Deep Neural Networks (in parallel)

PARALLEL OPTIMIZATION

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Design and Implementation of a Communication-Optimal Classifier for Distributed Kernel Support Vector Machines

Generating the Reduced Set by Systematic Sampling

Harp-DAAL for High Performance Big Data Computing

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Machine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Optimization for Machine Learning

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Parallel and Distributed Sparse Optimization Algorithms

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Identifying and Understanding Differential Transcriptor Binding

Effective Learning and Classification using Random Forest Algorithm CHAPTER 6

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Scalable and Fast SVM Regression using Modern Hardware

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs

Scaled Machine Learning at Matroid

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Second Order SMO Improves SVM Online and Active Learning

Machine Learning Lecture 3

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Lecture 2. Memory locality optimizations Address space organization

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Advances of parallel computing. Kirill Bogachev May 2016

Tools and Primitives for High Performance Graph Computation

Server Side Applications (i.e., public/private Clouds and HPC)

qr_mumps a multithreaded, multifrontal QR solver The MUMPS team,

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

Composite Self-concordant Minimization

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

arxiv: v1 [cs.dc] 24 Oct 2017

Connections between the Lasso and Support Vector Machines

High Performance Computing. Introduction to Parallel Computing

Transcription:

Parallel Methods for Convex Optimization A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Problems minimize g(x)+f(x; A, b) Sparse regression g(x) =kxk 1 f(x) =kax bk 2 2 mx Sparse SVM g(x) =kxk 1 f(x) = max(1 b i A T i x, 0) 2 i=1 Elastic net g(x) = 2 kxk2 2 +(1 )kxk 1 1 0.9 0.8 cluster 1 cluster 2 linear classifier 0.7 Group lasso g(x) = JX kx j k Kj a 1 0.6 0.5 j=1 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 a 2

Numerous applications Medical sciences Cancer diagnosis Prediction of common diseases: diabetes, pre-diabetes Biological sciences Protein remote homology detection Protein function prediction Gene expression data classification Finance sciences Financial time series forecasting Stock market prediction Web related applications Web searching Email filtering Relevance feedback Classification of articles, emails, web-pages Geo and spatiotemporal environmental analysis. 3

Existing directions and our goals Parallelizing linear algebra is monopolized by HPC community Parallelizing optimization methods is monopolized by ML community ML community HPC community Data parallel approach Uses spark Parallelizes LA Uses MPI (lingua franca of HPC) Our goal: generalize HPC/LA techniques to convex optimization 4

First attempt to parallelism In ML first-order methods is often the king Optimal worst-case FLOPS Excellent for low accuracy solutions Let s parallelize computation of the gradient of first-order methods. What could go wrong? Communication becomes the bottleneck and the methods are slow.

Parallelism, the ML way Partition the data across nodes. Each node works independently on its data. The outputs are averaged. Control FLOPS/communication tradeoff by number and quality of partitions. Data matrix Correlated Uncorrelated Large #partitions Many FLOPS Much communication Few FLOPS Excessive communication P1 P2 P Small #partitions Excessive FLOPS Low communication Low FLOPS Low communication

Parallelism, the ML way for l1-regularization minimize kxk 1 + f(x; A, b) Coordinates Partition the data and coordinates across nodes. Each node works Data matrix independently on its data and coordinates. The gradients are averaged. The optimal solution is often very sparse P1 P2 P The algorithm turns from parallel to serial as the algorithm converges since many nodes become idle.

Parallelism, the HPC way Our objective: solve the communication bottleneck of first-order methods using HPC techniques. Exchange FLOPS with communication, directly. We get optimal worst-case FLOPS + communication efficiency We get good scalability for all sparse data This talk + preliminary experiments

An example: coordinate descent Pseudo-code 1 communication per Sample a column of data iteration Compute partial derivative Update solution new old Repeat 9

An example: communication avoiding coordinate descent Pseudo-code 1 communication round per s iterations s Compute in parallel anticipated computations s for the next s iterations Redundantly store the result in all processors P1 P2 P Each processor independently computes the next s iterations Repeat 10

Other examples Block coordinate descent Accelerated block coordinate descent Newton-type (first-order++) block coordinate descent Proximal versions of the above 11

Running time in Big-O No free lunch: Exchange FLOPS and bandwidth for latency s Hµ2 fm P + Hµ 3 + H log P s + shµ 2 log P Time per message Time per word Time per FLOP P Number of cores Number of coordinates µ at each iteration f = nnz(a)/mn H Number of iterations 12

Optimal parameter s s s = P log P µ 2 fm+ µ 2 P log P Time per message Time per word Time per FLOP P Number of cores µ Number of coordinates at each iteration f = nnz(a)/mn Let s assume A is very sparse and mu=1 (single coordinate descent) r Cori or Edison r 10 6! s = O 10 10 = 100 13

Scalable results for all data layouts Data Layout 1D Row Partition 2D Block Partition 1D Column Partition A A A * Best performance depends on dataset and algorithm 14

Datasets Summary of (LIBSVM) datasets Name #Features #Data points Density of nonzeros url 3,231,961 2,396,130 0.0036% epsilon 2,000 400,000 100% news20 62,021 15,935 0.13% covtype 54 581,012 22% C++ using the Message Passing Interface (MPI). Intel MKL library for sparse and dense BLAS routines. All methods were tested on a Cray XC30. 15

Convergence of re-organized algorithms Convergence rate remains the same in exact arithmetic Empirically stable convergence: no divergence between methods 10 6 Baseline 1 0.5 Objective function1.5 Ours 0 100 200 300 Iterations 16

Scalability performance The more processors the better The gap between CA and non-ca increases w.r.t. #processors Running Time (sec) 173 111 70 39 Performance scaling: url Baseline Ours 3072 6144 12288 Processors (P) Running Time (sec) 6.4 3.7 3 2 Performance scaling: epsilon Baseline Ours 1.1 3072 6144 12288 Processors (P) 17

Scalability performance The more processors the better The gap between CA and non-ca increases w.r.t. #processors Performance scaling: news20 Performance scaling: covtype 30.7 22.2 12.8 9.6 Running Time (sec) 36.6 Baseline Ours Running Time (sec) 1.5 1.07 0.9 0.33 0.23 Baseline Ours 192 384 768 Processors (P) 768 1536 3072 Processors (P) 18

Speed up breakdown Large communication speedup until bandwidth takes a hit Computation is maintained due to local cache-efficient (BLAS-3) computations Speedup 10.56 9.67 8.99 4.95 4.96 3.16 1.65 1 1 2 total communication computation speedup = 1 Best s = 64 4 8 Speedup: url 16 32 64 128 256 512 Recurrence unrolling parameter (s) Speedup 10.94 8.33 6.72 5.89 3.89 2.03 1 2 Speedup: epsilon total communication computation speedup = 1 Best s = 64 4 8 16 32 64 128 256 Recurrence unrolling parameter (s) 19

Speed up breakdown Large communication speedup until bandwidth takes a hit Computation is maintained due to local cache-efficient (BLAS-3) computations Speedup 4.2 3.45 2.74 1.75 Speedup: news20 total communication computation speedup = 1 Best s = 16 Speedup 6.94 6.53 5.87 3.2 2.56 Speedup: covtype total communication computation speedup = 1 Best s = 32 1 1.66 1 2 4 8 16 32 64 Recurrence unrolling parameter (s) 128 2 4 8 16 32 64 Recurrence unrolling parameter (s) 20

Future directions Comparisons among CA and ML approaches Hybrid CA and ML, Partition based on ML, use CA for local problems Solve the local subproblems using GPUs + CA first order++ Currently the ML worst-case theory is worse than sequential Possible solution, find a way to quantify and incorporate correlation among partitions into theory. 21

Thank You! Questions? 22