Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Parallel Methods for Convex Optimization A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Problems minimize g(x)+f(x; A, b) Sparse regression g(x) =kxk 1 f(x) =kax bk 2 2 mx Sparse SVM g(x) =kxk 1 f(x) = max(1 b i A T i x, 0) 2 i=1 Elastic net g(x) = 2 kxk2 2 +(1 )kxk 1 1 0.9 0.8 cluster 1 cluster 2 linear classifier 0.7 Group lasso g(x) = JX kx j k Kj a 1 0.6 0.5 j=1 0.4 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 a 2

Numerous applications Medical sciences Cancer diagnosis Prediction of common diseases: diabetes, pre-diabetes Biological sciences Protein remote homology detection Protein function prediction Gene expression data classification Finance sciences Financial time series forecasting Stock market prediction Web related applications Web searching Email filtering Relevance feedback Classification of articles, emails, web-pages Geo and spatiotemporal environmental analysis. 3

Existing directions and our goals Parallelizing linear algebra is monopolized by HPC community Parallelizing optimization methods is monopolized by ML community ML community HPC community Data parallel approach Uses spark Parallelizes LA Uses MPI (lingua franca of HPC) Our goal: generalize HPC/LA techniques to convex optimization 4

First attempt to parallelism In ML first-order methods is often the king Optimal worst-case FLOPS Excellent for low accuracy solutions Let s parallelize computation of the gradient of first-order methods. What could go wrong? Communication becomes the bottleneck and the methods are slow.

Parallelism, the ML way Partition the data across nodes. Each node works independently on its data. The outputs are averaged. Control FLOPS/communication tradeoff by number and quality of partitions. Data matrix Correlated Uncorrelated Large #partitions Many FLOPS Much communication Few FLOPS Excessive communication P1 P2 P Small #partitions Excessive FLOPS Low communication Low FLOPS Low communication

Parallelism, the ML way for l1-regularization minimize kxk 1 + f(x; A, b) Coordinates Partition the data and coordinates across nodes. Each node works Data matrix independently on its data and coordinates. The gradients are averaged. The optimal solution is often very sparse P1 P2 P The algorithm turns from parallel to serial as the algorithm converges since many nodes become idle.

Parallelism, the HPC way Our objective: solve the communication bottleneck of first-order methods using HPC techniques. Exchange FLOPS with communication, directly. We get optimal worst-case FLOPS + communication efficiency We get good scalability for all sparse data This talk + preliminary experiments

An example: coordinate descent Pseudo-code 1 communication per Sample a column of data iteration Compute partial derivative Update solution new old Repeat 9

An example: communication avoiding coordinate descent Pseudo-code 1 communication round per s iterations s Compute in parallel anticipated computations s for the next s iterations Redundantly store the result in all processors P1 P2 P Each processor independently computes the next s iterations Repeat 10

Other examples Block coordinate descent Accelerated block coordinate descent Newton-type (first-order++) block coordinate descent Proximal versions of the above 11

Running time in Big-O No free lunch: Exchange FLOPS and bandwidth for latency s Hµ2 fm P + Hµ 3 + H log P s + shµ 2 log P Time per message Time per word Time per FLOP P Number of cores Number of coordinates µ at each iteration f = nnz(a)/mn H Number of iterations 12

Optimal parameter s s s = P log P µ 2 fm+ µ 2 P log P Time per message Time per word Time per FLOP P Number of cores µ Number of coordinates at each iteration f = nnz(a)/mn Let s assume A is very sparse and mu=1 (single coordinate descent) r Cori or Edison r 10 6! s = O 10 10 = 100 13

Scalable results for all data layouts Data Layout 1D Row Partition 2D Block Partition 1D Column Partition A A A * Best performance depends on dataset and algorithm 14

Datasets Summary of (LIBSVM) datasets Name #Features #Data points Density of nonzeros url 3,231,961 2,396,130 0.0036% epsilon 2,000 400,000 100% news20 62,021 15,935 0.13% covtype 54 581,012 22% C++ using the Message Passing Interface (MPI). Intel MKL library for sparse and dense BLAS routines. All methods were tested on a Cray XC30. 15

Convergence of re-organized algorithms Convergence rate remains the same in exact arithmetic Empirically stable convergence: no divergence between methods 10 6 Baseline 1 0.5 Objective function1.5 Ours 0 100 200 300 Iterations 16

Scalability performance The more processors the better The gap between CA and non-ca increases w.r.t. #processors Running Time (sec) 173 111 70 39 Performance scaling: url Baseline Ours 3072 6144 12288 Processors (P) Running Time (sec) 6.4 3.7 3 2 Performance scaling: epsilon Baseline Ours 1.1 3072 6144 12288 Processors (P) 17

Scalability performance The more processors the better The gap between CA and non-ca increases w.r.t. #processors Performance scaling: news20 Performance scaling: covtype 30.7 22.2 12.8 9.6 Running Time (sec) 36.6 Baseline Ours Running Time (sec) 1.5 1.07 0.9 0.33 0.23 Baseline Ours 192 384 768 Processors (P) 768 1536 3072 Processors (P) 18

Speed up breakdown Large communication speedup until bandwidth takes a hit Computation is maintained due to local cache-efficient (BLAS-3) computations Speedup 10.56 9.67 8.99 4.95 4.96 3.16 1.65 1 1 2 total communication computation speedup = 1 Best s = 64 4 8 Speedup: url 16 32 64 128 256 512 Recurrence unrolling parameter (s) Speedup 10.94 8.33 6.72 5.89 3.89 2.03 1 2 Speedup: epsilon total communication computation speedup = 1 Best s = 64 4 8 16 32 64 128 256 Recurrence unrolling parameter (s) 19

Speed up breakdown Large communication speedup until bandwidth takes a hit Computation is maintained due to local cache-efficient (BLAS-3) computations Speedup 4.2 3.45 2.74 1.75 Speedup: news20 total communication computation speedup = 1 Best s = 16 Speedup 6.94 6.53 5.87 3.2 2.56 Speedup: covtype total communication computation speedup = 1 Best s = 32 1 1.66 1 2 4 8 16 32 64 Recurrence unrolling parameter (s) 128 2 4 8 16 32 64 Recurrence unrolling parameter (s) 20

Future directions Comparisons among CA and ML approaches Hybrid CA and ML, Partition based on ML, use CA for local problems Solve the local subproblems using GPUs + CA first order++ Currently the ML worst-case theory is worse than sequential Possible solution, find a way to quantify and incorporate correlation among partitions into theory. 21

Thank You! Questions? 22