Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Size: px

Start display at page:

Download "Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney"

Paul Reeves
5 years ago
Views:

1 Parallel Methods for Convex Optimization A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

2 Problems minimize g(x)+f(x; A, b) Sparse regression g(x) =kxk 1 f(x) =kax bk 2 2 mx Sparse SVM g(x) =kxk 1 f(x) = max(1 b i A T i x, 0) 2 i=1 Elastic net g(x) = 2 kxk2 2 +(1 )kxk cluster 1 cluster 2 linear classifier 0.7 Group lasso g(x) = JX kx j k Kj a j= a 2

3 Numerous applications Medical sciences Cancer diagnosis Prediction of common diseases: diabetes, pre-diabetes Biological sciences Protein remote homology detection Protein function prediction Gene expression data classification Finance sciences Financial time series forecasting Stock market prediction Web related applications Web searching filtering Relevance feedback Classification of articles, s, web-pages Geo and spatiotemporal environmental analysis. 3

4 Existing directions and our goals Parallelizing linear algebra is monopolized by HPC community Parallelizing optimization methods is monopolized by ML community ML community HPC community Data parallel approach Uses spark Parallelizes LA Uses MPI (lingua franca of HPC) Our goal: generalize HPC/LA techniques to convex optimization 4

5 First attempt to parallelism In ML first-order methods is often the king Optimal worst-case FLOPS Excellent for low accuracy solutions Let s parallelize computation of the gradient of first-order methods. What could go wrong? Communication becomes the bottleneck and the methods are slow.

6 Parallelism, the ML way Partition the data across nodes. Each node works independently on its data. The outputs are averaged. Control FLOPS/communication tradeoff by number and quality of partitions. Data matrix Correlated Uncorrelated Large #partitions Many FLOPS Much communication Few FLOPS Excessive communication P1 P2 P Small #partitions Excessive FLOPS Low communication Low FLOPS Low communication

7 Parallelism, the ML way for l1-regularization minimize kxk 1 + f(x; A, b) Coordinates Partition the data and coordinates across nodes. Each node works Data matrix independently on its data and coordinates. The gradients are averaged. The optimal solution is often very sparse P1 P2 P The algorithm turns from parallel to serial as the algorithm converges since many nodes become idle.

8 Parallelism, the HPC way Our objective: solve the communication bottleneck of first-order methods using HPC techniques. Exchange FLOPS with communication, directly. We get optimal worst-case FLOPS + communication efficiency We get good scalability for all sparse data This talk + preliminary experiments

9 An example: coordinate descent Pseudo-code 1 communication per Sample a column of data iteration Compute partial derivative Update solution new old Repeat 9

10 An example: communication avoiding coordinate descent Pseudo-code 1 communication round per s iterations s Compute in parallel anticipated computations s for the next s iterations Redundantly store the result in all processors P1 P2 P Each processor independently computes the next s iterations Repeat 10

11 Other examples Block coordinate descent Accelerated block coordinate descent Newton-type (first-order++) block coordinate descent Proximal versions of the above 11

12 Running time in Big-O No free lunch: Exchange FLOPS and bandwidth for latency s Hµ2 fm P + Hµ 3 + H log P s + shµ 2 log P Time per message Time per word Time per FLOP P Number of cores Number of coordinates µ at each iteration f = nnz(a)/mn H Number of iterations 12

13 Optimal parameter s s s = P log P µ 2 fm+ µ 2 P log P Time per message Time per word Time per FLOP P Number of cores µ Number of coordinates at each iteration f = nnz(a)/mn Let s assume A is very sparse and mu=1 (single coordinate descent) r Cori or Edison r 10 6! s = O =

14 Scalable results for all data layouts Data Layout 1D Row Partition 2D Block Partition 1D Column Partition A A A * Best performance depends on dataset and algorithm 14

15 Datasets Summary of (LIBSVM) datasets Name #Features #Data points Density of nonzeros url 3,231,961 2,396, % epsilon 2, , % news20 62,021 15, % covtype ,012 22% C++ using the Message Passing Interface (MPI). Intel MKL library for sparse and dense BLAS routines. All methods were tested on a Cray XC30. 15

16 Convergence of re-organized algorithms Convergence rate remains the same in exact arithmetic Empirically stable convergence: no divergence between methods 10 6 Baseline Objective function1.5 Ours Iterations 16

17 Scalability performance The more processors the better The gap between CA and non-ca increases w.r.t. #processors Running Time (sec) Performance scaling: url Baseline Ours Processors (P) Running Time (sec) Performance scaling: epsilon Baseline Ours Processors (P) 17

18 Scalability performance The more processors the better The gap between CA and non-ca increases w.r.t. #processors Performance scaling: news20 Performance scaling: covtype Running Time (sec) 36.6 Baseline Ours Running Time (sec) Baseline Ours Processors (P) Processors (P) 18

19 Speed up breakdown Large communication speedup until bandwidth takes a hit Computation is maintained due to local cache-efficient (BLAS-3) computations Speedup total communication computation speedup = 1 Best s = Speedup: url Recurrence unrolling parameter (s) Speedup Speedup: epsilon total communication computation speedup = 1 Best s = Recurrence unrolling parameter (s) 19

20 Speed up breakdown Large communication speedup until bandwidth takes a hit Computation is maintained due to local cache-efficient (BLAS-3) computations Speedup Speedup: news20 total communication computation speedup = 1 Best s = 16 Speedup Speedup: covtype total communication computation speedup = 1 Best s = Recurrence unrolling parameter (s) Recurrence unrolling parameter (s) 20

21 Future directions Comparisons among CA and ML approaches Hybrid CA and ML, Partition based on ML, use CA for local problems Solve the local subproblems using GPUs + CA first order++ Currently the ML worst-case theory is worse than sequential Possible solution, find a way to quantify and incorporate correlation among partitions into theory. 21

22 Thank You! Questions? 22

Avoiding communication in primal and dual block coordinate descent methods

Avoiding communication in primal and dual block coordinate descent methods Aditya Devarakonda Kimon Fountoulakis James Demmel Michael W. Mahoney Electrical Engineering and Computer Sciences University