Conquering Massive Clinical Models with GPU. GPU Parallelized Logistic Regression

Size: px

Start display at page:

Download "Conquering Massive Clinical Models with GPU. GPU Parallelized Logistic Regression"

Shawn Wilcox
5 years ago
Views:

1 Conquering Massive Clinical Models with GPU Parallelized Logistic Regression M.D./Ph.D. candidate in Biomathematics University of California, Los Angeles Joint Statistical Meetings Vancouver, Canada, July 30, 2018 joint work with Trevor Shaddox, Marc Suchard Funding: NSF grant IIS (Suchard), F31 LM (Tian), Paul and Daisy Soros Fellowships for New Americans (Tian)

2 Introduction Observational data provides an enormity of information How Enormous? Truven CCAE has 122 million patients A study can have 100,000 patients, 100,000 covariates Using an enormity of data requires massive statistical models Fitting massive models is computationally prohibitive (especially with cross-validation) Exploit graphics processing units (GPUs) for massive parallelization potential

3 Propensity Score Fitting Propensity scores (PS) are used to reduce confounding in comparative effectiveness studies Fitting a PS involves modeling [binary] treatment assignment probability LOGISTIC REGRESSION is standard classification model for PS If include all covariates, requires fitting huge logistic regression with regularization (can take hours to tens of hours)

4 Graphics Processing Units (GPUs) Thousands of threads execute the same piece of code simultaneously Ideal for simple but highly parallel tasks Why GPUs? Relatively cheap ($2k), portable, easy to use Can buy an enclosure for plug-and-play Buying a big desktop: more expensive, less portable Using cloud services: costs add up, difficult to use

5 GPU Architecture

How a GPU Works All processors run one kernel" (piece of code) Job broken down into thread blocks" for multiprocessors Thread block broken down into 32 thread warps" for single processors Each

6 How a GPU Works All processors run one kernel" (piece of code) Job broken down into thread blocks" for multiprocessors Thread block broken down into 32 thread warps" for single processors Each multiprocessor has its own local memory NVIDIA Tesla K40 has 2880 threads on 90 SPs across 15 MPs Global memory access costs hundreds of clock cycles Host to device transfer is measured in 10s of microseconds!

7 Logistic Regression Fitting a propensity score with all pretreatment covariates treatment y i outcome z i pretreatment covariates x i,j time Log likelihood to maximize is: l(β) = n i=1 y i x T i β log[1 + exp(x T i β)] + λ p β j j=2

8 Cyclic Coordinate Descent Updates one coordinate at a time with one-dimensional Newton steps initialize initial search β; while not yet converged do for j 1 to J do gradient β j l(β) = n i=1 y ix i,j x i,j exp(x T hessian 2 l(β) = n βj 2 i=1 delta gradient/ hessian; β j β j +delta; i β) 1+exp(xi T x 2 i,k exp(x T i β) ; (1+exp(xi T β)) 2 β) + Sign(β j)λ; update x T i β and exp(x T i β) for subjects with x i,j 0; end end Algorithm 1: Cyclic coordinate descent

9 CCD Optimization initialize initial search β; while not yet converged do for j 1 to J do gradient β j l(β) = n i=1 y ix i,j x i,j exp(x T hessian 2 l(β) = n βj 2 i=1 delta gradient/ hessian; β j β j +delta; i β) 1+exp(xi T x 2 i,k exp(x T i β) ; (1+exp(xi T β)) 2 β) + Sign(β j)λ; update x T i β and exp(x T i β) for subjects with x i,j 0; end end n i=1 y ix i,j is constant, compute once

10 CCD Optimization initialize initial search β; while not yet converged do for j 1 to J do gradient β j l(β) = n i=1 y ix i,j x i,j exp(x T hessian 2 l(β) = n βj 2 i=1 delta gradient/ hessian; β j β j +delta; i β) 1+exp(xi T x 2 i,k exp(x T i β) ; (1+exp(xi T β)) 2 β) + Sign(β j)λ; update x T i β and exp(x T i β) for subjects with x i,j 0; end end n i=1 x i,j exp(x T i β) 1+exp(x T x 2 i,k exp(xt i β) (1+exp(x T i β)) 2 i β) and n i=1 can be distributed and reduced across multiple processes

11 CCD Optimization initialize initial search β; while not yet converged do for j 1 to J do gradient β j l(β) = n i=1 y ix i,j x i,j exp(x T hessian 2 l(β) = n βj 2 i=1 delta gradient/ hessian; β j β j +delta; i β) 1+exp(xi T x 2 i,k exp(x T i β) ; (1+exp(xi T β)) 2 β) + Sign(β j)λ; update x T i β and exp(x T i β) for subjects with x i,j 0; end end updating xi T β and exp(xi T β) can be done across multiple independent processes

12 CCD on GPU local gradient[512], hessian[512]; task id; sum0, sum1 0; while task <N j do k K[task]; xb XB[k]; xb sum0 sum0 + 1+exb(xb) ; sum1 sum1 + xb (1+exb(xb)) 2 ; task task + 512; end gradient[id] sum0; hessian[id] sum1; local parallel reduction to sum gradient and hessian; if id = 0 then delta - gradient / hessian; end task id; while task <N j do k K[task]; XB[k] XB[k] + delta; task task + 512; end

13 Performance Comparison Sparse n x p matrix (5% nonzero) 80% of β set to 0, rest are Norm(0, 2) Fixed lasso regularization penalty Compare Tesla K40 to my Macbook Pro (3.1GHz, 4 cores, 8GB) n p iterations GPU CPU s 18.5 s s 296 s s 214 s s 8998 s (est)

14 Cross Validation Search for optimal regularization hyperparameter is most expensive part of model fit Partition 1 Partition 2 Partition 3 Partition 4 Leave one fold out of each partition when training model Compute out-of-sample likelihood on the left out fold Running CCD on partitions is completely independent, can be parallelized

will request 28 blocks from global memory Will need to wait

15 CCD on GPU warp reads in 32 contiguous values from global K Our problems are sparse, so k are non-contiguous Here, warp will request 28 blocks from global memory Will need to wait for 500 x 28 = clock cycles Memory access is not coalesced

or kernel Each partition accesses the same indices of

16 Cross Validation on GPU Naive way to run different partitions: run each partition in separate work group or kernel Each partition accesses the same indices of a different XB vector 28 x 4 = 112 global memory accesses

17 Cross Validation on GPU Improved way to run different partitions: lay out global memory XB so partitions access contiguous memory Each partition accesses a different XB value of the same index, but are next to each other 29 global memory accesses

18 Performance Comparison Sparse n x p matrix (5% nonzero) 80% of β set to 0, rest are Norm(0, 2) Fit cyclops: model lr", lasso with cross validation Compare Tesla K40 to my Macbook Pro (3.1GHz, 4 cores, 8GB) n p CV iterations GPU CPU x s 5.5 s x s 47.5 s x s 81.3 s x s 738 s x s 9579 s (est) x s 36.3 h (est) GPU performance dominated by global memory access

19 What About glmnet? Without CV: glmnet seemingly much faster than Cyclops BUT Cyclops agrees with glm glmnet takes a shortcut" with quadratic approximations that have no descent guarantee With CV: comparable run time on mediocre GPU, different results good GPU may outcompete glmnet very difficult to make direct comparisons

20 Conditional Logistic Regression (CLR) For stratified data, condition on outcomes per stratum Avoid estimating intercept for each stratum (ULR) Less biased than ULR if do exact" CLR and not approximation More challenging than regular LR, involves computationally large terms We implemented CLR as well Outperforms clogit, compares well with clogitl1

21 Conclusions GPUs offer parallelization for large observational studies Uniquely suited to run CV, or different weights, bootstrap Implementation involves exploiting memory hierarchy

22 glm Results n x p matrix (5% nonzero) 80% of β set to 0, rest are Norm(0, 2) No regularization Compare all on my Macbook Pro dense Size Cyclops CPU Cyclops GPU glmnet glm 10k / 1k 121 s [182] 40.9 s [215] 2.77 s* [143] 73.4 [9] sparse Size Cyclops CPU Cyclops GPU glmnet 10k / 1k 9.03 s [235] 4.42 s [212] s* [143] 10k / 5k 42.2 s [ 310]? 2.29 s** [158]

23 Performance Comparison Warfarin vs dabigatran dataset Call createps from CohortMethod Fit cyclops: model lr", lasso with cross validation Compare Tesla K40 to imac (3.9GHz, 8 cores, 16GB) n p CV iterations GPU CPU x m 46.1 m x h x h 12.5 h GPU several times faster for 10x10 CV Shows the parallelization limits of LR Might not be better for 10x1, but could also be running multiple lambda

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models DB Tsai Steven Hillion Outline Introduction Linear / Nonlinear Classification Feature Engineering - Polynomial Expansion Big-data