Parallel Multigrid Preconditioning on Graphics Processing Units (GPUs) for Robust Power Grid Analysis

Size: px

Start display at page:

Download "Parallel Multigrid Preconditioning on Graphics Processing Units (GPUs) for Robust Power Grid Analysis"

Augusta Bates
5 years ago
Views:

1 Design Auomaion Group Parallel Muligrid Precondiioning on Graphics Processing Unis (GPUs) for Robus Power Grid Analysis Zhuo Feng Michigan Technological Universiy Zhiyu Zeng Texas A&M Universiy 200 ACM/EDAC/IEEE Design Auomaion Conference

2 Moivaion On-chip power disribuion nework verificaion challenge Tens of millions of grid nodes (recen IBM design reaches ~400M) Need long simulaion ime for ransien power grid verificaion Parallel circui simulaion algorihms on GPUs Pros: very cos efficien: 240-core GPU coss $400 Hardware resource usage limiaions: Shared memory size, number of regisers, ec Algorihm and daa srucure design preferences: Mulilevel ieraive algorihms for SIMD compuing plaform GPU-friendly device memory access paerns, simple conrol flow Our conribuion: a robus power grid simulaion mehod for GPU Muligrid precondiioning assures fas convergence (< 20 ieraions) GPU-specific daa srucure guaranee coalesced memory access 2

3 IR Drop in Power Disribuion Nework IR drop: volage drop due o non-ideal resisive wires V DD X X VDD VDD GND 3 GND Cadence

4 Power Grid Modeling & Analysis Muli-layer inerconnecs are modeled as 3D RC nework Swiching gae effecs are modeled by ime-varying curren loadings Vdd Vdd Vdd Vdd DC analysis solves linear sysem G v = b Transien analysis solves dv() G v () + C = b () d 4 Tens of millions of unknowns! G R C R v R b R nn nn n : n : : : Conducance Marix Capaciance Marix Node Volage Vecor Curren Loading Vecor

Precondiioned conjugae gradien (T. Chen e al, DAC 0) Muligrid mehods (S.

5 Prior Work Prior power grid analysis approaches Direc mehods (LU facorizaion, Cholesky decomposiion) Cholmod uses 7GB memory and >,000 s for a 9-million grid Ieraive mehods Precondiioned conjugae gradien (T. Chen e al, DAC 0) Muligrid mehods (S. Nassif e al, DAC 00) Sochasic mehod Random walk (H. Qian e al, DAC 05) V DD 5 Direc Mehod Muligrid V DD Random walk

6 Prior Work (Con.) Recen GPU based power grid analysis mehods Hybrid Muligrid mehod on GPU (Z. Feng e al, ICCAD 08) Pros: very fas (solves four million nodes per second) Cons: convergence rae depends on 2D grid approximaion Poisson Solver (J. Shi e al, DAC 09) Pros: public CUFFT library -> easier implemenaion Cons: only suiable 2D regular grids Robus precondiioned Krylov subspace ieraive mehods on GPU Precondiioners using incomplee LU or Cholesky marix facors Marix facors are hard o sore and process on GPU Muligrid based precondiioning mehods SIMD muligrid solver + sparse marix-vecor operaions on GPU 6

7 NVIDIA GPU Archiecure Sreaming Muliprocessor (SM) 8 sreaming processors (SP) 2 special funcion unis (SFU) Mulihreaded insrucion fech/dispach uni Muli-hreaded insrucion dispach o 52 acive hreads Read/wrie Thread Execuion Manager Parallel Daa Parallel Daa Parallel Daa Parallel Daa Parallel Daa Parallel Daa Parallel Daa Parallel Cache Cache Cache Cache Cache Cache Cache Daa Texure Texure Texure Texure Texure Texure Texure Texure Cache Read/wrie Read/wrie Read/wrie Read/wrie Read/wrie Global Memory Sreaming Muliprocessor 32 hreads (a warp) share one Insrucion L Daa L insrucion fech Insrucion Fech/Dispach Cover memory load laency Some facs abou an SM SP SP 7 6 KB shared memory 8,96 regisers >30 Gflops peak performance SP SP SP Shared Memory SFU SP SP SP SFU

8 GPU Memory Space (CUDA Memory Model) Each hread: R/W per-hread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid exure/consan memory Local Shared Global Texure Read Yes Yes Yes Yes Wrie Yes Yes Yes No Size Large Small Large Large BW High High High High Cached? No Yes No Yes Laency 500 cyc. 20 cyc. 500 cyc. 300 cyc. 8 Device Memory Comparison Block Block Thread Block Grid 0 Block 2 Grid Block Block N Block N Local Memory Shared Memory Global Memory

9 Conribuion of This Work Muligrid precondiioned Krylov subspace ieraive solver on GPU Hos (CPU) Memory 3D Muli layer Irregular Power Grid VDD VDD VDD VDD VDD VDD GPU Global Memory Jacobi (DRAM) Smooher using Sparse Marix a, a,4 a,5 a2,2 a2,3 a 2,6 a3,2 a3,3 a 3,8 a4, a4,4 a5, a5,5 a 5,7 a6,2 a6,6 a6,8 a7,5 a 7,7 a8,3 a8,6 a8,8 + Geomerical Muligrid Solver (GMD) MGPCG Algorihm on GPU Se Iniial Soluion Ge Iniial Residual and Search Direcion Updae Soluion and Residual Check Convergence No Converged Muligrid Precondiioning Converged DC : TR : Gx = b () Gx() + C dx = b() d Updae Search Direcion Reurn Final Soluion 9 Original Grid Marix + Geomerical Represenaion GPU-friendly Muli-level Ieraive Algorihm

10 Muligrid Mehods Among fases numerical algorihms for PDE-like problems Linear complexiy in he number of unknowns A hierarchy of exac o coarse replicas of he problem High (low) frequency errors damped on fine (coarse) grids Direc/ieraive solvers for coarses grid Muligrid operaions Smoohing, resricion, prolongaion and correcion, ec Algebraic MG (AMG) and Geomeric MG (GMD) GMD: suiable for GPU s SIMD compuaion AMG: robus for handling irregular grids, bu needs irregular memory access and complex conrol flow 0

11 Power Grid Topology Regularizaion Locaion-based mapping (Z. Feng e al, ICCAD 08) Meal 5~6 Meal 3~4 Meal ~2 2D Regular Grid

12 Parallel Muligrid Precondiioning 3D grid smooher + 2D gird GMD solver 3D fines grid is sored using ELL-like sparse marix forma 2D coarser o coarses grids are processed geomerically Coalesced memory accesses are guaraneed on GPU Jacobi Smoohing Jacobi Smoohing RHS Soluion Smooh Smooh Smooh Smooh Resric Smooh Prolong & Correc Smooh Resric Smooh Prolong & Correc Smooh Resric Prolong & Correc Resric Prolong & Correc Ieraive Marix Solver GMD Solver Ieraive Marix Solver Jacobi GMD Solve Jacobi GMD Solve Jacobi 2

13 GMD Smooher Mixed block-wise relaxaion on GPU Weighed Jacobi ieraions wihin each block Sreaming processors SP SP3 SP5 SP2 SP4 SP6 SP SP3 SP5 SP2 SP4 SP6 SP SP3 SP5 SP2 SP4 SP6 Muliprocessors SP7 SP8 SP7 SP8 SP7 SP8 Gauss-Seidel ieraions i among blocks Shared Memory SM Shared Memory SM2 Shared Memory SM3 Global Memory Execuion Time 3

14 Memory Layou on GPU Mixed daa srucures Original grid (fines grid level) X Level 0 Resricion Prolongaion ELL-like Sparse Marix a, a,4 a,5 a2,2 a2,3 a2,6 a a a a4, a4,4 a5, a5,5 a5,7 a a a a7,5 a7,7 a a a 3,2 3,3 3,8 6,2 6,6 6,8 8,3 8,6 8,8 Y Level 3 (coarses grid) Level 2 Level Regularized coarse o coarses grids Graphics Pixels on GPU 4

15 Nodal Analysis Marix ELL-like sparse marix sorage a, a,4 4 a,5 5 a2,2 a2,3 a2,6 a a a a4, a4,4 a5, a5,5 a5,7 a a a a7,5 a7,7 a a a 3,2 3,3 3,8 6,2 6,6 6,8 A = D + M D 8,3 8,6 8,8 Elemen Value Vecor Off Diagonal Elemens M Elemen Index Vecor Col Col 2 Col Col 2 : Diagonal Elemens of A M :Off-Diagonal Elemens of A P P P P a,4 a 2,3 a 7,5 8,3 a,5 a 2, a 3 6 a 8,6 0 + Inversed Diagonal Elemens D a a, a 2,2 a 7,7 a 8,8 P 5

16 GPU Device Memory Access Paern GPU-based Jacobi Ieraion (smooher): ( k + ) ( k ) x = D b Mx 2 a,4 a 2, a,5 a 2, a, a 2,2 7 8 a 7, a a8,3 8,6 7 8 a a 7,7 a 8,8 P P P P P Execuion T T2 T3 T4 Time ( k ) ( k + ) S = b M x x = D S + 6

17 Algorihm Flow Jacobi Smooher using Sparse Marix a, a,4 a,5 a2,2 a2,3 a 2,6 a3,2 a3,3 a 3,8 a4, a4,4 a5, a5,5 a 5,7 a6,2 a6,6 a6,8 a7,5 a 7,7 a8,3 a8,6 a8,8 + Geomerical Muligrid Solver (GMD) Se Iniial Soluion Ge Iniial Residual and Search Direcion Updae Soluion and Residual Check Convergence No Converged Muligrid Precondiioning Updae Search Direcion Converged Reurn Final Soluion 7

18 Experimen Resuls Linux compuing sysem: C++ & CUDA CPU: Core 2 Quad 2.66GHz + 6GB DRAM GPU: NVIDIA GTX 285.5GHz wih 240 SPs ($400) Power grid es cases IBM power grid benchmark circuis CKT~5 (0.3M ~ 7M).7 Larger indusrial power grid designs CKT6~8 (4.5 M ~ 0M) Direc solver on he hos Cholmod wih Supernodal and Meis funcions Ieraive solvers on GPU MGPCG: muligrid precondiioned CG DPCG: diagonally precondiioned CG HMD: hybrid muligrid (Z. Feng e al, ICCAD 08) ) 8

19 Power Grid Design Informaion CKT N_node N_layer N_nnz N_res N_cur CKT 27.0K K 209.7K 37.9K CKT2 85.6K 5 3.7M.4M 20.K CKT K 6 4.M.5M 277.0K CKT4.0M 3 4.3M.7M 540.8K CKT5.7M 3 6.6M 2.5M 76.5K CKT6 4.7M 8 8.8M 6.8M 85.5K CKT7 6.7M M 9.7M 267.3K CKT8 0.5M M 4.8M 49.3K N_layer: he number of meal layers N_res: he number of resisors N_cur: he number of curren sources 9

20 Convergence Comparison Residua al 0 0 MGPCG Max Errors HMD: e 3 Vol MGPCG: e 5 Vol o 0-4 HMD Ieraion Number HMD: Hybrid bidmuligrid idmehod MGPCG: Muligrid Precondiioned Conjugae Gradien Mehod 20 MGPC CG conve erges muc ch faser

21 Resuls Power Grid DC Analysis CKT NCG NDPCG NMGPCG NHMD TCG TDPCG CKT, CKT2 4,834 3, CKT3 2, CKT4 4, > CKT5 6, > NCG: he number of CG ieraions NDPCG: he number of diagonally precondiioned CG ieraions NMGPCG: he number of muligrid precondiioned CG ieraions NHMD: he number of hybrid muligrid ieraions TCG: he runime of CG TDPCG: he runime of diagonally precondiioned CG ieraions 2

22 Resuls (Con.) Power Grid DC Analysis (Con.) CKT TMGPCG THMD TCHOL Eavg Emax Speedup CKT e-4 4e-4 34X CKT e-6 2e-5 40X CKT e-5 e-4 54X CKT4 0.9 > e-5 7e-4 22X CKT5. > e-4 5e-4 25X TMGPCG: he runime of muligrid precondiioned CG ieraions THMD: he runime of hybrid muligrid ieraions TCHOL: he runime of direc marix solver (Cholmod) Eavg: average error Emax: maximum error Speedup: TCHOL/TMGPCG 22

23 Resuls (Con.) DC Analysis of Large Circuis CKT N_node N_MGPCG T_MGPCG T_CHOL Speedup CKT6 4.7M X CKT7 6.7M X CKT8 0.5M.6 N/A N/A N_MGPCG: he number of MGPCG ieraions T_MGPCG: he runime of MGPCG solver T_CHOL: he runime of he Cholmod solver 23

24 Resuls (Con.) Transien Analysis Resuls CKT Tcpu Tgpu Ngpu Eavg Emax Speedup CKT e-6 8e-4 20X CKT e-5 3e-4 2X CKT e-6 e-4 23X CKT e-5 2e-4 8X CKT e-5 e-4 2X Tcpu: Cholmod od solve ime Tgpu: MGPCG ime Ngpu: he number of MGPCG ieraions 24

25 Transien Analysis: CKT.8.75 Volage (V).7 Cholmod GPU Time (seconds) x CKT wih 27K nodes 500ime seps 509 MGPCG iers. Volage (V) Cholmod GPU Cholmod: 22s GPU: 9.2s 23X Speedups Time (seconds) x

26 Transien Analysis: CKT Cholmod GPU lage (V) Vo Time (seconds) x CKT5 wih 7Mnodes.7M ime seps 693 MGPCG iers. 26 Volage (V) Cholmod GPU Time (seconds) x 0-9 Cholmod: 2,700s GPU: 28s 22X Speedups

27 Conclusion and Fuure Work Robus circui simulaion on GPU is challenging How o accelerae simulaions for irregular problems? Hard o guaranee he accuracy and robusness? Parallel muligrid precondiioning mehod for power grid analysis Muligrid id precondiioning i (geomerical + marix represenaions) i Geomerical muligrid solver on GPU ELL-like sparse marix-vecor operaions for original grids on GPU Applicable o more general power grids wih srong irregulariies ii Much faser convergence & higher accuracy hen ever before Fuure work Node ordering and grid pariioning for Muli-Core-Muli-GPUs GPU performance modeling for furher improving he solver efficiency Heerogeneous compuing o adapively balance he work loads 27

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 10, OCTOBER 2011 1823 Parallel On-Chip Power Distribution Network Analysis on Multi-Core-Multi-GPU Platforms Zhuo Feng, Member,