Physis: An Implicitly Parallel Framework for Stencil Computa;ons

Size: px

Start display at page:

Download "Physis: An Implicitly Parallel Framework for Stencil Computa;ons"

Rudolph Kelley
6 years ago
Views:

1 Physis: An Implicitly Parallel Framework for Stencil Computa;ons Naoya Maruyama RIKEN AICS (Formerly at Tokyo Tech) GTC12, May

2 è Good performance with low programmer produc;vity Mul;- GPU Applica;on Development Non- unified programming models MPI for inter- node parallelism CUDA/OpenCL/OpenACC for accelerators Op;miza;on Blocking Overlapped computa;on and communica;on MPI CUDA CUDA

3 Goal Approach High performance, highly produc;ve programming for heterogeneous clusters High level abstrac9ons for structured parallel programming Simplifying programming models Portability across plaworms Does not sacrifice too much performance

4 Stencil Computa;on Pc = (Pc + Pn + Ps + Pw + Pe) * 1/5.0

Physis (Φύσις) Framework Stencil DSL Declara;ve Portable

code genera;on and op;miza;ons Automa;c paralleliza;on void

PSGrid3DFloat g2) { float v = PSGridGet(g1,x,y,z)

+PSGridGet(g1,x,y- 1,z)+PSGridGet(g1,x,y+1,z)

5 Physis (Φύσις) Framework Stencil DSL Declara;ve Portable Global- view C- based DSL Compiler Physis Target- specific code genera;on and op;miza;ons Automa;c paralleliza;on void diffusion(int x, int y, int z, PSGrid3DFloat g1, PSGrid3DFloat g2) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x- 1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y- 1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z- 1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0); } C C+MPI CUDA CUDA+MPI OpenMP OpenCL

6 DSL Overview C + custom data types and intrinsics Grid data types PSGrid3DFloat, PSGrid3DDouble, etc. Dense Cartesian domain types PSDomain1D, PSDomain2D, and PSDomain3D Intrinsics Run;me management Grid object management (PSGridFloat3DNew, etc) Grid accesses (PSGridCopyin, PSGridGet, etc) Applying stencils to grids (PSGridMap, PSGridRun) Grid reduc;ons (PSGridReduce)

7 Wri;ng Stencils Stencil Kernel C func;ons describing a single flow of scalar execu;on on one grid element Executed over specified rectangular domains void diffusion(const int x, const int y, const int z, PSGrid3DFloat g1, PSGrid3DFloat g2, float t) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x- 1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y- 1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z- 1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0*t); } Issues a write to grid g2 Offset must be constant

8 Applying Stencils to Grids Map: Creates a stencil closure that encapsulates stencil and grids Run: Itera;vely executes stencil closures PSGrid3DFloat g1 = PSGrid3DFloatNew(NX, NY, NZ); PSGrid3DFloat g2 = PSGrid3DFloatNew(NX, NY, NZ); PSDomain3D d = PSDomain3DNew(0, NX, 0, NY, 0, NZ); PSStencilRun(PSStencilMap(diffusion,d,g1,g2,0.5), PSStencilMap(diffusion,d,g2,g1,0.5), 10); Grouping by PSStencilRun à Target for kernel fusion op;miza;on

9 Physis Code Implementa;on Implementa;on Source Code Executable Code DSL translator Translate intrinsics calls to RT API calls Generate GPU kernels with boundary exchanges based on sta;c analysis Using the ROSE compiler framework (LLNL) Run;me Provides a shared memory- like interface for mul;dimensional grids over distributed CPU/GPU memory

10 CUDA Thread Blocking Each thread sweeps points in the Z dimension X and Y dimensions are blocked with AxB thread blocks, where A and B are user- configurable parameters (64x4 by default) Z Y

11 Example: 7- point Stencil GPU Code device void kernel(const int x,const int y,const int z, PSGrid3DFloatDev *g, PSGrid3DFloatDev *g2) { float v = (((((( * PSGridGetAddrNoHaloFloat3D(g,x,y,z) + * PSGridGetAddrFloat3D_0_fw(g,(x + 1),y,z)) + * PSGridGetAddrFloat3D_0_bw(g,(x - 1),y,z)) + * PSGridGetAddrFloat3D_1_fw(g,x,(y + 1),z)) + * PSGridGetAddrFloat3D_1_bw(g,x,(y - 1),z)) + * PSGridGetAddrFloat3D_2_bw(g,x,y,(z - 1))) + * PSGridGetAddrFloat3D_2_fw(g,x,y,(z + 1))); * PSGridEmitAddrFloat3D(g2,x,y,z) = v; } global void PSStencilRun_kernel(int offset0,int offset1, PSDomain dom, PSGrid3DFloatDev g, PSGrid3DFloatDev g2) { int x = blockidx.x * blockdim.x + threadidx.x + offset0, y = blockidx.y * blockdim.y + threadidx.y + offset1; if (x < dom.local_min[0] x >= dom.local_max[0] (y < dom.local_min[1] y >= dom.local_max[1])) return ; int z; for (z = dom.local_min[2]; z < dom.local_max[2]; ++z) { kernel(x,y,z,&g,&g2); } }

12 Example: 7- point Stencil CPU Code sta;c void PSStencilRun_0(int iter,void **stencils) { struct dim3 block_dim(64,4,1); struct PSStencil_kernel *s0 = (struct PSStencil_kernel *)stencils[0]; cudafuncsetcacheconfig( PSStencilRun_kernel,cudaFuncCachePreferL1); struct dim3 s0_grid_dim((int )(ceil( PSGetLocalSize(0) / ((double )64))),(int )(ceil( PSGetLocalSize(1) / ((double )4))),1); PSDomainSetLocalSize(&s0 - > dom); s0 - > g = PSGetGridByID(s0 - > g_index); s0 - > g2 = PSGetGridByID(s0 - > g2_index); int i; for (i = 0; i < iter; ++i) {{ int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; PSLoadNeighbor(s0 - > g,fw_width,bw_width,0,i > 0,1); } PSStencilRun_kernel<<<s0_grid_dim,block_dim>>>( PSGetLocalOffset(0), PSGetLocalOffset(1),s0 - > dom, *(( PSGrid3DFloatDev *)( PSGridGetDev(s0 - > g))), *(( PSGrid3DFloatDev *)( PSGridGetDev(s0 - > g2)))); } cudathreadsynchronize(); }

13 Op;miza;on: Overlapped Computa;on Inner points and Communica;on 1. Copy boundaries from GPU to CPU for non- unit stride cases 2. Computes interior points 3. Boundary exchanges with neighbors Boundary 4. Computes boundaries Time

$Op;miza;on Example: 7- Point Stencil CPU Code for (i = 0; i < iter; ++i) { PSStencilRun_kernel_interior<<<s0_grid_dim,block_dim,0, stream_interior>>> ( PSGetLocalOffset(0), PSGetLocalOffset(1),$ PSLoadNeighbor(s0 - > g,fw_width,bw_width,0,i > 0,1); PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>( PSDomainGetBoundary(&s0 - > dom,0,0,1,5,0), *((

PSLoadNeighbor(s0 - > g,fw_width,bw_width,0,i > 0,1); PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>( PSDomainGetBoundary(&s0 - > dom,0,0,1,5,0), *((

14 Op;miza;on Example: 7- Point Stencil CPU Code for (i = 0; i < iter; ++i) { PSStencilRun_kernel_interior<<<s0_grid_dim,block_dim,0, stream_interior>>> ( PSGetLocalOffset(0), PSGetLocalOffset(1), PSDomainShrink(&s0 - > dom,1), *(( PSGrid3DFloatDev *)( PSGridGetDev(s0 - > g))), *(( PSGrid3DFloatDev *)( PSGridGetDev(s0 - > g2)))); int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; PSLoadNeighbor(s0 - > g,fw_width,bw_width,0,i > 0,1); PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>( PSDomainGetBoundary(&s0 - > dom,0,0,1,5,0), *(( PSGrid3DFloatDev *)( PSGridGetDev(s0 - > g))), *(( PSGrid3DFloatDev *)( PSGridGetDev(s0 - > g2)))); PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[1]>>>( PSDomainGetBoundary(&s0 - > dom,0,0,1,5,1), *(( PSGrid3DFloatDev *)( PSGridGetDev(s0 - > g))), *(( PSGrid3DFloatDev *)( PSGridGetDev(s0 - > g2)))); PSStencilRun_kernel_boundary_2_fw<<<1,(dim3(128,1,4)),0, stream_boundary_kernel[11]>>>( PSDomainGetBoundary(&s0 - > dom,1,1,1,1,0), *(( PSGrid3DFloatDev *)( PSGridGetDev(s0 - > g))), *(( PSGrid3DFloatDev *)( PSGridGetDev(s0 - > g2)))); cudathreadsynchronize(); } cudathreadsynchronize(); } Compu;ng Boundary Planes Concurrently Compu;ng Interior Points Boundary Exchange

15 Evalua;on Performance and produc;vity Sample code 7- point diffusion kernel (#stencil: 1) Jacobi kernel from Himeno benchmark (#stencil: 1) Seismic simula;on (#stencil: 15) PlaWorm Tsubame 2.0 Node: Westmere- EP 2.9GHz x 2 + M2050 x 3 Dual Infiniband QDR with full bisec;on BW fat tree

16 Produc;vity 10" 8" 6" 4" 2" 0" Increase(of(Lines(of(Code( Diffusion" Himeno" Seismic" Original" MPI" Physis" Generated"(No"Opt)" Generated"(Opt)" Similar size as sequen;al code in C

17 Op;miza;on Effects Performance (GFLOPS) Diffusion Weak Scaling Performance Baseline Overlapped boundary exchange +Multistream boundary kernels Full opt Manual 0

18 Diffusion Weak Scaling GFlops x256x x128x Number of GPUs

19 Seismic Weak Scaling GFLOPS Problem size: 256x256x256 per GPU Number of GPUs (2 GPUs per node)

20 GFlops Diffusion Strong Scaling Problem size: 512x512x D 2- D 3- D Number of GPUs

21 Himeno Strong Scaling Problem size XL (1024x1024x512) 1- D 2- D Gflops Number of GPUs

22 Conclusion High- level abstrac;ons for stencil compua;ons Portable Declara;ve Automa;c paralleliza;on Future work Fault tolerance by automated checkpoin;ng More performance tuning 3.5D blocking [Nguyen, 2010] Support of other architectures OpenMP/OpenCL ongoing Por;ng to the K Computer Acknowledgments JST CREST, FP3C, NVIDIA The ROSE project by Dan Quinlan et al. of LLNL

23 Further Informa;on Code is available at Maruyama et al., Physis: Implicitly Parallel Programming Model for Stencil Computa;ons on Large- Scale GPU- Accelerated Supercomputers, SC 11, 2011.

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental