What does Fusion mean for HPC?

Size: px

Start display at page:

Download "What does Fusion mean for HPC?"

Giles Horton
5 years ago
Views:

1 What does Fusion mean for HPC? Casey Battaglino Aparna Chandramowlishwaran Jee Choi Kent Czechowski Cong Hou Chris McClanahan Dave S. Noble, Jr. Richard (Rich) Vuduc AMD Fusion Developers Summit Bellevue, Washington June 14, 2011

2 Two scales

3 CONTEXT: MOBO ( MOVING BOUNDARIES ) Citation: A. Rahimian, I. Lashuk, A. Chandramowlishwaran, D. Malhotra, L. Moon, R. Sampath, A. Shringarpure, S. Veerapaneni, J. Vetter, R. Vuduc, D. Zorin, and G. Biros. Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In SC 10. Winner, Gordon Bell Prize.

4 CONTEXT: MOBO ( MOVING BOUNDARIES ) Citation: A. Rahimian, I. Lashuk, A. Chandramowlishwaran, D. Malhotra, L. Moon, R. Sampath, A. Shringarpure, S. Veerapaneni, J. Vetter, R. Vuduc, D. Zorin, and G. Biros. Petascale direct numerical simulation of blood flow on 200k cores and heterogeneous architectures. In SC 10. Winner, Gordon Bell Prize.

5 First principles: Balance Kung (1986); Callahan & Kennedy (1988); McCalpin (1995)

6 Which swim lane will win for (3D) FFTs at exascale? vs. Swim lane 1 Swim lane 2 + =? 3D FFT

Year=2020: n=21000 0.30 0.077 0.25 Time (sec) 0.20 0.15 17 Pflop/s 5.8 Pflop/s 0.264 Communication Network 0.10 0.049 0.

7 Year=2020: n= Time (sec) Pflop/s 5.8 Pflop/s Communication Network CPU 1M sockets Peak = 3.98 EF/s Bisection = 273 TB/s GPU 131k sockets Peak = 3.98 EF/s Bisection = 70.2 TB/s

8 Two costs: Tnetwork + Tmemory

9 Two costs: Tnetwork + Tmemory

10 Two costs: Tnetwork + Tmemory

11 Two costs: Tnetwork + Tmemory Mem For a fixed problem size and fixed machine peak, faster s mean: fewer s smaller network and lower bisection bandwidth larger local-problem size Trend: Easier to scale network bandwidth than memory bandwidth More time communicating, both on the and in the network

12 Two costs: Tnetwork + Tmemory Mem For a fixed problem size and fixed machine peak, faster s mean: flop/s Let B memory bandwidth T mem O(B) T net O B (d 1)/d (d-dimensional torus)

14 Compute-intensive I/O-intensive

15 Compute-intensive I/O-intensive

17 Compute-intensive I/O-intensive

18 Desiderata, for hardware Balanced procs APUs provide reconfigurable balance High-bandwidth to APU GDDRx, stacked memory Integrated network I/O Shared cache or cache coherence within APU For software Unified address space Support for asynchronous parallel execution (task graphs) Less verbose programming models (next slide) Useful and interpretable counters

19 cl_mem d_a, d_b, d_c; clgpucontext = clcreatecontextfromtype(0, CL_DEVICE_TYPE_GPU, NULL, NULL, &errcode); shrcheckerror(errcode, CL_SUCCESS); errcode = clgetcontextinfo(clgpucontext, CL_CONTEXT_DEVICES, 0, NULL, &databytes); cl_device_id *cldevices = (cl_device_id *)malloc(databytes); errcode = clgetcontextinfo(clgpucontext, CL_CONTEXT_DEVICES, databytes, cldevices, NULL); shrcheckerror(errcode, CL_SUCCESS); clcommandque = clcreatecommandqueue(clgpucontext, cldevices [0], 0, &errcode); shrcheckerror(errcode, CL_SUCCESS); d_c = clcreatebuffer(clgpucontext, CL_MEM_READ_WRITE, mem_size_a, NULL, &errcode); d_a = clcreatebuffer(clgpucontext, CL_MEM_READ_WRITE CL_MEM_COPY_HOST_PTR, mem_size_a, h_a, &errcode); d_b = clcreatebuffer(clgpucontext, CL_MEM_READ_WRITE CL_MEM_COPY_HOST_PTR, mem_size_b, h_b, &errcode); char *clmatrixmul = oclloadprogsource("kernel.cl", "// My comment\n", &kernellength); shrcheckerror(clmatrixmul!= NULL, shrtrue); clprogram = clcreateprogramwithsource(clgpucontext, 1, (const char **)&clmatrixmul, &kernellength, &errcode); shrcheckerror(errcode, CL_SUCCESS); errcode = clbuildprogram(clprogram, 0, NULL, NULL, NULL, NULL); shrcheckerror(errcode, CL_SUCCESS); clkernel = clcreatekernel(clprogram, "matrixmul", &errcode); shrcheckerror(errcode, CL_SUCCESS); size_t localworksize[2], globalworksize[2]; int wa = WA; int wc = WC; errcode = clsetkernelarg(clkernel, 0, sizeof(cl_mem), (void *) &d_c); errcode = clsetkernelarg(clkernel, 1, sizeof(cl_mem), (void *)&d_a); errcode = clsetkernelarg(clkernel, 2, sizeof(cl_mem), (void *)&d_b); errcode = clsetkernelarg(clkernel, 3, sizeof(int), (void *) &wa); errcode = clsetkernelarg(clkernel, 4, sizeof(int), (void *) &wc); shrcheckerror(errcode, CL_SUCCESS); localworksize[0] = 16; localworksize[1] = 16; globalworksize[0] = 1024; globalworksize[1] = 1024; errcode = clenqueuendrangekernel(clcommandque, clkernel, 2, NULL, globalworksize, localworksize, 0, NULL, NULL); shrcheckerror(errcode, CL_SUCCESS); // 8. Retrieve result from device errcode = clenqueuereadbuffer(clcommandque, d_c, CL_TRUE, 0, mem_size_c, h_c, 0, NULL, NULL); shrcheckerror(errcode, CL_SUCCESS); // kernel void matrixmul( global float* C, global float* A, global float* B, int wa, int wb) { int bx = get_group_id(0); int by = get_group_id(1); int tx = get_local_id(0); int ty = get_local_id(1); int abegin = wa * BLOCK_SIZE * by; int aend = abegin + wa - 1; int astep = BLOCK_SIZE; int bbegin = BLOCK_SIZE * bx; int bstep = BLOCK_SIZE * wb; for (int a = abegin, b = bbegin; a <= aend; a += astep, b += bstep) { local float As[BLOCK_SIZE][BLOCK_SIZE]; local float Bs[BLOCK_SIZE][BLOCK_SIZE]; } } As[ty][tx] = A[a + wa * ty + tx]; Bs[ty][tx] = B[b + wb * ty + tx]; barrier(clk_local_mem_fence); for (int k = 0; k < BLOCK_SIZE; ++k) Csub += As[ty][k] * Bs[k][tx]; barrier(clk_local_mem_fence); int c = wb * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wb * ty + tx] = Csub;

GPU Computing with CUDA

GPU Computing with CUDA Hands-on: Shared Memory Use (Dot Product, Matrix Multiplication) Dan Melanz & Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical