Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Size: px

Start display at page:

Download "Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University"

Marianna Simpson
5 years ago
Views:

1 Lecture 11: OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University F17 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2017

2 Housekeeping Your goal today: understand Altera s interpretation of OpenCL on FPGA Notices Handout #4: lab 2, due noon, 10/6 3 weeks to project proposal Readings skim Ch 10, Reconfigurable Computing for concrete reference: Altera SDK for OpenCL: Programming Guide F17 L11 S2, James C. Hoe, CMU/ECE/CALCM, 2017

3 F17 L11 S3, James C. Hoe, CMU/ECE/CALCM, 2017 Project Proposal Planning Review Lecture 6; talk to TAs; talk to me Rolling start In class proposal presentation (worth 25%) on wk 9 presentation slides (PDF) due 10/20 noon Peer grading of proposal your peers will evaluate you too 10% them; 15% me 15 minutes + 5 minutes Q&A to convince us it is worth doing 6 wks worth of work (for a 12 unit load for 2 or 3) reasonable chance of meeting milestones

4 Required Proposal Outline 1. Introduction what is the problem and metric 2. Motivation why is it important/interesting 3. Overview of approach what do you do exactly 4. Overview of expected results 5. Weekly testable milestones 6. Risk assessment (what are the known unknowns? do you have everything you need to start? do you know how to do every step? do you know already what will happen?) F17 L11 S4, James C. Hoe, CMU/ECE/CALCM, 2017

5 F17 L11 S5, James C. Hoe, CMU/ECE/CALCM, 2017 Khronos OpenCL

6 1. Platform model Two Parts to OpenCL host (processor & memory) 1 or more accelerator devices + device side mem hierarchy: global/local/private APIs for host thread to interact with devices launch compute kernels to devices prepare (load/unload) device memory 2. Kernel programming language perfect triply nested loops no loop carried dependence F17 L11 S6, James C. Hoe, CMU/ECE/CALCM, 2017 OpenCL terms introduced in bold

7 OpenCL Platform Model host CPU global mem compute device host mem global mem compute device compute device compute device F17 L11 S7, James C. Hoe, CMU/ECE/CALCM, 2017 What are these compute devices???

8 Basic Host Program Example main ( ) {... get device handle and queue allocate memory buf objects......get kernel object... while ( ) {... initialize memory buf data bind buf objects to kernel arguments add kernel and buf objects to device queue wait for kernel to finish retrieve memory buf object for result... } What are these kernels??? F17 L11 S8, James C. Hoe, CMU/ECE/CALCM, 2017

9 What are these kernels? F17 L11 S9, James C. Hoe, CMU/ECE/CALCM, 2017 Specifically talking about OpenCL C

10 Conceptually... for (int i=0; i < R 0 ; i++) for (int j=0; j < R 1 ; j++) for (int k=0; k < R 2 ; k++) { << local variable declarations >> << arbitrary C-code with access to global memory >> } Loop body must be data parallel local variables limited to scope disallow loop carried dependencies through global memory ==> statements from different iterations can interleave in any order (using disambiguated local variables) F17 L11 S10, James C. Hoe, CMU/ECE/CALCM, 2017

11 Concretely... Only specify loop body as a kernel function kernel foo(<< pointers to global mem buf>>) { } int i=get_global_id(2), j=get (1), k=get (0); << local variable declarations >> << arbitrary C-code with access to global memory >> Triply nested loops hardwired as NDRange specified as 3 integer constants, i.e., the loop bounds (R 0, R 1, R 2 ) 1 execution of kernel function is a work item work item has private memory for local var s 1 kernel execution is R 0 xr 1 xr 2 work items F17 L11 S11, James C. Hoe, CMU/ECE/CALCM, 2017

12 Example: N by N MMM kernel mmm( global float *A, *B, *C) { int i=get_global_id(1); int j=get_global_id(0); } for(int k=0; k<n; k++) C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j]; NDRange=(N, N, 1) kernel function executed by NxNx1 work items each work item sees a different combination of dimension 0 and dimension 1 global id s no assumption about work items relative progress F17 L11 S12, James C. Hoe, CMU/ECE/CALCM, 2017

13 (For Your Reference: N by N MMM in C) float A[N][N], B[N][N], C[N][N]; for(int i=0; i<n; i++) for(int j=0; j<n; j++) { for(in k=0; k<n; k++) C[i][j]=C[i][j]+A[i][k]*B[k][j] } Note: Loop body of the inner most loop is not dataparallel dependency through C[i][j] Loop body of the second inner most loop is F17 L11 S13, James C. Hoe, CMU/ECE/CALCM, 2017

14 Work Group Partition NDRange of R 0 xr 1 xr 2 work items into 3D work groups of G 0 xg 1 xg 2 work items G 0/1/2 must divide R 0/1/2 evenly get_local_id(dimen): id within group NDRange=(9, 3, 6) get_group_id(dimen): id of group Group Size=(3, 3, 3) Group Range=(3,1,2) Work group signifies locality btwn work items execute together by a processing element can share per group local memory can synchronize by barrier() F17 L11 S14, James C. Hoe, CMU/ECE/CALCM, 2017 Why do we need this?

15 OpenCL Kernels on GPGPUs Work items is a CUDA thread Work group executes as a thread block broken down into 32 work item SIMD Warps Work groups from same and different kernels are interleaved on a Streaming Processor 128 SIMD lanes, 1 INT+1 FMA per lane, 1.73GHz 1 kernel can execute on upto 20 StrmProc s (as 1 compute device), peak 8,873 GFLOPS Global=GDDR; local=shared memory 96KB SRAM; private=register file 256KB SRAM Nvidia terms in italic bold; numbers for GTX F17 L11 S15, James C. Hoe, CMU/ECE/CALCM, 2017

16 Aside: Barrel Processor [HEP, Smith] Each cycle, select a ready thread from scheduling pool only one instruction per thread in flight at once on a long latency stall, remove the thread from scheduling Simpler and faster pipeline implementation since no data dependence, hence no stall or forwarding no penalty in making pipeline deeper warp Inst T1 A B C D E E G H Inst T2 A B C D E F G H Inst T3 A B C D E F G H Inst T4 A B C D E F G H Inst T5 A B C D E F G H Inst T6 A B C D E F G H Inst T7 A B C D E F G H Inst T8 A B C D E F G H Inst T1 A B C D E F G H F17 L11 S16, James C. Hoe, CMU/ECE/CALCM, 2017

17 To get 8,873 GFLOPS... # work items 128 x StrmProc pipeline depth x 20 Computation entirely of Fused Multiply Add Interleaved warps so no worries about RAW stalls No if then else (branch divergence) BTW: ld s and st s take up inst. issue BW of the top 320 GB/sec DRAM BW AI > 108 SP FP / float only certain access pattern can sustain peak BW SIMD ld s and st s in a warp has to go to the same memory line (memory coalescing) F17 L11 S17, James C. Hoe, CMU/ECE/CALCM, 2017

18 F17 L11 S18, James C. Hoe, CMU/ECE/CALCM, 2017 Altera s Take on OpenCL

19 OpenCL FPGA Platform Model host CPU host mem global mem (DRAM) global Mem (DRAM) global Mem (DRAM) FPGA custom Kernel custom device kernel custom device kernel device F17 L11 S19, James C. Hoe, CMU/ECE/CALCM, 2017 Compute devices synthesized from kernel functions

20 Example: N by N MM Add kernel mma( global float *A, *B, *C) { int i=get_global_id(1); int j=get_global_id(0); } C[i*N+j]=A[i*N+j]+B[i*N+j]; NDRange=(N, N, 1) Note in this example: data parallel kernel function no loop in kernel function F17 L11 S20, James C. Hoe, CMU/ECE/CALCM, 2017

21 Fully pipelined Kernel Datapath NDRange = (N,N,1) (gid0,gid1) stream =(0,1), (0,N 1),(1,0) (1,N 1) (N 1,0) (N 1,N 1) A gid0 gid1 B C addr calc load &A[i*N+j] addr calc load &B[i*N+j] addr calc &C[i*N+j] A[i*N+j] add B[i*N+j] store F17 L11 S21, James C. Hoe, CMU/ECE/CALCM, 2017 Fully unrolled loop in kernel fxn also okay

22 What about MMM? kernel mmm( global float *A, *B, *C) { int i=get_global_id(1); int j=get_global_id(0); } for(int k=0; k<n; k++) C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j]; NDRange=(N, N, 1) Can t pipeline work items like before PE pipelines the k iterations dependency on C[i*N+j] kernel function scope limits the tricks we can play F17 L11 S22, James C. Hoe, CMU/ECE/CALCM, 2017

23 Single Work Item Kernel kernel mmm( global float *A, *B, *C) { for(int i=0; j+i<n; i++) for(int j=0; j<n; +) for(in k=0; k<n; k++) C[i*N+j]=C[i*N+j]+A[i*N+k]*B[k*N+j] NDRange=(1, 1, 1) never do this on GPU!! Arbitrary control flow (loops, if s) and dependencies Becomes just regular C to HW synthesis pipeline and parallelize loops schedule for initiation interval, resource, etc. Only want OpenCL s platform model and API; work group & work item not too meaningful F17 L11 S23, James C. Hoe, CMU/ECE/CALCM, 2017

24 Altera Kernel Kernel Channels GPU multi kernel OpenCL program kernel computes from global mem to global mem next kernel continues from last kernel s output buf producer and consumer kernels fully serialized For streaming processing on FPGA, connect kernels with streaming channels to bypass DRAM concurrent producer and consumer kernels reduce DRAM and PCIe bandwidth requirements host PCIe global mem (DRAM) global load kernel A chnl kernel B chnl krnl C store mem PCIe (DRAM) host F17 L11 S24, James C. Hoe, CMU/ECE/CALCM, 2017 Also supports OpenCL 2.0 Pipes

25 Parting Thoughts OpenCL = platform model/api + SIMD language kernel language forces regular and explicit parallelism SIMD parallelism manifest differently on GPUs vs FPGAs Altera s OpenCL = platform model/api + regular HLS single work item kernel unlocks parallelism style kernel kernel channels alleviate DRAM and PCIe bottleneck for streaming use cases develop/debug/analysis tools integral to appeal GPUs great at SIMD; FPGAs good for more than SIMD F17 L11 S25, James C. Hoe, CMU/ECE/CALCM, 2017

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy