CnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University

Size: px

Start display at page:

Download "CnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University"

Aileen Jefferson
5 years ago
Views:

1 CnC-HC a programming model for CPU-GPU hybrid parallelism Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University

2 Acknowledgements CnC-CUDA: Declarative Programming for GPUs, Max Grossman, Alina Simion-Sbirlea, Zoran Budimlic, Vivek Sarkar, LCPC Scheduling Macro-Dataflow Programs on Task-Parallel Runtime Systems, Sagnak Tasirlar - Masters Thesis. Habanero C (HC) project and team members: Zoran Budimlic, Vincent Cave, Sanjay Chatterjee, Deepak Majeti, Vivek Sarkar, Yonghong Yan. +Programming+Language NSF Expeditions: Center for Domain-Specific Computing (CDSC) --- UCLA, Rice, OSU, UCSB The material in this talk is part of Alina Sbirlea s MS thesis research at Rice University

Motivation The Concurrent Collection programming model provides ease of use to the programmer Habanero-C (HC) implements an execution model for multicore

3 Motivation The Concurrent Collection programming model provides ease of use to the programmer Habanero-C (HC) implements an execution model for multicore processors FPGA GPU CPU CUDA provides maximum GPU resource utilization CnC-HC: a first step in providing a simple programming model for heterogeneous hardware

4 The CnC model Coordination language Dynamic light-weight task based Single assignment Deterministic Race free Well suited for heterogeneous parallelism

5 CnC flavors CnC Flavors Intel CnC Rice CnC - HJ Rice CnC - HC Language C++ Java C Single Assignment Enforced - Gets before Puts - Data Dependent Gets - Tag FuncCons - - Auto- generated step stubs Auto- generated Gets - - GPU extensions - Focus of my work

6 Extending the specifications Tag = tuple of variables; e.g.: t = (i, j, k) Tag functions =map from step tag to input/output item tags A tuple element can be: Arithmetic expression supported ops:+,-,*,/ Future work: user defined functions Range: { Expr1.. Expr2 } Step dependencies the step tag is a tuple of variables. input tags are defined relative to the step tag variables and constants. output tags are defined relative to the step tag variables, constants and inputs.

7 Extending the specifications Tag declarations < int [dim] tag1 > ; Item declarations [ double ** C1 ]; [ struct my_data * C2]; Step prescriptions < tag1 > :: ( s1step ); Step dependencies [ C1 : 0 ], [ C2 : j, i+j, i*j+1 ] -> ( s1step : i,j) -> < tag : i, { i+1.. C1[0] } >, [ C2 : j, i+j, i*j+2] ;

8 Concrete example Cholesky Factorization //Item collections [ int numtiles ]; [ int tilesize ]; [ double** Lkji ]; //Tag collections < int [1] singletontag > ; < int [1] controls1tag > ; < int [2] controls2tag >; < int [3] controls3tag >; //Step prescriptions < singletontag > :: ( kcomputestep ) ; < controls1tag > :: ( kjcomputestep ), ( s1computestep ) ; < controls2tag > :: ( kjicomputestep ), ( s2computestep ) ; < controls3tag > :: ( s3computestep ) ;

9 Concrete example Cholesky Factorization //Step dependencies [ numtiles : 0 ] ->( kcomputestep : k)-><controls1tag:{0.. numtiles[0] } > ; [ numtiles : 0 ] ->( kjcomputestep : k)-> <controls2tag: k, { k+1.. numtiles[0] } > ; ( kjicomputestep : k, j) -> < controls3tag : k, j, { k+1.. j+1 } > ; [ tilesize : 0 ], [ Lkji : k, k, k ] -> ( s1computestep : k ) -> [ Lkji : k, k, k+1 ] ; [ tilesize : 0 ], [ Lkji : j, k, k ], [ Lkji : k, k, k+1 ] -> ( s2computestep : k, j ) -> [ Lkji : j, k, k+1 ] ; [ tilesize : 0 ], [ Lkji : j, i, k ], [ Lkji : j, k, k+1 ], [ Lkji : i, k, k+1 ] -> ( s3computestep : k, j, i) -> [ Lkji : j, i, k+1 ] ; //Data taken from environment and written back to it env -> [ Lkji ], [ tilesize ], [ numtiles ], < singletontag > ; [ Lkji ] -> env;

10 CnC-HC Build Model!! *!+ &&' # (%!" #$ % )!

11 CnC-HC: Runtimes Motivation: Data dependencies (Gets) needs extra synchronization beyond HC constructs: async and finish Data Driven Steps start to execute when they are prescribed If a Get fails, step is killed Step is restarted by the step doing a Put on the data that Get failed on. Data Driven Await Steps do not start to execute until all data is available Dependencies are filled in when step is prescribed Once all dependencies are satisfied, step executes => Gets are ensured to succeed.

12 CnC-HC: DataDriven Runtime 0 1,. )) ))!/23 ' - 4 " % +!"#$%&'()(*+,&'-.+/ 01'+(2.3'! 4"$.56'#'+,$'3(7.3' "#$! %&& ' ($&() $ *$!$ +$()*,)- &./ &./ 0$*)&& # &$&( 1$($!)

13 $ & CnC-HC: DataDrivenAwait Runtime )6 6/ $ " " ' 7 7 +,. &!"#$%&'()(*+,&'-.+/ 01'+(2.3'! 4"$.56'#'+,$'3(7.3' "#$%#%% &#'()#%!** *%% %% +%!#%!,#-#.//0,1230-1# *0.1 # / 01 /01 "%*#*)*/ *%%%*/%% $%%#%! &%3%!# - 0 ## ##!145

14 Cholesky auto-gen.code (no user code) void prescribestep(char* stepname, char* steptag, Context* context){ // Create step if(!strncmp(stepname, "s1computestep\0", 13)){ step->stepid = Step_s1ComputeStep; s1computestep_dependencies(step->tag, (Context*)step->context, step); int status = checkdependencies(step); if(status == CNC_SUCCESS) dispatchstep(step); return;}. } void s1computestep_dependencies(char * tag, Context * context, Step* step){ int k = gettag(tag, 0); double** Lkji0;!"#$%&'()(*+,&'-.+/ char* taglkji0 = createtag(3, k, k, k); 01'+(2.3' adddependency((void**) & (Lkji0), taglkji0, context->lkji, step); 4"$.56'#'+,$'3(7.3' int* tilesizetemp1; char* tagtilesize1 = createtag(1, 0); adddependency((void**) & (tilesizetemp1), tagtilesize1, context->tilesize, step);} void dispatchstep(step* step){ switch(step->stepid){ case Step_s1ComputeStep: async IN(step){ s1computestep_gets(step->tag, (Context*)step->context, step); }; break; }}

15 Cholesky auto-gen.code (with user code) void* s1computestep_gets(char * tag, Context * context, Step* step){ int k = gettag(tag, 0); double** Lkji0; char* taglkji0 = createtag(3, k, k, k); CNC_GET((void**) & (Lkji0), taglkji0, context->lkji, step); int* tilesizetemp1; int tilesize1; char* tagtilesize1 = createtag(1, 0); CNC_GET((void**) & (tilesizetemp1), tagtilesize1, context->tilesize, step); tilesize1 = tilesizetemp1[0]; s1computestep( k, Lkji0, tilesize1, context ); return 0;} void s1computestep( int k, double** Lkji0, int tilesize1, Context* context){ double ** lblock; //user adds memory allocation and computation char* taglkji2 = createtag(3, k, k, k+1); Put(lBlock, taglkji2, context->lkji);} void kjicomputestep( int k, int j, Context* context){ int _index0_2; for(_index0_2 = k+1; _index0_2 < j+1; _index0_2++){ char* tagcontrols3tag0 = createtag(3, k, j, _index0_2); prescribestep("s3computestep", tagcontrols3tag0, context); }}!"#$%&'()(*+,&'-.+/ 01'+(2.3' 4"$.56'#'+,$'3(7.3'

16 CUDA Data parallel programming architecture from NVIDIA Execute programmer-defined kernels on extremely parallel GPUs CUDA program flow: 1. Push data on device 2. Launch kernel 3. Execute kernel and memory accesses in parallel 4. Pull data off device Device threads are launched in batches Blocks of Threads, Grid of Blocks Explicit device memory management Global, shared, constant, texture cudamalloc, cudamemcpy, cudafree, etc. Figure source: Y. Yan et. al JCUDA: a Programmer Friendly Interface for Accelerating Java Programs with CUDA. Euro-Par 2009.

17 CnC-CUDA extension to Intel CnC <tag> [in_item] (cpu_step) [out_item] {gpu_step} {gpu_step} {gpu_step} {gpu_step} {gpu_step}

18 CnC-HC CUDA Build Model " # $,! ) $ -$. )# & *!( "$% &'#( ) +$

19 Places in Habanero C PL0 Legend PL1 PL2 PL3 PL4 PL5 PL6 W0 W1 W2 W3 PL7 W4 PL8 W5 PL PL PL PL Physical memory Cache GPU memory Reconfigurable FPGA Implicit data movement Explicit data movement Wx CPU compute worker Wx Device agent worker Slide credit: Habanero C team

20 CnC-CUDA-HC: DataDrivenAwait "0 "" +- 3 " "2 "1 2 -* 1 "# 0.$ $ 4!! $ "/ / ' '( '((( $+ '(( $ -!"# $%$&'() $ *%* $+&'(() $+,&'((() $$+**,$,

21 Crypt auto-gen.code void prescribestep(char* stepname, char* steptag, Context* context){ // Create step if(!strncmp(stepname, "gpu_encrypt\0", 11)){ step->stepid = Step_gpu_encrypt; gpu_encrypt_dependencies(step->tag, (Context*)step->context, step); int status = checkdependencies(step); if(status == CNC_SUCCESS) dispatchstep(step); return;} } void gpu_encrypt_dependencies(char * tag, Context * context, Step* step){ adddependency( ); } void dispatchstep(step* step){ switch(step->stepid){ }} case Step_gpu_encrypt: async(gpu_pl) IN(step) { gpu_encrypt_gets(step->tag, (Context*)step->context, step); }; break;

22 Crypt auto-gen.&editable code void* gpu_encrypt_gets(char * tag, Context * context, Step* step){ } CNC_GET( ) gpu_encryptlaunch( ) Put( ) void gpu_encryptlaunch( ){.. cudamalloc... cudamemcpy. //host to device gpu_encryptkernelcaller<<<blocks_per_grid, threads_per_block>>>(.) cudathreadsynchronize();.. cudamemcpy. //device to host }.. cudafree. global void gpu_encryptkernelcaller( ){ int tid = blockdim.x*blockidx.x+threadidx.x; if(tid < ) { gpu_encryptkernel( ); } } device void gpu_encryptkernel( ) {. }

23 Experimental setup Compare CPU performance CnC-HJ (NonBlocking, DataDriven policies)vs CnC- HC (Work-first and Help-first) CnC-HJ-NB, CnC-HJ-DD, CnC-HC-WF, CnC-HC-HF Timing computation only Analyze hybrid execution Benchmarks: CPU Benchmarks: Cholesky Factorization: Java (HJ), C (HC) steps Black Scholes: Java (HJ), C (HC) steps Heart Wall Tracking: C (HJ, HC) steps GPU Benchmarks: Crypt C and CUDA steps

CPU experiments Intel(R) Xeon(R) E7330 @ 2.

24 CPU experiments Intel(R) Xeon(R) 2.40GHz, 16 cores Time (s) for 2,4,8,16 cores

25 CPU & GPU experiments Measuring time(s) for 0%-100% steps executed on GPU NVIDIA Quadro FX threads 32 CUDA cores 512 MB memory Intel(R) Xeon(R) E cores NVIDIA Tesla C threads 240 CUDA cores 4 GB memory AMD Phenom(tm) 9850 Quad-Core Processor Time (s) Time (s) % on GPU % on GPU

26 Conclusion & on-going work Conclusion Language extensions within the CnC model allows a hybrid execution model Using the GPU s computational power can provide significant performance improvement while allowing the user to focus on program development On-going and future work Extend graph specification to support user defined tag functions. e.g. [ in: f(i), g(j) ] Auto-generate more CPU-GPU linkage code Test on more benchmarks

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu