Supercomputing on your desktop:

Size: px

Start display at page:

Download "Supercomputing on your desktop:"

Paula Bell
5 years ago
Views:

1 IAP09 / Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 07 Nicolas Pinto (MIT) CUDA - Advanced #2

2 During this course, adapted for we ll try to and use existing material ;-)

3 yey!! Today

4 Wanna Play with The Big Guys?

5 Here are the keys to High-Performance in CUDA

6 To optimize or not to optimize Warning! Hoare said (and Knuth restated) Premature optimization is the root of all evil. Applied Mathematics slide by Johan Seland 23/53

7 To optimize or not to optimize Warning! Hoare said (and Knuth restated) We should forget about small efficiencies, say about 97% of the time: Premature optimization is the root of all evil. 3% of the time we really should worry about small efficiencies (Every 33rd codeline) Applied Mathematics slide by Johan Seland 23/53

8 IAP09 / Strategy Memory Optimizations Execution Optimizations

9 IAP09 / CUDA Performance Strategies

10 Optimization goals Strategy We should strive to reach GPU performance We must know the GPU performance Vendor specifications Syntetic benchmarks Choose a performance metric Memory bandwidth or GFLOPS? Use clock() to measure Experiment and profile! slide by Johan Seland Applied Mathematics 25/53

Programming Model Threading A kernel is executed as a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: Sharing data through shared memory

11 Programming Model Threading A kernel is executed as a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: Sharing data through shared memory Synchronizing their execution Host Kernel 1 Kernel 2 Block (1, 1) Device Grid 1 Block (0, 0) Block (0, 1) Grid 2 Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Threads from different blocks cannot cooperate Thread (0, 0) Thread (0, 1) Thread (1, 0) Thread (1, 1) Thread (2, 0) Thread (2, 1) Thread (3, 0) Thread (3, 1) Thread (4, 0) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) NVIDIA Corporation

12 Data Movement in a CUDA Program Memory Host Memory Device Memory [Shared Memory] COMPUTATION [Shared Memory] Device Memory Host Memory NVIDIA Corporation

13 Perf!"#$%$&'()*+,-$#.%/(0,-(#.'( $%$&'($78'"'78'7#("5-5**'*$/% 123(/"'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-: E,(%,-'(9,%"B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:( 85#5(#-57/0'-/ GF'7(*,>("5-5**'*$/%(9,%"B#5#$,7/(957(/,%'#$%'/(='( 05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/# 39

14 Perf!"#$%$&'()'%*+,(-*.'+'/0' -*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4' %'%*+, B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3 40

15 Perf!"#$%&'(")*"+$%,-%./"0$'%1$2,03 45)'0$'6%,-%*72$6%-"6*$0%*/")%+8,9"8%2$2,03!/0$"'6%:")%:,,;$0"*$%(7"%6/"0$'%2$2,03 <6$%,)$%=%"%-$>%*/0$"'6%*,%8,"'%=%:,2;5*$%'"*"% 6/"0$'%93%"88%*/0$"'6 <6$%7*%*,%"(,7'%),)?:,"8$6:$'%"::$66.*"+$%8,"'6%")'%6*,0$6%7)%6/"0$'%2$2,03%*,%0$?,0'$0%),)? :,"8$6:$"98$%"''0$667)+ 41

16 Perf!"#$%&'&((#()"*$+,,)-)#./(0 %&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$ *2(/)3'1-#""1'"$#72&((0$82"0 9&.0$/5'#&:";$*&.0$/5'#&:$8(1-4" <##3$'#"12'-#$2"&=#$(1>$#.12=5$/1$"2331'/$ 42

18 IAP09 / Memory Optimizations

19 !"#$%&'$()*#*+,)*$-. Memory /()*#*+*-0'#"#$%&')%,-.1"%. 2$,3".4*-0'03$5,3'#"#$%&',44"..". 6.*-0'.7,%"8'#"#$%&'"11"4)*9"3& 44

20 !"#"$%&"'()*&( Memory!*+,-*$.*./&0$#/$1/(#$.*./&0$2"'34,3#1$.5-1$ 6/4*&$#1"'$3*+,-*$.*./&0$#/$3*+,-*$2"'34,3#1 /'P$"'3$3*"66/-"#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$.*./&0 8&/5;$#&"'()*&( R'*$6"&Q*$#&"'()*&$.5-1$2*##*&$#1"'$."'0$(."66$/'*( 45

21 !"#$%&'()$*+,$-'./+0."123$.2 Memory (4*","55'(6'2789+"55':2+"55'("7;'1+'3+<"#$%5'()$*+ ='27+-$-'./ LM+CDE2+-$"24.$*+'1+1N'.($+KOP;+-'7=$.?'".*2+ 8'Q$.(5'()$*+!GH%$9 R$$+7=$+S?"1*:;*7=0$27T GUVW+RVX+2"-<5$ U2$+:;7=+("47;'1 W55'("7;1#+7''+-4(=+<"#$%5'()$*+-$-'./+("1+.$*4($+ 'Q$."55+2/27$-+<$.3'.-"1($ 0$27+/'4.+2/27$-2+"1*+"<<2+7'+5$".1+7=$;.+5;-;72 46

22 gmem!"#$%"&'()#*+&,(%-./0*12(. 47

23 Accessing global memory gmem 4 cycles to issue on memory fetch but cycles of latency The equivalent of 100 MADs Likely to be a performance bottleneck Order of magnitude speedups possible Coalesce memory access Use shared memory to re-order non-coalesced addressing Applied Mathematics slide by Johan Seland 32/53

24 gmem!"#$%&'()* +,'""-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&: +,'")/(*;";&,-%*("),"3,*$"0#$,<%<"-1= 9> 01/%&,4 8AB 01/%&,4 AC9 01/%&,D +..(/(")#$,-%&/-('/(")&,"),FBGHFIG,#-'2(/%'/;-%= J/#-/()*,#..-%&&,3"-,#,-%*("),<;&/,0%,#,<;$/(6$%,"3,-%*("), &(K% L2%,k /2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k /2 %$%<%)/,(),#, 0$"'M,0%()*,-%#. NO'%6/(")=,)"/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()* P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6 48

25 gmem!"#$%&'%()*''%&&+),%#(-./)0$"#1& :4 *$$)1>?%#(&)C#?1-'-C#1% :4 49

26 !"#$%&'(#')*+##'((,*-'%)."/*0&$%1( gmem B B4 C.(%&./"')*D1%;1."/*+));'((*E"$1*%*<=&1.F&'*$0*85G 50

27 gmem!"#$%&'()*+,-(.()*,/%&0$1& 234%5(.%)1,"),678+, -(.%&,#G%5#*%:,"G%5,C89,50)& 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&& 51

28 gmem!"#$%&'()*+,-./'-/.%&0"10&(2%0! %& :&%0#0,-./'-/.%0"10;..#9&0<,";=0()&-%#>0"10;..#90"10,-./'-/.%&0 B".'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H ".067 x y z Point structure x y z x y z x y z AoS x x x y y y z z z SoA 58

gmem!"#$%&'()*+,-.//#01!"#$%&'()*,*0%#2$1,(/30"4%&,250".*53.2!0(2('#$,2",/%/"0167".)8,9%0)%$& :%#8()*,&20.'2.0%&,";,&(<%,"25%0,25#),=>,?

29 gmem!"#$%&'()*+,-.//#01!"#$%&'()*,*0%#2$1,(/30"4%&,250".*53.2!0(2('#$,2",/%/"0167".)8,9%0)%$& 712%&,B($$,70%#9,'"#$%&'()*+ C0%;%0,-20.'2.0%&,";,D00#1& "4%0,D"- E;,-"D,(&,)"2,4(#7$%>,0%#8FB0(2%,250".*5,-GHG D88(2(")#$,0%&".0'%&+ D$(*)%8,I13%&,-JK,-#/3$% 59

30 !"#"$$%$&'%()#*&+#,-./%,/0#% smem 12&"&3"#"$$%$&(",-.2%4&("2*&/-#%"56&",,%66&(%()#* 7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:"2;6 <66%2/."$&/)&",-.%9%&-.=-&:"25>.5/- <",-&:"2;&,"2&6%#9.,%&)2%&"55#%66&3%#&,*,$% +&(%()#*&,"2&6%#9.,%&"6&("2*&6.(0$/"2%)06& ",,%66%6&"6&./&-"6&:"2;6 '0$/.3$%&6.(0$/"2%)06&",,%66%6&/)&"&:"2; Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 64

31 !"#$%&''()**+#,%-."/01)* smem 23%!"#$%43#51+67* 8+#)"(%"''()**+#,% *7(+')%99%: 23%!"#$%43#51+67* ;"#'3/%:<:%=)(/>7"7+3# Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 65

32 !"#$%&''()**+#,%-."/01)* smem 234"5%!"#$%67#81+9:* ;+#)"(%"''()**+#,% *:(+')%<<%2 =34"5%!"#$%67#81+9:* ;+#)"(%"''()**+#,% *:(+')%<<%= Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 x8 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 15 66

33 smem!"#$%&&'())()$*%+$,"$-%./)$".$012 3%.&#4&,5$"6$(%75$-%./$4)$89$-4,)$+('$9$7:"7/$7;7:() -%./) 012$5%)$AB$-%./) <"$-%./$C$%&&'())$D$AB <%*($%)$,5($)4E($"6$%$5%:6?#%'+ 67

34 !"#$%&'(%()$*'+#,-'.),/01.23 smem!"#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2"%$%'#$%',)'+#,-'.),/ "%'/#32'.#3%6 7/'#00'2"$%#&3')/'#'"#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2"%$%'13',)'+#,-'.),/01.2 7/'#00'2"$%#&3')/'#'"#0/89#$:'$%#&'2"%'1&%,21.#0'#&&$%33;' 2"%$%'13',)'+#,-'.),/01.2'<+$)#&.#32= 5"%'30)9'.#3%6 68

35 Use the right kind of memory Strategy Constant memory: Quite small, 20K As fast as register access if all threads in a warp access the same location Texture memory: Spatially cached Optimized for 2D locality Neighboring threads should read neighboring addresses No need to think about coalescing Constraint: These memories can only be updated from the CPU Applied Mathematics slide by Johan Seland 31/53

36 Memory optimizations roundup Strategy CUDA memory handling is complex And I have not covered all topics... Using memory correctly can lead to huge speedups At least CUDA expose the memory hierarchy, unlike CPUs Get your algorithm up an running first, then optimize Use shared memory to let threads cooperate Be wary of data ownership A thread does not have to read/write the data it calculate slide by Johan Seland Applied Mathematics 41/53

37 Conflicts, Coalescing, Warps... I hate growing up.

38 Example!"#$%$&'#$()*+,'%"-./*0'#1$,*21')3"(3.

39 !"#$%&'($")*+,*- Example./0'."1+2-'34#$")*+,* *#$"#-*9 :,"2-*;%)<

40 !"#$%&'(#')*+,%"(-$(' Example global void transpose_naive(float *odata, float *idata, int width, int height) { 1. unsigned int xindex = blockdim.x * blockidx.x + threadidx.x; 2. unsigned int yindex = blockdim.y * blockidx.y + threadidx.y; } if (xindex < width && yindex < height) { unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; $)%.%/0")'12$3.4 = 0)%.%/0")'120"4; } 71

41 !"#$%&'(#')*+,%"(-$(' Example.'%)(*/"-01*2,$3*4565 <,/1'*$01-01*1$*4565 ;8; ;87 ;8: ;879 ;8; 78; :8; 798; 78; : 7879 ; : ; : ; : Stride = 1, coalesced Stride = 16, uncoalesced 72

42 !"#$%&'%()*+#,&-"&% Example.&&/0-12",3)0#1+24)2&)-#+1212",%()2,1")&5/#+%)12$%& *6+%#(7$"'8)974:)7;<3 =%#()16%)974:7;< "/1-/1)12$% *6+%#()914:1;<3 =%#(&)%$%0%,1)914:1;< C+"0)2,-/1)12$% A+21%&)%$%0%,1)914:1;< 2,1")"/1-/1)12$%!"#$%&'2,B)2&)#'62%D%()2C3 E$"'8F12$%)(20%,&2",&)#+%)0/$12-$%&)"C)GH 73

43 74!"#$%&'%()*+#,&-"&%.+/0%&)0")1232 4%#(&)5+"6) : 89; < <98: <9; <98 <9< 8:98: 8:9; 8:98 8:9<.+/0%&)0")7232 4%#(&)5+"6)1232 8:98 ; <98 8:9< ;9< 89< <9< 8:98: ;98: 898: <98: 898: 89; < <98: <9; <98 <9< 8:98: 8:9; 8:98 8:9< 898: 89; < <98: <9; <98 <9< 8:98: 8:9; 8:98 8:9< Example

44 !"#"$%&'()(*+'(,- Example =1+23$;0,)$!"#" A?A 6>?A./01+23$01+2$!"#"$4('/$3'0(21$5$67 A?6 6>?6 8+-9$:,-;<(:'3 A?6> 6>?6> A?A 6>?A!,<B'(,- A?6 6>?6 C<<,:+'1$+-$D1E'0+F :,<B)- =1+2$3'0(21$5$6G./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6>?6> 75

45 !"#"$%&'()(*+'(,- Example =1+23$;0,)$!"#" A?A 6>?A./01+23$01+2$!"#"$4('/$3'0(21$5$67 A?6 6>?6 8+-9$:,-;<(:'3 A?6> 6>?6> A?A 6>?A!,<B'(,- A?6 6>?6 C<<,:+'1$+-$D1E'0+F :,<B)- =1+2$3'0(21$5$6G./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6>?6> 75

46 !"#$%&'%()*+#,&-"&% Example global void transpose(float *odata, float *idata, int width, int height) { 1. shared float block[(block_dim./)*block_dim]; } unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if (xindex < width && yindex < height) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * (BLOCK_DIM+1) + threadidx.x; block[index_block] = idata[index_in]; index_transpose = threadidx.x * (BLOCK_DIM+1) + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } syncthreads(); if (xindex < width && yindex < height) odata[index_out] = block[index_transpose];

47 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } slide by Johan Seland Applied Mathematics 39/53

48 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; Allocate shared memory. unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

49 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

50 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; Check that we are within domain, calculate more indices block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

51 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; Check that we are within domain, calculate more indices Write to shared memory. index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

52 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); Check that we are within domain, calculate more indices Write to shared memory. Calculate output indices. } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

53 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); Check that we are within domain, calculate more indices Write to shared memory. Calculate output indices. Synchronize. NB:outside if-clause } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

54 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; } unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Check that we are within domain, calculate more indices Write to shared memory. Calculate output indices. Synchronize. NB:outside if-clause Write to global mem. Different index Applied Mathematics slide by Johan Seland 39/53

55 Transpose timings Example Was it worth the trouble? Grid Size Coalesced Non-coalesced Speedup ms ms ms 0.33 ms ms 1.92 ms ms 6.6 ms 8.4 For me, this is a clear yes. Applied Mathematics slide by Johan Seland 40/53

57 IAP09 / Execution Optimizations

58 Know the arithmetic cost of operations Exec 4 clock cycles: Floating point: add, multiply, fused multiply-add Integer add, bitwise operations, compare, min, max 16 clock cycles: reciprocal, reciprocal square root, log(x), 32-bit integer multiplication 32 clock cycles: sin(x), cos(x) and exp(x) 36 clock cycles: Floating point division (24-bit version in 20 cycles) Particularly costly: Integer division, modulo Remedy: Replace with shifting whenever possible Double precision (when available) will perform at half the speed Applied Mathematics slide by Johan Seland 28/53

59 Exec!""#$%&"' ()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1- +2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6- "1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&- A+6./0+*/ B)%*+,-<+<1*' 79

60 Exec!"#$%&'()*+,#-.+/.0"#12#)1 3+(4+5'()*1+6+3+(4+70'2#8"().11("1,(+9''+70'2#8"().11("1+:9;.+92+'.912+(<.+5'()*+2(+.=.) (4+5'()*1+6+JKK+2(+1)9'.+2(+4020".+$.;#).1 &'()*1+.=.)02.$+#<+8#8.'#<.+491:#(< JKKK+5'()*1+8."+C"#$+B#''+1)9'.+9)"(11+70'2#8'.+C.<."92#(<1 80

61 Exec!"#$%&"'()"*"+,"+-.!"/,0/1&"'02'$&"('"#$%&"'(,"*"+,"+-. 3+%&'4-&$5+6%('"%47&(-/+(8"('"/,(9::(-.-7"%(7/&"' S T(.(U(JV W(T(S U(OV /,,N1O:(((P1OQ(P1EQ(P1: /,,N1O:(((P1JQ(P1OQ(P1R %[,/&/XYZ(UT(OV 7,N%D/'",N1O:((P1OQ(XP'OEUYZ( /,,N1O:(((((((((((P1OQ(P1OQ(P1R A5(-5C*7"&"7.(D$,"(&D"(7/&"+-.<(!4+(/&(7"/%&(EF: &D'"/,%(GH(2/'*%I(*"'(C47&$*'5-"%%5'?&(7"/%&(:JK 5--4*/+-. AD'"/,%(,5(+5&(D/L"(&5(8"75+#(&5(&D"(%/C"(&D'"/,(875-M 81

62 Exec!"#$%&"'()'"%%*'" +$,"(-.&"/01(21(*%$/#(34'"(&5'".,%(6"'(78 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,% 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,2-40>% <*32"'(4=('"#$%&"'%(6"'(>"'/"- H5"0>(I0*2$/(=$-"(=4'(J('"#$%&"'%(K(>"'/"- L%"(M3.N''"#04*/&O< =-.#(&4(<PHH < O(,"%$'",(3.N$3*3('"#$%&"'%(K(>"'/"- D&(%43"(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'!",*0"%(6"'=4'3./0"(M 98S8($%(%-4T H5"0>(I0*2$/(=$-"(=4'(98S8(*%.#" 82

63 Exec!"#"$%&'&'()$"*+,$-"),*.(" /*")012#3+2#&+'* #&+')#+)'6-- ="#"$%&'")$"(&*#"$),*.("A #;")0-+="7 *"-#&+'A architecture {sm_10} abiversion {0} modname {cubin} code { name = BlackScholesGPU lmem = 0 smem = 68 reg = 20 bar = 0 bincode { 0xa x x40024c09 0x per thread local memory per thread block shared memory per thread registers 83

64 Exec!"#$%&''()*+',%!*-'(-*./0 84

65 Exec!"#$%$&$'()#*+,-./)",+) *22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&, 9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/ <2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)",+)01234!'1>)$7)%61#$"1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611> 85

66 Exec!""#$%&"'()*(+,-./-0%&", 1&"-,%23&4(/""#$%&"'(5/,2(&/6(&,",22%-37'( 3&"-,%2,($,-./-0%&", BUT 8/9:/""#$%&"'(0#763$-/",22/-2("%&&/6(%5,;#%6,7'( $%-%77,7320A 86

67 Exec!"#"$%&%#'(%)*+,#)-../'0"&'+1!"#"$%&%#'("&'+1)2%/.3)"4".&"&'+1)&+)4'55%#%1&)6!73 6!73)8"#9)'1)$"19):"93 ;)+5)$,/&'.#+0%33+#3 <%$+#9)="14:'4&2 A2#%"43).%#)=/+0B -AG->H IJK.%#'$%1&L $+4%)4'30+8%#3)"14)3"8%3)+.&'$"/) 87

68 Loop unrolling Exec Sometimes we know some kernel parameters at compile time: # of loop iterations Degrees of polynomials Number of data elements If we could tell this to the compiler, it can unroll loops and optimize register usage We need to be generic Avoid code duplication, sizes unknown at compile time Templates to rescue The same trick can be used for regular C++ sources Applied Mathematics slide by Johan Seland 43/53

69 Example: de Casteljau algorithm Exec A standard algorithm for evaluating polynomials in Bernstein form f (x) =b d 00 Recursively defined: x 1 x f (x) =b d 00 b k i,j = xb k 1 i+1,j b 0 i,jare coefficients + (1 x)bk 1 i,j+1 x b10 d 1 b01 d 1 1 x x 1 x 2 b20 d 2 b11 d 2 b02 d 2 Applied Mathematics slide by Johan Seland 44/53

70 Implementation Exec The de Casteljau algorithm is usually implemented as nested for-loops Coefficients are overwritten for each iteration f l o a t d e C a s t e l j a u ( f l o a t c, float x, i n t d ) { f o r ( u i n t i = 1 ; i <= d ; ++i ) { } f o r ( u i n t j = 0 ; j <= d i ; ++j ) c [ j ] = ( 1. 0 f x) c [ j ] + x c [ j +1]; f (x) =c d 00 x 1 x c d 1 10 c d 1 01 } r e t u r n c [ 0 ] ; x 1 x x 1 x 2 c20 d 2 c11 d 2 c02 d 2 Applied Mathematics slide by Johan Seland 45/53

71 Template loop unrolling Exec We make d a template parameter template<int d> f l o a t d e C a s t e l j a u ( f l o a t c, float x, int d ) { f o r ( u i n t i = 1 ; i <= d ; ++i ) { f o r ( u i n t j = 0 ; j <= d i ; ++j ) c [ j ] = ( 1. 0 f x) c [ j ] + x c [ j +1]; } r e t u r n c [ 0 ] ; } Kernel is called as s w i t c h ( d ) { c a s e 1: d e C a s t e l j a u <1><<<dimGrid, dimblock >>>( c, x ) ; b r e a k ; c a s e 2: d e C a s t e l j a u <2><<<dimGrid, dimblock >>>( c, x ) ; b r e a k ;.. c a s e MAXD: d e C a s t e l j a u <MAXD><<<dimGrid, dimblock >>>( c, x ) ; b r e a k ; } Applied Mathematics slide by Johan Seland 46/53

72 Results Exec For the de Castelaju algorithm we see a relatively small speedup 1.2 (20%...) Very easy to implement Can lead to long compile times Conclusion: Probably worth it near end of development cycle Applied Mathematics slide by Johan Seland 47/53

73 Exec!"#$%&'("# )#*+,'-.#*/!)01/2+,3",4.#$+/$5.,.$-+,('-($' 6+4",7/$".%+'$(#8 0(9+,8+#-/:,.#$5(#8 ;.#</$"#3%($-' =.-+#$7/5(*(#8 )'+/2+.</2+,3",4.#$+/4+-,($'/-"/8&(*+/"2-(4(>.-("#/ +B8B/4+4",7C/$",+/$"42&-.-("#C/",/(#'-,&$-("#/"9+,5+.* D2-(4(>+/7"&,/.%8",(-54C/then &#,"%%/%""2' )'+/-+42%.-+/2.,.4+-+,'/-"/8+#+,.-+/"2-(4.%/$"*+ 88

74 !"#$%&'($)*+,-.$/012*.#0 Profiling 3#.4+$5#-+,0#$-67$2*67$418#68*-.$4# #$ 401:.#5 ;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$ 5#594?+!*5#$

75 !"#$%&' Profiling ()*$+',%-*,+-%./*0,1"+2,2%-01%-*,.34$+*-',3$,'"#$%&',"$,+2*,.2"56 +"7*'+%75 #&08"$.32*-*$+ #&08.32*-*$+ #'+8"$.32*-*$+ #'+8.32*-*$+ &3.%&8&3%0 &3.%&8'+3-* 9-%$.2 0")*-#*$+89-%$.2 Global memory loads/stores are coalesced (coherent) or non-coalesced (incoherent) Local loads/stores Total branches and divergent branches taken by threads "$'+-4.+"3$' : "$'+-4.+"3$,.34$+ 1%-58'*-"%&";* : +2-*%0,1%-5',+2%+,'*-"%&";*,3$,%00-*'',.3$<&".+',+3, '2%-*0,3-,.3$'+%$+,7*73-=.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./' 62

76 !"#$%&%$#'"()&%*+',$%)-*."#$%/ Profiling 01,.$/)%$&%$/$"#)$2$"#/)3'#4'")1)#4%$15)31%& 6",7)#1%($#/)*"$)8.,#'&%*-$//*% 01,.$/)3',,)"*#)-*%%$/&*"5)#*)#4$)#*#1,)".89$%)*+)31%&/),1."-4$5)+*%)1)&1%#'-.,1%):$%"$,; <1."-4)$"*.(4)#4%$15)9,*-:/)#*)$"/.%$)#41#)#4$)#1%($#) 8.,#'&%*-$//*%)'/)('2$")1)-*"/'/#$"#)&$%-$"#1($)*+)#4$)#*#1,) 3*%:; 01,.$/)1%$)9$/#)./$5)#*)'5$"#'+7)%$,1#'2$)&$%+*%81"-$) 5'++$%$"-$/)9$#3$$")."*&#'8'=$5)1"5)*&#'8'=$5)-*5$!")*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81("'#.5$/)*+) 63

77 84!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%( :43:72*;<=> +93??:*;<=> 92346?*;<=> *;<=>?37+2*;<=> 43869*;<=> : : :6*&> A"#("-*7B &0-.1C-"*"-"&"(.>*C"#*.D#"'/ 83962*&> A"#("-*:B )%&C-"."-E*0(#%--"/ 0(#%--*-'>.*F'#C A"#("-*+B $1#>.*'//*/0#1(G*G-%H'-*-%'/ 23744*&> A"#("-*9B >"I0"(.1'-*'//#">>1(G A"#("-*4B 1(."#-"'J"/*'//#">>1(G F1.D*H'(K*)%($-1).> A"#("-*2B* 1(."#-"'J"/*'//#">>1(G F1.D*/1J"#G"(.*H#'()D1(G A"#("-*7*%(*94,*"-"&"(.>B*74*;<=>L M."C MC""/0C <'(/F1/.D N1&"*O4 44* 1(.>P Q0&0-'.1J" MC""/0C Example

78 Build your own!

80 Thank you! slide by David 2008 NVIDIA Kirk Corpora

81 Back Pocket Slides slide by David Cox

83 IAP09 / Misc

84 Tesla C1060 Computing Processor Processor Core GHz Form factor On-board memory System I/O Memory I/O Display outputs Typical power 1x Tesla T10P 1.33 GHz Full ATX: (H) x 10.5 (L) Dual slot wide 4 GB PCIe x16 gen2 512-bit, 800MHz DDR 102 GB/s peak bandwidth None 160 W M02: High Performance Computing with CUDA 19

5 GHz 1U for an EIA 19 4-post rack 16 GB (4.

85 Tesla S1070 1U System Processors Core GHz Form factor Total 1U system memory System I/O Memory I/O per processor Display outputs Typical power Chassis dimensions 4 x Tesla T10P 1.5 GHz 1U for an EIA 19 4-post rack 16 GB (4.0GB per GPU) 2 PCIe x bit, 800MHz GDDR 102 GB/s peak bandwidth None 700 W 1.73 H! 17.5 W! 28.5 D M02: High Performance Computing with CUDA 20

86 Double Precision Floating Point M02: High Performance Computing with CUDA NVIDIA GPU SSE2 Cell SPE Precision IEEE 754 IEEE 754 IEEE 754 Rounding modes for FADD and FMUL Denormal handling All 4 IEEE, round to nearest, zero, inf, -inf Full speed All 4 IEEE, round to nearest, zero, inf, -inf Supported, costs 1000 s of cycles NaN support Yes Yes No Overflow and Infinity support Yes Flags No Yes Some FMA Yes No Yes Square root Division Reciprocal estimate accuracy Reciprocal sqrt estimate accuracy log2(x) and 2^x estimates accuracy Software with low-latency FMA-based convergence Software with low-latency FMA-based convergence Yes Hardware Hardware 24 bit 12 bit 12 bit 23 bit 12 bit 12 bit 23 bit No No Round to zero/truncate only Flush to zero No infinity, clamps to max norm Software only Software only 18

Lecture 2: CUDA Programming

CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first: