Supercomputing on your desktop:

Size: px
Start display at page:

Download "Supercomputing on your desktop:"

Transcription

1 IAP09 / Supercomputing on your desktop: Programming the next generation of cheap and massively parallel hardware using CUDA Lecture 07 Nicolas Pinto (MIT) CUDA - Advanced #2

2 During this course, adapted for we ll try to and use existing material ;-)

3 yey!! Today

4 Wanna Play with The Big Guys?

5 Here are the keys to High-Performance in CUDA

6 To optimize or not to optimize Warning! Hoare said (and Knuth restated) Premature optimization is the root of all evil. Applied Mathematics slide by Johan Seland 23/53

7 To optimize or not to optimize Warning! Hoare said (and Knuth restated) We should forget about small efficiencies, say about 97% of the time: Premature optimization is the root of all evil. 3% of the time we really should worry about small efficiencies (Every 33rd codeline) Applied Mathematics slide by Johan Seland 23/53

8 IAP09 / Strategy Memory Optimizations Execution Optimizations

9 IAP09 / CUDA Performance Strategies

10 Optimization goals Strategy We should strive to reach GPU performance We must know the GPU performance Vendor specifications Syntetic benchmarks Choose a performance metric Memory bandwidth or GFLOPS? Use clock() to measure Experiment and profile! slide by Johan Seland Applied Mathematics 25/53

11 Programming Model Threading A kernel is executed as a grid of thread blocks A thread block is a batch of threads that can cooperate with each other by: Sharing data through shared memory Synchronizing their execution Host Kernel 1 Kernel 2 Block (1, 1) Device Grid 1 Block (0, 0) Block (0, 1) Grid 2 Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1) Threads from different blocks cannot cooperate Thread (0, 0) Thread (0, 1) Thread (1, 0) Thread (1, 1) Thread (2, 0) Thread (2, 1) Thread (3, 0) Thread (3, 1) Thread (4, 0) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) NVIDIA Corporation

12 Data Movement in a CUDA Program Memory Host Memory Device Memory [Shared Memory] COMPUTATION [Shared Memory] Device Memory Host Memory NVIDIA Corporation

13 Perf!"#$%$&'()*+,-$#.%/(0,-(#.'( $%$&'($78'"'78'7#("5-5**'*$/% 123(/"'78/($#/(#-57/$/#,-/(,7()C3/D(7,#(%'%,-: E,(%,-'(9,%"B#5#$,7(,7(#.'(123(#,(5F,$8(9,/#*:( 85#5(#-57/0'-/ GF'7(*,>("5-5**'*$/%(9,%"B#5#$,7/(957(/,%'#$%'/(='( 05/#'-(#.57(#-57/0'--$7+(=59H(578(0,-#.(#,(.,/# 39

14 Perf!"#$%$&'()'%*+,(-*.'+'/0' -*12'30'4(536(7*/80*12'30'4(9(*+4'+(*:(%1;/$#<4' %'%*+, B/(3.1+'4(%'%*+,C(15*$4(.$;.84';+''(>1/D(0*/:2$0#3 40

15 Perf!"#$%&'(")*"+$%,-%./"0$'%1$2,03 45)'0$'6%,-%*72$6%-"6*$0%*/")%+8,9"8%2$2,03!/0$"'6%:")%:,,;$0"*$%(7"%6/"0$'%2$2,03 <6$%,)$%=%"%-$>%*/0$"'6%*,%8,"'%=%:,2;5*$%'"*"% 6/"0$'%93%"88%*/0$"'6 <6$%7*%*,%"(,7'%),)?:,"8$6:$'%"::$66.*"+$%8,"'6%")'%6*,0$6%7)%6/"0$'%2$2,03%*,%0$?,0'$0%),)? :,"8$6:$"98$%"''0$667)+ 41

16 Perf!"#$%&'&((#()"*$+,,)-)#./(0 %&'/)/)1.$012'$-1*32/&/)1.$/1$4##3$/5#$6%!$ *2(/)3'1-#""1'"$#72&((0$82"0 9&.0$/5'#&:";$*&.0$/5'#&:$8(1-4" <##3$'#"12'-#$2"&=#$(1>$#.12=5$/1$"2331'/$ 42

17

18 IAP09 / Memory Optimizations

19 !"#$%&'$()*#*+,)*$-. Memory /()*#*+*-0'#"#$%&')%,-.1"%. 2$,3".4*-0'03$5,3'#"#$%&',44"..". 6.*-0'.7,%"8'#"#$%&'"11"4)*9"3& 44

20 !"#"$%&"'()*&( Memory!*+,-*$.*./&0$#/$1/(#$.*./&0$2"'34,3#1$.5-1$ 6/4*&$#1"'$3*+,-*$.*./&0$#/$3*+,-*$2"'34,3#1 /'P$"'3$3*"66/-"#*3$4,#1/5#$*+*&$-/;0,'Q$#1*.$#/$1/(#$.*./&0 8&/5;$#&"'()*&( R'*$6"&Q*$#&"'()*&$.5-1$2*##*&$#1"'$."'0$(."66$/'*( 45

21 !"#$%&'()$*+,$-'./+0."123$.2 Memory (4*","55'(6'2789+"55':2+"55'("7;'1+'3+<"#$%5'()$*+ ='27+-$-'./ LM+CDE2+-$"24.$*+'1+1N'.($+KOP;+-'7=$.?'".*2+ 8'Q$.(5'()$*+!GH%$9 R$$+7=$+S?"1*:;*7=0$27T GUVW+RVX+2"-<5$ U2$+:;7=+("47;'1 W55'("7;1#+7''+-4(=+<"#$%5'()$*+-$-'./+("1+.$*4($+ 'Q$."55+2/27$-+<$.3'.-"1($ 0$27+/'4.+2/27$-2+"1*+"<<2+7'+5$".1+7=$;.+5;-;72 46

22 gmem!"#$%"&'()#*+&,(%-./0*12(. 47

23 Accessing global memory gmem 4 cycles to issue on memory fetch but cycles of latency The equivalent of 100 MADs Likely to be a performance bottleneck Order of magnitude speedups possible Coalesce memory access Use shared memory to re-order non-coalesced addressing Applied Mathematics slide by Johan Seland 32/53

24 gmem!"#$%&'()* +,'""-.()#/%.,-%#.,01,#,2#$345#-6,789 /2-%#.&: +,'")/(*;";&,-%*("),"3,*$"0#$,<%<"-1= 9> 01/%&,4 8AB 01/%&,4 AC9 01/%&,D +..(/(")#$,-%&/-('/(")&,"),FBGHFIG,#-'2(/%'/;-%= J/#-/()*,#..-%&&,3"-,#,-%*("),<;&/,0%,#,<;$/(6$%,"3,-%*("), &(K% L2%,k /2 /2-%#.,(),#,2#$345#-6,<;&/,#''%&&,/2% k /2 %$%<%)/,(),#, 0$"'M,0%()*,-%#. NO'%6/(")=,)"/,#$$,/2-%#.&,<;&/,0%,6#-/('(6#/()* P-%.('#/%.,#''%&&?,.(Q%-*%)'%,5(/2(),#,2#$35#-6 48

25 gmem!"#$%&'%()*''%&&+),%#(-./)0$"#1& :4 *$$)1>?%#(&)C#?1-'-C#1% :4 49

26 !"#$%&'(#')*+##'((,*-'%)."/*0&$%1( gmem B B4 C.(%&./"')*D1%;1."/*+));'((*E"$1*%*<=&1.F&'*$0*85G 50

27 gmem!"#$%&'()*+,-(.()*,/%&0$1& 234%5(.%)1,"),678+, -(.%&,#G%5#*%:,"G%5,C89,50)& 4%5.01%:Q.(&#$(*)%:,1J5%#:,#''%&& 51

28 gmem!"#$%&'()*+,-./'-/.%&0"10&(2%0! %& :&%0#0,-./'-/.%0"10;..#9&0<,";=0()&-%#>0"10;..#90"10,-./'-/.%&0 B".'%0&-./'-/.%0#$(*)C%)-+0DD#$(*)<E=40FG%.%0E0H ".067 x y z Point structure x y z x y z x y z AoS x x x y y y z z z SoA 58

29 gmem!"#$%&'()*+,-.//#01!"#$%&'()*,*0%#2$1,(/30"4%&,250".*53.2!0(2('#$,2",/%/"0167".)8,9%0)%$& 712%&,B($$,70%#9,'"#$%&'()*+ C0%;%0,-20.'2.0%&,";,D00#1& "4%0,D"- E;,-"D,(&,)"2,4(#7$%>,0%#8FB0(2%,250".*5,-GHG D88(2(")#$,0%&".0'%&+ D$(*)%8,I13%&,-JK,-#/3$% 59

30 !"#"$$%$&'%()#*&+#,-./%,/0#% smem 12&"&3"#"$$%$&(",-.2%4&("2*&/-#%"56&",,%66&(%()#* 7-%#%8)#%4&(%()#*&.6&5.9.5%5&.2/)&:"2;6 <66%2/."$&/)&",-.%9%&-.=-&:"25>.5/- <",-&:"2;&,"2&6%#9.,%&)2%&"55#%66&3%#&,*,$% +&(%()#*&,"2&6%#9.,%&"6&("2*&6.(0$/"2%)06& ",,%66%6&"6&./&-"6&:"2;6 '0$/.3$%&6.(0$/"2%)06&",,%66%6&/)&"&:"2; Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 64

31 !"#$%&''()**+#,%-."/01)* smem 23%!"#$%43#51+67* 8+#)"(%"''()**+#,% *7(+')%99%: 23%!"#$%43#51+67* ;"#'3/%:<:%=)(/>7"7+3# Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 65

32 !"#$%&''()**+#,%-."/01)* smem 234"5%!"#$%67#81+9:* ;+#)"(%"''()**+#,% *:(+')%<<%2 =34"5%!"#$%67#81+9:* ;+#)"(%"''()**+#,% *:(+')%<<%= Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 15 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 15 x8 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 15 66

33 smem!"#$%&&'())()$*%+$,"$-%./)$".$012 3%.&#4&,5$"6$(%75$-%./$4)$89$-4,)$+('$9$7:"7/$7;7:() -%./) 012$5%)$AB$-%./) <"$-%./$C$%&&'())$D$AB <%*($%)$,5($)4E($"6$%$5%:6?#%'+ 67

34 !"#$%&'(%()$*'+#,-'.),/01.23 smem!"#$%&'(%()$*'13'#3'/#32'#3'$%4132%$3'1/'2"%$%'#$%',)'+#,-'.),/ "%'/#32'.#3%6 7/'#00'2"$%#&3')/'#'"#0/89#$:'#..%33'&1//%$%,2'+#,-3;'2"%$%'13',)'+#,-'.),/01.2 7/'#00'2"$%#&3')/'#'"#0/89#$:'$%#&'2"%'1&%,21.#0'#&&$%33;' 2"%$%'13',)'+#,-'.),/01.2'<+$)#&.#32= 5"%'30)9'.#3%6 68

35 Use the right kind of memory Strategy Constant memory: Quite small, 20K As fast as register access if all threads in a warp access the same location Texture memory: Spatially cached Optimized for 2D locality Neighboring threads should read neighboring addresses No need to think about coalescing Constraint: These memories can only be updated from the CPU Applied Mathematics slide by Johan Seland 31/53

36 Memory optimizations roundup Strategy CUDA memory handling is complex And I have not covered all topics... Using memory correctly can lead to huge speedups At least CUDA expose the memory hierarchy, unlike CPUs Get your algorithm up an running first, then optimize Use shared memory to let threads cooperate Be wary of data ownership A thread does not have to read/write the data it calculate slide by Johan Seland Applied Mathematics 41/53

37 Conflicts, Coalescing, Warps... I hate growing up.

38 Example!"#$%$&'#$()*+,'%"-./*0'#1$,*21')3"(3.

39 !"#$%&'($")*+,*- Example./0'."1+2-'34#$")*+,* *#$"#-*9 :,"2-*;%)<

40 !"#$%&'(#')*+,%"(-$(' Example global void transpose_naive(float *odata, float *idata, int width, int height) { 1. unsigned int xindex = blockdim.x * blockidx.x + threadidx.x; 2. unsigned int yindex = blockdim.y * blockidx.y + threadidx.y; } if (xindex < width && yindex < height) { unsigned int index_in = xindex + width * yindex; unsigned int index_out = yindex + height * xindex; $)%.%/0")'12$3.4 = 0)%.%/0")'120"4; } 71

41 !"#$%&'(#')*+,%"(-$(' Example.'%)(*/"-01*2,$3*4565 <,/1'*$01-01*1$*4565 ;8; ;87 ;8: ;879 ;8; 78; :8; 798; 78; : 7879 ; : ; : ; : Stride = 1, coalesced Stride = 16, uncoalesced 72

42 !"#$%&'%()*+#,&-"&% Example.&&/0-12",3)0#1+24)2&)-#+1212",%()2,1")&5/#+%)12$%& *6+%#(7$"'8)974:)7;<3 =%#()16%)974:7;< "/1-/1)12$% *6+%#()914:1;<3 =%#(&)%$%0%,1)914:1;< C+"0)2,-/1)12$% A+21%&)%$%0%,1)914:1;< 2,1")"/1-/1)12$%!"#$%&'2,B)2&)#'62%D%()2C3 E$"'8F12$%)(20%,&2",&)#+%)0/$12-$%&)"C)GH 73

43 74!"#$%&'%()*+#,&-"&%.+/0%&)0")1232 4%#(&)5+"6) : 89; < <98: <9; <98 <9< 8:98: 8:9; 8:98 8:9<.+/0%&)0")7232 4%#(&)5+"6)1232 8:98 ; <98 8:9< ;9< 89< <9< 8:98: ;98: 898: <98: 898: 89; < <98: <9; <98 <9< 8:98: 8:9; 8:98 8:9< 898: 89; < <98: <9; <98 <9< 8:98: 8:9; 8:98 8:9< Example

44 !"#"$%&'()(*+'(,- Example =1+23$;0,)$!"#" A?A 6>?A./01+23$01+2$!"#"$4('/$3'0(21$5$67 A?6 6>?6 8+-9$:,-;<(:'3 A?6> 6>?6> A?A 6>?A!,<B'(,- A?6 6>?6 C<<,:+'1$+-$D1E'0+F :,<B)- =1+2$3'0(21$5$6G./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6>?6> 75

45 !"#"$%&'()(*+'(,- Example =1+23$;0,)$!"#" A?A 6>?A./01+23$01+2$!"#"$4('/$3'0(21$5$67 A?6 6>?6 8+-9$:,-;<(:'3 A?6> 6>?6> A?A 6>?A!,<B'(,- A?6 6>?6 C<<,:+'1$+-$D1E'0+F :,<B)- =1+2$3'0(21$5$6G./01+23$01+2$;0,)$:,-31:B'(H1$I+-93 A?6> 6>?6> 75

46 !"#$%&'%()*+#,&-"&% Example global void transpose(float *odata, float *idata, int width, int height) { 1. shared float block[(block_dim./)*block_dim]; } unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if (xindex < width && yindex < height) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * (BLOCK_DIM+1) + threadidx.x; block[index_block] = idata[index_in]; index_transpose = threadidx.x * (BLOCK_DIM+1) + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } syncthreads(); if (xindex < width && yindex < height) odata[index_out] = block[index_transpose];

47 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } slide by Johan Seland Applied Mathematics 39/53

48 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; Allocate shared memory. unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

49 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

50 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; Check that we are within domain, calculate more indices block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

51 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; Check that we are within domain, calculate more indices Write to shared memory. index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

52 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); Check that we are within domain, calculate more indices Write to shared memory. Calculate output indices. } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

53 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); Check that we are within domain, calculate more indices Write to shared memory. Calculate output indices. Synchronize. NB:outside if-clause } if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Applied Mathematics slide by Johan Seland 39/53

54 Coalesced transpose: Source code Example global void transpose( float *out, float *in, int w, int h ) { shared float block[block_dim*block_dim]; unsigned int xblock = blockdim.x * blockidx.x; unsigned int yblock = blockdim.y * blockidx.y; Allocate shared memory. Set up indexing unsigned int xindex = xblock + threadidx.x; unsigned int yindex = yblock + threadidx.y; } unsigned int index_out, index_transpose; if ( xindex < width && yindex < height ) { unsigned int index_in = width * yindex + xindex; unsigned int index_block = threadidx.y * BLOCK_DIM + threadidx.x; block[index_block] = in[index_in]; index_transpose = threadidx.x * BLOCK_DIM + threadidx.y; index_out = height * (xblock + threadidx.y) + yblock + threadidx.x; } synchthreads(); if ( xindex < width && yindex < height ) { out[index_out] = block[index_transpose]; } Check that we are within domain, calculate more indices Write to shared memory. Calculate output indices. Synchronize. NB:outside if-clause Write to global mem. Different index Applied Mathematics slide by Johan Seland 39/53

55 Transpose timings Example Was it worth the trouble? Grid Size Coalesced Non-coalesced Speedup ms ms ms 0.33 ms ms 1.92 ms ms 6.6 ms 8.4 For me, this is a clear yes. Applied Mathematics slide by Johan Seland 40/53

56

57 IAP09 / Execution Optimizations

58 Know the arithmetic cost of operations Exec 4 clock cycles: Floating point: add, multiply, fused multiply-add Integer add, bitwise operations, compare, min, max 16 clock cycles: reciprocal, reciprocal square root, log(x), 32-bit integer multiplication 32 clock cycles: sin(x), cos(x) and exp(x) 36 clock cycles: Floating point division (24-bit version in 20 cycles) Particularly costly: Integer division, modulo Remedy: Replace with shifting whenever possible Double precision (when available) will perform at half the speed Applied Mathematics slide by Johan Seland 28/53

59 Exec!""#$%&"' ()*+%,-.&/0*#"0.1&/-%*+-+2+"#0+,-/+3#+&0.%44'5-/1- +2+"#0.&6-10)+*-7%*$/-./-0)+-1&4'-7%'-01-).,+- 4%0+&".+/-%&,-8++$-0)+-)%*,7%*+-9#/'!""#$%&"' :-;#<9+*-1=-7%*$/-*#&&.&6- "1&"#**+&04'-1&-%-<#40.$*1"+//1*-,.>.,+,-9'- <%2.<#<-&#<9+*-1=-7%*$/-0)%0-"%&-*#&- A+6./0+*/ B)%*+,-<+<1*' 79

60 Exec!"#$%&'()*+,#-.+/.0"#12#)1 3+(4+5'()*1+6+3+(4+70'2#8"().11("1,(+9''+70'2#8"().11("1+:9;.+92+'.912+(<.+5'()*+2(+.=.) (4+5'()*1+6+JKK+2(+1)9'.+2(+4020".+$.;#).1 &'()*1+.=.)02.$+#<+8#8.'#<.+491:#(< JKKK+5'()*1+8."+C"#$+B#''+1)9'.+9)"(11+70'2#8'.+C.<."92#(<1 80

61 Exec!"#$%&"'()"*"+,"+-.!"/,0/1&"'02'$&"('"#$%&"'(,"*"+,"+-. 3+%&'4-&$5+6%('"%47&(-/+(8"('"/,(9::(-.-7"%(7/&"' S T(.(U(JV W(T(S U(OV /,,N1O:(((P1OQ(P1EQ(P1: /,,N1O:(((P1JQ(P1OQ(P1R %[,/&/XYZ(UT(OV 7,N%D/'",N1O:((P1OQ(XP'OEUYZ( /,,N1O:(((((((((((P1OQ(P1OQ(P1R A5(-5C*7"&"7.(D$,"(&D"(7/&"+-.<(!4+(/&(7"/%&(EF: &D'"/,%(GH(2/'*%I(*"'(C47&$*'5-"%%5'?&(7"/%&(:JK 5--4*/+-. AD'"/,%(,5(+5&(D/L"(&5(8"75+#(&5(&D"(%/C"(&D'"/,(875-M 81

62 Exec!"#$%&"'()'"%%*'" +$,"(-.&"/01(21(*%$/#(34'"(&5'".,%(6"'(78 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,% 6"'(78C(6.'&$&$4/",(.34/#(04/0*''"/&(&5'".,2-40>% <*32"'(4=('"#$%&"'%(6"'(>"'/"- H5"0>(I0*2$/(=$-"(=4'(J('"#$%&"'%(K(>"'/"- L%"(M3.N''"#04*/&O< =-.#(&4(<PHH < O(,"%$'",(3.N$3*3('"#$%&"'%(K(>"'/"- D&(%43"(64$/&(Q%6$--$/#R $/&4(98S8(3.1(400*'!",*0"%(6"'=4'3./0"(M 98S8($%(%-4T H5"0>(I0*2$/(=$-"(=4'(98S8(*%.#" 82

63 Exec!"#"$%&'&'()$"*+,$-"),*.(" /*")012#3+2#&+'* #&+')#+)'6-- ="#"$%&'")$"(&*#"$),*.("A #;")0-+="7 *"-#&+'A architecture {sm_10} abiversion {0} modname {cubin} code { name = BlackScholesGPU lmem = 0 smem = 68 reg = 20 bar = 0 bincode { 0xa x x40024c09 0x per thread local memory per thread block shared memory per thread registers 83

64 Exec!"#$%&''()*+',%!*-'(-*./0 84

65 Exec!"#$%$&$'()#*+,-./)",+) *22/,)#*+,-./)",+)01234)-/)-)%61#$"1,)27)8-+")/$&, 9:2$.)8-/#$'()32%"6#-#$2')2')6'.,+;"2"61-#,.)8-+"/ <2+,)#*+,-./)",+)01234)==)0,##,+)%,%2+>)1-#,'3>) #*+,-. A,+',1)$':23-#$2'/)3-')7-$1)$7)#22)%-'>)+,($/#,+/)-+,)6/,. B,6+$/#$3/ <$'$%6%C)DE)#*+,-./)",+)01234!'1>)$7)%61#$"1,)32'36++,'#)01234/) FGH)2+)HID)#*+,-./)-)0,##,+)3*2$3, J/6-11>)/#$11),'26(*)+,(/)#2)32%"$1,)-'.)$':24,)/633,//7611> 85

66 Exec!""#$%&"'()*(+,-./-0%&", 1&"-,%23&4(/""#$%&"'(5/,2(&/6(&,",22%-37'( 3&"-,%2,($,-./-0%&", BUT 8/9:/""#$%&"'(0#763$-/",22/-2("%&&/6(%5,;#%6,7'( $%-%77,7320A 86

67 Exec!"#"$%&%#'(%)*+,#)-../'0"&'+1!"#"$%&%#'("&'+1)2%/.3)"4".&"&'+1)&+)4'55%#%1&)6!73 6!73)8"#9)'1)$"19):"93 ;)+5)$,/&'.#+0%33+#3 <%$+#9)="14:'4&2 A2#%"43).%#)=/+0B -AG->H IJK.%#'$%1&L $+4%)4'30+8%#3)"14)3"8%3)+.&'$"/) 87

68 Loop unrolling Exec Sometimes we know some kernel parameters at compile time: # of loop iterations Degrees of polynomials Number of data elements If we could tell this to the compiler, it can unroll loops and optimize register usage We need to be generic Avoid code duplication, sizes unknown at compile time Templates to rescue The same trick can be used for regular C++ sources Applied Mathematics slide by Johan Seland 43/53

69 Example: de Casteljau algorithm Exec A standard algorithm for evaluating polynomials in Bernstein form f (x) =b d 00 Recursively defined: x 1 x f (x) =b d 00 b k i,j = xb k 1 i+1,j b 0 i,jare coefficients + (1 x)bk 1 i,j+1 x b10 d 1 b01 d 1 1 x x 1 x 2 b20 d 2 b11 d 2 b02 d 2 Applied Mathematics slide by Johan Seland 44/53

70 Implementation Exec The de Casteljau algorithm is usually implemented as nested for-loops Coefficients are overwritten for each iteration f l o a t d e C a s t e l j a u ( f l o a t c, float x, i n t d ) { f o r ( u i n t i = 1 ; i <= d ; ++i ) { } f o r ( u i n t j = 0 ; j <= d i ; ++j ) c [ j ] = ( 1. 0 f x) c [ j ] + x c [ j +1]; f (x) =c d 00 x 1 x c d 1 10 c d 1 01 } r e t u r n c [ 0 ] ; x 1 x x 1 x 2 c20 d 2 c11 d 2 c02 d 2 Applied Mathematics slide by Johan Seland 45/53

71 Template loop unrolling Exec We make d a template parameter template<int d> f l o a t d e C a s t e l j a u ( f l o a t c, float x, int d ) { f o r ( u i n t i = 1 ; i <= d ; ++i ) { f o r ( u i n t j = 0 ; j <= d i ; ++j ) c [ j ] = ( 1. 0 f x) c [ j ] + x c [ j +1]; } r e t u r n c [ 0 ] ; } Kernel is called as s w i t c h ( d ) { c a s e 1: d e C a s t e l j a u <1><<<dimGrid, dimblock >>>( c, x ) ; b r e a k ; c a s e 2: d e C a s t e l j a u <2><<<dimGrid, dimblock >>>( c, x ) ; b r e a k ;.. c a s e MAXD: d e C a s t e l j a u <MAXD><<<dimGrid, dimblock >>>( c, x ) ; b r e a k ; } Applied Mathematics slide by Johan Seland 46/53

72 Results Exec For the de Castelaju algorithm we see a relatively small speedup 1.2 (20%...) Very easy to implement Can lead to long compile times Conclusion: Probably worth it near end of development cycle Applied Mathematics slide by Johan Seland 47/53

73 Exec!"#$%&'("# )#*+,'-.#*/!)01/2+,3",4.#$+/$5.,.$-+,('-($' 6+4",7/$".%+'$(#8 0(9+,8+#-/:,.#$5(#8 ;.#</$"#3%($-' =.-+#$7/5(*(#8 )'+/2+.</2+,3",4.#$+/4+-,($'/-"/8&(*+/"2-(4(>.-("#/ +B8B/4+4",7C/$",+/$"42&-.-("#C/",/(#'-,&$-("#/"9+,5+.* D2-(4(>+/7"&,/.%8",(-54C/then &#,"%%/%""2' )'+/-+42%.-+/2.,.4+-+,'/-"/8+#+,.-+/"2-(4.%/$"*+ 88

74 !"#$%&'($)*+,-.$/012*.#0 Profiling 3#.4+$5#-+,0#$-67$2*67$418#68*-.$4# #$ 401:.#5 ;/&$-67$%/&$8*5*6<$210$-..$=#06#.$*6>19-8*16+$-67$ 5#594?+!*5#$

75 !"#$%&' Profiling ()*$+',%-*,+-%./*0,1"+2,2%-01%-*,.34$+*-',3$,'"#$%&',"$,+2*,.2"56 +"7*'+%75 #&08"$.32*-*$+ #&08.32*-*$+ #'+8"$.32*-*$+ #'+8.32*-*$+ &3.%&8&3%0 &3.%&8'+3-* 9-%$.2 0")*-#*$+89-%$.2 Global memory loads/stores are coalesced (coherent) or non-coalesced (incoherent) Local loads/stores Total branches and divergent branches taken by threads "$'+-4.+"3$' : "$'+-4.+"3$,.34$+ 1%-58'*-"%&";* : +2-*%0,1%-5',+2%+,'*-"%&";*,3$,%00-*'',.3$<&".+',+3, '2%-*0,3-,.3$'+%$+,7*73-=.+%8&%4$.2*0 : *>*.4+*0,+2-*%0,9&3./' 62

76 !"#$%&%$#'"()&%*+',$%)-*."#$%/ Profiling 01,.$/)%$&%$/$"#)$2$"#/)3'#4'")1)#4%$15)31%& 6",7)#1%($#/)*"$)8.,#'&%*-$//*% 01,.$/)3',,)"*#)-*%%$/&*"5)#*)#4$)#*#1,)".89$%)*+)31%&/),1."-4$5)+*%)1)&1%#'-.,1%):$%"$,; <1."-4)$"*.(4)#4%$15)9,*-:/)#*)$"/.%$)#41#)#4$)#1%($#) 8.,#'&%*-$//*%)'/)('2$")1)-*"/'/#$"#)&$%-$"#1($)*+)#4$)#*#1,) 3*%:; 01,.$/)1%$)9$/#)./$5)#*)'5$"#'+7)%$,1#'2$)&$%+*%81"-$) 5'++$%$"-$/)9$#3$$")."*&#'8'=$5)1"5)*&#'8'=$5)-*5$!")*#4$%)3*%5/>)#%7)#*)%$5.-$)#4$)81("'#.5$/)*+) 63

77 84!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%(!"#$%#&'()"*$%#*+,*"-"&"(.*#"/0).1%( :43:72*;<=> +93??:*;<=> 92346?*;<=> *;<=>?37+2*;<=> 43869*;<=> : : :6*&> A"#("-*7B &0-.1C-"*"-"&"(.>*C"#*.D#"'/ 83962*&> A"#("-*:B )%&C-"."-E*0(#%--"/ 0(#%--*-'>.*F'#C A"#("-*+B $1#>.*'//*/0#1(G*G-%H'-*-%'/ 23744*&> A"#("-*9B >"I0"(.1'-*'//#">>1(G A"#("-*4B 1(."#-"'J"/*'//#">>1(G F1.D*H'(K*)%($-1).> A"#("-*2B* 1(."#-"'J"/*'//#">>1(G F1.D*/1J"#G"(.*H#'()D1(G A"#("-*7*%(*94,*"-"&"(.>B*74*;<=>L M."C MC""/0C <'(/F1/.D N1&"*O4 44* 1(.>P Q0&0-'.1J" MC""/0C Example

78 Build your own!

79

80 Thank you! slide by David 2008 NVIDIA Kirk Corpora

81 Back Pocket Slides slide by David Cox

82

83 IAP09 / Misc

84 Tesla C1060 Computing Processor Processor Core GHz Form factor On-board memory System I/O Memory I/O Display outputs Typical power 1x Tesla T10P 1.33 GHz Full ATX: (H) x 10.5 (L) Dual slot wide 4 GB PCIe x16 gen2 512-bit, 800MHz DDR 102 GB/s peak bandwidth None 160 W M02: High Performance Computing with CUDA 19

85 Tesla S1070 1U System Processors Core GHz Form factor Total 1U system memory System I/O Memory I/O per processor Display outputs Typical power Chassis dimensions 4 x Tesla T10P 1.5 GHz 1U for an EIA 19 4-post rack 16 GB (4.0GB per GPU) 2 PCIe x bit, 800MHz GDDR 102 GB/s peak bandwidth None 700 W 1.73 H! 17.5 W! 28.5 D M02: High Performance Computing with CUDA 20

86 Double Precision Floating Point M02: High Performance Computing with CUDA NVIDIA GPU SSE2 Cell SPE Precision IEEE 754 IEEE 754 IEEE 754 Rounding modes for FADD and FMUL Denormal handling All 4 IEEE, round to nearest, zero, inf, -inf Full speed All 4 IEEE, round to nearest, zero, inf, -inf Supported, costs 1000 s of cycles NaN support Yes Yes No Overflow and Infinity support Yes Flags No Yes Some FMA Yes No Yes Square root Division Reciprocal estimate accuracy Reciprocal sqrt estimate accuracy log2(x) and 2^x estimates accuracy Software with low-latency FMA-based convergence Software with low-latency FMA-based convergence Yes Hardware Hardware 24 bit 12 bit 12 bit 23 bit 12 bit 12 bit 23 bit No No Round to zero/truncate only Flush to zero No infinity, clamps to max norm Software only Software only 18

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86) 26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is

More information

CS 314 Principles of Programming Languages

CS 314 Principles of Programming Languages CS 314 Principles of Programming Languages Zheng Zhang Fall 2016 Dec 14 GPU Programming Rutgers University Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing computations

More information

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,

More information

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont.

2/17/10. Administrative. L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies. Administrative, cont. Administrative L7: Memory Hierarchy Optimization IV, Bandwidth Optimization and Case Studies Next assignment on the website Description at end of class Due Wednesday, Feb. 17, 5PM Use handin program on

More information

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Performance optimization with CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Performance optimization with CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Memory spaces and memory access Shared memory Examples Lecture questions: 1. Suggest two significant

More information

CUDA Performance Optimization Mark Harris NVIDIA Corporation

CUDA Performance Optimization Mark Harris NVIDIA Corporation CUDA Performance Optimization Mark Harris NVIDIA Corporation Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary Optimize Algorithms for

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam Optimization GPUs are very fast BUT Naïve programming can result in disappointing performance

More information

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12.

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12. Outline L9: Project Discussion and Floating Point Issues Discussion of semester projects Floating point Mostly single precision until recent architectures Accuracy What s fast and what s not Reading: Ch

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 29 GPU Programming II Rutgers University Review: Programming with CUDA Compute Unified Device Architecture (CUDA) Mapping and managing

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

GPU Floating Point Features

GPU Floating Point Features CSE 591: GPU Programming Floating Point Considerations Klaus Mueller Computer Science Department Stony Brook University Objective To understand the fundamentals of floating-point representation To know

More information

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed

More information

Optimizing Parallel Reduction in CUDA

Optimizing Parallel Reduction in CUDA Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each

More information

Programming on GPUs (CUDA and OpenCL) Gordon Erlebacher Nov. 30, 2012

Programming on GPUs (CUDA and OpenCL) Gordon Erlebacher Nov. 30, 2012 Programming on GPUs (CUDA and OpenCL) Gordon Erlebacher Nov. 30, 2012 EvoluEon GPU Languages Fermi, GTX480 900 GTX 280 (933 GFLOPs) Khronos OpenCL 800 Floating-Point Operations (FLOPs) per Second (In

More information

Dense Linear Algebra. HPC - Algorithms and Applications

Dense Linear Algebra. HPC - Algorithms and Applications Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose Synchronization Ideal case for parallelism: no resources shared between threads no communication between threads Many

More information

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary Optimizing CUDA Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary NVIDIA Corporation 2009 2 Optimize Algorithms for the GPU Maximize

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

Linear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo

Linear Algebra on the GPU. Pawel Pomorski, HPC Software Analyst SHARCNET, University of Waterloo Linear Algebra on the GPU, HPC Software Analyst SHARCNET, University of Waterloo Overview Brief introduction to GPUs CUBLAS and MAGMA libraries Developing efficient linear algebra code for the GPU - case

More information

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application

$ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x. Enter in the application folder Compile the source code Launch the application $ cd ex $ cd 1.Enumerate_GPUs $ make $./enum_gpu.x Enter in the application folder Compile the source code Launch the application --- General Information for device 0 --- Name: xxxx Compute capability:

More information

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1

Optimizing CUDA NVIDIA Corporation 2009 U. Melbourne GPU Computing Workshop 27/5/2009 1 Optimizing CUDA 1 Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Summary 2 Optimize Algorithms for the GPU Maximize independent parallelism

More information

GPU Computing with CUDA. Part 3: CUDA Performance Tips and Tricks

GPU Computing with CUDA. Part 3: CUDA Performance Tips and Tricks GPU Computing with CUDA Part 3: CUDA Performance Tips and Tricks Dortmund, June 4, 2009 SFB 708, AK "Modellierung und Simulation" Dominik Göddeke Angewandte Mathematik und Numerik TU Dortmund dominik.goeddeke@math.tu-dortmund.de

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Josef Pelikán, Jan Horáček CGG MFF UK Praha GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Review Secret behind GPU performance: simple cores but a large number of them; even more threads can exist live on the hardware (10k 20k threads live). Important performance

More information

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research Reductions and Low-Level Performance Considerations CME343 / ME339 27 May 2011 David Tarjan [dtarjan@nvidia.com] NVIDIA Research REDUCTIONS Reduction! Reduce vector to a single value! Via an associative

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

Lecture 5. Performance Programming with CUDA

Lecture 5. Performance Programming with CUDA Lecture 5 Performance Programming with CUDA Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Today s lecture Matrix multiplication 2011 Scott B. Baden / CSE 262 / Spring 2011 3 Memory Hierarchy

More information

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Graph Partitioning Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM. Partition given graph G=(V,E) in k subgraphs of nearly equal

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA

Implementation of Adaptive Coarsening Algorithm on GPU using CUDA Implementation of Adaptive Coarsening Algorithm on GPU using CUDA 1. Introduction , In scientific computing today, the high-performance computers grow

More information

CS 179 Lecture 4. GPU Compute Architecture

CS 179 Lecture 4. GPU Compute Architecture CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at

More information

CUDA as a Supporting Technology for Next- Generation AR Applications. Tutorial 4

CUDA as a Supporting Technology for Next- Generation AR Applications. Tutorial 4 CUDA as a Supporting Technology for Next- Generation AR Applications Tutorial 4 Agenda CUDA as a Supporting Technology for Next-Generation AR Applications Thiago Farias, João Marcelo Teixeira, Pedro Leite,

More information

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission) CUDA Memory Model Some material David Kirk, NVIDIA and Wen-mei W. Hwu, 2007-2009 (used with permission) 1 G80 Implementation of CUDA Memories Each thread can: Grid Read/write per-thread registers Read/write

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory

CUDA Exercises. CUDA Programming Model Lukas Cavigelli ETZ E 9 / ETZ D Integrated Systems Laboratory CUDA Exercises CUDA Programming Model 05.05.2015 Lukas Cavigelli ETZ E 9 / ETZ D 61.1 Integrated Systems Laboratory Exercises 1. Enumerate GPUs 2. Hello World CUDA kernel 3. Vectors Addition Threads and

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Lab 1 Part 1: Introduction to CUDA

Lab 1 Part 1: Introduction to CUDA Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 13 Hardware/Software Co-Design Review so far Miaoqing Huang University of Arkansas Fall 2011 2 / 13 Problem I A student mentioned that he was able to multiply two 1,024 1,024 matrices using a tiled

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?

More information

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory.

Shared Memory. Table of Contents. Shared Memory Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Shared Memory. Table of Contents Shared Memory Learning CUDA to Solve Scientific Problems. 1 Objectives Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es

More information

Sparse Linear Algebra in CUDA

Sparse Linear Algebra in CUDA Sparse Linear Algebra in CUDA HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 22 nd 2017 Table of Contents Homework - Worksheet 2

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

GPU programming basics. Prof. Marco Bertini

GPU programming basics. Prof. Marco Bertini GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Multi-GPU Summary

Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Multi-GPU Summary Optimizing CUDA Outline Overview Hardware Memory Optimizations Execution Configuration Optimizations Instruction Optimizations Multi-GPU Summary NVIDIA Corporation 2009 2 Optimize Algorithms for the GPU

More information

Lecture 3: Introduction to CUDA

Lecture 3: Introduction to CUDA CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Introduction to CUDA Some slides here are adopted from: NVIDIA teaching kit Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Parallel Computing. Lecture 19: CUDA - I

Parallel Computing. Lecture 19: CUDA - I CSCI-UA.0480-003 Parallel Computing Lecture 19: CUDA - I Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com GPU w/ local DRAM (device) Behind CUDA CPU (host) Source: http://hothardware.com/reviews/intel-core-i5-and-i7-processors-and-p55-chipset/?page=4

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017 CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017 The art of doing more with less 2 Performance RULE #1: DON T TRY TOO HARD Peak Performance Time 3 Unrealistic Effort/Reward Performance

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

1/25/12. Administrative

1/25/12. Administrative Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:

More information

ECE 408 / CS 483 Final Exam, Fall 2014

ECE 408 / CS 483 Final Exam, Fall 2014 ECE 408 / CS 483 Final Exam, Fall 2014 Thursday 18 December 2014 8:00 to 11:00 Central Standard Time You may use any notes, books, papers, or other reference materials. In the interest of fair access across

More information

GPU Performance Nuggets

GPU Performance Nuggets GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION Julien Demouth, NVIDIA Cliff Woolley, NVIDIA WHAT WILL YOU LEARN? An iterative method to optimize your GPU code A way to conduct that method with NVIDIA

More information

CS 677: Parallel Programming for Many-core Processors Lecture 6

CS 677: Parallel Programming for Many-core Processors Lecture 6 1 CS 677: Parallel Programming for Many-core Processors Lecture 6 Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Logistics Midterm: March 11

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Plan. CS : High-Performance Computing with CUDA. Matrix Transpose Characteristics (1/2) Plan

Plan. CS : High-Performance Computing with CUDA. Matrix Transpose Characteristics (1/2) Plan Plan CS4402-9535: High-Performance Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 1 2 Performance Optimization 3 4 Parallel Scan 5 Exercises

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

COSC 6374 Parallel Computations Introduction to CUDA

COSC 6374 Parallel Computations Introduction to CUDA COSC 6374 Parallel Computations Introduction to CUDA Edgar Gabriel Fall 2014 Disclaimer Material for this lecture has been adopted based on various sources Matt Heavener, CS, State Univ. of NY at Buffalo

More information

Image convolution with CUDA

Image convolution with CUDA Image convolution with CUDA Lecture Alexey Abramov abramov _at_ physik3.gwdg.de Georg-August University, Bernstein Center for Computational Neuroscience, III Physikalisches Institut, Göttingen, Germany

More information

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY Introduction to CUDA Ingemar Ragnemalm Information Coding, ISY This lecture: Programming model and language Introduction to memory spaces and memory access Shared memory Matrix multiplication example Lecture

More information

CS : High-Performance Computing with CUDA

CS : High-Performance Computing with CUDA CS4402-9535: High-Performance Computing with CUDA Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) UWO-CS4402-CS9535 CS4402-9535: High-Performance Computing with CUDA UWO-CS4402-CS9535

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

CUDA Performance Considerations (2 of 2)

CUDA Performance Considerations (2 of 2) Administrivia CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Friday 03/04, 11:59pm Assignment 4 due Presentation date change due via email Not bonus

More information

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Advanced CUDA Optimizing to Get 20x Performance

Advanced CUDA Optimizing to Get 20x Performance Advanced CUDA Optimizing to Get 20x Performance Brent Oster Outline Motivation for optimizing in CUDA Demo performance increases Tesla 10-series architecture details Optimization case studies Particle

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy Lecture 7 Using Shared Memory Performance programming and the memory hierarchy Announcements Scott B. Baden /CSE 260/ Winter 2014 2 Assignment #1 Blocking for cache will boost performance but a lot more

More information

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight

More information