- PDF Free Download

Size: px

Start display at page:

Download ""

Rafe Jacobs
5 years ago
Views:

1 By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008

2 } } Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 Par Lab Boot UC Berkeley - GPU, CUDA, OpenCL programming } CUDA Programming Model Overview part 1 } } CUDA Programming Model Overview part 2 } Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, April 2009

4 CPU GPU Threads <100 >10,000 Single Thread Performance High Low On-chip cache Large Almost none Memory bandwidth i7e=51.2 GB/Sec GTX 580 = GB/ Sec Programming difficulty Easy Very difficult

5 MC Region Valley MT Region Performance Number Of Threads Fig. 1. Performance of a unified many-core (MC) many-thread (MT) machine exhibits three performance regions, depending on the number of threads in the workload.!!!! Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, April 2009

6 GOPS Rising compute/memory Number Of Threads Fig. 3. Performance for different compute/memory ratios.!!! % compute/mem=5 compute/mem=10 compute/mem=20 compute/mem=100 compute/mem=200 pure compute \+>9% E%.$"4.% #$)#% #$-% ("1-% *"(6;#)#+",% +,.#1;*#+",.% 6-1% (-("13% +,.#1;*#+",.% )1-% >+2-,8% #$-%.#--6-1% 6-15"1(< Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, April 2009 %

7 4$-,%+,*1-).+,>% %"1%/-*1-).+,>% 9% GOPS Rising workload locality Number Of Threads infinite cache =8.0, =25 =7.0, =30 =7.0, =50 =6.0, =100 no cache Fig. 2. Performance for different cache hit rate functions. % Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, April 2009 %

8 "4&21!5/%%#$/)+!/)!2,#/$!*(*,#!1/6#-!!! GOPS Rising cache capacity Number Of Threads Fig. 5. Performance for different cache sizes.!!! no cache 4 MB 8 MB 16 MB 32 MB 64 MB 100 MB infinite cache! Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, April 2009

9 C>+2-,%+,%,;(B-1%"5%*3*7-.%,--/-/%#"%)**-..%#$-%/)#)8%+9-9% & ' G% )55-*#.% #$-%.$)6-% "5% #$-% 6-15"1(),*-% 67"#9% \+>9% J% 61-<.-,#.% #$-% 6-15"1(),*-% 67"#% 5"1%.-2-1)7% /+55-1-,#% "55<*$+6% (-("13%7)#-,*+-.9% GOPS Number Of Threads Rising memory latency Fig. 4. Performance for different off-chip latencies.!! % zero latency 50 cycles 100 cycles 200 cycles 1000 cycles 2000 cycles &-("13%7)#-,*3%+.%+(6"1#),#%+,%#$-%&0%1->+",8%).%"B< Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, April 2009 %

10 "$#*/1/&)!*(4*:4(2/&)1-K!! GOPS GB/sec 100 GB/sec 150 GB/sec 200 GB/sec 250 GB/sec 300 GB/sec unlimited BW Rising off-chip BW capacity Number Of Threads Fig. 6. Performance with limited off-chip bandwidth.!!! 72!2,#!$/+,2'&12!1/5#!&%!2,#!"4&29!A,#$#!(44!(**#11#1!2&! Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Computer Architecture Letters, April 2009!

11 Raster Operations Processor Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.

12 Streaming MultiProcessor Controller Constant Cache Special Functional Unit 16k Figure 2. Texture/processor cluster (TPC).

13 } Threads are independent, lightweight and scalar execution contexts Each thread has its own PC, register file, call stack, etc. } 32 simultaneous threads constitute a Warp Warps are executed in lockstep SIMD Execution is similar to predicated execution Divergent threads execute simultaneously Memory accesses are coalesced } Thread Blocks comprised of one or more Warps No overhead for executing different PCs Shared instruction cache for each block

14 Figure 4. Single-instruction, multiplethread (SIMT) warp scheduling.

15 } 256 KB register file shared among all threads (~12,288) Tradeoff: #threads Vs. local context per thread } L1$/Scratchpad Configurable scratchpad size (non-cached) } Shared L2$ } Local DRAM Memory Used to store application state } Host Memory Used for communications with the CPU } Constant memory / Texture memory

16 // Compute sum of length N vectors: C = A + B Void vecadd (float* a, float* b, float* c, int N) { } for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; int main () { int N =... ; } float *a, *b, *c; a = new float[n]; //... allocate other arrays, fill with data vecadd (a, b, c, N);

17 // Compute sum of length N vectors: C = A + B void global vecadd (float* a, float* b, float* c, int N) { } int i = blockidx.x * blockdim.x + threadidx.x; if (i < N) c[i] = a[i] + b[i]; int main () { int N =... ; } float *a, *b, *c; cudamalloc (&a, sizeof(float) * N); //... allocate other arrays, fill with data // Use thread blocks with 256 threads each vecadd <<< (N+255)/256, 256 >>> (a, b, c, N);

19 } 681 million transistors, 470 mm2; } TSMC 90-nm CMOS; } 128 SP cores in 16 SMs; } 12,288 processor threads; } 1.5-GHz processor clock rate; } peak 576 Gflops in processors; } 768-Mbyte GDDR3 DRAM; } 384-pin DRAM interface; } 1.08-GHz DRAM clock; } 104-Gbyte/s peak bandwidth; and } typical power of 150 W at 1.3 V.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host