State-of-the-art in Heterogeneous Computing

Size: px

Start display at page:

Download "State-of-the-art in Heterogeneous Computing"

Shauna Scott
5 years ago
Views:

1 State-of-the-art in Heterogeneous Computing Guest Lecture NTNU Trond Hagen, Research Manager SINTEF, Department of Applied Mathematics 1

2 Overview Introduction GPU Programming Strategies Trends: Heterogeneous Processors Heterogeneous Computing at SINTEF 2

3 Hardware Evolution 1940ies: Atanasoff-Berry Computer (1942) Vacuum tubes, relays, complex wiring 1950ies: Transistor (1947) 1956 Nobel Prize Physics: Bardeen, Brattain, Shockley 1960ies: Integrated Circuit (1958) Noyce and Kilby (Nobel Prize in 2000) 1970ies: CPUs (1971) Intel Intel founded (Moore, Noyce) 3

4 Hardware Evolution - Processors 1971: Intel 4004, 2300 trans, 740 KHz 1981: Intel 80286, 134 thousand trans, 25 MHz 1993: Intel Pentium P5, 1.18 mill. trans, 300 MHz 2000: Intel Pentium 4, 42 mill. trans, 3.8 GHz 2010: Intel Nehalem, 2.3 bill. trans, 8 X 2.66 GHz 4

5 What happened ? Increasing the frequency of processors has several implications (three walls): Memory Instruction Level Parallelism Power The number of transistors on an integrated circuit for minimum component cost doubles every 24 months Gordon Moore. 5

6 Memory Wall Memory speeds have not increased as fast as core frequencies A processor can wait through hundreds of clock cycles if it has to get data or instructions from main memory Larger caches combined with instruction level parallelism can reduce the memory-wait time 6

7 Instruction Level Parallelism Wall Difficulty to find enough parallelism in the instructions stream of a single process to keep higher performance processor cores busy. ILP: Instruction pipelining (execution of multiple instructions can be partially overlapped) Superscalar (parallel instruction pipelines) Branch prediction (predict outcome of decisions) Out-of-order execution 7

8 Power Consumption Wall When the frequency increases, the power consumption increases disproportionately Power density relative to cube of frequency. Frequency limited to around 4 GHz P = C ρ f 2 3 V V dd dd P is the power density in watts per unit area, C is the total capacitance, ρ is the transistor density, f is the processor frequency, and V dd is the supply voltage. 8

How can multi-core help? 170 100 Two cores running at 85% frequency vs.

9 How can multi-core help? Two cores running at 85% frequency vs. one core at 100% frequency: 100% power consumption 170% performance South Africa

10 Accelerators Cell BE GPUs FPGAs Learning curve Learning curve Learning curve Skill Skill Skill Repetitions Repetitions Repetitions 10

11 GPU Programming Strategies 11

12 NVIDIA CUDA GPU Roadmap 16 Maxwell DP GFLOPS per Watt Kepler Fermi Tesla

13 NVIDIA CUDA GPU Roadmap 16 Maxwell DP GFLOPS per Watt Tesla What about increase in GFLOP per second? A 15% decrease in frequency can give you 50% decrease in power consumption Fermi Kepler Conclusion: More power efficient GPUs in the future (and faster ones as well)

14 NVIDIA CUDA GPU Roadmap 16 Maxwell DP GFLOPS per Watt Kepler Fermi Tesla

15 NVIDIA Fermi Architecture 16 streaming multiprocessors (SMs) are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache). 15

16 Fermi Architecture Streaming Multiprocessor 32 cores per SM 64 KB of RAM for shared memory and L1 cache Double precision speed at 50% of peak single precision Concurrent kernel execution ECC Support 16

17 GPU SPMD Grids Kernel execution is invoked by CPU over a compute grid. Subdivided into a set of threadblocks. Containing a set of threads with access to shared memory. All threads in the grid run the same program: Individual data and individual code flow (SPMD). 17

Multiprocessor Execution CPU scalar op: CPU SSE op: GPU Multiprocessor: CPU scalar op CPU SSE op GPU multiprocessor: 1 thread processes 1 data element per op.

18 Multiprocessor Execution CPU scalar op: CPU SSE op: GPU Multiprocessor: CPU scalar op CPU SSE op GPU multiprocessor: 1 thread processes 1 data element per op. 1 thread processes 2-4 data elements. 32 threads processes 32 data elements. These groups of 32 threads are called warps: Exposed as individual threads but run the same instruction. Divergence serialization. 18

19 Warp Serialization and Masking Hardware serializes divergent code flow: Execution time is sum of branches taken. Worst case: All threads in the warp takes individual branches (1/32 performance). Conclusion: Minimize divergent code flow! Move conditionals into data, use arithmetic operations, min/max, sorting, etc 19

20 Heterogeneous Program Execution Two types of execution: The host program runs on the CPU: Controls data movement between CPU and GPU memory. Invokes kernel execution over compute grids. Manages dependencies between kernels and data movement: Sequence invocations into streams/command queues. The SPMD kernels run on the device (=GPU). Executed in parallel (typically instances). Implicitly invoked by host program (over SPMD compute grids). 20

21 Launch Configuration How many threads/threadblocks to launch? Key to understanding: Instructions are issued in order (traverses the assembly) No ILP, branch prediction, or instruction reordering. A thread stalls when one of the operands isn t ready: Memory read by itself doesn t stall execution Latency is hidden by switching threads GMEM latency: clock cycles Arithmetic latency: cycles (due to pipeline depth) Conclusion: Need enough threads to hide latency! 21

22 Launch Configuration Hiding arithmetic latency: Need ~18 warps (32*18=576) threads per Fermi SM If two following instructions are dependent. Or, latency can also be hidden with independent instructions from the same warp For example, if instruction never depends on the output of preceding instruction, then only 9 warps are needed, etc. Maximizing global memory throughput: Depends on the access pattern, and word size Need enough memory transactions in flight to saturate the bus Independent loads and stores from the same thread Loads and stores from different threads Larger word sizes can also help (float2 is twice the transactions of float, for example) 22

23 Maximizing Memory Throughput If Memory Bound Example: one access per thread at a time. Tesla C2050, ECC on, theoretical bandwidth is ~120 GB/s Several independent smaller accesses have the same effect as one larger one: Four 32-bit ~= one 128-bit. Courtesy of NVIDIA 23

24 Launch Configuration Summary Need enough total threads to keep GPU busy Typically, you d like 512+ threads per SM More if processing one fp32 element per thread Block configuration Threads per block should be a multiple of warp size (32) SM can concurrently execute up to 8 threadblocks Really small threadblocks prevent achieving good occupancy (Example: 32 threads * 8 = 256) Really large blocks are less flexible (register spilling issues, etc.) Generally use threads per threadblock but use whatever is best for the application 24

25 Memory Hierarchy Review Local storage Each thread has own local storage Mostly registers (managed by the compiler) Shared memory / L1 Program configurable: 16KB shared / 48KB L1 or 48KB shared / 16KB L1 Shared memory is accessible by the threads in the same threadblock Very low latency (same level as arithmetic instruction latency) Very high throughput: 1+ TB/s aggregate L2 All accesses to global memory go through L2, including copies to/from CPU host Global memory Accessible by all threads as well as CPU host High latency ( cycles) Throughput: up to 177 GB/s 25

26 GMEM Optimization Guidelines Strive for perfect coalescing: Align starting address (may require padding) A warp should access within a contiguous region Have enough concurrent accesses to saturate the bus Process several elements per thread Multiple loads get pipelined Indexing calculations can often be reused Launch enough threads to maximize throughput Latency is hidden by switching threads (warps) Try L1 and caching configurations to see which one works best Caching vs. non-caching loads (nvcc option: Xptxas dlcm=cg) 16KB vs. 48KB L1 (CUDA call) 26

27 Shared Memory Uses: Inter-thread communication within a threadblock Cache data to reduce redundant global memory accesses (scratchpad memory) Use it to improve global memory access patterns Organization: 32 banks, 4-byte wide banks Successive 4-byte words belong to different banks Performance: 4 bytes per bank per 2 clocks per multiprocessor SMEM accesses are issued per warp per 16-threads for GPUs prior to Fermi serialization: if n threads of 32 access different 4-byte words in the same bank, n accesses are executed serially multicast: n threads access the same word in one fetch Could be different bytes within the same word Prior to Fermi, only broadcast was available, sub-word accesses within the same bank caused serialization 27

28 Constant Memory Ideal for coefficients and other data that is read uniformly by warps Data is stored in global memory, read through a constant-cache constant qualifier in declarations Can only be read by GPU kernels Limited to 64KB Fermi adds uniform accesses (automatically): Kernel pointer argument qualified with const Compiler must determine that all threads in a block will dereference the same address No limitation of 64 KB can use any global memory pointer Constant cache throughput: 32 bits per warp per 2 clocks per multiprocessor To be used when all threads in a warp read the same address Serializes otherwise 28

29 Constant Memory Ideal for coefficients and other data that is read uniformly by warps Data is stored in global memory, read through a constant-cache constant qualifier in declarations Can only be read global by GPU kernels void kernel(const float *g) Limited to 64KB{ Fermi adds uniform accesses: float x = g[5]; // Uniform cache Kernel pointer argument float y qualified = g[blockidx.x+5]; with const // load Compiler must determine float z = that g[threadidx.x]; all threads in a block will // Non-uniform dereference the same address // GMEM load No limit on array } size, can use any global memory pointer Constant cache throughput: 32 bits per warp per 2 clocks per multiprocessor To be used when all threads in a warp read the same address Serializes otherwise 29

30 Texture Cache Separate hardware cache typically used for graphics (spatial caching) Provides for free : Interpolation Linear, bilinear, trilinear (1D/2D/3D-textures) 9 bit interpolation weights. Format conversion {int, short, char} -> float Out of bound index handling clamp, wrap. 30

31 Runtime Math Library and Intrinsics Two types of runtime math library functions func(): may map directly to hardware instruction set architecture Fast but lower accuracy (see CUDA Programming Guide for full details) Examples: sinf(x), expf(x), powf(x, y) func(): compile to multiple instructions Slower but higher accuracy Examples: sin(x), exp(x), pow(x, y) A number of additional intrinsics: sincosf(), frcp_rz(),... Explicit IEEE rounding modes (rz,rn,ru,rd) 31

32 Synchronized Single-Stream Heterogeneous Program Host Program GPU Multiple CPU threads, sleep is no problem. Time But, GPU can do DMA and compute simultaneously (independent of CPU) and Fermi can run different kernels simultaneously (when grids are small). 32

33 Asynchronous Multi-Stream Heterogeneous Program Host Program GPU, stream 0 GPU, stream 1 Time 33

34 Asynchronous Multi-Stream Heterogeneous Program Host Program GPU, stream 0 GPU, stream 1 Time CODE: cudastream_t stream1, stream2; cudastreamcreate(&stream1); cudastreamcreate(&stream2); cudamemcpyasync( dst, src, size, dir, stream1 ); kernel<<<grid, block, 0, stream2>>>( ); Potentially overlapped 34

35 Trends: Heterogeneous Processors 35

(APUs) General purpose, programmable scalar and vector processing

36 AMD Fusion Single Die Heterogeneous Processors. Combination of CPU cores and GPU cores Accelerated Processing Units (APUs) General purpose, programmable scalar and vector processing cores. Shared low-latency memory model All next gen. AMD laptop/mobile processors will be APUs 36

37 Intel Sandy Bridge Will also incorporate a GPU on the chip CPU cores and GPU will share the same cache. GPU CPU cores Northbridge L3 Cache 37

38 What about NVIDIA? Has no public plans to integrate its high-end GPUs with a CPU. Tegra: Mobile market ARM CPU and NVIDIA GPU Future: CUDA GPUs and ARM CPUs for HPC? 38

39 Future Computational Systems Image courtesy of AnandTech 39

40 Heterogeneous Computing at SINTEF 40

41 Heterogeneous Computing Group The research group aims to combine the parallelism of traditional multi-core CPUs and accelerator cores, e.g., GPUs, to deliver unprecedented level of performance. Multi-core CPU Massively parallel GPU South Africa September

42 NVIDIA Research Partner NVIDIA CUDA Research Center award: Forefront of massive parallel computing. World-changing research in GPU computing. One of nine centers worldwide. Result of 7 years of research on GPU computing.

43 Focus Areas Parallel Computing New numerical algorithms State-of-the-art parallel architectures HPC applications Visual Computing Real-time rendering Photorealistic rendering Scientific visualization Cloud Computing Auto-tuning Web-based interfaces Load balancing Web based clients Heterogeneous computers (CPUs & GPUs) Workstations

44 Application Areas Impact: Cobber bar Flood Simulation, Shallow Water Computational Fluid Dynamics, SPH Isogeometric Visualization Reservoir Visualization Optimization (Metahauristics) Ultrasound Processing Dam break (sloshing in a tank) Solving Linear Systems of Equations Matlab Interface for GPUs Image Processing Photorealistic Rendering Surgical Simulation and Visualization Signal Processing

45 Faster than Real-Time Simulations Saint Venant Equation

46 Worlds Fastest Marching Cubes Implementation South Africa

47 Code development Source code is parallelized for CPUs, GPUs, and cloud based heterogeneous systems Open source

48 Thanks! Contact information: Main reference: Brodtkorb, Dyken, Hagen, Hjelmervik and Storaasli, State-of-the-Art in Heterogeneous Computing, Journal of Scientific Programming, IOS Press, 18(1), 2010, pp

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization