PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Size: px

Start display at page:

Download "PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort"

Jason King
5 years ago
Views:

1 PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

2 Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class 2: advanced 5. Case study: LOFAR telescope with many-cores

3 3 What are many-cores?

4 What are many-cores 4 From Wikipedia: A many-core processor is a multicore processor in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient largely because of issues with congestion in supplying instructions and data to the many processors.

5 What are many-cores 5 How many is many? Several tens of cores How are they different from multi-core CPUs? Non-uniform memory access (NUMA) Private memories Network-on-chip Examples Multi-core CPUs (48-core AMD magny-cours) Graphics Processing Units (GPUs) GPGPU = general purpose programming on GPUs Cell processor (PlayStation 3) Server processors (Sun Niagara)

6 Many-core questions 6 The search for performance Build hardware What architectures? Evaluate hardware What metrics? How do we measure? Use it What workloads? Expected performance? Program it How to program? How to optimize? Benchmark How to analyze performance?

7 Today s Topics 7 Introduction Why many-core programming? History Hardware introduction Performance model: Arithmetic Intensity and Roofline

8 8 Why do we need many-cores?

9 Why do we need many-cores? 9 T12 GT200 G80 NV30 NV40 G70 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad Westmere

10 10 Why do we need many-cores?

11 11 Why do we need many-cores?

China's Tianhe-1A 12 #5 in top500 list 4.701 pflops peak 2.

12 China's Tianhe-1A 12 #5 in top500 list pflops peak pflops max 14,336 Xeon X5670 processors 7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores

13 13 Power efficiency

14 14 Graphics in 1980

15 15 Graphics in 2000

16 16 Graphics now: GPU movie

Realism of modern GPUs 17 http://www.youtube.com/watch?

17 Realism of modern GPUs 17 =bjdeipvpjgq&feature=play er_embedded#t=49s Courtesy techradar.com

18 Why do we need many-cores? 18 Performance Large scale parallelism Power Efficiency Use transistors more efficiently Price (GPUs) Huge market, bigger than Hollywood Mass production, economy of scale spotty teenagers pay for our HPC needs!

19 19 GPGPU history Fermi 3B xtors RIVA 128 3M xtors GeForce M xtors GeForce 3 60M xtors GeForce FX 125M xtors GeForce M xtors

20 GPGPU History 20 Use Graphics primitives for HPC Ikonas [England 1978] Pixel Machine [Potmesil & Hoffert 1989] Pixel-Planes 5 [Rhoades, et al. 1992] Programmable shaders, around 1998 DirectX / OpenGL Map application onto graphics domain! GPGPU Brook (2004), Cuda (2007), OpenCL (Dec 2008),...

21 CUDA C/C++ Continuous Innovation July 07 Nov 07 April 08 Aug 08 July 09 Nov 09 Mar 10 CUDA Toolkit 1.0 CUDA Toolkit 1.1 CUDA Visual Profiler 2.2 CUDA Toolkit 2.0 CUDA Toolkit 2.3 Parallel Nsight Beta CUDA Toolkit 3.0 C Compiler C Extensions Single Precision BLAS FFT SDK 40 examples Win XP 64 Atomics support Multi-GPU support cuda-gdb HW Debugger Double Precision Compiler Optimizations Vista 32/64 Mac OSX DP FFT Conversion intrinsics Performance enhancements C++ inheritance Fermi support Tools updates Driver / RT interop 3D Textures HW Interpolation

22 Cuda Tools 22 Parallel Nsight Visual Studio Visual Profiler For Linux cuda-gdb For Linux

23 23 Many-core hardware introduction

24 24 The search for performance

25 The search for performance 25 We have M(o)ore transistors Bigger cores? We are hitting the walls! power, memory, instruction-level parallelism (ILP) How do we use them? Large-scale parallelism Many-cores!

26 Choices 26 Core type(s): Fat or slim? Vectorized (SIMD)? Homogeneous or heterogeneous? Number of cores: Few or many? Memory Shared-memory or distributed-memory? Parallelism Instruction-level parallelism, threads, vectors,

27 A taxonomy 27 Based on field-of-origin : General-purpose Intel, AMD Graphics Processing Units (GPUs) NVIDIA, ATI Gaming/Entertainment Sony/Toshiba/IBM Embedded systems Philips/NXP, ARM Servers Oracle, IBM, Intel High Performance Computing Intel, IBM,

multi-layered Per-core cache and shared cache Programming

28 General Purpose Processors 28 Architecture Few fat cores Vectorization (SSE, AVX) Homogeneous Stand-alone Memory Shared, multi-layered Per-core cache and shared cache Programming Processes (OS Scheduler) Message passing Multi-threading Coarse-grained parallelism

Examples Sun Niagara II 8 cores x 8 threads IBM POWER7 8

29 Server-side 29 General-purpose-like with more hardware threads Lower performance per thread high throughput Examples Sun Niagara II 8 cores x 8 threads IBM POWER7 8 cores x 4 threads Intel SCC 48 cores, all can run their own OS

30 Graphics Processing Units 30 Architecture Hundreds/thousands of slim cores Homogeneous Accelerator Memory Very complex hierarchy Both shared and per-core Programming Off-load model Many fine-grained symmetrical threads Hardware scheduler

31 Cell/B.E. 31 Architecture Heterogeneous 8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE) Memory Per-core memory, network-on-chip Programming User-controlled scheduling 6 levels of parallelism, all under user control Fine- and coarse-grain parallelism

32 Take home message 32 Variety of platforms Core types & counts Memory architecture & sizes Parallelism layers & types Scheduling Open questions: Why so many? How many platforms do we need? Can any application run on any platform?

33 33 Hardware performance metrics

34 Hardware Performance metrics 34 Clock frequency [GHz] = absolute hardware speed Memories, CPUs, interconnects Operational speed [GFLOPs] Operations per cycle Memory bandwidth [GB/s] differs a lot between different memories on chip Power [Watt] Derived metrics FLOP/Byte, FLOP/Watt

35 Theoretical peak performance 35 Peak = chips * cores * vectorwidth * FLOPs/cycle * clockfrequency Examples from DAS-4: Intel Core i7 CPU 2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs NVIDIA GTX 580 GPU 1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle * GhZ = 1581 GFLOPs ATI HD chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle * GhZ = 2703 GFLOPs

36 DRAM Memory bandwidth 36 Throughput = memory bus frequency * bits per cycle * bus width Memory clock!= CPU clock! In bits, divide by 8 for GB/s Examples: Intel Core i7 DDR3: * 2 * 64 = 21 GB/s NVIDIA GTX 580 GDDR5: * 4 * 384 = 192 GB/s ATI HD 6970 GDDR5: * 4 * 256 = 176 GB/s

37 Memory bandwidths 37 On-chip memory can be orders of magnitude faster Registers, shared memory, caches, E.g., AMD HD 7970 L1 cache achieves 2 TB/s Other memories: depends on the interconnect Intel s technology: QPI (Quick Path Interconnect) 25.6 GB/s AMD s technology: HT3 (Hyper Transport 3) 19.2 GB/s Accelerators: PCI-e GB/s

38 Power 38 Chip manufactures specify Thermal Design Power (TDP) We can measure dissipated power Whole system Typically (much) lower than TDP Power efficiency FLOPS / Watt Examples (with theoretical peak and TDP) Intel Core i7: 154 / 160 = 1.0 GFLOPs/W NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W

39 Summary Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte Sun Niagara IBM bg/p IBM Power Intel Core i AMD Barcelona AMD Istanbul AMD Magny-Cours Cell/B.E NVIDIA GTX NVIDIA GTX AMD HD AMD HD

40 Absolute hardware performance 40 Only achieved in the optimal conditions: Processing units 100% used All parallelism 100% exploited All data transfers at maximum bandwidth No application is like this Even difficult to write micro-benchmarks

41 41 Performance analysis Operational Intensity and the Roofline model

42 Software performance metrics (3 P s) 42 Performance Execution time Speed-up vs. best available sequential application Achieved GFLOPs Computational efficiency Achieved GB/s Memory efficiency Productivity and Portability Programmability Production costs Maintenance costs

43 Arithmetic intensity 43 The number of arithmetic (floating point) operations per byte of memory that is accessed Is the program compute intensive or data intensive on a particular architecture? Ignore overheads Loop counters Array index calculations Etc.

44 RGB to gray 44 for (int y = 0; y < height; y++) { for (int x = 0; x < width; x++) { Pixel pixel = RGB[y][x]; gray[y][x] = 0.30 * pixel.r * pixel.g * pixel.b; } }

45 RGB to gray 45 for (int y = 0; y < height; y++) { for (int x = 0; x < width; x++) { Pixel pixel = RGB[y][x]; gray[y][x] = 0.30 * pixel.r * pixel.g * pixel.b; } } 2 additions, 3 multiplies = 5 operations 3 reads, 1 write = 4 memory accesses AI = 5/4 = 1.25

46 Compute or memory intensive? Sun Niagara 2 IBM bg/p IBM Power 7 Intel Core i7 AMD Barcelona AMD Istanbul AMD Magny-Cours Cell/B.E. NVIDIA GTX 580 NVIDIA GTX 680 AMD HD 6970 RGB to Gray

47 Applications AI 47 O( 1 ) O( log(n) ) O( N ) A r i t h m e t i c I n t e n s i t y SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

48 Operational intensity 48 The number of operations per byte of DRAM traffic Difference with Arithmetic Intensity Operations, not just arithmetic Caches After they have been filtered by the cache hierarchy Not between processor and cache But between cache and DRAM memory

49 Attainable performance 49 Attainable GFlops/sec = min(peak Floating-Point Performance, Peak Memory Bandwidth * Operational Intensity)

50 The Roofline model 50 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

51 Roofline: comparing architectures 51 AMD Opteron X2: 17.6 gflops, 15 GB/s, ops/byte = 1.17 AMD Opteron X4: 73.6 gflops, 15 GB/s, ops/byte = 4.9

52 Roofline: computational ceilings 52 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

53 Roofline: bandwidth ceilings 53 AMD Opteron X2 (two cores): 17.6 gflops, 15 GB/s, ops/byte = 1.17

54 54 Roofline: optimization regions

55 Use the Roofline model 55 Determine what to do first to gain performance Increase memory streaming rate Apply in-core optimizations Increase arithmetic intensity Reader Samuel Williams, Andrew Waterman, David Patterson Roofline: an insightful visual performance model for multicore architectures

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center MANY-CORE COMPUTING 7-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center Schedule 2 1. Introduction, performance metrics & analysis 2. Programming: basics (10-10-2013)