MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

Size: px

Start display at page:

Download "MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center"

Alvin Hubbard
5 years ago
Views:

1 MANY-CORE COMPUTING 7-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center

2 Schedule 2 1. Introduction, performance metrics & analysis 2. Programming: basics ( ) 3. Programming: advanced ( ) 4. Case study: LOFAR telescope with many-cores by Rob van Nieuwpoort ( )

3 What are many-cores? 3 From Wikipedia: A many-core processor is a multicore processor in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient largely because of issues with congestion in supplying instructions and data to the many processors. In this course: Multi-core/many-core CPUs (GP)GPUs

4 What are many-cores 4 How many is many? Several tens of cores How are they different from multi-core CPUs? Non-uniform memory access (NUMA) Private memories Network-on-chip Examples Multi-core CPUs (48-core AMD magny-cours) Graphics Processing Units (GPUs) n GPGPU = general purpose programming on GPUs Server processors (Sun Niagara) HPC processors n Cell B.E. (PlayStation 3) n Intel Xeon Phi (aka Intel MIC, former Larrabee)

5 Today s Topics 5 Why do many-cores exist? History Hardware introduction Performance model: Arithmetic Intensity and Roofline

6 6 Why many-cores? Moore s law Many-cores in real-life

Moore s Law 7 Gordon Moore (co-founder of Intel) predicted in

rate of roughly a factor of two per year.

7 Moore s Law 7 Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to continue, if not to increase... Electronics Magazine 1965

8 8 Transistor Counts (Intel)

9 Impact of device shrinking 9 Assume transistor size shrinks by a factor of x! Transistors per unit area: up by x*x Die size? Assume the same Clock rate? may go up by x because wires are shorter Raw computing power? Programs could go x*x*x times faster In reality? Power consumption, memory, parallelism impose stricter bounds!

10 Revolution in Processors 10 Chip density is continuing to increase about 2x every 2 years BUT Clock speed is not ILP is not Power is not

11 New ways to use transistors 11 Parallelism on-chip: multi-core processors Multicore revolution Every machine will soon be a parallel machine What about performance? Can applications use this parallelism? Do they have to be rewritten from scratch? Will all programmers have to be parallel programmers? New programming models are needed Try to hide complexity from most programmers

12 Top500 [1/4] 12 State of the art in HPC (top500.org) Trial for all new HPC architectures 195 cores/node! Accelerated! Accelerated!

13 Top500 [2/4] 13 Performance is dominated by multi-/many-cores Multi-core CPUs Accelerators

14 Top500 [3/4] 14 Accelerators? Relatively low numbers High performance impact

China's Tianhe-1A 15 #10 in top500 list June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.

15 China's Tianhe-1A 15 #10 in top500 list June 2013 (#1 in Top500 in November 2010) pflops peak pflops max 14,336 Xeon X5670 processors 7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores

China's Tianhe-2 16 #1 in Top500 June 2013 54.902 pflops peak 33.862 pflops max 16.

16 China's Tianhe-2 16 #1 in Top500 June pflops peak pflops max nodes = x (2 x Xeon IvyBridge + 3 x Xeon Phi) = cores ( => 195 cores/node)

17 17 Top500: prediction

18 18 GPUs vs. Top500

19 Why do we need many-cores? 19 T12 GT200 G80 NV30 NV40 G70 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad

20 20 Why do we need many-cores?

21 21 Power efficiency

22 22 Graphics in 1980

23 23 Graphics in 2000

Realism of modern GPUs 24 http://www.youtube.com/watch?

24 Realism of modern GPUs 24 v=bjdeipvpjgq&feature=pla yer_embedded#t=49s Courtesy techradar.com

25 Why do we need many-cores? 25 Performance Large scale parallelism Power Efficiency Use transistors more efficiently Price (GPUs) Game market is huge, bigger than Hollywood Mass production, economy of scale spotty teenagers pay for our HPC needs! Prestige Reach ExaFLOP by 2019

26 26 History

27 27 Intel

28 GPGPU History 28 Fermi 3B xtors GeForce 256 RIVA M xtors 3M xtors GeForce 3 60M xtors GeForce FX 125M xtors GeForce M xtors Current generation: NVIDIA Kepler 7.1 transistors More cores, more parallelism, more performance

29 GPGPU History 29 Use Graphics primitives for HPC Ikonas [England 1978] Pixel Machine [Potmesil & Hoffert 1989] Pixel-Planes 5 [Rhoades, et al. 1992] Programmable shaders, around 1998 DirectX / OpenGL Map application onto graphics domain! GPGPU Brook (2004), Cuda (2007), OpenCL (Dec 2008),...

30 CUDA C/C++ Continuous Innovation July 07 Nov 07 April 08 Aug 08 July 09 Nov 09 Mar 10 CUDA Toolkit 1.0 CUDA Toolkit 1.1 CUDA Visual Profiler 2.2 CUDA Toolkit 2.0 CUDA Toolkit 2.3 Parallel Nsight Beta CUDA Toolkit 3.0 C Compiler C Extensions Single Precision BLAS FFT SDK 40 examples Win XP 64 Atomics support Multi-GPU support cuda-gdb HW Debugger Double Precision Compiler Optimizations Vista 32/64 Mac OSX DP FFT Conversion intrinsics Performance enhancements C++ inheritance Fermi support Tools updates Driver / RT interop 3D Textures HW Interpolation

31 Cuda Tools 31 Parallel Nsight Visual Studio Visual Profiler For Linux cuda-gdb For Linux

32 32 Another GPGPU history

33 33 AMD

34 34 AMD

35 35 AMD

36 36 ARM

37 37 Many-core hardware

38 Choices 38 Core type(s): Fat or slim? Vectorized (SIMD)? Homogeneous or heterogeneous? Number of cores: Few or many? Memory Shared-memory or distributed-memory? Parallelism Instruction-level parallelism, threads, vectors,

39 A taxonomy 39 Based on field-of-origin : General-purpose n Intel, AMD Graphics Processing Units (GPUs) n NVIDIA, ATI Gaming/Entertainment n Sony/Toshiba/IBM Embedded systems n Philips/NXP, ARM Servers n Oracle, IBM, Intel High Performance Computing n Intel, IBM,

multi-layered Per-core cache and shared cache Programming

40 General Purpose Processors 40 Architecture Few fat cores Vectorization (SSE, AVX) Homogeneous Stand-alone Memory Shared, multi-layered Per-core cache and shared cache Programming Processes (OS Scheduler) Message passing Multi-threading Coarse-grained parallelism

Examples Sun Niagara II n 8 cores x 8 threads IBM POWER7 n

41 Server-side 41 General-purpose-like with more hardware threads Lower performance per thread high throughput Examples Sun Niagara II n 8 cores x 8 threads IBM POWER7 n 8 cores x 4 threads Intel SCC n 48 cores, all can run their own OS

42 Graphics Processing Units 42 Architecture Hundreds/thousands of slim cores Homogeneous Accelerator Memory Very complex hierarchy Both shared and per-core Programming Off-load model Many fine-grained symmetrical threads Hardware scheduler

43 Cell/B.E. 43 Architecture Heterogeneous 8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE) Memory Per-core memory, network-on-chip Programming User-controlled scheduling 6 levels of parallelism, all under user control Fine- and coarse-grain parallelism

44 Xeon Phi 44 Architecture ~60 homogeneous cores n 4 threads per core x86 architecture Memory Per-core caches (L1,L2) n Coherence UMA [?] Programming SPMD/MPMD Fine- and coarse-grain parallelism (vector processing and threads, respectively

45 Take home message 45 Variety of platforms Core types & counts Memory architecture & sizes Parallelism layers & types Scheduling Open questions: Why so many? How many platforms do we need? Can any application run on any platform?

46 46 Hardware performance metrics

47 Hardware Performance metrics 47 Clock frequency [GHz] = absolute hardware speed Memories, CPUs, interconnects Operational speed [GFLOPs] Operations per cycle Memory bandwidth [GB/s] differs a lot between different memories on chip Power [Watt] Derived metrics FLOP/Byte, FLOP/Watt

48 Theoretical peak performance 48 Peak = chips * cores * vectorwidth * FLOPs/cycle * clockfrequency Examples from DAS-4: Intel Core i7 CPU n 2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs NVIDIA GTX 580 GPU n 1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle * GhZ = 1581 GFLOPs ATI HD 6970 n 1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle * GhZ = 2703 GFLOPs

49 DRAM Memory bandwidth 49 Throughput = memory bus frequency * bits per cycle * bus width Memory clock!= CPU clock! In bits, divide by 8 for GB/s Examples: Intel Core i7 DDR3: * 2 * 64 = 21 GB/s NVIDIA GTX 580 GDDR5: * 4 * 384 = 192 GB/s ATI HD 6970 GDDR5: * 4 * 256 = 176 GB/s

50 Memory bandwidths 50 On-chip memory can be orders of magnitude faster Registers, shared memory, caches, E.g., AMD HD 7970 L1 cache achieves 2 TB/s Other memories: depends on the interconnect Intel s technology: QPI (Quick Path Interconnect) n 25.6 GB/s AMD s technology: HT3 (Hyper Transport 3) n 19.2 GB/s Accelerators: PCI-e 2.0 n 8 GB/s

51 Power 51 Chip manufactures specify Thermal Design Power (TDP) We can measure dissipated power Whole system Typically (much) lower than TDP Power efficiency FLOPS / Watt Examples (with theoretical peak and TDP) Intel Core i7: 154 / 160 = 1.0 GFLOPs/W NVIDIA GTX 580: ATI HD 6970: 1581 / 244 = 6.3 GFLOPs/W 2703 / 250 = 10.8 GFLOPs/W

52 Summary Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte Sun Niagara IBM bg/p IBM Power Intel Core i AMD Barcelona AMD Istanbul AMD Magny-Cours Cell/B.E NVIDIA GTX NVIDIA GTX AMD HD AMD HD

53 Absolute hardware performance 53 Only achieved in the optimal conditions: Processing units 100% used All parallelism 100% exploited All data transfers at maximum bandwidth In real life No application is like this Can we reason about real performance?

54 54 Performance analysis Operational Intensity and the Roofline model

55 An Example 55 I am the CEO of SmartSoftwareSolutions. I have an application that runs on my old Pentium laptop in 2.5 hours. I want to hire you to use many-cores to improve the performance. Metrics I will judge candidates by: How fast can the application be: n Execution time => what the users are interested in! How many times faster can you make it: n Speed-up => use the best possible sequential performance How do I know I should chose you? n Achievable performance => reason how far the performance is n Depends on application, hardware, and dataset! Is this architecture a good one to use? n Utilization => did I really need this hardware?

56 Questions? Comments? 56 For questions, comments, suggestions, : A.L.Varbanescu@uva.nl

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class