MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

MANY-CORE COMPUTING 7-Oct-2013 Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, escience Center

Schedule 2 1. Introduction, performance metrics & analysis 2. Programming: basics (10-10-2013) 3. Programming: advanced (14-10-2013) 4. Case study: LOFAR telescope with many-cores by Rob van Nieuwpoort (17-10-2013)

What are many-cores? 3 From Wikipedia: A many-core processor is a multicore processor in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient largely because of issues with congestion in supplying instructions and data to the many processors. In this course: Multi-core/many-core CPUs (GP)GPUs

What are many-cores 4 How many is many? Several tens of cores How are they different from multi-core CPUs? Non-uniform memory access (NUMA) Private memories Network-on-chip Examples Multi-core CPUs (48-core AMD magny-cours) Graphics Processing Units (GPUs) n GPGPU = general purpose programming on GPUs Server processors (Sun Niagara) HPC processors n Cell B.E. (PlayStation 3) n Intel Xeon Phi (aka Intel MIC, former Larrabee)

Today s Topics 5 Why do many-cores exist? History Hardware introduction Performance model: Arithmetic Intensity and Roofline

6 Why many-cores? Moore s law Many-cores in real-life

Moore s Law 7 Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. The complexity for minimum component costs has increased at a rate of roughly a factor of two per year... Certainly over the short term this rate can be expected to continue, if not to increase... Electronics Magazine 1965

8 Transistor Counts (Intel)

Impact of device shrinking 9 Assume transistor size shrinks by a factor of x! Transistors per unit area: up by x*x Die size? Assume the same Clock rate? may go up by x because wires are shorter Raw computing power? Programs could go x*x*x times faster In reality? Power consumption, memory, parallelism impose stricter bounds!

Revolution in Processors 10 Chip density is continuing to increase about 2x every 2 years BUT Clock speed is not ILP is not Power is not

New ways to use transistors 11 Parallelism on-chip: multi-core processors Multicore revolution Every machine will soon be a parallel machine What about performance? Can applications use this parallelism? Do they have to be rewritten from scratch? Will all programmers have to be parallel programmers? New programming models are needed Try to hide complexity from most programmers

Top500 [1/4] 12 State of the art in HPC (top500.org) Trial for all new HPC architectures 195 cores/node! Accelerated! Accelerated!

Top500 [2/4] 13 Performance is dominated by multi-/many-cores Multi-core CPUs Accelerators

Top500 [3/4] 14 Accelerators? Relatively low numbers High performance impact

China's Tianhe-1A 15 #10 in top500 list June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon X5670 processors 7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores

China's Tianhe-2 16 #1 in Top500 June 2013 54.902 pflops peak 33.862 pflops max 16.000 nodes = 16.000 x (2 x Xeon IvyBridge + 3 x Xeon Phi) = 3.120.000 cores ( => 195 cores/node)

17 Top500: prediction

18 GPUs vs. Top500

Why do we need many-cores? 19 T12 GT200 G80 NV30 NV40 G70 3GHz Dual Core P4 3GHz Core2 Duo 3GHz Xeon Quad

20 Why do we need many-cores?

21 Power efficiency

22 Graphics in 1980

23 Graphics in 2000

Realism of modern GPUs 24 http://www.youtube.com/watch? v=bjdeipvpjgq&feature=pla yer_embedded#t=49s Courtesy techradar.com

Why do we need many-cores? 25 Performance Large scale parallelism Power Efficiency Use transistors more efficiently Price (GPUs) Game market is huge, bigger than Hollywood Mass production, economy of scale spotty teenagers pay for our HPC needs! Prestige Reach ExaFLOP by 2019

26 History

27 Multi-core @ Intel

GPGPU History 28 Fermi 3B xtors GeForce 256 RIVA 128 23M xtors 3M xtors 1995 2000 GeForce 3 60M xtors GeForce FX 125M xtors GeForce 8800 681M xtors 2005 2010 Current generation: NVIDIA Kepler 7.1 transistors More cores, more parallelism, more performance

GPGPU History 29 Use Graphics primitives for HPC Ikonas [England 1978] Pixel Machine [Potmesil & Hoffert 1989] Pixel-Planes 5 [Rhoades, et al. 1992] Programmable shaders, around 1998 DirectX / OpenGL Map application onto graphics domain! GPGPU Brook (2004), Cuda (2007), OpenCL (Dec 2008),...

CUDA C/C++ Continuous Innovation 30 2007 2008 2009 2010 July 07 Nov 07 April 08 Aug 08 July 09 Nov 09 Mar 10 CUDA Toolkit 1.0 CUDA Toolkit 1.1 CUDA Visual Profiler 2.2 CUDA Toolkit 2.0 CUDA Toolkit 2.3 Parallel Nsight Beta CUDA Toolkit 3.0 C Compiler C Extensions Single Precision BLAS FFT SDK 40 examples Win XP 64 Atomics support Multi-GPU support cuda-gdb HW Debugger Double Precision Compiler Optimizations Vista 32/64 Mac OSX DP FFT 16-32 Conversion intrinsics Performance enhancements C++ inheritance Fermi support Tools updates Driver / RT interop 3D Textures HW Interpolation

Cuda Tools 31 Parallel Nsight Visual Studio Visual Profiler For Linux cuda-gdb For Linux

32 Another GPGPU history

33 GPUs @ AMD

34 Multi-core @ AMD

35 Multi-core @ AMD

36 GPU @ ARM

37 Many-core hardware

Choices 38 Core type(s): Fat or slim? Vectorized (SIMD)? Homogeneous or heterogeneous? Number of cores: Few or many? Memory Shared-memory or distributed-memory? Parallelism Instruction-level parallelism, threads, vectors,

A taxonomy 39 Based on field-of-origin : General-purpose n Intel, AMD Graphics Processing Units (GPUs) n NVIDIA, ATI Gaming/Entertainment n Sony/Toshiba/IBM Embedded systems n Philips/NXP, ARM Servers n Oracle, IBM, Intel High Performance Computing n Intel, IBM,

General Purpose Processors 40 Architecture Few fat cores Vectorization (SSE, AVX) Homogeneous Stand-alone Memory Shared, multi-layered Per-core cache and shared cache Programming Processes (OS Scheduler) Message passing Multi-threading Coarse-grained parallelism

Server-side 41 General-purpose-like with more hardware threads Lower performance per thread high throughput Examples Sun Niagara II n 8 cores x 8 threads IBM POWER7 n 8 cores x 4 threads Intel SCC n 48 cores, all can run their own OS

Graphics Processing Units 42 Architecture Hundreds/thousands of slim cores Homogeneous Accelerator Memory Very complex hierarchy Both shared and per-core Programming Off-load model Many fine-grained symmetrical threads Hardware scheduler

Cell/B.E. 43 Architecture Heterogeneous 8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE) Memory Per-core memory, network-on-chip Programming User-controlled scheduling 6 levels of parallelism, all under user control Fine- and coarse-grain parallelism

Xeon Phi 44 Architecture ~60 homogeneous cores n 4 threads per core x86 architecture Memory Per-core caches (L1,L2) n Coherence UMA [?] Programming SPMD/MPMD Fine- and coarse-grain parallelism (vector processing and threads, respectively

Take home message 45 Variety of platforms Core types & counts Memory architecture & sizes Parallelism layers & types Scheduling Open questions: Why so many? How many platforms do we need? Can any application run on any platform?

46 Hardware performance metrics

Hardware Performance metrics 47 Clock frequency [GHz] = absolute hardware speed Memories, CPUs, interconnects Operational speed [GFLOPs] Operations per cycle Memory bandwidth [GB/s] differs a lot between different memories on chip Power [Watt] Derived metrics FLOP/Byte, FLOP/Watt

Theoretical peak performance 48 Peak = chips * cores * vectorwidth * FLOPs/cycle * clockfrequency Examples from DAS-4: Intel Core i7 CPU n 2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs NVIDIA GTX 580 GPU n 1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle * 1.544 GhZ = 1581 GFLOPs ATI HD 6970 n 1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle * 0.880 GhZ = 2703 GFLOPs

DRAM Memory bandwidth 49 Throughput = memory bus frequency * bits per cycle * bus width Memory clock!= CPU clock! In bits, divide by 8 for GB/s Examples: Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s

Memory bandwidths 50 On-chip memory can be orders of magnitude faster Registers, shared memory, caches, E.g., AMD HD 7970 L1 cache achieves 2 TB/s Other memories: depends on the interconnect Intel s technology: QPI (Quick Path Interconnect) n 25.6 GB/s AMD s technology: HT3 (Hyper Transport 3) n 19.2 GB/s Accelerators: PCI-e 2.0 n 8 GB/s

Power 51 Chip manufactures specify Thermal Design Power (TDP) We can measure dissipated power Whole system Typically (much) lower than TDP Power efficiency FLOPS / Watt Examples (with theoretical peak and TDP) Intel Core i7: 154 / 160 = 1.0 GFLOPs/W NVIDIA GTX 580: ATI HD 6970: 1581 / 244 = 6.3 GFLOPs/W 2703 / 250 = 10.8 GFLOPs/W

Summary Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte Sun Niagara 2 8 64 11.2 76 0.1 IBM bg/p 4 8 13.6 13.6 1.0 IBM Power 7 8 32 265 68 3.9 Intel Core i7 4 16 85 25.6 3.3 AMD Barcelona 4 8 37 21.4 1.7 AMD Istanbul 6 6 62.4 25.6 2.4 AMD Magny-Cours 12 12 125 25.6 4.9 Cell/B.E. 8 8 205 25.6 8.0 NVIDIA GTX 580 16 512 1581 192 8.2 NVIDIA GTX 680 8 1536 3090 192 16.1 AMD HD 6970 384 1536 2703 176 15.4 AMD HD 7970 32 2048 3789 264 14.4

Absolute hardware performance 53 Only achieved in the optimal conditions: Processing units 100% used All parallelism 100% exploited All data transfers at maximum bandwidth In real life No application is like this Can we reason about real performance?

54 Performance analysis Operational Intensity and the Roofline model

An Example 55 I am the CEO of SmartSoftwareSolutions. I have an application that runs on my old Pentium laptop in 2.5 hours. I want to hire you to use many-cores to improve the performance. Metrics I will judge candidates by: How fast can the application be: n Execution time => what the users are interested in! How many times faster can you make it: n Speed-up => use the best possible sequential performance How do I know I should chose you? n Achievable performance => reason how far the performance is n Depends on application, hardware, and dataset! Is this architecture a good one to use? n Utilization => did I really need this hardware?

Questions? Comments? 56 For questions, comments, suggestions, : A.L.Varbanescu@uva.nl