Hybrid Architectures Why Should I Bother?

Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering, July 8 19, 2013 1

The Simulation Pipeline v a l i d a t i o n phenomenon, process etc. mathematical model numerical algorithm simulation code results to interpret statement tool modelling numerical treatment parallel implementation visualization embedding Computer Simulations in Science and Engineering, July 8 19, 2013 2

Parallel Computing Faster, Bigger, More Why parallel high performance computing: Response time: compute a problem in 1 p time speed up engineering processes real-time simulations (tsunami warning?) Problem size: compute a p-times bigger problem simulation of large-/multi-scale phenomena maximal problem size that fits into the machine validation of smaller, operational models Throughput: compute p problems at once case and parameter studies, statistical risk scenarios, etc. (hazard maps, data base for tsunami warning,... ) massively distributed computing (SETI@home, e.g.) Computer Simulations in Science and Engineering, July 8 19, 2013 3

Part I High Performance Computing in CSE Past(?) and Present Trends Computer Simulations in Science and Engineering, July 8 19, 2013 4

The Seven Dwarfs of HPC dwarfs = key algorithmic kernels in many scientific computing applications P. Colella (LBNL), 2004: 1. dense linear algebra 2. sparse linear algebra 3. spectral methods 4. N-body methods 5. structured grids 6. unstructured grids 7. Monte Carlo Tsunami & storm-surge simulation: usually PDE solvers on structured or unstructured meshes SWE: a simple shallow water solver on Cartesian grids Computer Simulations in Science and Engineering, July 8 19, 2013 5

Computational Science Demands a New Paradigm Computational simulation must meet three challenges to become a mature partner of theory and experiment (Post & Votta, 2005) 1. performance challenge exponential growth of performance, massively parallel architectures 2. programming challenge new (parallel) programming models 3. prediction challenge careful verification and validation of codes; towards reproducible simulation experiments Computer Simulations in Science and Engineering, July 8 19, 2013 6

Four Horizons for Enhancing the Performance...... of Parallel Simulations Based on Partial Differential Equations (David Keyes, 2000) 1. Expanded Number of Processors in 2000: 1000 cores; in 2010: 200,000 cores 2. More Efficient Use of Faster Processors PDF working-sets, cache efficiency 3. More Architecture-Friendly Algorithms improve temporal/spatial locality 4. Algorithms Delivering More Science per Flop adaptivity (in space and time), higher-order methods, fast solvers Computer Simulations in Science and Engineering, July 8 19, 2013 7

Performance Development in Supercomputing (source: www.top500.org) Computer Simulations in Science and Engineering, July 8 19, 2013 8

Top 500 (www.top500.org) June 2013 Computer Simulations in Science and Engineering, July 8 19, 2013 9

Top 500 Spotlights Tianhe-2 and K Computer Tianhe-2/MilkyWay-2 Intel Xeon Phi (NUDT) 3.1 mio cores(!) Intel Ivy Bridge and Xeon Phi Linpack benchmark: 33.8 PFlop/s 17 MW power(!!) Knights Corner / Intel Xeon Phi / Intel MIC as accelerator 61 cores, roughly 1.1 1.3 GHz Titan Cray XK7, NVIDIA K20x (ORNL) 18,688 compute nodes; 300,000 Opteron cores 18,688 NVIDIA Tesla K20 GPUs Linpack benchmark: 17.6 PFlop/s 8.2 MW power Computer Simulations in Science and Engineering, July 8 19, 2013 10

Top 500 Spotlights Sequoia and K Computer Sequoia IBM BlueGene/Q (LLNL) 98,304 compute nodes; 1.6 mio cores Linpack benchmark: 17.1 PFlop/s 8 MW power K Computer SPARC64 (RIKEN, Kobe) 88,128 processors; 705,024 cores Linpack benchmark: 10.51 PFlop/s 12 MW power SPARC64 VIIIfx 2.0GHz (8-core CPU) Computer Simulations in Science and Engineering, July 8 19, 2013 11

Performance Development in Supercomputing (source: www.top500.org) Computer Simulations in Science and Engineering, July 8 19, 2013 12

International Exascale Software Project Roadmap Towards an Exa-Flop/s Platform in 2018 (www.exascale.org): 1. technology trends concurrency, reliability, power consumption,... blueprint of an exascale system: 10-billion-way concurrency, 100 million to 1 billion cores, 10-to-100-way concurrency per core, hundreds of cores per die,... 2. science trends climate, high-energy physics, nuclear physics, fusion energy sciences, materials science and chemistry,... 3. X-stack (software stack for exascale) energy, resiliency, heterogeneity, I/O and memory 4. Polito-economic trends exascale systems run by government labs, used by CSE scientists Computer Simulations in Science and Engineering, July 8 19, 2013 13

Exascale Roadmap Aggressively Designed Strawman Architecture Level What Perform. Power RAM FPU FPU, regs,. instr.-memory 1.5 Gflops 30mW 4 FPUs, L1 6 Gflops 141mW Proc. Chip 742 cores, L2/L3, Interconn. 4.5 Tflops 214W Node Proc. chip, DRAM 4.5 Tflops 230W 16 GB Group 12 proc. chips, routers 54 Tflops 3.5KW 192 GB rack 32 groups 1.7 Pflops 116KW 6.1 TB System 583 racks 1 Eflops 67.7MW 3.6 PB approx. 285,000 cores per rack; 166 mio cores in total Source: ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Computer Simulations in Science and Engineering, July 8 19, 2013 14

Exascale Roadmap Should You Bother? Your department s compute cluster in 5 years? a Petaflop System! one rack of the Exaflop system using the same/similar hardware extrapolated example machine: peak performance: 1.7 PFlop/s 6 TB RAM, 60 GB cache memory total concurrency : 1.1 10 6 number of cores: 280, 000 number of chips: 384 Source: ExaScale Software Study: Software Challenges in Extreme Scale Systems Computer Simulations in Science and Engineering, July 8 19, 2013 15

Your Department s PetaFlop/s Cluster in 5 Years? Tianhe-1A (Tianjin, China; Top500 # 10 ) 14,336 Xeon X5670 CPUs 7,168 Nvidia Tesla M2050 GPUs Linpack benchmark: 2.6 PFlop/s 4 MW power Stampede (Intel, Top500 # 6) 102,400 cores (incl. Xeon Phi: MIC/ many integrated cores ) Linpack benchmark: 5 PFlop/s Knights Corner / Intel Xeon Phi / Intel MIC as accelerator 61 cores, roughly 1.1 1.3 GHz wider vector FP units: 64 bytes (i.e., 16 floats, 8 doubles) 4.5 MW power Computer Simulations in Science and Engineering, July 8 19, 2013 16

Free Lunch is Over ( )... actually already over for quite some time! Speedup of your software can only come from parallelism: clock speed of CPU has stalled instruction-level parallelism per core has stalled number of cores is growing size of vector units is growing ( ) Quote and image taken from: H. Sutter, The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb s Journal 30(3), March 2005. Computer Simulations in Science and Engineering, July 8 19, 2013 17

Manycore CPU Intel MIC Architecture Intel MIC Architecture: An Intel Co-Processor Architecture FIXED FUNCTION LOGIC VECTOR VECTOR VECTOR IA CORE IA CORE IA CORE INTERPROCESSOR NETWORK COHERENT COHERENT COHERENT CACHE CACHE CACHE COHERENT COHERENT COHERENT CACHE CACHE CACHE INTERPROCESSOR NETWORK VECTOR VECTOR VECTOR IA CORE IA CORE IA CORE VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE MEMORY and I/O INTERFACES Many cores and many, many more threads Standard IA programming and memory model (source: Intel/K. Skaugen SC 10 keynote presentation) Computer Simulations in Science and Engineering, July 8 19, 2013 18

Manycore CPU Intel MIC Architecture (2) Diagram of a Knights Corner core: Figure 4: Knights Corner (source: An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors) The cores are in-order dual issue x86 processor cores which trace some history to the original Computer Simulations in Science and Engineering, July 8 19, 2013 19 Pentium design, but with the addition of 64-bit support, four hardware threads per core, power

cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The Technische Universität 512 CUDA München cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread GPGPU global scheduler NVIDIA distributes thread Fermi blocks to SM thread schedulers. Fermi s 16 SM are positioned (source: around NVIDIA a common Fermi Whitepaper) L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache). Computer Simulations in Science and Engineering, July 8 19, 2013 7 20

cessor Technische Universität München GPGPU NVIDIA Fermi (2) eneration SM introduces several ral innovations that make it not only the erful SM yet built, Third Generation but also Streaming the most able and efficient. Multiprocessor Performance CUDA cores eatures 32 CUDA s a fourfold ver prior SM ach CUDA has a fully integer arithmetic The third generation SM introduces several architectural innovations that make it not only the most powerful SM yet built, but also the most programmable and efficient. CUDA 512 High Performance CUDA cores Each SM features 32 CUDA Dispatch Port CUDA processors a fourfold Operand Collector Dispatch Port Operand Collector increase over prior SM designs. Each CUDA FP Unit INT Unit processor has a fully FP Unit INT Unit pipelined integer arithmetic Result Queue logic unit (ALU) and floating point unit (FPU). Prior GPUs Result used Queue IEEE 754-1985 floating point arithmetic. The Fermi architecture implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) ALU) and floating (FPU). Prior GPUs instruction used for both IEEE single and 754-1985 double precision arithmetic. FMA improves over a multiply-add int arithmetic. The Fermi architecture ts the new IEEE loss of 754-2008 precision in the addition. floating-point FMA is more (source: NVIDIA accurate Fermi than Whitepaper) performing the operations providing the fused multiply-add (FMA) for both single and double precision. FMA improves over a multiply-add (MAD) instruction by doing the multiplication and addition with a single final rounding step, with no Warp Scheduler Dispatch Unit Fermi Streaming Multiprocessor (SM) separately. GT200 implemented double precision FMA. Warp Scheduler Dispatch Unit In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly Computer Simulations in Science and Engineering, July 8 19, 2013 Interconnect Network 21 designed integer ALU supports full 32-bit precision for all instructions, consistent with standard Warp Scheduler Dispatch Unit Instruction Cache Register File (32,768 x 32-bit) Interconnect Network 64 KB Shared Memory / L1 Cache Uniform Cache Register File (32,768 x 32-bit) Warp Scheduler Dispatch Unit SFU SFU SFU SFU SFU SFU SFU SFU

GPGPU NVIDIA Fermi (3) General Purpose Graphics Processing Unit: 512 CUDA cores Memory Subsystem Innovations improved double precision performance shared vs. global program memory behavior. new: L1 und L2 cache (768 KB) trend from GPU towards CPU? NVIDIA Parallel DataCache TM with Configurable L1 and Unified L2 Cache Working with hundreds of GPU computing applications from various industries, we learned that while Shared memory benefits many problems, it is not appropriate for all problems. Some algorithms map naturally to Shared memory, others require a cache, while others require a combination of both. The optimal memory hierarchy should offer the benefits of both Shared memory and cache, and allow the programmer a choice over its partitioning. The Fermi memory hierarchy adapts to both types of Adding a true cache hierarchy for load / store operations presented significant challenges. Traditional GPU architectures support a read-only load path for texture operations and a write-only export path for pixel data output. However, this approach is poorly suited to executing general purpose C or C++ thread programs that expect reads and writes to be ordered. As one example: spilling a register operand to memory and then reading it back creates a read after write hazard; if the read and write paths are separate, it may be necessary to explicitly flush the entire write / export path before it is safe to issue the read, and any caches on the read path would not be coherent with respect to the write data. The Fermi architecture addresses this challenge by implementing a single unified memory request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2 Computer Simulations in Science and Engineering, cache Julythat 8 19, services 2013all operations (load, store and texture). The per-sm L1 cache is 22

Parallel Computing Paradigms Not exactly sure how the hardware will look like... (CPU-style, GPU-style, something new?) However: massively parallel programming required revival of vector computing several/many FPUs performing the same operation hybrid/heterogenous achitectures different kind of cores; dedicated accelerator hardware different access to memory cache and cache coherency small amount of memory per core our concern in this course: data parallelism vectorisation (and a look into GPU computing) Computer Simulations in Science and Engineering, July 8 19, 2013 23