Hybrid Architectures Why Should I Bother?

Similar documents
HPC Algorithms and Applications

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Trends in HPC (hardware complexity and software challenges)

Tesla GPU Computing A Revolution in High Performance Computing

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

GPU Fundamentals Jeff Larkin November 14, 2016

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Tesla GPU Computing A Revolution in High Performance Computing

Steve Scott, Tesla CTO SC 11 November 15, 2011

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Intel Many Integrated Core (MIC) Architecture

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Accelerating High Performance Computing.

Parallel Computing: Parallel Architectures Jin, Hai

The Era of Heterogeneous Computing

Parallel Computer Architecture - Basics -

PART I - Fundamentals of Parallel Computing

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Mathematical computations with GPUs

Introduction to CUDA

CS427 Multicore Architecture and Parallel Computing

Timothy Lanfear, NVIDIA HPC

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

John Hengeveld Director of Marketing, HPC Evangelist

CUDA Experiences: Over-Optimization and Future HPC

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

HPC Algorithms and Applications

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

HIGH-PERFORMANCE COMPUTING

Parallel Programming

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Mathematical computations with GPUs

Race to Exascale: Opportunities and Challenges. Avinash Sodani, Ph.D. Chief Architect MIC Processor Intel Corporation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Numerical Simulation on the GPU

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Multi-Processors and GPU

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

CME 213 S PRING Eric Darve

Tesla Architecture, CUDA and Optimization Strategies

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

GPU for HPC. October 2010

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Fabio AFFINITO.

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Fujitsu s Approach to Application Centric Petascale Computing

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

NVIDIA Fermi Architecture

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Master Informatics Eng.

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Overview of Tianhe-2

HPC Technology Trends

Lecture 1: Introduction and Computational Thinking

Parallel Programming on Ranger and Stampede

Fundamental CUDA Optimization. NVIDIA Corporation

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Computing with Fornax. Dr. Christopher Harris

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

A study on linear algebra operations using extended precision floating-point arithmetic on GPUs

Godson Processor and its Application in High Performance Computers

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Cray XC Scalability and the Aries Network Tony Ford

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Performance Benefits of NVIDIA GPUs for LS-DYNA

Sparse Linear Algebra in CUDA

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

The End of Denial Architecture and The Rise of Throughput Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Transcription:

Hybrid Architectures Why Should I Bother? CSCS-FoMICS-USI Summer School on Computer Simulations in Science and Engineering Michael Bader July 8 19, 2013 Computer Simulations in Science and Engineering, July 8 19, 2013 1

The Simulation Pipeline v a l i d a t i o n phenomenon, process etc. mathematical model numerical algorithm simulation code results to interpret statement tool modelling numerical treatment parallel implementation visualization embedding Computer Simulations in Science and Engineering, July 8 19, 2013 2

Parallel Computing Faster, Bigger, More Why parallel high performance computing: Response time: compute a problem in 1 p time speed up engineering processes real-time simulations (tsunami warning?) Problem size: compute a p-times bigger problem simulation of large-/multi-scale phenomena maximal problem size that fits into the machine validation of smaller, operational models Throughput: compute p problems at once case and parameter studies, statistical risk scenarios, etc. (hazard maps, data base for tsunami warning,... ) massively distributed computing (SETI@home, e.g.) Computer Simulations in Science and Engineering, July 8 19, 2013 3

Part I High Performance Computing in CSE Past(?) and Present Trends Computer Simulations in Science and Engineering, July 8 19, 2013 4

The Seven Dwarfs of HPC dwarfs = key algorithmic kernels in many scientific computing applications P. Colella (LBNL), 2004: 1. dense linear algebra 2. sparse linear algebra 3. spectral methods 4. N-body methods 5. structured grids 6. unstructured grids 7. Monte Carlo Tsunami & storm-surge simulation: usually PDE solvers on structured or unstructured meshes SWE: a simple shallow water solver on Cartesian grids Computer Simulations in Science and Engineering, July 8 19, 2013 5

Computational Science Demands a New Paradigm Computational simulation must meet three challenges to become a mature partner of theory and experiment (Post & Votta, 2005) 1. performance challenge exponential growth of performance, massively parallel architectures 2. programming challenge new (parallel) programming models 3. prediction challenge careful verification and validation of codes; towards reproducible simulation experiments Computer Simulations in Science and Engineering, July 8 19, 2013 6

Four Horizons for Enhancing the Performance...... of Parallel Simulations Based on Partial Differential Equations (David Keyes, 2000) 1. Expanded Number of Processors in 2000: 1000 cores; in 2010: 200,000 cores 2. More Efficient Use of Faster Processors PDF working-sets, cache efficiency 3. More Architecture-Friendly Algorithms improve temporal/spatial locality 4. Algorithms Delivering More Science per Flop adaptivity (in space and time), higher-order methods, fast solvers Computer Simulations in Science and Engineering, July 8 19, 2013 7

Performance Development in Supercomputing (source: www.top500.org) Computer Simulations in Science and Engineering, July 8 19, 2013 8

Top 500 (www.top500.org) June 2013 Computer Simulations in Science and Engineering, July 8 19, 2013 9

Top 500 Spotlights Tianhe-2 and K Computer Tianhe-2/MilkyWay-2 Intel Xeon Phi (NUDT) 3.1 mio cores(!) Intel Ivy Bridge and Xeon Phi Linpack benchmark: 33.8 PFlop/s 17 MW power(!!) Knights Corner / Intel Xeon Phi / Intel MIC as accelerator 61 cores, roughly 1.1 1.3 GHz Titan Cray XK7, NVIDIA K20x (ORNL) 18,688 compute nodes; 300,000 Opteron cores 18,688 NVIDIA Tesla K20 GPUs Linpack benchmark: 17.6 PFlop/s 8.2 MW power Computer Simulations in Science and Engineering, July 8 19, 2013 10

Top 500 Spotlights Sequoia and K Computer Sequoia IBM BlueGene/Q (LLNL) 98,304 compute nodes; 1.6 mio cores Linpack benchmark: 17.1 PFlop/s 8 MW power K Computer SPARC64 (RIKEN, Kobe) 88,128 processors; 705,024 cores Linpack benchmark: 10.51 PFlop/s 12 MW power SPARC64 VIIIfx 2.0GHz (8-core CPU) Computer Simulations in Science and Engineering, July 8 19, 2013 11

Performance Development in Supercomputing (source: www.top500.org) Computer Simulations in Science and Engineering, July 8 19, 2013 12

International Exascale Software Project Roadmap Towards an Exa-Flop/s Platform in 2018 (www.exascale.org): 1. technology trends concurrency, reliability, power consumption,... blueprint of an exascale system: 10-billion-way concurrency, 100 million to 1 billion cores, 10-to-100-way concurrency per core, hundreds of cores per die,... 2. science trends climate, high-energy physics, nuclear physics, fusion energy sciences, materials science and chemistry,... 3. X-stack (software stack for exascale) energy, resiliency, heterogeneity, I/O and memory 4. Polito-economic trends exascale systems run by government labs, used by CSE scientists Computer Simulations in Science and Engineering, July 8 19, 2013 13

Exascale Roadmap Aggressively Designed Strawman Architecture Level What Perform. Power RAM FPU FPU, regs,. instr.-memory 1.5 Gflops 30mW 4 FPUs, L1 6 Gflops 141mW Proc. Chip 742 cores, L2/L3, Interconn. 4.5 Tflops 214W Node Proc. chip, DRAM 4.5 Tflops 230W 16 GB Group 12 proc. chips, routers 54 Tflops 3.5KW 192 GB rack 32 groups 1.7 Pflops 116KW 6.1 TB System 583 racks 1 Eflops 67.7MW 3.6 PB approx. 285,000 cores per rack; 166 mio cores in total Source: ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems Computer Simulations in Science and Engineering, July 8 19, 2013 14

Exascale Roadmap Should You Bother? Your department s compute cluster in 5 years? a Petaflop System! one rack of the Exaflop system using the same/similar hardware extrapolated example machine: peak performance: 1.7 PFlop/s 6 TB RAM, 60 GB cache memory total concurrency : 1.1 10 6 number of cores: 280, 000 number of chips: 384 Source: ExaScale Software Study: Software Challenges in Extreme Scale Systems Computer Simulations in Science and Engineering, July 8 19, 2013 15

Your Department s PetaFlop/s Cluster in 5 Years? Tianhe-1A (Tianjin, China; Top500 # 10 ) 14,336 Xeon X5670 CPUs 7,168 Nvidia Tesla M2050 GPUs Linpack benchmark: 2.6 PFlop/s 4 MW power Stampede (Intel, Top500 # 6) 102,400 cores (incl. Xeon Phi: MIC/ many integrated cores ) Linpack benchmark: 5 PFlop/s Knights Corner / Intel Xeon Phi / Intel MIC as accelerator 61 cores, roughly 1.1 1.3 GHz wider vector FP units: 64 bytes (i.e., 16 floats, 8 doubles) 4.5 MW power Computer Simulations in Science and Engineering, July 8 19, 2013 16

Free Lunch is Over ( )... actually already over for quite some time! Speedup of your software can only come from parallelism: clock speed of CPU has stalled instruction-level parallelism per core has stalled number of cores is growing size of vector units is growing ( ) Quote and image taken from: H. Sutter, The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb s Journal 30(3), March 2005. Computer Simulations in Science and Engineering, July 8 19, 2013 17

Manycore CPU Intel MIC Architecture Intel MIC Architecture: An Intel Co-Processor Architecture FIXED FUNCTION LOGIC VECTOR VECTOR VECTOR IA CORE IA CORE IA CORE INTERPROCESSOR NETWORK COHERENT COHERENT COHERENT CACHE CACHE CACHE COHERENT COHERENT COHERENT CACHE CACHE CACHE INTERPROCESSOR NETWORK VECTOR VECTOR VECTOR IA CORE IA CORE IA CORE VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE MEMORY and I/O INTERFACES Many cores and many, many more threads Standard IA programming and memory model (source: Intel/K. Skaugen SC 10 keynote presentation) Computer Simulations in Science and Engineering, July 8 19, 2013 18

Manycore CPU Intel MIC Architecture (2) Diagram of a Knights Corner core: Figure 4: Knights Corner (source: An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors) The cores are in-order dual issue x86 processor cores which trace some history to the original Computer Simulations in Science and Engineering, July 8 19, 2013 19 Pentium design, but with the addition of 64-bit support, four hardware threads per core, power

cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The Technische Universität 512 CUDA München cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread GPGPU global scheduler NVIDIA distributes thread Fermi blocks to SM thread schedulers. Fermi s 16 SM are positioned (source: around NVIDIA a common Fermi Whitepaper) L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache). Computer Simulations in Science and Engineering, July 8 19, 2013 7 20

cessor Technische Universität München GPGPU NVIDIA Fermi (2) eneration SM introduces several ral innovations that make it not only the erful SM yet built, Third Generation but also Streaming the most able and efficient. Multiprocessor Performance CUDA cores eatures 32 CUDA s a fourfold ver prior SM ach CUDA has a fully integer arithmetic The third generation SM introduces several architectural innovations that make it not only the most powerful SM yet built, but also the most programmable and efficient. CUDA 512 High Performance CUDA cores Each SM features 32 CUDA Dispatch Port CUDA processors a fourfold Operand Collector Dispatch Port Operand Collector increase over prior SM designs. Each CUDA FP Unit INT Unit processor has a fully FP Unit INT Unit pipelined integer arithmetic Result Queue logic unit (ALU) and floating point unit (FPU). Prior GPUs Result used Queue IEEE 754-1985 floating point arithmetic. The Fermi architecture implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) ALU) and floating (FPU). Prior GPUs instruction used for both IEEE single and 754-1985 double precision arithmetic. FMA improves over a multiply-add int arithmetic. The Fermi architecture ts the new IEEE loss of 754-2008 precision in the addition. floating-point FMA is more (source: NVIDIA accurate Fermi than Whitepaper) performing the operations providing the fused multiply-add (FMA) for both single and double precision. FMA improves over a multiply-add (MAD) instruction by doing the multiplication and addition with a single final rounding step, with no Warp Scheduler Dispatch Unit Fermi Streaming Multiprocessor (SM) separately. GT200 implemented double precision FMA. Warp Scheduler Dispatch Unit In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result, multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly Computer Simulations in Science and Engineering, July 8 19, 2013 Interconnect Network 21 designed integer ALU supports full 32-bit precision for all instructions, consistent with standard Warp Scheduler Dispatch Unit Instruction Cache Register File (32,768 x 32-bit) Interconnect Network 64 KB Shared Memory / L1 Cache Uniform Cache Register File (32,768 x 32-bit) Warp Scheduler Dispatch Unit SFU SFU SFU SFU SFU SFU SFU SFU

GPGPU NVIDIA Fermi (3) General Purpose Graphics Processing Unit: 512 CUDA cores Memory Subsystem Innovations improved double precision performance shared vs. global program memory behavior. new: L1 und L2 cache (768 KB) trend from GPU towards CPU? NVIDIA Parallel DataCache TM with Configurable L1 and Unified L2 Cache Working with hundreds of GPU computing applications from various industries, we learned that while Shared memory benefits many problems, it is not appropriate for all problems. Some algorithms map naturally to Shared memory, others require a cache, while others require a combination of both. The optimal memory hierarchy should offer the benefits of both Shared memory and cache, and allow the programmer a choice over its partitioning. The Fermi memory hierarchy adapts to both types of Adding a true cache hierarchy for load / store operations presented significant challenges. Traditional GPU architectures support a read-only load path for texture operations and a write-only export path for pixel data output. However, this approach is poorly suited to executing general purpose C or C++ thread programs that expect reads and writes to be ordered. As one example: spilling a register operand to memory and then reading it back creates a read after write hazard; if the read and write paths are separate, it may be necessary to explicitly flush the entire write / export path before it is safe to issue the read, and any caches on the read path would not be coherent with respect to the write data. The Fermi architecture addresses this challenge by implementing a single unified memory request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2 Computer Simulations in Science and Engineering, cache Julythat 8 19, services 2013all operations (load, store and texture). The per-sm L1 cache is 22

Parallel Computing Paradigms Not exactly sure how the hardware will look like... (CPU-style, GPU-style, something new?) However: massively parallel programming required revival of vector computing several/many FPUs performing the same operation hybrid/heterogenous achitectures different kind of cores; dedicated accelerator hardware different access to memory cache and cache coherency small amount of memory per core our concern in this course: data parallelism vectorisation (and a look into GPU computing) Computer Simulations in Science and Engineering, July 8 19, 2013 23