Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1

Agenda Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers 08:00 08:45 Stampede and MIC Overview 08:45 10:00 Vectorization 10:00 10:15 Break 10:15 11:15 OpenMP and Offloading 11:15 12:00 Lab (on a Sandy Bridge machine) 2

Stampede and MIC Overview The TACC Stampede System Overview of Intel s MIC ecosystem MIC: Many Integrated Cores Basic Design Roadmap Coprocessor vs. Accelerator Programming Models OpenMP, MPI MKL, TBB, and Cilk Plus Roadmap & Preparation MIC is pronounced Mike 3

Stampede A new NSF-funded XSEDE resource at TACC Funded two components of a new system: Two petaflop, 100,000+ core Xeon based Dell cluster to be the new workhorse of the NSF open science community (>3x Ranger) Eight additional petaflops of Intel Many-Integrated- Core (MIC) processors to provide a revolutionary innovative capability: a new generation of supercomputing. Change the power/performance curves of supercomputing without changing the programming model (terribly much).

Stampede - Headlines Up to 10PF Peak performance in initial system (Early 2013) 100,000+ conventional Intel processor cores Almost 500,000 total cores with Intel MIC Upgrade with Future Knight s in 2015 14PB+ disk 272TB+ RAM 56Gb FDR InfiniBand interconnect Nearly 200 racks of compute hardware (10,000 sq ft) >SIX MEGAWATTS total power

The Stampede Project A Complete Advanced Computing Ecosystem: HPC Compute, storage, interconnect Visualization subsystem (144 high end GPUs Large memory support (16 1TB nodes) Integration with archive, and (coming soon) TACC global work filesystem and other data systems People, training, and documentation to support computational science Hosted at an expanded building in TACC; massive power upgrades 12MW new total datacenter power Thermal energy storage to reduce operating costs 6

Stampede Datacenter February 20th

Stampede Datacenter March 22nd

Stampede Datacenter May 16th

Stampede Datacenter June 20th 11

Facilities at TACC Stampede Footprint Stampede: 8000 ft 2 10 PF 6.5 MW Ranger: 3000 ft 2 0.6 PF 3 MW Capabilities: 20x Footprint: 2x Stampede InfiniBand (fat-tree) ~75 Miles of InfiniBand Cables 12

An Inflection point (or two) in High Performance Computing Relatively boring, but rapidly improving architectures for the last 15 years. Performance rising much faster than Moore s Law But power rising faster And concurrency rising faster than that, with serial performance decreasing. Something had to give

Consider this Example Homogeneous v. Heterogeneous Acquisition costs and power consumption Homogeneous system: K computer in Japan 10 PF unaccelerated Rumored $1.2 billion acquisition costs Large power bill and foot print Heterogeneous system: Stampede at TACC Near 10PF (2 SNB + up to 8 MIC) Modest footprint: 8,000 ft 2, 6.5MW $27.5 million 15

Accelerated Computing for Exascale Exascale systems, predicted for 2018, would have required 500MW on the old curves. Something new was clearly needed. The accelerated computing movement was reborn (this happens periodically, starting with 387 math coprocessors).

Key aspects of acceleration We have lots of transistors Moore s law is holding; this isn t necessarily the problem. We don t really need lower power per transistor, we need lower power per *operation*. How to do this? nvidia GPU -AMD Fusion FPGA -ARM Cores

Intel s MIC approach Since the days of RISC vs. CISC, Intel has mastered the art of figuring out what is important about a new processing technology, and saying why can t we do this in x86? The Intel Many Integrated Core (MIC) architecture is about large die, simpler circuit, much more parallelism, in the x86 line.

What is MIC? MIC is Intel s solution to try and change the power curves Stage 1: Knights Ferry (KNF) SDP: Software Development Platform Intel has granted early access to KNF to several dozen institutions Training has started TACC had access to KNF hardware since early March 2011 and is hosting a system since mid March 2011 TACC continues to evaluate the hardware and programming models Stage 2: Knights Corner (KNC) -- Now Xeon Phi Early silicon now at TACC for Stampede development First product expected early 2013 MIC Ancestors 80-core Terascale research program Larrabee many-core visual computing Single-chip Cloud Computer (SCC) 19

What is MIC? Basic Design Ideas Leverage x86 architecture (CPU with many cores) Leverage (Intel s) existing x86 programming models Dedicate much of the silicon to floating point ops., keep some cache(s) Keep cache-coherency protocol (was debated at the beginning) Increase floating-point throughput per core Implement as a separate device Pre-product Software Development Platform: Knights Ferry x86 cores that are simpler, but allow for more compute throughput Strip expensive features (out-of-order execution, branch prediction, etc.) Widen SIMD registers for more throughput Fast (GDDR5) memory on card GPU form factor: size, power, PCIe connection 20

MIC Architecture Many cores on the die L1 and L2 cache Bidirectional ring network Memory and PCIe connection MIC (KNF) architecture block diagram Knights Ferry SDP Up to 32 cores 1-2 GB of GDDR5 RAM 512-bit wide SIMD registers L1/L2 caches Multiple threads (up to 4) per core Slow operation in double precision Knights Corner (first product) 50+ cores Increased amount of RAM Details are under NDA Double precision half the speed of single precision (canonical ratio) 22 nm technology 21

MIC vs. Xeon Number of cores Clock Speed (GHz) SIMD width (bit) MIC (KNF) Xeon 30-32 4-8 1 ~3 512 128-256 Xeon chip designed to work well for all work loads MIC designed to work well for number crunching Single MIC core is slower than a Xeon core Number of cores on card is much higher What is Vectorization? Instruction set: Streaming SIMD Extensions (SSE) Single Instruction Multiple Data (SIMD) Multiple data elements are loaded into SSE registers One operation produces 2, 4, or 8 results in double precision Effective vectorization (by the compiler) crucial for performance 22

What we at TACC like about MIC Intel s MIC is based on x86 technology x86 cores w/ caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs Optimize once, run anywhere Our early MIC porting efforts for codes in the field are frequently doubling performance on Sandy Bridge.

Coprocessor vs. Accelerator GPUs are often called Accelerators Intel calls MIC a Coprocessor Similarities Large die device, all silicon dedicated to compute Fast GDDR5 memory Connected through PCIe bus, two physical address spaces Number of MIC cores similar to number of GPU Streaming Multiprocessors 24

Differences Coprocessor vs. Accelerator Architecture: x86 vs. streaming processors HPC Programming model: Threading/MPI: Programming details coherent caches vs. shared memory and caches extension to C++/C/Fortran vs. CUDA/OpenCL OpenCL support OpenMP and Multithreading vs. threads in hardware MPI on host and/or MIC vs. MPI on host only offloaded regions vs. kernels Support for any code: serial, scripting, etc. Yes vs. No Native mode: MIC-compiled code may be run as an independent process on the coprocessor. 25

The MIC Promise Competitive Performance: Peak and Sustained Familiar Programming Model HPC: C/C++, Fortran, and Intel s TBB Parallel Programming: OpenMP, MPI, Pthreads, Cilk Plus, OpenCL Intel s MKL library (later: third party libraries) Serial and Scripting, etc. (anything a CPU core can do) Easy transition for OpenMP code Pragmas/directives added to offload OMP parallel regions onto MIC See examples later Support for MPI MPI tasks on host, communication through offloading MPI tasks on MIC, communication through MPI 26

Adapting and Optimizing Code for MIC Hybrid Programming MPI + Threads (OpenMP, etc.) Optimize for higher thread performance Minimize serial sections and synchronization Test different execution models Symmetric vs. Offloaded Optimize for L1/L2 caches Test performance on MIC Optimize for specific architecture Start production Today Any resource Today/KNF/early KNC Stampede MIC 2013+

Levels of Parallel Execution Multi-core CPUs, Coprocessors (MIC), and Accelerators (GPUs) exploit hardware parallelism on 2 levels 1. Processing unit (CPU, MIC, GPU) with multiple functional units CPU/MIC: Cores GPU: Streaming Multiprocessor/SIMD Engine Functional units operate independently 2. Each functional units executes in lock-step (in parallel) CPU/MIC: Vector processing unit utilizes SIMD lanes GPU: Warp/Wavefront executes on CUDA or Stream cores 29

SIMD vs. CUDA/Stream Cores Each functional units executes in lock-step (in parallel) CPU/MIC: Vector processing unit utilizes SIMD lanes GPU: Warp/Wavefront executing on CUDA or Stream cores GPU programming model exposes low level parallelism All CUDA/Stream cores must be utilized for good performance; this is stating the obvious and it is transparent to the programmer Programmer is forced to manually setup blocks and grids of threads Steep learning curve at the beginning, but it leads naturally to the (data-parallel) utilization of all cores CPU programming model leaves vectorization to compiler Programmers may be unaware of parallel hardware Failing to write vectorizable code will lead to a significant performance penalty of 87.5% on MIC (50-75% on Xeon) 30

Early MIC Programming Experiences at TACC Codes port easily Minutes to days depending mostly on library dependencies Performance requires real work While the silicon continues to evolve Getting codes to run *at all* is almost too easy; really need to put in the effort to get what you expect Scalability is pretty good Multiple threads per core *really important* Getting your code to vectorize *really important*

Host-only Offload Symmetric Reverse Offload MIC Native Stampede Programming Five Possible Models

Programming Intel MIC-based Systems MPI+Offload MPI ranks on Intel Xeon processors (only) Offload All messages into/out of processors Offload models used to accelerate MPI ranks Intel Cilk TM Plus, OpenMP*, Intel Threading Building Blocks, Pthreads* within Intel MIC Network MPI Data Xeon Data Xeon MIC MIC Homogenous network of hybrid nodes: Data Xeon MIC Data Xeon MIC 11

Programming Intel MIC-based Systems Many-core Hosted MPI ranks on Intel MIC (only) All messages into/out of Intel MIC Intel Cilk TM Plus, OpenMP*, Intel Threading Building Blocks, Pthreads used directly within MPI processes Programmed as homogenous network of many-core CPUs: Network Xeon Xeon Data MIC Data MIC Data MPI Xeon MIC Data Xeon MIC 13

Programming Intel MIC-based Systems Symmetric MPI ranks on Intel MIC and Intel Xeon processors Messages to/from any core Data MPI Data Intel Cilk TM Plus, OpenMP*, Intel Threading Building Blocks, Pthreads* used directly within MPI processes MPI Xeon Data MIC Data MPI Network Xeon MIC Programmed as heterogeneous network of homogeneous nodes: Data Data Xeon MIC Data Data Xeon MIC 14

MIC Programming with OpenMP MIC specific pragma precedes OpenMP pragma Fortran:!dir$ omp offload target(mic) < > C: #pragma offload target(mic) < > All data transfer is handled by the compiler (optional keywords provide user control) Offload and OpenMP section will go over examples for: Automatic data management Manual data management I/O from within offloaded region Data can stream through the MIC; no need to leave MIC to fetch new data Also very helpful when debugging (print statements) Offloading a subroutine and using MKL 36

Programming Models Ready to use on day one! TBB s will be available to C++ programmers MKL will be available Automatic offloading by compiler for some MKL features (probably most transparent mode) Cilk Plus Useful for task-parallel programing (add-on to OpenMP) May become available for Fortran users as well OpenMP TACC expects that OpenMP will be the most interesting programming model for our HPC users 37

General expectation: Roadmap What comes next? How to get prepared? Many of the upcoming large systems will be accelerated Stampede s base Sandy Bridge cluster being constructed now Intel s MIC coprocessor is on track for 2013 Expect that (large) systems with MIC coprocessors will become available in 2013 at/after product launch MPI + OpenMP is the main HPC programming model If you are not using TBBs If you are not spending all your time in libraries (MKL, etc.) Early user access sometime after SC12 39

Roadmap What comes next? How to get prepared? Many HPC applications are pure-mpi codes Start thinking about upgrading to a hybrid scheme Adding OpenMP is a larger effort than adding MIC directives Special MIC/OpenMP considerations Many more threads will be needed: 50+ cores (Production KNC) 50+/100+/200+ threads Good OpenMP scaling will be (much) more important Vectorize, vectorize, vectorize There is no unified last-level cache (LLC): Cache shared among cores, several MB on CPUs Stay tuned The coming years will be exciting 40