Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Similar documents
The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Parallel Programming on Ranger and Stampede

Introduction to Xeon Phi. Bill Barth January 11, 2013

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

Introduction to the Intel Xeon Phi on Stampede

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

The Era of Heterogeneous Computing

Introduc)on to Xeon Phi

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Introduc)on to Xeon Phi

Preparing for Highly Parallel, Heterogeneous Coprocessing

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Trends in HPC (hardware complexity and software challenges)

GPUs and Emerging Architectures

Intel Knights Landing Hardware

Intel Many Integrated Core (MIC) Architecture

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Intel Xeon Phi Coprocessors

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

HPC Hardware Overview

Introduc)on to Xeon Phi

Path to Exascale? Intel in Research and HPC 2012

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Trends and Challenges in Multicore Programming

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

HPC Architectures. Types of resource currently in use

John Hengeveld Director of Marketing, HPC Evangelist

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Architecture, Programming and Performance of MIC Phi Coprocessor

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Technology for a better society. hetcomp.com

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Building NVLink for Developers

Investigation of Intel MIC for implementation of Fast Fourier Transform

OpenMP 4.0 (and now 5.0)

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

High performance computing and numerical modeling

the Intel Xeon Phi coprocessor

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Computer Architecture and Structured Parallel Programming James Reinders, Intel

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Modern Processor Architectures. L25: Modern Compiler Design

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

Interconnect Your Future

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Carlo Cavazzoni, HPC department, CINECA

CPU-GPU Heterogeneous Computing

The Mont-Blanc approach towards Exascale

Umeå University

Accelerator Programming Lecture 1

Umeå University

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

Addressing Heterogeneity in Manycore Applications

Intel Architecture for HPC

HPC code modernization with Intel development tools

INF5063: Programming heterogeneous multi-core processors Introduction

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Paving the Road to Exascale

Intel Enterprise Processors Technology

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Intel Xeon Phi архитектура, модели программирования, оптимизация.

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Overview of research activities Toward portability of performance

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

General introduction: GPUs and the realm of parallel architectures

n N c CIni.o ewsrg.au

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Accelerating Implicit LS-DYNA with GPU

Introduction to GPU hardware and to CUDA

Lecture 9: MIMD Architecture

Preparing GPU-Accelerated Applications for the Summit Supercomputer

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Overview of Tianhe-2

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CME 213 S PRING Eric Darve

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Lecture 9: MIMD Architectures

Parallel Programming on Larrabee. Tim Foley Intel Corp

ECE 574 Cluster Computing Lecture 23

NVIDIA s Compute Unified Device Architecture (CUDA)

Transcription:

Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1

Agenda Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers 08:00 08:45 Stampede and MIC Overview 08:45 10:00 Vectorization 10:00 10:15 Break 10:15 11:15 OpenMP and Offloading 11:15 12:00 Lab (on a Sandy Bridge machine) 2

Stampede and MIC Overview The TACC Stampede System Overview of Intel s MIC ecosystem MIC: Many Integrated Cores Basic Design Roadmap Coprocessor vs. Accelerator Programming Models OpenMP, MPI MKL, TBB, and Cilk Plus Roadmap & Preparation MIC is pronounced Mike 3

Stampede A new NSF-funded XSEDE resource at TACC Funded two components of a new system: Two petaflop, 100,000+ core Xeon based Dell cluster to be the new workhorse of the NSF open science community (>3x Ranger) Eight additional petaflops of Intel Many-Integrated- Core (MIC) processors to provide a revolutionary innovative capability: a new generation of supercomputing. Change the power/performance curves of supercomputing without changing the programming model (terribly much).

Stampede - Headlines Up to 10PF Peak performance in initial system (Early 2013) 100,000+ conventional Intel processor cores Almost 500,000 total cores with Intel MIC Upgrade with Future Knight s in 2015 14PB+ disk 272TB+ RAM 56Gb FDR InfiniBand interconnect Nearly 200 racks of compute hardware (10,000 sq ft) >SIX MEGAWATTS total power

The Stampede Project A Complete Advanced Computing Ecosystem: HPC Compute, storage, interconnect Visualization subsystem (144 high end GPUs Large memory support (16 1TB nodes) Integration with archive, and (coming soon) TACC global work filesystem and other data systems People, training, and documentation to support computational science Hosted at an expanded building in TACC; massive power upgrades 12MW new total datacenter power Thermal energy storage to reduce operating costs 6

7

Stampede Datacenter February 20th

Stampede Datacenter March 22nd

Stampede Datacenter May 16th

Stampede Datacenter June 20th 11

Facilities at TACC Stampede Footprint Stampede: 8000 ft 2 10 PF 6.5 MW Ranger: 3000 ft 2 0.6 PF 3 MW Capabilities: 20x Footprint: 2x Stampede InfiniBand (fat-tree) ~75 Miles of InfiniBand Cables 12

Stampede and MIC Overview The TACC Stampede System Overview of Intel s MIC ecosystem MIC: Many Integrated Cores Basic Design Roadmap Coprocessor vs. Accelerator Programming Models OpenMP, MPI MKL, TBB, and Cilk Plus Roadmap & Preparation 13

An Inflection point (or two) in High Performance Computing Relatively boring, but rapidly improving architectures for the last 15 years. Performance rising much faster than Moore s Law But power rising faster And concurrency rising faster than that, with serial performance decreasing. Something had to give

Consider this Example Homogeneous v. Heterogeneous Acquisition costs and power consumption Homogeneous system: K computer in Japan 10 PF unaccelerated Rumored $1.2 billion acquisition costs Large power bill and foot print Heterogeneous system: Stampede at TACC Near 10PF (2 SNB + up to 8 MIC) Modest footprint: 8,000 ft 2, 6.5MW $27.5 million 15

Accelerated Computing for Exascale Exascale systems, predicted for 2018, would have required 500MW on the old curves. Something new was clearly needed. The accelerated computing movement was reborn (this happens periodically, starting with 387 math coprocessors).

Key aspects of acceleration We have lots of transistors Moore s law is holding; this isn t necessarily the problem. We don t really need lower power per transistor, we need lower power per *operation*. How to do this? nvidia GPU -AMD Fusion FPGA -ARM Cores

Intel s MIC approach Since the days of RISC vs. CISC, Intel has mastered the art of figuring out what is important about a new processing technology, and saying why can t we do this in x86? The Intel Many Integrated Core (MIC) architecture is about large die, simpler circuit, much more parallelism, in the x86 line.

What is MIC? MIC is Intel s solution to try and change the power curves Stage 1: Knights Ferry (KNF) SDP: Software Development Platform Intel has granted early access to KNF to several dozen institutions Training has started TACC had access to KNF hardware since early March 2011 and is hosting a system since mid March 2011 TACC continues to evaluate the hardware and programming models Stage 2: Knights Corner (KNC) -- Now Xeon Phi Early silicon now at TACC for Stampede development First product expected early 2013 MIC Ancestors 80-core Terascale research program Larrabee many-core visual computing Single-chip Cloud Computer (SCC) 19

What is MIC? Basic Design Ideas Leverage x86 architecture (CPU with many cores) Leverage (Intel s) existing x86 programming models Dedicate much of the silicon to floating point ops., keep some cache(s) Keep cache-coherency protocol (was debated at the beginning) Increase floating-point throughput per core Implement as a separate device Pre-product Software Development Platform: Knights Ferry x86 cores that are simpler, but allow for more compute throughput Strip expensive features (out-of-order execution, branch prediction, etc.) Widen SIMD registers for more throughput Fast (GDDR5) memory on card GPU form factor: size, power, PCIe connection 20

MIC Architecture Many cores on the die L1 and L2 cache Bidirectional ring network Memory and PCIe connection MIC (KNF) architecture block diagram Knights Ferry SDP Up to 32 cores 1-2 GB of GDDR5 RAM 512-bit wide SIMD registers L1/L2 caches Multiple threads (up to 4) per core Slow operation in double precision Knights Corner (first product) 50+ cores Increased amount of RAM Details are under NDA Double precision half the speed of single precision (canonical ratio) 22 nm technology 21

MIC vs. Xeon Number of cores Clock Speed (GHz) SIMD width (bit) MIC (KNF) Xeon 30-32 4-8 1 ~3 512 128-256 Xeon chip designed to work well for all work loads MIC designed to work well for number crunching Single MIC core is slower than a Xeon core Number of cores on card is much higher What is Vectorization? Instruction set: Streaming SIMD Extensions (SSE) Single Instruction Multiple Data (SIMD) Multiple data elements are loaded into SSE registers One operation produces 2, 4, or 8 results in double precision Effective vectorization (by the compiler) crucial for performance 22

What we at TACC like about MIC Intel s MIC is based on x86 technology x86 cores w/ caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs Optimize once, run anywhere Our early MIC porting efforts for codes in the field are frequently doubling performance on Sandy Bridge.

Coprocessor vs. Accelerator GPUs are often called Accelerators Intel calls MIC a Coprocessor Similarities Large die device, all silicon dedicated to compute Fast GDDR5 memory Connected through PCIe bus, two physical address spaces Number of MIC cores similar to number of GPU Streaming Multiprocessors 24

Differences Coprocessor vs. Accelerator Architecture: x86 vs. streaming processors HPC Programming model: Threading/MPI: Programming details coherent caches vs. shared memory and caches extension to C++/C/Fortran vs. CUDA/OpenCL OpenCL support OpenMP and Multithreading vs. threads in hardware MPI on host and/or MIC vs. MPI on host only offloaded regions vs. kernels Support for any code: serial, scripting, etc. Yes vs. No Native mode: MIC-compiled code may be run as an independent process on the coprocessor. 25

The MIC Promise Competitive Performance: Peak and Sustained Familiar Programming Model HPC: C/C++, Fortran, and Intel s TBB Parallel Programming: OpenMP, MPI, Pthreads, Cilk Plus, OpenCL Intel s MKL library (later: third party libraries) Serial and Scripting, etc. (anything a CPU core can do) Easy transition for OpenMP code Pragmas/directives added to offload OMP parallel regions onto MIC See examples later Support for MPI MPI tasks on host, communication through offloading MPI tasks on MIC, communication through MPI 26

Stampede and MIC Overview The TACC Stampede System Overview of Intel s MIC ecosystem MIC: Many Integrated Cores Basic Design Roadmap Coprocessor vs. Accelerator Programming Models OpenMP, MPI MKL, TBB, and Cilk Plus Roadmap & Preparation 27

Adapting and Optimizing Code for MIC Hybrid Programming MPI + Threads (OpenMP, etc.) Optimize for higher thread performance Minimize serial sections and synchronization Test different execution models Symmetric vs. Offloaded Optimize for L1/L2 caches Test performance on MIC Optimize for specific architecture Start production Today Any resource Today/KNF/early KNC Stampede MIC 2013+

Levels of Parallel Execution Multi-core CPUs, Coprocessors (MIC), and Accelerators (GPUs) exploit hardware parallelism on 2 levels 1. Processing unit (CPU, MIC, GPU) with multiple functional units CPU/MIC: Cores GPU: Streaming Multiprocessor/SIMD Engine Functional units operate independently 2. Each functional units executes in lock-step (in parallel) CPU/MIC: Vector processing unit utilizes SIMD lanes GPU: Warp/Wavefront executes on CUDA or Stream cores 29

SIMD vs. CUDA/Stream Cores Each functional units executes in lock-step (in parallel) CPU/MIC: Vector processing unit utilizes SIMD lanes GPU: Warp/Wavefront executing on CUDA or Stream cores GPU programming model exposes low level parallelism All CUDA/Stream cores must be utilized for good performance; this is stating the obvious and it is transparent to the programmer Programmer is forced to manually setup blocks and grids of threads Steep learning curve at the beginning, but it leads naturally to the (data-parallel) utilization of all cores CPU programming model leaves vectorization to compiler Programmers may be unaware of parallel hardware Failing to write vectorizable code will lead to a significant performance penalty of 87.5% on MIC (50-75% on Xeon) 30

Early MIC Programming Experiences at TACC Codes port easily Minutes to days depending mostly on library dependencies Performance requires real work While the silicon continues to evolve Getting codes to run *at all* is almost too easy; really need to put in the effort to get what you expect Scalability is pretty good Multiple threads per core *really important* Getting your code to vectorize *really important*

Host-only Offload Symmetric Reverse Offload MIC Native Stampede Programming Five Possible Models

Programming Intel MIC-based Systems MPI+Offload MPI ranks on Intel Xeon processors (only) Offload All messages into/out of processors Offload models used to accelerate MPI ranks Intel Cilk TM Plus, OpenMP*, Intel Threading Building Blocks, Pthreads* within Intel MIC Network MPI Data Xeon Data Xeon MIC MIC Homogenous network of hybrid nodes: Data Xeon MIC Data Xeon MIC 11

Programming Intel MIC-based Systems Many-core Hosted MPI ranks on Intel MIC (only) All messages into/out of Intel MIC Intel Cilk TM Plus, OpenMP*, Intel Threading Building Blocks, Pthreads used directly within MPI processes Programmed as homogenous network of many-core CPUs: Network Xeon Xeon Data MIC Data MIC Data MPI Xeon MIC Data Xeon MIC 13

Programming Intel MIC-based Systems Symmetric MPI ranks on Intel MIC and Intel Xeon processors Messages to/from any core Data MPI Data Intel Cilk TM Plus, OpenMP*, Intel Threading Building Blocks, Pthreads* used directly within MPI processes MPI Xeon Data MIC Data MPI Network Xeon MIC Programmed as heterogeneous network of homogeneous nodes: Data Data Xeon MIC Data Data Xeon MIC 14

MIC Programming with OpenMP MIC specific pragma precedes OpenMP pragma Fortran:!dir$ omp offload target(mic) < > C: #pragma offload target(mic) < > All data transfer is handled by the compiler (optional keywords provide user control) Offload and OpenMP section will go over examples for: Automatic data management Manual data management I/O from within offloaded region Data can stream through the MIC; no need to leave MIC to fetch new data Also very helpful when debugging (print statements) Offloading a subroutine and using MKL 36

Programming Models Ready to use on day one! TBB s will be available to C++ programmers MKL will be available Automatic offloading by compiler for some MKL features (probably most transparent mode) Cilk Plus Useful for task-parallel programing (add-on to OpenMP) May become available for Fortran users as well OpenMP TACC expects that OpenMP will be the most interesting programming model for our HPC users 37

Stampede and MIC Overview The TACC Stampede System Overview of Intel s MIC ecosystem MIC: Many Integrated Cores Basic Design Roadmap Coprocessor vs. Accelerator Programming Models OpenMP, MPI MKL, TBB, and Cilk Plus Roadmap & Preparation 38

General expectation: Roadmap What comes next? How to get prepared? Many of the upcoming large systems will be accelerated Stampede s base Sandy Bridge cluster being constructed now Intel s MIC coprocessor is on track for 2013 Expect that (large) systems with MIC coprocessors will become available in 2013 at/after product launch MPI + OpenMP is the main HPC programming model If you are not using TBBs If you are not spending all your time in libraries (MKL, etc.) Early user access sometime after SC12 39

Roadmap What comes next? How to get prepared? Many HPC applications are pure-mpi codes Start thinking about upgrading to a hybrid scheme Adding OpenMP is a larger effort than adding MIC directives Special MIC/OpenMP considerations Many more threads will be needed: 50+ cores (Production KNC) 50+/100+/200+ threads Good OpenMP scaling will be (much) more important Vectorize, vectorize, vectorize There is no unified last-level cache (LLC): Cache shared among cores, several MB on CPUs Stay tuned The coming years will be exciting 40