Parallel Programming on Ranger and Stampede

Similar documents
Preparing for Highly Parallel, Heterogeneous Coprocessing

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Introduction to Xeon Phi. Bill Barth January 11, 2013

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Introduction to the Intel Xeon Phi on Stampede

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

The Era of Heterogeneous Computing

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Introduc)on to Xeon Phi

Trends in HPC (hardware complexity and software challenges)

HPC Hardware Overview

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

High Performance Computing in the Manycore Era: Challenges for Applications

GPUs and Emerging Architectures

HPC Architectures. Types of resource currently in use

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to CUDA Programming

Introduc)on to Xeon Phi

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Intel Xeon Phi Coprocessors

Intel Knights Landing Hardware

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Real Parallel Computers

n N c CIni.o ewsrg.au

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Accelerator Programming Lecture 1

Parallel Programming Libraries and implementations

the Intel Xeon Phi coprocessor

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Computing. Hwansoo Han (SKKU)

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

CPU-GPU Heterogeneous Computing

General Purpose GPU Computing in Partial Wave Analysis

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Architecture, Programming and Performance of MIC Phi Coprocessor

Modern Processor Architectures. L25: Modern Compiler Design

Introduction to KNL and Parallel Computing

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Intel Many Integrated Core (MIC) Architecture

High Performance Computing with Accelerators

Introduction to GPU hardware and to CUDA

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

An Empirical Methodology for Judging the Performance of Parallel Algorithms on Heterogeneous Clusters

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Simultaneous Multithreading on Pentium 4

General introduction: GPUs and the realm of parallel architectures

Overview of Tianhe-2

Trends and Challenges in Multicore Programming

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Real Parallel Computers

Thread and Data parallelism in CPUs - will GPUs become obsolete?

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Fundamental CUDA Optimization. NVIDIA Corporation

World s most advanced data center accelerator for PCIe-based servers

Introduction to GPGPU and GPU-architectures

INF5063: Programming heterogeneous multi-core processors Introduction

Fundamental CUDA Optimization. NVIDIA Corporation

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Addressing Heterogeneity in Manycore Applications

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

IBM CORAL HPC System Solution

Illinois Proposal Considerations Greg Bauer

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Lecture 3: Intro to parallel machines and models

Carlo Cavazzoni, HPC department, CINECA

Parallel Computing. November 20, W.Homberg

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Intel Xeon Phi архитектура, модели программирования, оптимизация.

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Introduc)on to Xeon Phi

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Technology for a better society. hetcomp.com

CME 213 S PRING Eric Darve

High Performance Computing (HPC) Introduction

Introduc)on to Hyades

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

rabbit.engr.oregonstate.edu What is rabbit?

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Transcription:

Parallel Programming on Ranger and Stampede Steve Lantz Senior Research Associate Cornell CAC Parallel Computing at TACC: Ranger to Stampede Transition December 11, 2012

What is Stampede? NSF-funded XSEDE project at TACC New HPC system with two main components 2+ petaflop/s Intel Xeon E5 based Dell cluster to be the new workhorse for the NSF open science community (>3x Ranger) 7+ additional petaflop/s of Intel Xeon Phi SE10P coprocessors to change the power/performance curves of supercomputing Complete ecosystem for advanced computing HPC cluster, storage, interconnect Visualization subsystem (144 high-end GPUs) Large memory support (16 1TB nodes) Global work file system, archive, other data systems People, training, documentation to support computational science 12/11/2012 www.cac.cornell.edu 2

Construction Under Way at TACC, May 2012 Left: water chiller plant; right: addition to main facility Photo credit: Steve Lantz 12/11/2012 www.cac.cornell.edu 3

Stampede Specs from News Releases 6400 Dell C8220X nodes in initial system 16 Xeon E5 Sandy Bridge cores per node, 102400 total 32GB memory per node, 200TB total At least 6400 Xeon Phi SE10P coprocessor cards 61 core, 4 hardware threads per core 8 GB additional memory per card 14+ PB storage Lustre parallel filesystem 4864 3TB drives Photo by TACC, June 2012 12/11/2012 www.cac.cornell.edu 4

Stampede Specs from News Releases 56Gb/s InfiniBand, fat-tree interconnect ~75 miles of cables < 1.2 μs latency Compare Lonestar: 32Gb/s (eff.) 15PF+ after upgrade in 2015 Nearly 200 racks SIX MEGAWATTS total power Thermal energy storage cuts costs Datacenter expansion of 10,000 sq. ft. Photo by TACC, June 2012 12/11/2012 www.cac.cornell.edu 5

Stampede Footprint vs. Ranger Ranger: 3000 ft 2 0.6 PF 3 MW Stampede: 8000 ft 2 10 PF 6.5 MW Capabilities are 17x; footprint is 2.7x; power draw is 2.1x 12/11/2012 www.cac.cornell.edu 6

How Does Stampede Reach Petaflop/s? Hardware trend since around 2004: processors gain more cores (execution engines) rather than greater clock speed IBM POWER4 (2001) became the first chip with 2 cores, 1.1 1.9 GHz; meanwhile, Intel s single-core Pentium 4 was a bust at >3.8 GHz Top server and workstation chips in 2012 (Intel Xeon, AMD Opteron) now have 4, 8, even 16 cores, running at 1.6 3.2 GHz Does it mean Moore s Law is dead? No! Transistor densities are still doubling every 2 years Clock rates have stalled at < 4 GHz due to power consumption Only way to increase flop/s/watt is through greater on-die parallelism 12/11/2012 www.cac.cornell.edu 7

CPU Speed and Complexity Trends Committee on Sustaining Growth in Computing Performance, National Research Council. "What Is Computer Performance?" In The Future of Computing Performance: Game Over or Next Level? Washington, DC: The National Academies Press, 2011. 12/11/2012 www.cac.cornell.edu 8

Implications for Petaflop/s Machines Only way to increase flop/s/watt is through greater on-die parallelism! These trends hold true for non-cpu devices too Processors for mobile devices, e.g. If 1 chip holds 10s of the best cores, why not 100s of weaker ones? Around 2007 8, Cell chips had 1 main and 8 synergistic processors But then something else was recognized 12/11/2012 www.cac.cornell.edu 9

GPUs: Highly Parallel Hardware is in PCs Already High-end graphics processing units (GPUs) contain 100 or 1000s of thread processors and enough RAM to rival CPUs in compute capability GPUs have been further tailored for HPC Stampede example: NVIDIA Tesla K20 2496 CUDA cores @ 732 MHz 5GB dedicated memory 1.17 Tflop/s peak DP rate 225W power consumption Initially there were hardware obstacles to using GPUs for general calculations, but these have been overcome ECC memory, double precision, IEEE-compliant arithmetic are built in What about software? 12/11/2012 Tesla GPU www.cac.cornell.edu 10

General Purpose Computing on GPUs (GPGPU) NVIDIA CUDA (2006) has been the forerunner in this area SDK + API that permits programmers to use the C language to code algorithms for execution on NVIDIA GPUs (must be compiled with nvcc) Stream processing: GPU executes a code kernel on a stream of inputs Works well if kernel is multithreaded, vectorized (SIMD), pipelined OpenCL (2008) is a more recent, open standard originated by Apple C99-based language + API that enables data-parallel computation on GPUs as well as CPUs (e.g., ARM) Nontrivial (re)coding may be needed, based on a specialized API Good performance depends on very specific tuning to cache sizes, etc. Hard to keep thread processors busy over slow PCIe interconnect Resulting code is far less portable due to the API and special tuning 12/11/2012 www.cac.cornell.edu 11

The Intel Approach: MIC MIC = Many Integrated Cores = a coprocessor on a PCIe card that features >50 cores: released as Xeon Phi, used in Stampede Represents Intel s response to GPGPU, especially NVIDIA s CUDA Incorporates lessons learned from several internal development efforts: Larrabee, 80-core Terascale chip, Single-Chip Cloud (SCC) Answers the question: if 8 modern Xeon cores fit on a die, how many early Pentiums would fit? Addresses the API problem: standard x86 instructions are supported Includes 64-bit addressing Other recent x86 extensions may not be available Special instructions are added for an extra-wide (512-bit) vector register MIC supports general-purpose executables built using familiar Intel compilers, libraries, and analysis tools 12/11/2012 www.cac.cornell.edu 12

Coprocessor vs Accelerator Coprocessor Accelerator Architecture x86, multi-core, vector Streaming processors Coherent cache Shared memory, caches Programming Model Extensions to C/Fortran CUDA, OpenCL Threading MPI OpenMP, explicit threading May use communication, synchronization Host Host Host MIC Programming Constructs Serial, Thread, Vector Hardware threads, little user control Mostly independent Host Host Stream 12/11/2012 www.cac.cornell.edu 13

MIC Architecture SE10P is first production version used in Stampede Chip, memory on PCIe card 61 cores, each containing: 64 KB L1 cache 512 KB L2 cache 512 byte vector unit 4 hardware threads In-order instruction pipeline 31.5 MB total coherent L2 cache, connected by ring bus 8 GB GDDR5 memory Very fast, 352 GB/s vs 50 GB/s/socket for E5 Courtesy Intel 12/11/2012 www.cac.cornell.edu 14

MIC vs. CPU Number of cores Clock Speed (GHz) SIMD width (bit) DP GFLOPS/core HW threads/core MIC (SE10P) CPU (E5) MIC is 61 8 much higher 1.01 2.7 lower 512 256 higher 16+ 21+ lower 4 1* higher CPUs designed for all workloads, high single-thread performance MIC also general purpose, though optimized for number crunching Focus on high aggregate throughput via lots of weaker threads 12/11/2012 www.cac.cornell.edu 15

Two Types of CPU/MIC Parallelism Threading (work-level parallelism) OpenMP, Cilk Plus, TBB, Pthreads, etc It s all about sharing work and scheduling Vectorization (data-level parallelism) Lock step Instruction Level Parallelization (SIMD) Requires management of synchronized instruction execution It s all about finding simultaneous operations To fully utilize MIC, both types of parallelism need to be identified and exploited Need at 2-4 threads to keep a MIC core busy (in-order execution stalls) Vectorized loops gain 8x performance on MIC! Important for CPUs as well: gain of 4x on Sandy Bridge 12/11/2012 www.cac.cornell.edu 16

Threading Courtesy Intel 12/11/2012 www.cac.cornell.edu 17

Vectorization Instruction set: Streaming SIMD instructions (SSE, AVX) Single Instruction Multiple Data (SIMD) Wide vector registers hold multiple SP, DP or integer values One operation produces, 2, 4, 8, results or more. Compilers are good at vectorizing inner loops automatically as an optimization but they need help Make sure each iteration is independent Align data to good boundaries of registers and cache Xeon Phi vector width twice as wide as E5 CPU Twice as many calculations vector operation 12/11/2012 www.cac.cornell.edu 18

Parallelism and Performance on MIC and CPU Courtesy Intel 12/11/2012 www.cac.cornell.edu 19

Typical Configuration of a Future Stampede Node Host with dual Intel Xeon Sandy Bridge (CPU) Linux OS HCA Access from network: ssh <host> (OS) ssh <coprocessor> (mos) PCIe card with Intel Xeon Phi (MIC) Linux micro OS PCIe Virtual IP* service for MIC * can t do this with a Lonestar GPU node, e.g., which is otherwise similar 12/11/2012 www.cac.cornell.edu 20

Strategies for HPC Codes MPI code: could be hybrid with OpenMP No change run on CPUs, MICs, or both Expand existing hybrids; or, add OpenMP offload Build on libraries like Intel MKL, PETSc, etc. 12/11/2012 www.cac.cornell.edu 21

Programming Models for Stampede 1 Offload Execution Directives indicate data and functions to send from CPU to MIC for execution Unified source code Code modifications required Compile once with offload flags Single executable includes instructions for MIC and CPU Run in parallel using MPI and/or scripting, if desired 12/11/2012 Courtesy Scott McMillan, Intel www.cac.cornell.edu 22

Programming Models for Stampede 2 Symmetric Execution Message passing (MPI) on CPUs and MICs alike Unified source code Code modifications optional Assign different work to CPUs vs. MICs Multithread with OpenMP for CPUs, MICs, or both Compile twice, 2 executables One for MIC, one for host Run in parallel using MPI 12/11/2012 Courtesy Scott McMillan, Intel www.cac.cornell.edu 23

Pros and Cons of MIC Programming Models Offload engine: MIC acts as coprocessor for the host Pros: distinct hardware gets distinct role; programmable via simple calls to a library such as MKL, or via directives (we ll go into depth on this) Cons: PCIe is the path for most work; difficult to retain data on card Symmetric #1: Everything is just an MPI core Pros: MPI works for all cores (though 1 MIC core < 1 server core) Cons: memory may be insufficient to support a mos plus lots of data; fails to take good advantage of shared memory; PCIe is a bottleneck Symmetric #2: Both MIC and host are just SMPs Pros: MPI/OpenMP works for both host and MIC; more efficient use of limited PCIe bandwidth and limited MIC memory Cons: hybrid programming is already tough on homogeneous SMPs; not much experience with OpenMP-based hybrids scaling to 50+ cores 12/11/2012 www.cac.cornell.edu 24

Using Compiler Directives to Offload Work OpenMP s directives provide a natural model 2010: OpenMP working group starts to consider accelerator extensions Related efforts are launched to target specific types of accelerators LEO, Language Extensions for Offload Intel moves forward to support processors and coprocessors, initially OpenACC PGI moves forward to support GPUs, initially Will OpenMP 4.0 produce a compromise among all the above? Clearly desirable, but it s difficult Other devices exist: network controllers, antenna A/D, cameras Exactly what falls in the accelerator class? How diverse is it? Are coprocessors a distinct class? 12/11/2012 www.cac.cornell.edu 25

OpenMP Offload Constructs: Base Program #include <omp.h> #define N 10000 void foo(double *, double *, double *, int ); int main(){ int i; double a[n], b[n], c[n]; for(i=0;i<n;i++){ a[i]=i; b[i]=n-1-i;}... Objective: offload foo to a device Use OpenMP to do the offload } foo(a,b,c,n); void foo(double *a, double *b, double *c, int n){ int i; for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } 12/11/2012 www.cac.cornell.edu 26

OpenMP Offload Constructs: Requirements #include <omp.h> #define N 10000 #pragma omp <offload_function_spec> void foo(double *, double *, double *, int ); int main(){ int i; double a[n], b[n], c[n]; for(i=0;i<n;i++){ a[i]=i; b[i]=n-1-i;}... #pragma omp <offload_this> foo(a,b,c,n); } #pragma omp <offload_function_spec> void foo(double *a, double *b, double *c, int n){ int i; #pragma omp parallel for for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } Direct compiler to offload function or block Decorate function and prototype Ideally, familiar-looking OpenMP directives work on device 12/11/2012 www.cac.cornell.edu 27

Early MIC Programming Experiences Codes port easily Minutes to days depending mostly on library dependencies Performance requires real work Really need to put in the effort to get what you expect Optimizing for MIC is similar to optimizing for CPUs Optimize once, run anywhere Early MIC ports of real codes can show 2x Sandy Bridge performance Scalability is pretty good Forking multiple threads per core is really important Getting your code to vectorize is also really important 12/11/2012 www.cac.cornell.edu 28

Roadmap: What Comes Next? Expect many of the upcoming large systems to be accelerated MPI + OpenMP will be the main HPC programming model If you are not using Intel TBBs or Cilk If you are not spending all your time in libraries (MKL, etc.) Many HPC applications are pure-mpi codes Start thinking about upgrading to a hybrid scheme Adding OpenMP is a larger effort than adding MIC directives Special MIC/OpenMP considerations Many more threads will be needed: 60+ cores on production Xeon Phi 60+/120+/240+ threads Good OpenMP scaling (and vectorization) are much more important Stampede to deploy January 2013 12/11/2012 www.cac.cornell.edu 29

Reference Much of the information in this talk was gathered from presentations at the TACC Intel Highly Parallel Computing Symposium, Austin, Texas, April 10 11, 2012: http://www.tacc.utexas.edu/ti-hpcs12. Early draft Stampede User Guide https://portal.tacc.utexas.edu/userguides/stampede Intel press materials http://newsroom.intel.com/docs/doc-3126 Intel MIC developer information http://software.intel.com/micdeveloper 12/11/2012 www.cac.cornell.edu 30