Preparing for Highly Parallel, Heterogeneous Coprocessing

Similar documents
Parallel Programming on Ranger and Stampede

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Introduction to the Intel Xeon Phi on Stampede

GPUs and Emerging Architectures

Introduction to Xeon Phi. Bill Barth January 11, 2013

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

The Era of Heterogeneous Computing

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to CUDA Programming

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

GPU Architecture. Alan Gray EPCC The University of Edinburgh

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Trends in HPC (hardware complexity and software challenges)

Introduc)on to Xeon Phi

Intel Knights Landing Hardware

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

HPC Hardware Overview

Introduction to GPU hardware and to CUDA

Accelerator Programming Lecture 1

CME 213 S PRING Eric Darve

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

High Performance Computing in the Manycore Era: Challenges for Applications

Intel Xeon Phi Coprocessors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Modern Processor Architectures. L25: Modern Compiler Design

HPC Architectures. Types of resource currently in use

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Addressing Heterogeneity in Manycore Applications

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

n N c CIni.o ewsrg.au

INF5063: Programming heterogeneous multi-core processors Introduction

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Computing: Parallel Architectures Jin, Hai

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Thread and Data parallelism in CPUs - will GPUs become obsolete?

General Purpose GPU Computing in Partial Wave Analysis

OP2 FOR MANY-CORE ARCHITECTURES

Native Computing and Optimization. Hang Liu December 4 th, 2013

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

CPU-GPU Heterogeneous Computing

Introduc)on to Xeon Phi

Intel Xeon Phi Coprocessor

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Parallel Computing. November 20, W.Homberg

Architecture, Programming and Performance of MIC Phi Coprocessor

WHY PARALLEL PROCESSING? (CE-401)

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Fundamental CUDA Optimization. NVIDIA Corporation

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Introduction to KNL and Parallel Computing

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to GPGPU and GPU-architectures

Intel Many Integrated Core (MIC) Architecture

Illinois Proposal Considerations Greg Bauer

The Mont-Blanc approach towards Exascale

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Carlo Cavazzoni, HPC department, CINECA

The BioHPC Nucleus Cluster & Future Developments

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

General introduction: GPUs and the realm of parallel architectures

Real Parallel Computers

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Real-Time Rendering Architectures

Experiences with ENZO on the Intel Many Integrated Core Architecture

High Performance Computing with Accelerators

World s most advanced data center accelerator for PCIe-based servers

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

GPU programming. Dr. Bernhard Kainz

Trends in the Infrastructure of Computing

IBM CORAL HPC System Solution

High Performance Computing (HPC) Introduction

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Transcription:

Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012

What Are We Talking About Here? Hardware trend since around 2004: processors gain more cores (execution engines) rather than greater clock speed IBM POWER4 (2001) became the first chip with 2 cores, 1.1 1.9 GHz; meanwhile, Intel s single-core Pentium 4 was a bust at >3.8 GHz Top server and workstation chips in 2012 (Intel Xeon, AMD Opteron) now have 4, 8, even 16 cores, running at 1.6 3.2 GHz Does it mean Moore s Law is dead? No! Transistor densities are still doubling every 2 years Clock rates have stalled at < 4 GHz due to power consumption Only way to increase flop/s/watt is through greater on-die parallelism If 1 chip holds 10s of the best cores, why not 100s of weaker ones? Around 2007 8, Cell chips had 1 main and 8 synergistic processors; but then something else came along 5/17/2012 www.cac.cornell.edu 2

Highly Parallel Hardware Is in PCs Already High-end graphics processing units (GPUs) contain 100s of thread processors and RAM enough to rival CPUs in compute capability GPUs are being further tailored for HPC Lonestar example: NVIDIA Tesla M2070 448 CUDA cores @ 1.15 GHz 6GB dedicated memory 1.03 Tflop/s peak SP rate 238W power consumption Initially there were hardware obstacles to using GPUs for general calculations, but these have been overcome ECC memory, double precision, IEEE-compliant arithmetic are built in What about software? Tesla C2070 5/17/2012 www.cac.cornell.edu 3

General Purpose Computing on GPUs (GPGPU) Given the right software tools, developers can write code allowing the GPU to perform calculations usually handled by the CPU Stream processing: GPU executes a code kernel on a stream of inputs Exploits the GPU s rendering pipeline, designed to transform and shade a stream of vertices: highly parallel, very energy efficient Works well if kernel is multithreaded, vectorized (SIMD), pipelined NVIDIA CUDA (2006) is the forerunner in this area SDK + API that permits programmers to use the C language to code algorithms for execution on NVIDIA GPUs (must be compiled with nvcc) OpenCL (2008) is a more recent, open standard originated by Apple C99-based language + API that enables data-parallel computation on GPUs as well as CPUs Actively supported on Intel, AMD, NVIDIA, ARM platforms 5/17/2012 www.cac.cornell.edu 4

Not All Applications Are Suitable for GPGPU Workload must be compute intensive, i.e., any data item fetched from main memory must take part in several GPU operations Reason: to reach the GPU, data travel over a slow PCIe interconnect Workload must be decomposed into many small, independent units Reason: GPU is only effective when all thread processors are kept busy Nontrivial (re)coding may be needed, based on a specialized API Good performance depends on very specific tuning to the hardware (cache sizes, etc.) Resulting code is far less portable due to the API and special tuning May be avoided if a suitable kernel or library already exists Is there a better way for numerically-intensive applications to take advantage of hardware trends? 5/17/2012 www.cac.cornell.edu 5

The Intel Approach: MIC MIC = Many Integrated Cores = a coprocessor on a PCIe card that features >50 compute cores Represents Intel s response to GPGPU, especially NVIDIA s CUDA Incorporates lessons learned from the former Larrabee development effort (which never became a product) Answers the question: if 8 modern Xeon cores fit on a die, how many Pentium III s would fit? Addresses the API problem: standard x86 instructions are supported Includes 64-bit addressing Other recent x86 extensions may not be available Special instructions are added for an extra-wide (512-bit) vector register MIC executables are built using familiar Intel compilers, libraries, and analysis tools 5/17/2012 www.cac.cornell.edu 6

MIC = A Teraflop/s System on a Chip! 1996: ASCI Red, first system to achieve 1 Tflop/s sustained 72 cabinets 2011: Knights Corner, first Tflop/s system on a chip 1 PCIe slot 5/17/2012 www.cac.cornell.edu 7

The Knights Ferry (KNF) Coprocessor KNF is the early development platform for the Intel MIC architecture Chip + memory on a PCI Express card Up to 8MB coherent shared L2 cache Up to 32 cores, 4 threads/core, < 1.2 GHz SIMD vector unit (8-DP floats wide) In-order instruction pipeline Linux Micro OS (mos) runs on MIC Small OS memory footprint Basic functionality (I/O, standards Unix commands, etc.) Users can telnet to MIC RHEL 6.0, 6.1, 6.2 or SUSE 11 SP1 runs on host Details for Knights Corner (KNC) have yet not been disclosed 5/17/2012 www.cac.cornell.edu 8

Typical Configuration of a Future Stampede Node Host with dual Intel Xeon Sandy Bridge Linux OS HCA Access from network: ssh 192.168.0.1 (OS) ssh 192.168.0.2 (mos) PCIe card with Intel Knights Corner Linux micro OS PCIe Virtual IP* service for MIC * can t do this with a Lonestar GPU node, e.g., which is otherwise similar 5/17/2012 www.cac.cornell.edu 9

First Large-Scale MIC System: TACC Stampede System specs from news release 10PF+ peak performance in initial system (1Q 2013) 2PF conventional cluster (Sandy Bridge) 8PF complementary coprocessors (KNC) 15PF+ after upgrade 14PB+ disk, 200TB+ RAM 56Gb/s FDR InfiniBand, fat-tree interconnect, ~75 miles of cables Compare Lonestar: 32Gb/s QDR (effective) Nearly 200 racks of compute hardware Integrated shared memory and remote visualization subsystems Total concurrency approaching 500,000 cores 5/17/2012 www.cac.cornell.edu 10

Construction Is Already Under Way at TACC Left: water chiller plant; right: addition to main facility Photo credit: Steve Lantz 5/17/2012 www.cac.cornell.edu 11

Programming Models for Stampede Offload Execution Directives indicate data and functions to send from CPU to MIC for execution Unified source code Code modifications required Compile once with offload flags Single executable includes instructions for MIC and CPU Run in parallel using MPI and/or scripting, if desired Symmetric Execution Message passing (MPI) on CPUs and MICs alike Unified source code Code modifications optional Assign different work to CPUs vs. MICs Multithread with OpenMP for CPUs, MICs, or both Compile twice, 2 executables One for MIC, one for host Run in parallel using MPI 5/17/2012 www.cac.cornell.edu 12

Strategies for HPC Codes MPI code No change run on CPUs, MICs, or both Expand existing hybrids; or, add OpenMP offload Build on libraries like Intel MKL, PETSc, etc. 5/17/2012 www.cac.cornell.edu 13

Pros and Cons of MIC Programming Models Offload engine: accelerator for host Pros: distinct hardware gets distinct role; programmable via simple calls to a library such as MKL, or via directives (we ll go into depth on this) Cons: most work travels over PCIe; difficult to retain data on card Symmetric #1: heterogeneous MPI cores Pros: MPI works for all cores (though 1 MIC core < 1 server core) Cons: memory is insufficient to give each core a mos plus lots of data; fails to take good advantage of shared memory; PCIe is a bottleneck Symmetric #2: heterogeneous SMPs (symmetric multiprocessors) Pros: MPI/OpenMP works for both host and MIC; efficient use of limited PCIe bandwidth and MIC memory due to single message source/sink Cons: hybrid programming is already tough on homogeneous SMPs; not clear whether existing OpenMP-based hybrids scale to 50+ cores 5/17/2012 www.cac.cornell.edu 14

Using Compiler Directives to Offload Work OpenMP s directives provide a natural model 2010: OpenMP working group starts to consider accelerator extensions Related efforts are launched to target specific types of accelerators LEO, Language Extensions for Offload Intel moves forward to support processors and co-processors, initially OpenACC PGI moves forward to support GPUs, initially Will OpenMP 4.0 produce a compromise among all the above? Clearly desirable, but it s difficult Other devices exist: network controllers, antenna A/D, cameras Exactly what falls in the accelerator class? How diverse is it? 5/17/2012 www.cac.cornell.edu 15

OpenMP Offload Constructs: Base Program #include <omp.h> #define N 10000 void foo(double *, double *, double *, int ); int main(){ int i; double a[n], b[n], c[n]; for(i=0;i<n;i++){ a[i]=i; b[i]=n-1-i;}... Objective: offload foo to a device Use OpenMP to do the offload } foo(a,b,c,n); void foo(double *a, double *b, double *c, int n){ int i; for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } 5/17/2012 www.cac.cornell.edu 16

OpenMP Offload Constructs: Requirements #include <omp.h> #define N 10000 #pragma omp <offload_function_spec> void foo(double *, double *, double *, int ); int main(){ int i; double a[n], b[n], c[n]; for(i=0;i<n;i++){ a[i]=i; b[i]=n-1-i;}... #pragma omp <offload_this> foo(a,b,c,n); } #pragma omp <offload_function_spec> void foo(double *a, double *b, double *c, int n){ int i; #pragma omp parallel for for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } Direct compiler to offload function or block Decorate function and prototype Usual OpenMP directives work on device 5/17/2012 www.cac.cornell.edu 17

OpenMP Offload Constructs: Data Requirements #include <omp.h> #define N 10000 #pragma omp <offload_function_spec> void foo(double *, double *, double *, int ); int main(){ int i; double a[n], b[n], c[n]; for(i=0;i<n;i++){ a[i]=i; b[i]=n-1-i;}... #pragma omp <offload_this> <data_clause> foo(a,b,c,n); } #pragma omp <offload_function_spec> void foo(double *a, double *b, double *c, int n){ int i; #pragma omp parallel for for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } Data must be moved to and from the device Synchronous model (move data to device at dispatch of execution, move back afterward) Control of data locality is new to OpenMP (and OpenACC) 5/17/2012 www.cac.cornell.edu 18

OpenMP Offload Constructs: Asynchronicity #include <omp.h> #define N 10000 #pragma omp <offload_function_spec> void foo(double *, double *, double *, int ); int main(){ int i; double a[n], b[n], c[n]; for(i=0;i<n;i++){ a[i]=i; b[i]=n-1-i;} #pragma omp <offload_data> <async>... #pragma omp <offload_this> <async> <data> foo(a,b,c,n); } #pragma omp <offload_function_spec> void foo(double *a, double *b, double *c, int n){ int i; #pragma omp parallel for for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } Offloaded region can be done asynchronously Moving data asynchronously is another important option 5/17/2012 www.cac.cornell.edu 19

Two Types of MIC Parallelism (Both Are Needed) OpenMP (offloading to MIC) Work Level Parallelization (threads) Requires management of asynchronous processes It s all about sharing work and scheduling Vectorization Lock step Instruction Level Parallelization (SIMD) Requires management of synchronized instruction execution It s all about finding simultaneous operations To fully utiltize MIC, both types of parallelism should be identified and exploited 5/17/2012 www.cac.cornell.edu 20

Vectorization Matters Too Vectorization, or SIMD processing, enables simultaneous, independent operation on multiple data operands with a single instruction. (Large arrays should provide a constant stream of data.) Vector reawakening Pre-Sandy Bridge DP* vector units are only 2 wide (SSE) Sandy Bridge DP* vector units are 4 wide (AVX) MIC DP* vector units are 8 wide! (VEX) Unvectorized loops lose 4x performance on Sandy Bridge and 8x performance on MIC! Evaluate performance with vectorization compiler options turned on and off ( -no-vec ) to assess overall vectorization. *DP = double precision (8 bytes, 64 bits) 5/17/2012 www.cac.cornell.edu 21

Working with a Vectorizing Compiler Compilers are good at vectorizing inner loops, but they need help Make sure each iteration is independent Align data to match register sizes and cache line boundaries Compilers will look for vectorization opportunities starting at -O2 To apply the latest relevant vector instructions for the given architecture: -x<simd_instr_set> To examine assembly code, myprog.s: -S To confirm with a vector report: -vec-report=<n>, n= verboseness % ifort -xhost -vec-report=4 prog.f90 -c prog.f90(31): (col. 11) remark: loop was not vectorized: existence of vector dependence 5/17/2012 www.cac.cornell.edu 22

Dreams for Future Software Convergence What would application programmers find most desirable? Stay with standard languages, language extensions, and compilers Move away from CUDA and special compilers Operate at a high level Avoid intrinsics that resemble assembly language Express programs in terms of generic parallel tasks De-emphasize APIs based on specific realizations, like threads Maybe we re getting close Compilers support vectorization and OpenMP (CL? ACC?) There s plenty of room for improvement in the emerging standards 5/17/2012 www.cac.cornell.edu 23

What Does All This Mean for Me? Next-generation algorithms and programs will need to run well on architectures like those just described Power efficiency will inevitably drive hardware designs in this direction Mobile processors are just as power-constrained as HPC behemoths; the design goals align It seems likely that even workstations and laptops will soon come equipped with many-core coprocessors of some type Therefore, finding and expressing parallelism in your computational workload will become increasingly important Codes must be flexible enough to deal with heterogeneous resources Asynchronous, adaptable methods may eventually be favored 5/17/2012 www.cac.cornell.edu 24

Reference Much of the information in this talk was gathered from presentations at the TACC Intel Highly Parallel Computing Symposium, Austin, Texas, April 10 11, 2012: http://www.tacc.utexas.edu/ti-hpcs12. 5/17/2012 www.cac.cornell.edu 25