The Stampede is Coming: A New Petascale Resource for the Open Science Community

Similar documents
The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Parallel Programming on Ranger and Stampede

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to the Intel Xeon Phi on Stampede

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

Preparing for Highly Parallel, Heterogeneous Coprocessing

Introduc)on to Xeon Phi

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

HPC Hardware Overview

The Era of Heterogeneous Computing

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Introduc)on to Xeon Phi

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Knights Landing Hardware

Preparing GPU-Accelerated Applications for the Summit Supercomputer

GPUs and Emerging Architectures

Introduction to Parallel Programming

the Intel Xeon Phi coprocessor

Intel Many Integrated Core (MIC) Architecture

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Trends in HPC (hardware complexity and software challenges)

n N c CIni.o ewsrg.au

Architecture, Programming and Performance of MIC Phi Coprocessor

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

HPC Architectures. Types of resource currently in use

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Accelerator Programming Lecture 1

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Overview of Intel Xeon Phi Coprocessor

Overview of Tianhe-2

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Intel Architecture for HPC

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

The Stampede Supercomputer

Intel Xeon Phi Coprocessors

Technology for a better society. hetcomp.com

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Netweb Technologies Delivers India s Fastest Hybrid Supercomputer with Breakthrough Performance

Investigation of Intel MIC for implementation of Fast Fourier Transform

Modern Processor Architectures. L25: Modern Compiler Design

Overview of research activities Toward portability of performance

Paving the Road to Exascale

GPU Architecture. Alan Gray EPCC The University of Edinburgh

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Interconnect Your Future

ABySS Performance Benchmark and Profiling. May 2010

Computer Architecture and Structured Parallel Programming James Reinders, Intel

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Advances of parallel computing. Kirill Bogachev May 2016

Path to Exascale? Intel in Research and HPC 2012

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

Intel High-Performance Computing. Technologies for Engineering

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

General introduction: GPUs and the realm of parallel architectures

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Introduc)on to Xeon Phi

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Illinois Proposal Considerations Greg Bauer

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Lecture 9: MIMD Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

OpenMP on Ranger and Stampede (with Labs)

HPC Capabilities at Research Intensive Universities

IBM CORAL HPC System Solution

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Interconnect Your Future

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

IBM Power AC922 Server

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

FUSION PROCESSORS AND HPC

OpenMP 4.0 (and now 5.0)

CUDA. Matthew Joyner, Jeremy Williams

IBM Power Systems HPC Cluster

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

IBM Db2 Analytics Accelerator Version 7.1

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Isilon Performance. Name

High performance Computing and O&G Challenges

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

rabbit.engr.oregonstate.edu What is rabbit?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

NAMD Performance Benchmark and Profiling. January 2015

Transcription:

The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu

Stampede: Solicitation US National Science Foundation (NSF) competition in 2011 for large HPC system To support open science across all domains of science and engineering To enable simulation-based & data-driven science Solicitation requirements & constraints $25M production system, plus up to $5M for innovative component Up to $6M/year for 4 years of O&M In production by January 2013 (fingers crossed)

Stampede: NSF HPC Context System must be integrated into XSEDE Allocations, integrated support, etc. Will effectively replace Ranger (TACC) and Kraken (NICS) HPC systems They expire in Februay 2013 (Ranger), June 2014 (Kraken) So probably had to bid at least 1½ - 2 PF Does not come with separate funding for networking, storage, visualization but all are required!

TACC s Stampede Project We wanted a comprehensive science system: Tremendous HPC capability, with powerful processors Support for HTC, big shared memory apps as well Powerful remote viz subsystem: 128 high-end GPUs Integration with archive, and (coming soon) TACC global work filesystem and other data systems Software complementing hardware to support both simulation-based and data-driven science People, training, and documentation to support diverse users, applications, and modes of computational science 4

Stampede: Background TACC evaluation Linux/x86 easiest path for XSEDE users Needed additional performance from innovative component community really needs >> 2PF due to lack of recent large-scale NSF HPC systems so, we considered acceleration options for more PFs Evaluated GPU and MIC decided on MIC Easier porting More programming options (x86) not just acceleration Leveraging x86 optimization experience More innovative? (Didn t exist yet )

Key Aspects of Acceleration We have lots of transistors Moore s law is holding; this isn t necessarily the problem. We don t really need lower power per transistor, we need lower power per *operation*. How to do this?

Intel s MIC Approach: Coprocessor, Not Just Accelerator Since days of RISC vs. CISC, Intel has mastered the art of figuring out what is important about a new processing technology, and saying why can t we do this in x86? Intel Many Integrated Core (MIC) architecture is (like GPUs) about large die, simpler circuit, much more parallelism and in the x86 line Easier to use (even as an accelerator) with standard, familiar tools (e.g. MPI, OpenMP) More flexible programming modes, models

Basic Design Ideas What is MIC? Leverage x86 architecture (CPU with many cores) x86 cores that are simpler, but allow for more compute throughput Leverage (Intel s) existing x86 programming models Dedicate much of the silicon to floating point ops., keep some cache(s) Keep cache-coherency protocol Increase floating-point throughput per core Implement as a separate device Strip expensive features (out-of-order execution, branch prediction, etc.) Widen SIMD registers for more throughput Fast (GDDR5) memory on card 8

Coprocessor vs. Accelerator Architecture: x86 vs. streaming processors HPC Programming model: Threading/MPI: Programming details coherent caches vs. shared memory and caches extension to C++/C/Fortran vs. CUDA/OpenCL OpenCL support OpenMP and Multithreading vs. threads in hardware MPI on host and/or MIC vs. MPI on host only offloaded regions vs. kernels Support for any code: serial, scripting, etc. Yes Native mode: Any code may be offloaded as a whole to the coprocessor No

What We Like about MIC Intel s MIC is based on x86 technology x86 cores w/ caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs Optimize once, run anywhere Our early MIC porting efforts for codes in the field are frequently doubling performance on Sandy Bridge.

TACC s Stampede: The Big Picture Dell, Intel, and Mellanox are vendor partners Almost 10 PF peak performance in initial system (2013) 2+ PF of Intel Xeon E5 (6400 dual-socket nodes) 7+ PF of Intel Xeon Phi (MIC) coprocessors (at least 6400) Special edition/release for this project 61 cores 14+ PB disk, 150+ GB/s I/O bandwidth 272+ TB RAM 56 Gb/s Mellanox FDR InfiniBand interconnect 16 x 1TB large shared memory nodes 128 Nvidia Kepler K20 GPUs for remote viz 182 racks, 6 MW power!

New Technologies/Milestones in the Stampede Project Density: surpasses 40KW/Cabinet New Dell node designs to support multiple 300W expansion cards in single node ~1U Total system power past 5 *megawatts* Thermal storage technology incorporated. Breakthrough price/performance and power/performance. Inclusion of Intel MIC; but we must program it. Application concurrency past 1 *million* threads per application

Stampede: How Will Users Use It? 2+ PF Xeon-only system (MPI, OpenMP) Many users will use it as an extremely powerful Sandy Bridge cluster and that s OK! They may also use the shared memory nodes, remote vis 7+ PF MIC-only system (MPI, OpenMP) Homogeneous codes can be run entirely on the MICs! ~10PF heterogeneous system (MPI, OpenMP) Run separate MPI tasks on Xeon vs. MIC; use OpenMP extensions for offload for hybrid programs

Will My Code Run on MIC? Yes That s the wrong question, it s: Will your code run *best* on MIC?, or Will you get great MIC performance without additional work?

Host-only Offload Symmetric Reverse Offload MIC Native Stampede Programming Five Possible Models

Key Questions Why do we need to use a thread-based programming model? MPI programming Where to place the MPI tasks? How to communicate between host and MIC?

Early MIC Programming Experiences at TACC Codes port easily Minutes to days depending mostly on library dependencies Performance can require real work While the sw environment continues to evolve Getting codes to run *at all* is almost too easy; really need to put in the effort to get what you expect Scalability is pretty good Multiple threads per core *really important* Getting your code to vectorize *really important*

Programming MIC with Threads Key aspects of the MIC coprocessor A lot of local memory, but even more cores 100+ threads or tasks on a MIC One MPI task per core? Probably not Severe limitation of the available memory per task Some GB of Memory & 100 tasks some ten s of MB per MPI task Threads (OpenMP, etc.) are a must on MIC

MIC Programming with OpenMP MIC specific pragma precedes OpenMP pragma Fortran:!dir$ omp offload target(mic) < > C: #pragma offload target(mic) < > All data transfer is handled by the compiler (optional keywords provide user control) Options for: Automatic data management Manual data management I/O from within offloaded region Data can stream through the MIC; no need to leave MIC to fetch new data Also very helpful when debugging (print statements) Offloading a subroutine and using MKL 19

Offloaded Execution Model MPI task execute on the host Directives offload OpenMP code sections to the MIC Communication between MPI tasks on hosts through MPI Communication between host and coprocessor through offload semantics Code modifications: Offload directives inserted before OpenMP parallel regions One executable (a.out) runs on host and coprocessor

MIC Training Offloading Basic Program Base program: #include <omp.h> #define N 10000 void foo(double *, double *, double *, int ); int main(){ int i; double a[n], b[n], c[n]; for(i=0;i<n;i++){ a[i]=i; b[i]=n-1-i;} } foo(a,b,c,n); Example program Offload engine is a device Objective: Offload foo Do OpenMP on device void foo(double *a, double *b, double *c, int n){ int i; for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } }

MIC Training Offloading Function Offload Function offload: Requirements #include <omp.h> #define N 10000 #pragma omp <offload_function_spec> void foo(double *, double *, double *, int ); int main(){ int i; double a[n], b[n], c[n]; for(i=0;i<n;i++){ a[i]=i; b[i]=n-1-i;} } #pragma omp <offload_this> foo(a,b,c,n); #pragma omp <offload_function_spec> void foo(double *a, double *b, double *c, int n){ int i; #pragma omp parallel for for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } Direct compiler to offload function or block Decorate function and prototype Usual OpenMP directives on device

The Stampede is Coming Stampede has a *lot* of new technologies, and as such has a certain element of risk. However, as of now, we expect the early user/science period to begin in December. Full production by January 7 th (Base system at least). You might hear some more about this machine at SC12 in November. Thank to NSF, Intel, Dell

Ranger: 3000 ft 2 Datacenter Expansion Stampede: 8000 ft 2

Power/Physical Stampede will physically use 182 48U cabinets. Power density (after upgrade in 2015) will exceed 40kW per rack. Estimated 2015 peak power is 6.2MW.

Key Datacenter Features Thermal energy storage to reduce peak power charges Hot aisle containment to boost efficiency (and simply provide enough cooling). Total IT power to 9.5MW, total power ~12MW. Expand experiments in mineral oil cooling.

Stampede: Powered by Intel Xeon E5 and Xeon Phi TACC & Intel have provided open science clusters since 2001 x86 architecture has become dominant ISA in supercomputing Stampede needed to enable discoveries that require petaflops Xeon E5 enables more far performance for general applications Xeon Phi enables tremendous potential performance for highly parallel applications, with much easier porting than GPUs Thus, Stampede will preserve (and reward!) many years of x86 experience and expertise in scientific codes, while offering world-class performance

Stampede Will Enable New Scientific Discoveries Across Domains 1000+ projects, by thousands of researchers