Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Similar documents
The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

GPU Consideration for Next Generation Weather (and Climate) Simulations

Operational Robustness of Accelerator Aware MPI

Trends in HPC (hardware complexity and software challenges)

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

John Levesque Nov 16, 2001

An Introduction to OpenACC

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

GPU Computing with NVIDIA s new Kepler Architecture

CUDA. Matthew Joyner, Jeremy Williams

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Cray XC Scalability and the Aries Network Tony Ford

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Shifter: Fast and consistent HPC workflows using containers

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Technology for a better society. hetcomp.com

HPC with Multicore and GPUs

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

Pedraforca: a First ARM + GPU Cluster for HPC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPU-centric communication for improved efficiency

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

RAMSES on the GPU: An OpenACC-Based Approach

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

GPUs and Emerging Architectures

Status and Directions of NVIDIA GPUs for Earth System Modeling

Deutscher Wetterdienst

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

The Arm Technology Ecosystem: Current Products and Future Outlook

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

n N c CIni.o ewsrg.au

Supercomputing: scientific instruments and precursors for new technologies what does this mean for Switzerland? Thomas C.

User Training Cray XC40 IITM, Pune

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

IBM CORAL HPC System Solution

An update on the COSMO- GPU developments

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

PP POMPA (WG6) News and Highlights. Oliver Fuhrer (MeteoSwiss) and the whole POMPA project team. COSMO GM13, Sibiu

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

NVIDIA GPUs in Earth System Modelling Thomas Bradley

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Real Parallel Computers

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

HPC Architectures past,present and emerging trends

The Stampede is Coming: A New Petascale Resource for the Open Science Community

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

HPC future trends from a science perspective

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

The Mont-Blanc Project

First Experiences With Validating and Using the Cray Power Management Database Tool

Accelerating HPL on Heterogeneous GPU Clusters

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

High Performance Computing with Accelerators

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Intel Performance Libraries

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Overview of Tianhe-2

Preparing a weather prediction and regional climate model for current and emerging hardware architectures.

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Adrian Tate XK6 / openacc workshop Manno, Mar

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

MAGMA: a New Generation

Opportunities & Challenges for Piz Daint s Cray XC50 with ~5000 P100 GPUs. Thomas C. Schulthess

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Real Parallel Computers

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

General Purpose GPU Computing in Partial Wave Analysis

SPIRAL, FFTX, and the Path to SpectralPACK

Quantum ESPRESSO on GPU accelerated systems

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Modern GPUs (Graphics Processing Units)

Parallel Computing. November 20, W.Homberg

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Cray Scientific Libraries. Overview

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Execution Models for the Exascale Era

Electronic structure calculations on Thousands of CPU's and GPU's

Stan Posey, NVIDIA, Santa Clara, CA, USA

NVIDIA GPU TECHNOLOGY UPDATE

Interconnection Network for Tightly Coupled Accelerators Architecture

HPC Architectures past,present and emerging trends

Scientific Computing in practice

Transcription:

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014

* Timelines & releases are not precise Top 500 Green 500 Cray XK6 Cray XC30 Cray XC30 (adaptive) Cray XK7 Prototypes with accelerator devices GPU nodes in a viz cluster GPU cluster Collaborative projects HP2C training program High Performance High Productivity Computing (HP2C) PASC conferences & workshops Platform for Advanced Scientific Computing (PASC) 2009 2010 2011 2012 2013 2014 2015 Application Investment & Engagement Training and workshops Prototypes & early access parallel systems HPC installation and operations 2

Reduce code prototyping and deployment time on HPC systems GPU-enabled MPI & MPS GPUDirect GPUDirect-RDMA OpenACC 2.0 OpenACC 1.0 OpenCL 1.0 CUDA 2.x CUDA 2.x CUDA 2.x CUDA 2.x OpenCL 1.1 CUDA 3.x CUDA 3.x CUDA 3.x OpenCL 1.2 CUDA 4.x CUDA 4.x CUDA 4.x Cray XK6 OpenCL 2.0 CUDA 5.x CUDA 5.x CUDA 6.x Cray XK7 Cray XC30 & hybrid XC30 X86 cluster with C2070, M2050, S1070 2009 2010 idataplex cluster M2090 2011 Testbed with Kepler & Phi 2012 2013 2014 2015 Requirements analysis Applications development and tuning 3 * Timelines & releases are not precise

Algorithmic motifs and their arithmetic intensity COSMO, WRF, SPECFEM3D Rank-1 update in HF-QMC Structured grids / stencils Sparse linear algebra Matrix-Vector Vector-Vector BLAS1&2 Fast Fourier Transforms FFTW & SPIRAL Rank-N update in DCA++ QMR in WL-LSMS Linpack (Top500) Dense Matrix-Matrix BLAS3 O(1) O(log N) O(N) arithmetic density 4

Requirements Analysis Compute and memory bandwidth Hybrid compute nodes Network bandwidth Fully provisioned dragonfly on 28 cabinets CP2K expose NW issues GPU Enabled MPI & MPS Low jitter 5

Third-row added 8 x 10 x10 Phase II Piz Daint (hybrid XC30) Phase I Piz Daint (multi-core XC30) TDS (santis) 6

Adaptive Cray XC30 Compute Node Phase I 48 port router NIC0 NIC1 NIC2 NIC3 QPDC QPDC 48 port router Phase II NIC0 NIC1 NIC2 NIC3 H-PDC H-PDC 7

Phase I Phase II Number of compute nodes 2,256 5,272 Peak system DP performance 0.75 Pflops 7.8 PFlops CPU cores/sockets per node (Intel E5-2670) 16/2 8/1-1600 memory per node 32 GBytes 32 GBytes GPU SMX/devices per node (Nvidia K20x) -- 14/1 memory per node -- 6 GB (ECC off) DP Gflops per node 332.8 Gflops (CPU) 166.4 Gflops (CPU) 1311 Gflops (GPU) -1600 bandwidth per node 102.4 GB/s 51.2 GB/s bandwidth per node -- 250 GB/s (ECC off) Network & I/O interface 16x PCIe 3.0 16x PCIe 3.0 Network injection bandwidth per node ~ 10 GB/s ~ 10 GB/s Node performance characteristics 8

Phase I Number of cabinets 12 28 Number of groups 6 14 Number of Aries network & router chips Number of optical ports (max) Number of optical ports (connected) Phase II 576 1344 1440 3360 360 3276 Number of optical cables 180 1638 Bandwidth of optical cables 6750 GB/s 61425 GB/s Bisection bandwidth 4050 GB/s 33075 GB/s Point to point bandwidth 8.5-10 GB/s 8.5-10 GB/s Global bandwidth per compute node 3 GB/s 11.6 GB/s Network performance characteristics 9

C, C++, Fortran Comm. Libraries OpenMP MPI 3.0 PGAS Libsci MKL Phase I OpenMP OpenMP 4.0 OpenACC CUDA OpenCL MPI 3.0 PGAS G2G MPI Libsci MKL Libsci_acc CUDA SDK libs Phase II Co-designed Code Development Interfaces 10

Debugger Perf tools OpenMP MPI OpenMP MPI OpenMP MPI OpenACC CUDA OpenMP MPI CUDA OpenACC Power Exec. Env. Task affinity Task affinity Thread affinity Thread affinity GPU modes CCM RUR Phase I Phase II Co-designed Tools and Execution Environment 11

Resource management DiagnosRcs & monitoring SLURM NHC CPU Phase I SLURM RUR CPU RUR GPU GPU modes NHC CPU PMDB CPU PMDB NW NHC GPU PMDB GPU TDK Phase II Co-designed System Interfaces and Tools 12

Adaptive Programming and Execution Models 48 port router 48 port router 48 port router NIC0 NIC1 NIC2 NIC3 NIC0 NIC1 NIC2 NIC3 NIC0 NIC1 NIC2 NIC3 H-PDC H-PDC H-PDC H-PDC H-PDC H-PDC Data Centric (I) CPU GPU Offload model Data Centric (II) CPU Data Centric (III) GPU (+ CPU*) GPUDirect-RDMA GPU Enabled I/O 13

14

Co-design potential for adaptive XC30 GPU for visualization CUG 2014 paper by Mark Klein (NCSA) Extensions to programming environments Modular development tools (e.g. Clang-LLVM) Improvements to the GPU diagnostics and monitoring Collaboration with Cray & Nvidia Beyond GPUDirect making GPU access other resources directly 15

Thank you 16