CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Similar documents
Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

An Introduction to OpenACC

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Supercomputers. Alex Reid & James O'Donoghue

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CUDA. Matthew Joyner, Jeremy Williams

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

John Levesque Nov 16, 2001

Steve Scott, Tesla CTO SC 11 November 15, 2011

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Computational Challenges and Opportunities for Nuclear Astrophysics

Trends in HPC (hardware complexity and software challenges)

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Cray XC Scalability and the Aries Network Tony Ford

HPC Technology Update Challenges or Chances?

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

Parallel Computer Architecture II

GPU Architecture. Alan Gray EPCC The University of Edinburgh

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

Mathematical computations with GPUs

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Automatic Tuning of the High Performance Linpack Benchmark

Top500

Update on Cray Activities in the Earth Sciences

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

OP2 FOR MANY-CORE ARCHITECTURES

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Early Experiences Writing Performance Portable OpenMP 4 Codes

Large scale Imaging on Current Many- Core Platforms

Using the Cray Programming Environment to Convert an all MPI code to a Hybrid-Multi-core Ready Application

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Fujitsu s Approach to Application Centric Petascale Computing

Exascale Challenges and Applications Initiatives for Earth System Modeling

HPC Issues for DFT Calculations. Adrian Jackson EPCC

Introduction to Parallel Computing

Wednesday : Basic Overview. Thursday : Optimization

arxiv: v1 [physics.comp-ph] 4 Nov 2013

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

High-Performance Computing & Simulations in Quantum Many-Body Systems PART I. Thomas Schulthess

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Confessions of an Accidental Benchmarker

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

The Titan Tools Experience

Hybrid Architectures Why Should I Bother?

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

HIGH-PERFORMANCE COMPUTING

High-Performance Computing - and why Learn about it?

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Accelerating HPL on Heterogeneous GPU Clusters

HOKUSAI System. Figure 0-1 System diagram

Recent Advances in Heterogeneous Computing using Charm++

Adapting Numerical Weather Prediction codes to heterogeneous architectures: porting the COSMO model to GPUs

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Report on the Sunway TaihuLight System. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Turbostream: A CFD solver for manycore

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Illinois Proposal Considerations Greg Bauer

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

The Cray Rainier System: Integrated Scalar/Vector Computing

It s not my fault! Finding errors in parallel codes 找並行程序的錯誤

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Lecture 20: Distributed Memory Parallelism. William Gropp

CINECA and the European HPC Ecosystem

ADVANCED COMPUTER ARCHITECTURES

Cray XT3 for Science

Introducing the next generation of affordable and productive massively parallel processing (MPP) computing the Cray XE6m supercomputer.

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Practical Scientific Computing

Fabio AFFINITO.

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Parallel Computer Architecture - Basics -

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Breakthrough Science via Extreme Scalability. Greg Clifford Segment Manager, Cray Inc.

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

The way toward peta-flops

Tri-Hybrid Computational Fluid Dynamics on DOE s Cray XK7, Titan.

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

The Mont-Blanc approach towards Exascale

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

Performance Analysis and Optimization of Gyrokinetic Torodial Code on TH-1A Supercomputer

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

OpenFOAM Scaling on Cray Supercomputers Dr. Stephen Sachs GOFUN 2017

Mathematical computations with GPUs

Performance Study of Popular Computational Chemistry Software Packages on Cray HPC Systems

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

NVIDIA GPUs in Earth System Modelling Thomas Bradley

The Impact of Optics on HPC System Interconnects

Transcription:

CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar

CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary

INTRODUCTION The Cray XK6 supercomputer is a trifecta of scalar, network and many-core innovation. Hybrid supercomputer Combination of: Cray s Gemini interconnect, AMD's leading multi-core scalar processors and NVIDIA s powerful many-core GPU processors Enhanced version of XE6 Uses Blade architecture as in Cray XE6 Capable of scaling to 500,000 scalar processors and 50 petaflops of hybrid peak performance

HISTORY In 1988, Cray Research introduced Cray Y-MP, the world's first supercomputer Sustained over 1 gigaflop on many applications Fujitsu's Numerical Wind Tunnel supercomputer used 166 vector processors to gain the top spot in 1994 with a peak speed of 1.7 gigaflops per processor. The Hitachi SR2201: peak performance of 600 gigaflops in 1996 by using 2048 The Intel Paragon had 1000 to 4000 Intel i860 processors, was ranked the fastest in the world in 1993

SUPER-COMPUTER STATISTICS

COMPARISON WITH THE PRESENT CRAY SUPERCOMPUTERS

CRAY XK6- ARCHITECTURE Four nodes per blade Adaptive hybrid computing Scalable compute nodes, I/Os Gemini Mezzanine Plug compatible with Cray XE6 blade Configurable processor, memory and SXM GPU AMD Opteron 6200 Series processor: Highly associative on-chip data cache supports aggressive out-of-order execution Integrated memory controller Significant performance advantage to algorithms The NVIDIA Tesla 20-series: Based on the next generation CUDA GPU architecture codenamed Fermi

NODE- ARCHITECTURE

XK6 ACCELERATOR BLADE

GEMINI INTERCONNECTION NETWORK

GEMINI INTERCONNECTION NETWORKS Each node acts as 2 nodes on a 3D Torus Each Node provided with a High Radix YARC router to support up to 168 Gbps. Parallel electrical and optical paths High Bandwidth and lower latency for both long and short messages Low cost of integration Gemini Mezzanine card to avoid memory ICN bottlenecks.

NVIDIA TESLA X2090 Special Embedded version of Tesla M2090. Provides High Performance Computing for highly parallel applications. 448 cores with 6 GB GDDR5 Memory. Can support up to 600+ GFLOPs High Bandwidth to host Quick Master-Slave Communication. CUDA capable for easy programmability.

CRAY XK6 CABINETS Each cabinet has up to 96 processors Two processors wrapped in the form of a blade (XE6 compatible) With 1536 cores, can give 70+ TFLOPs performance

SPECIFICATIONS

SPECIFICATIONS

PERFORMANCE- LUDWIG 10 cabinets of Cray XK6 936 GPUs (nodes) Only 4% deviation from perfect scaling between 8 and 936 GPUs Application sustaining 40+ Tflop/s and still scaling... Strong scaling also very good, but physicists want to simulate larger systems

PERFORMANCE - HIMENO Parallel 3D Poisson equation solver benchmark iterative loop evaluating 19-point stencil Co-Array Fortran version of code Fully ported to accelerators using 27 directive pairs Strong scaling Use asynchronous GPU data transfers and kernel launches to help avoid this

INDUSTRIAL ACCEPTANCE Oak Ridge National Laboratory Jaguar/TITAN High computation capacity for Scientific research 200 cabinets with > 18000 nodes. Estimated 10 20 PFLOPs Currently upgrading from XT5 based Jaguar system to XK6 based Titan system with increased performance.

INDUSTRIAL ACCEPTANCE

INDUSTRIAL ACCEPTANCE CSCS- Swiss National Super Computing Centre Cray XE6 402 Tflops 1496 nodes Gemini Interconnects Cray XK6 176 nodes with one AMD and one GPU element each

SUMMARY Higher Supercomputing potential with GPU Accelerated computing Better Inter node communication with the Gemini Optical interconnects Backward compatible with XE6 cabinets and can be merged with XE6 systems. Highly suited to Scientific Research computations requiring high computational power of the order of 100s TFLOPs

REFERENCES http://www.cray.com/products/xk6/xk6.aspx CrayXK6Brochure.pdf http://en.wikipedia.org/wiki/supercomputer http://i.top500.org/stats Applications on Cray XK6, Roberto Ansaloni