Cori (2016) and Beyond Ensuring NERSC Users Stay Productive

Size: px

Start display at page:

Download "Cori (2016) and Beyond Ensuring NERSC Users Stay Productive"

Debra Norton
5 years ago
Views:

1 Cori (2016) and Beyond Ensuring NERSC Users Stay Productive Nicholas J. Wright! Advanced Technologies Group Lead! Heterogeneous Mul-- Core 4 Workshop 17 September

2 NERSC Systems Today Edison: 2.39PF, 333 TB RAM 7.6 PB Local Scratch 163 GB/s 80 GB/s Global Scratch 3.6 PB 5 x SFA12KE Cray XC30 5,192 nodes, 125K Cores Hopper: 1.3PF, 212 TB RAM 16 x FDR IB 2.2 PB Local Scratch 70 GB/s 50 GB/s 5 GB/s /project /home 5 PB DDN9900 & NexSAN 250 TB NetApp 5460 Cray XE6 6,384 nodes 150K Cores Data- Intensive Systems Carver, PDSF, JGI,KBASE,HEP 14x QDR Vis & Analy-cs Data Transfer Nodes Adv. Arch. Testbeds Science Gateways 16 x QDR IB Ethernet & IB Fabric Science Friendly Security ProducFon Monitoring Power Efficiency WAN 12 GB/s HPSS 2 x 10 Gb 1 x 100 Gb So#ware Defined Networking 50 PB stored, 240 PB capacity, 20 years of community data - 2 -

3 NERSC-8 (Cori) Mission Need The Department of Energy Office of Science requires an HPC system to support the rapidly increasing computafonal demands of the enfre spectrum of DOE SC computafonal research. Provide a significant increase in computafonal capabilifes, at least 10 Fmes the sustained performance of the Hopper system on a set of representafve DOE benchmarks Delivery in the 2015/2016 Fme frame Provide high bandwidth access to exisfng data stored by confnuing research projects. PlaSorm needs to begin to transifon users to more energy- efficient many- core architectures.

Many Core Architectures: Why Now? Hennessy, Pa[erson. Computer Architecture, a QuanFtaFve Approach. 5th Ed. 2011 oh no! Moore s Law will confnue. Perf. vs.

4 Many Core Architectures: Why Now? Hennessy, Pa[erson. Computer Architecture, a QuanFtaFve Approach. 5th Ed oh no! Moore s Law will confnue. Perf. vs. VAX/780 ExponenFal performance increase; do not rewrite so_ware, just buy a new machine! hmmm wall Power wall ILP wall To sustain historic performance growth, the DOE community must prepare for new architectures 4

Why can t we keep doing what we ve been doing?

limiter for performance improvement Cost: FLOPs are biggest cost for system: op[mize for compute Concurrency: Modest growth of parallelism by adding nodes scaling:

is primary design constraint for future HPC system design Cost: Data movement dominates: op[mize to minimize data movement Concurrency: Exponen[al growth of

Heterogeneity: Architectural and performance non- uniformity increase Reliability: It s the hardware s problem Reliability: Cannot count on hardware protec[on alone

5 Why can t we keep doing what we ve been doing?! Optimization target for hardware has evolved to new direction (but pmodels have not kept up) Old Constraints New Constraints Peak clock frequency as primary limiter for performance improvement Cost: FLOPs are biggest cost for system: op[mize for compute Concurrency: Modest growth of parallelism by adding nodes scaling: maintain byte per flop capacity and bandwidth Locality: MPI+X model (uniform costs within node & between nodes) Uniformity: Assume uniform system performance Power is primary design constraint for future HPC system design Cost: Data movement dominates: op[mize to minimize data movement Concurrency: Exponen[al growth of parallelism within chips Scaling: Compute growing 2x faster than capacity or bandwidth Locality: must reason about data locality and possibly topology Heterogeneity: Architectural and performance non- uniformity increase Reliability: It s the hardware s problem Reliability: Cannot count on hardware protec[on alone Picojoules*Per*Opera/on*!####"!###"!##"!#" Intranode/SMP+ Communica2on+ On4chip++/+CMP+ communica2on+ Internode/MPI+ Communica2on+ 21@" A#!B" Fundamentally breaks our current programming paradigm and compu[ng ecosystem!" $%"&'(%" )*+,-.*/"!00"12345,6" 700"12345,6" (8345,69$):;" <14=<",2.*/4122*4." >/1--"-?-.*0" 5

The Cori System DSL 32 Nodes MOM 28 Nodes KNL Compute 9304 Nodes Cray Cascade- 64 Cabinets

(4) DVS Server Nodes (32) 68 LNET Routers 32 FDR IB to NGF 136 FDR IB SMW FC Switch Boot RAID

GigE Switch 40 GigE Switch eslogin 14 Nodes esms 2 Node To NERSC network 6 FC FDR IB 40 GigE

6 The Cori System DSL 32 Nodes MOM 28 Nodes KNL Compute 9304 Nodes Cray Cascade- 64 Cabinets Haswell Compute 1920 Nodes Burst Buffer 384 Nodes 2 BOOT 2 SDB Network RSIP (10) Network Nodes (4) DVS Server Nodes (32) 68 LNET Routers 32 FDR IB to NGF 136 FDR IB SMW FC Switch Boot RAID Boot Cabinet With eslogin Redundant Core IB Switches 15 FDR IB 102 FDR IB PFS To NERSC network GigE Switch 40 GigE Switch eslogin 14 Nodes esms 2 Node To NERSC network 6 FC FDR IB 40 GigE GigE (Admin) GigE (Internal) OSS 96 Nodes DDN WolfCreek 12 Scalable Units MGS 2 Nodes MDS 4 Nodes 5 NetApp E2700

7 Cori Configuration 64 Cabinets of Cray XC System Over 9,300 Knights Landing compute nodes Self- hosted (not an accelerator) Greater than 60 cores per node with four hardware threads each GB memory per node High bandwidth on- package memory Over 1,900 Haswell compute nodes Data parffon 14 external login nodes Aries Interconnect (same as on Edison) 10x Hopper sustained performance using NERSC SSP metric Lustre File system 28 PB capacity, 432 GB/sec peak performance NVRAM Burst Buffer for I/O accelera-on Significant Intel and Cray applica-on transi-on support Delivery in mid- 2016; installa-on in new LBNL CRT - 7 -

8 Stand-Alone Processor vs. CoProcessor Older Knights Corner Co-Processor Next-Generation! Knights Landing stand-alone" processor: No offloading bottlenecks - 8 -

9 Intel Knights Landing Processor Next genera-on Xeon- Phi, >3TF peak Single socket processor - Self- hosted, not a co- processor, not an accelerator Greater than 60 cores per processor with support for four hardware threads each; more cores than current genera-on Intel Xeon Phi Intel "Silvermont" architecture enhanced for high performance compu-ng 512b vector units (32 flops/clock AVX 512) 3X single- thread performance over current genera-on Xeon- Phi co- processor High bandwidth on- package memory, up to 16GB capacity with bandwidth projected to be 5X that of DDR4 DRAM memory Higher performance per waj - 9 -

Manually manage how your applicafon uses the integrated on- package memory and external DDR for peak performance Harness the benefits of both cache and flat

10 Knights Landing Integrated On-Package! Cache Model Flat Model Hybrid Model Let the hardware automafcally manage the integrated on- package memory as an L3 cache between KNL CPU and external DDR Manually manage how your applicafon uses the integrated on- package memory and external DDR for peak performance Harness the benefits of both cache and flat models by segmenfng the integrated on- package memory Near HBW In-Package HBW In-Package... HBW In-Package KNL CPU Cache CPU Package Near HBW In-Package HBW In-Package... HBW In-Package PCB Far DDR DDR... DDR Top View Side View Maximum performance through higher memory bandwidth and flexibility

11 Programming Model Considerations Knight s Landing is a self- hosted part Users can focus on adding parallelism to their applicafons without concerning themselves with PCI- bus transfers MPI + OpenMP preferred programming model Should enable NERSC users to make robust code changes MPI- only will work performance may not be op-mal On package MCDRAM How to opfmally use? Explicitly or implicitly??

12 Beyond Cori

13 DOE Fast Forward and Design Forward Programs! Objec-ve: Accelerate movement of innova[ve ideas from computer architecture research into products through public/private partnerships Evaluate advanced research concepts and develop quan-ta-ve evidence of their benefit for DOE applica-ons (using DOE Codesign Center apps) Engage DOE applica-ons teams to understand technology trends constraints (how it impacts their code development) Understand how to program these new features Develop strong evidence to promote adop-on by product teams Fast Forward Technology Areas Processor: AMD, Intel, NVIDIA : IBM, AMD Storage: WhamCloud Design Forward Interconnects Architecture (Review Next week at LBNL / July 22-24) System Architecture (under review) 13

14 Abstract Machine Model for Exascale (Low Capacity, High Bandwidth) 3D Stacked (High Capacity, Low Bandwidth) Thin Cores / Accelerators Fat Core Fat Core DRAM NVRAM Integrated NIC for Off-Chip Communication Core Coherence Domain 14

15 Summary NERSC s goal is to provide usable Exascale compu-ng. Cori is first step ExciFng technology! Processor and Burst Buffer represent first steps along the path to exascale Extensive vendor and community support Disrup-ve technology changes are coming Understand how they will effect you! Note that Exascale era architectures are starfng to crystalize

Preparing your Application for Advanced Manycore Architectures

Preparing your Application for Advanced Manycore Architectures Katie Antypas Services Dept Head, NERSC-8 Project Lead CSGF HPC Workshop July 17, 2014-1 - What is Manycore? No precise definition Multicore