HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

Size: px

Start display at page:

Download "HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara"

Warren Carter
5 years ago
Views:

1 HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara

2 Many advantages as a supercomputing resource: Low energy consumption. Limited floor space requirements Fast internal network Homogeneous architecture simple usage model. But Low, single core performance + I/O structure meant very high parallelism necessary (at least 1024 cores). For some applications low memory/core (1Gb) and I/O performance also a problem. Also limited capabilities of O.S. on compute cores (e.g. no interactive access) Cross compilation, because login nodes different to compute nodes, can complicate some build procedures. BG/Q (Fermi) as a Tier0 Resource FERMI scheduled to be decommissioned mid-end 2016

3 Replacing Fermi at Cineca - considerations A new procurement is a complicated process and considers many factors but must include (together with the price): Minimum peak compute power Power consumption Floor space required Availability Disk space, internal network, etc. IBM no longer offers the BlueGene range for supercomputers so cannot be a solution. Many computer centres are adopting instead a heterogenous model for computer clusters

4 Replacing Fermi the Marconi solution Fermi 2PFlops. 0.8Mwatt Galileo & PICO 1.2PFlops. 0.4Mwatt Marconi A1 2.1PFlops. 0.7Mwatt Marconi A2 11PFlops. 1.3MWatt Alpha Volume Beta Commit Wins Marconi A3 7PFlops. 1.0Mwatt Sistem A4(?) 50PFlops. 3.2 Mwatt 1.2Mwatt 2.4Mwatt 2.3Mwatt 2.3Mwatt 3.2Mwatt 50 rack 120 rack 120 rack 120 rack 150 rack 100mq 240mq 240mq 240mq 300mq

5 Marconi High level system Characteristics Tender proposal Partition Installation CPU # nodes # of Racks Power A1 Broadwell (2.1 PFlops) April 2016 E v KW A2 - Knight Landing (11 Pflops) September 2016 KNL KW A3 Skylake (7 Pflops expected) June 2017 E v5 >1500 > KW Network: Intel OmniPath

6 1 PFs no-conventional (KNL) A2 KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs 5 PFs - conventional 1 PFs - conventional A1 BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs A3 SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs

7 storage GSS*: 5 PB scratch area (50 GB/s) by Jan17: 10 PB long term storage (40 GB/s) 20 PB Tape library 1 PFs no-conventional (KNL) A2 KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs 5 PFs - conventional 1 PFs - conventional A1 BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs A3 SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs

8 Marconi - Compute A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E v4 Broadwell GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake GHz dual socket node: 40 core and 196GByte /node

9 Marconi - Compute A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E v4 Broadwell GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake GHz dual socket node: 40 core and 196GByte /node

10 Marconi - Compute A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E v4 Broadwell GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake GHz dual socket node: 40 core and 196GByte /node

11 Marconi - Network Network type: new Intel Omnipath Largest Omnipath cluster of the world Network topology: Fat-tree 2:1 oversubscription tapering at the level of the core switches only Core Switches: 5 x OPA Core Switch Sawtooth Forest 768 ports each Hdge Switch: 216 OPA Edge Switch Eldorado Forest 48 ports each Maximum system configuration: 5(opa) x 768(ports) x 2(tapering) -> 7680 servers

12 5 x 768 ports core Switches 3x 216 x 48 ports Hedge Switches 32 downlink 32 nodes fully interconnected island 6624 Compute nodes

13 A1 HPL Full system Linpack: 1 MPI task per node perf range: PFs. Max Perf: PFs with Turbo-OFF. Turbo-ON -> throttling June 2016:Number 46

14 A2 HPL Full system Linpack: 3556 nodes 1 MPI task per node Max Perf with HyperThreading-OFF. November 2016:Number 12 [0] ================================================================================ [0] T/V N NB P Q Time Gflops [0] [0] W[0] R00L2L e+06 [0] HPL_pdgesv() start time Fri Nov 4 23:10: [0] [0] HPL_pdgesv() end time Sat Nov 5 06:33: [0] [0] HPL Efficiency by CPU Cycle % [0] HPL Efficiency by BUS Cycle % [0] [0] Ax-b _oo/(eps*( A _oo* x _oo+ b _oo)*n)= PASSED [0] ================================================================================ [0] [0] Finished 1 tests with the following results: [0] 1 tests completed and passed residual checks, [0] 0 tests completed and failed residual checks, [0] 0 tests skipped because of illegal input values. [0] [0] [0] End of Tests. [0] ================================================================================

Intel Xeon PHI KNC (Galileo@CINECA) An accelerator (like GPUs) but more similar to a conventional

512 bit vector unit. Cores connected in a ring topology and MPI possible.

15 Intel Xeon PHI KNC An accelerator (like GPUs) but more similar to a conventional multicore CPU. Current version, Knight s Corner (KNC) has GHz cores,8-16gb RAM. 512 bit vector unit. Cores connected in a ring topology and MPI possible. No need to write CUDA or OpenCL as Intel compilers will compile Fortran or C code for the MIC. 1-2 Tflops, according to model.

The main differences are: KNL will be a standalone processor not an accelerator (unlike KNC) KNL

16 A2: Knights Landing (KNL) A big unknown because very few people currently have access to KNL. But we know the architecture of KNL and the differences and similarities with respect to KNC. The main differences are: KNL will be a standalone processor not an accelerator (unlike KNC) KNL has more powerful cores and faster internal network. On package high performance, memory (16GB, MCDRAM).

17 Intel Xeon PHI KNC-KNL comparision KNC (Galileo) KNL (Marconi) #cores 61 (pentium) 68 (Atom ) Core frequency GHz 1.4 Ghz Memory 16GB GDDR5 96GB DDR4 +16Gb MCDRAM Internal network Bi-directional Ring Mesh Vectorisation 512 bit /core 2xAVX-512 /core Usage Co-processor Standalone Performance (Gflops) 1208 (dp)/2416 (sp) ~3000 (dp) Power ~300W ~200W A KNC core can be 10x slower than a Haswell core. A KNL core is expected to be 2-3X slower. Big differences also in memory bandwidth.

20 Coming next: A3 A3. Intel Skylake processors (mid-2017) Successor to Haswell, and launched in Expect increase in performance and power efficiency.

21 Coming next: A3

22 Marconi A1 and A2 exploitation at its best *

23 Exploiting the parallel universe Three levels of parallelism supported by Intel hardware Thread Level Parallelism Multi thread/task performance Exposed by programming models Execute tens/hundreds/thousands task concurrently Vector Level Parallelism Single thread performance Exposed by tools and programming models Operate on 4/8/16 elements at a time Instruction Level Parallelism Single thread performance Automatically exposed by HW/tools Effectively limited to a few instructions

A1 exploitation A1: Broadwell nodes Similar to Haswell cores present on Galileo. Expect only a small difference in single core performance wrt Galileo, but a big difference compared to Fermi.

24 A1 exploitation A1: Broadwell nodes Similar to Haswell cores present on Galileo. Expect only a small difference in single core performance wrt Galileo, but a big difference compared to Fermi. More cores/node (36) should mean better OpenMP performance but also MPI performance will improve (faster network). Life much easier for SPMD programming models. cores/node 36 Memory/node 128 GB + Use SIMD vectorization

25 Single Instruction Multiple Data (SIMD) vectorization Technique for exploiting VLP on a single thread Operate on more than one element at a time Might decrease instruction counts significantly Elements are stored on SIMD registers or vectors Code needs to be vectorized Vectorization usually on inner loops Main and remainder loops are generated a[i:4] for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; Scalar loop b[i:4] for (int i = 0; i < N; i += 4) c[i:4] = a[i:4] + b[i:4]; SIMD loop (4 elements) c[i:4]

26 A2 exploitation A2: KNL (More) similar to KNC coprocessor present on Galileo, but with remarkable differences: High bandwidth MCDRAM available AVX-512 ISA Binary compatible with standard Xeon MPI and OpenMP mixing still the best choice (less OpenMP performances dependent). cores/node Use SIMD vectorization Memory/node 96GByte DDR4 + 16GByte MCDRAM

A2: MCDRAM Memory bandwidth in HPC is one of common bottleneck for performances To increase the demand for memory bandwidth KNL have a on-package high memory bandwidth memory (HBM)

27 A2: MCDRAM Memory bandwidth in HPC is one of common bottleneck for performances To increase the demand for memory bandwidth KNL have a on-package high memory bandwidth memory (HBM) based on multi-channel dynamic random access memory (MCDRAM). This memory is capable of delivering up to 5x performance ( 400 Gb/s) compared to DDR4 memory on same platform ( 90 GB/s)

28 A2: MCDRAM HBM on KNL can be used as 1. a last-level cache (LLC) 2. an addressable memory. The configuration is determined at boot time, by choosing in BIOS setting between three MCDRAM modes: 1. Flat mode 2. Cache mode 3. Hybrid mode

29 A2: MCDRAM The best mode to use will depend on the application.

30 A2: Using HBM as addressable memory Two methods for this: the numactl tool Works best if the whole app can fit in MCDRAM the memkind library Using library calls or Compiler Directives Needs source modification

31 A2: Using numactl to access MCDRAM Run "numactl --hardware" to see the NUMA configuration of your system Look for the node only. If the total memory footprint of your app is smaller than the size of MCDRAM ps -C myapp u see RSS value Use numactl to allocate all of its memory from MCDRAM numactl --membind=mcdram_id myapp Where mcdram_id is the ID of MCDRAM "node" If the total memory footprint of your app is larger than the size of MCDRAM You can still use numactl to allocate part of your app in MCDRAM numactl --preferred=mcdram_id myapp Allocations that don't fit into MCDRAM spills over to DDR numactl --interleave=nodes myapp Allocations are interleaved across all nodes

32 A2: Using Memkind to access MCDRAM Memkind library is a user-extensible heap manager built on top of jemalloc, a C library for general-purpose memory allocation functions. The library is generalizable to any NUMA architecture, but for Knights Landing processors it is used primarily for manual allocation to HBM using special allocators for C/C++ has limited support for Fortran

33 A2: Using Memkind - C case Allocate 1000 floats from DDR float *fv; fv = (float *)malloc(sizeof(float) * 1000); Allocate 1000 floats from MCDRAM float *fv; fv = (float *)hbw_malloc(sizeof(float) * 1000);

34 A2: Using Memkind - Fortran case C Declare arrays to be dynamic REAL, ALLOCATABLE :: A(:), B(:), C(:)!DEC$ ATTRIBUTES FASTMEM :: A NSIZE=1024 c c allocate array 'A' from MCDRAM c ALLOCATE (A(1:NSIZE)) c c Allocate arrays that will come from DDR c ALLOCATE (B(NSIZE), C(NSIZE))

35 A2 exploitation A2: KNL (More) similar to KNC coprocessor present on Galileo, but with remarkable differences: High bandwidth MCDRAM available AVX-512 ISA Binary compatible with standard Xeon MPI and OpenMP mixing still the best choice (less OpenMP performances dependent). cores/node Use SIMD vectorization Memory/node 96GByte DDR4 + 16GByte MCDRAM

Porting Applications onto Marconi A1 and A2 Applications developers must utilise vectorisation (SIMD) and/or MPI(OpenMP) processes(threads), and possibly the fast memory of KNL (MCDRAM).

37 Porting Applications onto Marconi A1 and A2 Applications developers must utilise vectorisation (SIMD) and/or MPI(OpenMP) processes(threads), and possibly the fast memory of KNL (MCDRAM). From the user perspective, its success will depend on how software tools and applications are able to exploit the KNL architecture. Key idea: use proxy (KNC, ): SPECFEM3D_GLOBE. Already reasonable results with KNC in the framework of the activity. Good amount of vectorisation (FORCE_VECTORIZATION preprocessing enabling and SIMD optimization) suitable for KNC and future KNL. High number of OpenMP threads scaling (up to more than 60 on KNC) Worth noting that up to now KNC haven t been widely supported by Geophysical Applications software developers and users. A remarkable exception is SPECFEM3D_GLOBE software CIG repo, where the native version is maintained and tested. Again, this should be fine for its KNL startup.

38 Computer system Good practice: vectorization (SPECFEM3D_GLOBE KNC investigation) SPECFEM3D_GLOBE Regional_MiddleEast test case: forward simulation Haswell (Galileo) KNC (Galileo) Computer System e.t. (sec.) e.t. (sec.) Speedup wrt Haswell SPECFEM3D_GLOBE Regional_MiddleEast test case: no vectorisation comparison Haswell (Galileo) KNC (Galileo) Slowdown factor wrt vectorised Based on a 4-node Galileo partition (16 MPI processes, 4 and 60 OpenMP threads on Haswell and KNC respectively). The impact of vectorisation: on Haswell and KNC respectively). <- 2x Slowdown factor

39 Good practice: choosing MCDRAM memory modes MCDRAM cache mode should be the good choice for most applications but applications with cache unfriendly data are candidates for using other memory modes. In this case: Try to identify what to put in MCDRAM: timings with selected data items allocated fast/slow (intrusive) memory profiling (VTUNE Amplifier tool)

40 Conclusions Marconi A1: moderate improvements over the years. but a big improvements compared to Fermi. High expectations of Marconi A2 KNL performances. KNC paves the way for increasing performances.try to manage domain parallelism, increase threading, exploit data parallelism (vectorisation) and improve data locality (new chance: use on package memory)

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers