Intel Architecture for HPC

Size: px

Start display at page:

Download "Intel Architecture for HPC"

Solomon Hubbard
5 years ago
Views:

1 Intel Architecture for HPC Georg Zitzlsberger 1st of March 2018

2 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter Intel Architectures Knights Landing (KNL) Skylake Server (SKX/SKSP) SIMD and Intel AVX512 Integrated Graphics (igfx) Where to get the Information from? General Performance Considerations How to Measure Performance?

3 GB per core) Nodes: 576 & 432 (w/o & w/ KNC) Interconnect: InfiniBand FDR56 SKU description on ark.intel.

3 Architectures Available Host: Haswell Intel R Xeon R E5-2680v3 (2.5 GHz) 12 cores 2 Sockets (24 cores total) RAM: 128GB (5.3 GB per core) Nodes: 576 & 432 (w/o & w/ KNC) Interconnect: InfiniBand FDR56 SKU description on ark.intel.com Coprocessor: Knights Corner (KNC) First generation Intel R Xeon Phi TM Coprocessor 7120P (1.24 GHz) 61 cores RAM: 16 GB ( 260 MB per core) 2 coprocessors per node Interconnect: PCIe 2.0 (via host) SKU description on ark.intel.com (Images: IT4Innovations)

4 Notes to Systems Hosts: Hyper-Threading: off Turbo Boost: on, SpeedStep: off Cluster on Die (CoD): off (only one NUMA node per socket) Frequencies 2.5 GHz): (Images: Intel) Details can be found in Intel Xeon Processor E5 v3 Specification Update

5 Excursion - Frequencies With Haswell and later (also includes KNL): (Image: Intel)

6 Notes to Systems Coprocessors: Turbo Boost: off (max GHz with 1.24 GHz) ECC: on (limits memory BW by 12%) Use micsmc to query settings, e.g.: > micsmc -- turbo --ecc mic0 ( turbo ): Turbo mode is disabled mic1 ( turbo ): Turbo mode is disabled mic0 ( ecc ): ECC is enabled mic1 ( ecc ): ECC is enabled Use micsmc --help for a full list.

7 Theoretic Peak Performance - Hosts Throughput: FLOPS SP : # cores frequency SIMD FMA GHz 16 AVX 2 AVX2 = 0.960TFLOPS FLOPS DP : GHz 8 AVX 2 AVX2 = 0.480TFLOPS Memory bandwidth: BW DDR4 : # channels frequency byte cycle MT /s 8byte 68GB/s 1 BW QPI : # QPILinks frequency byte cycle # directions 2 9.6GT /s 2byte GB/s 2 See ark.intel.com for full specifications. 1 Note that DDR BW was just calculated for one socket! 2 Entire system QPI BW w/o overhead

8 Theoretic Peak Performance - Coprocessors Throughput: FLOPS SP : # cores frequency SIMD FMA GHz 16 MIC 2 MIC = 2.42TFLOPS FLOPS DP : GHz 8 MIC 2 MIC = 1.21TFLOPS Memory bandwidth: BW GDDR5 : # channels frequency byte cycle MT /s 4byte 350GB/s 3 BW PCIe x16 : 8GB/s See ark.intel.com for full specifications. See PDF for real numbers with Stream Triad. 3 Note that max. 170 GB/s is realistic!

9 Intel Roadmap (Image: Intel) Intel R Xeon R processors now: Intel R Xeon R scalable processors Intel R Xeon Phi TM (co)processor roadmap ended with KNL It is expected that Intel Xeon & Xeon Phi processors merge with Icelake Server

10 Intel R Xeon R processors v3 (Haswell) Execution: Out of Order (OOO): The order of instructions you see is not necessarily the order they get executed! Speculative execution Branch predictor estimates the likely branch and speculatively executes it (due to deep pipeline) FMA support with AVX-2: Peak performance only with 2x FMA per cycle! Watch out for: Order of instructions can only be influenced with data flow changs in higher level (i.e. C/C++, Fortran,... ). Branch prediction might be wrong - pipeline needs to be flushed If FMA is not used theoretic peak performance is 50%!

11 cont d... NUMA: Every socket has own local memory block (contiguous) Sockets are connected via QPI (Quick Path Interconnect) Hyper-threading adds two HW-threads per core (can at best get 30% more performance) Watch out for: Local memory access is fast (see bandwidth mentioned earlier); remote access to memory tied to other socket is slow due to QPI Hyper-threading uses same resources (or even less) per core - homogeneous HPC applications might not all benefit unless latency hiding is needed

12 cont d... Caches (inclusive) L1: 32 kb (each for data and instructions) L2: 256 kb L3 (LLC): shared 30 MB (so-called SmartCache ), with 2,5 MB/core (slice) Cacheline size: 64 byte Memory prefetcher for each cache to pre-read detecting certain patterns Watch out for: Data loaded/stored needs to be loaded down to L1 cache (excluding non-temporal stores)! Cache sizes (can) change from generation to generation. Every datum accessed in memory will require the entire cacheline to be loaded to cache The memory prefetcher might cause bandwidth problems (pre-reading the wrong sparse data)

13 cont d... Paging: Level Page Size Entries Associativity Instruction 4KB ways Instruction 2MB/4MB 8 per thread First Level Data 4KB 64 4 First Level Data 2MB/4MB 32 4 First Level Data 1GB 4 4 Second Level Shared by KB and 2/4MB pages Watch out for: Smaller page sizes can cause more page misses (TLB miss) More data streams than associativity can cause huge TLB misses: If data streams are on individual pages and their addresses modulo page size are same.

14 Ringbus (Image: Intel) Variants: low (LCC), medium (MCC) and high core count (HCC) MCC & HCC have Cluster of Die (CoD) support: 2 NUMA nodes instead of one

15 Pipeline Diagram (Image: Intel) Front-end: Fetches & decodes instructions to uops Back-end: Executes uops

16 Intel R Xeon Phi TM coprocessor (KNC) Execution: In-order: The order of instructions is what is executed. Hint: Look at the assembly (Intel Compiler) to see the cycle counts when the instructions are executed. FMA support Peak performance only with 1x FMA per cycle! HW-threading is a must (2 or more threads): Loading (fetch & decode) instuctions takes 2 cycles per HW-thread. Watch out for: The architecture is susceptible to the order instructions are generated If FMA is not used theoretic peak performance is 50% Only using one HW-thread also reduces theoretic peak performance by 50% Advantage though: In-order makes timing for benchmarking reproducible!

17 cont d... NUMA: Only one NUMA node per coprocessor card Every coprocessor (NUMA node) has 16 GB of fast DDR5 HW-threading adds four HW-threads per core Watch out for: DDR5 memory is fast in theory ( 350 GB/s) but the architecture only allows max. 180 GB/s (ECC off) DDR5 is also limited to 16 GB for native applications. Offload enabled applications can mitigate that problem though. HW-threading is a must (use 2 or 4 HW-threads per core); avoid 3 HW-threads per core due to non-proportional smaller resources in that mode.

18 cont d... Caches (inclusive) L1: 32 kb (each for data and instructions) L2 (LLC): 512 kb per core Cacheline size: 64 byte Memory prefetcher only for L2 cache Watch out for (+same as for Haswell): Cache level is less deep and overall size is much smaller. Memory prefetcher is less powerfull than on big cores (e.g. Haswell). It only detects a subset of access patterns. KNC might require manual SW prefetching instructions (keep in mind that the Intel Compilers by default try to create those which are not always optimal).

19 cont d... Paging: Level Page Size Entries Associativity L1 Data TLB 4K 64 4-way 2M 8 4-way L1 Instruction TLB 4K 32 4-way L2 TLB 4K, 2M 64 4-way Watch out for: Same as for Haswell because the principle is the same.

20 Pipeline Diagram (Image: Intel)

21 Ohter Intel Architectures

22 Knights Landing (KNL) (Image: Intel)

23 Knights Landing (KNL) Out-of-order, 2 instructions per cycle (per core) 2 VPUs (2x FMA) Instruction & Data L1: 32KB each; L2: 2x 64KB (no L3) Both L1 and L2 have memory prefetchers 2 cores per tile, shared L2 cache per tile (NUMA!) Up to 4 HW-threads per core (unlike KNC also one HW-thread meaningful) 2D mesh (no ringbus like KNC) 3 different cluster modes 3 different MCDRAM configurations Throttling of frequency with SSE/AVX & AVX512 instructions Selfboot version (no coprocessor) Omnipath option (KNL-F)

24 KNL - Cluster Modes (Image: TACC) All-to-all: No NUMA awareness Sub-NUMA-4 (SNC-4): Full NUMA awareness Note: There s also a SNC-2 mode Quadrant: Partial NUMA aware (best tradeoff) Changing modes requires reboot

25 KNL - MCDRAM Configurations (Image: TACC) Cache mode: MCDRAM is acting as L3 cache (with high latency) Flat mode: MCDRAM can be allocated directly by programmer Note: Requires libmemkind, and compiler directives for Fortran Hybird mode: Mix of cache and flat mode (25/50%)

26 Skylake Server (SKX/SKSP) (Image: Intel)

27 Skylake Server (SKX/SKSP) Out-of-order, 4 instructions per cycle (per core) 2 VPUs (2x FMA) Instruction & Data L1: 32KB each; L2: 1MB; L3: 1.375MB/core (non-inclusive) Both L1 and L2 have memory prefetchers Up to 2 HW-threads per core 2D mesh (no ringbus like HSW) Cluster on Die mode (SNC-2) Throttling of frequency with AVX & AVX512 instructions Omnipath option (SKSP-F)

28 Skylake Server (SKX/SKSP) (Image: Intel) L2 caches increased 4 fold: expected better CPI and cache hits L3 cache per core is less and non-inclusive Data reuse is more important now (L3 might fail and DDR is slow)

29 Skylake Pipeline Diagram (SKL) (Image: Intel)

30 Skylake Server Pipeline Diagram (SKX/SKSP) (Image: Intel)

31 SIMD and Intel AVX512 (Image: Intel)

32 Integrated Graphics (igfx) (Image: Intel) Peak Shader 1 GHz

33 Integrated Graphics (igfx) (Image: Intel)

34 Programming Integrated Graphics (igfx) OpenMP based (recommended for HPC) or dedicated async. offload API Use target(gfx) or target(gfx kernel) respectively Need to install dedicated binutils package Use -qopenmp -qopenmp-offload=gfx Mixing of MIC and igfx code is not possible Control via GFX * environment variables (e.g. GFX MAX THREAD COUNT) Depending on variant, up to 7 threads per EU More information can be found here

35 Where to get the Information from? Intel R Xeon R processors v3 (Haswell) (incl. all big cores + KNL): Intel R 64 and IA-32 Architectures Optimization Reference Manual See section 2.2 The Haswell Microarchitecture. Intel R Xeon Phi TM coprocessor (KNC): Intel R Xeon Phi TM Coprocessor System Software Developers Guide See section 2 Intel R Xeon Phi TM Coprocessor Architecture. Very good 3rd party source (w/ unofficial but empirical numbers): Agner Fog s Software optimization resources Exercise: Which (best-case) access latency does the L1 data cache have for HSW & KNC? Which impact does it have to performance?

36 General Performance Considerations Most properties cannot be changed or influenced 4 : Memory setup (DDR4 DIMMS used) SKUs and properties Architecture limiations (e.g. KNC s memory BW) ECC on or off Thermal constraints CoD and caching strategies HW-threading or frequency setups... Separate between core and uncore when optimizing: Core: The microarchitecture itself Can be considered as invariant for a given generation Uncore: The system cores are embedded Can vary for a given generation, hence parameterize here 4 Some can be changed in BIOS, requiring a reboot, though.

37 How to Measure Performance? Lock frequency (e.g. in BIOS or cpufreq tool) or profile it (e.g. with Intel VTune Amplifier XE) Avoid both frequency boosting (Turbo Boost) or throttling (SpeedStep). Measure effects of Hyper-threading: does not need to be turned off in BIOS but make sure SW-threads are pinned properly Beware that high power AVX instructions throttle the frequency to a well documented AVX base frequency (from Haswell onwards, incl. KNL - but not KNC). Deterministic threading required (ensure pinning). Are there other processes running and which cores handle interrupts?

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility