Introduction to tuning on KNL platforms

Size: px

Start display at page:

Download "Introduction to tuning on KNL platforms"

Marsha Joseph
5 years ago
Views:

1 Introduction to tuning on KNL platforms Gilles Gouaillardet RIST 1

2 Agenda Why do we need many core platforms? KNL architecture Single-thread optimization Parallelization Common pitfalls Conclusion 2

3 Why do we need many core platforms? 3

4 CPU trends past 40 years Sources: AMD

5 The free lunch is over Back in the days, Moore s law meantthe processor frequency doubled every two years Memory bandwidth increased too upgrading hardware was enough to increase application performance Processor frequency cannot increase any more (physical, power and thermal constraints) More transistors means more and more capable cores Memory bandwidth and cache size keeps increasing Memory bandwidth and cache size per core tends to decrease 5

6 The road to exascale Exascale supercomputer using evolved technology would be too expensive too power hungry ( > 100 MW, acceptable is < 20 MW) Disruptive technology is needed FPGA? GPU? Many core? 6

7 Disruptive technology is needed for exascale Good compromise between absolute performance, performance per Watt, cost and programmability Many core is a popular option, easy programming is an important factor Sunway TaihuLight (RISC -#1 on Top500) Intel Xeon Phi KNC/KNL (x86_64 -#2, #6, #7 on Top500) Post K (ARMv8-A with SVE) 7

8 Challenges More, and more capable but slower cores Good algorithm is critical Parallelism is no more an option Instruction level parallelism (vectorization) Block level parallelism (MPI and/or OpenMP) Hierarchical memory Hyperthreading Code modernization is mandatory 8

9 Manycore tuning challenges With great computational power comes great algorithmic responsibility David E. Keyes (KAUST) 9

10 KNL architecture 10

KNL Tile 2 Cores: 2-wide Out of Order (OoO) Silvermont Microarchitecture based, on steroids for HPC Deeper OoO Higher bandwidth Larger TLBs 4 SMT threads (HyperThreading) 2 VPU: 2x AVX512 units,

11 KNL Tile 2 Cores: 2-wide Out of Order (OoO) Silvermont Microarchitecture based, on steroids for HPC Deeper OoO Higher bandwidth Larger TLBs 4 SMT threads (HyperThreading) 2 VPU: 2x AVX512 units, 32SP/16DP per unit. L1/L2 prefetchers L2: 1MB 16-way. 1 Line Read and ½ Line Write per cycle. Coherent across all Tiles CHA: Caching/Home Agent. Distributed Tag Directory to keep L2s coherent. MESIF protocol. 2D-Mesh connections for Tile 11

12 Instruction Set Architecture Backward compatible with Intel Xeon products Support all previous ISA extensions (SSE, AVX and AVX2) Also support AVX-512 ISA AVX512F: 512b vector extensions with mask support AVX512PFI: new prefetch instructions AVX512ERI: new exponential and reciprocal instructions AVX512CDI: new conflict detection instructions 12

13 KNL tiles - 36 Tiles interconnected by a 2D Mesh - Tile: 2 cores + 2 VPU/core + 1MB L2 - MCDRAM: 16 GB on-package, high bandwidth memory - DDR4: 6 channels, up to 384 GB - Vector perf: 3+TF DP and 6+TF SP - Scalar perf: ~3x over KNC - Streams Triad (GB/s) - MCDRAM : DDR:

14 KNL vs (1 st generation) KNC Standalone vs accelerator only Binary compatibility with (non Phi) Xeon processors You can run all your legacy apps without recompiling! New Core: Atom based (now OoO capable) ~3x higher ST performance New AVX 512 ISA 512-bit Vector ISA with Masks Gather/Scatter Engine Hardware support for gather and scatter New memory subsystem MCDRAM + DDR High bandwidth Memory -> MCDRAM Huge bulk memory -> DDR Cluster and memory modes settable at boot time 14

15 KNL memory system 15

16 High Bandwidth Memory MCDRAM is often referred as High Bandwidth Memory Most applications are bandwidth sensitive MCDRAM latency is worst than DDR Very few applications are latency sensitive Do not call MCDRAM fast memory or high speed memory 16

cache) Flat mode (16 GB in a dedicated NUMA node) Hybrid

17 KNL Memory modes MCDRAM is High Bandwidth Memory It can be configured at boot time Cache mode (16 GB in a L3 cache) Flat mode (16 GB in a dedicated NUMA node) Hybrid (4 or 8 GB in a L3 cache, 12 or 8 GB in a dedicated NUMA node) 17

18 HBM usage Command line # use only HBM $ numactl m 1 a.out Autohbw library $ export AUTO_HBW_SIZE=min_size[:max_size] $ LD_PRELOAD=libautohbw.so a.out # try HBM first, fallback to standard memory $ numactl preferred=1 a.out Fortran directive Memkind library real(r8), allocatable :: a(:)!dir$ attributes fastmem, align:64 :: a #include <hbwmalloc.h> double * d = hbw_malloc(n*sizeof(double)); allocate(a) 18

19 HBM usage Using HBM optimally in SNC modes is not trivial Need to know which is the closest MCDRAM CPU physical ids can change when a node is booted Scripts depend on memory mode RIST developed mcdram-bind utility that, whenever possible, binds optimally even in SNC modes 19

20 KNL cluster modes KNL cluster mode is selected at boot time (BIOS parameter) KNL modes influence cache coherency wiring and topology presented to the application. Commonly used modes are : Alltoall Quadrant SNC-4 (Sub-NUMA Clustering) SNC-2 Using the most appropriate mode is critical to achieve optimal performances The best mode depends on application and parallelization model Hopefully, KNL cluster mode can be selected on a per job basis 20

21 KNL Cluster modes All2all/Quadrant (1 socket, 68 cores) SNC2 (2 sockets, cores) SNC4 (4 sockets, cores) 21

22 L2 cache miss L2 cache miss Send request to the distributed memory (CHA) Miss in the directory, forward to memory Memory sends data to requestor quadrant mode all2all mode SNC-4 mode 22

23 KNL cluster modes Rules of thumb AlltoAll : do not use it! SNC-4: 4 NUMA nodes, generally best with a multiple of 4 MPI tasks per node. Note 34 tiles do not split evenly into 4 quadrants! SNC-2: 2 NUMA nodes, generally best with a multiple of 2 MPI tasks per node Quadrant: flat memory, to be used if SNC-4/2 is not a fit 23

24 Query a single node modes $ cat /var/run/hwloc/knl_memoryside_cache version: 2 cache_size: associativity: 1 inclusiveness: 1 line_size: 64 cluster_mode: Quadrant memory_mode: Cache 24

25 Query modes in a SLURM multi-node job $ clush -w $SLURM_JOB_NODELIST grep mode: /var/run/hwloc/knl_memoryside_cache clubak -c kusmcnode[01,04] (2) cluster_mode: Quadrant memory_mode: Cache kusmcnode cluster_mode: SNC4 memory_mode: Flat kusmcnode cluster_mode: Quadrant memory_mode: Flat 25

26 Single-thread optimization 26

27 Single thread optimization Maximum single thread performance can only be achieved with optimized code that fully exploits all the hardware features : Vectorization (Instruction Level Parallelism) use of all the Floating Points units (FMA) Maximize FLOP vs BYTES 27

28 Vectorization Scalar (one instruction produces one result) SIMD processing (one instruction can produce multiple results) 28

29 Compilers are conservative If there might be some dependences, the compiler will not vectorize Compiler generated vectorization report are very valuable Developer knows best, if there is no dependence, then tell the compiler Compiler specific pragma Standard OpenMP 4 simd directive 29

30 AoS vs SoA As taught in Object Oriented Programming classes, common data layout is Array of Structure (AoS) #define N 1024 typedef struct { double x; double y; double z; } point; point p[n]; /* x-translation */ for (int i=0; i<n; i++) p[i].x += 1.0; Strided access : 1 cache line contains 2 or 3 useful double out of 8 A vector uses data from 3 cache lines 30

31 AoS vs SoA Optimized data layout is Structure of Arrays (SoA) #define N 1024 typedef struct { double x[n]; double y[n]; double z[n]; } points; points p; /* x-translation */ for (int i=0; i<n; i++) p.x[i] += 1.0; Streaming access : 1 cache line contains 8 useful double 31

32 Indirect access #define N 1024 double * a; int * indixes; for (int i=0; i<n; i++) a[indixes[i]]] += 1.0; Data must be gathered/scattered from/to several cache lines Hardware support is available In the general case, vectorization is incorrect! 32

33 Indirect access and collisions #define N 1024 double * a; int * indixes; for (int i=0; i<n; i++) a[indixes[i]]] += 1.0; If no conflict can occur, then compiler must be informed vectorization is safe Hardware support (AVX512-CD) improves indirect access performance 33

34 Operations with mask for (int i=0; i<n; i++) if (c[i] > 0) a[i] = a[i] + b[i] Create a mask (Boolean array) Execute a masked vector instruction 34

35 Arithmetic intensity Arithmetic intensity is the ratio FLOP per byte moved to/from memory Numerical intensity is based on algorithm, but it can be influenced by dataset 35

Roofline analysis Kernel with low arithmetic intensity are memory-bound Kernel with high arithmetic intensity are compute-bound Roofline analysis is a visual method to find

36 Roofline analysis Kernel with low arithmetic intensity are memory-bound Kernel with high arithmetic intensity are compute-bound Roofline analysis is a visual method to find how well the kernel is performing There are several roofs Memory type (L1 / L2 / HBM / standard memory Cpu features (no vectorization, vectorization, vectorization + FMA) 36

37 Roofline analysis w/ Intel Advisor 2017 Roofline model has to be built once per processor Vendor knows best Process fully automated from Advisor 2017 Update 2 37

38 Parallelization 38

39 Parallelization Free lunch is over, more cores are needed to keep application performances A lot more cores are available and must be effectively used to achieve greater performance Parallelization is now mandatory 39

40 Performance scaling Strong scaling how the solution time varies with the number of processors for a fixed total problem size? Ideally, adding processors decreases the time to solution. Weak scaling how the solution time varies with the number of processors for a fixed problem size per processor Ideally, the time to solution remains constant. 40

41 Amdahl s law (strong scaling) S latency is the theoretical speedup of the execution of the whole task; sis the speedup of the part of the task that benefits from improved system resources; pis the proportion of execution time that the part benefiting from improved resources originally occupied. 41

42 Gustafson s law (weak scaling) S latency is the theoretical speedup in latency of the execution of the whole task; sis the speedup in latency of the execution of the part of the task that benefits from the improvement of the resources of the system; pis the percentage of the execution workload of the whole task concerning the part that benefits from the improvement of the resources of the system before the improvement. 42

43 At a glance Amdahl s law we are fundamentally limited by the serial fraction Gustafson s law we need larger problems for larger numbers CPUs whilst we are still limited by the serial fraction, it becomes less important 43

44 Parallelization models MPI is the de facto standard for inter process communication Flat MPI is sometimes a good option, but beware of memory and wire-up overhead MPI+X is the general paradigm X is for intra node communications and can be OpenMP Pthreads PGAS (OpenSHMEM, Co-arrays, ) MPI (!) 44

45 OpenMP limitations Most OpenMP parallelization focus only on loops OpenMP has an overhead (thread creation, synchronization, reduction) OpenMP was designed when memory was flat, today NUMA is very common. NUMA makes performance models hard to build OpenMP is generally best within a NUMA node natural when iterations are independent MPI communications within an OpenMP region is not natural 45

46 Hyperthreading A KNL core consists of 4 hardware threads Hardware threads share resources (cache, FPU, ) When a hardware thread is waiting for memory, an other hardware thread can be scheduled to perform some computation 2 hardware threads are enough to achieve maximum performances (4 on KNC) Best is to experiment with 1 and 2 threads per core, and choose the fastest option 46

47 Intel Thread Advisor Suitability report gives a speed-up estimation 47

48 Intel Thread Advisor Dependency analysis help predicting parallel data sharing problems (very slow, and only on annotated loops) 48

49 Problem sizing on KNL KNL has both DDR4 (standard) and MCDRAM (High Bandwidth) memory MCDRAM can be configured as cache or scratchpad Impact of performance can be significant If your app is cache-friendly, weak scale using all available memory If your app is not cache-friendly, it might be more effective to weak scale using only HBM 49

50 Is your app-cache friendly? In flat mode Run with HBM only Run with standard memory only In cache mode Increase dataset size to use all available memory Measure the drop in performance (if any) when using all the memory in cache mode 50

51 Common pitfalls 51

52 MCDRAM pitfall MCDRAM is not fast memory MCDRAM is High Bandwidth Memory (HBM) DDR4 has better (e.g. lower) latency than MCDRAM Most applications are limited by memory bandwidth and not memory latency 52

53 ISA pitfalls Legacy, SSE, AVX, AVX2 and AVX512 instructions are available Good point is KNL can run any binary compiled on a non KNL node Bad point is a binary compiled on the frontend does not use AVX512 instruction set by default, but run slowly without any warning on a KNL node Do not forget to compile your apps with xmic-avx512 (KNL only) or axmic-avx512 (fat binary, runs both on login node and fast on KNL) Do not use mmic with KNL! 53

54 SNC-4 pitfalls KNL is split in 4 NUMA nodes This is tiles, e.g tiles It is generally best to balance the tiles, which means it is best to run only 64 MPI tasks (and hence sacrifice 4 cores) 54

55 KNL modes pitfalls Use the same cluster and memory modes on all the nodes of a job Keep track of the used cluster and memory modes when measuring performances 55

56 KNL frequency pitfall From Intel s footnote 3 Beware when computing theoretical peak performance 56

57 Conclusions 57

58 Intel Knight Landing (KNL) Conclusions Is a many cores architecture Uses 512 bits vectors Offers several memory and cluster modes Understanding KNL s unique architecture is necessary to get the best performance out of it Vectorization and parallelization are mandatory Tools are available to help We can help you too! 58

59 Questions? About this presentation Need support for your projects 59

Introduction to tuning on KNL platforms

Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Post-K overview Single-thread optimization Parallelization