ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen)

Size: px

Start display at page:

Download "ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen)"

Susan Jacobs
6 years ago
Views:

1 ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On Thomas Röhl Uni Erlangen) Thomas.Roehl@fau.de

2 Agenda HPC FAU Login on cluster Batch system Modern architectures LIKWID Thread affinity Hardware performance monitoring (End-to-end & Marker API) CPU frequency 2

3 HPC FAU Production systems 2-socket systems: Emmy: 560 nodes, IvyBridge, 20 phy. 2.2 GHz 16 Xeon Phi, 16 Nvidia K20 Lima: 500 nodes, Westmere, 24 phy GHz TinyBlue: 84 nodes, Nehalem, 8 phy GHz 1-socket systems: Woody:40 nodes, SandyBridge, 4 phy. 3.5 GHz 72 nodes, Haswell, 4 phy. 3.4 GHz 64 nodes, Skylake nodes, 4 phy. 3.5 GHz 3

4 Access to HPC FAU Frontends (emmy, woody, lima) from FAU network via SSH (Only for compilation, don t run applications!) From outside, connect to cshpc first Console access: SSH X access: NoMachine NX ( Login: ssh <user>@<host> Copy: scp (-r) <file/folder> <user>@<host>:<dest> scp (-r) <user>@<host>:<file/folder> <dest> 4

5 Further information on clusters 5

6 NOW YOU Try SSH login on cluster frontend emmy Copy folder ~unrz139/mucosim to your home

7 ssh ssh cp r ~unrz139/mucosim $HOME 7

8 Batch system Get available nodes with properties: pbsnodes See stats of your jobs: qstat or qstat.<clustername> Submit job: qsub -I : Get interactive job (console) -l : Set properties nodes=<nodecount> or nodes=<nodename1>,<nodename2> ppn=<40 but cluster specific> (SMT threads of compute node(s)) walltime=hh:mm:ss (Runtime of job) fx.y (set fixed frequency) likwid (allow user to measure hardware counters) 8

9 Batch system examples Interactive job on 2 nodes for 3 hours: qsub -I -l nodes=2:ppn=40 -l walltime=03:00:00 Interactive job on 2 nodes for 3 hours, each with 1 Nvidia K20: qsub -I -l nodes=2:ppn=40:k20m1x,walltime=03:00:00 Non-Interactive job on 2 nodes for 3 hours with fixed frequency: qsub -l nodes=2:ppn=40:f2.0,walltime=03:00:00 xy.sh Non-Interactive job with properties in batch script: qsub xy.sh 9

10 Batch system scripts #!/bin/csh #PBS -l nodes=2:ppn=40 #PBS -l walltime=04:00:00 #PBS -N <jobname> #PBS -l likwid [ ] $ qsub test.batch Or #!/bin/bash -l Set job properties Set job runtime Set job name Enable LIKWID Copy, start, pollux.rrze.uni-erlangen.de Outputs in <jobname>.o and <jobname>.e

11 Module system on all FAU Automatically loaded for csh and bash (in batch scripts bash -l) Module system: module avail list module show <mod> or <mod>/<version> module load <mod> module unload <mod> Common modules: intel64, gcc, intelmpi, openmpi, likwid 11

12 Further information on software environment 12

13 MODERN COMPUTER ARCHITECTURES

14 Intel IvyBridge Architecture Source: Intel CPUs with attached L3 segments Memory controllers 14

15 Socket architecture Non-uniform access to other L3 segments One core can use all L3 segments Non-uniform access to memory (NUMA in-socket) Only one ring attached to PCIe All units are self-managing (System-on-Chip principle) 15

In-core architecture 1) Load instruction(s) into L2 2) Load instruction(s) into L1I 3) Decode instruction 4) Load [r8] using port 2 (or 3) 5) Data in L1D 6) Retire load

16 In-core architecture 1) Load instruction(s) into L2 2) Load instruction(s) into L1I 3) Decode instruction 4) Load [r8] using port 2 (or 3) 5) Data in L1D 6) Retire load operation 7) Calculate y = 2x in port 1 8) Retire add operation 9) Store y in [r10] 10) Data in L1D vmovapd ymm1, [r8] vaddpd ymm2, ymm1, ymm1 vmovapd [r10], ymm

17 In-core architecture On x86: CISC outside, RISC inside CISC instructions decoded to RISC instructions and back Additional buffers In-order to out-of-order (Reorder buffer) repeating instr. streams (Loop buffer) out-of-order operations (Load/Store buffer) Execution ports do the real work out of order Calculation, data transfer and address ports Retirement collects all RISCs of a CISC and commits 17

18 Cache hierarchy Core 0 Core 1 L1D L1I L2 L3 Core 2 Core 3 L1D L1I L2 2x 16 bytes 1x 32 bytes 1x 32 bytes Intra-socket ring/mesh Memory Controller QPI Caches are often one-ported, thus one direction per cycle 18

19 Cache hierarchy HT threads of a core share L1 and L2 L3 for a group of cores Keep required data in hierarchy as high and as long as possible (all time advise!) Use streaming access pattern if possible (helps prefetchers) If stored data not directly needed, use non-temporal stores (write directly to memory) Allocate data on the socket that it consumes it (QPI slower) 19

20 Available compilers The mainly used compilers on the clusters are Intel ICC and GCC Always test performance of multiple compilers ICC GCC OpenMP -qopenmp -fopenmp Optimization -O1, -O2, -O3, -Ofast Activate AVX -xavx(2) -mavx(2) -ftree-vectorize Non-temporal stores -qopt-streaming-stores=always N/A 20

21 NOW YOU Go to folder 01_tmv and submit job to cluster Run matrix-vector-multiplication interactivly - on all CPUs - with other compile options - with different compiler version (module)

22 $ cd $HOME/mucosim/01_tmv $ qsub matrix.batch $ qstat $ qsub -I l nodes=1:ppn=40 l walltime=00:10:00 $ make help $ make run OMP_NUM_THREADS=40 $ module load intel64/x or gcc/y make build CFLAGS_GCC= -O3 mavx make build CFLAGS_ICC= -O3 xhost 22

23 Thread pinning & performance analysis

24 Importance of Affinity Bandwidth decreases with each level Latency increases with each level Pin threads according to data locality Node Register L1 L2 L3 Memory SSD HDD Core Socket 24

25 Importance of Affinity STREAM benchmark on 16-core Sandy Bridge Pinning (physical cores No pinning first, first socket first) 25

26 LIKWID Overview Like I know what I do Set of tools: Topology information Process/Thread pinning Hardware performance monitoring Low-level benchmarking CPU frequency manipulation CPU feature manipulation (prefetchers) 26

27 System topology with LIKWID likwid-topology on i (Haswell) Thread topology Cache topology NUMA topology Graphical topology Socket 0: kB 32kB 32kB 32kB kB 256kB 256kB 256kB MB

28 Affinity with LIKWID likwid-pin LIKWID defines affinity domains: Node (N:0-23) Last Level Cache (C0:0-5) Socket (S1:0-11) NUMA domain (M0:0-5) 28

29 Affinity with LIKWID likwid-pin Broken in Physical selection: 0,1,2,3 or 0-3 Logical selection: S0:0-3 or L:<domain>:0-3 (phy. cores first) Function-based selection: E:N:8 = 0,20,1,21,2,22,3,23 E:N:8:1:2 = 0,1,2,3,4,5,6,7 Scattered over affinity domains: M:scatter: Fill all memory domains, physical cores first 0,10,1,11,2,12,3,13, Combine multiple selections S0:0@S0:1 = 0,1 29

30 NOW YOU Look at system topology Go to folder 02_tmv and run matrix interactivly - on all physical CPUs - on the first 5 physical CPUs per socket - don t use make run but likwid-pin directly

31 $ qsub I l nodes=1:ppn=40:likwid,walltime=00:10:00 $ module load likwid/4.2.0 $ likwid-topology $ make run PINSTR= E:N:20:1:2" $ make run PINSTR= 0,1,2,3,4@E:S1:5:1:2" $ likwid-pin h $ likwid-pin c E:N:20:1:2./matrix $ likwid-pin c 0,1,2,3,4,10,11,12,13,14./matrix 31

32 Runtime profile Intel compiler provides simple runtime profiling interface Build with -profile-functions (and maybe fno-inline) No parallel execution! Find out hotspots in the code Creates XML and tabular output files with fields: Time and time share for function Call and exit count File and line of function 32

33 Runtime profile Time(%) Self(%) Call count Function File:line runloop matrix.c: time_init timer.c: fillmatrix matrix.c: main matrix.c:71 For GCC use pg and gprof <exec> gmon.out Flat profile (like ICC) and call graph (--graph) 33

34 Go to folder 03_runtime_profile and submit job which is the hottest function? Run interactively the matrix example which is the hottest function?

35 $ qsub stream.batch (qstat, ls runtime_profile*) $ qsub I l nodes=1:ppn=40:likwid,walltime=00:10:00 $ make help (!) $ make build_matrix /run_matrix $ gprof --flat-profile matrix gmon.out $ module load intel64 $ less *.dump 35

HPM - Hardware Performance Monitoring Additional anaylsis method to software based analysis Vampir Totalview Intel Trace Analyzer/Collector Performance

36 HPM - Hardware Performance Monitoring Additional anaylsis method to software based analysis Vampir Totalview Intel Trace Analyzer/Collector Performance counters implemented in hardware Low-level data of CPU s functional units, cache and memory Partly not accurate (FLOP/s on SandyBridge or IvyBridge) 36

37 Each unit has 2-4 counters and maybe a fixed-purpose counter LIKWID uses different names: CBOX, MBOX, RBOX 37

38 LIKWID HPM - Hardware Performance Monitoring Simple end-to-end measurements likwid-perfctr sets up system topology and perfmon Start and stop HPM Execute application on given CPU set Evaluate counter values and derive metrics likwid-perfctr c E:S0:8:1:2 g FLOPS_DP./a.out Measure CPUs 0 to 7 on Socket 0 (-C for pin and measure) Double precision FLOP/s perf. group likwid-perfctr a for all available groups 38

39 LIKWID Performance groups Event names are not intuitive -> difficult selection Performance groups combine event set and derived metrics Derive counter results to metrics (bandwidth, ratios, ) Examples: No pinning L2/L3 traffic likwid-perfctr c 0-3@E:S1:4:1:2 g L3./a.out likwid-perfctr C E:N:10:1:2 g FLOPS_DP./a.out Pinning 10 threads, 1 out of 2 Double-precision floating point ops 39

40 LIKWID Performance groups on emmy FLOPS_AVX: Packed AVX MFlops/s FLOPS_DP: Double Precision MFlops/s FLOPS_SP: Single Precision MFlops/s DATA: Load to store ratio L2: L2 cache bandwidth in MBytes/s L3: L3 cache bandwidth in MBytes/s MEM: Main memory bandwidth in MBytes/s ENERGY: Power and Energy consumption MEM_DP: Memory & DP FLOP/s & Energy MEM_SP: Memory & SP FLOP/s & Energy 40

41 NOW YOU Go to folder 04_tmv and run interactivly - Why is L3 evict data volume of Core 0 larger? - Measure memory bandwidth running on both sockets - Measure DP FLOP/s with different CPU selections - Force vectorization and measure DP FLOP/s again

42 make run make run PINSTR= PERFGRP="MEM Make run PINSTR= E:N:20:1:2 PERFGRP= FLOPS_DP" make build CFLAGS_GCC= -O3 ffast-math CFLAGS_GCC= -O3 ffast-math mavx CFLAGS_ICC= -O3 xavx 42

43 likwid-perfctr Marker API mode Until now, we measured the whole application Measure only a code region of an application The configuration is still done by likwid-perfctr Multiple named regions can be measured (also nested) Results on multiple region calls are accumulated 43

44 Marker API macros #include <likwid.h> LIKWID_MARKER_INIT; // must be called from serial region LIKWID_MARKER_THREADINIT; // must be called from parallel region LIKWID_MARKER_START( Compute ); <code> LIKWID_MARKER_STOP( Compute ); LIKWID_MARKER_CLOSE; // must be called from serial region 44

45 Add marker API to code (restructure loops) #pragma omp parallel for <loop> #pragma omp parallel { LIKWID_MARKER_START( Compute ); #pragma omp for <loop> LIKWID_MARKER_STOP( Compute ); } 45

46 Add marker API to code (closed-source library calls) calc_some_func() #pragma omp parallel { LIKWID_MARKER_START( foo ) } calc_some_func() #pragma omp parallel { LIKWID_MARKER_STOP( foo ) } 46

47 Use it Compile: $CC DLIKWID_PERFMON $LIKWID_INC $LIKWID_LIB code.c \ o code llikwid LIWKID_INC and LIKWID_LIB defined by module system Run: likwid-perfctr C <cpustr> -g <group> -m./a.out Use capital C MarkerAPI requires pinned threads Tells likwid-perfctr to use MarkerAPI mode 47

48 Measure marked code region $ likwid-perfctr C 0,1,2 g L2 m./a.out ===================== Region: Compute ===================== Region Info core 0 core 1 core RDTSC Runtime [s] call count [ raw counter results ] Metric core 0 core 1 core Runtime (RDTSC) [s] Runtime unhalted [s] Clock [MHz] CPI L2 Load [MBytes/s] L2 Evict [MBytes/s] L2 bandwidth [MBytes/s] L2 data volume [GBytes] Region time of each thread Region calls of each thread Derived metrics for each thread 48

49 NOW YOU Go to folder 05_tmv and run interactivly - measure DP FLOP/s - measure memory bandwidth - what s wrong with the code?

50 make run make run PINSTR= PERFGRP="MEM Make run PINSTR= E:N:20:1:2 PERFGRP= FLOPS_DP" make build CFLAGS_GCC= -O3 ffast-math CFLAGS_GCC= -O3 ffast-math mavx CFLAGS_ICC= -O3 xavx Load imbalance, parallelize init and use smaller chunks for each thread make build DEFINES= -DPARALLEL_CHUNK DPARALLEL_CHUNK_INIT 50

51 CPU Frequency likwid-setfrequencies Change CPU frequency of affinity domains Set property likwid only, no fixed frequency See available frequencies: likwid-setfrequencies l See current frequency settings: likwid-setfrequencies p Set frequency of socket 1 to 2.2 GHz likwid-setfrequencies c S1 f 2.2 Set scaling governor performance on socket 0 likwid-setfrequencies c S0 g performance 51

52 ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] Thank you for your attention! Regionales RechenZentrum Erlangen [RRZE] Martensstraße 1, Erlangen Thomas.Roehl@fau.de

53 Examples MuCoSim Hands On Thomas Röhl

54 Triangular-Matrix-Vector-Multiplication Parallelized with #pragma omp parallel What s happening here? Last thread executes instructions faster than first thread? Lower is better MuCoSim Hands On Thomas Röhl 54

55 Triangular-Matrix-Vector-Multiplication Retired instructions missleading Waiting in implicit OpenMP barrier issues many but short instructions We need to measure actual work Higher is better MuCoSim Hands On Thomas Röhl 55

56 Triangular-Matrix-Vector-Multiplication Floating point instructions reliable useful work metric But floating point instr. counters since SandyBridge only tendentially correct Higher is better MuCoSim Hands On Thomas Röhl 56

57 Triangular-Matrix-Vector-Multiplication Changing OMP scheduler to static with chunk size 16 smaller work packages per thread No imbalance anymore! Is it also faster? MuCoSim Hands On Thomas Röhl 57

58 Triangular-Matrix-Vector-Multiplication Scaling run on Intel SandyBridge over both sockets (8 phy. cores per socket) MuCoSim Hands On Thomas Röhl 58

59 ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] Thank you for your attention! Regionales RechenZentrum Erlangen [RRZE] Martensstraße 1, Erlangen LIKWID:

ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On. Thomas Röhl Uni Erlangen)

ERLANGEN REGIONAL COMPUTING CENTER [ RRZE] MuCoSim Hands On Thomas Röhl (HPC @ Uni Erlangen) Thomas.Roehl@fau.de Agenda HPC systems @ FAU Login on cluster Batch system LIKWID Thread affinity Hardware performance