Performance Tools (Paraver/Dimemas)

Size: px

Start display at page:

Download "Performance Tools (Paraver/Dimemas)"

Derrick Phelps
5 years ago
Views:

1 Performance Tools (Paraver/Dimemas) Jesús Labarta, Judit Gimenez BSC Enes workshop on exascale techs. Hamburg, March 18 th 2014

2 Our Tools! Since 1991! Based on traces! Open Source Core tools: Paraver (paramedir) offline trace analysis Dimemas message passing simulator Extrae instrumentation! Focus Detail, flexibility, intelligence 2

3 A different view point! Look at structure Of behavior, not syntax Differentiated or repetitive patterns in time and space Focus on computation regions (Burst) s 3

4 A different view point! and fundamental metrics Useful user NMMB LB Ser Trf Eff Eff M. Casas et al, Automatic analysis of speedup of MPI applications. ICS adv2 (gather fft-scatter)* mono 4

5 More on structure and concurrency? Scalability tradeoffs between processes at different phases 5

6 More on structure and concurrency How to find out: Discussion with developer Automatic? V. Subotic et al, Automatic exploration of potential parallelism in sequential applications. ISC

7 More on structure and concurrency 7

8 More on structure and concurrency Huge potentials of concurrency and overlap to: tolerate latencies spread load across resource cores and network!! 8

9 More on structure and concurrency You may even want to constrain potential concurrency!!! 9

10 More on structure and concurrency and syntax WIP: Taskify with OmpSs OpenMP 4.0 accelerator features in OmpSs 10

11 Performance analytics 11

12 Using Clustering to identify structure Completed Instructions IPC J. Gonzalez et al, Automatic Detection of Parallel Applications Computation Phases. (IPDPS 2009) 12

13 Projecting hardware counters based on clustering! Full per region HWC characterization from a single run Instruction mix Miss ratios Stalls 13

14 Tracking structural evolution! Frame sequence: clustered scatterplot as core counts increases OpenMX Strong scaling G.Llort et all, On the Usefulness of Object Tracking Techniques in Performance Analysis, SC

15 Mixing instrumentation and sampling! to get extreme detail with minimal overhead! Different roles Instrumentation delimits regions Sampling report progress within region Iteration #1 Iteration #2 Iteration #3 Synthetic Iteration Harald Servat et al. Detailed performance analysis using coarse grain sampling 2009 Harald Servat et al. Unveiling Internal Evolution of Parallel Application Computation Phases ICPP

16 Folding hardware counters Instructions evolution for routine copy_faces of NAS MPI BT.B Red crosses represent the folded samples and show the completed instructions from the start of the routine Green line is the curve fitting of the folded samples and is used to reintroduce the values into the tracefile Blue line is the derivative of the curve fitting over time (counter rate) 16

20 M instructions ~ 1000 MIPS MPI call 24.

17 Combined clustering + folding! Instantaneous values! All metrics! From a single run! No overhead M instructions ~ 1000 MIPS MPI call M instructions ~ 1100 MIPS MPI call CGPOP -1D M instructions ~ 1200 MIPS 17

18 CESM v18 v19 trace! User functions not instrumented 160 s ATM: 384 LND: 16 ICE: 32 OCN: 10 CPL: GB 11.5 MB ms 2.55 GB 4.5 MB 18

19 CESM CAM v18 Convect_shallow_tend aer_rad_props_sw rrtmg_sg Microp_driver_tend aer_rads_prop_lw rad_rrtmg_lw 19

20 CESM CAM v19 M_list_mp_init_ Vertical_diffusion Aerosol_dryed_intr_ Convect_shallow_tend Svp_water Microp_driver_tend aer_rad_props_sw rrtmg_sw rad_rrtmg_lw 20

21 Dimemas 21

22 Dimemas: Coarse grain, Trace driven simulation! Simulation: Highly non linear model Linear components Point to point communication Sequential processor performance Global CPU speed Per block/subroutine Non linear components Synchronization semantics Blocking receives Rendezvous Resource contention CPU CPU Communication subsystem CPU CPU» links (half/full duplex), busses L Local Memory L CPU CPU CPU Local Memory B L CPU CPU CPU Local Memory 22

23 Ideal machine! The impossible machine: BW =, L = 0! Actually describes/characterizes Intrinsic application behavior Load balance problems? Dependence problems? Nehalem cluster 256 processes Allgather + sendrecv allreduce alltoall sendrec waitall Real run Ideal network Impact on practical machines? 23

24 The potential of hybrid/accelerator parallelization! Hybrid parallelization Speedup SELECTED regions by the CPUratio factor! We do need to overcome the hybrid Amdahl s law asynchrony + Load balancing mechanisms!!! %elapsed time GADGET, 128 procs Code region 93.67% 97.49% 99.11% 24

25 Conclusion! BSC tools Extremely powerful visualization and analysis capabilities Performance Analytics Performance data is big data Management analytics Capturing knowledge and methodologies in algorithmic workflows! Useful insight for informed decisions on code refactoring 25

26 THANKS

27 Insight! Observations / highly probable speculations / good questions about fundamental behavior Suggesting possibilities for optimization! Identification of specific poor performance sequential code! Bimodal behavior in alternating iterations?! Bimodal behavior in space: Day-night imbalance Moving load imbalance Separate cause and potential solution! Repetitive fine grain structure within phase 2 / 3 sub iterations? parallelizable? Potential source for overlap of communication/computation? 27

28 A call for Performance analytics! Data acquisition A lot of data is captured! Presentation Profile: a few (or not so few) pre computed first order statistics Far too summarized Trace visualization No summarization at all Need for intelligent data processing to derive actual insight 28

29 CESM CLM v

30 CESM POP v

31 NMMB 31

32 Measuring Parallel efficiency 32

BSC Tools. Challenges on the way to Exascale. Efficiency (, power, ) Variability. Memory. Faults. Scale (,concurrency, strong scaling, )

BSC Tools. Challenges on the way to Exascale. Efficiency (, power, ) Variability. Memory. Faults. Scale (,concurrency, strong scaling, ) www.bsc.es BSC Tools Jesús Labarta BSC Paris, October 2 nd 212 Challenges on the way to Exascale Efficiency (, power, ) Variability Memory Faults Scale (,concurrency, strong scaling, ) J. Labarta, et all,