Dimemas internals and details. BSC Performance Tools

Size: px

Start display at page:

Download "Dimemas internals and details. BSC Performance Tools"

Bathsheba Parsons
5 years ago
Views:

1 Dimemas ernals and details BSC Performance Tools

2 CEPBA tools framework XML control Predictions/expectations Valgrind OMPITrace.prv MRNET Dyninst, PAPI Time Analysis, filters.prv.cfg Paraver +.pcf.trf DIMEMAS VENUS (IBM-ZRL) how2gen.xml Stats Gen.viz.cube.xls.txt Machine description Instr. Level Simulators PeekPerf Data Display Tools 2

3 Dimemas tracefile Characterises application Sequence of resource demands for each task Sequence of events: communication Application model Format SDDF for historical reasons. Definition of records New format 3

4 Dimemas tracefile Format SDDF for historical reasons Definition of records #1: " burst" double };; { "taskid"; "thid"; "time"; #2: "NX send" { "taskid"; "thid"; "dest taskid"; "msg length"; "tag"; "commid"; "use_rendezvous"; };; #40: "block begin" { "taskid"; "thid"; "blockid"; };; #41: "block end" { "taskid"; "thid"; "blockid"; };; #201: "global OP" { "rank"; "thid"; "glop_id"; "comm_id"; "root_rank"; "root_thid"; "bytes_sent"; "bytes_recvd"; };; 4

5 Dimemas tracefile Format ASCII records "block begin" { 35, 0, 73 };; "NX recv" { 35, 0, 39, 4160, 10003, "block end" { 35, 0, 73 };; " burst" { 35, 0, };; "block begin" { 35, 0, 73 };; "NX recv" { 35, 0, 31, 4160, 10004, "block end" { 35, 0, 73 };; " burst" { 35, 0, };; "block begin" { 35, 0, 75 };; "NX send" { 35, 0, 34, 1560, 10001, "block end" { 35, 0, 75 };; " burst" { 35, 0, };; "block begin" { 35, 0, 75 };; "NX send" { 35, 0, 31, 3640, 10003, "block end" { 35, 0, 75 };; 0, 1 };; 0, 1 };; 0, 0 };; 0, 0 };; 5

6 Dimemas trace generation Dimemas instrumentation MPIDtrace Run the same way as OMPItrace Paraver Dimemas trace Generation Prv2trf original.prv dimemas.trf Default: Duration of each computation region taken from.prv computation duration Usage: prv2trf -i <iprobe_miss_threshold> -b <hw_counter_type>,<factor> <paraver_trace> <dimemas_trace> Force synchronized start of all threads -h -n -i <iprobe_miss_threshold> This help No generate initial idle states Maximun MPI_Iprobe misses to discard Iprobe area burst -b <hw_counter_type>,<factor> Hardware counter type and factor used to generate burst durations Computation region duration derived from hardware counters assuming/modeling a given performance (<factor>) 6

7 Parallel machine model Dimemas: Coarse grain trace driven simulator Network of SMPs Multiprogrammed workload Key factors influencing performance Objectives B L L L Local Memory Local Memory Abstract architecture Basic MPI protocols No attempt to model details of a specific implementation Simple/general Fast simulation 7 Local Memory

8 Dimemas GUI Specify trace to simulate Open chooser Specify 8

9 Parallel machines: highly non linear systems Linear components Po to po communication Sequential processor performance MessageSize T= +L BW Global speed Per block/subroutine Non linear components Synchronization semantics Blocking receives Rendezvous Resource contention Communication subsystem B L L Local Memory L Local Memory Links (in/out, halfduplex) Busses 9 Local Memory

10 Dimemas GUI Specify target machine 10

11 p2p communication model Early receiver Machine Latency Uses Independent of size Simulated contention for machine resources (links & buses) MPI_send Computation proceeds Logical Transfer Physical Transfer Size BW Process Blocked MPI_recv Machine Latency Uses Independent of size 11

12 p2p communication model Late receiver Machine Latency Uses Independent of size Simulated contention for machine resources (links & buses) MPI_send Computation proceeds Physical Transfer Logical Transfer Size BW Machine Latency Uses Independent of size MPI_recv 12

13 p2p communication model Rendezvous Machine Latency Uses Independent of size Simulated contention for machine resources (links & buses) MPI_send Process Blocked Physical Transfer Logical Transfer Size BW Machine Latency Uses Independent of size MPI_recv 13

14 Collective communication model Generic model Barrier / Fan-in / Fan-out Cost of communication phase Generic Per call Model factor Lin / log / const Size of message Min over all processes Collective Processor time Avg over all processes Block time Comm. time Max over all processes 14

15 Collective Communication Model Generic model Communication time Model factor Lin / log / const Size Time = Latency + MODEL_FACTOR Bandwidth Model Null 0 Constant 1 Linear P Logarithmic Factor log2p C Nsteps = stepsi, stepsi = B i=1 15

16 Collective Communication Model Per call model Model factor Lin Log Const Size of message Min over all processes Mean over all processes Max over all processes Specified in input file 16

17 Dimemas GRID: model extension L L B B L Dedicated connections External network Variation on effective bandwidth due to traffic Collective communication extension. Not targeted by this tutorial 17

18 Architecture description file Configuration file SDDF format for historical reasons Definition of records #1: "environment information" { char "machine_name"[]; "machine_id"; // "instrumented_architecture" "Architecture used to instrument" char "instrumented_architecture"[]; // "number_of_nodes" "Number of nodes on virtual machine" "number_of_nodes"; // "network_bandwidth" "Data tranfer rate between nodes in Mbytes/s" // "0 means instantaneous communication" double "network_bandwidth"; // "number_of_buses_on_network" "Maximun number of messages on network" // "0 means no limit" // "1 means bus contention" "number_of_buses_on_network"; // "1 Constant, 2 Lineal, 3 Logarithmic" "communication_group_model"; };; 18

19 Architecture description file Configuration file #2: "node information" { "machine_id"; // "node_id" "Node number" "node_id"; // "simulated_architecture" "Architecture node name" char "simulated_architecture"[]; // "number_of_processors" "Number of processors within node" "number_of_processors"; // "number_of_input_links" "Number of input links in node" "number_of_input_links"; // "number_of_output_links" "Number of output links in node" "number_of_output_links"; // "startup_on_local_communication" "Communication startup" double "startup_on_local_communication"; // "startup_on_remote_communication" "Communication startup" double "startup_on_remote_communication"; // "speed_ratio_instrumented_vs_simulated" "Relative processor speed" double "speed_ratio_instrumented_vs_simulated"; // "memory_bandwidth" "Data tranfer rate o node in Mbytes/s" // "0 means instantaneous communication" double "memory_bandwidth"; double "external_net_startup"; };; 19

20 Architecture description file #s Configuration In/out links BW B "wide area network information" {"", 1, 0, 4, 0.0, 0.0, 1};; "environment information" {"", 0, "", 128, 250.0, 0, 3};; "node information" {0, 0, "", 1, 1, 1, 0.0, , 1.0, "node information" {0, 1, "", 1, 1, 1, 0.0, , 1.0, "node information" {0, 2, "", 1, 1, 1, 0.0, , 1.0, "node information" {0, 3, "", 1, 1, 1, 0.0, , 1.0, "mapping information" {"WRF.MN.128p.chop2.trf", 128, [128] {0,1,2,3,4,5,6,7,8,9,10,11,,125,126,127}};; L 0.0, 0.0, 0.0, 0.0, 0.0};; 0.0};; 0.0};; 0.0};; "configuration files" {"", "", "collectives.cfg", ""};; 20

21 Application Analysis Group messages? Bandwidth problem? BW =, L = 0 Concurrent communication problems? L =, BW = BW = target, L = target, buses = 1, 2,... Ideal network? BW =, Allgather + sendrecv alltoall allreduce waitall L=0 Real run Ideal network sendrec 21

22 Hands on session Directory ro2dimemas contains a guidelines document that you can apply to the WRF.128p trace or your own. A comparison of the original and simulated trace is shown below for the WRF.128p case Real MareNostrum Dimemas prediction for MareNostrum 22

23 Hands on session BW 5 MB/s BW 10 MB/s L 100 us BW 250 MB/s Sensitivity to the different factors (latency, BW, )? In different parts of the trace? 23

24 Hands on session busses 2 links BW 5 MB/s BW 10 MB/s 2 busses BW 250 MB/s Relationship between bandwidth, injectors and contention. Amount of contention? Endpo contention? 24

Recommendation: Important to schedule communications.

25 Application Analysis End po contention Simulation with Dimemas PEPC Exchange phase Very low BW 1 output link, input links Recommendation: Important to schedule communications. Everybody sending by destination rank order Endpo contention at low ranked processes 25

CommEff IPC # instr0 * * * * * P0 macrolb0 microlb0 CommEff 0 IPC0 # instr LB = eff i =1 i #

26 Speedup model T eff i = Ti P LB CommEff IPC # instr0 Sup = * * * * P0 LB0 CommEff 0 IPC0 # instr CommEff = max(eff i ) IPC P Directly from real execution metrics Sup = P macrolb microlb CommEff IPC # instr0 * * * * * P0 macrolb0 microlb0 CommEff 0 IPC0 # instr LB = eff i =1 i # instr P * max(eff i ) Migrating/local load imbalance Serialization Requires Dimemas simulation Ti T 26

27 Parametric studies: Estimating impact of different factors GADGET Ideal speeding up ALL the computation bursts by the ratio factor The more processes the less speedup (higher impact of bandwidth limitations)!!!!! Speedup Speedup Bandwidth (MB/s) Bandwidth (MB/s) 0 Bandwidth (MB/s) ratio ratio Speedup ratio

28 Parametric studies #!/bin/sh echo bw time for log_bw in $(seq 6 14) do let i=2**log_bw sed s/bwref/$i.0/g machine.ref.cfg >tmp.cfg echo $i `Dimemas S 32K tmp.cfg grep Execu awk '{pr $NF}'` rm tmp.cfg done Machine.REF.cfg "environment information" {"", 0, "", 128, BWREF, 0, 3};; "node information" {0, 0, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; "node information" {0, 1, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; 28

29 Estimating impact Profile 40 We do need to overcome the hybrid Amdahl s law asynchrony + Load balancing mechanisms!!! 0 1 Speedup Bandwdith (MB/s) Code region 64 0 Bandwdith (MB/s) Bandwdith (MB/s) ratio ratio % code region % Speedup Speedup 93.67% Speedup SELECTED regions by the ratio factor %elapsed time Hybrid GADGET (128 processes % of computation time 35 ratio

30 Using block factors Clusterize with option b Convert to trf Specify block performance factors (time of block is divided by factor) Simulate. WRF.NM.128p.chop2.prv clusterized with Cluster.I.IPC.xml Prediction speeding up cluster 2 by 100x 30

Dimemas GUI Block factors "environment information" {"", 0, "", 128, 250.0, 0, 3};; "node information" {0, 0, "", 1, 1, 1, 0.0, 0.000008, 1.

31 Dimemas GUI Block factors "environment information" {"", 0, "", 128, 250.0, 0, 3};; "node information" {0, 0, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; "node information" {0, 1, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; "modules information" {1007, 100.0};; 31

Application Analysis GADGET @ 64-256 procs

early detection Experiments @ small core

32 Application Analysis procs Serialization Detected through Dimemas simulation for ideal erconnect Precise measurement and prediction results in early detection small core counts warn for potential large core count relevant computation between communcation in sendrecv phases Real run Ideal network 32

33 CEPBA tools framework XML control Interconnect evaluation environment Valgrind OMPITrace.prv MRNET Dyninst, PAPI Time Analysis, filters.prv.cfg Paraver +.pcf.trf DIMEMAS VENUS (IBM-ZRL) how2gen.xml Stats Gen.viz.cube.xls.txt Machine description Instr. Level Simulators PeekPerf Data Display Tools 33

Interconnect simulation environment Dimemas MPI

34 Interconnect simulation environment Dimemas MPI replay Very fast, coarse grain network model Config File Config File Venus (IBM) Detailed network simulator Routing Protocols Venus Sim. Dimemas Sim. (Client) traces Interaction (socket) routes ServerMod (Server) mapping topology traces statistics 34

35 Multiscale Simulation 35

36 Multiscale simulation: L2 cache size Vs Network Bandwidth Left: Cluster representatives IPC with different L2 cache sizes 64KB 512MB Right: Application execution time with different network bandwidths 125Mb/s 500Mb/s 4MB 250Mb/s VAC and WRF Dominated by computation phases Impact of network is negligible 64KB 500Mb/s NAS BT Network bandwidth is more significant L2 size reduction can be compensated by an increase in network bandwidth 36

BSC Tools. Challenges on the way to Exascale. Efficiency (, power, ) Variability. Memory. Faults. Scale (,concurrency, strong scaling, )

BSC Tools. Challenges on the way to Exascale. Efficiency (, power, ) Variability. Memory. Faults. Scale (,concurrency, strong scaling, ) www.bsc.es BSC Tools Jesús Labarta BSC Paris, October 2 nd 212 Challenges on the way to Exascale Efficiency (, power, ) Variability Memory Faults Scale (,concurrency, strong scaling, ) J. Labarta, et all,