Trace-driven co-simulation of highperformance

Size: px

Start display at page:

Download "Trace-driven co-simulation of highperformance"

Baldric Black
5 years ago
Views:

1 IBM Research GmbH, Zurich, Switzerland Trace-driven co-simulation of highperformance computing systems using OMNeT++ Cyriel Minkenberg, Germán Rodríguez Herrera IBM Research GmbH, Zurich, Switzerland 2nd International Workshop on OMNeT++, March 6, IBM Corporation

2 Overview Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++ Context MareNostrum MareIncognito Design of interconnection networks for massively parallel computers Models & tools Computation Dimemas Communication Venus Visualization Paraver Integrated tool chain Venus architecture Co-simulation with Dimemas Paraver tracing Accuracy: topologies, routing, mapping, Myrinet/IBA/Ethernet models Sample results Conclusions & future work

Background: MareNostrum Blade-based parallel

(BSC) 2,560 nodes 10,240 IBM PowerPC 970MP

3 GHz (2,560 JS21 blades) Peak performance of

TB of disk storage Interconnection networks

3 Background: MareNostrum Blade-based parallel computer at Barcelona Supercomputing Center (BSC) 2,560 nodes 10,240 IBM PowerPC 970MP processors at 2.3 GHz (2,560 JS21 blades) Peak performance of Teraflops 20 TB of main memory TB of disk storage Interconnection networks Myrinet and Gigabit Ethernet Linux: SuSe Distribution 44 racks, 120 m 2 floor space

4 MareIncognito Joint project between BSC and IBM to developed a follow-on for MareNostrum, codenamed MareIncognito 10+ PFLOP/s machine for 2011 timeframe comprising in the order of 10-20K 1+ TFLOP blades Our focus: Design a performance- and cost-optimized interconnection network for such a system Gain deeper understanding of HPC traffic patterns and the impact of the interconnect on overall system performance Use this understanding to optimize the design of the MareIncognito interconnect Reduce cost & power without sacrificing much performance Topology, bisectional bandwidth, routing, contention (internal & external), task placement, collectives, adapter & switch implementation

5 Interconnect design Ever-increasing levels of parallelism and distribution are causing shift from processor-centric to interconnect-centric computer architecture Interconnect represents a significant fraction of overall system cost Switches, cables Maintenance We need tools to predict system performance with reasonable accuracy Absolute performance Parameter sensitivity; trend prediction ( what if? ) Accurate model of application behavior (compute nodes) Accurate model of communication behavior (interconnect) Many tools do either of these things well; very few manage both simultaneously with sufficient accuracy

6 Compute node model From the network perspective, compute node acts as traffic source and sink In communication network design (telco, internet) typical approaches are Synthetic models based on some kind of stochastic processes may (or may not) be reasonable if a sufficient level of statistical multiplexing is present relation to reality unclear at best Replaying traces recorded on, e.g., some provider s backbone In either case, semantics of traffic content are rarely considered: no causal dependencies between communications Traffic in HPC systems has strong causal dependencies These stem from control and data dependencies inherent in a given parallel program These dependencies can be captured by running a program on a given machine and recording them in a trace file If the program has been written using the Message Passing Interface (MPI) library, this basically amounts to a per-process list of send/recv/wait calls Such a trace can be replayed observing the MPI call semantics to correctly reproduce communication dependencies

7 Compute node model: Dimemas Dimemas (BSC simulator) Rapidly simulates MPI traces collected from real machines Faithfully models MPI semantics & node architecture Sensitivity: helps identify coarse-grain factors and relevant communication phases Leverages CEPBA Tools trace generation, handling and visualization Models interconnection network at a very high abstraction level MPI replay config file (#buses, bandwidth, latency, eager threshold, ) MPI application run MPI trace Dimemas interface (client) Paraver (BSC visualization) Visualizes MPI communication patterns at the node level Allows debugging of inter-process communication, e.g., load imbalance, contention statistics Paraver node trace Paraver

8 Interconnect model: Venus Simulates interconnect at flit level Event-driven simulator using OMNeT++ (originally based on MARS simulator) Supports wormhole routing and segmentation for long messages Provides various detailed switch and adapter implementations Myrinet, 10G Ethernet, InfiniBand Generic input-, output-, and combined inputoutput-queued switches Provides various topologies and routing methods Extended Generalized Fat Tree (XGFT), Mesh, 2D/3D Torus, Hypercube, arbitrary Myrinet topologies Supports various routing methods Source routing, table lookup Algorithmic (online), Myrinet routes files (offline) Static, dynamic Highly configurable Topology, switch/adapter models, buffer sizes, link speed, flit size, segmentation, latencies, etc. Supports MareNostrum/MareIncognito Server mode to co-simulate with Dimemas via socket interface Outputs paraver-compatible trace files enabling detailed observation of network behavior Detailed models of Myrinet switch and adapter hardware Translation tool to convert Myrinet map file to OMNeT++ ned topology description Import facility to load Myrinet route files at simulation runtime; supports multiple routes per flow (adaptivity) Flexible task to node mapping mechanism Tool to generate Myrinet map and routes files for arbitrary XGFTs

Tool Enhancement & Integration MPI application run.routes file.

(#buses, bandwidth, latency, eager threshold, ) Dimemas interface (client) co-simulation (socket)

, bandwidth, delay, segmentation, buffer size, ) statistics Paraver node trace Paraver network

9 Tool Enhancement & Integration MPI application run.routes file.map file xgft routereader map2ned MPI replay MPI trace mapping routes topology config file (#buses, bandwidth, latency, eager threshold, ) Dimemas interface (client) co-simulation (socket) Venus interface (server) config file (adapter & switch arch., bandwidth, delay, segmentation, buffer size, ) statistics Paraver node trace Paraver network trace statistics Paraver visualization, analysis, validation Goal: Understand application usage of physical communication resources and facilitate optimal communication subsystem design

10 Venus model structure host system toplevel Host statin[c] statin[h*c] statout[c] statout[h*c] statin[] in statin[c] node[c] statout[] statout[c] Host[h] out out[] in node[0] out statin[c] statout[c] Host[0] dataout[n] datain[n] dataout[n] datain[n] nodeout[c] nodein[c] in[] out[] statin[c] dimemas statout[c] Host[H-1] dataout[n] datain[n] statistics in node[c-1] out router datain[h] adapout[a] statout[] dataout[h] statin[] adapin[a] datain[h] in[] to_control from_control dataout[h] datainingr dataoutingr system adapter[0] dataoutegr datainegr system datainingr dataoutingr Network[0] adapter[a] dataoutegr datainegr ctrlstatdim prvin in datainingr dataoutingr out adapter[a-1] control Network[N-1] dataoutegr datainegr dataout[c] datain[c]

11 Hybrid Dimemas & Venus simulation Based on PDES Null message protocol algorithm Not really parallel: simulators take turns simulating (blocking) Dimemas interface (client) co-simulation (socket) toplevel Venus interface (server) Earliest input time set to 0 while any messages in flight Earliest output time set to statin[] timestamp of next event Dimemas interface statout[] system Client/Server model statout[] Venus acts as a server simulating the network out[] Dimemas acts as a client dimemas simulating the application in[] out[] Each provides minimal set of commands to the other statistics Dimemas commands SEND timestamp source destination size extra strings statin[] STOP timestamp END ctrlstatdim FINISH in[] PROTO (OK_TO_SEND, prvin READY_TO_RECV) to_control in Venus responses from_control STOP REACHED out timestamp COMPLETED SEND [echo of the original control data + extra data]

12 D&V Co-simulation: SEND example Event Queue [1us] [10us] [20us] Dimemas VenusClient process_event() Event at t = 1 us is a SEND Issue SEND command pop_event() Messages in flight > 0 Next Event at time (t == 10us) > 1us Issue STOP t command Issue END command size destination source any string to be sent back SEND 1us x001 0x002 STOP 10us END Venus ServerMod Blocked in receive_commands() Event Queue [1us] [2us] [10us] Blocked in pop_event() Inside pop_event() END received (unblock) Add receive event into the queue at time 2 us Process Next Event END received (unblock) Process SEND command Send message at t = 1 us Schedule STOP at t = 10 us Continue simulation COMPLETED SEND 2us x001 0x002 Event: Message received at t = 2 us Send Response to Dimemas END Updating the timestamp Cancel scheduled STOP Blocked in receive_commands()

13 Network topologies Fat tree (k-ary n-tree) statin[c] Host[0] statout[c] dataout[n] Extended Generalized Fat datain[n] Tree (XGFT) datain[h] system network statin[h*c] statout[h*c] Mesh Torus Hypercube statin[c] Host[h] statout[c] S dataout[n] 3,0 S 3,k-1 S 3,k datain[n] Arbitrary (possibly irregular) topologies using Myrinet map files dataout[h] Network[0]... S 3,k^2-1 S 2,0 S 2,k-1 S datain[h] 2,k S 2,2k-1 S 2,(k-1)k S 2,k^2-1 S 1,0 S 1,k-1 S 1,k S 1,2k-1 S 1,(k-1)k S 1,k^2-1 statin[c] dataout[n] dataout[h] Host[H-1] statout[c] datain[n] Network[N-1] S 3,2k-1 S 3,(k-1)k A A k^3

14 Insights gained at multiple levels Application & MPI library level Blocking vs. nonblocking calls MPI protocol level Eager vs. rendez-vous Topology & routing Static source based routing Contention and pattern aware routings Slimmed networks External vs. internal contention Hardware Head-of-line blocking Switch arbitration policy Segmentation Deadlock Automatic communication-computation overlap Putting it all together: Validation with real traces in a real machine Identifying the sources of network performance losses Collectives variability Interconnect layers Software: Application & MPI library Protocol/Middleware Network topology & routing Switch & adapter technology

At the protocol level Eager/Rendez-vous Rendez-vous protocol needs control messages Impact is relatively small if the control messages never wait after a data segment

15 At the protocol level Eager/Rendez-vous Rendez-vous protocol needs control messages Impact is relatively small if the control messages never wait after a data segment to get sent If control messages are sent after a long segment, the sender will be delayed Smaller segments Out of Band protocol messages The delay propagates to the other threads Segment size trade-off? 16KB as in Myrinet for long messages is too long to interleave urgent control messages Alltoall example (no internal contention) No imbalance, control messages are always sent before data segments Eager: ~4.3 ms Rdvz: ~4.8 ms Waiting for protocol messages Very small imbalance <10us, time: ~6.3 ms

16 Performance of WRF on two-level slimmed tree WRF, 256 nodes WRF, 256 nodes Execution time [s] ,1 2,2 2,M 1,1 1,2 1,N 1 N/2 1 N/2 1 N/ Number of top-level sw itches Execution time relative to fat tree [%] Number of top-level switches WRF, 256 nodes output-queued switch input-queued switch (Myrinet) Dimemas number of input links = number of output links = 256 Relative increase in execution time [%] latency = 8 us cpu_ratio = 1.0 eager threshold = bytes Venus link speed = 2 Gb/s, flit size = 8 B (duration = 32 ns) Myrinet switch and adapter models Myrinet segmentation adapter buffer size = 1024 KB switch buffer size per port = 8 KB Number of top-level switches

17 Conclusions & future work Conclusions Created OMNeT-based interconnection network simulator that faithfully models real networks in terms of topology, routing, flow control, switch- and adapter architecture Integrated network simulator with trace-based MPI simulator to capture reactive traffic behavior, thus obtaining high accuracy in simulating the interactions between computation and communication Enabled entirely new insights into the effect of the interconnection network on overall system performance under realistic traffic patterns Future work Allow multiple simulators to connect to Venus simultaneously Modify MPI library to directly redirect communications to Venus Make Venus itself run in parallel Extended Venus to produce power estimates

19 D&V co-simulation: The client /* DIMEMAS CODE */ int main() { //DIMEMAS CODE vc_initialize(); // EVENT MANAGER While (events() venus_pending_events ()) { pop_event (); // DIMEMAS CODE SEND: () vc_send() RDVZ_RECV: () vc_ready_recv() RDVZ_SEND: () vc_ready_send() } vc_finish() // DIMEMAS CODE } B L O C K I N G Close the socket Dimemas VenusClient Open socket to Venus Create event queue for pending events co-simulation (socket) Check if Venus communicated something Venus ServerMod (server) Modified conditions to stop the simulation now the normal queue can be empty, because some relevant events are pending to be finished by Venus Block if necessary (check the number of messages in flight) Extract event for events queue: Reception of messages in Venus is handled here Extract the relevant event from the events queue, and put them into the pending_events queue Send the commands to Venus through the socket Increase the number of messages in_flight ; this will cause pop_event() to relinquish control to Venus if necessary

20 D&V co-simulation: The server ServerMod is interposed between the statistics module and the network. Acts as a Generator and Sink of User Messages /* SERVERMOD CODE */ int initialize() { Bind/listen/accept (server socket) } int receive_commands() { while (command = read_from_socket()) execute (command); } int handlemessage() { } STOP: waitforcommands = true; send_through_socket(stop REACHED); USER_MESSAGE: waitforcommands = callback(usermessage); // [... ] if (waitforcommands) receive_commands(); B L O C K I N G Dimemas VenusClient co-simulation (socket) Venus ServerMod (server) Insert into the network User Messages as instructed by Dimemas or schedule STOP s to relinquish control back to Dimemas Commands are read until an END command is found; then Venus simulation continues If a STOP is reached, tell Dimemas and block (in receive_commands()) until Dimemas schedules a next safe Earliest Output Time (STOP) When ServerMod sees a User Message it executes the associated callback: If it is a protocol message, new messages are injected into the network If it is a Data message that completed, inform Dimemas, cancel the scheduled STOP, and block (receive_commands);

21 Extended Generalized Fat Trees XGFT ( h ; m 1,, m h ; w 1,, w h ) h = height number of levels-1 levels are numbered 0 through h level 0 : compute nodes levels 1 h : switch nodes m i = number of children per node at level i, 0 < i h w i = number of parents per node at level i-1, 0 < i h number of level 0 nodes = Π i m i number of level h nodes = Π i w i XGFT ( 3 ; 3, 2, 2 ; 2, 2,3 ) 0,0,0 0,0,1 0,0,2 0,1,0 0,1,1 0,1,2 1,0,0 1,0,1 1,0,2 1,1,0 1,1,1 1,1, ,0,0 0,1,0 1,0,0 1,1,0 0,0,1 0,1,1 1,0,1 1,1, ,0,0 1,0,0 0,1,0 1,1,0 0,0,1 1,0,1 0,1,1 1,1, ,0,0 1,0,0 2,0,0 0,1,0 1,1,0 2,1,0 0,0,1 1,0,1 2,0,1 0,1,1 1,1,1 2,1,1

22 Performance of 256-node WRF on various topologies cpu ratio Execution time [s] Hypercube(8,9) Torus(16,16,1,7) Torus(16,8,2,7) Torus(16,4,4,7) Hypercube(0,256) MareNostrum(256) FatTree(2,32) SlimTree(2,32,16) Hypercube(7,9) SlimTree(2,32,15) SlimTree(2,32,13) SlimTree(2,32,14) SlimTree(2,32,12) SlimTree(2,32,11) SlimTree(2,32,9) SlimTree(2,32,8) SlimTree(2,32,10) Torus(4,4,4,10) Hypercube(6,10) Torus(8,8,4,7) Torus8,4,4,8() SlimTree(2,32,7) SlimTree(2,32,6) SlimTree(2,32,5) SlimTree(2,32,4) Torus(4,4,2,14) Hypercube(5,13) Hypercube(1,129) SlimTree(2,32,3) SlimTree(2,32,2) Hypercube(4,20) Hypercube(3,35) Hypercube(2,66) SlimTree(2,32,1) Topology

Impact of scheduling policy Software: Application & MPI library Protocol/Middleware Network topology & routing Switch & adapter technology (a) Pointers

occurs (on output 15), the wrong selection (which depends on the position of the pointer) results in HOL blocking; the scheduler should first serve the input

23 Impact of scheduling policy Software: Application & MPI library Protocol/Middleware Network topology & routing Switch & adapter technology (a) Pointers initialized randomly The scheduling policy is the determining factor (infinite CPU, IQ switch) Round-robin pointers are initialized randomly When contention occurs (on output 15), the wrong selection (which depends on the position of the pointer) results in HOL blocking; the scheduler should first serve the input on which another message will arrive Unfortunately, it has no crystal ball L2 switches HOL blocking detected (b) Pointers initialized to 0 (c) Pointers initialized to N-1

24 Backup S 3,0 S 3,k-1 S 3,k S 3,2k-1 S 3,(k-1)k... S 3,k^2-1 S 2,0 S 2,k-1 S 2,k S 2,2k-1 S 2,(k-1)k S 2,k^2-1 S 1,0 S 1,k-1 S 1,k S 1,2k-1 S 1,(k-1)k S 1,k^2-1 A A k^3

25 Multi-rail system system statin[c] statout[c] Host[0] dataout[n] datain[n] datain[h] statin[h*c] dataout[h] Network[0] statin[c] statout[c] Host[h] dataout[n] datain[n] statout[h*c] datain[h] statin[c] Host[H-1] statout[c] dataout[n] datain[n] dataout[h] Network[N-1]

26 MultiHost module Host in node[0] out datainingr dataoutingr adapter[0] dataoutegr datainegr statin[c] statout[c] in node[c] out nodeout[c] nodein[c] router adapout[a] adapin[a] datainingr dataoutingr adapter[a] dataoutegr datainegr dataout[c] datain[c] datainingr dataoutingr in node[c-1] out adapter[a-1] dataoutegr datainegr

27 D&V Co-simulation: Protocol (hints) Human-readable ASCII text Sent through standard Unix Sockets Few Commands and Responses Extensible Dimemas VenusClient co-simulation (socket) Venus ServerMod (server) Commands SEND timestamp source destination size extra strings The extra strings enconde the events that Dimemas wil update upon reception of the message. STOP timestamp END FINISH PROTO (OK_TO_SEND,READY_TO_RECV) (parameters as for SEND) Responses: STOP REACHED timestamp COMPLETED SEND [echo of the original data + extra data]

Impact of switch architecture Infinite CPU speed, output-queued switch

completes in about two times the message duration (54 µs) plus startup

Why? completion time = 63 µs completion time = 87 µs Normal CPU speed,

(task 95) starts sending 26 µs later than all other tasks, adding 26

28 Impact of switch architecture Infinite CPU speed, output-queued switch Infinite CPU speed, input-queued switch Ideal case, traffic pattern completes in about two times the message duration (54 µs) plus startup latency Three messages experience additional delay due to HOL blocking. Why? completion time = 63 µs completion time = 87 µs Normal CPU speed, output-queued switch Normal CPU speed, input-queued switch One node (task 95) starts sending 26 µs later than all other tasks, adding 26 µs to the communication time completion time = 135 µs completion time = 157 µs

Adaptive Routing for Data Center Bridges

Adaptive Routing for Data Center Bridges Cyriel Minkenberg 1, Mitchell Gusat 1, German Rodriguez 2 1 IBM Research - Zurich 2 Barcelona Supercomputing Center Overview IBM Research - Zurich Data center network