Trace-driven co-simulation of highperformance
|
|
- Baldric Black
- 5 years ago
- Views:
Transcription
1 IBM Research GmbH, Zurich, Switzerland Trace-driven co-simulation of highperformance computing systems using OMNeT++ Cyriel Minkenberg, Germán Rodríguez Herrera IBM Research GmbH, Zurich, Switzerland 2nd International Workshop on OMNeT++, March 6, IBM Corporation
2 Overview Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++ Context MareNostrum MareIncognito Design of interconnection networks for massively parallel computers Models & tools Computation Dimemas Communication Venus Visualization Paraver Integrated tool chain Venus architecture Co-simulation with Dimemas Paraver tracing Accuracy: topologies, routing, mapping, Myrinet/IBA/Ethernet models Sample results Conclusions & future work
3 Background: MareNostrum Blade-based parallel computer at Barcelona Supercomputing Center (BSC) 2,560 nodes 10,240 IBM PowerPC 970MP processors at 2.3 GHz (2,560 JS21 blades) Peak performance of Teraflops 20 TB of main memory TB of disk storage Interconnection networks Myrinet and Gigabit Ethernet Linux: SuSe Distribution 44 racks, 120 m 2 floor space
4 MareIncognito Joint project between BSC and IBM to developed a follow-on for MareNostrum, codenamed MareIncognito 10+ PFLOP/s machine for 2011 timeframe comprising in the order of 10-20K 1+ TFLOP blades Our focus: Design a performance- and cost-optimized interconnection network for such a system Gain deeper understanding of HPC traffic patterns and the impact of the interconnect on overall system performance Use this understanding to optimize the design of the MareIncognito interconnect Reduce cost & power without sacrificing much performance Topology, bisectional bandwidth, routing, contention (internal & external), task placement, collectives, adapter & switch implementation
5 Interconnect design Ever-increasing levels of parallelism and distribution are causing shift from processor-centric to interconnect-centric computer architecture Interconnect represents a significant fraction of overall system cost Switches, cables Maintenance We need tools to predict system performance with reasonable accuracy Absolute performance Parameter sensitivity; trend prediction ( what if? ) Accurate model of application behavior (compute nodes) Accurate model of communication behavior (interconnect) Many tools do either of these things well; very few manage both simultaneously with sufficient accuracy
6 Compute node model From the network perspective, compute node acts as traffic source and sink In communication network design (telco, internet) typical approaches are Synthetic models based on some kind of stochastic processes may (or may not) be reasonable if a sufficient level of statistical multiplexing is present relation to reality unclear at best Replaying traces recorded on, e.g., some provider s backbone In either case, semantics of traffic content are rarely considered: no causal dependencies between communications Traffic in HPC systems has strong causal dependencies These stem from control and data dependencies inherent in a given parallel program These dependencies can be captured by running a program on a given machine and recording them in a trace file If the program has been written using the Message Passing Interface (MPI) library, this basically amounts to a per-process list of send/recv/wait calls Such a trace can be replayed observing the MPI call semantics to correctly reproduce communication dependencies
7 Compute node model: Dimemas Dimemas (BSC simulator) Rapidly simulates MPI traces collected from real machines Faithfully models MPI semantics & node architecture Sensitivity: helps identify coarse-grain factors and relevant communication phases Leverages CEPBA Tools trace generation, handling and visualization Models interconnection network at a very high abstraction level MPI replay config file (#buses, bandwidth, latency, eager threshold, ) MPI application run MPI trace Dimemas interface (client) Paraver (BSC visualization) Visualizes MPI communication patterns at the node level Allows debugging of inter-process communication, e.g., load imbalance, contention statistics Paraver node trace Paraver
8 Interconnect model: Venus Simulates interconnect at flit level Event-driven simulator using OMNeT++ (originally based on MARS simulator) Supports wormhole routing and segmentation for long messages Provides various detailed switch and adapter implementations Myrinet, 10G Ethernet, InfiniBand Generic input-, output-, and combined inputoutput-queued switches Provides various topologies and routing methods Extended Generalized Fat Tree (XGFT), Mesh, 2D/3D Torus, Hypercube, arbitrary Myrinet topologies Supports various routing methods Source routing, table lookup Algorithmic (online), Myrinet routes files (offline) Static, dynamic Highly configurable Topology, switch/adapter models, buffer sizes, link speed, flit size, segmentation, latencies, etc. Supports MareNostrum/MareIncognito Server mode to co-simulate with Dimemas via socket interface Outputs paraver-compatible trace files enabling detailed observation of network behavior Detailed models of Myrinet switch and adapter hardware Translation tool to convert Myrinet map file to OMNeT++ ned topology description Import facility to load Myrinet route files at simulation runtime; supports multiple routes per flow (adaptivity) Flexible task to node mapping mechanism Tool to generate Myrinet map and routes files for arbitrary XGFTs
9 Tool Enhancement & Integration MPI application run.routes file.map file xgft routereader map2ned MPI replay MPI trace mapping routes topology config file (#buses, bandwidth, latency, eager threshold, ) Dimemas interface (client) co-simulation (socket) Venus interface (server) config file (adapter & switch arch., bandwidth, delay, segmentation, buffer size, ) statistics Paraver node trace Paraver network trace statistics Paraver visualization, analysis, validation Goal: Understand application usage of physical communication resources and facilitate optimal communication subsystem design
10 Venus model structure host system toplevel Host statin[c] statin[h*c] statout[c] statout[h*c] statin[] in statin[c] node[c] statout[] statout[c] Host[h] out out[] in node[0] out statin[c] statout[c] Host[0] dataout[n] datain[n] dataout[n] datain[n] nodeout[c] nodein[c] in[] out[] statin[c] dimemas statout[c] Host[H-1] dataout[n] datain[n] statistics in node[c-1] out router datain[h] adapout[a] statout[] dataout[h] statin[] adapin[a] datain[h] in[] to_control from_control dataout[h] datainingr dataoutingr system adapter[0] dataoutegr datainegr system datainingr dataoutingr Network[0] adapter[a] dataoutegr datainegr ctrlstatdim prvin in datainingr dataoutingr out adapter[a-1] control Network[N-1] dataoutegr datainegr dataout[c] datain[c]
11 Hybrid Dimemas & Venus simulation Based on PDES Null message protocol algorithm Not really parallel: simulators take turns simulating (blocking) Dimemas interface (client) co-simulation (socket) toplevel Venus interface (server) Earliest input time set to 0 while any messages in flight Earliest output time set to statin[] timestamp of next event Dimemas interface statout[] system Client/Server model statout[] Venus acts as a server simulating the network out[] Dimemas acts as a client dimemas simulating the application in[] out[] Each provides minimal set of commands to the other statistics Dimemas commands SEND timestamp source destination size extra strings statin[] STOP timestamp END ctrlstatdim FINISH in[] PROTO (OK_TO_SEND, prvin READY_TO_RECV) to_control in Venus responses from_control STOP REACHED out timestamp COMPLETED SEND [echo of the original control data + extra data]
12 D&V Co-simulation: SEND example Event Queue [1us] [10us] [20us] Dimemas VenusClient process_event() Event at t = 1 us is a SEND Issue SEND command pop_event() Messages in flight > 0 Next Event at time (t == 10us) > 1us Issue STOP t command Issue END command size destination source any string to be sent back SEND 1us x001 0x002 STOP 10us END Venus ServerMod Blocked in receive_commands() Event Queue [1us] [2us] [10us] Blocked in pop_event() Inside pop_event() END received (unblock) Add receive event into the queue at time 2 us Process Next Event END received (unblock) Process SEND command Send message at t = 1 us Schedule STOP at t = 10 us Continue simulation COMPLETED SEND 2us x001 0x002 Event: Message received at t = 2 us Send Response to Dimemas END Updating the timestamp Cancel scheduled STOP Blocked in receive_commands()
13 Network topologies Fat tree (k-ary n-tree) statin[c] Host[0] statout[c] dataout[n] Extended Generalized Fat datain[n] Tree (XGFT) datain[h] system network statin[h*c] statout[h*c] Mesh Torus Hypercube statin[c] Host[h] statout[c] S dataout[n] 3,0 S 3,k-1 S 3,k datain[n] Arbitrary (possibly irregular) topologies using Myrinet map files dataout[h] Network[0]... S 3,k^2-1 S 2,0 S 2,k-1 S datain[h] 2,k S 2,2k-1 S 2,(k-1)k S 2,k^2-1 S 1,0 S 1,k-1 S 1,k S 1,2k-1 S 1,(k-1)k S 1,k^2-1 statin[c] dataout[n] dataout[h] Host[H-1] statout[c] datain[n] Network[N-1] S 3,2k-1 S 3,(k-1)k A A k^3
14 Insights gained at multiple levels Application & MPI library level Blocking vs. nonblocking calls MPI protocol level Eager vs. rendez-vous Topology & routing Static source based routing Contention and pattern aware routings Slimmed networks External vs. internal contention Hardware Head-of-line blocking Switch arbitration policy Segmentation Deadlock Automatic communication-computation overlap Putting it all together: Validation with real traces in a real machine Identifying the sources of network performance losses Collectives variability Interconnect layers Software: Application & MPI library Protocol/Middleware Network topology & routing Switch & adapter technology
15 At the protocol level Eager/Rendez-vous Rendez-vous protocol needs control messages Impact is relatively small if the control messages never wait after a data segment to get sent If control messages are sent after a long segment, the sender will be delayed Smaller segments Out of Band protocol messages The delay propagates to the other threads Segment size trade-off? 16KB as in Myrinet for long messages is too long to interleave urgent control messages Alltoall example (no internal contention) No imbalance, control messages are always sent before data segments Eager: ~4.3 ms Rdvz: ~4.8 ms Waiting for protocol messages Very small imbalance <10us, time: ~6.3 ms
16 Performance of WRF on two-level slimmed tree WRF, 256 nodes WRF, 256 nodes Execution time [s] ,1 2,2 2,M 1,1 1,2 1,N 1 N/2 1 N/2 1 N/ Number of top-level sw itches Execution time relative to fat tree [%] Number of top-level switches WRF, 256 nodes output-queued switch input-queued switch (Myrinet) Dimemas number of input links = number of output links = 256 Relative increase in execution time [%] latency = 8 us cpu_ratio = 1.0 eager threshold = bytes Venus link speed = 2 Gb/s, flit size = 8 B (duration = 32 ns) Myrinet switch and adapter models Myrinet segmentation adapter buffer size = 1024 KB switch buffer size per port = 8 KB Number of top-level switches
17 Conclusions & future work Conclusions Created OMNeT-based interconnection network simulator that faithfully models real networks in terms of topology, routing, flow control, switch- and adapter architecture Integrated network simulator with trace-based MPI simulator to capture reactive traffic behavior, thus obtaining high accuracy in simulating the interactions between computation and communication Enabled entirely new insights into the effect of the interconnection network on overall system performance under realistic traffic patterns Future work Allow multiple simulators to connect to Venus simultaneously Modify MPI library to directly redirect communications to Venus Make Venus itself run in parallel Extended Venus to produce power estimates
18
19 D&V co-simulation: The client /* DIMEMAS CODE */ int main() { //DIMEMAS CODE vc_initialize(); // EVENT MANAGER While (events() venus_pending_events ()) { pop_event (); // DIMEMAS CODE SEND: () vc_send() RDVZ_RECV: () vc_ready_recv() RDVZ_SEND: () vc_ready_send() } vc_finish() // DIMEMAS CODE } B L O C K I N G Close the socket Dimemas VenusClient Open socket to Venus Create event queue for pending events co-simulation (socket) Check if Venus communicated something Venus ServerMod (server) Modified conditions to stop the simulation now the normal queue can be empty, because some relevant events are pending to be finished by Venus Block if necessary (check the number of messages in flight) Extract event for events queue: Reception of messages in Venus is handled here Extract the relevant event from the events queue, and put them into the pending_events queue Send the commands to Venus through the socket Increase the number of messages in_flight ; this will cause pop_event() to relinquish control to Venus if necessary
20 D&V co-simulation: The server ServerMod is interposed between the statistics module and the network. Acts as a Generator and Sink of User Messages /* SERVERMOD CODE */ int initialize() { Bind/listen/accept (server socket) } int receive_commands() { while (command = read_from_socket()) execute (command); } int handlemessage() { } STOP: waitforcommands = true; send_through_socket(stop REACHED); USER_MESSAGE: waitforcommands = callback(usermessage); // [... ] if (waitforcommands) receive_commands(); B L O C K I N G Dimemas VenusClient co-simulation (socket) Venus ServerMod (server) Insert into the network User Messages as instructed by Dimemas or schedule STOP s to relinquish control back to Dimemas Commands are read until an END command is found; then Venus simulation continues If a STOP is reached, tell Dimemas and block (in receive_commands()) until Dimemas schedules a next safe Earliest Output Time (STOP) When ServerMod sees a User Message it executes the associated callback: If it is a protocol message, new messages are injected into the network If it is a Data message that completed, inform Dimemas, cancel the scheduled STOP, and block (receive_commands);
21 Extended Generalized Fat Trees XGFT ( h ; m 1,, m h ; w 1,, w h ) h = height number of levels-1 levels are numbered 0 through h level 0 : compute nodes levels 1 h : switch nodes m i = number of children per node at level i, 0 < i h w i = number of parents per node at level i-1, 0 < i h number of level 0 nodes = Π i m i number of level h nodes = Π i w i XGFT ( 3 ; 3, 2, 2 ; 2, 2,3 ) 0,0,0 0,0,1 0,0,2 0,1,0 0,1,1 0,1,2 1,0,0 1,0,1 1,0,2 1,1,0 1,1,1 1,1, ,0,0 0,1,0 1,0,0 1,1,0 0,0,1 0,1,1 1,0,1 1,1, ,0,0 1,0,0 0,1,0 1,1,0 0,0,1 1,0,1 0,1,1 1,1, ,0,0 1,0,0 2,0,0 0,1,0 1,1,0 2,1,0 0,0,1 1,0,1 2,0,1 0,1,1 1,1,1 2,1,1
22 Performance of 256-node WRF on various topologies cpu ratio Execution time [s] Hypercube(8,9) Torus(16,16,1,7) Torus(16,8,2,7) Torus(16,4,4,7) Hypercube(0,256) MareNostrum(256) FatTree(2,32) SlimTree(2,32,16) Hypercube(7,9) SlimTree(2,32,15) SlimTree(2,32,13) SlimTree(2,32,14) SlimTree(2,32,12) SlimTree(2,32,11) SlimTree(2,32,9) SlimTree(2,32,8) SlimTree(2,32,10) Torus(4,4,4,10) Hypercube(6,10) Torus(8,8,4,7) Torus8,4,4,8() SlimTree(2,32,7) SlimTree(2,32,6) SlimTree(2,32,5) SlimTree(2,32,4) Torus(4,4,2,14) Hypercube(5,13) Hypercube(1,129) SlimTree(2,32,3) SlimTree(2,32,2) Hypercube(4,20) Hypercube(3,35) Hypercube(2,66) SlimTree(2,32,1) Topology
23 Impact of scheduling policy Software: Application & MPI library Protocol/Middleware Network topology & routing Switch & adapter technology (a) Pointers initialized randomly The scheduling policy is the determining factor (infinite CPU, IQ switch) Round-robin pointers are initialized randomly When contention occurs (on output 15), the wrong selection (which depends on the position of the pointer) results in HOL blocking; the scheduler should first serve the input on which another message will arrive Unfortunately, it has no crystal ball L2 switches HOL blocking detected (b) Pointers initialized to 0 (c) Pointers initialized to N-1
24 Backup S 3,0 S 3,k-1 S 3,k S 3,2k-1 S 3,(k-1)k... S 3,k^2-1 S 2,0 S 2,k-1 S 2,k S 2,2k-1 S 2,(k-1)k S 2,k^2-1 S 1,0 S 1,k-1 S 1,k S 1,2k-1 S 1,(k-1)k S 1,k^2-1 A A k^3
25 Multi-rail system system statin[c] statout[c] Host[0] dataout[n] datain[n] datain[h] statin[h*c] dataout[h] Network[0] statin[c] statout[c] Host[h] dataout[n] datain[n] statout[h*c] datain[h] statin[c] Host[H-1] statout[c] dataout[n] datain[n] dataout[h] Network[N-1]
26 MultiHost module Host in node[0] out datainingr dataoutingr adapter[0] dataoutegr datainegr statin[c] statout[c] in node[c] out nodeout[c] nodein[c] router adapout[a] adapin[a] datainingr dataoutingr adapter[a] dataoutegr datainegr dataout[c] datain[c] datainingr dataoutingr in node[c-1] out adapter[a-1] dataoutegr datainegr
27 D&V Co-simulation: Protocol (hints) Human-readable ASCII text Sent through standard Unix Sockets Few Commands and Responses Extensible Dimemas VenusClient co-simulation (socket) Venus ServerMod (server) Commands SEND timestamp source destination size extra strings The extra strings enconde the events that Dimemas wil update upon reception of the message. STOP timestamp END FINISH PROTO (OK_TO_SEND,READY_TO_RECV) (parameters as for SEND) Responses: STOP REACHED timestamp COMPLETED SEND [echo of the original data + extra data]
28 Impact of switch architecture Infinite CPU speed, output-queued switch Infinite CPU speed, input-queued switch Ideal case, traffic pattern completes in about two times the message duration (54 µs) plus startup latency Three messages experience additional delay due to HOL blocking. Why? completion time = 63 µs completion time = 87 µs Normal CPU speed, output-queued switch Normal CPU speed, input-queued switch One node (task 95) starts sending 26 µs later than all other tasks, adding 26 µs to the communication time completion time = 135 µs completion time = 157 µs
Adaptive Routing for Data Center Bridges
Adaptive Routing for Data Center Bridges Cyriel Minkenberg 1, Mitchell Gusat 1, German Rodriguez 2 1 IBM Research - Zurich 2 Barcelona Supercomputing Center Overview IBM Research - Zurich Data center network
More informationTrace-driven Co-simulation of High-Performance Computing Systems using OMNeT++
Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++ ABSTRACT Cyriel Minkenberg IBM Zurich Research Laboratory Säumerstrasse 4 8803 Rüschlikon, Switzerland sil@zurich.ibm.com
More informationTowards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems
Towards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems Robert Birke, German Rodriguez, Cyriel Minkenberg IBM Research Zurich Outline High-performance computing:
More informationDimemas internals and details. BSC Performance Tools
Dimemas ernals and details BSC Performance Tools CEPBA tools framework XML control Predictions/expectations Valgrind OMPITrace.prv MRNET Dyninst, PAPI Time Analysis, filters.prv.cfg Paraver +.pcf.trf DIMEMAS
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationNetworks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized
More informationOutline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers
Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS
More informationTDT Appendix E Interconnection Networks
TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages
More informationEnabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer
Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer Chris Carothers, Elsa Gonsiorowski and Justin LaPre Center for Computational Innovations Rensselaer
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationMessaging Overview. Introduction. Gen-Z Messaging
Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional
More informationIn-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017
In-Network Computing Sebastian Kalcher, Senior System Engineer HPC May 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait
More informationParallel Computer Architecture II
Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de
More informationScaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc
Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC
More informationG Robert Grimm New York University
G22.3250-001 Receiver Livelock Robert Grimm New York University Altogether Now: The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Motivation
More informationScalability of Trace Analysis Tools. Jesus Labarta Barcelona Supercomputing Center
Scalability of Trace Analysis Tools Jesus Labarta Barcelona Supercomputing Center What is Scalability? Jesus Labarta, Workshop on Tools for Petascale Computing, Snowbird, Utah,July 2007 2 Index General
More informationAdvanced Computer Networks. Flow Control
Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Last week TCP in Datacenters Avoid incast problem - Reduce
More informationLecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)
Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew
More informationRouting Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)
Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Many routing algorithms exist 1) Arithmetic 2) Source-based 3) Table lookup
More informationPerformance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems
Performance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems P. BOROVSKA, O. NAKOV, S. MARKOV, D. IVANOVA, F. FILIPOV Computer System Department Technical University
More informationReal Parallel Computers
Real Parallel Computers Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel Computing 2005 Short history
More informationLecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control
Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection
More informationEN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to
More information10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G
10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures
More informationInterconnect Technology and Computational Speed
Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationUsing Lamport s Logical Clocks
Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based
More informationLecture: Interconnection Networks
Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet
More informationChapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup
Chapter 4 Routers with Tiny Buffers: Experiments This chapter describes two sets of experiments with tiny buffers in networks: one in a testbed and the other in a real network over the Internet2 1 backbone.
More informationCluster Network Products
Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster
More informationCongestion Management in Lossless Interconnects: Challenges and Benefits
Congestion Management in Lossless Interconnects: Challenges and Benefits José Duato Technical University of Valencia (SPAIN) Conference title 1 Outline Why is congestion management required? Benefits Congestion
More informationLecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels
Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a
More informationReal Parallel Computers
Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationNetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013
NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching
More informationComparison of Deadlock Recovery and Avoidance Mechanisms to Approach Message Dependent Deadlocks in on-chip Networks
Comparison of Deadlock Recovery and Avoidance Mechanisms to Approach Message Dependent Deadlocks in on-chip Networks Andreas Lankes¹, Soeren Sonntag², Helmut Reinig³, Thomas Wild¹, Andreas Herkersdorf¹
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationRouting Algorithms. Review
Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent
More informationPerformance Tools (Paraver/Dimemas)
www.bsc.es Performance Tools (Paraver/Dimemas) Jesús Labarta, Judit Gimenez BSC Enes workshop on exascale techs. Hamburg, March 18 th 2014 Our Tools! Since 1991! Based on traces! Open Source http://www.bsc.es/paraver!
More informationData Center Network Topologies II
Data Center Network Topologies II Hakim Weatherspoon Associate Professor, Dept of Computer cience C 5413: High Performance ystems and Networking April 10, 2017 March 31, 2017 Agenda for semester Project
More informationLecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control
Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,
More informationUNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department
UNIVERSITY OF CASTILLA-LA MANCHA Computing Systems Department A case study on implementing virtual 5D torus networks using network components of lower dimensionality HiPINEB 2017 Francisco José Andújar
More informationRemoving the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ
Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. of Computer Engineering (DISCA) Universidad Politécnica de Valencia
More informationGot Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China
Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ACM SIGCOMM 2013, 12-16 August, Hong Kong, China Virtualized Server 1 Application Performance in Virtualized
More informationIntroduction to High-Speed InfiniBand Interconnect
Introduction to High-Speed InfiniBand Interconnect 2 What is InfiniBand? Industry standard defined by the InfiniBand Trade Association Originated in 1999 InfiniBand specification defines an input/output
More informationFast-Response Multipath Routing Policy for High-Speed Interconnection Networks
HPI-DC 09 Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks Diego Lugones, Daniel Franco, and Emilio Luque Leonardo Fialho Cluster 09 August 31 New Orleans, USA Outline Scope
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationResource allocation and utilization in the Blue Gene/L supercomputer
Resource allocation and utilization in the Blue Gene/L supercomputer Tamar Domany, Y Aridor, O Goldshmidt, Y Kliteynik, EShmueli, U Silbershtein IBM Labs in Haifa Agenda Blue Gene/L Background Blue Gene/L
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 21 Routing Outline Routing Switch Design Flow Control Case Studies Routing Routing algorithm determines which of the possible paths are used as routes how
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationWhat is Parallel Computing?
What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing
More informationBasic Low Level Concepts
Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock
More informationI/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13
I/O Handling ECE 650 Systems Programming & Engineering Duke University, Spring 2018 Based on Operating Systems Concepts, Silberschatz Chapter 13 Input/Output (I/O) Typical application flow consists of
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationPerformance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing
Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. Informática de Sistemas y Computadores Universidad Politécnica
More informationLecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: ABCs of Networks Starting Point: Send bits between 2 computers Queue
More informationLecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance
Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,
More informationA PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationStockholm Brain Institute Blue Gene/L
Stockholm Brain Institute Blue Gene/L 1 Stockholm Brain Institute Blue Gene/L 2 IBM Systems & Technology Group and IBM Research IBM Blue Gene /P - An Overview of a Petaflop Capable System Carl G. Tengwall
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More information1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects
Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects
More informationA Multiple LID Routing Scheme for Fat-Tree-Based InfiniBand Networks
A Multiple LID Routing Scheme for Fat-Tree-Based InfiniBand Networks Xuan-Yi Lin, Yeh-Ching Chung, and Tai-Yi Huang Department of Computer Science National Tsing-Hua University, Hsinchu, Taiwan 00, ROC
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationAdvanced Computer Networks. Datacenter TCP
Advanced Computer Networks 263 3501 00 Datacenter TCP Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Today Problems with TCP in the Data Center TCP Incast TPC timeouts Improvements
More informationThe Future of High Performance Interconnects
The Future of High Performance Interconnects Ashrut Ambastha HPC Advisory Council Perth, Australia :: August 2017 When Algorithms Go Rogue 2017 Mellanox Technologies 2 When Algorithms Go Rogue 2017 Mellanox
More informationSoftware and Tools for HPE s The Machine Project
Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric
More informationLecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel
More informationGUARANTEED END-TO-END LATENCY THROUGH ETHERNET
GUARANTEED END-TO-END LATENCY THROUGH ETHERNET Øyvind Holmeide, OnTime Networks AS, Oslo, Norway oeyvind@ontimenet.com Markus Schmitz, OnTime Networks LLC, Texas, USA markus@ontimenet.com Abstract: Latency
More informationCMSC 611: Advanced. Interconnection Networks
CMSC 611: Advanced Computer Architecture Interconnection Networks Interconnection Networks Massively parallel processor networks (MPP) Thousands of nodes Short distance (
More informationDetermining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace
Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more
More informationLarge-Scale Network Simulation Scalability and an FPGA-based Network Simulator
Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid
More informationTopologies. Maurizio Palesi. Maurizio Palesi 1
Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and
More informationThe way toward peta-flops
The way toward peta-flops ISC-2011 Dr. Pierre Lagier Chief Technology Officer Fujitsu Systems Europe Where things started from DESIGN CONCEPTS 2 New challenges and requirements! Optimal sustained flops
More informationInfiniband Fast Interconnect
Infiniband Fast Interconnect Yuan Liu Institute of Information and Mathematical Sciences Massey University May 2009 Abstract Infiniband is the new generation fast interconnect provides bandwidths both
More informationContention-based Congestion Management in Large-Scale Networks
Contention-based Congestion Management in Large-Scale Networks Gwangsun Kim, Changhyun Kim, Jiyun Jeong, Mike Parker, John Kim KAIST Intel Corp. {gskim, nangigs, cjy9037, jjk12}@kaist.ac.kr mike.a.parker@intel.com
More informationLessons learned from MPI
Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationReceive Livelock. Robert Grimm New York University
Receive Livelock Robert Grimm New York University The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Motivation Interrupts work well when I/O
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationWhite paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation
White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview
More informationxsim The Extreme-Scale Simulator
www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of
More informationInterconnection Networks
Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially
More informationParallel Architecture. Sathish Vadhiyar
Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate
More informationInfiniBand-based HPC Clusters
Boosting Scalability of InfiniBand-based HPC Clusters Asaf Wachtel, Senior Product Manager 2010 Voltaire Inc. InfiniBand-based HPC Clusters Scalability Challenges Cluster TCO Scalability Hardware costs
More informationECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts
ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-26 1 Network/Roadway Analogy 3 1.1. Running
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System X idataplex CINECA, Italy The site selection
More informationToward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies
Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday
More informationFault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson
Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies Mohsin Y Ahmed Conlan Wesson Overview NoC: Future generation of many core processor on a single chip
More informationTopologies. Maurizio Palesi. Maurizio Palesi 1
Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and
More informationBoosting the Performance of Myrinet Networks
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations
More informationEfficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip
ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,
More informationFCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture
More informationMapping MPI+X Applications to Multi-GPU Architectures
Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under
More informationLecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren
Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren Lecture 21 Overview" How fast should a sending host transmit data? Not to fast, not to slow, just right Should not be faster than
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationAdvanced Computer Networks. Flow Control
Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi, Qin Yin, Timothy Roscoe Spring Semester 2015 Oriana Riva, Department of Computer Science ETH Zürich 1 Today Flow Control Store-and-forward,
More information