Trace-driven co-simulation of highperformance

Size: px
Start display at page:

Download "Trace-driven co-simulation of highperformance"

Transcription

1 IBM Research GmbH, Zurich, Switzerland Trace-driven co-simulation of highperformance computing systems using OMNeT++ Cyriel Minkenberg, Germán Rodríguez Herrera IBM Research GmbH, Zurich, Switzerland 2nd International Workshop on OMNeT++, March 6, IBM Corporation

2 Overview Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++ Context MareNostrum MareIncognito Design of interconnection networks for massively parallel computers Models & tools Computation Dimemas Communication Venus Visualization Paraver Integrated tool chain Venus architecture Co-simulation with Dimemas Paraver tracing Accuracy: topologies, routing, mapping, Myrinet/IBA/Ethernet models Sample results Conclusions & future work

3 Background: MareNostrum Blade-based parallel computer at Barcelona Supercomputing Center (BSC) 2,560 nodes 10,240 IBM PowerPC 970MP processors at 2.3 GHz (2,560 JS21 blades) Peak performance of Teraflops 20 TB of main memory TB of disk storage Interconnection networks Myrinet and Gigabit Ethernet Linux: SuSe Distribution 44 racks, 120 m 2 floor space

4 MareIncognito Joint project between BSC and IBM to developed a follow-on for MareNostrum, codenamed MareIncognito 10+ PFLOP/s machine for 2011 timeframe comprising in the order of 10-20K 1+ TFLOP blades Our focus: Design a performance- and cost-optimized interconnection network for such a system Gain deeper understanding of HPC traffic patterns and the impact of the interconnect on overall system performance Use this understanding to optimize the design of the MareIncognito interconnect Reduce cost & power without sacrificing much performance Topology, bisectional bandwidth, routing, contention (internal & external), task placement, collectives, adapter & switch implementation

5 Interconnect design Ever-increasing levels of parallelism and distribution are causing shift from processor-centric to interconnect-centric computer architecture Interconnect represents a significant fraction of overall system cost Switches, cables Maintenance We need tools to predict system performance with reasonable accuracy Absolute performance Parameter sensitivity; trend prediction ( what if? ) Accurate model of application behavior (compute nodes) Accurate model of communication behavior (interconnect) Many tools do either of these things well; very few manage both simultaneously with sufficient accuracy

6 Compute node model From the network perspective, compute node acts as traffic source and sink In communication network design (telco, internet) typical approaches are Synthetic models based on some kind of stochastic processes may (or may not) be reasonable if a sufficient level of statistical multiplexing is present relation to reality unclear at best Replaying traces recorded on, e.g., some provider s backbone In either case, semantics of traffic content are rarely considered: no causal dependencies between communications Traffic in HPC systems has strong causal dependencies These stem from control and data dependencies inherent in a given parallel program These dependencies can be captured by running a program on a given machine and recording them in a trace file If the program has been written using the Message Passing Interface (MPI) library, this basically amounts to a per-process list of send/recv/wait calls Such a trace can be replayed observing the MPI call semantics to correctly reproduce communication dependencies

7 Compute node model: Dimemas Dimemas (BSC simulator) Rapidly simulates MPI traces collected from real machines Faithfully models MPI semantics & node architecture Sensitivity: helps identify coarse-grain factors and relevant communication phases Leverages CEPBA Tools trace generation, handling and visualization Models interconnection network at a very high abstraction level MPI replay config file (#buses, bandwidth, latency, eager threshold, ) MPI application run MPI trace Dimemas interface (client) Paraver (BSC visualization) Visualizes MPI communication patterns at the node level Allows debugging of inter-process communication, e.g., load imbalance, contention statistics Paraver node trace Paraver

8 Interconnect model: Venus Simulates interconnect at flit level Event-driven simulator using OMNeT++ (originally based on MARS simulator) Supports wormhole routing and segmentation for long messages Provides various detailed switch and adapter implementations Myrinet, 10G Ethernet, InfiniBand Generic input-, output-, and combined inputoutput-queued switches Provides various topologies and routing methods Extended Generalized Fat Tree (XGFT), Mesh, 2D/3D Torus, Hypercube, arbitrary Myrinet topologies Supports various routing methods Source routing, table lookup Algorithmic (online), Myrinet routes files (offline) Static, dynamic Highly configurable Topology, switch/adapter models, buffer sizes, link speed, flit size, segmentation, latencies, etc. Supports MareNostrum/MareIncognito Server mode to co-simulate with Dimemas via socket interface Outputs paraver-compatible trace files enabling detailed observation of network behavior Detailed models of Myrinet switch and adapter hardware Translation tool to convert Myrinet map file to OMNeT++ ned topology description Import facility to load Myrinet route files at simulation runtime; supports multiple routes per flow (adaptivity) Flexible task to node mapping mechanism Tool to generate Myrinet map and routes files for arbitrary XGFTs

9 Tool Enhancement & Integration MPI application run.routes file.map file xgft routereader map2ned MPI replay MPI trace mapping routes topology config file (#buses, bandwidth, latency, eager threshold, ) Dimemas interface (client) co-simulation (socket) Venus interface (server) config file (adapter & switch arch., bandwidth, delay, segmentation, buffer size, ) statistics Paraver node trace Paraver network trace statistics Paraver visualization, analysis, validation Goal: Understand application usage of physical communication resources and facilitate optimal communication subsystem design

10 Venus model structure host system toplevel Host statin[c] statin[h*c] statout[c] statout[h*c] statin[] in statin[c] node[c] statout[] statout[c] Host[h] out out[] in node[0] out statin[c] statout[c] Host[0] dataout[n] datain[n] dataout[n] datain[n] nodeout[c] nodein[c] in[] out[] statin[c] dimemas statout[c] Host[H-1] dataout[n] datain[n] statistics in node[c-1] out router datain[h] adapout[a] statout[] dataout[h] statin[] adapin[a] datain[h] in[] to_control from_control dataout[h] datainingr dataoutingr system adapter[0] dataoutegr datainegr system datainingr dataoutingr Network[0] adapter[a] dataoutegr datainegr ctrlstatdim prvin in datainingr dataoutingr out adapter[a-1] control Network[N-1] dataoutegr datainegr dataout[c] datain[c]

11 Hybrid Dimemas & Venus simulation Based on PDES Null message protocol algorithm Not really parallel: simulators take turns simulating (blocking) Dimemas interface (client) co-simulation (socket) toplevel Venus interface (server) Earliest input time set to 0 while any messages in flight Earliest output time set to statin[] timestamp of next event Dimemas interface statout[] system Client/Server model statout[] Venus acts as a server simulating the network out[] Dimemas acts as a client dimemas simulating the application in[] out[] Each provides minimal set of commands to the other statistics Dimemas commands SEND timestamp source destination size extra strings statin[] STOP timestamp END ctrlstatdim FINISH in[] PROTO (OK_TO_SEND, prvin READY_TO_RECV) to_control in Venus responses from_control STOP REACHED out timestamp COMPLETED SEND [echo of the original control data + extra data]

12 D&V Co-simulation: SEND example Event Queue [1us] [10us] [20us] Dimemas VenusClient process_event() Event at t = 1 us is a SEND Issue SEND command pop_event() Messages in flight > 0 Next Event at time (t == 10us) > 1us Issue STOP t command Issue END command size destination source any string to be sent back SEND 1us x001 0x002 STOP 10us END Venus ServerMod Blocked in receive_commands() Event Queue [1us] [2us] [10us] Blocked in pop_event() Inside pop_event() END received (unblock) Add receive event into the queue at time 2 us Process Next Event END received (unblock) Process SEND command Send message at t = 1 us Schedule STOP at t = 10 us Continue simulation COMPLETED SEND 2us x001 0x002 Event: Message received at t = 2 us Send Response to Dimemas END Updating the timestamp Cancel scheduled STOP Blocked in receive_commands()

13 Network topologies Fat tree (k-ary n-tree) statin[c] Host[0] statout[c] dataout[n] Extended Generalized Fat datain[n] Tree (XGFT) datain[h] system network statin[h*c] statout[h*c] Mesh Torus Hypercube statin[c] Host[h] statout[c] S dataout[n] 3,0 S 3,k-1 S 3,k datain[n] Arbitrary (possibly irregular) topologies using Myrinet map files dataout[h] Network[0]... S 3,k^2-1 S 2,0 S 2,k-1 S datain[h] 2,k S 2,2k-1 S 2,(k-1)k S 2,k^2-1 S 1,0 S 1,k-1 S 1,k S 1,2k-1 S 1,(k-1)k S 1,k^2-1 statin[c] dataout[n] dataout[h] Host[H-1] statout[c] datain[n] Network[N-1] S 3,2k-1 S 3,(k-1)k A A k^3

14 Insights gained at multiple levels Application & MPI library level Blocking vs. nonblocking calls MPI protocol level Eager vs. rendez-vous Topology & routing Static source based routing Contention and pattern aware routings Slimmed networks External vs. internal contention Hardware Head-of-line blocking Switch arbitration policy Segmentation Deadlock Automatic communication-computation overlap Putting it all together: Validation with real traces in a real machine Identifying the sources of network performance losses Collectives variability Interconnect layers Software: Application & MPI library Protocol/Middleware Network topology & routing Switch & adapter technology

15 At the protocol level Eager/Rendez-vous Rendez-vous protocol needs control messages Impact is relatively small if the control messages never wait after a data segment to get sent If control messages are sent after a long segment, the sender will be delayed Smaller segments Out of Band protocol messages The delay propagates to the other threads Segment size trade-off? 16KB as in Myrinet for long messages is too long to interleave urgent control messages Alltoall example (no internal contention) No imbalance, control messages are always sent before data segments Eager: ~4.3 ms Rdvz: ~4.8 ms Waiting for protocol messages Very small imbalance <10us, time: ~6.3 ms

16 Performance of WRF on two-level slimmed tree WRF, 256 nodes WRF, 256 nodes Execution time [s] ,1 2,2 2,M 1,1 1,2 1,N 1 N/2 1 N/2 1 N/ Number of top-level sw itches Execution time relative to fat tree [%] Number of top-level switches WRF, 256 nodes output-queued switch input-queued switch (Myrinet) Dimemas number of input links = number of output links = 256 Relative increase in execution time [%] latency = 8 us cpu_ratio = 1.0 eager threshold = bytes Venus link speed = 2 Gb/s, flit size = 8 B (duration = 32 ns) Myrinet switch and adapter models Myrinet segmentation adapter buffer size = 1024 KB switch buffer size per port = 8 KB Number of top-level switches

17 Conclusions & future work Conclusions Created OMNeT-based interconnection network simulator that faithfully models real networks in terms of topology, routing, flow control, switch- and adapter architecture Integrated network simulator with trace-based MPI simulator to capture reactive traffic behavior, thus obtaining high accuracy in simulating the interactions between computation and communication Enabled entirely new insights into the effect of the interconnection network on overall system performance under realistic traffic patterns Future work Allow multiple simulators to connect to Venus simultaneously Modify MPI library to directly redirect communications to Venus Make Venus itself run in parallel Extended Venus to produce power estimates

18

19 D&V co-simulation: The client /* DIMEMAS CODE */ int main() { //DIMEMAS CODE vc_initialize(); // EVENT MANAGER While (events() venus_pending_events ()) { pop_event (); // DIMEMAS CODE SEND: () vc_send() RDVZ_RECV: () vc_ready_recv() RDVZ_SEND: () vc_ready_send() } vc_finish() // DIMEMAS CODE } B L O C K I N G Close the socket Dimemas VenusClient Open socket to Venus Create event queue for pending events co-simulation (socket) Check if Venus communicated something Venus ServerMod (server) Modified conditions to stop the simulation now the normal queue can be empty, because some relevant events are pending to be finished by Venus Block if necessary (check the number of messages in flight) Extract event for events queue: Reception of messages in Venus is handled here Extract the relevant event from the events queue, and put them into the pending_events queue Send the commands to Venus through the socket Increase the number of messages in_flight ; this will cause pop_event() to relinquish control to Venus if necessary

20 D&V co-simulation: The server ServerMod is interposed between the statistics module and the network. Acts as a Generator and Sink of User Messages /* SERVERMOD CODE */ int initialize() { Bind/listen/accept (server socket) } int receive_commands() { while (command = read_from_socket()) execute (command); } int handlemessage() { } STOP: waitforcommands = true; send_through_socket(stop REACHED); USER_MESSAGE: waitforcommands = callback(usermessage); // [... ] if (waitforcommands) receive_commands(); B L O C K I N G Dimemas VenusClient co-simulation (socket) Venus ServerMod (server) Insert into the network User Messages as instructed by Dimemas or schedule STOP s to relinquish control back to Dimemas Commands are read until an END command is found; then Venus simulation continues If a STOP is reached, tell Dimemas and block (in receive_commands()) until Dimemas schedules a next safe Earliest Output Time (STOP) When ServerMod sees a User Message it executes the associated callback: If it is a protocol message, new messages are injected into the network If it is a Data message that completed, inform Dimemas, cancel the scheduled STOP, and block (receive_commands);

21 Extended Generalized Fat Trees XGFT ( h ; m 1,, m h ; w 1,, w h ) h = height number of levels-1 levels are numbered 0 through h level 0 : compute nodes levels 1 h : switch nodes m i = number of children per node at level i, 0 < i h w i = number of parents per node at level i-1, 0 < i h number of level 0 nodes = Π i m i number of level h nodes = Π i w i XGFT ( 3 ; 3, 2, 2 ; 2, 2,3 ) 0,0,0 0,0,1 0,0,2 0,1,0 0,1,1 0,1,2 1,0,0 1,0,1 1,0,2 1,1,0 1,1,1 1,1, ,0,0 0,1,0 1,0,0 1,1,0 0,0,1 0,1,1 1,0,1 1,1, ,0,0 1,0,0 0,1,0 1,1,0 0,0,1 1,0,1 0,1,1 1,1, ,0,0 1,0,0 2,0,0 0,1,0 1,1,0 2,1,0 0,0,1 1,0,1 2,0,1 0,1,1 1,1,1 2,1,1

22 Performance of 256-node WRF on various topologies cpu ratio Execution time [s] Hypercube(8,9) Torus(16,16,1,7) Torus(16,8,2,7) Torus(16,4,4,7) Hypercube(0,256) MareNostrum(256) FatTree(2,32) SlimTree(2,32,16) Hypercube(7,9) SlimTree(2,32,15) SlimTree(2,32,13) SlimTree(2,32,14) SlimTree(2,32,12) SlimTree(2,32,11) SlimTree(2,32,9) SlimTree(2,32,8) SlimTree(2,32,10) Torus(4,4,4,10) Hypercube(6,10) Torus(8,8,4,7) Torus8,4,4,8() SlimTree(2,32,7) SlimTree(2,32,6) SlimTree(2,32,5) SlimTree(2,32,4) Torus(4,4,2,14) Hypercube(5,13) Hypercube(1,129) SlimTree(2,32,3) SlimTree(2,32,2) Hypercube(4,20) Hypercube(3,35) Hypercube(2,66) SlimTree(2,32,1) Topology

23 Impact of scheduling policy Software: Application & MPI library Protocol/Middleware Network topology & routing Switch & adapter technology (a) Pointers initialized randomly The scheduling policy is the determining factor (infinite CPU, IQ switch) Round-robin pointers are initialized randomly When contention occurs (on output 15), the wrong selection (which depends on the position of the pointer) results in HOL blocking; the scheduler should first serve the input on which another message will arrive Unfortunately, it has no crystal ball L2 switches HOL blocking detected (b) Pointers initialized to 0 (c) Pointers initialized to N-1

24 Backup S 3,0 S 3,k-1 S 3,k S 3,2k-1 S 3,(k-1)k... S 3,k^2-1 S 2,0 S 2,k-1 S 2,k S 2,2k-1 S 2,(k-1)k S 2,k^2-1 S 1,0 S 1,k-1 S 1,k S 1,2k-1 S 1,(k-1)k S 1,k^2-1 A A k^3

25 Multi-rail system system statin[c] statout[c] Host[0] dataout[n] datain[n] datain[h] statin[h*c] dataout[h] Network[0] statin[c] statout[c] Host[h] dataout[n] datain[n] statout[h*c] datain[h] statin[c] Host[H-1] statout[c] dataout[n] datain[n] dataout[h] Network[N-1]

26 MultiHost module Host in node[0] out datainingr dataoutingr adapter[0] dataoutegr datainegr statin[c] statout[c] in node[c] out nodeout[c] nodein[c] router adapout[a] adapin[a] datainingr dataoutingr adapter[a] dataoutegr datainegr dataout[c] datain[c] datainingr dataoutingr in node[c-1] out adapter[a-1] dataoutegr datainegr

27 D&V Co-simulation: Protocol (hints) Human-readable ASCII text Sent through standard Unix Sockets Few Commands and Responses Extensible Dimemas VenusClient co-simulation (socket) Venus ServerMod (server) Commands SEND timestamp source destination size extra strings The extra strings enconde the events that Dimemas wil update upon reception of the message. STOP timestamp END FINISH PROTO (OK_TO_SEND,READY_TO_RECV) (parameters as for SEND) Responses: STOP REACHED timestamp COMPLETED SEND [echo of the original data + extra data]

28 Impact of switch architecture Infinite CPU speed, output-queued switch Infinite CPU speed, input-queued switch Ideal case, traffic pattern completes in about two times the message duration (54 µs) plus startup latency Three messages experience additional delay due to HOL blocking. Why? completion time = 63 µs completion time = 87 µs Normal CPU speed, output-queued switch Normal CPU speed, input-queued switch One node (task 95) starts sending 26 µs later than all other tasks, adding 26 µs to the communication time completion time = 135 µs completion time = 157 µs

Adaptive Routing for Data Center Bridges

Adaptive Routing for Data Center Bridges Adaptive Routing for Data Center Bridges Cyriel Minkenberg 1, Mitchell Gusat 1, German Rodriguez 2 1 IBM Research - Zurich 2 Barcelona Supercomputing Center Overview IBM Research - Zurich Data center network

More information

Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++

Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++ Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++ ABSTRACT Cyriel Minkenberg IBM Zurich Research Laboratory Säumerstrasse 4 8803 Rüschlikon, Switzerland sil@zurich.ibm.com

More information

Towards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems

Towards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems Towards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems Robert Birke, German Rodriguez, Cyriel Minkenberg IBM Research Zurich Outline High-performance computing:

More information

Dimemas internals and details. BSC Performance Tools

Dimemas internals and details. BSC Performance Tools Dimemas ernals and details BSC Performance Tools CEPBA tools framework XML control Predictions/expectations Valgrind OMPITrace.prv MRNET Dyninst, PAPI Time Analysis, filters.prv.cfg Paraver +.pcf.trf DIMEMAS

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized

More information

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers Outline Execution Environments for Parallel Applications Master CANS 2007/2008 Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Supercomputers OS abstractions Extended OS

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer

Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer Chris Carothers, Elsa Gonsiorowski and Justin LaPre Center for Computational Innovations Rensselaer

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017 In-Network Computing Sebastian Kalcher, Senior System Engineer HPC May 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

G Robert Grimm New York University

G Robert Grimm New York University G22.3250-001 Receiver Livelock Robert Grimm New York University Altogether Now: The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Motivation

More information

Scalability of Trace Analysis Tools. Jesus Labarta Barcelona Supercomputing Center

Scalability of Trace Analysis Tools. Jesus Labarta Barcelona Supercomputing Center Scalability of Trace Analysis Tools Jesus Labarta Barcelona Supercomputing Center What is Scalability? Jesus Labarta, Workshop on Tools for Petascale Computing, Snowbird, Utah,July 2007 2 Index General

More information

Advanced Computer Networks. Flow Control

Advanced Computer Networks. Flow Control Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Last week TCP in Datacenters Avoid incast problem - Reduce

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Many routing algorithms exist 1) Arithmetic 2) Source-based 3) Table lookup

More information

Performance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems

Performance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems Performance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems P. BOROVSKA, O. NAKOV, S. MARKOV, D. IVANOVA, F. FILIPOV Computer System Department Technical University

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel Computing 2005 Short history

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to

More information

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G 10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G Mohammad J. Rashti and Ahmad Afsahi Queen s University Kingston, ON, Canada 2007 Workshop on Communication Architectures

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup Chapter 4 Routers with Tiny Buffers: Experiments This chapter describes two sets of experiments with tiny buffers in networks: one in a testbed and the other in a real network over the Internet2 1 backbone.

More information

Cluster Network Products

Cluster Network Products Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster

More information

Congestion Management in Lossless Interconnects: Challenges and Benefits

Congestion Management in Lossless Interconnects: Challenges and Benefits Congestion Management in Lossless Interconnects: Challenges and Benefits José Duato Technical University of Valencia (SPAIN) Conference title 1 Outline Why is congestion management required? Benefits Congestion

More information

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a

More information

Real Parallel Computers

Real Parallel Computers Real Parallel Computers Modular data centers Overview Short history of parallel machines Cluster computing Blue Gene supercomputer Performance development, top-500 DAS: Distributed supercomputing Short

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

Comparison of Deadlock Recovery and Avoidance Mechanisms to Approach Message Dependent Deadlocks in on-chip Networks

Comparison of Deadlock Recovery and Avoidance Mechanisms to Approach Message Dependent Deadlocks in on-chip Networks Comparison of Deadlock Recovery and Avoidance Mechanisms to Approach Message Dependent Deadlocks in on-chip Networks Andreas Lankes¹, Soeren Sonntag², Helmut Reinig³, Thomas Wild¹, Andreas Herkersdorf¹

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Routing Algorithms. Review

Routing Algorithms. Review Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent

More information

Performance Tools (Paraver/Dimemas)

Performance Tools (Paraver/Dimemas) www.bsc.es Performance Tools (Paraver/Dimemas) Jesús Labarta, Judit Gimenez BSC Enes workshop on exascale techs. Hamburg, March 18 th 2014 Our Tools! Since 1991! Based on traces! Open Source http://www.bsc.es/paraver!

More information

Data Center Network Topologies II

Data Center Network Topologies II Data Center Network Topologies II Hakim Weatherspoon Associate Professor, Dept of Computer cience C 5413: High Performance ystems and Networking April 10, 2017 March 31, 2017 Agenda for semester Project

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department UNIVERSITY OF CASTILLA-LA MANCHA Computing Systems Department A case study on implementing virtual 5D torus networks using network components of lower dimensionality HiPINEB 2017 Francisco José Andújar

More information

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. of Computer Engineering (DISCA) Universidad Politécnica de Valencia

More information

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China

Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat. ACM SIGCOMM 2013, August, Hong Kong, China Got Loss? Get zovn! Daniel Crisan, Robert Birke, Gilles Cressier, Cyriel Minkenberg, and Mitch Gusat ACM SIGCOMM 2013, 12-16 August, Hong Kong, China Virtualized Server 1 Application Performance in Virtualized

More information

Introduction to High-Speed InfiniBand Interconnect

Introduction to High-Speed InfiniBand Interconnect Introduction to High-Speed InfiniBand Interconnect 2 What is InfiniBand? Industry standard defined by the InfiniBand Trade Association Originated in 1999 InfiniBand specification defines an input/output

More information

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks HPI-DC 09 Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks Diego Lugones, Daniel Franco, and Emilio Luque Leonardo Fialho Cluster 09 August 31 New Orleans, USA Outline Scope

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Resource allocation and utilization in the Blue Gene/L supercomputer

Resource allocation and utilization in the Blue Gene/L supercomputer Resource allocation and utilization in the Blue Gene/L supercomputer Tamar Domany, Y Aridor, O Goldshmidt, Y Kliteynik, EShmueli, U Silbershtein IBM Labs in Haifa Agenda Blue Gene/L Background Blue Gene/L

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 21 Routing Outline Routing Switch Design Flow Control Case Studies Routing Routing algorithm determines which of the possible paths are used as routes how

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13 I/O Handling ECE 650 Systems Programming & Engineering Duke University, Spring 2018 Based on Operating Systems Concepts, Silberschatz Chapter 13 Input/Output (I/O) Typical application flow consists of

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. Informática de Sistemas y Computadores Universidad Politécnica

More information

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: ABCs of Networks Starting Point: Send bits between 2 computers Queue

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Stockholm Brain Institute Blue Gene/L

Stockholm Brain Institute Blue Gene/L Stockholm Brain Institute Blue Gene/L 1 Stockholm Brain Institute Blue Gene/L 2 IBM Systems & Technology Group and IBM Research IBM Blue Gene /P - An Overview of a Petaflop Capable System Carl G. Tengwall

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

A Multiple LID Routing Scheme for Fat-Tree-Based InfiniBand Networks

A Multiple LID Routing Scheme for Fat-Tree-Based InfiniBand Networks A Multiple LID Routing Scheme for Fat-Tree-Based InfiniBand Networks Xuan-Yi Lin, Yeh-Ching Chung, and Tai-Yi Huang Department of Computer Science National Tsing-Hua University, Hsinchu, Taiwan 00, ROC

More information

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Advanced Computer Networks. Datacenter TCP

Advanced Computer Networks. Datacenter TCP Advanced Computer Networks 263 3501 00 Datacenter TCP Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Today Problems with TCP in the Data Center TCP Incast TPC timeouts Improvements

More information

The Future of High Performance Interconnects

The Future of High Performance Interconnects The Future of High Performance Interconnects Ashrut Ambastha HPC Advisory Council Perth, Australia :: August 2017 When Algorithms Go Rogue 2017 Mellanox Technologies 2 When Algorithms Go Rogue 2017 Mellanox

More information

Software and Tools for HPE s The Machine Project

Software and Tools for HPE s The Machine Project Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

GUARANTEED END-TO-END LATENCY THROUGH ETHERNET

GUARANTEED END-TO-END LATENCY THROUGH ETHERNET GUARANTEED END-TO-END LATENCY THROUGH ETHERNET Øyvind Holmeide, OnTime Networks AS, Oslo, Norway oeyvind@ontimenet.com Markus Schmitz, OnTime Networks LLC, Texas, USA markus@ontimenet.com Abstract: Latency

More information

CMSC 611: Advanced. Interconnection Networks

CMSC 611: Advanced. Interconnection Networks CMSC 611: Advanced Computer Architecture Interconnection Networks Interconnection Networks Massively parallel processor networks (MPP) Thousands of nodes Short distance (

More information

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

The way toward peta-flops

The way toward peta-flops The way toward peta-flops ISC-2011 Dr. Pierre Lagier Chief Technology Officer Fujitsu Systems Europe Where things started from DESIGN CONCEPTS 2 New challenges and requirements! Optimal sustained flops

More information

Infiniband Fast Interconnect

Infiniband Fast Interconnect Infiniband Fast Interconnect Yuan Liu Institute of Information and Mathematical Sciences Massey University May 2009 Abstract Infiniband is the new generation fast interconnect provides bandwidths both

More information

Contention-based Congestion Management in Large-Scale Networks

Contention-based Congestion Management in Large-Scale Networks Contention-based Congestion Management in Large-Scale Networks Gwangsun Kim, Changhyun Kim, Jiyun Jeong, Mike Parker, John Kim KAIST Intel Corp. {gskim, nangigs, cjy9037, jjk12}@kaist.ac.kr mike.a.parker@intel.com

More information

Lessons learned from MPI

Lessons learned from MPI Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Receive Livelock. Robert Grimm New York University

Receive Livelock. Robert Grimm New York University Receive Livelock Robert Grimm New York University The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Motivation Interrupts work well when I/O

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview

More information

xsim The Extreme-Scale Simulator

xsim The Extreme-Scale Simulator www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of

More information

Interconnection Networks

Interconnection Networks Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially

More information

Parallel Architecture. Sathish Vadhiyar

Parallel Architecture. Sathish Vadhiyar Parallel Architecture Sathish Vadhiyar Motivations of Parallel Computing Faster execution times From days or months to hours or seconds E.g., climate modelling, bioinformatics Large amount of data dictate

More information

InfiniBand-based HPC Clusters

InfiniBand-based HPC Clusters Boosting Scalability of InfiniBand-based HPC Clusters Asaf Wachtel, Senior Product Manager 2010 Voltaire Inc. InfiniBand-based HPC Clusters Scalability Challenges Cluster TCO Scalability Hardware costs

More information

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-26 1 Network/Roadway Analogy 3 1.1. Running

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System X idataplex CINECA, Italy The site selection

More information

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday

More information

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies Mohsin Y Ahmed Conlan Wesson Overview NoC: Future generation of many core processor on a single chip

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Boosting the Performance of Myrinet Networks

Boosting the Performance of Myrinet Networks IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren

Lecture 21: Congestion Control CSE 123: Computer Networks Alex C. Snoeren Lecture 21: Congestion Control" CSE 123: Computer Networks Alex C. Snoeren Lecture 21 Overview" How fast should a sending host transmit data? Not to fast, not to slow, just right Should not be faster than

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Advanced Computer Networks. Flow Control

Advanced Computer Networks. Flow Control Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi, Qin Yin, Timothy Roscoe Spring Semester 2015 Oriana Riva, Department of Computer Science ETH Zürich 1 Today Flow Control Store-and-forward,

More information