Topology Awareness in the Tofu Interconnect Series

Size: px
Start display at page:

Download "Topology Awareness in the Tofu Interconnect Series"

Transcription

1 Topology Awareness in the Tofu Interconnect Series Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 23rd, 2016, ExaComm2016 Workshop 0

2 Introduction Networks are getting larger Systems have tens of thousands of nodes Highly scalable network topologies e.g. multi-dimensional torus, dragonfly Channel bisection < 1/2 node count Bisection bandwidth < injection bandwidth Issue: communication algorithms Existing general algorithms will be inefficient (video) MPI_Bcast on the K computer Topology-aware optimization is required This talk presents the topologyawareness design of the Tofu interconnect series, and visualizes the achievements June 23rd, 2016, ExaComm2016 Workshop 1

3 Tofu Interconnect Series Highly scalable 6D mesh/torus network Tofu interconnect Developed for the K computer Tofu interconnect 2 SoC integration and optical transceiver Another version is being developed for the Post-K machine Tofu interconnect Tofu interconnect June 23rd, 2016, ExaComm2016 Workshop 2

4 Index Topology-aware task allocation Topology-aware optimization Tuned collective communication library Low-level features of the network interface Topology-aware algorithms (for long messages) June 23rd, 2016, ExaComm2016 Workshop 3

5 6D Mesh/Torus Network Dimension labels: XYZABC Lengths of A-, B-, and C-axes are fixed; 2, 3 and 2 Z B A C X Y Conceptual Model June 23rd, 2016, ExaComm2016 Workshop 4

6 Task Allocation and Rank Mapping A rectangular region in the physical 6D network for each task Contiguous in the XYZ-axes and not divided in the ABC-axes Virtual torus rank mapping Users defined the logical shape of the task as a virtual 1D/2D/3D torus The length for each dimension is defined in the batch script Example: using the full system of the K computer ( ) Virtual 1D torus Virtual 2D torus Virtual 3D torus #PJM -L "node=82944 #PJM -L "node=576x144 #PJM -L "node=54x48x32 A rank number reflects the logical coordinates of process Embedding a virtual torus into a physical rectangular region A nearest neighbor node in the virtual torus space is guaranteed to be a nearest neighbor node in the physical 6D network The task scheduler may add padding nodes and rotate the shape to increase the chance for allocation June 23rd, 2016, ExaComm2016 Workshop 5

7 Index Topology-aware task allocation Topology-aware optimization Tuned collective communication library Low-level features of the network interface Topology-aware algorithms (for long messages) June 23rd, 2016, ExaComm2016 Workshop 6

8 Manual Tuning with Profiling Dynamic profiling Enable profiling during the application s communication activity The profiler periodically samples performance analysis (PA) counters The profiling log is saved to storage after profiling PA counters of the Tofu interconnect Each counter is a hardware 64-bit register A set of PA counters is provided for each port of the router Bytes transferred, busy cycles, idle cycles, packet buffer depleted cycles, etc. Visualization Users find bottlenecks Manual performance tuning MPI and task allocation options Communication algorithms Screen shot of the Fujitsu Profiler June 23rd, 2016, ExaComm2016 Workshop 7

9 Automatic Tuning without Profiling Custom rank mapping order A default rank mapping order often affects the communication performance in a multi-dimensional torus One of the optimization candidates right after executing a vanilla code RMATT (Rank Mapping Automatic Tuning Tool) Requires no profiling log but execution statistics Calculates rank mapping order using the simulated annealing algorithm Users input the shape of torus and a list of communication pattern Each line of the list includes source and destination pair of processes and total amount of transferred data during a task MPI Application Configuration MPI option 32x32x16 RMATT (machine file) Input 0 (0,0,0) Output 1 (0,0,1) Comm. Pattern 2 (0,1,0) Rank Mapping Optimization 0 1 1kB 0 4 2kB 1 0 1kB MPI rank mapping file Execution environment June 23rd, 2016, ExaComm2016 Workshop 8

10 Evaluations of Improvement by RMATT NAS Parallel Benchmark (CG) Case 1: NPROCS=1024, CLASS=B, 2D Torus 32x32 Default(x-y order) Rank map optimized by RMATT Default RMATT Execution time 1.33 sec 1.24 sec 7% improved (includes calculation time) Case 2: NPROCS=8192, CLASS=D, 2D Torus 128x64 Default RMATT Execution time sec 9.98 sec 9% improved (includes calculation time) June 23rd, 2016, ExaComm2016 Workshop 9

11 Index Topology-aware task allocation Topology-aware performance optimization Tuned collective communication library Low-level features of the network interface Topology-aware algorithms (for long messages) June 23rd, 2016, ExaComm2016 Workshop 10

12 Low-Level Network Interface Fujitsu s FJMPI is developed based on Open MPI The tuned collective communication library bypasses the Open MPI stack and uses the low-level network interface directly MPI Interface Layer tuned COLL ob1 PML bypass tofu LLP r2 BML tofu BTL bypass tofu COMMON Tofu Library Tofu Interconnect June 23rd, 2016, ExaComm2016 Workshop 11

13 Simultaneous Communication Four RDMA engines (Tofu network interfaces) per node The peak injection bandwidth of each TNI is 5 GB/s for Tofu1 and 12.5 GB/s for Tofu2. The point-to-point messaging layer of the FJMPI uses four TNIs in a round-robin manner The tuned COLL identifies four TNIs to avoid a collision of the destination TNI CPU RDMA Engine 0 RDMA Engine 1 RDMA Engine 2 RDMA Engine 3 link 0 link 1 link 2 link 3 link 4 link 5 link 6 link 7 link 8 link 9 XYZ ABC June 23rd, 2016, ExaComm2016 Workshop 12

14 Injection Rate Control Latency (us) Contention depletes packet buffers and causes congestion Congestion can be avoided by reducing the injection rate Latencies of simultaneous 8-hop data transfer on a 32-node ring size Optimized packet gap Congestion Low injection rate Packet gap (multiples of MTU) June 23rd, 2016, ExaComm2016 Workshop 13

15 Index Topology-aware task allocation Topology-aware performance optimization Tuned collective communication library Low-level features of the network interface Topology-aware algorithms (for long messages) June 23rd, 2016, ExaComm2016 Workshop 14

16 Overview Assumed environment The shape of the communicator is a mesh or a torus One process per node participates in inter-process communication When there are multiple processes in a node, collective communication is fanned out through shared memory Optimization policies (for long messages only) Use multiple network interfaces Communicate with nearest neighbor nodes Control the injection rate for communication with far nodes Algorithms implemented in the FJMPI Triple trinary tree for broadcast and reduce Three-phase quad rings for gather Uniformly overlaid symmetrical pattern for all-to-all June 23rd, 2016, ExaComm2016 Workshop 15

17 Triple Trinary Tree Broadcasts data by dividing into three parts and simultaneously propagating each part via a different path Each path is a spanning trinary tree, and the three trees share no directed edges By reversing the direction of all edges, data can be reduced June 23rd, 2016, ExaComm2016 Workshop 16

18 MPI_Allreduce Two phases First phase reduce data using triple trinary trees Second phase broadcast the reduced data using the reversed trees (video) MPI_Allreduce on the K computer June 23rd, 2016, ExaComm2016 Workshop 17

19 Three-Phase Quad Rings The ring all-gather algorithm transfers data cyclically A C D A B D A B C B C D Divides data into four parts, and simultaneously transfers each part along a different direction Phase 1 Phase 2 Phase 3 June 23rd, 2016, ExaComm2016 Workshop 18

20 MPI_Allgather The three-phase quad ring algorithm (video) MPI_Allgather on the K computer June 23rd, 2016, ExaComm2016 Workshop 19

21 Uniformly Overlaid Symmetrical Pattern (1) A multi-phase all-to-all communication algorithm In each phase, each process transfers data to multiple processes that have symmetrical relative coordinates Each phase is divided into sub-phases June 23rd, 2016, ExaComm2016 Workshop 20

22 Uniformly Overlaid Symmetrical Pattern (2) For each phase, communication patterns of all processes are uniform For each sub-phase, the number of colliding transfers is the same as the hop count of a transfer Injection rate control for each sub-phase avoids congestion and increases effective throughput June 23rd, 2016, ExaComm2016 Workshop 21

23 MPI_Alltoall Uniformly overlaid symmetrical pattern algorithm (video) MPI_Alltoall on the K computer Left: the uniformly overlaid symmetrical pattern algorithm Right: default algorithm of the Open MPI June 23rd, 2016, ExaComm2016 Workshop 22

24 Summary Topology awareness design of the Tofu series Task allocation Virtual torus rank mapping Performance optimization Tofu PA counters for manual tuning with the Fujitsu Profiler Rank Mapping Automatic Tuning Tool (RMATT) Tuned collective communication library in the FJMPI Utilizes low-level network features Simultaneous communication Injection rate control Topology-aware algorithms for long messages Triple trinary tree algorithm for broadcast and reduce Three-phase quad rings algorithm for gather Uniformly overlaid symmetric pattern algorithm for all-to-all June 23rd, 2016, ExaComm2016 Workshop 23

25 24

Programming for Fujitsu Supercomputers

Programming for Fujitsu Supercomputers Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming

More information

PRIMEHPC FX10: Advanced Software

PRIMEHPC FX10: Advanced Software PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for

More information

The Tofu Interconnect D

The Tofu Interconnect D The Tofu Interconnect D 11 September 2018 Yuichiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouichi Hirai, Toshiyuki Shimizu, Shinya Hiramoto, Yoshiro Ikeda, Takahide Yoshikawa, Kenji

More information

Current and Future Challenges of the Tofu Interconnect for Emerging Applications

Current and Future Challenges of the Tofu Interconnect for Emerging Applications Current and Future Challenges of the Tofu Interconnect for Emerging Applications Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 22, 2017, ExaComm 2017 Workshop

More information

The Tofu Interconnect 2

The Tofu Interconnect 2 The Tofu Interconnect 2 Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shun Ando, Masahiro Maeda, Takahide Yoshikawa, Koji Hosoe, and Toshiyuki Shimizu Fujitsu Limited Introduction Tofu interconnect

More information

Technical Computing Suite supporting the hybrid system

Technical Computing Suite supporting the hybrid system Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect

More information

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shunji Uno, Shinji Sumimoto, Kenichi Miura, Naoyuki Shida, Takahiro Kawashima,

More information

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Shunji Uno, Shinji Sumimoto, Kenichi Miura, Naoyuki Shida, Takahiro Kawashima,

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied

More information

ICC: An Interconnect Controller for the Tofu Interconnect Architecture

ICC: An Interconnect Controller for the Tofu Interconnect Architecture : An Interconnect Controller for the Tofu Interconnect Architecture August 24, 2010 Takashi Toyoshima Next Generation Technical Computing Unit Fujitsu Limited Background Requirements for Supercomputing

More information

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10 White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10 Next Generation Technical Computing Unit Fujitsu Limited Contents Overview of the PRIMEHPC FX10 Supercomputer 2 SPARC64 TM IXfx: Fujitsu-Developed

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview

More information

The way toward peta-flops

The way toward peta-flops The way toward peta-flops ISC-2011 Dr. Pierre Lagier Chief Technology Officer Fujitsu Systems Europe Where things started from DESIGN CONCEPTS 2 New challenges and requirements! Optimal sustained flops

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Phastlane: A Rapid Transit Optical Routing Network

Phastlane: A Rapid Transit Optical Routing Network Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens

More information

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup Chapter 4 Routers with Tiny Buffers: Experiments This chapter describes two sets of experiments with tiny buffers in networks: one in a testbed and the other in a real network over the Internet2 1 backbone.

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ Networking for Data Acquisition Systems Fabrice Le Goff - 14/02/2018 - ISOTDAQ Outline Generalities The OSI Model Ethernet and Local Area Networks IP and Routing TCP, UDP and Transport Efficiency Networking

More information

The Impact of Optics on HPC System Interconnects

The Impact of Optics on HPC System Interconnects The Impact of Optics on HPC System Interconnects Mike Parker and Steve Scott Hot Interconnects 2009 Manhattan, NYC Will cost-effective optics fundamentally change the landscape of networking? Yes. Changes

More information

Performance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems

Performance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems Performance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems P. BOROVSKA, O. NAKOV, S. MARKOV, D. IVANOVA, F. FILIPOV Computer System Department Technical University

More information

Lecture 20: Distributed Memory Parallelism. William Gropp

Lecture 20: Distributed Memory Parallelism. William Gropp Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

A closer look at network structure:

A closer look at network structure: T1: Introduction 1.1 What is computer network? Examples of computer network The Internet Network structure: edge and core 1.2 Why computer networks 1.3 The way networks work 1.4 Performance metrics: Delay,

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

InfiniBand-based HPC Clusters

InfiniBand-based HPC Clusters Boosting Scalability of InfiniBand-based HPC Clusters Asaf Wachtel, Senior Product Manager 2010 Voltaire Inc. InfiniBand-based HPC Clusters Scalability Challenges Cluster TCO Scalability Hardware costs

More information

3D WiNoC Architectures

3D WiNoC Architectures Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",

More information

Introduction to High-Speed InfiniBand Interconnect

Introduction to High-Speed InfiniBand Interconnect Introduction to High-Speed InfiniBand Interconnect 2 What is InfiniBand? Industry standard defined by the InfiniBand Trade Association Originated in 1999 InfiniBand specification defines an input/output

More information

Introduction of Fujitsu s next-generation supercomputer

Introduction of Fujitsu s next-generation supercomputer Introduction of Fujitsu s next-generation supercomputer MATSUMOTO Takayuki July 16, 2014 HPC Platform Solutions Fujitsu has a long history of supercomputing over 30 years Technologies and experience of

More information

xsim The Extreme-Scale Simulator

xsim The Extreme-Scale Simulator www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of

More information

Chapter 16 Networking

Chapter 16 Networking Chapter 16 Networking Outline 16.1 Introduction 16.2 Network Topology 16.3 Network Types 16.4 TCP/IP Protocol Stack 16.5 Application Layer 16.5.1 Hypertext Transfer Protocol (HTTP) 16.5.2 File Transfer

More information

Computer Networking. Introduction. Quintin jean-noël Grenoble university

Computer Networking. Introduction. Quintin jean-noël Grenoble university Computer Networking Introduction Quintin jean-noël Jean-noel.quintin@imag.fr Grenoble university Based on the presentation of Duda http://duda.imag.fr 1 Course organization Introduction Network and architecture

More information

Advanced Computer Networks. Flow Control

Advanced Computer Networks. Flow Control Advanced Computer Networks 263 3501 00 Flow Control Patrick Stuedi Spring Semester 2017 1 Oriana Riva, Department of Computer Science ETH Zürich Last week TCP in Datacenters Avoid incast problem - Reduce

More information

Extending the LAN. Context. Info 341 Networking and Distributed Applications. Building up the network. How to hook things together. Media NIC 10/18/10

Extending the LAN. Context. Info 341 Networking and Distributed Applications. Building up the network. How to hook things together. Media NIC 10/18/10 Extending the LAN Info 341 Networking and Distributed Applications Context Building up the network Media NIC Application How to hook things together Transport Internetwork Network Access Physical Internet

More information

Source-Route Bridging

Source-Route Bridging 25 CHAPTER Chapter Goals Describe when to use source-route bridging. Understand the difference between SRB and transparent bridging. Know the mechanism that end stations use to specify a source-route.

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

Embedded Systems: Hardware Components (part II) Todor Stefanov

Embedded Systems: Hardware Components (part II) Todor Stefanov Embedded Systems: Hardware Components (part II) Todor Stefanov Leiden Embedded Research Center, Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded

More information

POLYMORPHIC ON-CHIP NETWORKS

POLYMORPHIC ON-CHIP NETWORKS POLYMORPHIC ON-CHIP NETWORKS Martha Mercaldi Kim, John D. Davis*, Mark Oskin, Todd Austin** University of Washington *Microsoft Research, Silicon Valley ** University of Michigan On-Chip Network Selection

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications Sep 2009 Gilad Shainer, Tong Liu (Mellanox); Jeffrey Layton (Dell); Joshua Mora (AMD) High Performance Interconnects for

More information

CARRIER SENSE MULTIPLE ACCESS (CSMA):

CARRIER SENSE MULTIPLE ACCESS (CSMA): Lecture Handout Computer Networks Lecture No. 8 CARRIER SENSE MULTIPLE ACCESS (CSMA): There is no central control management when computers transmit on Ethernet. For this purpose the Ethernet employs CSMA

More information

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch PERFORMANCE BENCHMARKS Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch Chelsio Communications www.chelsio.com sales@chelsio.com +1-408-962-3600 Executive Summary Ethernet provides a reliable

More information

Topology basics. Constraints and measures. Butterfly networks.

Topology basics. Constraints and measures. Butterfly networks. EE48: Advanced Computer Organization Lecture # Interconnection Networks Architecture and Design Stanford University Topology basics. Constraints and measures. Butterfly networks. Lecture #: Monday, 7 April

More information

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management Marina Garcia 22 August 2013 OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management M. Garcia, E. Vallejo, R. Beivide, M. Valero and G. Rodríguez Document number OFAR-CM: Efficient Dragonfly

More information

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet Hot Interconnects 2014 End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet Green Platform Research Laboratories, NEC, Japan J. Suzuki, Y. Hayashi, M. Kan, S. Miyakawa,

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

High-Performance Broadcast for Streaming and Deep Learning

High-Performance Broadcast for Streaming and Deep Learning High-Performance Broadcast for Streaming and Deep Learning Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth - SC17 2 Outline Introduction

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Lecture 2: Topology - I

Lecture 2: Topology - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

APENet: LQCD clusters a la APE

APENet: LQCD clusters a la APE Overview Hardware/Software Benchmarks Conclusions APENet: LQCD clusters a la APE Concept, Development and Use Roberto Ammendola Istituto Nazionale di Fisica Nucleare, Sezione Roma Tor Vergata Centro Ricerce

More information

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

CS 6230: High-Performance Computing and Parallelization Introduction to MPI CS 6230: High-Performance Computing and Parallelization Introduction to MPI Dr. Mike Kirby School of Computing and Scientific Computing and Imaging Institute University of Utah Salt Lake City, UT, USA

More information

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Topics Taxonomy Metric Topologies Characteristics Cost Performance 2 Interconnection

More information

Computer-to-Computer Networks (cont.)

Computer-to-Computer Networks (cont.) Computer-to-Computer Networks (cont.) 1 Connecting Mainframes - Option 1: Full Mesh / Direct Link Infrastructure Downsides: 1) cost - O(n 2 ) links and NICs 2) implementation complexity / scalability [

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Update on Topology Aware Scheduling (aka TAS)

Update on Topology Aware Scheduling (aka TAS) Update on Topology Aware Scheduling (aka TAS) Work done in collaboration with J. Enos, G. Bauer, R. Brunner, S. Islam R. Fiedler Adaptive Computing 2 When things look good Image credit: Dave Semeraro 3

More information

Physical Organization of Parallel Platforms. Alexandre David

Physical Organization of Parallel Platforms. Alexandre David Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:

More information

Interconnection Network

Interconnection Network Interconnection Network Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Topics

More information

CMSC 611: Advanced. Interconnection Networks

CMSC 611: Advanced. Interconnection Networks CMSC 611: Advanced Computer Architecture Interconnection Networks Interconnection Networks Massively parallel processor networks (MPP) Thousands of nodes Short distance (

More information

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto

SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES. Natalie Enright Jerger University of Toronto SIGNET: NETWORK-ON-CHIP FILTERING FOR COARSE VECTOR DIRECTORIES University of Toronto Interaction of Coherence and Network 2 Cache coherence protocol drives network-on-chip traffic Scalable coherence protocols

More information

Adaptors Communicating. Link Layer: Introduction. Parity Checking. Error Detection. Multiple Access Links and Protocols

Adaptors Communicating. Link Layer: Introduction. Parity Checking. Error Detection. Multiple Access Links and Protocols Link Layer: Introduction daptors ommunicating hosts and routers are nodes links connect nodes wired links wireless links layer-2 packet is a frame, encapsulates datagram datagram controller sending host

More information

High Performance MPI on IBM 12x InfiniBand Architecture

High Performance MPI on IBM 12x InfiniBand Architecture High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction

More information

Real-Time Protocol (RTP)

Real-Time Protocol (RTP) Real-Time Protocol (RTP) Provides standard packet format for real-time application Typically runs over UDP Specifies header fields below Payload Type: 7 bits, providing 128 possible different types of

More information

Interconnect Your Future

Interconnect Your Future Interconnect Your Future Gilad Shainer 2nd Annual MVAPICH User Group (MUG) Meeting, August 2014 Complete High-Performance Scalable Interconnect Infrastructure Comprehensive End-to-End Software Accelerators

More information

Flow Control can be viewed as a problem of

Flow Control can be viewed as a problem of NOC Flow Control 1 Flow Control Flow Control determines how the resources of a network, such as channel bandwidth and buffer capacity are allocated to packets traversing a network Goal is to use resources

More information

Part IV: 3D WiNoC Architectures

Part IV: 3D WiNoC Architectures Wireless NoC as Interconnection Backbone for Multicore Chips: Promises, Challenges, and Recent Developments Part IV: 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan 1 Outline: 3D WiNoC Architectures

More information

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Generic Topology Mapping Strategies for Large-scale Parallel Architectures Generic Topology Mapping Strategies for Large-scale Parallel Architectures Torsten Hoefler and Marc Snir Scientific talk at ICS 11, Tucson, AZ, USA, June 1 st 2011, Hierarchical Sparse Networks are Ubiquitous

More information

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1 Future of Interconnect Fabric A ontrarian View Shekhar Borkar June 13, 2010 Intel orp. 1 Outline Evolution of interconnect fabric On die network challenges Some simple contrarian proposals Evaluation and

More information

Lecture 9: Bridging. CSE 123: Computer Networks Alex C. Snoeren

Lecture 9: Bridging. CSE 123: Computer Networks Alex C. Snoeren Lecture 9: Bridging CSE 123: Computer Networks Alex C. Snoeren Lecture 9 Overview Finishing up media access Ethernet Contention-free methods (rings) Moving beyond one wire Link technologies have limits

More information

IV. PACKET SWITCH ARCHITECTURES

IV. PACKET SWITCH ARCHITECTURES IV. PACKET SWITCH ARCHITECTURES (a) General Concept - as packet arrives at switch, destination (and possibly source) field in packet header is used as index into routing tables specifying next switch in

More information

CS 421: COMPUTER NETWORKS SPRING FINAL May 24, minutes. Name: Student No: TOT

CS 421: COMPUTER NETWORKS SPRING FINAL May 24, minutes. Name: Student No: TOT CS 421: COMPUTER NETWORKS SPRING 2012 FINAL May 24, 2012 150 minutes Name: Student No: Show all your work very clearly. Partial credits will only be given if you carefully state your answer with a reasonable

More information

Lecture 3: Topology - II

Lecture 3: Topology - II ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 3: Topology - II Tushar Krishna Assistant Professor School of Electrical and

More information

Understanding the Routing Requirements for FPGA Array Computing Platform. Hayden So EE228a Project Presentation Dec 2 nd, 2003

Understanding the Routing Requirements for FPGA Array Computing Platform. Hayden So EE228a Project Presentation Dec 2 nd, 2003 Understanding the Routing Requirements for FPGA Array Computing Platform Hayden So EE228a Project Presentation Dec 2 nd, 2003 What is FPGA Array Computing? Aka: Reconfigurable Computing Aka: Spatial computing,

More information

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017 In-Network Computing Paving the Road to Exascale 5th Annual MVAPICH User Group (MUG) Meeting, August 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric

More information

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National

More information

Lecture 9: Bridging & Switching"

Lecture 9: Bridging & Switching Lecture 9: Bridging & Switching" CSE 123: Computer Networks Alex C. Snoeren HW 2 due Wednesday! Lecture 9 Overview" Finishing up media access Contention-free methods (rings) Moving beyond one wire Link

More information

Architecting Low Latency Cloud Networks

Architecting Low Latency Cloud Networks Architecting Low Latency Cloud Networks As data centers transition to next generation virtualized & elastic cloud architectures, high performance and resilient cloud networking has become a requirement

More information

Scalable Ethernet Clos-Switches. Norbert Eicker John von Neumann-Institute for Computing Ferdinand Geier ParTec Cluster Competence Center GmbH

Scalable Ethernet Clos-Switches. Norbert Eicker John von Neumann-Institute for Computing Ferdinand Geier ParTec Cluster Competence Center GmbH Scalable Ethernet Clos-Switches Norbert Eicker John von Neumann-Institute for Computing Ferdinand Geier ParTec Cluster Competence Center GmbH Outline Motivation Clos-Switches Ethernet Crossbar Switches

More information

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 18: Communication Models and Architectures: Interconnection Networks Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi

More information

Parallel Computing Interconnection Networks

Parallel Computing Interconnection Networks Parallel Computing Interconnection Networks Readings: Hager s book (4.5) Pacheco s book (chapter 2.3.3) http://pages.cs.wisc.edu/~tvrdik/5/html/section5.html#aaaaatre e-based topologies Slides credit:

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

IBM HPC Development MPI update

IBM HPC Development MPI update IBM HPC Development MPI update Chulho Kim Scicomp 2007 Enhancements in PE 4.3.0 & 4.3.1 MPI 1-sided improvements (October 2006) Selective collective enhancements (October 2006) AIX IB US enablement (July

More information

Multi-class Applications for Parallel Usage of a Guaranteed Rate and a Scavenger Service

Multi-class Applications for Parallel Usage of a Guaranteed Rate and a Scavenger Service Department of Computer Science 1/18 Multi-class Applications for Parallel Usage of a Guaranteed Rate and a Scavenger Service Markus Fidler fidler@informatik.rwth-aachen.de Volker Sander sander@fz.juelich.de

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information