UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department

Similar documents
An Effective Queuing Scheme to Provide Slim Fly topologies with HoL Blocking Reduction and Deadlock Freedom for Minimal-Path Routing

Congestion Management in Lossless Interconnects: Challenges and Benefits

Packet Switch Architecture

Packet Switch Architecture

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Leveraging HyperTransport for a custom high-performance cluster network

4. Networks. in parallel computers. Advances in Computer Architecture

Fast-Response Multipath Routing Policy for High-Speed Interconnection Networks

Topologies. Maurizio Palesi. Maurizio Palesi 1

Basic Switch Organization

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Lecture: Interconnection Networks

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

TDT Appendix E Interconnection Networks

Interconnection Networks

LS Example 5 3 C 5 A 1 D

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture 2: Topology - I

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Synthetic Traffic Generation: a Tool for Dynamic Interconnect Evaluation

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Network-on-chip (NOC) Topologies

Efficient Switches with QoS Support for Clusters

Interconnection Network

The final publication is available at

Message Passing Models and Multicomputer distributed system LECTURE 7

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

Chapter 6 Connecting Device

Topologies. Maurizio Palesi. Maurizio Palesi 1

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

Evaluating Bufferless Flow Control for On-Chip Networks

Cluster Network Products

Lecture 23: Router Design

Routing Algorithms. Review

CS575 Parallel Processing

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

EC 513 Computer Architecture

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

ICC: An Interconnect Controller for the Tofu Interconnect Architecture

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Basic Low Level Concepts

The Cray XT4 and Seastar 3-D Torus Interconnect

Multicomputer distributed system LECTURE 8

Lecture 16: Router Design

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Deadlock-free XY-YX router for on-chip interconnection network

Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Lecture 22: Router Design

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Cray XC Scalability and the Aries Network Tony Ford

Congestion in InfiniBand Networks

EECS 570 Final Exam - SOLUTIONS Winter 2015

FAST AND DEADLOCK-FREE ROUTING FOR ARBITRARY INFINIBAND NETWORKS

Architecture or Parallel Computers CSC / ECE 506

ACCELERATING COMMUNICATION IN ON-CHIP INTERCONNECTION NETWORKS. A Dissertation MIN SEON AHN

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

Congestion Management in HPC

Deadlock and Router Micro-Architecture

Ultra-Fast NoC Emulation on a Single FPGA

Parallel Computing Platforms

EE 382C Interconnection Networks

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Parallel Computing Platforms

Scalable Schedulers for High-Performance Switches

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Interconnection Network

ES1 An Introduction to On-chip Networks

Prediction Router: Yet another low-latency on-chip router architecture

CSE 123A Computer Networks

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Address InterLeaving for Low- Cost NoCs

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

[ ] In earlier lectures, we have seen that switches in an interconnection network connect inputs to outputs, usually with some kind buffering.

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Adaptive Routing Strategies for Modern High Performance Networks

36 IEEE POTENTIALS /07/$ IEEE

Parallel Architectures

SoC Design Lecture 13: NoC (Network-on-Chip) Department of Computer Engineering Sharif University of Technology

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

In-Order Packet Delivery in Interconnection Networks using Adaptive Routing

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Wormhole Routing Local Area Networks. Multicasting Protocols for High-Speed,

Transcription:

UNIVERSITY OF CASTILLA-LA MANCHA Computing Systems Department A case study on implementing virtual 5D torus networks using network components of lower dimensionality HiPINEB 2017 Francisco José Andújar Muñoz et al. 0 / 21

Outline Introduction ndt torus topology Building ndt torus using EXTOLL cards Performance evaluation Conclusions and future work 0 / 21

Outline Introduction ndt torus topology Building ndt torus using EXTOLL cards Performance evaluation Conclusions and future work 0 / 21

Supercomputer systems Many scientific problems cannot be addressed in a laboratory: Non-reproducible problems. Too dangerous experiments. Too expensive experiments. Different time constants for the systems and the experimenter.... 1 / 21

Supercomputer systems Many scientific problems cannot be addressed in a laboratory: Non-reproducible problems. Too dangerous experiments. Too expensive experiments. Different time constants for the systems and the experimenter.... Supercomputing is the key to address these problems! Supercomputers are used in several scientific areas: Medicine: Protein modelling involved in multiple diseases, as cancer, Alzheimer, etc. Meteorology: Climate modelling and weather prediction. Physics: Nuclear reactions, supernovae and black holes modelling, etc. Automotive industry: Aerodynamics modelling. 1 / 21

Interconnection networks Supercomputers can have thousands of computing nodes. The interconnection network is an essential component to communicate this huge amount of nodes! The network topology has a significant impact on the overall system performance. Torus topology is widely used in supercomputers: Constant radix Facilitates implementation. Low Radix Simpler and cheaper hardware. Scalable Linear cost of expansion. Easy implementation of routing algorithms. 2 / 21

Building a n-dimensional torus We can build a torus using: 2 port communication cards 1D torus. 3 / 21

Building a n-dimensional torus We can build a torus using: 2 port communication cards 1D torus. 4 port communication cards 2D torus. 3 / 21

Building a n-dimensional torus We can build a torus using: 2 port communication cards 1D torus. 4 port communication cards 2D torus. 6 port communication cards 3D torus. 3 / 21

Building a n-dimensional torus We can build a torus using: 2 port communication cards 1D torus. 4 port communication cards 2D torus. 6 port communication cards 3D torus.... 2n port communication cards nd torus. 3 / 21

nd torus performance The torus performance depends on the number of dimensions. The higher number of dimensions: The lower distances among nodes. Consider a 1024 PE network: 32 32 2D torus: d avg = 16 16 8 8 3D torus: d avg = 8 4 4 4 4 4 5D torus: d avg = 5 The higher network performance. 4 / 21

Increasing the torus performance The higher number of dimensions: The higher number of ports. The hardware complexity increases: More expensive chip production. Difficulty to implement some techniques (e.g. VOQ). More expensive hardware. Is it possible to increase the number of dimensions without using communication cards with more ports? 5 / 21

Increasing the torus performance The higher number of dimensions: The higher number of ports. The hardware complexity increases: More expensive chip production. Difficulty to implement some techniques (e.g. VOQ). More expensive hardware. Is it possible to increase the number of dimensions without using communication cards with more ports? Idea: Combine several cards as a single node communication hardware. Simplest case: to interconnect two cards by one port. 5 / 21

Outline Introduction ndt torus topology Building ndt torus using EXTOLL cards Performance evaluation Conclusions and future work 5 / 21

3DT torus topology Y+ Z+ PE0 Card 0 X+ X- Card 1 Z- Y- PE1 Example Given 4 port cards, a 2D torus can be built... Or we can use one port per card for interconnecting two cards. There are still 6 ports to build a 6 port node. A 3D torus can be built using these nodes, and it is called 3D Twin (or just 3DT ) torus. 6 / 21

3DT torus topology Y+ Z+ X- PE0 Card 0 Card 1 INTERNAL LINK X+ Z- Y- PE1 Example Given 4 port cards, a 2D torus can be built... Or we can use one port per card for interconnecting two cards. There are still 6 ports to build a 6 port node. A 3D torus can be built using these nodes, and it is called 3D Twin (or just 3DT ) torus. 6 / 21

3DT torus topology X- POSITIVE LINKS Y+ PE0 Card 0 Card 1 Z+ INTERNAL LINK X+ Z- Y- PE1 Example Given 4 port cards, a 2D torus can be built... Or we can use one port per card for interconnecting two cards. There are still 6 ports to build a 6 port node. A 3D torus can be built using these nodes, and it is called 3D Twin (or just 3DT ) torus. 6 / 21

3DT torus topology X- POSITIVE LINKS Y+ PE0 Card 0 Card 1 Z+ INTERNAL LINK X+ NEGATIVE LINKS Z- Y- PE1 Example Given 4 port cards, a 2D torus can be built... Or we can use one port per card for interconnecting two cards. There are still 6 ports to build a 6 port node. A 3D torus can be built using these nodes, and it is called 3D Twin (or just 3DT ) torus. 6 / 21

ndt torus topology This idea can be generalized for n dimensions. nd Twin (ndt ) torus topology. Node communication hardware: two (n + 1) port cards. There are a total of 2n + 2 ports. One port of each card interconnects the two cards. We refer to this port as internal link. There are 2n remaining ports to connect the node with the neighbour nodes. 7 / 21

ndt torus topology This idea can be generalized for n dimensions. nd Twin (ndt ) torus topology. Node communication hardware: two (n + 1) port cards. There are a total of 2n + 2 ports. One port of each card interconnects the two cards. We refer to this port as internal link. There are 2n remaining ports to connect the node with the neighbour nodes. Example 4 port cards 2D torus or 3DT torus. 6 port cards 3D torus or 5DT torus. 8 port cards 4D torus or 7DT torus. 7 / 21

Optimal node configuration Y+ Y+ Z+ X- PE0 Card 0 Z+ X+ X- PE0 Card 0 X+ Card 1 Card 1 Z- PE1 Z- PE1 Y- Y- The message latency depends on the node configuration. Crossing two internal cards, the latency increases! ( ndt torus: 2n n ) 2 configurations. Optimal configuration Minimizes the number of paths that use the internal link. 8 / 21

Optimal configuration for ndt torus node d _ (n-1)/2 d + (n-1)/2 _ d 0 + d d + 0 n/2...... ad _ n/2-1 d + n/2-1 Card0 Card1 d _ n/2 n-1 d _ + d n-1 a _ d 0 + d d + 0 (n-1)/2+1...... d _ (n-1)/2-1 d + (n-1)/2-1 Card0 Card1 d _ (n-1)/2+1 n-1 d _ + d n-1 PE0 PE1 PE0 PE1 (a) n is even (b) n is odd Optimal node configuration. 9 / 21

Optimal configuration for ndt torus node Example Given 6 port cards, we can build a 5DT torus and the optimal configuration is: The ports of d 0 and d 1 are connected to the first card. The ports of d 3 and d 4 are connected to the second card. The ports of d 2 are separated between the two cards. 10 / 21

DORT routing algorithm DORT implements the DOR algorithm in ndt torus. Each PE is identified by (n + 1) digits o 0, o 1,..., o n 1 pe : o0, o 1,..., o n 1 d 0, d 1,..., d n 1 coordinates. pe Processing element identifier. First, a packet is routed from d 0 to d n 1 dimensions. If the output port does not belong to the current card, the packet is routed to the internal link. When the packet arrives at the destination node: Destination PE = current PE? route it to the NIC. Destination PE = neighbour PE? route it to the internal link. Virtual channels used in internal link to avoid deadlocks. 11 / 21

Outline Introduction ndt torus topology Building ndt torus using EXTOLL cards Performance evaluation Conclusions and future work 11 / 21

EXTOLL cards EXTOLL 1 is an interconnection network technology designed to achieve: Very low latency. A high bandwidth. A high sustained message rate. A high availability in large-scale networks (up to 64k nodes). 1 http://www.extoll.de/ 12 / 21

EXTOLL cards EXTOLL is an interconnection network technology designed to achieve: Very low latency. A high bandwidth. A high sustained message rate. A high availability in large-scale networks (up to 64k nodes). EXTOLL cards have 6 ports. EXTOLL cards support: 3D torus. Arbitrary topologies. 5DT torus!!! 13 / 21

EXTOLL architecture Host Interface NIC VELO Network Linkport CPU RMA Linkport Core Core Core Core Memory Controller PCIe or HyperTransport On-Chip Network ATU Shared Memory Engine Network Switch Linkport Linkport Linkport Control & Status Linkport 14 / 21

EXTOLL Network IQ switches: Multiqueue-FIFO buffers. VOQ-switch to minimize the HOL-blocking. Virtual cut-through. Fine-grain credit flow-control. islip arbiter. Virtual channels: To avoid deadlocks. To provide adaptiveness. Four Traffic Classes (TCs) to provide QoS. Table-based routing: Allows to implement arbitrary topologies. Each TC can have its own routing function. 15 / 21

TS-DOR routing algorithm DORT not implementable using EXTOLL cards: 4 deterministic VCs required in the internal link. EXTOLL only has 2 deterministic VCs. New deterministic routing algorithm: Twin-source Dimension Order Routing (TS-DOR) Combines TCs and VCs to avoid deadlocks. Messages routed following different dimension orders. Dimension order determined by the source PE. 16 / 21

TS-DOR routing algorithm The messages generated by PE0 are: Injected in TC0 or TC2. Routed from d 0 to d 4. The messages generated by PE1 are: Injected in TC1 or TC3. Routed from d 4 to d 0. Two VCs per TC required to avoid deadlock. Additional advantages: Shorter paths than DORT. Better load-balance than DORT. 17 / 21

Outline Introduction ndt torus topology Building ndt torus using EXTOLL cards Performance evaluation Conclusions and future work 17 / 21

Simulation model EXTOLLsim model Models the main features of EXTOLL crossbar. Routing algorithms: 3D torus: fully-adaptive routing + DOR. 5DT torus: TS-DOR algorithm. Uniform traffic pattern. Evaluated network sizes: Number of PEs 3D torus 5DT torus 256 8 8 4 4 4 2 2 2 512 8 8 8 4 4 2 4 2 1024 16 8 8 4 4 2 4 4 Results: Normalized throughput (cells/cycle/nic) vs Normalized injection rate (cells/cycle/nic) Network cell latency vs Normalized injection rate (cells/cycle/nic) 18 / 21

Normalized throughput Normalized Avg. Throughput 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 8 x 8 x 4 8 x 8 x 8 16 x 8 x 8 4 x 4 x 2 x 2 x 2 4 x 4 x 2 x 4 x 2 4 x 4 x 2 x 4 x 4 Uniform traffic pattern 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Injection Rate 19 / 21

Network cell latency Avg. Cell Network latency (cycles) 1000 800 600 400 200 8 x 8 x 4 8 x 8 x 8 16 x 8 x 8 4 x 4 x 2 x 2 x 2 4 x 4 x 2 x 4 x 2 4 x 4 x 2 x 4 x 4 Uniform traffic pattern 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Normalized Injection Rate 20 / 21

Outline Introduction ndt torus topology Building ndt torus using EXTOLL cards Performance evaluation Conclusions and future work 20 / 21

Conclusions and Future Work Conclusions: To build an ndt torus is possible using the commercial hardware EXTOLL. TS-DOR: A new deadlock-free routing algorithms have been designed. We have proved the network performance is increased: Network latencies are reduced. The network accepts more traffic. Lower throughput degradation using virtual channels. Future work: Evaluation using other traffic patterns and MPI traffic. Performance comparison of DORT and TS-DOR algorithms. Adaptive routing for EXTOLL 5DT torus. 21 / 21

UNIVERSITY OF CASTILLA-LA MANCHA Computing Systems Department A case study on implementing virtual 5D torus networks using network components of lower dimensionality HiPINEB 2017 Francisco José Andújar Muñoz et al. fandujar@dsi.uclm.es 21 / 21