ICC: An Interconnect Controller for the Tofu Interconnect Architecture

Similar documents
Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

The Tofu Interconnect D

The Tofu Interconnect 2

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

Topology Awareness in the Tofu Interconnect Series

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

The way toward peta-flops

Advanced Computer Networks. Flow Control

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

A Single Chip Shared Memory Switch with Twelve 10Gb Ethernet Ports

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Performance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department

Lecture: Interconnection Networks

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 18: Communication Models and Architectures: Interconnection Networks

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1

Technical Computing Suite supporting the hybrid system

Network on Chip Architecture: An Overview

Lecture 2: Topology - I

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

Embedded Systems: Hardware Components (part II) Todor Stefanov

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Fujitsu s new supercomputer, delivering the next step in Exascale capability

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Barcelona: a Fibre Channel Switch SoC for Enterprise SANs Nital P. Patwa Hardware Engineering Manager/Technical Leader

Hybrid On-chip Data Networks. Gilbert Hendry Keren Bergman. Lightwave Research Lab. Columbia University

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

4. Networks. in parallel computers. Advances in Computer Architecture

Computer Architecture

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Basic Low Level Concepts

Interconnection Networks

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Network-on-chip (NOC) Topologies

A ONE CHIP HARDENED SOLUTION FOR HIGH SPEED SPACEWIRE SYSTEM IMPLEMENTATIONS

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

16-Lane 16-Port PCIe Gen2 System Interconnect Switch with Non-Transparent Bridging

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Topologies. Maurizio Palesi. Maurizio Palesi 1

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

PRIMEHPC FX10: Advanced Software

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Interconnection Network for Tightly Coupled Accelerators Architecture

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

1 Copyright 2013 Oracle and/or its affiliates. All rights reserved.

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Quality-of-Service for a High-Radix Switch

Scalable Distributed Memory Machines

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing

The Network Layer and Routers

Final Presentation. Network on Chip (NoC) for Many-Core System on Chip in Space Applications. December 13, 2017

CMSC 611: Advanced. Interconnection Networks

Computer parallelism Flynn s categories

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Lecture 9: Bridging. CSE 123: Computer Networks Alex C. Snoeren

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Routers Technologies & Evolution for High-Speed Networks

Networks and Distributed Systems. Sarah Diesburg Operating Systems CS 3430

Introduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec

Interconnection Networks

Chapter 6. Storage and Other I/O Topics

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Prediction Router: Yet another low-latency on-chip router architecture

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

RDMA and Hardware Support

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 19: Networks and Distributed Systems

Fujitsu s Approach to Application Centric Petascale Computing

Flexible Architecture Research Machine (FARM)

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

1/29/2008. From Signals to Packets. Lecture 6 Datalink Framing, Switching. Datalink Functions. Datalink Lectures. Character and Bit Stuffing.

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Parallel Architectures

Fujitsu s Chipset Development for High-Performance, High-Reliability Mission- Critical IA Servers PRIMEQUEST

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Storage. Hwansoo Han

MULTICAST USE IN THE FINANCIAL INDUSTRY

Running Head: NETWORKING 1

Xen and the Art of Virtualization. CSE-291 (Cloud Computing) Fall 2016

Programming for Fujitsu Supercomputers

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Interconnection Networks

HWP2 Application level query routing HWP1 Each peer knows about every other beacon B1 B3

The Nios II Family of Configurable Soft-core Processors

Switching & ARP Week 3

New Approaches to Optical Packet Switching in Carrier Networks. Thomas C. McDermott Chiaro Networks Richardson, Texas

Multicomputer distributed system LECTURE 8

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 20: Networks and Distributed Systems

The RM9150 and the Fast Device Bus High Speed Interconnect

Advanced Parallel Programming I

QuickSpecs. HP Z 10GbE Dual Port Module. Models

Cray XC Scalability and the Aries Network Tony Ford

Computer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM

Transcription:

: An Interconnect Controller for the Tofu Interconnect Architecture August 24, 2010 Takashi Toyoshima Next Generation Technical Computing Unit Fujitsu Limited

Background Requirements for Supercomputing Systems Low latency Communication latency limits the scalability of applications High bandwidth Increasing calculation FLOPS requires higher network bandwidth be balanced with FLOPS RAS Reliability, Availability and Serviceability The risk of hardware faults in large systems increases along with the increased number of nodes 1

Fujitsu s New Interconnect Architecture 6D Mesh/Torus Interconnect Architecture* Scalability Fault-tolerance LSI Features Ten network links Four communication engines Comm. Engine Comm. Engine Comm. Engine Comm. Engine Tofu Unit Router Engine 3D Torus Tofu Unit Links x4 3D Torus Links x6 (*) Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers, IEEE Computer, vol.42, no.11, Yuichiro Ajima, Shinji Sumimoto, Toshiyuki Shimizu 2

Implementation Implementation Features Overview Interface features for latency and throughput Network features for network utilization Conclusion 3

Specifications Fujitsu s 65nm CMOS Technology Die size 18.2mm 18.1mm Transistors 48M gates for logic 12M-bit SRAM cells I/O 5GB/s Ports 16 6.25Gb/s 8 links / port Misc. ASIC design flow 312.5MHz/625.0MHz 4

Floor Plan Node Domain bus bridge 20GB/s in each direction Communication engines 4 Node Domain 5GB/s in each direction Barrier engine (Comm.#0 only) PCIe 2.0 root complex 2 Isolated power domain Router Domain Crossbar 14 ports 5GB/s in each direction Link ports 10 5GB/s in each direction Router Domain 5

RAS Features Fault Domain Isolation Router continues to work on node faults Error Protection Radiation-hardened FFs ECC protection RAM/Data path Parity error detection Control path CRC protection Data link/transaction Node Domain Router Domain 6

Features Overview Implementation Features Overview Interface features for latency and throughput Network features for network utilization Conclusion 7

Features Latency Throughput RAS System Many Neighbors Hop Reduction 3D Torus View Many Neighbors Trunking Detour Path Subnet Partitioning Network Interface RDMA - Quick Start - Piggyback - Strong Order Stream Offload Barrier Engine GAP Control Multi-Interfaces - User Thread x2 - Kernel Thread Radiation-hardened FF ECC Parity CRC Router Engine Cut-through Grant Prediction Straight Bypath Straight Bypath New VC Scheduling Node Error Isolation Radiation-hardened FF ECC Parity CRC : Unique Features Today s topics are highlighted in red 8

Features Interface features Implementation Features Overview Interface features for latency and throughput Network features for network utilization Conclusion 9

RDMA: Remote Direct Memory Access Features Mem. Network Mem. Operation READ (Get) / WRITE (Put) Length ~16MB MTU 256B~1920B Virtual Address Support (64K set) Low Latency and High Throughput Command supply throughput and latency Out-of-ordered I/O memory bus 10

RDMA Optimizations Sender Techniques Direct descriptor Quick command supply Piggyback PIO DMA Throughput Latency Good Good Command Supply Performance Command embedded communication payload Short message sending without any DMA Receiver Techniques Out-of-ordered I/O memory bus High throughput bus transaction Strong ordered store In order completion of DMA transactions for buffer polling 11

Direct Descriptor Feature Normal Command Supply DMA fetching produces high throughput command supply Mem. Mem. Direct Descriptor and DMA Command Supply DMA latency hiding by Block Store with first two commands Cmd 1 Cmd 2 Cmd 5 Cmd 6 Network Cmd 7 Cmd 8 Cmd 3 Cmd 4 Cmd 5 Cmd 6 Network Cmd 1 Cmd 2 Cmd 3 Cmd 4 Mem. Mem. 12

Piggyback Feature Normal Payload Supply User messages (payload) should be fetched by DMA Mem. Network (4) Packet Mem. Piggyback Payload Supply User messages (payload) are embedded in commands Mem. Payload Cmd 1 (2) Packet Network Mem. 13

Out-of-Ordered I/O Memory Bus Completion Notification Polling notifies after the completion of O-o-O DMA Stores Mem. Network (1) Packets Mem. Buffer Polling with Strong Ordered Indication Memory controller guarantees specified DMA ordering Mem. Network (1) Packets Mem. 14

RDMA Performance Hardware Measured Results Piggyback achieves low latency in short message Strong ordered packet makes buffer polling possible 3.50 3.00 Relative Latency 2.50 2.00 1.50 1.00 0.50 0.00 48% cut Piggyback (dashed line) 8 32 128 256 512 768 1024 1280 1792 1920 Message Size [B] Strong Ordered DMA 0.37 point in avg. Buffer Polling Notification Polling 15

Features Network Features Implementation Features Overview Interface features for latency and throughput Network features for network utilization Conclusion 16

Network Utilization Problems Global Unfairness of Throughput Arbitrations with local fairness cause global unfairness Node #0 Node #1 Node #2 100% 50% 25% Non-uniform Application Traffic in Time and Space Bandwidth of idle links needs to be used effectively Idle Idle Phase A Phase B 17 Bottleneck

Global Unfairness of Throughput Local Fairness of arbitration cause global unfairness Arbiters on every junction treat all incoming traffic fairly Node #0 #0 #1 #2 100% 0% 0% halved Node #1 #0 #1 #2 50% 50% 0% Ideal Throughput 18 halved Node #2 Node #2 #0 #1 #2 25% 25% 50% #0 #1 #2 33% 33% 33%

Injection Rate Control Software Specify the Inter-Packet GAP Parameter Node #0 GAP GAP Node #1 GAP Node #2 Communication engine works to control injection rate Tofu Descriptor Command #0 #1 #2 33% 0% 0% #0 #1 #2 33% 33% 0% Insert temporal gaps between transmitting packets Interval can be specified by the user RDMA Message #0 #1 #2 33% 33% 33% Transport Layer Packets Packet GAP Packet GAP Packet 19

Throughput Performance Hardware Measured Results Software can specify fine grained GAP parameters: 0-255 GAP works to control throughput effectively 20

Non-uniform Application Traffic Non-uniform Traffic in Space Idle Bottleneck Non-uniform Traffic in Time Busy Idle Busy Idle Phase A Phase B 21

Trunking Communication Trunking Independent Idle Paths Nodes have four neighborhoods in Tofu Unit Independent links and 3D-Torus networks Each node has four communication engines Up to 4 throughput Tofu Unit Independent 3D Torus Networks Tofu Unit 22

Trunking Performance Results Communication engines achieve good performance Trunking mechanisms scale up to four engines Communication Engine 1-4 Engines Trunking Throughput [GB/s] 5.0 4.0 3.0 2.0 1.0 0.0 Throughput [GB/s] 20.0 15.0 10.0 5.0 MTU [B] 0.0 WRITE READ x1 x2 x3 x4 23

Conclusion Implementation Features Overview Interface features for latency and throughput Network features for network utilization Conclusion 24

Concluding Remarks Tofu: A 6D mesh/torus interconnect architecture Interconnect for Fujitsu s Peta/Exascale computing systems Low latency, High bandwidth and RAS Features High-throughput and low-latency RDMA Direct Descriptor and Piggyback Out of Ordered I/O Memory Bus Network features for network utilization Network injection rate control Trunking up to four times throughput 25

Thank You for Listening Thanks to Next Generation Technical Computing Unit Aiichiro Inoue, Yuji Oinaga Tofu Architecture Team Toshiyuki Shimizu, Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto Design Team Takeo Asakawa, Akira Asato, Takumi Maruyama, Koichiro Takayama, Koichi Yoshimi, Osamu Moriyama, Masao Yoshikawa, Shinichi Iwasaki, Takekazu Tabata, Yoshiro Ikeda, Yuzo Takagi, Yoshihito Matsushita, Toshihiko Kodama, Satoshi Nakagawa, Masato Inokai, Shigekatsu Sagi, Ikuto Hosokawa, Yaroku Sugiyama, Takahide Yoshikawa 26

27