A Single Chip Shared Memory Switch with Twelve 10Gb Ethernet Ports

Similar documents
PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

A 400Gbps Multi-Core Network Processor

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

The Network Layer and Routers

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

1 Copyright 2013 Oracle and/or its affiliates. All rights reserved.

Router Architectures

Packet Switch Architectures Part 2

The RM9150 and the Fast Device Bus High Speed Interconnect

The Design of the KiloCore Chip

The iflow Address Processor Forwarding Table Lookups using Fast, Wide Embedded DRAM

InfiniBand SDR, DDR, and QDR Technology Guide

An Advanced High Speed Solid State Data Recorder

PE310G4TSF4I71 Quad Port SFP+ 10 Gigabit Ethernet PCI Express Time Stamp Server Adapter Intel Based

CSE 123A Computer Networks

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Fujitsu s Chipset Development for High-Performance, High-Reliability Mission- Critical IA Servers PRIMEQUEST

PE2G4SFPI35L Quad Port SFP Gigabit Ethernet PCI Express Server Adapter Intel i350am4 Based

Barcelona: a Fibre Channel Switch SoC for Enterprise SANs Nital P. Patwa Hardware Engineering Manager/Technical Leader

PCI Express: Evolution, Deployment and Challenges

PE2G4SFPI80 Quad Port SFP Gigabit Ethernet PCI Express Server Adapter Intel 82580EB Based

ICC: An Interconnect Controller for the Tofu Interconnect Architecture

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Multi-gigabit Switching and Routing

PE3G4TSFI35P Quad Port Fiber Gigabit Ethernet PCI Express Time Stamp Server Adapter Intel Based

Ethernet Networks for the ATLAS Data Collection System: Emulation and Testing

10 Gigabit XGXS/XAUI PCS Core. 1 Introduction. Product Brief Version April 2005

A Formal Verification Methodology for Checking Data Integrity

MOSAID Semiconductor

Hybrid Memory Cube (HMC)

INT Gbit Ethernet MAC Engine

KiloCore: A 32 nm 1000-Processor Array

Ultra-Fast NoC Emulation on a Single FPGA

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

PEX 8636, PCI Express Gen 2 Switch, 36 Lanes, 24 Ports

100 GBE AND BEYOND. Diagram courtesy of the CFP MSA Brocade Communications Systems, Inc. v /11/21

OE2G2I35 Dual Port Copper Gigabit Ethernet OCP Mezzanine Adapter Intel I350BT2 Based

Toward a unified architecture for LAN/WAN/WLAN/SAN switches and routers

EE 382C Final Project Presentation. Ted Jiang Curt Harting 5/24/11

Experience with the NetFPGA Program

The FTK to Level-2 Interface Card (FLIC)

A ONE CHIP HARDENED SOLUTION FOR HIGH SPEED SPACEWIRE SYSTEM IMPLEMENTATIONS

Objectives for HSSG. Peter Tomaszewski Scientist, Signal Integrity & Packaging. rev3 IEEE Knoxville Interim Meeting, September 2006

Trade Offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip

SoC Communication Complexity Problem

GIGABIT ETHERNET XMVR LAN SERVICES MODULES

Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China

POLYMORPHIC ON-CHIP NETWORKS

INT 1011 TCP Offload Engine (Full Offload)

High Bandwidth Electronics

Introduction to High-Speed InfiniBand Interconnect

Tile Processor (TILEPro64)

PE2G4B19L Quad Port Copper Gigabit Ethernet PCI Express Server Adapter Broadcom BCM5719 Based

Accelerating the Mass Transition of 100 GbE in the Data Center. Siddharth Sheth, Vice President of Marketing, High-Speed Connectivity

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

Gigabit Ethernet XMVR LAN Services Modules

CS250 VLSI Systems Design Lecture 9: Patterns for Processing Units and Communication Links

Gigabit Ethernet XMVR LAN Services Modules

Introduction to the Catalyst 3920

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing

fleximac A MICROSEQUENCER FOR FLEXIBLE PROTOCOL PROCESSING

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS

Cisco Nexus 5000 Series Architecture: The Building Blocks of the Unified Fabric

TOC: Switching & Forwarding

Chapter 1. Introduction

Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture

Fast Flexible FPGA-Tuned Networks-on-Chip

PE210G2SPi9A Dual Port Fiber 10 Gigabit Ethernet PCI Express Server Adapter Intel based. Description. Key Features

ECE 485/585 Microprocessor System Design

High Performance Ethernet for Grid & Cluster Applications. Adam Filby Systems Engineer, EMEA

Presentation_ID. 2002, Cisco Systems, Inc. All rights reserved.

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

EE414 Embedded Systems Ch 5. Memory Part 2/2

Five Emerging DRAM Interfaces You Should Know for Your Next Design

Topics for Today. Network Layer. Readings. Introduction Addressing Address Resolution. Sections 5.1,

Ethernet Switch. WAN Gateway. Figure 1: Switched LAN Example

Messaging Overview. Introduction. Gen-Z Messaging

PEX8764, PCI Express Gen3 Switch, 64 Lanes, 16 Ports

Networks-on-Chip Router: Configuration and Implementation

A distributed architecture of IP routers

Motivation to Teach Network Hardware

In-chip and Inter-chip Interconnections and data transportations for Future MPAR Digital Receiving System

Routers Technologies & Evolution for High-Speed Networks

INT G bit TCP Offload Engine SOC

M1QSFP28SFP28. 5-speed dual-media test module. Xena Networks The price/performance leaders in Gigabit Ethernet Test & Measurement

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features

DesignCon SerDes Architectures and Applications. Dave Lewis, National Semiconductor Corporation

Introduction to TCP/IP Offload Engine (TOE)

SunFire range of servers

The CoreConnect Bus Architecture

24 GE with 4 Shared SFP Web Smart Switch

Industry Standards for the Exponential Growth of Data Center Bandwidth and Management. Craig W. Carlson

Achieving UFS Host Throughput For System Performance

Centip3De: A 64-Core, 3D Stacked, Near-Threshold System

Prediction Router: Yet another low-latency on-chip router architecture

EE 457 Unit 7b. Main Memory Organization

The Tofu Interconnect D

IBM PowerPRS TM. A Scalable Switch Fabric to Multi- Terabit: Architecture and Challenges. July, Speaker: François Le Maut. IBM Microelectronics

ROM Status Update. U. Marconi, INFN Bologna

Transcription:

A Single Chip Shared Memory Switch with Twelve 10Gb Ethernet Ports Takeshi Shimizu, Yukihiro Nakagawa, Sridhar Pathi, Yasushi Umezawa, Takashi Miyoshi, Yoichi Koyanagi, Takeshi Horie, Akira Hattori Hot Chips 15 - August 19, 2003 Fujitsu Laboratories of America, Inc. Advanced Interconnect Technology Department 1

Outline Background Overview and Features Switch Implementation Evaluation Summary 2

Background IP-based networks connect all the computing resources (All-IP). Ethernet protocol is commonly used for IP networks. 10Gb Ethernet is a promising solution for unified, fat pipe between servers and storage systems. : 10GE Switch Chip Server Server Router Blade server Switch Switch Switch Storage Area Network Clients Web Server Application Server Database Server 3

Primary Target Applications Our primary target is infrastructure for computing platform, such as SAN, Clusters, Blade Servers. Those applications require short latency, low cost and high density. Motivation to develop a dense 10Gb Ethernet switch. So, the design strategy is set as follows. Focus on layer-2 switching. High-throughput/low latency switch core. SerDes integration for copper solution. Storage Area Network Cluster Computing Blade Server 4

MB87Q3050 Overview Twelve 10 Gb Ethernet ports. Integrated SerDes for XAUI. 3.125Gbps x 4 lane/direction Layer-2 switching with 802.1Q VLAN. Output queue switching using shared memory. 240 Gbps shared memory bandwidth. Cut-through forwarding. Statistics counters for RMON. XAUI XAUI XAUI PHY PHY PHY RMON MIB Shared Memory Management Address VLAN Table Output FIFO Output FIFO Output FIFO Processor I/F PHY PHY PHY XAUI XAUI XAUI CPU Bus, EEPROM 5

Designing a Switch Core A protocol independent switch core was developed. Credit based flow control. Packet by packet operation, allowing variable length from 64B to 9KB. This is not a fixed-length cell switch. Four level priority queue, with simple distributed arbiters per output port. Cut-through forwarding (fall-through latency of the core: 150ns) Multicast support for single copy, multiple read. 10GbE Credit (exported) Credit (exported) Credit (exported) Switch Core & Shared Memory Output FIFO 10GbE Pause Flow Control Credit Flow Control Pause Flow Control 6

ing for Cut-Through The switch core is designed for cut-through forwarding of variable length frames. The switch core has NO speed-up in terms of the bandwidth per port. All the data path in the switch isrunning at 10Gbps data rate. Popular implementation with crossbar switches may have increased B/W per port. The switch core assumes NO fragmentation. The entire packetshould be transferred at 10Gbps rate. These features are desired to handle variable length frames in Ethernet protocol, to minimize store&forward operation. buffering is simply for speed-matching between clock domains when cut-through mode is chosen. 10Gbps XAUI 10Gbps +/- 100ppm PHY Recovered Clock Domain Core Clock Domain Switch Core & Shared Memory 7

On-chip Interconnect for Shared Memory An on-chip memory sub-system is designed for 12 10Gbps reads AND 12 10Gbps writes. Multi-port Stream Memory Deep interleaved memory banks are connected via distributed, multi-stage interconnection network. No global arbitration is a good policy for chip integration and timing closure. Switching is rescheduled at every packet boundary for access path arbitration. Arbitration delay exists between packets (as shown later). Switches and Distributed Arbiters Interleaved Memory Banks No blocking access in arbitrary length data Arbitration delay between data streams Switches and Distributed Arbiters 8

Output ing for Short Latency Frames come out of the switch core without fragmentation, at 10Gbps rate. However, arbitration delay (caused by the on-chip interconnect) exists between frames. The average value is 3.2 cycle (assuming random, full-loaded traffics). On the other hand, outgoing packets has 8 byte preamble and 12 byte IPG as defined in the Ethernet protocol. 5 cycles between frames in the switch core. Thus, with a small amount of FIFO, it is possible to sustain wire-speed throughput. We do not need a large FIFO.The size is much smaller than the maximum packet size. It also helps to reduce the fall-through latency in a loaded condition. Data Delay Data Delay Switch Core 10Gbps FIFO 10Gbps Frame PA/IPG Frame PA/IPG 9

Chip Implementation MB87Q3050 12-port, 10Gbps Ethernet Switch Chip Items Technology Logic SRAM Core Logic Frequency Package Specification Fujitsu CS91: 0.11um CMOS ASIC 6.3M gates (total) 897K Bytes (total) 312.5MHz FCBGA-728 ADD CHIP PCTURE Signals 336 High Speed Power Consumption Die Size XAUI (3.125Gb/s x 4) X 12 15.3 W (typical), full-loaded 16mm x 16mm 10

Floorplan Image Statistics Counters On-Chip Interconnect Shared Memory On-Chip Interconnect Tag Memory Routing Table Management Processor I/F Unit 11

Evaluation Board For functional validation Hardware Firmware Nine copper cable connectors. Tested with copper cables up to 5m. Three optical Interfaces. XENPAK modules. Evaluation Board Top View 12

Performance: Zero-loss Throughput Zero-loss Throughput is measured using the first-silicon. Optical links are used for 2-port paring configuration. Optical links and Cupper cables are used for full mesh configuration. Slight packet loss is observed (to be fixed in future silicon). Throughput (Percentage 100 90 80 100 98 100 98 100 99 100 99 100 99 100 99 100 99 100 99 13

Performance: Latency Fall-through latency at 100% throughput workload is measured with the first silicon. Latency(nsec) Packet size bytes Cupper Cable Optical Cable (w/ XENPAK modules) 64 750 1160 128 670 1090 256 660 1190 512 630 1230 1024 670 1280 1280 640 1140 1518 630 1190 9216 660 1220 14

Application Prototype:1Gb Switch Box 10G Switch Board MB87Q3050 XAUI Twelve 10G ports on switch backplane Switch 10G x 12 1G Board 1G x 24 6 boards: 1G or 10G board 10G Board Layer-2 switch for cluster systems by Fujitsu Laboratories Ltd. (Japan) Flexible port configuration Ex) 1G x 144 ports 10G x 12 ports 15

Summary MB87Q3050 was designed for 10Gb Ethernet-based interconnection for servers and storage equipment. Low cost, dense integration. high-throughput, low latency. The switch achieves wire-speed operation at each port, by 240Gbps shared memory bandwidth. It also achieves 450ns latency with cut-through forwarding. Acknowledgement: The chip was jointly developed by Fujitsu Laboratories of America, Inc., Fujitsu Laboratories Ltd., Japan, and Fujitsu Ltd., Japan. The backend design was supported by Fujitsu Microelectronics America, Inc. The development was partially funded by the New Energy and Industrial Technology Development Organization (NEDO), which is one of thejapanese Governmental Agencies. 16

17