RHiNET-3/SW: an 0-Gbit/s high-speed network switch for distributed parallel computing S. Nishimura 1, T. Kudoh 2, H. Nishi 2, J. Yamamoto 2, R. Ueno 3, K. Harasawa 4, S. Fukuda 4, Y. Shikichi 4, S. Akutsu 4, K. Tasho 5, and H. Amano 3 1 RWCP Optical Interconnection Hitachi Laboratory, 2 RWCP Tsukuba Research Center, 3 Keio University, 4 Hitachi Communication Systems, Inc. 5 Synergetech, Inc. E-mail: nisimura@crl.hitachi.co.jp 1
Contents RHiNET concept (RWCP high-performance network) Concept and architecture of RHiNET-3/SW Key components in RHiNET-3/SW switch-lsi, deskew-lsi, parallel optical link, board Evaluation test results, on LSIs bit-error-rate, deskew function 2
RHiNET concept RHiNET switch: x high-speed crossbar switch RHiNET-3/SW High-speed parallel optical link Targets: PCI-bus based NIC Low-cost, and high-performance parallel computing through the combined computational power of PCs Connecting computers distributed within one or more floors of a building Features: Reliable low-latency communication, no upper layer Long links ( - 1 km), free topology design Large bi-section bandwidth (- Gbit/s) 3
Structure of RHiNET-3/SW (schematic structure) Optical RX DS-LSI DS-RX(1) DS-TX(1) DS-TX(0) DS-RX(0) electrical I/Os (-bit data, 1-bit clock) P2 P1 P3 SW-LSI P0 P4 P7 P5 P6 optical I/Os (-bit data, 1-bit clock) DS-LSI DS-LSI DS-RX(1) DS-TX(1) DS-TX(0) DS-RX(0) DS-LSI Optical TX electrical I/Os (-bit data, 1-bit clock) Switch: -Gbit/s x -port Aggregate throughput: 0 Gbit/s BB encoded data with clock I/O: 1.25-Gbit/s x 12-channel optical links Transmission length: < 1km DS-LSI: skew compensation for long transmission length Electrical I/O: CML or LVDS 4
Design concepts of RHiNET-3 Hop-by-hop retransmission Low-cost optical link module Retransmission: need for error-free data transmission Simple procedures and compact circuits Retransmission unit: micro frame (160 bits) Credit-based flow control For long transmission length Effective use of packet buffer 32 Virtual channels (VCs) - Virtual lane - Deadlock-free and topology-free 5
Flow control and retransmission (layered) TX-switch flit (0 bits) RX-switch Tx flow controller Tx-retrans. Ctrl. micro frame (2 flits [160 bits]) network link Rx-retrans. Ctrl. Rx flow controller 32 VC buffer Retransmission layer (unit : micro frame [160 bits]) Per-VC credit-based flow control layer (unit: flit [ 0 bits]) Small data size: reduce overhead (latency and bandwidth) 6
Format of micro frame (MF) Micro frame type 0 (bit) 63 69 74 79 Flit 0 Flit 1 Payload Payload MF sequence number CRC Credit Acknowledge Retransmission request CRC and sequence-number based retransmission mechanism Retransmission unit: micro frame (12 bits payload / 160 bits) Acknowledge: sequence number of successfully received MF Credit, acknowledge and retransmission request use the same field Small retransmission overhead 7
Retransmission mechanism (behavior) TX-switch Tx-retrans. Ctrl. Rx-retrans. Ctrl. RX-switch Retrans. buffer network link CRC/Seq.number check Error Detected!! Hop-by-hop retransmission : error-free transmission, and small overhead
Credit-based flow control TX-switch Credit counters RX-switch Tx-retrans. Ctrl. network link Tx flow controller Rx flow controller Rx-retrans. Ctrl. Per-VC credit-based flow control VC Buffer 256 flits (2 Kbytes) Credit-based flow control mechanism enables long data transmission and uses VC buffer effectively 9
Components of RHiNET-3/SW (schematic structure) SW-LSI Motherboard DS-RX(1) DS-TX(1) DS-TX(0) DS-RX(0) P2 P1 P3 P4 SW-LSI P0 P7 P5 P6 DS-RX(1) DS-TX(1) DS-TX(0) DS-RX(0) DS-LSIs Optical link modules
Blockdiagram of SW-LSI 1.25 Gbit/s x bit per port 125Mbit/s 0 bit per port Routing Table Demultiplexer Elastic Buffer Rx-retrans. Ctrl. RT Controller VC Controller Tx-retrans. Ctrl. Multiplexer Packet Buffer Crossbar Retrans. Buffer 1.25 Gbit/s x bit per port 11
Floor plan of SW-LSI (1st cut ) VC buffer memory PLL 0.14-um CMOS ASIC Die size: 16.5 mm x 16.5 mm Number of gates: 1502 k Buffer memory: a total of 640 kbytes I/O: 1.25 Gbit/s per pin Package: 74-pin BGA 12
DS-LSI (LSI for skew compensation) from SW-LSI to SW-LSI to SW-LSI from SW-LSI TX0 RX0 RX1 TX1 Optical TX Optical RX Optical RX Optical TX 12-channel fiber ribbon ( < 1 km) Optical RX Optical TX TX0 RX0 RX1 TX1 DCcoupled ACcoupled 1.25-Gbit/s x 12-channel AC-coupled optical modules DS-LSI has BB encoder and decoder For high-speed (1.25 Gbit/s per pin) AC-coupled optical data transmission DS-LSI compensates skew between -bit data and 1-bit clock Maximum skew: +/- 256 ns larger than a skew of 1-km MMF fiber ribbon (+/- 64 ns) Initial data pattern consists of 64 BB special characters 13
12-channel parallel optical link TX module RX module 12-channel parallel data transmission (products of ZARLINK TM semiconductor) 50-nm VCSEL 12-channel CML interfaces 155 Mbit/s - 2.5 Gbit/s (AC-coupled) GI 50/125 12-channel MMF fiber Up to 300-m data transmission at 2.5- Gbit/s BER: -12 14
Structure of motherboard (1st test-bed) Fiber ribbon SW-LSI Four DS-LSIs Designed to evaluate switching function Size: 550 x 550 mm Multi-wire interconnection board TM (Hitachi Chemical, Ltd.,) To overcome crosstalk, skew, and propagation loss Layout is optimized according to experimental results Eight pairs of 12-channel optical modules 15
Evaluation results (bit error rate) SW-LSI output from channel D0 of port0 BER (bit error rate): < -11 at data rate of 1.25 Gbit/s per pin Timing budget margin: about 400 ps 16
Evaluation results (deskew function) P5 P4 P6 P7 SW-LSI P0 P1 TX0 RX0 RX1 TX1 TX RX RX P3 P2 DS-LSI TX Optical Modules 12-channel ribbon fiber (300 m ) Port 0 Port 1 Deskew Function works successfully. 17
Summary A prototype network switch, RHiNET-3/SW, for a RHiNET high-performance distributed parallel computing environment Specifications Gbit/s x ports Parallel optical data transmission over a distance of up to 1 km Aggregate throughput is 0 Gbit/s per board Architecture Hop-by-hop retransmission mechanism Credit-based flow control reliable and long-transmission-distance data communication For -nodes parallel computing RHiNET-3/SW High-throughput, long-distance and flexible-flow-control In a distributed parallel computer system using commercial PCs 1