Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China

Similar documents
Chapter - 7. Multiplexing and circuit switches

Problem Formulation. Specialized algorithms are required for clock (and power nets) due to strict specifications for routing such nets.

Introduction to Field Programmable Gate Arrays

The Design of the KiloCore Chip

IV. PACKET SWITCH ARCHITECTURES

High-speed Serial Interface

A Four-Terabit Single-Stage Packet Switch with Large. Round-Trip Time Support. F. Abel, C. Minkenberg, R. Luijten, M. Gusat, and I.

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Reconfigurable PLL for Digital System

Spiral 2-8. Cell Layout

Package level Interconnect Options

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

A 400Gbps Multi-Core Network Processor

File: 'ReportV37P-CT89533DanSuo.doc' CMPEN 411, Spring 2013, Homework Project 9 chip, 'Tiny Chip' fabricated through MOSIS program

Timing for Optical Transmission Network (OTN) Equipment. Slobodan Milijevic Maamoun Seido

High-speed Serial Interface

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

Multi-gigabit Switching and Routing

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

A 2 Gb/s Asymmetric Serial Link for High-Bandwidth Packet Switches

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

FABRICATION TECHNOLOGIES

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

CAP+ CAP. Loop Filter

An Overview of Standard Cell Based Digital VLSI Design

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

CSE 123A Computer Networks

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

AMchip architecture & design

Streaming Encryption for a Secure Wavelength and Time Domain Hopped Optical Network

An Introduction to Programmable Logic

ASNT1016-PQA 16:1 MUX-CMU

EECS150 - Digital Design Lecture 17 Memory 2

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles

10 Gbit/s Transmitter MUX with Re-timing GD16585/GD16589 (FEC)

A mm 2 780mW Fully Synthesizable PLL with a Current Output DAC and an Interpolative Phase-Coupled Oscillator using Edge Injection Technique

PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO. IRIS Lab National Chiao Tung University

16th Microelectronics Workshop Oct 22-24, 2003 (MEWS-16) Tsukuba Space Center JAXA

Packet Switch Architectures Part 2

The Memory Hierarchy 1

A Single Chip Shared Memory Switch with Twelve 10Gb Ethernet Ports

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing

Nevis ADC Design. Jaroslav Bán. Columbia University. June 4, LAr ADC Review. LAr ADC Review. Jaroslav Bán

CS310 Embedded Computer Systems. Maeng

CHAPTER 4 DUAL LOOP SELF BIASED PLL

CCS Technical Documentation NHL-2NA Series Transceivers. Camera Module

Development and research of different architectures of I 2 C bus controller. E. Vasiliev, MIET

A Configurable Radiation Tolerant Dual-Ported Static RAM macro, designed in a 0.25 µm CMOS technology for applications in the LHC environment.

TECHNOLOGY BRIEF. Double Data Rate SDRAM: Fast Performance at an Economical Price EXECUTIVE SUMMARY C ONTENTS

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

Memory. Outline. ECEN454 Digital Integrated Circuit Design. Memory Arrays. SRAM Architecture DRAM. Serial Access Memories ROM

Axcelerator Family FPGAs

Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999

10 Gbit/s Transmitter MUX with Re-timing GD16585/GD16589 (FEC)

Designing High-Speed ATM Switch Fabrics by Using Actel FPGAs

PHOTONIC ATM FRONT-END PROCESSOR

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

The RM9150 and the Fast Device Bus High Speed Interconnect

ECE 485/585 Microprocessor System Design

Design and Implementation of High Performance Application Specific Memory

Physical Implementation of the DSPIN Network-on-Chip in the FAUST Architecture

Microcomputers. Outline. Number Systems and Digital Logic Review

EECS 150 Homework 7 Solutions Fall (a) 4.3 The functions for the 7 segment display decoder given in Section 4.3 are:

3. The high voltage level of a digital signal in positive logic is : a) 1 b) 0 c) either 1 or 0

2. Link and Memory Architectures and Technologies

EECS150 - Digital Design Lecture 16 Memory 1

Hybrid On-chip Data Networks. Gilbert Hendry Keren Bergman. Lightwave Research Lab. Columbia University

Digital Integrated Circuits A Design Perspective. Jan M. Rabaey

6. Latches and Memories

Field Programmable Gate Array (FPGA)

1. NoCs: What s the point?

Design of Optical Burst Switches based on Dual Shuffle-exchange Network and Deflection Routing

AN-12. Use Quickswitch Bus Switches to Make Large, Fast Dual Port RAMs. Dual Port RAM. Figure 1. Dual Port RAM in Dual Microprocessor System

Lab 2. Standard Cell layout.

Ultra-Low Latency, Bit-Parallel Message Exchange in Optical Packet Switched Interconnection Networks

Router Architectures

High-Speed NAND Flash

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1

Chapter 8 Memory Basics

Chapter 6. CMOS Functional Cells

An overview of standard cell based digital VLSI design

Versatile RRAM Technology and Applications

CS150 Fall 2012 Solutions to Homework 6

SEE Tolerant Self-Calibrating Simple Fractional-N PLL

HEAT (Hardware enabled Algorithmic tester) for 2.5D HBM Solution

CS152 Computer Architecture and Engineering Lecture 16: Memory System

MTJ-Based Nonvolatile Logic-in-Memory Architecture

Synchronous Optical Networks (SONET) Advanced Computer Networks

DI-SSD: Desymmetrized Interconnection Architecture and Dynamic Timing Calibration for Solid-State Drives

The Future of Electrical I/O for Microprocessors. Frank O Mahony Intel Labs, Hillsboro, OR USA

ESE 570 Cadence Lab Assignment 2: Introduction to Spectre, Manual Layout Drawing and Post Layout Simulation (PLS)

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

Synchronous Optical Networks SONET. Computer Networks: SONET

Toward a unified architecture for LAN/WAN/WLAN/SAN switches and routers

ASIC Physical Design Top-Level Chip Layout

3. Implementing Logic in CMOS

ASNT7120-KMA 10GS/s, 4-bit Flash Analog-to-Digital Converter

A 20 GSa/s 8b ADC with a 1 MB Memory in 0.18 µm CMOS

A 32nm, 0.9V Supply-Noise Sensitivity Tracking PLL for Improved Clock Data Compensation Featuring a Deep Trench Capacitor Based Loop Filter

Token Bit Manager for the CMS Pixel Readout

Transcription:

CMOS Crossbar Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China

OUTLINE Motivations Problems of Designing Large Crossbar Our Approach - Pipelined MUX Core Interface Link and Clocking Design Conclusions

Motivations Advances in fiber optic link technology and WDM have made raw bandwidth in abundance Switches/Routers are replacing the transmission link as the bottleneck of the network Switches/Routers with high speed (OC-192, 10Gb/s) and large number of I/O ports (128*128 or 256*256) are becoming a necessity Key issues for designing high speed router: Switching Fabric Interconnects Queuing Schemes Arbiters Multicast

Fabric Interconnects: Crossbar Crossbar (Crosspoint) Fabric is becoming the preferable interconnect fabric for high speed and scalable switching It has been proven that crossbar (even input-queued) can have as high throughput as any switch. A crossbar inherently supports multicast efficiently. QoS can be implemented reasonably easy. The key challenge is the scalability for high line rate and large number of ports CMOS technology can achieve high density and low cost

Architecture of the CMOS Crossbar Switch Crossbar Switch Core fulfills the switch/router function Controller configures the crossbar core switching High speed data link communicates between switch fabric and line card PLL provides on-chip precise clock High Speed Data Link High Speed Data Link Controller 1GHz 256*256 Crossbar Switch Core High Speed Data Link P L L High Speed Data Link

Two Approaches to Build the Core X-Y Based Crossbar MUX Based Crossbar INPUT 0 1 2 Control word INPUT 0 1 2 3 3 0 1 2 3 OUTPUT 2:1 MUX 0 1 2 3 OUTPUT Control word Scalability : N 2 Speed : limited by Cap at input and output lines Control: N 2 bits Scalability : N 2 Speed : limited by Cap only at input line Control: N*Log 2 N bits

Problems of Designing Large Crossbar Switch The switch core scales to N 2 Design complexity increases The performance requirement increases much faster than that can be achieved through CMOS technology scaling The throughput can be satisfied by using multiple bit-slices (e.g. 8) of the core, however, the core size increases by 8 times Wire delay is also substantial in high performance chip

Our Approach Pipelined MUX Crossbar Digital Core Digital MUX tree based design technique guarantees the high performance as well as the low design complexity In order to integrate large crossbar switch, only 2 bit-slices are embedded in the digital core instead of 8 (60%+ area saving) 1GHz digital core is demanded for the 2 Gb/s interface, the MUX tree can be pipelined to fulfill the requirement Additional pipeline stage is added to drive long wire

SDFF embedded with MUX Vdd embedded 4- to-1 MUX P1 N1 X S NAND I3 P2 I5 Q QB Sel I4 N5 N2 N6 I6 D Clk N3 N4 I1 I2 High performance Semi-Dynamic Flip-Flop (SDFF) is used [Klass98, Stojanovic et.al. 99] One of the fastest Flip-Flops due to negative setup time Little overhead for embedding with MUX function

Pipeline Stages Partition Static SDFF embedded Static SDFF embedded 1st stage 2nd stage (a) Static 2-to-1 SDFF embedded Static 8-to-1 SDFF embedded 1st stage 2nd stage The pipeline of the 256-to-1 MUX can be partitioned as Natural: 16-to-1 MUX in 1 st stage; 16-to-1 MUX in 2 nd stage Balanced: 8-to-1 MUX in 1 st stage; 32-to-1 MUX 2 nd stage (b)

Driving Long Wire Adding Repeater cannot Satisfy the 1GHz Requirement The 1 st stage is critical due to the large capacitor at the input line Distributed R-C wire model is employed Repeater can be inserted to reduce the wire delay For 256 ports, even inserting the optimal size and number of repeater, the delay is still larger than 1ns D CLK0 Q0 Qb0 driv1 Delay time (ps) Driver delay 1500 1000 500 0 Rwire Cwire W Static 2to1 MUX in each cell CLK1 DD Wire delay Repeater cell cell cell cell SDFF_em 4to1 MUX 0 1 2 3 Q1 Port No.= 64 Port No.= 128 Port No.= 256 Repeater number

Adding One Pipeline Stage to Drive Long Wire Add one more stage for driving the long wire by D CLK0 inserting a Flip-Flop The whole 256*256 crossbar is divided into 4 128*128 -- sub-crossbar, so that the input line only need to drive 128 cells instead of 256 For 128 ports, sub-ns delay time is achievable Q0 Qb0 driv1 Delay time (ps) Driver delay 1500 1000 500 0 Rwire Cwire W Static 2to1 MUX Wire delay in each cell CLK1 DD Flip-Flop cell cell cell cell FF CLK1 SDFF_em 4to1 MUX 0 1 2 3 Repeater number Q1 Port No.= 64 Port No.= 128 Port No.= 256

3-stages Pipelined MUX Crossbar Floor-Planning The 256*256 crossbar consists of 4 sub-crossbars (128*128) running at 1GHz frequency 2 pipeline stages in each sub-crossbar 2 bit-slices are embedded matching with 2Gb/s data link IN [0:63] OUT [0:63] Legend R R OUT [192:255] IN [64:127] OUT [64:127] R R R 0 1 3 R 2:1 2:1 R R R R R 0 ~ 3 : Sub-Crossbar 0~3; IN [192:255] 2:1 2:1 2:1: SDFF embedded with 2-to-1 Mux; R 2 R R: Register c R 128*128 Sub-Crossbar OUT [128:191] IN [128:191]

3-stages Pipelined MUX Crossbar Timing Diagram In sub-crossbar 0, inputs [0:127] are switching in the 1 st and 2 nd stages, while in sub-crossbar 3, inputs [128:255] are switching in the 2 nd and 3 rd stages Finally, the two groups of outputs are fed into SDFF_embedded with 2-to-1 MUX to complete the 256-to-1 MUX action IN [0:63] IN [64:127] Sub-Crossbar0 Static 2-to-1 1st stage SDFF Static 2nd stage Static 3rd stage IN [128:191] IN [192:255] Sub-Crossbar3 1st stage Static 2-to-1 2nd stage SDFF Static 3rd stage Static SDFF 2-to-1 Mux OUT [0:63] OUT [192:255]

The Sub-Crossbar Circuits Simulation Results Technology Layout size Sub-crossbar No. Post layout simulation 1st stage 2 nd stage 3 rd stage TSMC 0.25µm SCN5M Deep 5.9 mm * 3.1 mm Sub-crossbar0 Sub-crossbar3 959 ps 845 ps 866 ps 959 ps 236 ps 866 ps

Control Circuits Control bits are used to configure the corresponding MUX in the crossbar pipeline in the correct timing stage. For saving the pin counts, the control inputs are embedded within the data inputs, each incoming frame packet includes one byte control word and 64 bits of data The timing constraints can be satisfied by careful pipelining the control path

Control Circuits (cont d) Bang-Bang PD: samples the 2Gb/s inputs, converts to 2bits, each at 1Gb/s Re-synchronization: synchronizes each input signal to the main clock input @ 2Gb/s Interface clock Bang - Bang P D 2 bits @ 1Gb/s 2 Counter Pulse Gen Main clock Counter Pulse Gen D M U X 1 0 4 2 2 4 3 2 to Data - Path DMUX: demultiplexs signal to data & control bits Counter: counts 4/36 and controls the DMUX Control DMUX 8 Resynchronization subcrossbar to Control - Path to Crossbar

Full Crossbar Core Layout and Specification Full 256*256 crossbar core with 2 bit-slices Technology: TSMC 0.25µm SCN5M Deep, 5 Layer Metal Layout size: 14 mm*8 mm Transistor counts: 2000k Supply voltage: 2.5v Clock frequency: 1GHz Power: 40W

Interface Link and Clocking Design The dual loop delay locked loop () design technique is adopted in the data link for data and clock recovery The main analog generates multiple clock phases for the interpolation in the full digital periphery loop A half rate bang-bang phase detector is used in the periphery loop to sample the 2Gb/s incoming signal by using 1GHz clock A 3 rd loop, an analog PLL, provides the 1GHz on-chip clock D L L Peri - Loops P L L

Interface Link and Clocking Design System CLK The system clock is at 250MHz, PLL provides the precise 1GHz clock for the whole chip Several periphery loops share one analog D 0 @ 2Gb/s D 1 @ 2Gb/s Bang-Bang 4 Peri - Loop Bang-Bang 4 2 bits @ 1Gb/s F S M Phase Selector & Interpolator 2 bits @ 1Gb/s F S M Phase Selector & Interpolator PFD 1/4 VCDL P L L CLK Distribution Network P D CP+LP CP+LF VCO Peri - Loop D L L......

Conclusions A 2Gb/s 256*256 CMOS Crossbar Switch Core is achievable with current process technology Significant area saving is obtained by using only 2 bit-slices in the crossbar switch core 3-stages pipelined MUX circuit is proposed to decrease the cycle time to less than 1ns Post layout simulation results show that each stage can run at a clock rate higher than 1GHz Full 256*256 crossbar core has been laid out to demonstrate the design PLL & dual circuits have been designed for the clocking and high speed link in the whole chip