An Overview of Standard Cell Based Digital VLSI Design

Similar documents
An overview of standard cell based digital VLSI design

The Design of the KiloCore Chip

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles

KiloCore: A 32 nm 1000-Processor Array

FABRICATION TECHNOLOGIES

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling

An Asynchronous Array of Simple Processors for DSP Applications

Lab. Course Goals. Topics. What is VLSI design? What is an integrated circuit? VLSI Design Cycle. VLSI Design Automation

VLSI Digital Signal Processing

Hardware and Applications of AsAP: An Asynchronous Array of Simple Processors

ASIC Physical Design Top-Level Chip Layout

EE 330 Laboratory Experiment Number 11

DIGITAL SANDBOX WORKSHOP Summer Digital Sandbox Mission

ESE 570 Cadence Lab Assignment 2: Introduction to Spectre, Manual Layout Drawing and Post Layout Simulation (PLS)

Logic Synthesis. Logic Synthesis. Gate-Level Optimization. Logic Synthesis Flow. Logic Synthesis. = Translation+ Optimization+ Mapping

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

The Microprocessor as a Microcosm:

VLSI Design Automation

Laboratory 6. - Using Encounter for Automatic Place and Route. By Mulong Li, 2013

ECE 459/559 Secure & Trustworthy Computer Hardware Design

Evolution of CAD Tools & Verilog HDL Definition

VLSI Design Automation

Design Solutions in Foundry Environment. by Michael Rubin Agilent Technologies

Overview of Digital Design Methodologies

EE 330 Laboratory Experiment Number 11 Design, Simulation and Layout of Digital Circuits using Hardware Description Languages

TUTORIAL II ECE 555 / 755 Updated on September 11 th 2006 CADENCE LAYOUT AND PARASITIC EXTRACTION

VLSI Design Automation. Maurizio Palesi

The IIT standard cell library Version 2.1

TSBCD025 High Voltage 0.25 mm BCDMOS

VLSI Design Automation. Calcolatori Elettronici Ing. Informatica

Dynamic Voltage and Frequency Scaling Circuits with Two Supply Voltages

SYNTHESIS FOR ADVANCED NODES

More Course Information

Silicon Virtual Prototyping: The New Cockpit for Nanometer Chip Design

Cell-Based Design Flow. TA : 吳廸優

Tutorial for Cadence SOC Encounter Place & Route

Cell-Based IC Physical Design & Verification SOC Encounter. Advisor : 李昆忠 Presenter : 蕭智元

An Introduction to Programmable Logic

EE 330 Laboratory Experiment Number 11 Design and Simulation of Digital Circuits using Hardware Description Languages

A Hexagonal Shaped Processor and Interconnect Topology for Tightly-tiled Many-Core Architecture

CPE/EE 427, CPE 527, VLSI Design I: Tutorial #4, Standard cell design flow (from verilog to layout, 8-bit accumulator)

Brief Introduction of Cell-based Design. Ching-Da Chan CIC/DSD

ECE520 VLSI Design. Lecture 1: Introduction to VLSI Technology. Payman Zarkesh-Ha

Digital System Design Lecture 2: Design. Amir Masoud Gharehbaghi

MOSAID Semiconductor

Outline. SoC Encounter Flow. Typical Backend Design Flow. Digital IC-Project and Verification. Place and Route. Backend ASIC Design flow

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

ProASIC PLUS FPGA Family

Spiral 2-8. Cell Layout

CMOS VLSI Design Lab 3: Controller Design and Verification

Columbia Univerity Department of Electrical Engineering Fall, 2004

RISECREEK: From RISC-V Spec to 22FFL Silicon

Lecture 11 Logic Synthesis, Part 2

Digital IC- Project 1. Place and Route. Oskar Andersson. Oskar Andersson, EIT, LTH, Digital IC project and Verifica=on

EE 434 ASIC & Digital Systems

ECE425: Introduction to VLSI System Design Machine Problem 3 Due: 11:59pm Friday, Dec. 15 th 2017

Design and Analysis of Ultra Low Power Processors Using Sub/Near-Threshold 3D Stacked ICs

CMOS VLSI Design Lab 3: Controller Design and Verification

310/ ICTP-INFN Advanced Tranining Course on FPGA and VHDL for Hardware Simulation and Synthesis 27 November - 22 December 2006

Design Methodologies

Computational Platforms for Virtual Immersion Architectures: Integrating Specialized Processing Units Into A Homogeneous Array of Processors

DIGITAL DESIGN TECHNOLOGY & TECHNIQUES

Graphics: Alexandra Nolte, Gesine Marwedel, Universität Dortmund. RTL Synthesis

CMOS VLSI Design Lab 3: Controller Design and Verification

101-1 Under-Graduate Project Digital IC Design Flow

Synthesis and APR Tools Tutorial

23. Digital Baseband Design

Physical Placement with Cadence SoCEncounter 7.1

Design of a Low Density Parity Check Iterative Decoder

UCLA 3D research started in 2002 under DARPA with CFDRC

UNIVERSITY OF WATERLOO

System-on-Chip Design for Wireless Communications

Design Methodologies and Tools. Full-Custom Design

A Design Tradeoff Study with Monolithic 3D Integration

Standard Cell Design and Optimization Methodology for ASAP7 PDK

RTL Coding General Concepts

Serial Adapter for I 2 C / APFEL and 8 channel DAC ASIC

EE 466/586 VLSI Design. Partha Pande School of EECS Washington State University

Hardware Modeling using Verilog Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

GUIDELINES FOR USE SMIC 0.18 micron, 1.8 V high-density synchronous single port SRAM IP blocks compiler

Cluster-based approach eases clock tree synthesis

Tutorial for Encounter

Usable gates 600 1,250 2,500 5,000 10,000 Macrocells Logic array blocks Maximum user I/O

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

AMchip architecture & design

Design rule illustrations for the AMI C5N process can be found at:

Cadence On-Line Document

Physical Implementation

Synopsys Design Platform

TABLE OF CONTENTS 1.0 PURPOSE INTRODUCTION ESD CHECKS THROUGHOUT IC DESIGN FLOW... 2

Problem Formulation. Specialized algorithms are required for clock (and power nets) due to strict specifications for routing such nets.

Overview. Design flow. Principles of logic synthesis. Logic Synthesis with the common tools. Conclusions

DRAM Fab partnership for Intelligent RAM (IRAM)

CHAPTER 1 INTRODUCTION

Digital Integrated Circuits A Design Perspective. Jan M. Rabaey

The Design of MCU's Communication Interface

EITF35: Introduction to Structured VLSI Design

Monolithic 3D IC Design for Deep Neural Networks

Lecture Content. 1 Adam Teman, 2018

Transcription:

An Overview of Standard Cell Based Digital VLSI Design With examples taken from the implementation of the 36-core AsAP1 chip and the 1000-core KiloCore chip Zhiyi Yu, Tinoosh Mohsenin, Aaron Stillmaker, Bevan Baas VCL Laboratory, UC Davis

Outline Overview of standard cellbased design Design of the AsAP1 and KiloCore chips including CAD Tool Flow

Standard cell vs. Full-custom IC design Standard-cell based IC design Design using standard cells Standard cells come from library provider Many different choices for cell size, delay, leakage power Many EDA tools to automate this flow Shorter design time Custom IC design (e.g., magic) Design all by yourself Higher performance Lower energy per workload (lower power) Smaller chip area

Standard cell based VLSI design flow Front end System specification and architecture HDL (verilog or VHDL) coding Behavioral simulations using RTL (HDL) Synthesis Gate-level simulations Back end Floorplanning, Power grid design Standard-cell Placement Interconnect routing DRC (Design Rule Check) LVS (Layout vs Schematic) Dynamic simulation and static timing analysis

Outline Overview of standard cell-based design Design of the AsAP1 and KiloCore chips including CAD Tool Flow

AsAP1 Block Diagram GALS array of identical processors Each processor is a reduced complexity programmable DSP with small memories Each processor can receive data from any two neighbors and send data to any of its four neighbors FIFO 0 32 words FIFO 1 32 words IMem 64 words ALU MAC Control DMem 128 words OSC Output Static config Dynamic config B. Baas 6

KiloCore Developed by the VLSI Computation Laboratory at UC Davis, with a similar architecture to AsAP (Asynchronous Array of Processors) A processing chip containing multiple uniform simple processor elements Globally Asynchronous Locally Synchronous (GALS) Each processor has its local clock generator Each processor can communicate with its neighbor processors using dual-clock FIFOs

KiloCore Design Contains 1,000 processors on one chip Fastest clock rate processor designed at a university 12 memories containing 64 KB each for 768 KB of shared memory 8 KiloCore Block Diagram Single Processor One 64 KB memory

Simple Diagram of the Front-End Design Flow System Specification RTL Coding Synthesis Gate level code Example: c =!a & b INV (.in(a),.out(a_inv)); AND (.in1 (a_inv),.in2(b),.out(c)); a b C

Simple diagram of the Back-end design flow gate level Verilog from synthesis Place & Route Final layout (go for fabrication) Gate level Verilog DRC LVS Design rule check Layout vs. schematic Timing information Gate level dynamic and/or static analysis

Back-End: Physical Design Flow

Back-end design of AsAP1 Technology: TSMC 0.18 μm CMOS Standard cell library: Artisan Tools Synthesis: Synopsis Design compiler Placement & Route: Cadence Encounter DRC & LVS: Mentor Calibre Static timing analysis: Primetime

Back-End Design of KiloCore Technology: IBM 32 nm PD-SOI CMOS Standard cell library: ARM-Artisan Tools Synthesis: Synopsis Design compiler Placement & Route: Cadence Encounter DRC & LVS: Mentor Calibre Static timing analysis: Primetime and UltraSim (Spice)

Flow of Placement and Routing Import needed files Floorplan Placement & in-place optimization Clock tree generation Routing

Import Needed Files Gate-level verilog (.v) Geometry information (.lef) Timing information (.lib) INV (.in (a),.out (a_inv)); AND (.in1 (a_inv),.in2 (b),.out (c)); INV: 1um width AND: 2 um width a INV b AND C INV: 1ns delay; AND: 2 ns delay Delay (a c): 1ns + 2ns = 3ns

Floorplan Size of chip Location of Pins Location of main blocks Power supply: give enough power for each gate Power supply (1.8 V) 1.75 V current 1.7 V Vdd (Metal) (need another power stripe) 1.65 V Gate 1 Gate 2 Gate 3 Gate 4 Gnd Voltage drop equation: V2 = V1 I * R

Floorplan of a single processor Inst Mem Data Mem ALU MAC Control Clock InFIFO 0 InFIFO 0

Floorplan of Single KiloCore Processor 18

Placement & In-Place Optimization Placement: place the gates (standard cells) Utilizes a long series of very complex optimizations to meet all design goals as best as possible. Design goals include: maximum delay of all paths, minimize length of all wires (to increase probability of a successful route), etc. In-place optimization Why: there will always be a timing difference between synthesis and layout (e.g., actual wire delay is different than predicted) How: change gate size, insert buffers May not change the circuit function!!

Placement of a single AsAP1 processor

Placement of Single KiloCore Processor 21

Placement of Single KiloCore Processor 22

Clock Tree Design Main parameters: skew, delay, transition time Clock Delay= x S SET Q Clock Skew= x -y R CLR SQS SET SET Q Q R R CLR Q Q CLR Original Clock S SET Q R CLR SQS SET SET Q Q R R CLR Q Q CLR S SET Q R CLR SQS SET SET Q Q Clock Delay = y R R CLR Q Q CLR

Clock tree of a single AsAP1 processor

Clock Tree of a Single KiloCore Processor Colors indicate clock phase which is the same as clock skew 25

Routing Routing is the second step in the Standard Cell Place & Route process and consists of the CAD P&R tool routing all necessary signal wires Local power and ground connections to standard cells are made during an earlier power striping step and are made by a much simpler process of simply laying down horizontal power stripes Routing consists of two main steps Connection of global signals (power) Connect other signals

Metal Layer Topology Routing

Layout of a Single AsAP1 Processor After Routing Area: 0.8mm x 0.8mm

Layout of the first generation 6x6 AsAP1 Area: 30 mm^2 in 180 nm CMOS - 36 processors - 114 PADs One processor

Routing on a Single KiloCore Processor 239 μm x 231.3 μm 574,733 transistors 30

KiloCore Chip Layout KiloCore Chip Entire chip takes up 8 mm x 8 mm or 64 mm 2 LVDS pairs used for high speed data I/O Drivers connected through C4 bump array I/O Drivers Single Processor 31 KiloCore Chip Layout SRAM Memories

Verification after layout DRC (design rule check) LVS (layout vs. schematic) GDS-II vs. (verilog + spice module) Gate level verilog dynamic simulation Mainly check the function Different with synthesis result

Useful tools Dynamic Simulation: Modelsim (Mentor), NC-verilog (Cadence), Active-HDL Synthesis: Design-compiler, design-analyzer (Synopsys) Placement & Routing Encounter & Virtuoso (Cadence) Astro (Synopsys) DRC & LVS Calibre (Mentor) Dracula (Cadence) Static Analysis Primetime (Synsopsys)

Chip Micrograph of the 36-Core AsAP1 IMem OSC Single Processor DMem FIFOs Flow: Standard-cell based Technology: TSMC 0.18 µm Transistors: 1 Proc 230,000 Chip 8.5 million Max speed: 610 MHz @ 2.0 V Area: 1 Proc 0.66 mm² Chip 32.1 mm² Power (1 Proc @ 1.8V, 475 MHz): Typical application 32 mw Typical 100% active 84 mw Power (1 Proc @ 0.9V, 116 MHz): Typical application 2.4 mw ISSCC 2006 HotChips 2006 IEEE Micro 2007 JSSC 2008 34

KiloCore Chip Technology 32nm IBM PDSOI CMOS 8 mm 7.82 mm Processors Indep. Mems Num. Oscs. 2012 1000 per chip 1.78 GHz @ 1.1 V 1.24 GHz, 18 mw @ 0.90 V 115 MHz, 0.7 mw @ 0.56 V 12 per chip Die Area 64 mm 2 Array Area 60 mm 2 7.67 mm 8 mm Transistors C4 Bumps Package 621 Million 564 (162 I/O) 676 Pad Flip-Chip BGA Slide 35 HotChips 2016

KiloCore Measurements A maximum of 1.78 trillion instructions per second Assuming a custom package design Processors achieve their optimal energy times time of 11.1 (pj x ns / instruction) at 0.9 V Supply Voltage Max Clock Freq (MHz) Total Chip Instructions / sec At minimum voltage, KiloCore can perform 115 billion instructions per second using 0.7 Watts low enough to be powered by a single AA battery! Total Chip Power 1.10 V 1782 1.78 Trillion 39 Watts 0.90 V 1237 1.24 Trillion 17 Watts 0.84 V 1000 1.00 Trillion 13 Watts 0.75 V 638 638 Billion 6.3 Watts 0.56 V 115 115 Billion 0.7 Watts 37 HotChips 2016