An FPGA Architecture Supporting Dynamically-Controlled Power Gating

Similar documents
Power Solutions for Leading-Edge FPGAs. Vaughn Betz & Paul Ekas

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

Research Challenges for FPGAs

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University

Vdd Programmability to Reduce FPGA Interconnect Power

FPGA Clock Network Architecture: Flexibility vs. Area and Power

FPGA Power Management and Modeling Techniques

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

Device And Architecture Co-Optimization for FPGA Power Reduction

Floorplan and Power/Ground Network Co-Synthesis for Fast Design Convergence

SUBMITTED FOR PUBLICATION TO: IEEE TRANSACTIONS ON VLSI, DECEMBER 5, A Low-Power Field-Programmable Gate Array Routing Fabric.

CS310 Embedded Computer Systems. Maeng

DESIGN METHODS IN SUB-MICRON TECHNOLOGIES

Power Gate Optimization Method for In-Rush Current and Power Up Time

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

FPGA Power and Timing Optimization: Architecture, Process, and CAD

Jae Wook Lee. SIC R&D Lab. LG Electronics

FPGA Power Reduction Using Configurable Dual-Vdd

Low Power System-on-Chip Design Chapters 3-4

Architecture Evaluation for

Minimizing Power Dissipation during. University of Southern California Los Angeles CA August 28 th, 2007

Congestion-Aware Power Grid. and CMOS Decoupling Capacitors. Pingqiang Zhou Karthikk Sridharan Sachin S. Sapatnekar

Advanced FPGA Design Methodologies with Xilinx Vivado

A 50% Lower Power ARM Cortex CPU using DDC Technology with Body Bias. David Kidd August 26, 2013

How Much Logic Should Go in an FPGA Logic Block?

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

LOW POWER DESIGN IMPLEMENTATION OF A SIGNAL ACQUISITION MODULE RAVI BHUSHAN THAKUR. B.Tech, Jawaharlal Nehru Technological University, INDIA 2007

Reducing Power in an FPGA via Computer-Aided Design

MTJ-Based Nonvolatile Logic-in-Memory Architecture

Hoplite-DSP Harnessing the Xilinx DSP48 Multiplexers to efficiently support NoCs on FPGAs. Chethan Kumar H B and Nachiket Kapre

The Impact of Pipelining on Energy per Operation in Field-Programmable Gate Arrays

An Introduction to Programmable Logic

ECE 636. Reconfigurable Computing. Lecture 2. Field Programmable Gate Arrays I

Leakage Mitigation Techniques in Smartphone SoCs

Low-Power Technology for Image-Processing LSIs

Hybrid LUT/Multiplexer FPGA Logic Architectures

Millimeter-Scale Nearly Perpetual Sensor System with Stacked Battery and Solar Cells

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles

FIELD programmable gate arrays (FPGAs) provide an attractive

CELL-BASED design technology has dominated

Design Methodologies and Tools. Full-Custom Design

On-chip Networks Enable the Dark Silicon Advantage. Drew Wingard CTO & Co-founder Sonics, Inc.

Reconfigurable Computing

EE219A Spring 2008 Special Topics in Circuits and Signal Processing. Lecture 9. FPGA Architecture. Ranier Yap, Mohamed Ali.

Three DIMENSIONAL-CHIPS

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

Memory Footprint Reduction for FPGA Routing Algorithms

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Power Optimization in FPGA Designs

Design and Technology Trends

An Energy-Efficient Near/Sub-Threshold FPGA Interconnect Architecture Using Dynamic Voltage Scaling and Power-Gating

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Sustainable Computing: Informatics and Systems

Mixed Signal Verification of an FPGA-Embedded DDR3 SDRAM Memory Controller using ADMS

The Design of the KiloCore Chip

ESE534: Computer Organization. Tabula. Previously. Today. How often is reuse of the same operation applicable?

Last Time. Making correct concurrent programs. Maintaining invariants Avoiding deadlocks

Cluster-based approach eases clock tree synthesis

Reducing Leakage Energy in FPGAs Using Region-Constrained Placement

A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling

Innovative Power Control for. Performance System LSIs. (Univ. of Electro-Communications) (Tokyo Univ. of Agriculture and Tech.)

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

Embedded Programmable Logic Core Enhancements for System Bus Interfaces

Designing for Low Power with Programmable System Solutions Dr. Yankin Tanurhan, Vice President, System Solutions and Advanced Applications

VLSI Test Technology and Reliability (ET4076)

INTRODUCTION TO FPGA ARCHITECTURE

How to get realistic C-states latency and residency? Vincent Guittot

CSE P567 - Winter 2010 Lab 1 Introduction to FGPA CAD Tools

Enhanced Multi-Threshold (MTCMOS) Circuits Using Variable Well Bias

Low-Power Programmable FPGA Routing Circuitry

Chapter 5: ASICs Vs. PLDs

SpiNNaker a Neuromorphic Supercomputer. Steve Temple University of Manchester, UK SOS21-21 Mar 2017

Monolithic 3D IC Design for Deep Neural Networks

COEN-4730 Computer Architecture Lecture 12. Testing and Design for Testability (focus: processors)

0912GN-120E/EL/EP Datasheet E-Series GaN Transistor

High performance, power-efficient DSPs based on the TI C64x

Introduction to Field Programmable Gate Arrays

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Spiral 2-8. Cell Layout

Designing Heterogeneous FPGAs with Multiple SBs *

Academic Clustering and Placement Tools for Modern Field-Programmable Gate Array Architectures

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

A Routing Approach to Reduce Glitches in Low Power FPGAs

ECE 5745 Complex Digital ASIC Design Topic 7: Packaging, Power Distribution, Clocking, and I/O

Low Power Design Techniques

Delay Modeling and Static Timing Analysis for MTCMOS Circuits

FABRICATION TECHNOLOGIES

1. NoCs: What s the point?

Power Consumption in 65 nm FPGAs

Handouts. FPGA-related documents

ECE 486/586. Computer Architecture. Lecture # 2

A Non-Volatile Microcontroller with Integrated Floating-Gate Transistors

Structured Datapaths. Preclass 1. Throughput Yield. Preclass 1

White Paper Performing Equivalent Timing Analysis Between Altera Classic Timing Analyzer and Xilinx Trace

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Design of Adaptive Communication Channel Buffers for Low-Power Area- Efficient Network-on. on-chip Architecture

Transcription:

An FPGA Architecture Supporting Dynamically-Controlled Power Gating Altera Corporation March 16 th, 2012 Assem Bsoul and Steve Wilton {absoul, stevew}@ece.ubc.ca System-on-Chip Research Group Department of Electrical and Computer Engineering University of British Columbia Vancouver, B.C., Canada

What this talk is about An FPGA architecture supporting dynamic power gating: Turn off regions at run-time to save power, with on-chip control ASIC designers do this regularly Challenges for an FPGA: We don t know about application Routing for signals Handling wakeup in a programmable way control System-on-chip with 2 islands 2

Motivation for new FPGA Arch. FPGA leakage power is major component of total power Almost the same amount as dynamic power for 28nm Cyclone V: Static 32%, Dynamic 42%, I/O 26%* High-end FPGAs are power hungry Entering an era where we can t turn it all on at once! Large power leads to heat issues reliability, cooling cost Mobile hand-held applications Many applications have regions with long idle periods *Altera WP-01158-2.0: Meeting the Low Power Imperative at 28nm 3

Some Relevant Work 4

Statically-Controlled Power Gating Available FPGA power gating is statically-controlled Unused FPGA blocks are turned off at configuration time V DD SRAM Logic Cluster Useful if amount of unused resources in FPGA is large! However, designers select the smallest chip that fits their design Frequently idle blocks still dissipate leakage! 5

Commercial FPGAs Programmable Power Technology (Stratix III, IV, V) Change Vth of LABs according to location on critical path Not runtime-controlled! Power gating for embedded blocks Shut down unused I/O, memories, PLLs, transceivers, etc. Supported in both Altera and Xilinx Done at configuration time Flash*Freeze in IGLOO and ProASIC3L from Microsemi Allows sleep mode for the whole chip 6

We propose an FPGA architecture where regions of the chip can be turned on/off at runtime to control their power consumption 7

Outline Power gating architecture Complications: wakeup current System-level evaluation Current work Other issues 8

Outline Power gating architecture Complications: wakeup current System-level evaluation Current work Other issues 9

Our FPGA Architecture Big Picture Divide FPGA device into power-controlled regions Support dynamically-controlled sleep mode Use general-purpose routing fabric for control signals Power control signal Power domain Power gating region 10

Architecture Details 11

Logic Cluster LC power switch V DD 2 0 1 PG_CNTL1 Logic Cluster 12

Routing Channel Routing Channel Routing Channel Routing Channel Connection box In3 In4 Logic Cluster In1 In2 Track isolation buffer Routing channel A routing channel is shared between the two neighboring logic clusters In3 In4 Logic Cluster In1 In2 13

Routing Channel Routing Channel Routing Channel Routing Channel Routing channel power switch V DD 2 0 1 PG_CNTL1 PG_CNTL2 Turn off routing channel only when both neighbors are off. Routing Channel 14

15 Routing Channel Routing Channel Routing Channel Routing Channel We don t gate switch blocks right now focus of current work!

Control Signals Use general-purpose routing fabric to route control signals. Re-used free input pins. - Input pins are expensive! 16

Area Overhead: Region Power Gating Architecture granularity - Share sleep transistor among region of tiles. Control signal from bordering routing channels. RC RC RC RC From bordering connection blocks Interesting tradeoff: - Area vs. CAD difficulty RC RC Power gating region, size 2x2 17

Evaluation The Good: Potential leakage reduction The Bad: Area, Delay, Leakage power overhead 18

Experimental Setup Sweep architecture N (cluster) W (channel) Create HSPICE netlist (automatic LC buffer sizing) R (region) Run HSPICE to measure performance 45 nm PTM Create HSPICE netlist for power gating architecture Architecture parameters (N, I, W, Fcout,...) Three architectures Ungated Static-gating (SG) Resize sleep transistors No Meets performance constraints (10%)? Dynamic-gating (DG) Yes Collect results (leakage power and area) 19

Granularity Results: Leakage Power Compared to ungated, static-gating and dynamic-gating reduce leakage in sleep mode by more than 44%. Dynamic-gating has 0.8% more leakage than static-gating (R=4). Leakage reduction increase with increased R. 20

Granularity Results: Area overhead Area overhead compared to ungated: dynamic-gating 0.75%, static-gating 0.57% (R=4). Area overhead decrease as R increase. 21

Delay overhead is 10% by design: We chose sleep transistors with delay impact no more than 10% Tradeoff: delay overhead vs. area overhead 22

Outline Power gating architecture Complications: wakeup current System-level evaluation Current work Other issues 23

What is the inrush current problem? 24

Inrush Current in Power Gating VDD SLEEP OFF ON Power-gated functional block Current (ma) Inrush current Voltage droop on power grid lines malfunction of the design Voltage on power grid Voltage droop 25

In power-gated ASIC designs, how the inrush current problem is solved? 26

Related work ASIC Domain Daisy chaining: A chain of parallel power switches instead of one large switch Delay the wakeup of each stage to limit inrush current V DD Virtual V DD Power control T 1 T n-1 27

How the problem is different in our FPGA? 28

Inrush current in ASIC vs. FPGA In ASICs Power gating is well known Application is known at fabrication time Inrush current requirements are known at design time In FPGAs Application is not known at fabrication time Sizes and locations of power-gated blocks are not known Inrush current requirements are not known The solution needs to be configurable! 29

How serious the problem is in our FPGA? 30

FPGA Power Grid Modeling A model of the FPGA power grid to evaluate the effect of inrush current: Multiple metal layers that represent the power grid FPGA tiles modeled as a grid of current sources Obtain current values from power analysis of MCNC benchmarks (max average current per tile Imax_tile 400µA) 31

Effect of Inrush Current Voltage drop due to inrush current in our FPGA architecture. Voltage Drop (mv) Finer granularity Voltage drop on power grid is larger than 100mV! 32

What are the possible solutions for our power-gated FPGA? 33

A Possible FPGA Solution Ask the designer to take care of it!! How? Create a power controller that activates multiple signals in sequence to wakeup one functional block: Requires user experience and knowledge. Complicates design process. May result in power controller with large power consumption. 34

The Proposed Architecture A configurable architecture to limit inrush current. Has two levels 1) Fixed intra-region level: Ensures we can turn on individual regions safely. 2) Configurable inter-region level: Safely sequences the wakeup of a power-gated app. 35

Fixed Intra-region Architecture 36

Use parallel sleep transistors (STs) and sequence wakeup using fixed delay elements. Intra-region Architecture Sizes of STs and delay elements is based on maximum allowed current. A Tradeoff exists between area overhead of intra-region level and power grid metal area. 37

Configurable Inter-region Architecture 38

Inter-region Architecture Intra-region level should solve the problem (theoretically!) However In practice more current might be drawn due to: Switching of unrelated signals in routing. Switching of inputs to logic clusters partially turned on. Which is application specific! Instead of turning on a complete power-gated application at once, turn on one power gating region at a time. 39

Inter-region Architecture FPGA chip with power gating regions. Power gating region in FPGA 40

Inter-region Architecture A power-gated app. is mapped to one or more regions. 41

Inter-region Architecture During wakeup: turn on one region at a time. After 1 x T Delay is inserted before the turn on of next region. 42

Inter-region Architecture During wakeup: turn on one region at a time. After 2 x T Delay is inserted before the turn on of next region. 43

Inter-region Architecture During wakeup: turn on one region at a time. After 3 x T Delay is inserted before the turn on of next region. 44

Inter-region Architecture During wakeup: turn on one region at a time. After 4 x T 45

Inter-region Architecture We need to insert delays between the turn-on of regions. However we don t know: the shape, size, and location of a power-gated app. on FPGA. how many delays are required before turning on a region. Use programmable delay elements (PDEs). 46

Inter-region Architecture ST n V DD ST ST Virtual V 2 1 DD T n-1 T 1 0 1 Inter-region P D E MUX Power Gating Region SLEEP From bordering routing channels Programmable delay element (PDE) PDE size (number of inputs) the largest power-gated app. that can be turned on using one control signal. 47

Inter-region Architecture C1 C1 C1 C1 48

Inter-region Architecture C1 C1 49

Inter-region Architecture C1 C1 C2 C1 C1 C2 50

Results: Inrush Current Handling Settings: Region size = 4x4 Typical power grid (not oversized) Worst case temperature PDE size (# of regions controlled using one signal): 50 Leakage power savings (compared to ungated): With inrush handling: 91% Without inrush handling: 97% Area overhead (compared to ungated): Less than 2.2% 51

Outline Power gating architecture Complications: wakeup current System-level evaluation Current work Other issues 52

Potential Leakage Energy Reduction Use a model that relates: Number and size of idle regions in application Proportion of the time idle regions can be turned off Size of the power controller Potential slowdown of application to the energy savings of the architecture 53

Potential Leakage Energy Reduction Leakage energy reduction (%) compared to ungated. 54 Fraction of idle times Leakage reduction (%) Fraction of idle regions

Conditions to Achieve Energy Savings To achieve savings: Esleep < Eidle Use a model to find the minimum idle time required in order to achieve energy savings when in sleep mode. For clma (775 tiles PDE size 49), Tidle 900 ns. Equivalent to 450 cycles on 500 MHz clock! 55

Outline Power gating architecture Complications: wakeup current System-level evaluation Current work Other issues 56

Switch Boxes Power Gating SBs dissipate more than 50% of static power in FPGA tile! Challenges in power gating SBs: Used to route SLEEP signals Unrelated signals could be routed through SBs in power domain SLEEP 57

Switch Boxes Power Gating Solution: partial power gating for SBs Allows off mode for SBs inside power domain Allows partial off mode for shared SBs SLEEP 58

Outline Power gating architecture Complications: wakeup current System-level evaluation Current work Other issues 59

Applications Mapping to Architecture Mapping applications (high level): Automatic detection for power domains in app Automatic generation for power controller What about verification? 60

Evaluation and Benchmarks What we have used so far: HSPICE circuit models and simulations Architecture parameter sweep Application model to find potential savings, min Tidle, etc. What we really need: Real applications with power domains More advanced power model A way to map applications to architecture 61

Summary Summary of dynamically-controlled power gating FPGA: 1. Enables turning off unused logic at config time static power 2. Enables turning off idle logic at runtime static power 3. Control could come from off or on chip (from soft logic) 4. Programmable inrush current handling 5. Routing architecture is not changed (until now) Challenges: 1. Area overhead small though 2. Performance degradation we assume 10% for now 3. Mapping applications to architecture automated? 62