Innovative Power Control for. Performance System LSIs. (Univ. of Electro-Communications) (Tokyo Univ. of Agriculture and Tech.)

Similar documents
A Study of Leakage Power Reduction Mechanisms on Functional Units and TLBs for Embedded Processors

A Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan

A Multi-Vdd Dynamic Variable-Pipeline On-Chip Router for CMPs

A HARDWARE COMPLETE DETECTION MECHANISM FOR AN ENERGY EFFICIENT RECONFIGURABLE ACCELERATOR CMA

SH-Mobile3: Application Processor for 3G Cellular Phones on a Low-Power SoC Design Platform

Efficient Systems. Micrel lab, DEIS, University of Bologna. Advisor

Low-Power Technology for Image-Processing LSIs

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

A 297MOPS/0.4mW Ultra Low Power Coarse-grained Reconfigurable Accelerator CMA-SOTB-2

Delay Modeling and Static Timing Analysis for MTCMOS Circuits

Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors" ASP-DAC 2014

Part IV: 3D WiNoC Architectures

Parallel Processing SIMD, Vector and GPU s cont.

An FPGA Architecture Supporting Dynamically-Controlled Power Gating

3D WiNoC Architectures

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

ECE 571 Advanced Microprocessor-Based Design Lecture 24

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

An Energy-Efficient Near/Sub-Threshold FPGA Interconnect Architecture Using Dynamic Voltage Scaling and Power-Gating

EECS 322 Computer Architecture Superpipline and the Cache

A 167-processor 65 nm Computational Platform with Per-Processor Dynamic Supply Voltage and Dynamic Clock Frequency Scaling

High performance, power-efficient DSPs based on the TI C64x

Leakage Mitigation Techniques in Smartphone SoCs

Power Analysis for CMOS based Dual Mode Logic Gates using Power Gating Techniques

Low Power System-on-Chip Design Chapters 3-4

Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery

LECTURE 11. Memory Hierarchy

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

Lecture 18: Multithreading and Multicores

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech)

Normally-Off MCU Architecture for Low-power Sensor Node

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Real-Time Dynamic Energy Management on MPSoCs

5. Memory Hierarchy Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

LOW POWER SRAM CELL WITH IMPROVED RESPONSE

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

A Write-Back-Free 2T1D Embedded. a Dual-Row-Access Low Power Mode.

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

Low Power System Design

Ultra Low Power (ULP) Challenge in System Architecture Level

Extreme Scale Computer Architecture: Energy Efficiency from the Ground Up

Advanced Computer Architecture (CS620)

Embedded SRAM Technology for High-End Processors

Vdd Programmable and Variation Tolerant FPGA Circuits and Architectures

Real-Time Dynamic Voltage Hopping on MPSoCs

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

HPC VT Machine-dependent Optimization

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Blackfin Optimizations for Performance and Power Consumption

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Prediction Router: Yet another low-latency on-chip router architecture

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Process and Design Solutions for Exploiting FD SOI Technology Towards Energy Efficient SOCs

An Overview of Standard Cell Based Digital VLSI Design

Jae Wook Lee. SIC R&D Lab. LG Electronics

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Low-power Architecture. By: Jonathan Herbst Scott Duntley

A Memory System Design Framework: Creating Smart Memories

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Ultra Low-Cost Defect Protection for Microprocessor Pipelines

Reconfigurable Computing. Introduction

A Non-Volatile Microcontroller with Integrated Floating-Gate Transistors

CALCULATION OF POWER CONSUMPTION IN 7 TRANSISTOR SRAM CELL USING CADENCE TOOL

ReNoC: A Network-on-Chip Architecture with Reconfigurable Topology

EECS Dept., University of California at Berkeley. Berkeley Wireless Research Center Tel: (510)

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

SH-X3 Flexible SuperH Multi-core for High-performance and Low-power Embedded Systems

Adaptive Voltage Scaling (AVS) Alex Vainberg October 13, 2010

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Simulation and Analysis of SRAM Cell Structures at 90nm Technology

A Tightly Coupled General Purpose Reconfigurable Accelerator LAPP and Its Power States for HotSpot-Based Energy Reduction

Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor

Computer Architecture s Changing Definition

Outline Marquette University

Memory technology and optimizations ( 2.3) Main Memory

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Near-Threshold Computing: Reclaiming Moore s Law


Advanced Parallel Programming I

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Gigascale Integration Design Challenges & Opportunities. Shekhar Borkar Circuit Research, Intel Labs October 24, 2004

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

Integrating MRPSOC with multigrain parallelism for improvement of performance

FPGA Power and Timing Optimization: Architecture, Process, and CAD

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

KiloCore: A 32 nm 1000-Processor Array

Memory Hierarchy. Reading. Sections 5.1, 5.2, 5.3, 5.4, 5.8 (some elements), 5.9 (2) Lecture notes from MKP, H. H. Lee and S.

ECE 571 Advanced Microprocessor-Based Design Lecture 22

University of California, Berkeley. Midterm II. You are allowed to use a calculator and one 8.5" x 1" double-sided page of notes.

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Lecture 18: Core Design, Parallel Algos

Transcription:

Innovative Power Control for Ultra Low-Power and High- Performance System LSIs Hiroshi Nakamura Hideharu Amano Masaaki Kondo Mitaro Namiki Kimiyoshi Usami (Univ. of Tokyo) (Keio Univ.) (Univ. of Electro-Communications) (Tokyo Univ. of Agriculture and Tech.) (Shibaura Inst. of Tech.) 1

Objective and Strategy Objective: drastic power reduction of high-performance system LSIs Strategy: innovative power control through tight Co-Optimization / Co-Design of system software, architecture, and circuit design. Principle: Performance: limited by a bottleneck Power: summation of whole system Low power and slow operation for unhurried / idle parts System Software Compiler Architecture Circuit Technology Co-Opt timizat ion/co o-desig gn 2

Role of Design Hierarchy for Low Power OS Architecture When? Where? Circuit How? throttle lever of power/performance Device Clock Gating, Dual Vth, DVFS, Power Gating, Back-bias,.. Circuit Level : Provide levers to throttle performance / power Architecture, OS Level : Find a chance to set levers, when and where?? architecture: Intra-task/process optimization OS: Inter-task/process optimization 3

Preferable Throttle Lever Effectiveness of Processor Reconfig int fp System Power Reduction Low Overhead in Area, Performance, Power Controlling the throttle lever itself takes time and consumes power Fine Control Granularity in both Space and Time Locations of busy / idle parts are small and change frequently cache Processor int fp cache Cache Memory Network System LSI busy idle time 4

Example of Throttle Levers for dynamic power: Clock Gating, DVFS both effective, DVFS particular (Power Vdd 2 ) Clock Gating: very fine-grained control with little overhead easily utilized within circuit level design DVFS: tens of μs to change Vdd through regulator moderate granularity for leakage power: Power Gating, Body Biasing both effective, but large overhead in power and performance Body biasing: spatial granularity statically defined regions not easy for fine-grained i control sleep signal Circuit Block sleep Tr. Vdd VGND GND Power Gating 5

Role of Design Hierarchy for Low Power: The Ideal System OS Architecture Circuit Device When? When? Where? Where? How? How? Spatial and Temporal Granularity is important Co-Design of Circuit, Architecture and OS for Power Co-Optimization of Throttle Lever Control: especially, Co-Optimization of Spatial and Temporal Granularity ex. activity localization to make full use of throttle levers characteristics by architecture/os 6

Team Formation of our Research Project Co-Optim mization of System Software and Arch hitecture e Archite Circuit t Design Architecture/ Compiler Co-Optim mization cture an of nd n System Software Network Processor int fp cache Circuit it Design Reconfig System Memory VddH VddL Sub-theme (leader) Co-operative System Software with Arch. (Prof. Namiki) Ultra Low-Power Reconf. Architecture (Prof. Amano) Data Resident Architecture (Prof. Nakamura) Data Resident Compiler (Prof. Kondo) Ultra Low-Power Circuit logic block Design (Prof. Usami) 7

(Project 1) Geyser: Low Power Processor through Fine-grained Runtime Power Gating Target: Leakage Power Background: Leakage reduction techniques so far, Standby time: power-gating (Coarse Grain) Runtime: Cache-decay, Drowsy-cache, (Coarse Grain in temporal) Leakage for logic parts (ALU, multiplier, etc.) gets serious Fast but Leaky transistors are used Active ratio of those parts are not necessarily high, but active parts change frequently, that is, cycle by cycle Objective : Reduce runtime leakage power of logic parts Challenge: how to optimize the granularity of power gating 8

Instruction Pipeline with Power-Gating Geyser: MIPS compatible processor with 5-stage pipeline, Straightforward PG (power-gating) Turn EX-units into active mode only if necessary Ex-unit gets active when an affecting instruction enters the IF stage The activated EX-unit returns to sleep mode after execution IF ID EX MEM WB Inst SHIFT Instruction ALU Shift Mult Div Operation Detects which unit will be used Sends wake-up signal MIPS R3000 pipeline 9

Challenges for Run-Time Power-Gating: Energy Overhead Power Break-Even Time (BET) 1 + 3 : Energy overhead Normal Leakage 1 2 4 3 1 + 3 = 2 : part of leakage saving 2 Break-Even Time(BET) Time 4 : Net Energy saving Sleep Wake- Up Sleep period should be longer than BET Otherwise, total energy consumption increases BET tells the smallest granularity for Power Gating 10

Break Even Time of Each Functional Unit 11 Cycl les @20 00MHz 90 nm technology 114 25 65 100 125 92 74 74 44 38 26 28 22 12 16 10 14 8 10 8 12 6 8 2 ALU Shift Mult Div CP0 BET is shortened when the chip temperature climbs up Leakage current depends on temperature heavily We need Novel PG strategies taking BET into account 11

Power Gating Strategies Requirement: Power off Ex-units longer than BET static strategy straightforward:ex-units always in sleep after execution ideal compiler (ideal compiler-directed): exact average idle time of Ex-units after each instruction is known (for reference only) dynamic strategy L1 miss: Ex-units fall asleep only if encountering L1 cache misses L1 miss penalty = 15 cycles L2 miss: Ex-units fall asleep only if encountering L2 cache misses L2 miss penalty = 200 cycles both static and dynamic strategies es ideal compiler + L2 cache miss ideal (God) : ideal dynamic strategy exact idle time of Ex-units are known at anytime, upper limit of PG (for reference only) 12

Result for Frequently Used Execution Unit FPADD for MGRID straightforward: ard BET is longer than sleep time waste of energy Relative Energy compared to non-pg ideal compiler: less chance for longer BET L1: resulting sleep time is about 15 ideal for BET<15, but waste of energy for longer BETstraightforward L2: resulting sleep time is 200 ideal for longer BET for shorter BET, compiler is effective ideal compiler L1 L2 ideal comp. + L2 ideal (God) BET(cycle) 13

Collaboration with Compiler / OS Suggested Power Gating Strategy Co-optimization on Control Granularity of the PG lever compiler direction by assuming short BET, because compiler-directed PG is effective for shorter BET for shorter BET (high temperature), compiler direction is put into use, and take (compiler + L2-miss) strategy for longer BET (low temperature), take L2-miss strategy, but ignore compiler direction OS is expected to switch between strategies by observing changes on BET Power Gating Collaborated with Compiler / OS 14

Leakage Monitor [Koyama et. al. ITC-CSCC 08] [Usami et. al. ISLPED2011 (poster 15)] BET depends on the dynamic environment, such as temperature and the process variation. on-chip leakage monitoring circuit More leakage results in faster charging of VGND Estimate leakage by measuring rise-time of VGND to VREF OS can select the best PG strategy by observing this monitor OFF ON '1' '0' VGND VGN ND Volta age (V) More leakage Less leakage Reference(V REF ) Rise Rise Sleep time (s) 15

Co-Optimization of Throttle Lever Control in Fine-grained ga edruntime Power Gating PG Strategy best granularity changes dynamically (e.g. temperature) PG Control through Activity Localization PG Lever controlled in 10~100cycles OS Architecture Circuit Who should be responsible for PG Control depends on granularity of Control PG control granularity (BET) : 10 ~ 100 cycles best granularity of control changes every msec 16

Prototype CPU : Geyser-1 [Ikebuchi et. al. ASSCC 09] MIPS R3000 Fujitsu e-shuttle 65nm Vdd=1.2V successfully in operation the first successful cycle by cycle power gating 2.1 mm 4.2 mm Shifter MULT DIV ALU leakage monitor 17

Prototype CPU : Geyser-2 Geyser-2: 2 nd Prototype with caches and TLBs on-chip max working frequency : 210MHz (wakeup latency is less than 5ns) Demonstration @ ISLPED2011 booth 4 Leakag ge Power [mw] Temperature [C] 18

(Project 2) Cool Mega Array Reconfigurable Accelerator: not for performance but power-efficiency PE array consists of only a combinatorial logic Power consumption of registers and clock distribution is reduced Low-voltage and Low-power PE array operation balanced with data bandwidth of memory localization of operations Operation / Reg. access Performance / Power combinational circuit DVS region PE SE DME DME DME DME DMEM DMEM DMEM DMEM M M M M Architecture of CMA 19

Prototype : CMA-1 Fujitsu 65nm 8x8 PE array 12KB data memory control part : 1.2V Maximum power efficiency 223.2 [MOPS/mW] Power Efficiency [MOPS/mW] Demonstration @ ISLPED2011 booth 4 PE Array Voltage [V] 20

Summary and Future Direction Geyser : Run-time Power Gating Processor first cycle-by-cycle l power gating processor Cool Mega Array : Power Efficiency i Accelerator CMA CMA CMA Other Projects Fine Grain Power Gating NoCs [Matsutani et. al. NOCS 2010] [Matsutani et. al. IEEE Trans. on CAD, 4/2011] Linux-based Evaluation Platform Demonstration @ISLPED2011 booth 4 Towards Integrated System LSIs Evaluation through real integration via 3D wireless NoCs Geyser CPU Main Memory L2 Cache 21

Selected Publications 1. N. Seki, et.al., A Fine Grain Dynamic Sleep Control Scheme in MIPS R3000, Proc. of ICCD-2008, pp. 612-617 617, 2008 2. K.Usami, et.al., Design and Implementation of Fine-grain Power Gating with Ground Bounce Suppression, Proc. of VLSI Design 2009, pp. 381-386, 2009 3. N.Takagi, et.al., Cooperative Shared Resource Access Control for Low Power Chip Multiprocessors, ISLPED-2009, pp. 177-182, 2009 4. SS S.Saito, et.al., "MuCCRA-Cube:A C 3D Dynamically Reconfigurable Processor with Inductive Coupling link," Proc. of FPL09, pp.6-11, 2009 5. D.Ikebuchi, et.al., Geyser-1: A MIPS R3000 CPU core with fine grain runtime power gating, Proc. of IEEE ASSCC-2009, pp. 281-284, 284 2009 6. H. Matsutani, et.al., "Ultra Fine-Grained Run-Time Power Gating of On- Chip Routers for CMPs", Proc. of NOCS'10, pp.61-68, 2010. 7. H. Matsutani, et.al., "Performance, Area, and Power Evaluations of Ultrafine-Grained Run-Time Power-Gating Routers for CMPs", IEEE Trans. on CAD (TCAD), Vol.30, No.4, pp.520-533. Apr 2011. 8. K.Usami, et.al., On-chip Detection Methodology for Break-Even Time of Power Gated Function Units, Proc. of ISLPED-2011, (to appear) 22