Innovative Power Control for. Performance System LSIs. (Univ. of Electro-Communications) (Tokyo Univ. of Agriculture and Tech.)

Innovative Power Control for Ultra Low-Power and High- Performance System LSIs Hiroshi Nakamura Hideharu Amano Masaaki Kondo Mitaro Namiki Kimiyoshi Usami (Univ. of Tokyo) (Keio Univ.) (Univ. of Electro-Communications) (Tokyo Univ. of Agriculture and Tech.) (Shibaura Inst. of Tech.) 1

Objective and Strategy Objective: drastic power reduction of high-performance system LSIs Strategy: innovative power control through tight Co-Optimization / Co-Design of system software, architecture, and circuit design. Principle: Performance: limited by a bottleneck Power: summation of whole system Low power and slow operation for unhurried / idle parts System Software Compiler Architecture Circuit Technology Co-Opt timizat ion/co o-desig gn 2

Role of Design Hierarchy for Low Power OS Architecture When? Where? Circuit How? throttle lever of power/performance Device Clock Gating, Dual Vth, DVFS, Power Gating, Back-bias,.. Circuit Level : Provide levers to throttle performance / power Architecture, OS Level : Find a chance to set levers, when and where?? architecture: Intra-task/process optimization OS: Inter-task/process optimization 3

Preferable Throttle Lever Effectiveness of Processor Reconfig int fp System Power Reduction Low Overhead in Area, Performance, Power Controlling the throttle lever itself takes time and consumes power Fine Control Granularity in both Space and Time Locations of busy / idle parts are small and change frequently cache Processor int fp cache Cache Memory Network System LSI busy idle time 4

Example of Throttle Levers for dynamic power: Clock Gating, DVFS both effective, DVFS particular (Power Vdd 2 ) Clock Gating: very fine-grained control with little overhead easily utilized within circuit level design DVFS: tens of μs to change Vdd through regulator moderate granularity for leakage power: Power Gating, Body Biasing both effective, but large overhead in power and performance Body biasing: spatial granularity statically defined regions not easy for fine-grained i control sleep signal Circuit Block sleep Tr. Vdd VGND GND Power Gating 5

Role of Design Hierarchy for Low Power: The Ideal System OS Architecture Circuit Device When? When? Where? Where? How? How? Spatial and Temporal Granularity is important Co-Design of Circuit, Architecture and OS for Power Co-Optimization of Throttle Lever Control: especially, Co-Optimization of Spatial and Temporal Granularity ex. activity localization to make full use of throttle levers characteristics by architecture/os 6

Team Formation of our Research Project Co-Optim mization of System Software and Arch hitecture e Archite Circuit t Design Architecture/ Compiler Co-Optim mization cture an of nd n System Software Network Processor int fp cache Circuit it Design Reconfig System Memory VddH VddL Sub-theme (leader) Co-operative System Software with Arch. (Prof. Namiki) Ultra Low-Power Reconf. Architecture (Prof. Amano) Data Resident Architecture (Prof. Nakamura) Data Resident Compiler (Prof. Kondo) Ultra Low-Power Circuit logic block Design (Prof. Usami) 7

(Project 1) Geyser: Low Power Processor through Fine-grained Runtime Power Gating Target: Leakage Power Background: Leakage reduction techniques so far, Standby time: power-gating (Coarse Grain) Runtime: Cache-decay, Drowsy-cache, (Coarse Grain in temporal) Leakage for logic parts (ALU, multiplier, etc.) gets serious Fast but Leaky transistors are used Active ratio of those parts are not necessarily high, but active parts change frequently, that is, cycle by cycle Objective : Reduce runtime leakage power of logic parts Challenge: how to optimize the granularity of power gating 8

Instruction Pipeline with Power-Gating Geyser: MIPS compatible processor with 5-stage pipeline, Straightforward PG (power-gating) Turn EX-units into active mode only if necessary Ex-unit gets active when an affecting instruction enters the IF stage The activated EX-unit returns to sleep mode after execution IF ID EX MEM WB Inst SHIFT Instruction ALU Shift Mult Div Operation Detects which unit will be used Sends wake-up signal MIPS R3000 pipeline 9

Challenges for Run-Time Power-Gating: Energy Overhead Power Break-Even Time (BET) 1 + 3 : Energy overhead Normal Leakage 1 2 4 3 1 + 3 = 2 : part of leakage saving 2 Break-Even Time(BET) Time 4 : Net Energy saving Sleep Wake- Up Sleep period should be longer than BET Otherwise, total energy consumption increases BET tells the smallest granularity for Power Gating 10

Break Even Time of Each Functional Unit 11 Cycl les @20 00MHz 90 nm technology 114 25 65 100 125 92 74 74 44 38 26 28 22 12 16 10 14 8 10 8 12 6 8 2 ALU Shift Mult Div CP0 BET is shortened when the chip temperature climbs up Leakage current depends on temperature heavily We need Novel PG strategies taking BET into account 11

Power Gating Strategies Requirement: Power off Ex-units longer than BET static strategy straightforward:ex-units always in sleep after execution ideal compiler (ideal compiler-directed): exact average idle time of Ex-units after each instruction is known (for reference only) dynamic strategy L1 miss: Ex-units fall asleep only if encountering L1 cache misses L1 miss penalty = 15 cycles L2 miss: Ex-units fall asleep only if encountering L2 cache misses L2 miss penalty = 200 cycles both static and dynamic strategies es ideal compiler + L2 cache miss ideal (God) : ideal dynamic strategy exact idle time of Ex-units are known at anytime, upper limit of PG (for reference only) 12

Result for Frequently Used Execution Unit FPADD for MGRID straightforward: ard BET is longer than sleep time waste of energy Relative Energy compared to non-pg ideal compiler: less chance for longer BET L1: resulting sleep time is about 15 ideal for BET<15, but waste of energy for longer BETstraightforward L2: resulting sleep time is 200 ideal for longer BET for shorter BET, compiler is effective ideal compiler L1 L2 ideal comp. + L2 ideal (God) BET(cycle) 13

Collaboration with Compiler / OS Suggested Power Gating Strategy Co-optimization on Control Granularity of the PG lever compiler direction by assuming short BET, because compiler-directed PG is effective for shorter BET for shorter BET (high temperature), compiler direction is put into use, and take (compiler + L2-miss) strategy for longer BET (low temperature), take L2-miss strategy, but ignore compiler direction OS is expected to switch between strategies by observing changes on BET Power Gating Collaborated with Compiler / OS 14

Leakage Monitor [Koyama et. al. ITC-CSCC 08] [Usami et. al. ISLPED2011 (poster 15)] BET depends on the dynamic environment, such as temperature and the process variation. on-chip leakage monitoring circuit More leakage results in faster charging of VGND Estimate leakage by measuring rise-time of VGND to VREF OS can select the best PG strategy by observing this monitor OFF ON '1' '0' VGND VGN ND Volta age (V) More leakage Less leakage Reference(V REF ) Rise Rise Sleep time (s) 15

Co-Optimization of Throttle Lever Control in Fine-grained ga edruntime Power Gating PG Strategy best granularity changes dynamically (e.g. temperature) PG Control through Activity Localization PG Lever controlled in 10~100cycles OS Architecture Circuit Who should be responsible for PG Control depends on granularity of Control PG control granularity (BET) : 10 ~ 100 cycles best granularity of control changes every msec 16

Prototype CPU : Geyser-1 [Ikebuchi et. al. ASSCC 09] MIPS R3000 Fujitsu e-shuttle 65nm Vdd=1.2V successfully in operation the first successful cycle by cycle power gating 2.1 mm 4.2 mm Shifter MULT DIV ALU leakage monitor 17

Prototype CPU : Geyser-2 Geyser-2: 2 nd Prototype with caches and TLBs on-chip max working frequency : 210MHz (wakeup latency is less than 5ns) Demonstration @ ISLPED2011 booth 4 Leakag ge Power [mw] Temperature [C] 18

(Project 2) Cool Mega Array Reconfigurable Accelerator: not for performance but power-efficiency PE array consists of only a combinatorial logic Power consumption of registers and clock distribution is reduced Low-voltage and Low-power PE array operation balanced with data bandwidth of memory localization of operations Operation / Reg. access Performance / Power combinational circuit DVS region PE SE DME DME DME DME DMEM DMEM DMEM DMEM M M M M Architecture of CMA 19

Prototype : CMA-1 Fujitsu 65nm 8x8 PE array 12KB data memory control part : 1.2V Maximum power efficiency 223.2 [MOPS/mW] Power Efficiency [MOPS/mW] Demonstration @ ISLPED2011 booth 4 PE Array Voltage [V] 20

Summary and Future Direction Geyser : Run-time Power Gating Processor first cycle-by-cycle l power gating processor Cool Mega Array : Power Efficiency i Accelerator CMA CMA CMA Other Projects Fine Grain Power Gating NoCs [Matsutani et. al. NOCS 2010] [Matsutani et. al. IEEE Trans. on CAD, 4/2011] Linux-based Evaluation Platform Demonstration @ISLPED2011 booth 4 Towards Integrated System LSIs Evaluation through real integration via 3D wireless NoCs Geyser CPU Main Memory L2 Cache 21

Selected Publications 1. N. Seki, et.al., A Fine Grain Dynamic Sleep Control Scheme in MIPS R3000, Proc. of ICCD-2008, pp. 612-617 617, 2008 2. K.Usami, et.al., Design and Implementation of Fine-grain Power Gating with Ground Bounce Suppression, Proc. of VLSI Design 2009, pp. 381-386, 2009 3. N.Takagi, et.al., Cooperative Shared Resource Access Control for Low Power Chip Multiprocessors, ISLPED-2009, pp. 177-182, 2009 4. SS S.Saito, et.al., "MuCCRA-Cube:A C 3D Dynamically Reconfigurable Processor with Inductive Coupling link," Proc. of FPL09, pp.6-11, 2009 5. D.Ikebuchi, et.al., Geyser-1: A MIPS R3000 CPU core with fine grain runtime power gating, Proc. of IEEE ASSCC-2009, pp. 281-284, 284 2009 6. H. Matsutani, et.al., "Ultra Fine-Grained Run-Time Power Gating of On- Chip Routers for CMPs", Proc. of NOCS'10, pp.61-68, 2010. 7. H. Matsutani, et.al., "Performance, Area, and Power Evaluations of Ultrafine-Grained Run-Time Power-Gating Routers for CMPs", IEEE Trans. on CAD (TCAD), Vol.30, No.4, pp.520-533. Apr 2011. 8. K.Usami, et.al., On-chip Detection Methodology for Break-Even Time of Power Gated Function Units, Proc. of ISLPED-2011, (to appear) 22