Low-power Architecture. By: Jonathan Herbst Scott Duntley

Size: px

Start display at page:

Download "Low-power Architecture. By: Jonathan Herbst Scott Duntley"

Alvin French
6 years ago
Views:

1 Low-power Architecture By: Jonathan Herbst Scott Duntley

2 Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media Mobile computers o Most embedded systems run on batteries Objective to extend battery life as long as possible without sacrificing too much performance o Lower running costs $$ Go green!

3 Low power architecture Memory techniques o Associativity o Low Power refresh o Drowsy cache Bus Techniques o Bus inversion ISA Branch prediction Parallel Processing vs. Superpipelining Clock gating/scaling Voltage Scaling Cortex A8

4 Memory - Associativity Direct-mapped cache - Least power -> no block searching Conventional Set associative o As block read occurs -> Both tag and data arrays read o Data written to bus -> Only used if tags match o As associativity, power consumption Alternative: Phased-set associative o Tag and data are broken in sub-arrays o Only tag array is read and compared o Data sub-array r/w to a buffer upon cache hit, and then to the bus o Advantage: Less power consumption by avoiding unnecessary data reads o Disadvantage: Takes 2 clock cycles rather than one

5 Memory - Phased set associative Phased Set Associative Cache

6 Memory - Associativity - Benchmark Cache Type Miss Rate Average Power Increase from Direct-Mapped Direct-Mapped way Set Associative % 4-way Phased Set Associative % Cache power analysis

7 Power Management Static Power Domains Voltage Domains Dynamic Clock Scaling/Gating Voltage Scaling Wait-For-Interrupt

8 Memory - Drowsy cache Modern processors -> Growing cache size o Contributes a size-able fraction of a chip's power consumption o As transistor sizing decreases -> large amount of power due to leakage Idea: Put the cold cache lines into a state-preserving low power state to prevent leakage current o Low-power state = 25% of full-power energy Disadvantage: Slight performance loss due to the "wake-up" time required to access drowsy cache

9 Drowsy cache - Benchmark Drowsy cache benchmark

10 Buses - Bus inversion Bus lines are normally of high capacitance o Large amount of power consumption due to switching Where, Alpha = switching factor f = clock frequency C = capacitance V = voltage Want to: Minimize switching factor

11 Buses - Bus inversion Idea: If the # of bits on an N bit line that need to switch are > N/2 o Invert entire line, and then switch necessary bits back Bus Inversion o Advantage: Less power consumed o Disadvantage: More hardware needed

12 Buses - Bus inversion

13 Parallel Processing and Pipelining Parallel Computations Multiple cores Multiple Issue pipelines Linear power increase Pipelining Faster clock Exponential power increase Longer branch miss-predictions

14 Low power & ISA Single Issue, Multiple Data (SIMD) o Reduce number of instruction fetches/decodes -> Reduce power RISC vs. CISC o ASP Embedded - CISC More specific hardware helps reduce overhead from general hardware -> less power o General Embedded - RISC Less specific operations needed Reduced complexity helps with power consumption o The line is blurring - less and less need for ASP processors since GPP's are rapidly becoming more powerful and lowpower

15 Branch prediction techniques Accurately predict branches without too much complexity o Static branch prediction Simple, done at compile time by ISA Examination of program behavior Choose backward branches taken, forward branches not o Dynamic Branch Prediction More complex, More hardware Occurs during run-time Higher power consumption but much more accurate Branch Target Buffer (BTB) Pattern history table (PHT)

16 Cortex A8 Die

17 Cortex A8 Architecture

18 Architecture Overview < 300 mw to 1 W Power Consumption 600 MHz at 1.08 V, 1 GHz at 0.9 V Configuration (up to 1.5 GHz, but suffers a significant power increase) 13 cycle, 2 issue superscalar pipeline Static scheduling scoreboard Integrated NEON multimedia pipeline Static and dynamic power management

19 Static Scheduling Scoreboard Static instruction scheduling In-order issue, in-order retire Dynamic voltage and clock scaling Pending Queue: Takes better advantage of 2-issue pipeline Replay Queue: Holds issue information only Avoid long cache miss stalls

20 Instruction Set Architecture RISC Architecture 2-issue instructions Multicycle instructions SIMD Instructions for NEON Shift included instructions 32-bit instructions compressed to 16-bit for a 30% code reduction

21 Branch Prediction 95% accuracy 10-bit Global History Register (GHR) 4096 entry (256x16) Global History Buffer (GHB) with 2-bit saturating counters o column indexed by first 8 bits of GHR o row indexed by last two bits of GHR XORed with low 4 bits of PC 512 entry Branch Target Buffer (BTB) o indexed by address o stores branch address and branch type 1 stall cycle on branch taken 13 cycle penalty on missprediction

22 Memory L1 Cache 32 or 64 KB Separate instruction and data cache 1 cycle latency 4-way set associative Hash Virtual Address Buffer Data Cache 3 entry 64-bit integer store buffer 8 enrty 128-bit NEON store buffer L2 Cache Up to 1 MB 8 cycle latency 8-way set associative Nonblocking NEON loads

23 Static Power Management Power Domains

24 Static Power Management Voltage Domains

25 Dynamic Power Management Wait-For-Interrupt Architecture Clock gating Voltage scaling

26 Future of ARM? ARM chips currently offered at $10-20 a piece o Intel atom -> $35+ ARM currently controls about 90% of the mobile phone processor market -> Low Price/Power o Intel still needs more R&D to be able to compete with ARM power specs Why not for laptops/netbooks? o Regular Windows cannot run it (Linux/Android) Windows Mobile/CE (Embedded Compact) o Excludes main part of consumer PC market o Mainstream version release of windows -> Supports ARM ARM could easily move into market Increasing parallelism Increased performance-to-power ratio

27 Future/Theoretical : DRAM Refresh Two ideas, but not necessarily implemented yet: o Intelligent Refresh Idea: A cell that has been written or read to recently does not need to be refreshed Most effective power reduction during periods of great use Drawback: Large amount of overhead needed to keep track of which cells have been accessed recently o OS Controlled Refresh Idea: Not necessary to refresh unused memory so disable it The OS knows what memory has been used Instead of only swapping out pages when memory is full, swap out unused memory -> No refresh

28 Conclusion Basic idea - Reduce power o Trade-off->low performance and/or more complexity Recent architecture and design trends o Static power becoming as important as dynamic Dynamic Static o Reduce any of these, reduce overall power

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more