Low-power Architecture. By: Jonathan Herbst Scott Duntley

Low-power Architecture By: Jonathan Herbst Scott Duntley

Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media Mobile computers o Most embedded systems run on batteries Objective to extend battery life as long as possible without sacrificing too much performance o Lower running costs $$ Go green!

Low power architecture Memory techniques o Associativity o Low Power refresh o Drowsy cache Bus Techniques o Bus inversion ISA Branch prediction Parallel Processing vs. Superpipelining Clock gating/scaling Voltage Scaling Cortex A8

Memory - Associativity Direct-mapped cache - Least power -> no block searching Conventional Set associative o As block read occurs -> Both tag and data arrays read o Data written to bus -> Only used if tags match o As associativity, power consumption Alternative: Phased-set associative o Tag and data are broken in sub-arrays o Only tag array is read and compared o Data sub-array r/w to a buffer upon cache hit, and then to the bus o Advantage: Less power consumption by avoiding unnecessary data reads o Disadvantage: Takes 2 clock cycles rather than one

Memory - Phased set associative Phased Set Associative Cache

Memory - Associativity - Benchmark Cache Type Miss Rate Average Power Increase from Direct-Mapped Direct-Mapped.046-4-way Set Associative.035 85.6% 4-way Phased Set Associative.035 68.5% Cache power analysis

Power Management Static Power Domains Voltage Domains Dynamic Clock Scaling/Gating Voltage Scaling Wait-For-Interrupt

Memory - Drowsy cache Modern processors -> Growing cache size o Contributes a size-able fraction of a chip's power consumption o As transistor sizing decreases -> large amount of power due to leakage Idea: Put the cold cache lines into a state-preserving low power state to prevent leakage current o Low-power state = 25% of full-power energy Disadvantage: Slight performance loss due to the "wake-up" time required to access drowsy cache

Drowsy cache - Benchmark Drowsy cache benchmark

Buses - Bus inversion Bus lines are normally of high capacitance o Large amount of power consumption due to switching Where, Alpha = switching factor f = clock frequency C = capacitance V = voltage Want to: Minimize switching factor

Buses - Bus inversion Idea: If the # of bits on an N bit line that need to switch are > N/2 o Invert entire line, and then switch necessary bits back Bus Inversion o Advantage: Less power consumed o Disadvantage: More hardware needed

Buses - Bus inversion

Parallel Processing and Pipelining Parallel Computations Multiple cores Multiple Issue pipelines Linear power increase Pipelining Faster clock Exponential power increase Longer branch miss-predictions

Low power & ISA Single Issue, Multiple Data (SIMD) o Reduce number of instruction fetches/decodes -> Reduce power RISC vs. CISC o ASP Embedded - CISC More specific hardware helps reduce overhead from general hardware -> less power o General Embedded - RISC Less specific operations needed Reduced complexity helps with power consumption o The line is blurring - less and less need for ASP processors since GPP's are rapidly becoming more powerful and lowpower

Branch prediction techniques Accurately predict branches without too much complexity o Static branch prediction Simple, done at compile time by ISA Examination of program behavior Choose backward branches taken, forward branches not o Dynamic Branch Prediction More complex, More hardware Occurs during run-time Higher power consumption but much more accurate Branch Target Buffer (BTB) Pattern history table (PHT)

Cortex A8 Die

Cortex A8 Architecture

Architecture Overview < 300 mw to 1 W Power Consumption 600 MHz at 1.08 V, 1 GHz at 0.9 V Configuration (up to 1.5 GHz, but suffers a significant power increase) 13 cycle, 2 issue superscalar pipeline Static scheduling scoreboard Integrated NEON multimedia pipeline Static and dynamic power management

Static Scheduling Scoreboard Static instruction scheduling In-order issue, in-order retire Dynamic voltage and clock scaling Pending Queue: Takes better advantage of 2-issue pipeline Replay Queue: Holds issue information only Avoid long cache miss stalls

Instruction Set Architecture RISC Architecture 2-issue instructions Multicycle instructions SIMD Instructions for NEON Shift included instructions 32-bit instructions compressed to 16-bit for a 30% code reduction

Branch Prediction 95% accuracy 10-bit Global History Register (GHR) 4096 entry (256x16) Global History Buffer (GHB) with 2-bit saturating counters o column indexed by first 8 bits of GHR o row indexed by last two bits of GHR XORed with low 4 bits of PC 512 entry Branch Target Buffer (BTB) o indexed by address o stores branch address and branch type 1 stall cycle on branch taken 13 cycle penalty on missprediction

Memory L1 Cache 32 or 64 KB Separate instruction and data cache 1 cycle latency 4-way set associative Hash Virtual Address Buffer Data Cache 3 entry 64-bit integer store buffer 8 enrty 128-bit NEON store buffer L2 Cache Up to 1 MB 8 cycle latency 8-way set associative Nonblocking NEON loads

Static Power Management Power Domains

Static Power Management Voltage Domains

Dynamic Power Management Wait-For-Interrupt Architecture Clock gating Voltage scaling

Future of ARM? ARM chips currently offered at $10-20 a piece o Intel atom -> $35+ ARM currently controls about 90% of the mobile phone processor market -> Low Price/Power o Intel still needs more R&D to be able to compete with ARM power specs Why not for laptops/netbooks? o Regular Windows cannot run it (Linux/Android) Windows Mobile/CE (Embedded Compact) o Excludes main part of consumer PC market o Mainstream version release of windows -> Supports ARM ARM could easily move into market Increasing parallelism Increased performance-to-power ratio

Future/Theoretical : DRAM Refresh Two ideas, but not necessarily implemented yet: o Intelligent Refresh Idea: A cell that has been written or read to recently does not need to be refreshed Most effective power reduction during periods of great use Drawback: Large amount of overhead needed to keep track of which cells have been accessed recently o OS Controlled Refresh Idea: Not necessary to refresh unused memory so disable it The OS knows what memory has been used Instead of only swapping out pages when memory is full, swap out unused memory -> No refresh

Conclusion Basic idea - Reduce power o Trade-off->low performance and/or more complexity Recent architecture and design trends o Static power becoming as important as dynamic Dynamic Static o Reduce any of these, reduce overall power