Low-power Architecture. By: Jonathan Herbst Scott Duntley

Similar documents
Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Control Hazards. Branch Prediction

Copyright 2012, Elsevier Inc. All rights reserved.

LECTURE 5: MEMORY HIERARCHY DESIGN

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

Outline Marquette University

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Mainstream Computer System Components

Adapted from David Patterson s slides on graduate computer architecture

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

KeyStone II. CorePac Overview

Amber Baruffa Vincent Varouh

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

101. The memory blocks are mapped on to the cache with the help of a) Hash functions b) Vectors c) Mapping functions d) None of the mentioned

Instruction Level Parallelism (Branch Prediction)

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

The new Intel Xscale Microarchitecture

Copyright 2012, Elsevier Inc. All rights reserved.

EECS 322 Computer Architecture Superpipline and the Cache

Control Hazards. Prediction

Intel released new technology call P6P

Keywords and Review Questions

Chapter Seven Morgan Kaufmann Publishers

CS146 Computer Architecture. Fall Midterm Exam

Exploitation of instruction level parallelism

Performance of computer systems

Microarchitecture Overview. Performance

Memory Systems IRAM. Principle of IRAM

PowerPC 740 and 750

EE 4980 Modern Electronic Systems. Processor Advanced

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Microarchitecture Overview. Performance

Comparative Analysis of Contemporary Cache Power Reduction Techniques

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

A superscalar machine is one in which multiple instruction streams allow completion of more than one instruction per cycle.

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Universität Dortmund. ARM Architecture

Chapter 2: Memory Hierarchy Design Part 2

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Performance Characteristics. i960 CA SuperScalar Microprocessor

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Computer Architecture EE 4720 Final Examination

Lecture notes for CS Chapter 2, part 1 10/23/18

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by:

HY225 Lecture 12: DRAM and Virtual Memory

Multimedia in Mobile Phones. Architectures and Trends Lund

Complex Pipelines and Branch Prediction

Itanium 2 Processor Microarchitecture Overview

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Memory Hierarchy. Advanced Optimizations. Slides contents from:

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Processors. Young W. Lim. May 12, 2016

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Embedded Systems Design: A Unified Hardware/Software Introduction. Outline. Chapter 5 Memory. Introduction. Memory: basic concepts

Embedded Systems Design: A Unified Hardware/Software Introduction. Chapter 5 Memory. Outline. Introduction

CS Computer Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CS/EE 6810: Computer Architecture

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Unleashing the Power of Embedded DRAM

ECE 486/586. Computer Architecture. Lecture # 2

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Advanced Computer Architecture (CS620)

Chapter 2: Memory Hierarchy Design Part 2

LECTURE 10: Improving Memory Access: Direct and Spatial caches

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

Memory Hierarchy and Caches

The Pentium II/III Processor Compiler on a Chip

Superscalar Processors

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Cycles Per Instruction For This Microprocessor

Superscalar Processors

COMPUTER ORGANIZATION AND DESI

CS 426 Parallel Computing. Parallel Computing Platforms

Computer Architecture

ECE 152 Introduction to Computer Architecture

Single Chip Heterogeneous Multiprocessor Design

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

ECE 341. Lecture # 15

Architectural Performance. Superscalar Processing. 740 October 31, i486 Pipeline. Pipeline Stage Details. Page 1

Hyperthreading Technology

ECE 3055: Final Exam

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Transcription:

Low-power Architecture By: Jonathan Herbst Scott Duntley

Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media Mobile computers o Most embedded systems run on batteries Objective to extend battery life as long as possible without sacrificing too much performance o Lower running costs $$ Go green!

Low power architecture Memory techniques o Associativity o Low Power refresh o Drowsy cache Bus Techniques o Bus inversion ISA Branch prediction Parallel Processing vs. Superpipelining Clock gating/scaling Voltage Scaling Cortex A8

Memory - Associativity Direct-mapped cache - Least power -> no block searching Conventional Set associative o As block read occurs -> Both tag and data arrays read o Data written to bus -> Only used if tags match o As associativity, power consumption Alternative: Phased-set associative o Tag and data are broken in sub-arrays o Only tag array is read and compared o Data sub-array r/w to a buffer upon cache hit, and then to the bus o Advantage: Less power consumption by avoiding unnecessary data reads o Disadvantage: Takes 2 clock cycles rather than one

Memory - Phased set associative Phased Set Associative Cache

Memory - Associativity - Benchmark Cache Type Miss Rate Average Power Increase from Direct-Mapped Direct-Mapped.046-4-way Set Associative.035 85.6% 4-way Phased Set Associative.035 68.5% Cache power analysis

Power Management Static Power Domains Voltage Domains Dynamic Clock Scaling/Gating Voltage Scaling Wait-For-Interrupt

Memory - Drowsy cache Modern processors -> Growing cache size o Contributes a size-able fraction of a chip's power consumption o As transistor sizing decreases -> large amount of power due to leakage Idea: Put the cold cache lines into a state-preserving low power state to prevent leakage current o Low-power state = 25% of full-power energy Disadvantage: Slight performance loss due to the "wake-up" time required to access drowsy cache

Drowsy cache - Benchmark Drowsy cache benchmark

Buses - Bus inversion Bus lines are normally of high capacitance o Large amount of power consumption due to switching Where, Alpha = switching factor f = clock frequency C = capacitance V = voltage Want to: Minimize switching factor

Buses - Bus inversion Idea: If the # of bits on an N bit line that need to switch are > N/2 o Invert entire line, and then switch necessary bits back Bus Inversion o Advantage: Less power consumed o Disadvantage: More hardware needed

Buses - Bus inversion

Parallel Processing and Pipelining Parallel Computations Multiple cores Multiple Issue pipelines Linear power increase Pipelining Faster clock Exponential power increase Longer branch miss-predictions

Low power & ISA Single Issue, Multiple Data (SIMD) o Reduce number of instruction fetches/decodes -> Reduce power RISC vs. CISC o ASP Embedded - CISC More specific hardware helps reduce overhead from general hardware -> less power o General Embedded - RISC Less specific operations needed Reduced complexity helps with power consumption o The line is blurring - less and less need for ASP processors since GPP's are rapidly becoming more powerful and lowpower

Branch prediction techniques Accurately predict branches without too much complexity o Static branch prediction Simple, done at compile time by ISA Examination of program behavior Choose backward branches taken, forward branches not o Dynamic Branch Prediction More complex, More hardware Occurs during run-time Higher power consumption but much more accurate Branch Target Buffer (BTB) Pattern history table (PHT)

Cortex A8 Die

Cortex A8 Architecture

Architecture Overview < 300 mw to 1 W Power Consumption 600 MHz at 1.08 V, 1 GHz at 0.9 V Configuration (up to 1.5 GHz, but suffers a significant power increase) 13 cycle, 2 issue superscalar pipeline Static scheduling scoreboard Integrated NEON multimedia pipeline Static and dynamic power management

Static Scheduling Scoreboard Static instruction scheduling In-order issue, in-order retire Dynamic voltage and clock scaling Pending Queue: Takes better advantage of 2-issue pipeline Replay Queue: Holds issue information only Avoid long cache miss stalls

Instruction Set Architecture RISC Architecture 2-issue instructions Multicycle instructions SIMD Instructions for NEON Shift included instructions 32-bit instructions compressed to 16-bit for a 30% code reduction

Branch Prediction 95% accuracy 10-bit Global History Register (GHR) 4096 entry (256x16) Global History Buffer (GHB) with 2-bit saturating counters o column indexed by first 8 bits of GHR o row indexed by last two bits of GHR XORed with low 4 bits of PC 512 entry Branch Target Buffer (BTB) o indexed by address o stores branch address and branch type 1 stall cycle on branch taken 13 cycle penalty on missprediction

Memory L1 Cache 32 or 64 KB Separate instruction and data cache 1 cycle latency 4-way set associative Hash Virtual Address Buffer Data Cache 3 entry 64-bit integer store buffer 8 enrty 128-bit NEON store buffer L2 Cache Up to 1 MB 8 cycle latency 8-way set associative Nonblocking NEON loads

Static Power Management Power Domains

Static Power Management Voltage Domains

Dynamic Power Management Wait-For-Interrupt Architecture Clock gating Voltage scaling

Future of ARM? ARM chips currently offered at $10-20 a piece o Intel atom -> $35+ ARM currently controls about 90% of the mobile phone processor market -> Low Price/Power o Intel still needs more R&D to be able to compete with ARM power specs Why not for laptops/netbooks? o Regular Windows cannot run it (Linux/Android) Windows Mobile/CE (Embedded Compact) o Excludes main part of consumer PC market o Mainstream version release of windows -> Supports ARM ARM could easily move into market Increasing parallelism Increased performance-to-power ratio

Future/Theoretical : DRAM Refresh Two ideas, but not necessarily implemented yet: o Intelligent Refresh Idea: A cell that has been written or read to recently does not need to be refreshed Most effective power reduction during periods of great use Drawback: Large amount of overhead needed to keep track of which cells have been accessed recently o OS Controlled Refresh Idea: Not necessary to refresh unused memory so disable it The OS knows what memory has been used Instead of only swapping out pages when memory is full, swap out unused memory -> No refresh

Conclusion Basic idea - Reduce power o Trade-off->low performance and/or more complexity Recent architecture and design trends o Static power becoming as important as dynamic Dynamic Static o Reduce any of these, reduce overall power