L1 Data Cache Decomposition for Energy Efficiency

Similar documents
L1 Data Cache Decomposition for Energy Efficiency Λ Michael Huang, Jose Renau, Seung-Moon Yoo, and Josep Torrellas

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

POSH: A TLS Compiler that Exploits Program Structure

DeAliaser: Alias Speculation Using Atomic Region Support

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST

Defining a High-Level Programming Model for Emerging NVRAM Technologies

The Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas

DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently

APPENDIX Summary of Benchmarks

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Prototyping Architectural Support for Program Rollback Using FPGAs

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Caching and Buffering in HDF5

ShortCut: Architectural Support for Fast Object Access in Scripting Languages

CS533: Speculative Parallelization (Thread-Level Speculation)

Prefetching (II): Using Processors-In-Memory (PIM) for Prefetching. Instructor: Josep Torrellas CS533

Positional Adaptation of Processors: Application to Energy Reduction

Reducing Instruction Fetch Cost by Packing Instructions into Register Windows

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Transparent Pointer Compression for Linked Data Structures

250 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 2, FEBRUARY 2011

Fast Design Space Subsetting. University of Florida Electrical and Computer Engineering Department Embedded Systems Lab

Processor Architecture and Interconnect

Execution-based Prediction Using Speculative Slices

Light64: Ligh support for data ra. Darko Marinov, Josep Torrellas. a.cs.uiuc.edu

Static Transformation for Heap Layout Using Memory Access Patterns

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

Design Space Optimization of Embedded Memory Systems via Data Remapping

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

ReplayConfusion: Detecting Cache-based Covert Channel Attacks Using Record and Replay

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

A Comparison of Capacity Management Schemes for Shared CMP Caches

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Thread-Level Speculation on a CMP Can Be Energy Efficient

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Saving Energy with Architectural and Frequency Adaptations for Multimedia Applications Chris Hughes

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Thread-Level Speculation on a CMP Can Be Energy Efficient

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

EE414 Embedded Systems Ch 5. Memory Part 2/2

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

CS 61C: Great Ideas in Computer Architecture. (Brief) Review Lecture

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison

Understanding Sources of Inefficiency in General-Purpose Chips. Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

CS 6354: Memory Hierarchy I. 29 August 2016

RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors

Cache Pipelining with Partial Operand Knowledge

Decoupling Dynamic Information Flow Tracking with a Dedicated Coprocessor

Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching

Prefetching. An introduction to and analysis of Hardware and Software based Prefetching. Jun Yi Lei Robert Michael Allen Jr.

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction

Topics: Memory Management (SGG, Chapter 08) 8.1, 8.2, 8.3, 8.5, 8.6 CS 3733 Operating Systems

Characterization of Silent Stores

Efficiency vs. Effectiveness in Terabyte-Scale IR

A Framework for Modeling GPUs Power Consumption

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Microarchitecture Overview. Performance

AS the processor-memory speed gap continues to widen,

ECE 411 Exam 1 Practice Problems

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization li for Multiprocessors

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

Why do we care about parallel?

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

RxNetty vs Tomcat Performance Results

ECE/CS 757: Homework 1

Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes

Paging. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Enhancing Energy Efficiency of Processor-Based Embedded Systems thorough Post-Fabrication ISA Extension

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

RAID in Practice, Overview of Indexing

Emerging NVM Memory Technologies

Flash Memory Based Storage System

JOP: A Java Optimized Processor for Embedded Real-Time Systems. Martin Schoeberl

SPECULATIVE MULTITHREADED ARCHITECTURES

Correction Prediction: Reducing Error Correction Latency for On-Chip Memories

Virtual Memory I. Jo, Heeseung

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 19: Memory Hierarchy Five Ways to Reduce Miss Penalty (Second Level Cache) Admin

Parallel Streaming Computation on Error-Prone Processors. Yavuz Yetim, Margaret Martonosi, Sharad Malik

Transcription:

L1 Data Cache Decomposition for Energy Efficiency Michael Huang, Joe Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu/flexram

Objective Reduce L1 data cache energy consumption No performance degradation Partition the cache in multiple ways Specialization for stack accesses International Symposium on Low Power Electronics and Design, August 2001 2

Outline L1 D-Cache decomposition Specialized Stack Cache Pseudo Set-Associative Cache Simulation Environment Evaluation Conclusions International Symposium on Low Power Electronics and Design, August 2001 3

L1 D-Cache Decomposition A Specialized Stack Cache (SSC) A Pseudo Set-Associative Cache (PSAC) International Symposium on Low Power Electronics and Design, August 2001 4

Selection Selection done in decode stage to speed up Based on instruction address and opcode 2Kbit table to predict the PSAC way Address Opcode PSAC SSC International Symposium on Low Power Electronics and Design, August 2001 5

Stack Cache Small, direct-mapped cache Virtually tagged Software optimizations: Very important to reduce stack cache size Avoid trashing: allocate large structs in heap Easy to implement International Symposium on Low Power Electronics and Design, August 2001 6

SSC: Specialized Stack Cache Pointers to reduce traffic: TOS: reduce number write-backs SRB (safe-region-bottom): reduce unnecessary line-fills for write miss Region between TOS & SRB is safe (missing lines are non initialized) Infrequent access TOS TOS SRB SRB Stack grows International Symposium on Low Power Electronics and Design, August 2001 7

Pseudo Set-Associative Cache Partition the cache in 4 ways Tag Data Evaluated activation policies: Sequential, FallBackReg, Phased Cache, FallBackPha, PredictPha International Symposium on Low Power Electronics and Design, August 2001 8

Sequential (Calder 96) cycle 1 cycle 2 cycle 3 International Symposium on Low Power Electronics and Design, August 2001 9

Fallback-regular (Inoue 99) cycle 1 cycle 2 International Symposium on Low Power Electronics and Design, August 2001 10

Phased Cache (Hasegawa 95) cycle 1 cycle 2 International Symposium on Low Power Electronics and Design, August 2001 11

Fallback-phased (ours) Emphasis in energy reduction cycle 1 cycle 2 cycle 3 International Symposium on Low Power Electronics and Design, August 2001 12

Predictive Phased (ours) Emphasis in performance cycle 1 cycle 2 International Symposium on Low Power Electronics and Design, August 2001 13

Simulation Environment Baseline configuration: Processor: 1GHz R10000 like L1: 32 KB 2-way L2: 512KB 8-way phased cache Memory: 1 Rambus Channel Energy model: extended CACTI Energy is for data memory hierarchy only International Symposium on Low Power Electronics and Design, August 2001 14

Applications Multimedia SPECint Scientific Mp3dec: MP3 decoder Mp3enc: MP3 encoder Gzip: Data compression Crafty: Chess game MCF: Traffic model Bsom: data mining Blast: protein matching Treeadd: Olden tree search International Symposium on Low Power Electronics and Design, August 2001 15

Adding a Stack Cache Normalize Baseline 1 0.8 0.6 0.4 1.01 1.00 0.99 0.99 0.99 0.98 0.83 0.80 0.78 0.77 0.77 0.76 0.84 0.81 0.77 0.76 PLAIN 256B SSC 256B PLAIN 512B SSC 512B PLAIN 1KB SSC 1KB 0.76 0.75 0.2 0 Delay Energy E*D For the same size the Specialized Stack Cache is always better International Symposium on Low Power Electronics and Design, August 2001 16

Pseudo Set-Associative Cache 1 1.05 0.99 1.05 1.01 0.98 4-way Sequential 4-way FallBackReg 4-way Phased 4-way FallBackPha 4-way PredictPha Normalize Baseline 0.8 0.6 0.4 0.68 0.69 0.74 0.67 0.68 0.72 0.69 0.78 0.68 0.67 0.2 0 Delay Energy E*D PredictPha has the best delay and energy-delay product International Symposium on Low Power Electronics and Design, August 2001 17

PSAC: 2-way vs. 4-way 1 0.99 0.97 0.98 2-way Sequential 2-way PredictPha 4-way PredictPha Normalize Basline 0.8 0.6 0.4 0.78 0.79 0.68 0.77 0.76 0.67 0.2 0 Delay Energy E*D For E*D, 4-way PSAC is better than 2-way International Symposium on Low Power Electronics and Design, August 2001 18

Pseudo Set-Associative + Specialized Stack Cache 1 0.98 0.98 0.97 0.96 4-way PredictPha 4-way PredictPha + SSC256B 4-way PredictPha + SSC512B Normalize Baseline 0.8 0.6 0.4 0.68 0.61 0.58 0.57 4-way PredictPha + SSC1KB 0.67 0.60 0.56 0.55 0.2 0 Delay Energy E*D Combining PSAC and SSC reduces E*D by 44% on average International Symposium on Low Power Electronics and Design, August 2001 19

Area Constrained: small PSAC+SSC 1 0.98 0.98 0.97 24KB 3-way PredictPha 24KB 3-way PredictPha + SSC512B 32KB 4-way PredictPha + SSC512B Normalize Baseline 0.8 0.6 0.4 0.74 0.61 0.58 0.72 0.60 0.56 0.2 0 Delay Energy E*D SSC + small PSAC delivers cost effective E*D design International Symposium on Low Power Electronics and Design, August 2001 20

Energy Breakdown Normalize Baseline 1 0.8 0.6 0.4 BLAST MCF MP3D SSC L1 L2 Mem 0.2 0 Baseline 4-way PSAC SSC512B Comb Baseline 4-way PSAC SSC512B Comb Baseline 4-way PSAC SSC512B Comb International Symposium on Low Power Electronics and Design, August 2001 21

Conclusions Stack cache: important for energy-efficiency SW optimization required for stack caches Effective Specialized Stack Cache extensions Pseudo Set-Associative Cache: 4-way more effective than 2-way Predictive Phased PSAC has the lowest E*D Effective to combine PASC and SSC E*D reduced by 44% on average International Symposium on Low Power Electronics and Design, August 2001 22

Backup Slides International Symposium on Low Power Electronics and Design, August 2001 23

Cache Energy Energy (pj) 2000 1800 1600 1400 1200 1000 800 600 400 200 4-way 2-way 1-way 0 4K 8K 16K 32K 64K Cache Size International Symposium on Low Power Electronics and Design, August 2001 24

Extended CACTI New sense amplifier 15% bit-line swing for reads Full bit-line swing for writes Different energy for reads, writes, linefills, and write backs Multiple optimization parameters International Symposium on Low Power Electronics and Design, August 2001 25

SSC Energy Overhead Small energy consumption required to use TOS and SRB Registers updated at function call and return Registers check on cache miss International Symposium on Low Power Electronics and Design, August 2001 26

Miss Rate BLAST BSOM CRAFTY GZIP MCF MP3D MP3E TREE 12% 10% 8% 6% 4% 2% 0% 4KB 8KB 16KB 32KB 64KB International Symposium on Low Power Electronics and Design, August 2001 27

Overview International Symposium on Low Power Electronics and Design, August 2001 28