The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

Size: px
Start display at page:

Download "The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!"

Transcription

1 The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3

2 Modern CMPs" Intel e (2013)! SLLC" AMD Orochi (2012)! SLLC" Fujitsu SPARC VIIIfx (2011)! Intel Itanium 9500 (2012)! SLLC" 2! SLLC"

3 *100 multiprogrammed SPEC CPU 2006 workloads in an 8-core CMP! Inclusive Shared Last-Level Cache (SLLC)! 8MB Conventional! 1/8! size! Reuse Cache! Data 1MB! Tags! Tags! Data! 1/2! Same average performance* " with 84% area savings"

4 Outline"! Motivation!! The reuse cache! Idea! Organization! Coherence! Replacement!! Related work!! Evaluation!! Conclusions! 4!

5 Outline"! Motivation"! The reuse cache! Idea! Organization! Coherence! Replacement!! Related work!! Evaluation!! Conclusions! 5!

6 Motivation" SLLC data" Many hits! Zero hits! Reuse locality" " Used more than one time likely to be used many times! Recently reused lines more useful than lines reused before! Opportunity for a smaller SLLC which stores only reused data! 6!

7 Motivation" Live cache fraction evolution! LRU DRRIP NRR time (x100k cycles) 7!! The fraction of live SLLC lines is very small (10-30% LRU)! On average 78% of lines won t receive any additional access!! State-of-art replacement policies better!! 8MB Shared LLC, 16-way, LRU, 8-core CMP, mix #14!

8 Motivation" 100% 5% of all the loaded lines concentrate all the hits" 50% Total hits! 0.5% of all the loaded lines concentrate close to 50% of hits" 0.5% 5% Inserted lines! Most of the inserted lines will not experience any hit because hits concentrate in a few lines! 8! 8MB Shared LLC, 16-way, LRU, 8-core CMP, mix #14!

9 Outline"! Motivation!! The reuse cache" Idea! Organization! Coherence protocol! Replacement!! Related work!! Evaluation!! Conclusions! 9!

10 Idea" Data is stored only when reuse is detected! From main memory! tag! insertion! Tags! Data! From main memory! data! insertion! tag hit! Tags! Data! To private! caches! To private! caches! 10! First access: " Miss " only tag insertion! Second access:" Tag hit " data insertion! (reuse detected)!

11 Organization"! Decoupled tag and data arrays! More tag than data entries! Conventional cache! Reuse cache! Tag! Data! Tag! Data! 11!

12 Organization"! Tag array! NORMAL! Maintains coherence/inclusion! Set-associative! NEW! Each entry has associated data or not! Forward pointers: to indicate corresponding data array entry! Tag array! 16-way!..." tag! state! pointer! Data! array! 12!

13 Organization"! Tag array! Maintains coherence/inclusion! Set-associative! Each entry has associated data or not! Forward pointers: to indicate corresponding data array entry!! Data array! Only stores reused lines! Queue or set-associative! Reverse pointers: update tag array entry! Tag array! 16-way!..." tag! state! pointer! Data! array! 13! pointer! data! repl.!

14 Coherence"! Conventional coherence protocols can be used with small changes! Two types of states have to be considered! # Tag-only states! # Tag + data states! Transitions between both types of states! are triggered by data insertions or evictions! Data array! insertion! (reuse)! Tag-only! states! Tag + data! states! Data! eviction! 14!

15 Replacement"! Tag and data arrays have independent replacement policies! Tag array! # Not Recently-Reused (NRR) 1 [Albericio et al. TACO 13]!! reused and private cache contents are unlikely to be evicted! Data array! # Clock, for queue organization [Corbató MIT 68]! # Not Recently-Used (NRU), for set associativity! 15!

16 NRU vs NRR"! NRU! Exploiting temporal locality (approx. to LRU)! 1 bit per line! Used on intel i7 and SPARC T2! Used% Not%used%! NRR 1! Exploiting reuse locality! 1 bit per line! Private cache contents are protected! Insertion Reused% Not%reused% 16! Insertion

17 Outline"! Motivation!! The reuse cache! Idea! Organization! Coherence protocol! Replacement!! Related work"! Evaluation!! Conclusions! 17!

18 Related work"! Many proposals decouple tag/data arrays! Decoupled sectored caches, Seznec, ISCA 1994! The pool of subsectors, Rothman and A. Smith, ICS 1999! NuRAPID, Chishti et al., MICRO 2003! V-way, Qureshi et al., ISCA 2005!! Non-inclusive cache, inclusive directory (NCID), Zhao et al., Computer Frontiers 2010! Additional tags to maintain inclusion! Selective allocation policy based on set dueling! Tag+data" Tag" Tag" Data" Set" 18!

19 Outline"! Motivation!! The reuse cache! Idea! Organization! Coherence protocol! Replacement!! Related work!! Evaluation"! Conclusions! 19!

20 Methodology" Baseline system configuration! 8 in-order core CMP system! Bank 0! L2! L1! 1 DDR3 channel!! crossbar!! Bank 3! L2! L1! core 0! core 7! 8MB SLLC! LRU, 16-way! (inclusive)! 256KB L2! 16KB I$! 16KB D$!! Simulation! Simics full-system simulator! Ruby from GEMS Multifacet!! Area and latency! CACTI 6.5!! Workloads! Multi-programmed: SPEC CPU 2006 (100 random mixes from all the 29 applications)! Parallel: from PARSEC and SPLASH2!! 20!

21 Terminology"! RC-x/y = Reuse cache with tags equivalent to x MB,! y MB of data! Tags! RC-x/y! Data! Tags! Data! Conventional cache! of x MB! Tags! Data! Conventional cache! of y MB! 21!

22 Size exploration Performance over conventional 8MB! 22! 1.15! 1.1! 1.05! 1! 0.95! 0.9! Chosen configuration for! each data array size! RC-x/8 RC-x/4 RC-x/2 RC-x/1 RC-x/0.5

23 Size exploration Performance over conventional 8MB! 23! 1.15! 1.1! 1.05! 1! 0.95! 0.9! RC-16/8 Hardware cost RC-8/4 This configuration keeps average! performance while it reduces size! RC-8/2 Same avg. performance! with 84% area savings! Conv. 8MB LRU! RC-4/1 RC-4/ % 80% 60% 40% 20% 0% Hardware cost respect to conv. 8MB"

24 Better baseline Performance over conventional 8MB! 1.15! 1.1! 1.05! 1! 0.95! 0.9! RC-16/8 +1.1% perf.! -40% HW cost! RC-8/4-0.7% perf.! -60% HW cost! RC-8/2 RC-4/1 RC-4/0.5 Conv. DRRIP 8MB! Conv. LRU 8MB! 24!

25 Multiprogrammed 9 workloads over 5% of speedup Better in more than half of the mixes Performance over conventional 8MB! 6 workloads over 5% of slowdown 25!

26 NCID comparison 1.1! 7.0% Performance over conventional 8MB! 1.05! 6.4% 5.2% Reuse%cache% NCID% 1! 5.3% 0.95! 0.9! RC-8/4! RC-8/2! RC-4/1! RC-4/0.5! NCID-8/4 NCID-8/2 NCID-4/1 NCID-4/0.5 26!

27 Behavior insight" Percentage of data stored into the cache with respect to the number of tags! Conventional cache: 100%! RC-4/1: 5%! Our selection of contents policy is selecting lines that will receive hits 100% 5% of the loaded lines get all the hits" 50% Percentage of total hits! Percentage of inserted lines! 27! 0.5% 5%

28 Additional results! Performance analysis for more sizes!! Set-associative data array!! Per-application study!! Parallel workloads! 28!

29 Outline"! Motivation!! The reuse cache! Idea! Organization! Coherence protocol! Replacement!! Related work!! Evaluation!! Conclusions" 29!

30 Conclusions"! A big fraction of the SLLC contents is dead! A selective allocation policy is needed!! A SLLC organization is proposed: the reuse cache! Very selective line allocation policy, based on reuse! Same avg. performance with 84% storage savings! Alternative applications:! # Further reduction of energy! # More computing elements with same area! 30!

31 The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3

Evaluating the Reuse Cache for mobile processors. Lokesh Jindal, Urmish Thakker, Swapnil Haria CS-752 Fall 2014 University of Wisconsin-Madison

Evaluating the Reuse Cache for mobile processors. Lokesh Jindal, Urmish Thakker, Swapnil Haria CS-752 Fall 2014 University of Wisconsin-Madison Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker, Swapnil Haria CS-752 Fall 2014 University of Wisconsin-Madison 1 Executive Summary Problem : Mobile SOCs Area is money! Cache

More information

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Optimizing Replication, Communication, and Capacity Allocation in CMPs Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important

More information

Lecture 11: Large Cache Design

Lecture 11: Large Cache Design Lecture 11: Large Cache Design Topics: large cache basics and An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS 02 Distance Associativity for High-Performance

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

SEESAW: Set Enhanced Superpage Aware caching

SEESAW: Set Enhanced Superpage Aware caching SEESAW: Set Enhanced Superpage Aware caching http://synergy.ece.gatech.edu/ Set Associativity Mayank Parasar, Abhishek Bhattacharjee Ω, Tushar Krishna School of Electrical and Computer Engineering Georgia

More information

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT : EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1

More information

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,

More information

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Executive Summary Different memory technologies have different

More information

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore By Dan Stafford Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016 Caches and Memory Hierarchy: Review UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only 10-20% of the processor peak Most of the single processor performance loss

More information

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013 Summary NUCA is giving us more capacity, but further away 40 Applications

More information

FILTERING DIRECTORY LOOKUPS IN CMPS

FILTERING DIRECTORY LOOKUPS IN CMPS Departamento de Informática e Ingeniería de Sistemas Memoria de Tesis Doctoral FILTERING DIRECTORY LOOKUPS IN CMPS Autor: Ana Bosque Arbiol Directores: Pablo Ibáñez José M. Llabería Víctor Viñals Zaragoza

More information

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion

More information

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed Survey results CS 6354: Memory Hierarchy I 29 August 2016 1 2 Processor/Memory Gap Variety in memory technologies SRAM approx. 4 6 transitors/bit optimized for speed DRAM approx. 1 transitor + capacitor/bit

More information

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017 Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the

More information

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

Efficiency by exploiting reuse locality and adjusting prefetch

Efficiency by exploiting reuse locality and adjusting prefetch 2013 74 Jorge Albericio Latorre Improving the SLLC Efficiency by exploiting reuse locality and adjusting prefetch Departamento Informática e Ingeniería de Sistemas Director/es Ibáñez Marín, Pablo Enrique

More information

CS 6354: Memory Hierarchy I. 29 August 2016

CS 6354: Memory Hierarchy I. 29 August 2016 1 CS 6354: Memory Hierarchy I 29 August 2016 Survey results 2 Processor/Memory Gap Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time

More information

FreshCache: Statically and Dynamically Exploiting Dataless Ways

FreshCache: Statically and Dynamically Exploiting Dataless Ways FreshCache: Statically and Dynamically Exploiting Dataless Ways Arkaprava Basu 1 Derek R. Hower 2 * Mark D. Hill 1 Michael M. Swift 1 1 Department of Computer Sciences University of Wisconsin-Madison {basu,

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks : Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Techniques to Reduce Cache Misses Victim caches Better replacement policies pseudo-lru, NRU Prefetching, cache

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers 1 ASPLOS 2016 2-6 th April Amro Awad (NC State University) Pratyusa Manadhata (Hewlett Packard Labs) Yan Solihin (NC

More information

Lecture-16 (Cache Replacement Policies) CS422-Spring

Lecture-16 (Cache Replacement Policies) CS422-Spring Lecture-16 (Cache Replacement Policies) CS422-Spring 2018 Biswa@CSE-IITK 1 2 4 8 16 32 64 128 From SPEC92 Miss rate: Still Applicable Today 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity

More information

1. Creates the illusion of an address space much larger than the physical memory

1. Creates the illusion of an address space much larger than the physical memory Virtual memory Main Memory Disk I P D L1 L2 M Goals Physical address space Virtual address space 1. Creates the illusion of an address space much larger than the physical memory 2. Make provisions for

More information

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache

Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache SOMAYEH SARDASHTI, University of Wisconsin-Madison ANDRE SEZNEC, IRISA/INRIA DAVID A WOOD, University of Wisconsin-Madison Cache

More information

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti Computer Sciences Department University of Wisconsin-Madison somayeh@cs.wisc.edu David

More information

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question: Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Flexible Cache Error Protection using an ECC FIFO

Flexible Cache Error Protection using an ECC FIFO Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead

More information

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems

More information

Doppelgänger: A Cache for Approximate Computing

Doppelgänger: A Cache for Approximate Computing Doppelgänger: A Cache for Approximate Computing ABSTRACT Joshua San Miguel University of Toronto joshua.sanmiguel@mail.utoronto.ca Andreas Moshovos University of Toronto moshovos@eecg.toronto.edu Modern

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

An Accurate and Detailed Prefetching. Simulation Framework for gem5

An Accurate and Detailed Prefetching. Simulation Framework for gem5 An Accurate and Detailed ing Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina martit@ac.upc.edu Computer Architecture Department UPC BarcelonaTech ing Reduce memory latency

More information

Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho

Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho 20 th Interna+onal Symposium On High Performance Computer Architecture (HPCA). Orlando, FL, February

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Mo Money, No Problems: Caches #2...

Mo Money, No Problems: Caches #2... Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to

More information

SLIP: Reducing Wire Energy in the Memory Hierarchy

SLIP: Reducing Wire Energy in the Memory Hierarchy SLIP: Reducing Wire Energy in the Memory Hierarchy Subhasis Das Tor M. Aamodt William J. Dally Stanford University, University of British Columbia, NVIDIA subhasis@stanford.edu, aamodt@ece.ubc.ca, dally@stanford.edu

More information

Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes

Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen Ilya Wagner Zeshan Chishti Wei Wu Chris Wilkerson Shih-Lien Lu Intel Labs Overview Large caches and memories

More information

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder Michigan Technological University Randy Katz & David A. Patterson University of California, Berkeley Levels in

More information

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40

More information

Assisting Cache Replacement by Helper-Threading for MPSoCs

Assisting Cache Replacement by Helper-Threading for MPSoCs Assisting Cache Replacement by Helper-Threading for MPSoCs Masaaki Kondo Graduate School of Information Science and Technology, The University of Tokyo MPSoC2015 1 Background Increasing number of cores

More information

Sharing-aware Efficient Private Caching in Many-core Server Processors

Sharing-aware Efficient Private Caching in Many-core Server Processors 17 IEEE 35th International Conference on Computer Design Sharing-aware Efficient Private Caching in Many-core Server Processors Sudhanshu Shukla Mainak Chaudhuri Department of Computer Science and Engineering,

More information

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang Executive Summary Problems: performance and reliability of write operations

More information

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair *, Johnathan Alsop^, Sarita V. Adve + * University of Wisconsin-Madison ^ AMD Research + University

More information

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

SHARDS & Talus: Online MRC estimation and optimization for very large caches

SHARDS & Talus: Online MRC estimation and optimization for very large caches SHARDS & Talus: Online MRC estimation and optimization for very large caches Nohhyun Park CloudPhysics, Inc. Introduction Efficient MRC Construction with SHARDS FAST 15 Waldspurger at al. Talus: A simple

More information

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view 1 Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view Pierre Michaud INRIA HiPEAC 11, January 26, 2011 2 Outline Self-performance contract Proposition for

More information

Lecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks

Lecture 15: Large Cache Design III. Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks Lecture 15: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity, cache networks 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that

More information

BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory

BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory HotStorage 18 BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory Gyuyoung Park 1, Miryeong Kwon 1, Pratyush Mahapatra 2, Michael Swift 2, and Myoungsoo Jung 1 Yonsei University Computer

More information

TDT 4260 lecture 3 spring semester 2015

TDT 4260 lecture 3 spring semester 2015 1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics Lecture 14: Large Cache Design III Topics: Replacement policies, associativity, cache networks, networking basics 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that simultaneously

More information

STORING DATA: DISK AND FILES

STORING DATA: DISK AND FILES STORING DATA: DISK AND FILES CS 564- Spring 2018 ACKs: Dan Suciu, Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? How does a DBMS store data? disk, SSD, main memory The Buffer manager controls how

More information

ECE 30 Introduction to Computer Engineering

ECE 30 Introduction to Computer Engineering ECE 0 Introduction to Computer Engineering Study Problems, Set #9 Spring 01 1. Given the following series of address references given as word addresses:,,, 1, 1, 1,, 8, 19,,,,, 7,, and. Assuming a direct-mapped

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

LECTURE 12. Virtual Memory

LECTURE 12. Virtual Memory LECTURE 12 Virtual Memory VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a cache for magnetic disk. The mechanism by which this is accomplished

More information

Memshare: a Dynamic Multi-tenant Key-value Cache

Memshare: a Dynamic Multi-tenant Key-value Cache Memshare: a Dynamic Multi-tenant Key-value Cache ASAF CIDON*, DANIEL RUSHTON, STEPHEN M. RUMBLE, RYAN STUTSMAN *STANFORD UNIVERSITY, UNIVERSITY OF UTAH, GOOGLE INC. 1 Cache is 100X Faster Than Database

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection

More information

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming MEMORY HIERARCHY BASICS B649 Parallel Architectures and Programming BASICS Why Do We Need Caches? 3 Overview 4 Terminology cache virtual memory memory stall cycles direct mapped valid bit block address

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Two-Level Address Storage and Address Prediction

Two-Level Address Storage and Address Prediction Two-Level Address Storage and Address Prediction Enric Morancho, José María Llabería and Àngel Olivé Computer Architecture Department - Universitat Politècnica de Catalunya (Spain) 1 Abstract. : The amount

More information

Lecture 2: Memory Systems

Lecture 2: Memory Systems Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer

More information

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar

More information

Virtual Memory. Motivation:

Virtual Memory. Motivation: Virtual Memory Motivation:! Each process would like to see its own, full, address space! Clearly impossible to provide full physical memory for all processes! Processes may define a large address space

More information

Probabilistic Directed Writebacks for Exclusive Caches

Probabilistic Directed Writebacks for Exclusive Caches Probabilistic Directed Writebacks for Exclusive Caches Lena E. Olson University of Wisconsin-Madison lena@cs.wisc.edu Abstract Energy is an increasingly important consideration in memory system design.

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

Use-Based Register Caching with Decoupled Indexing

Use-Based Register Caching with Decoupled Indexing Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

Architecting HBM as a High Bandwidth, High Capacity, Self-Managed Last-Level Cache

Architecting HBM as a High Bandwidth, High Capacity, Self-Managed Last-Level Cache Architecting HBM as a High Bandwidth, High Capacity, Self-Managed Last-Level Cache Tyler Stocksdale Advisor: Frank Mueller Mentor: Mu-Tien Chang Manager: Hongzhong Zheng 11/13/2017 Background Commodity

More information