The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

Size: px

Start display at page:

Download "The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!"

Chester Johnson
6 years ago
Views:

1 The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3

2 Modern CMPs" Intel e (2013)! SLLC" AMD Orochi (2012)! SLLC" Fujitsu SPARC VIIIfx (2011)! Intel Itanium 9500 (2012)! SLLC" 2! SLLC"

3 *100 multiprogrammed SPEC CPU 2006 workloads in an 8-core CMP! Inclusive Shared Last-Level Cache (SLLC)! 8MB Conventional! 1/8! size! Reuse Cache! Data 1MB! Tags! Tags! Data! 1/2! Same average performance* " with 84% area savings"

4 Outline"! Motivation!! The reuse cache! Idea! Organization! Coherence! Replacement!! Related work!! Evaluation!! Conclusions! 4!

5 Outline"! Motivation"! The reuse cache! Idea! Organization! Coherence! Replacement!! Related work!! Evaluation!! Conclusions! 5!

6 Motivation" SLLC data" Many hits! Zero hits! Reuse locality" " Used more than one time likely to be used many times! Recently reused lines more useful than lines reused before! Opportunity for a smaller SLLC which stores only reused data! 6!

7 Motivation" Live cache fraction evolution! LRU DRRIP NRR time (x100k cycles) 7!! The fraction of live SLLC lines is very small (10-30% LRU)! On average 78% of lines won t receive any additional access!! State-of-art replacement policies better!! 8MB Shared LLC, 16-way, LRU, 8-core CMP, mix #14!

8 Motivation" 100% 5% of all the loaded lines concentrate all the hits" 50% Total hits! 0.5% of all the loaded lines concentrate close to 50% of hits" 0.5% 5% Inserted lines! Most of the inserted lines will not experience any hit because hits concentrate in a few lines! 8! 8MB Shared LLC, 16-way, LRU, 8-core CMP, mix #14!

9 Outline"! Motivation!! The reuse cache" Idea! Organization! Coherence protocol! Replacement!! Related work!! Evaluation!! Conclusions! 9!

10 Idea" Data is stored only when reuse is detected! From main memory! tag! insertion! Tags! Data! From main memory! data! insertion! tag hit! Tags! Data! To private! caches! To private! caches! 10! First access: " Miss " only tag insertion! Second access:" Tag hit " data insertion! (reuse detected)!

11 Organization"! Decoupled tag and data arrays! More tag than data entries! Conventional cache! Reuse cache! Tag! Data! Tag! Data! 11!

12 Organization"! Tag array! NORMAL! Maintains coherence/inclusion! Set-associative! NEW! Each entry has associated data or not! Forward pointers: to indicate corresponding data array entry! Tag array! 16-way!..." tag! state! pointer! Data! array! 12!

13 Organization"! Tag array! Maintains coherence/inclusion! Set-associative! Each entry has associated data or not! Forward pointers: to indicate corresponding data array entry!! Data array! Only stores reused lines! Queue or set-associative! Reverse pointers: update tag array entry! Tag array! 16-way!..." tag! state! pointer! Data! array! 13! pointer! data! repl.!

14 Coherence"! Conventional coherence protocols can be used with small changes! Two types of states have to be considered! # Tag-only states! # Tag + data states! Transitions between both types of states! are triggered by data insertions or evictions! Data array! insertion! (reuse)! Tag-only! states! Tag + data! states! Data! eviction! 14!

15 Replacement"! Tag and data arrays have independent replacement policies! Tag array! # Not Recently-Reused (NRR) 1 [Albericio et al. TACO 13]!! reused and private cache contents are unlikely to be evicted! Data array! # Clock, for queue organization [Corbató MIT 68]! # Not Recently-Used (NRU), for set associativity! 15!

16 NRU vs NRR"! NRU! Exploiting temporal locality (approx. to LRU)! 1 bit per line! Used on intel i7 and SPARC T2! Used% Not%used%! NRR 1! Exploiting reuse locality! 1 bit per line! Private cache contents are protected! Insertion Reused% Not%reused% 16! Insertion

17 Outline"! Motivation!! The reuse cache! Idea! Organization! Coherence protocol! Replacement!! Related work"! Evaluation!! Conclusions! 17!

18 Related work"! Many proposals decouple tag/data arrays! Decoupled sectored caches, Seznec, ISCA 1994! The pool of subsectors, Rothman and A. Smith, ICS 1999! NuRAPID, Chishti et al., MICRO 2003! V-way, Qureshi et al., ISCA 2005!! Non-inclusive cache, inclusive directory (NCID), Zhao et al., Computer Frontiers 2010! Additional tags to maintain inclusion! Selective allocation policy based on set dueling! Tag+data" Tag" Tag" Data" Set" 18!

19 Outline"! Motivation!! The reuse cache! Idea! Organization! Coherence protocol! Replacement!! Related work!! Evaluation"! Conclusions! 19!

20 Methodology" Baseline system configuration! 8 in-order core CMP system! Bank 0! L2! L1! 1 DDR3 channel!! crossbar!! Bank 3! L2! L1! core 0! core 7! 8MB SLLC! LRU, 16-way! (inclusive)! 256KB L2! 16KB I$! 16KB D$!! Simulation! Simics full-system simulator! Ruby from GEMS Multifacet!! Area and latency! CACTI 6.5!! Workloads! Multi-programmed: SPEC CPU 2006 (100 random mixes from all the 29 applications)! Parallel: from PARSEC and SPLASH2!! 20!

21 Terminology"! RC-x/y = Reuse cache with tags equivalent to x MB,! y MB of data! Tags! RC-x/y! Data! Tags! Data! Conventional cache! of x MB! Tags! Data! Conventional cache! of y MB! 21!

22 Size exploration Performance over conventional 8MB! 22! 1.15! 1.1! 1.05! 1! 0.95! 0.9! Chosen configuration for! each data array size! RC-x/8 RC-x/4 RC-x/2 RC-x/1 RC-x/0.5

23 Size exploration Performance over conventional 8MB! 23! 1.15! 1.1! 1.05! 1! 0.95! 0.9! RC-16/8 Hardware cost RC-8/4 This configuration keeps average! performance while it reduces size! RC-8/2 Same avg. performance! with 84% area savings! Conv. 8MB LRU! RC-4/1 RC-4/ % 80% 60% 40% 20% 0% Hardware cost respect to conv. 8MB"

24 Better baseline Performance over conventional 8MB! 1.15! 1.1! 1.05! 1! 0.95! 0.9! RC-16/8 +1.1% perf.! -40% HW cost! RC-8/4-0.7% perf.! -60% HW cost! RC-8/2 RC-4/1 RC-4/0.5 Conv. DRRIP 8MB! Conv. LRU 8MB! 24!

25 Multiprogrammed 9 workloads over 5% of speedup Better in more than half of the mixes Performance over conventional 8MB! 6 workloads over 5% of slowdown 25!

26 NCID comparison 1.1! 7.0% Performance over conventional 8MB! 1.05! 6.4% 5.2% Reuse%cache% NCID% 1! 5.3% 0.95! 0.9! RC-8/4! RC-8/2! RC-4/1! RC-4/0.5! NCID-8/4 NCID-8/2 NCID-4/1 NCID-4/0.5 26!

27 Behavior insight" Percentage of data stored into the cache with respect to the number of tags! Conventional cache: 100%! RC-4/1: 5%! Our selection of contents policy is selecting lines that will receive hits 100% 5% of the loaded lines get all the hits" 50% Percentage of total hits! Percentage of inserted lines! 27! 0.5% 5%

28 Additional results! Performance analysis for more sizes!! Set-associative data array!! Per-application study!! Parallel workloads! 28!

29 Outline"! Motivation!! The reuse cache! Idea! Organization! Coherence protocol! Replacement!! Related work!! Evaluation!! Conclusions" 29!

30 Conclusions"! A big fraction of the SLLC contents is dead! A selective allocation policy is needed!! A SLLC organization is proposed: the reuse cache! Very selective line allocation policy, based on reuse! Same avg. performance with 84% storage savings! Alternative applications:! # Further reduction of energy! # More computing elements with same area! 30!

31 The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3

Evaluating the Reuse Cache for mobile processors. Lokesh Jindal, Urmish Thakker, Swapnil Haria CS-752 Fall 2014 University of Wisconsin-Madison

Evaluating the Reuse Cache for mobile processors Lokesh Jindal, Urmish Thakker, Swapnil Haria CS-752 Fall 2014 University of Wisconsin-Madison 1 Executive Summary Problem : Mobile SOCs Area is money! Cache