JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

Size: px

Start display at page:

Download "JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES"

Merry Doyle
6 years ago
Views:

1 JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013

2 Summary NUCA is giving us more capacity, but further away 40 Applications have widely varying cache behavior MPKI libquantum zeusmp sphinx3 Cache Size Cache organization should adapt to application Jigsaw uses physical cache resources as building blocks of virtual caches, or shares 0 16MB

3 Approach 3 Jigsaw uses physical cache resources as building blocks of virtual caches, or shares libquantum zeusmp 40 MPKI 0 16MB Cache Size Bank Tiled Multicore sphinx3

4 Agenda 4 Introduction Background Goals Existing Approaches Jigsaw Design Evaluation

5 Goals Make effective use of cache capacity 5 Place data for low latency Provide capacity isolation for performance Have a simple implementation

6 Existing Approaches: S-NUCA 6 Spread lines evenly across banks High Capacity High Latency No Isolation Simple

7 Existing Approaches: Partitioning 7 Isolate regions of cache between applications. High Capacity High Latency Isolation Simple Jigsaw needs partitioning; uses Vantage to get strong guarantees with no loss in associativity

8 Existing Approaches: Private 8 Place lines in local bank Low Capacity Low Latency Isolation Complex LLC directory

9 Existing Approaches: D-NUCA Placement, migration, and replication heuristics 9 High Capacity But beware of over-replication and restrictive mappings Low Latency Don t fully exploit capacity vs. latency tradeoff No Isolation Complexity Varies Private-baseline schemes require LLC directory

10 Existing Approaches: Summary 10 S-NUCA Partitioning Private D-NUCA High Capacity Low Latency Yes Yes No Yes No No Yes Yes Isolation No Yes Yes No Simple Yes Yes No Depends

11 Jigsaw High Capacity Any share can take full capacity, no replication 11 Low Latency Shares allocated near cores that use them Isolation Partitions within each bank Simple Low overhead hardware, no LLC directory, software-managed

12 Agenda 12 Introduction Background Jigsaw Design Operation Monitoring Configuration Evaluation

13 Jigsaw Components 13 Size & Placement Operation Accesses Configuration Miss Curves Monitoring

14 Jigsaw Components 14 Operation Configuration Monitoring

15 Agenda 15 Introduction Background Jigsaw Design Operation Monitoring Configuration Evaluation

16 Operation: Access 16 Data è shares, so no LLC coherence required Share 1 LLC Classifier TLB Share 2 Share 3 STB L1I L2 L1D... Share N Core LD 0x5CA1AB1E

17 Data Classification Jigsaw classifies data based on access pattern Thread, Process, Global, and Kernel 17 Data lazily re-classified on TLB miss Similar to R-NUCA but n R-NUCA: Classification è Location n Jigsaw: Classification è Share (sized & placed dynamically) Negligible overhead Operating System 6 thread shares 2 process shares 1 global share 1 kernel share

18 Operation: Share-bank Translation Buffer 18 Gives unique location of the line in the LLC Address, Share è Bank, Partition Hash lines proportionally q q Share: STB: A A B A A B 400 bytes; low overhead Share Id (from TLB) Address (from L1 miss) 2706 STB Entry 4 entries, associative, exception on miss H Bank/ Part 0 1 1/3 3/5 1/3 Share Config 0x5CA1AB1E Bank/ Part 63 0/8 Address 0x5CA1AB1E maps to bank 3, partition 5

19 Agenda 19 Introduction Background Jigsaw Design Operation Monitoring Configuration Evaluation

20 Monitoring Software requires miss curves for each share 20 Add utility monitors (UMONs) per tile to produce miss curves Dynamic sampling to model full LLC at each bank; see paper Way 0 Way N-1 Misses 0x3DF7AB 0xFE3D98 0xDD380B 0x3930EA 0xB3D3GA 0x0E5A7B 0x x7890AB 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 0x7A5744 0x7A4A70 0xADD235 0x , , ,021 32,103 Tag Array Hit Counters Size Cache Size

21 Configuration 21 Software decides share configuration Approach: Size è Place Solving independently is simple Sizing is hard, placing is easy Misses Size LLC SIZE PLACE

22 Configuration: Sizing Partitioning problem: Divide cache capacity of S among P partitions/shares to maximize hits 22 Use miss curves to describe partition behavior NP-complete in general Existing approaches: Hill climbing is fast but gets stuck in local optima UCP Lookahead is good but scales quadratically: O(P x S 2 ) Utility-based Cache Partitioning, Qureshi and Patt, MICRO 06 Can we scale Lookahead?

23 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 23 Misses Size LLC Size

24 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 24 Misses Size LLC Size

25 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 25 Misses Size LLC Size

26 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 26 Misses Size LLC Size

27 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 27 Misses Size LLC Size

28 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 28 Misses Size LLC Size

29 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 29 Misses Size LLC Size

30 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 30 Misses Size LLC Size

31 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 31 Misses Size LLC Size

32 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 32 Misses Size LLC Size

33 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 33 Misses Size LLC Size

34 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 34 Misses Size LLC Size

35 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 35 Misses Size LLC Size

36 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 36 Misses Size LLC Size

37 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 37 Maximum cache utility Misses Size LLC Size

38 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 38 Maximum cache utility Misses Size LLC Size

39 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 39 Maximum cache utility Misses Size LLC Size

40 Configuration: Lookahead Observation: Lookahead traces the convex hull of the miss curve 40 Maximum cache utility Misses Size LLC Size

41 Convex Hulls The convex hull of a curve is the set containing all lines between any two points on the curve, or the curve connecting the points along the bottom 41 Misses Misses Size LLC Size Size LLC Size

42 Configuration: Peekahead There are well-known linear algorithms to compute convex hulls Peekahead algorithm is an exact, linear-time implementation of UCP Lookahead 42 Misses Misses Size LLC Size Size LLC Size

43 Configuration: Peekahead 43 Peekahead computes all convex hulls encountered during allocation in linear time Starting from every possible allocation Up to any remaining cache capacity Misses Misses Size LLC Size Size LLC Size

44 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) Convex hulls have decreasing slope è decreasing average cache utility è only consider next point on hull Use max-heap to compare between partitions 44 Best Step?

45 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 45 Best Step? Current Allocation

46 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 46 Best Step Current Allocation

47 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 47 Best Step?

48 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 48 Best Step

49 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 49 Best Step?

50 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 50 Best Step

51 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 51 Best Step

52 Configuration: Peekahead 52 Full runtime is O(P x S) P number of partitions S cache size See paper for additional examples, algorithm, and corner cases See technical report for additional detail, proofs, and run-time analysis Jigsaw: Scalable Software-Defined Caches (Extended Version), Nathan Beckmann and Daniel Sanchez, Technical Report MIT-CSAIL-TR , Massachusetts Institute of Technology, July 2013

53 Re-configuration When STB changes, some addresses hash to different banks 53 H 0x5CA1AB1E H 0x5CA1AB1E 1/3 1/3 3/5 1/3 1/3 3/5 1/3 4/9 3/5 1/3 1/3 3/5 Selective invalidation hardware walks the LLC and invalidates lines that have moved Heavy-handed but infrequent and avoids directory Maximum of 300K cycles / 50M cycles = 0.6% overhead

54 Design: Hardware Summary Operation: Share-bank translation buffer (STB) handles accesses TLB augmented with share id Tile Organization Jigsaw L3 Bank Bank partitioning HW (Vantage) 54 Monitoring HW: produces miss curves Configuration: invalidation HW Partitioning HW (Vantage) Monitoring HW L1I L2 NoC Router L1D Core Inv HW STB TLBs Modified structures New/added structures

55 Agenda 55 Introduction Background Jigsaw Design Evaluation Methodology Performance Energy

56 Methodology Execution-driven simulation using zsim 56 Workloads: 16-core singlethreaded mixes of SPECCPU2006 workloads 64-core multithreaded (4x16-thread) mixes of PARSEC Cache organizations LRU shared S-NUCA cache with LRU replacement; baseline Vantage S-NUCA with Vantage and UCP Lookahead R-NUCA state-of-the-art shared-baseline D-NUCA organization IdealSPD ( shared-private D-NUCA ) private L3 + shared L4 n 2x capacity of other schemes n Upper bound for private-baseline D-NUCA organizations Jigsaw

57 Evaluation: Performance 16-core multiprogrammed mixes of SPECCPU Jigsaw achieves best performance Up to 50% improved throughput, 2.2x improved w. speedup Gmean +14% throughput, +18% w. speedup Jigsaw does even better on the most memory intensive mixes Top 20% of LRU MPKI Gmean +21% throughput, +29% w. speedup

58 Evaluation: Performance core multithreaded mixes of PARSEC Jigsaw achieves best performance Gmean +9% throughput, +9% w. speedup Remember IdealSPD is an upper bound with 2x capacity

59 Evaluation: Performance Breakdown 16-core multiprogrammed mixes of SPECCPU Breakdown memory stalls into network and DRAM Normalized to LRU R-NUCA is limited by capacity in these workloads (private data è local bank) Vantage only benefits DRAM IdealSPD acts as either a private organization (benefit network) or a shared organization (benefit DRAM) Optimum Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

60 Evaluation: Energy core multiprogrammed mixes McPAT models of full-system energy (chip + DRAM) Jigsaw achieves best energy reduction Up to 72%, gmean of 11% Reduces both network and DRAM energy

61 Conclusion NUCA is giving us more capacity, but further away Applications have widely varying cache behavior Cache organization should adapt to meet application needs 40 MPKI 0 16MB Cache Size Jigsaw uses physical cache resources as building blocks of virtual caches, or shares Sized to fit working set Placed near application for low latency Jigsaw improves performance up to 2.2x and reduces energy up to 72%

62 62 Misses QUESTIONS Size LLC Size Address H 3 Way 0 Way N-1 0x3F7AB 0xFE3D98 0xD380B 0x3930EA 0xBD3GA 0x0E5A7B 0x x7890AB 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 Tag Array 0x7A5744 0x74A70 0xAD235 0x Limit 717, , ,021 32,103 Hit Counters

63 Placement Greedy algorithm Each share is allocated budget Shares take turns grabbing space in nearby banks Banks ordered by distance from center of mass of cores accessing share Repeat until budget & banks exhausted

64 Monitoring Software requires miss curves for each share 64 Add UMONs per tile Small tag array that models LRU on sample of accesses Tracks # hits per way, # misses è miss curve Changing sampling rate models a larger cache Sampling Rate = UMON Lines Modeled Cache Lines STB spreads lines proportionally to partition size, so sampling rate must compensate Share size Sampling Rate = Partition size UMON Lines Modeled Cache Lines

Monitoring STB spreads addresses unevenly è change sampling rate to compensate 65 Augment UMON with hash (shared with STB) and 32-bit limit register that gives fine control over sampling rate Address

65 Monitoring STB spreads addresses unevenly è change sampling rate to compensate 65 Augment UMON with hash (shared with STB) and 32-bit limit register that gives fine control over sampling rate Address H 3 Way 0 Way N-1 0x3DF7AB 0xFE3D98 0xDD380B 0x3930EA 0xB3D3GA 0x0E5A7B 0x x7890AB 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 Tag Array < Limit 0x7A5744 0x7A4A70 0xADD235 0x , , ,021 32,103 Hit Counters UMON now models full LLC capacity exactly Shares require only one UMON Max four shares / bank è four UMONs / bank è 1.4% overhead

66 Evaluation: Extra See paper for: Out-of-order results Execution time breakdown Peekahead performance Sensitivity studies 66

An Analytical Approach to Memory System Design NATHAN BE CKMANN PHD DEFENSE

An Analytical Approach to Memory System Design NATHAN BE CKMANN PHD DEFENSE CSAIL MIT 17 AUGUST 2015 Executive Summary Data movement is a growing problem in multicores DRAM 1000 energy of FP multiply-add