SCALING HARDWARE AND SOFTWARE

Size: px

Start display at page:

Download "SCALING HARDWARE AND SOFTWARE"

Sandra Gibbs
5 years ago
Views:

1 SCALING HARDWARE AND SOFTWARE FOR THOUSAND-CORE SYSTEMS Daniel Sanchez Electrical Engineering Stanford University

2 Multicore Scalability 1.E E E E E E Multicore performance (MIPS) s Transistors (Millions) performance (MIPS) 2 1.E Multicore is key to future of computing Scaling performance is hard, even with a lot of parallelism

3 Memory is Critical 3 Memory limits performance and energy efficiency Basic indicators: 64-bit FP op: ~1ns latency, ~20pJ energy Shared cache access: ~10ns latency, ~1nJ energy DRAM access: ~100ns latency, ~20nJ energy HW & SW must optimize memory performance

4 Multicore Memory Hierarchy 4 ate Main Memory Shared Coherence Directory ate ate ate Per-core private caches Fast access to critical working set Should satisfy most accesses Shared last-level cache Increases utilization Accelerates communication Can be partitioned for isolation Coherence protocol Makes caches transparent to SW Uses directory to track sharers

5 Memory Hierarchy Challenges at 1K s 5 Main Memory hierarchy is hard to scale Shared Coherence Directory

6 Memory Hierarchy Challenges at 1K s 6 Main Memory hierarchy is hard to scale 1. Directories scale poorly Shared Coherence Directory

7 Memory Hierarchy Challenges at 1K s 7 Main Memory Shared hierarchy is hard to scale 1. Directories scale poorly 2. Conflicts in caches & directory are more frequent Coherence Directory

8 Memory Hierarchy Challenges at 1K s 8 1 Main Memory Shared Coherence Directory hierarchy is hard to scale 1. Directories scale poorly 2. Conflicts in caches & directory are more frequent 3. Shared cache cannot be partitioned efficiently

9 Memory Hierarchy Challenges at 1K s 9 1 Shared Coherence Directory 2 Main Memory hierarchy is hard to scale 1. Directories scale poorly 2. Conflicts in caches & directory are more frequent 3. Shared cache cannot be partitioned efficiently 4. No isolation or QoS due to shared cache and directory

10 Scaling Parallel Runtimes 10 Parallel runtime maps application to hardware Resource management Scheduling Runtime is fundamental to scale with manageable complexity Parallel Application Parallel Runtime Operating System Hardware

11 Scheduling Parallel Applications 11 1 Main Memory Shared Coherence Directory Application Parallel tasks Different requirements May have dependences Application Tasks

12 Scheduling Parallel Applications 12 1 Main Memory Shared Coherence Directory Application Parallel tasks Different requirements May have dependences Scheduler assigns tasks to cores Application Tasks

13 Runtime & Scheduling Challenges 13 Main Memory Constrained parallelism Shared 1 Coherence Directory Application Tasks

14 Runtime & Scheduling Challenges 14 1 Main Memory Shared Coherence Directory Constrained parallelism Coarser tasks Unneeded serialization Application Tasks

15 Runtime & Scheduling Challenges 15 Main Memory Shared Constrained parallelism Increased cache misses 1 Coherence Directory Application Tasks

16 Runtime & Scheduling Challenges 16 1 Main Memory Shared Coherence Directory Constrained parallelism Increased cache misses Load imbalance Application Tasks

17 Runtime & Scheduling Challenges 17 1 Main Memory Shared Coherence Directory 2 Sched 3 Sched Constrained parallelism Increased cache misses Load imbalance Scheduling overheads Application Tasks

Runtime & Scheduling Challenges 18 1 Main Memory Shared Coherence Directory 2 3 4 1024 Constrained parallelism

18 Runtime & Scheduling Challenges 18 1 Main Memory Shared Coherence Directory Constrained parallelism Increased cache misses Load imbalance Scheduling overheads Excessive memory footprint (crash!) Application Tasks

19 Runtime & Scheduling Challenges 19 1 Main Memory Shared Coherence Directory Constrained parallelism Increased cache misses Load imbalance Scheduling overheads Excessive memory footprint (crash!) Application Tasks Conflicting issues Need smart algorithms!

20 Contributions 20 Scalable cache hierarchies: Efficient highly-associative caches [MICRO 10] Scalable cache partitioning [ISCA 11, Top Picks 12] Scalable coherence directories [HPCA 12] Scalable scheduling: Efficient dynamic scheduling by leveraging programming model information [PACT 11] Hardware-accelerated scheduling [ASPLOS 10]

21 This Talk 21 Scalable cache hierarchies: Efficient highly-associative caches [MICRO 10] Scalable cache partitioning [ISCA 11, Top Picks 12] Scalable coherence directories [HPCA 12] Scalable scheduling: Efficient dynamic scheduling by leveraging programming model information [PACT 11] Hardware-accelerated scheduling [ASPLOS 10]

22 Rethinking Common-Case Design Conventional approach: Make the common case fast Based on patterns of past and current workloads Overprovision to mitigate worst case or for future workloads Multicore demands going beyond the common case Shared resources Need guarantees on all cases Overprovisioning alone is insufficient and wasteful Some overprovisioning simplifies design Must provide guarantees with minimal overprovisioning Root cause: Empirical design Limited understanding of system behavior 22

23 Solution: Analytical Design Approach 23 Design basic components that are easily analyzable Simple, accurate, workload-independent analytical models Easy to understand, reason about behavior Use models to design systems that work well in all cases Scalability and QoS guaranteed in all scenarios Outperform conventional techniques in the common case Need to revisit fundamental aspects of our systems (associativity, coherence, )

24 Set-Associative s 24 Basic building block of caches, directories Line address Hash Function H Index Way 1 Way 2 Way 3 Way 4 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Problems: Reducing conflicts (higher associativity) more ways Higher energy, latency, area Conflicts depend on workload s access patterns

25 Z 25 One hash function per way H1 Indexes Way 1 Way 2 Way 3 Line address H2 H3 Hits require a single lookup low hit energy and latency Misses exploit the multiple hash functions to obtain an arbitrarily large number of replacement candidates Multi-step process, draws on prior research on Cuckoo hashing Happens infrequently (on misses) and off the critical path

26 D M Z Replacement U F N B P A G V C D E K Z T M X J R H Q I H1 H2 H3 L O S Y A MISS Way 1 Way 2 Way 3 26

27 Z Replacement U F N B P A G V C D E K Z T M X J R H Q I H1 H2 H3 L O S Y Way 1 Way 2 Way 3 D M A 27

28 Z Replacement Instead of evicting A, can move it and evict K or X Similarly, can move K or X more candidates U F N B P A G V C D E K Z T M X J R H Q I H1 H2 H3 L O S A Way 1 Way 2 Way 3 A K X 28

29 Z Replacement 29 Y A D M K X B S P Z L M N E T X F K E Q G R Chosen by replacement policy (LRU/LFU/RRIP )

30 Z Replacement Y A K X L M N E U F N B P A G V C D E K Z T M X J R H Q I L O S Y D B Z T X G R M P S E Q F K 30

31 Z Replacement 31 Y A D M K X B S P Z L M N E T X F K E Q G R U V M F C A P B K E H R X D J Y G Z T Q I L O S

32 Z Replacement 32 Hits always take a single lookup Way 1 Way 2 Way 3 Y H1 H2 H U V M F C A P B K E H R X D J Y G Z T Q I L O S HIT Replacements do not affect hit latency, are simple to implement

33 Methodology zsim: A fast, 1000-core, microarchitectural x86 simulator Fast: Parallel, leverages dynamic binary translation (Pin) Minstrs/s per host core, 600 Minstrs/s on 12-core Xeon Scalable: Phase-based sync, simulates thousands of cores Validated: Within 10% of Atom and Nehalem systems Simple: ~20 KLoC, used in research and courses at Stanford 33 Integrate zsim with existing area, energy, and latency models (McPAT, CACTI)

34 Z Benefits Latency (ns) 8MB shared LLC optimized for area latency energy, 32nm: SA 4 SA 32 Z 4/52 Hit Energy (nj) SA 4 Z = Scalable associativity at low cost Cost of 4-way cache Associativity > 64-way cache SA 32 Z 4/52 Performance vs SA SA 4 SA 32 Z 4/52 Energy eff. vs SA SA 4 SA Z 4/52

35 Z Associativity Z associativity depends only on the number of replacement candidates (R) Independent of ways, workload, and replacement policy 35 Problems in defining associativity: array + replacement policy Insight 1: With Z, replacement candidates are very close to uniformly distributed over the array Insight 2: All policies do the same thing, rank cache lines Eviction priority: Rank of a line normalized to [0,1] Example: With LRU policy, LRU line has 1.0 priority, MRU has 0.0

36 Z Associativity Associativity: Probability distribution of eviction priorities of evicted lines Z associativity depends only on the number of replacement candidates (R): F A R ( x) = Pr( A x) = x, x [0,1] 36 With R=8, 2% of evictions in 60% of least evictable lines With R=64, only 10-6 of evictions in 80% of least evictable lines

37 Z Analytical Models 37 Analytical models are accurate in practice: Theory: 1 in a million Practice: 1 in 5 14 workloads, 1024 cores

38 Partitioning 38 Main Memory ate ate ate Shared Directory ate ate ate ate ate

39 Partitioning 39 Main Memory ate ate ate Shared Directory ate ate ate ate ate L2 L2 L2 L2 L2 L2 L2 L2 VM1 VM2 VM3 VM4 VM5 VM6 partitioning techniques divide cache space explicitly Isolation: Virtualize cache among applications, VMs Efficiency: Improve performance, fairness Configurability: SW-controlled buffers (performance, security)

40 Partitioning Techniques 40 Strict partitioning schemes: Based on restricting line placement Way partitioning: Restrict insertions to specific ways Strict, but supports few partitions and degrades associativity Way 1 Way 2 Way 3 Way 4 Way 5 Way 6 Way 7 Way 8 Soft partitioning schemes: Based on tweaking the replacement policy PIPP: Insert and promote lines in LRU chain depending on their partition Simple, but approximate partitioning and degrades replacement performance Way 1 Way 2 Way 3 Way 4 Way 5 Way 6 Way 7 Way 8

41 Partitioning with Vantage 41 Previous partitioning techniques have major drawbacks Not scalable, support few partitions Degrade performance Vantage solves deficiencies of previous techniques Scalable: Supports hundreds of fine-grain partitions Maintains high associativity and strict isolation among partitions (QoS)

42 Vantage Design 42 Vantage partitions most of the cache logically by modifying the replacement process No restrictions on line placement Managed region Unmanaged region

43 Vantage Design Vantage partitions the managed region Incoming lines (misses) inserted in partition Each partition demotes least wanted lines to unmanaged region Evict only from unmanaged region no interference 43 Insertions Partition 0 Partition 1 Partition 2 Partition 3 Demotions Unmanaged region Evictions

44 Controlling Demotions Access B (partition 0) MISS 44 Get replacement candidates (16) 5 P1 2 P2 6 P3 3 unmgd Evict from unmanaged region Insert new line (in partition 0) Demote? Always demoting from inserting partition does not scale with number of partitions Instead, maintain sizes by matching demotion rate to miss rate

45 Demoting with Apertures 45 Aperture: Portion of candidates to demote from each partition Apertures Partition 0 Partition 1 Partition 2 Partition 3 23% 15% 5% 11% Partition 0 MISS Over aperture? Replacement candidates No No No No No No No No No No Yes No No Demote (in top 11% of P3) Evict Over aperture? Evict Nothing is demoted (all candidates above apertures!) No No No No No No No No No No No No Over aperture? No Yes No No No No No Yes No No No No Demote (in top 23% of P0) Demote (in top 15% of P1) Evict

46 Managing Apertures Partition apertures can be derived analytically: A i = P k=1 M i M k P S k=1 k 1 S i R m Intuition: Aperture ~ miss rate (Mi)/size (Si) 46 Apertures are also capped to A max Higher aperture lower partition associativity A max ensures high minimum associativity e.g., A max =40% ~ R=16 associativity We just let partitions that need A i > A max grow

47 Bounds on Size and Interference The worst-case total growth of all partitions over their target sizes is bounded and small: Δ = 1 A max Intuition: A -sized partition is always stable, and multiple unstable partitions help each other demote Independent of the number of partitions! 1 R 47 Assign an extra to unmanaged region With R=52 and A max =0.4, =5% of the cache Bounded worst-case sizes & interference

48 A Simple Vantage Controller Use negative feedback loop to derive apertures Use timestamps to determine lines within aperture Practical implementation that maintains analytical guarantees 48 Tags: Extra partition ID field Tag Array Data Array Partition (6b) Timestamp (8b) Coherence/ Valid Bits 256 bits of state per partition Line Address Partition 0 state (256b) Controller Partition P-1 state (256b) Vantage Replacement Logic ~1% extra storage, grows with log(partitions) Simple logic, ~10 adders and comparators Logic not on critical path

maximize throughput (utility-based partitioning) Each line shows throughput vs

49 Vantage Evaluation 49 Better than unpartitioned Worse than unpartitioned 350 mixes on a 32-core CMP with a shared LLC (32 partitions) Partitions sized to maximize throughput (utility-based partitioning) Each line shows throughput vs unpartitioned 64-way baseline Way-partitioning, PIPP degrade throughput for most workloads

50 Vantage Evaluation 50 Better than unpartitioned Worse than unpartitioned Vantage improves throughput for most workloads using a 4-way/52-candidate Zcache Other schemes cannot scale beyond a few cores

51 Scaling Directories 51 Main Memory ate ate ate Shared Directory ate ate ate ate ate Scaling directories is hard: Excessive latency, energy, area overheads, or too complex Introduce invalidations Interference

52 Scalable Coherence Directory 52 Insights: Flexible sharer set encoding: Lines with few sharers use one entry, widely shared lines use multiple entries Scalability Use Z Efficient high associativity, analytical models Negligible invalidations with minimal overprovisioning (~10%) SCD achieves scalability and performance guarantees Area, energy grow with log(cores), constant latency Simple: No modifications to coherence protocol At 1024 cores, SCD is 13x smaller than a sparse directory, 2x smaller, faster and simpler than a hierarchical directory

53 Scalable Scheduling 1 Main Memory Shared Coherence Directory Scheduling requirements: Expose enough parallelism Locality-aware Load balancing Low overheads Bounded memory footprint 53 Application Tasks Dynamic vs static schedulers: Dynamic: Poor locality, footprint not bounded if non-trivial dependences Static: Great compile-time schedules, but no load-balancing, only regular apps

54 Insight: Leverage Programming Model 54 Solution: Dynamic fine-grain scheduling techniques that leverage programming model information to satisfy requirements Expose all parallelism through fine-grain tasks Locality-aware task queuing and load-balancing Bounded footprint Make dynamic scheduling practical in rich programming models (StreamIt, GRAMPS, Delite) Significant improvements over state-of-the-art schedulers on existing 12-core, 24-thread Xeon SMP: Up to 17x over dynamic (more parallelism, locality-aware, footprint) Up to 5.3x over static (no load imbalance) Scheduler choice becomes more critical as we scale up!

55 Hardware-Accelerated Schedulers 55 Fine-grain scheduling with 100+ threads is slow in software Hardware schedulers (e.g., GPUs): Fast but inflexible Insight: Software schedulers dominated by communication Solution: Accelerate communication with simple hardware ADM: Asynchronous, register-register messages between threads Small and scalable costs (~1KB buffers per core), virtualizable ADM-accelerated fine-grain schedulers: Achieve speed and scalability of HW + flexibility of SW At 512 threads, 6.4x faster than SW and 70% faster than HW ADM can accelerate other primitives (e.g., barriers, IPC)

56 Contributions 56 Scalable cache hierarchies: Efficient highly-associative caches [MICRO 10] Scalable cache partitioning [ISCA 11, Top Picks 12] Scalable coherence directories [HPCA 12] Scalable scheduling: Efficient dynamic scheduling by leveraging programming model information [PACT 11] Hardware-accelerated scheduling [ASPLOS 10]

57 Conclusions Scaling to 1000 cores requires HW and SW techniques: Scale hardware with highly efficient caches with scalable partitioning and coherence Scale software with dynamic, fine-grain, HW-accelerated scheduling 57

58 Acknowledgements Christos Research group: Jacob, David, Richard, Christina, Woongki, Austen, Mike, Hari PPL faculty: Kunle, Bill, Mark, Pat, Mendel, John, Alex PPL students: George, Jeremy, Defense committee: Bill, Kunle, Nick Family & friends Borja, Gemma, Idoia, Carlos, Felix, Dani, Manuel, Gonzalo, Adrian, Christina, George, Yiannis, Sotiria, Alexandros, Nadine, Martin, Elliot, Nick, Steph, Olivier, Leen, John, Sam, Mario, Nicole, Cristina, Kshipra, Robert, Erik, 58

59 THANK YOU FOR YOUR ATTENTION QUESTIONS?

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power