SCALING HARDWARE AND SOFTWARE
|
|
- Sandra Gibbs
- 5 years ago
- Views:
Transcription
1 SCALING HARDWARE AND SOFTWARE FOR THOUSAND-CORE SYSTEMS Daniel Sanchez Electrical Engineering Stanford University
2 Multicore Scalability 1.E E E E E E Multicore performance (MIPS) s Transistors (Millions) performance (MIPS) 2 1.E Multicore is key to future of computing Scaling performance is hard, even with a lot of parallelism
3 Memory is Critical 3 Memory limits performance and energy efficiency Basic indicators: 64-bit FP op: ~1ns latency, ~20pJ energy Shared cache access: ~10ns latency, ~1nJ energy DRAM access: ~100ns latency, ~20nJ energy HW & SW must optimize memory performance
4 Multicore Memory Hierarchy 4 ate Main Memory Shared Coherence Directory ate ate ate Per-core private caches Fast access to critical working set Should satisfy most accesses Shared last-level cache Increases utilization Accelerates communication Can be partitioned for isolation Coherence protocol Makes caches transparent to SW Uses directory to track sharers
5 Memory Hierarchy Challenges at 1K s 5 Main Memory hierarchy is hard to scale Shared Coherence Directory
6 Memory Hierarchy Challenges at 1K s 6 Main Memory hierarchy is hard to scale 1. Directories scale poorly Shared Coherence Directory
7 Memory Hierarchy Challenges at 1K s 7 Main Memory Shared hierarchy is hard to scale 1. Directories scale poorly 2. Conflicts in caches & directory are more frequent Coherence Directory
8 Memory Hierarchy Challenges at 1K s 8 1 Main Memory Shared Coherence Directory hierarchy is hard to scale 1. Directories scale poorly 2. Conflicts in caches & directory are more frequent 3. Shared cache cannot be partitioned efficiently
9 Memory Hierarchy Challenges at 1K s 9 1 Shared Coherence Directory 2 Main Memory hierarchy is hard to scale 1. Directories scale poorly 2. Conflicts in caches & directory are more frequent 3. Shared cache cannot be partitioned efficiently 4. No isolation or QoS due to shared cache and directory
10 Scaling Parallel Runtimes 10 Parallel runtime maps application to hardware Resource management Scheduling Runtime is fundamental to scale with manageable complexity Parallel Application Parallel Runtime Operating System Hardware
11 Scheduling Parallel Applications 11 1 Main Memory Shared Coherence Directory Application Parallel tasks Different requirements May have dependences Application Tasks
12 Scheduling Parallel Applications 12 1 Main Memory Shared Coherence Directory Application Parallel tasks Different requirements May have dependences Scheduler assigns tasks to cores Application Tasks
13 Runtime & Scheduling Challenges 13 Main Memory Constrained parallelism Shared 1 Coherence Directory Application Tasks
14 Runtime & Scheduling Challenges 14 1 Main Memory Shared Coherence Directory Constrained parallelism Coarser tasks Unneeded serialization Application Tasks
15 Runtime & Scheduling Challenges 15 Main Memory Shared Constrained parallelism Increased cache misses 1 Coherence Directory Application Tasks
16 Runtime & Scheduling Challenges 16 1 Main Memory Shared Coherence Directory Constrained parallelism Increased cache misses Load imbalance Application Tasks
17 Runtime & Scheduling Challenges 17 1 Main Memory Shared Coherence Directory 2 Sched 3 Sched Constrained parallelism Increased cache misses Load imbalance Scheduling overheads Application Tasks
18 Runtime & Scheduling Challenges 18 1 Main Memory Shared Coherence Directory Constrained parallelism Increased cache misses Load imbalance Scheduling overheads Excessive memory footprint (crash!) Application Tasks
19 Runtime & Scheduling Challenges 19 1 Main Memory Shared Coherence Directory Constrained parallelism Increased cache misses Load imbalance Scheduling overheads Excessive memory footprint (crash!) Application Tasks Conflicting issues Need smart algorithms!
20 Contributions 20 Scalable cache hierarchies: Efficient highly-associative caches [MICRO 10] Scalable cache partitioning [ISCA 11, Top Picks 12] Scalable coherence directories [HPCA 12] Scalable scheduling: Efficient dynamic scheduling by leveraging programming model information [PACT 11] Hardware-accelerated scheduling [ASPLOS 10]
21 This Talk 21 Scalable cache hierarchies: Efficient highly-associative caches [MICRO 10] Scalable cache partitioning [ISCA 11, Top Picks 12] Scalable coherence directories [HPCA 12] Scalable scheduling: Efficient dynamic scheduling by leveraging programming model information [PACT 11] Hardware-accelerated scheduling [ASPLOS 10]
22 Rethinking Common-Case Design Conventional approach: Make the common case fast Based on patterns of past and current workloads Overprovision to mitigate worst case or for future workloads Multicore demands going beyond the common case Shared resources Need guarantees on all cases Overprovisioning alone is insufficient and wasteful Some overprovisioning simplifies design Must provide guarantees with minimal overprovisioning Root cause: Empirical design Limited understanding of system behavior 22
23 Solution: Analytical Design Approach 23 Design basic components that are easily analyzable Simple, accurate, workload-independent analytical models Easy to understand, reason about behavior Use models to design systems that work well in all cases Scalability and QoS guaranteed in all scenarios Outperform conventional techniques in the common case Need to revisit fundamental aspects of our systems (associativity, coherence, )
24 Set-Associative s 24 Basic building block of caches, directories Line address Hash Function H Index Way 1 Way 2 Way 3 Way 4 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Problems: Reducing conflicts (higher associativity) more ways Higher energy, latency, area Conflicts depend on workload s access patterns
25 Z 25 One hash function per way H1 Indexes Way 1 Way 2 Way 3 Line address H2 H3 Hits require a single lookup low hit energy and latency Misses exploit the multiple hash functions to obtain an arbitrarily large number of replacement candidates Multi-step process, draws on prior research on Cuckoo hashing Happens infrequently (on misses) and off the critical path
26 D M Z Replacement U F N B P A G V C D E K Z T M X J R H Q I H1 H2 H3 L O S Y A MISS Way 1 Way 2 Way 3 26
27 Z Replacement U F N B P A G V C D E K Z T M X J R H Q I H1 H2 H3 L O S Y Way 1 Way 2 Way 3 D M A 27
28 Z Replacement Instead of evicting A, can move it and evict K or X Similarly, can move K or X more candidates U F N B P A G V C D E K Z T M X J R H Q I H1 H2 H3 L O S A Way 1 Way 2 Way 3 A K X 28
29 Z Replacement 29 Y A D M K X B S P Z L M N E T X F K E Q G R Chosen by replacement policy (LRU/LFU/RRIP )
30 Z Replacement Y A K X L M N E U F N B P A G V C D E K Z T M X J R H Q I L O S Y D B Z T X G R M P S E Q F K 30
31 Z Replacement 31 Y A D M K X B S P Z L M N E T X F K E Q G R U V M F C A P B K E H R X D J Y G Z T Q I L O S
32 Z Replacement 32 Hits always take a single lookup Way 1 Way 2 Way 3 Y H1 H2 H U V M F C A P B K E H R X D J Y G Z T Q I L O S HIT Replacements do not affect hit latency, are simple to implement
33 Methodology zsim: A fast, 1000-core, microarchitectural x86 simulator Fast: Parallel, leverages dynamic binary translation (Pin) Minstrs/s per host core, 600 Minstrs/s on 12-core Xeon Scalable: Phase-based sync, simulates thousands of cores Validated: Within 10% of Atom and Nehalem systems Simple: ~20 KLoC, used in research and courses at Stanford 33 Integrate zsim with existing area, energy, and latency models (McPAT, CACTI)
34 Z Benefits Latency (ns) 8MB shared LLC optimized for area latency energy, 32nm: SA 4 SA 32 Z 4/52 Hit Energy (nj) SA 4 Z = Scalable associativity at low cost Cost of 4-way cache Associativity > 64-way cache SA 32 Z 4/52 Performance vs SA SA 4 SA 32 Z 4/52 Energy eff. vs SA SA 4 SA Z 4/52
35 Z Associativity Z associativity depends only on the number of replacement candidates (R) Independent of ways, workload, and replacement policy 35 Problems in defining associativity: array + replacement policy Insight 1: With Z, replacement candidates are very close to uniformly distributed over the array Insight 2: All policies do the same thing, rank cache lines Eviction priority: Rank of a line normalized to [0,1] Example: With LRU policy, LRU line has 1.0 priority, MRU has 0.0
36 Z Associativity Associativity: Probability distribution of eviction priorities of evicted lines Z associativity depends only on the number of replacement candidates (R): F A R ( x) = Pr( A x) = x, x [0,1] 36 With R=8, 2% of evictions in 60% of least evictable lines With R=64, only 10-6 of evictions in 80% of least evictable lines
37 Z Analytical Models 37 Analytical models are accurate in practice: Theory: 1 in a million Practice: 1 in 5 14 workloads, 1024 cores
38 Partitioning 38 Main Memory ate ate ate Shared Directory ate ate ate ate ate
39 Partitioning 39 Main Memory ate ate ate Shared Directory ate ate ate ate ate L2 L2 L2 L2 L2 L2 L2 L2 VM1 VM2 VM3 VM4 VM5 VM6 partitioning techniques divide cache space explicitly Isolation: Virtualize cache among applications, VMs Efficiency: Improve performance, fairness Configurability: SW-controlled buffers (performance, security)
40 Partitioning Techniques 40 Strict partitioning schemes: Based on restricting line placement Way partitioning: Restrict insertions to specific ways Strict, but supports few partitions and degrades associativity Way 1 Way 2 Way 3 Way 4 Way 5 Way 6 Way 7 Way 8 Soft partitioning schemes: Based on tweaking the replacement policy PIPP: Insert and promote lines in LRU chain depending on their partition Simple, but approximate partitioning and degrades replacement performance Way 1 Way 2 Way 3 Way 4 Way 5 Way 6 Way 7 Way 8
41 Partitioning with Vantage 41 Previous partitioning techniques have major drawbacks Not scalable, support few partitions Degrade performance Vantage solves deficiencies of previous techniques Scalable: Supports hundreds of fine-grain partitions Maintains high associativity and strict isolation among partitions (QoS)
42 Vantage Design 42 Vantage partitions most of the cache logically by modifying the replacement process No restrictions on line placement Managed region Unmanaged region
43 Vantage Design Vantage partitions the managed region Incoming lines (misses) inserted in partition Each partition demotes least wanted lines to unmanaged region Evict only from unmanaged region no interference 43 Insertions Partition 0 Partition 1 Partition 2 Partition 3 Demotions Unmanaged region Evictions
44 Controlling Demotions Access B (partition 0) MISS 44 Get replacement candidates (16) 5 P1 2 P2 6 P3 3 unmgd Evict from unmanaged region Insert new line (in partition 0) Demote? Always demoting from inserting partition does not scale with number of partitions Instead, maintain sizes by matching demotion rate to miss rate
45 Demoting with Apertures 45 Aperture: Portion of candidates to demote from each partition Apertures Partition 0 Partition 1 Partition 2 Partition 3 23% 15% 5% 11% Partition 0 MISS Over aperture? Replacement candidates No No No No No No No No No No Yes No No Demote (in top 11% of P3) Evict Over aperture? Evict Nothing is demoted (all candidates above apertures!) No No No No No No No No No No No No Over aperture? No Yes No No No No No Yes No No No No Demote (in top 23% of P0) Demote (in top 15% of P1) Evict
46 Managing Apertures Partition apertures can be derived analytically: A i = P k=1 M i M k P S k=1 k 1 S i R m Intuition: Aperture ~ miss rate (Mi)/size (Si) 46 Apertures are also capped to A max Higher aperture lower partition associativity A max ensures high minimum associativity e.g., A max =40% ~ R=16 associativity We just let partitions that need A i > A max grow
47 Bounds on Size and Interference The worst-case total growth of all partitions over their target sizes is bounded and small: Δ = 1 A max Intuition: A -sized partition is always stable, and multiple unstable partitions help each other demote Independent of the number of partitions! 1 R 47 Assign an extra to unmanaged region With R=52 and A max =0.4, =5% of the cache Bounded worst-case sizes & interference
48 A Simple Vantage Controller Use negative feedback loop to derive apertures Use timestamps to determine lines within aperture Practical implementation that maintains analytical guarantees 48 Tags: Extra partition ID field Tag Array Data Array Partition (6b) Timestamp (8b) Coherence/ Valid Bits 256 bits of state per partition Line Address Partition 0 state (256b) Controller Partition P-1 state (256b) Vantage Replacement Logic ~1% extra storage, grows with log(partitions) Simple logic, ~10 adders and comparators Logic not on critical path
49 Vantage Evaluation 49 Better than unpartitioned Worse than unpartitioned 350 mixes on a 32-core CMP with a shared LLC (32 partitions) Partitions sized to maximize throughput (utility-based partitioning) Each line shows throughput vs unpartitioned 64-way baseline Way-partitioning, PIPP degrade throughput for most workloads
50 Vantage Evaluation 50 Better than unpartitioned Worse than unpartitioned Vantage improves throughput for most workloads using a 4-way/52-candidate Zcache Other schemes cannot scale beyond a few cores
51 Scaling Directories 51 Main Memory ate ate ate Shared Directory ate ate ate ate ate Scaling directories is hard: Excessive latency, energy, area overheads, or too complex Introduce invalidations Interference
52 Scalable Coherence Directory 52 Insights: Flexible sharer set encoding: Lines with few sharers use one entry, widely shared lines use multiple entries Scalability Use Z Efficient high associativity, analytical models Negligible invalidations with minimal overprovisioning (~10%) SCD achieves scalability and performance guarantees Area, energy grow with log(cores), constant latency Simple: No modifications to coherence protocol At 1024 cores, SCD is 13x smaller than a sparse directory, 2x smaller, faster and simpler than a hierarchical directory
53 Scalable Scheduling 1 Main Memory Shared Coherence Directory Scheduling requirements: Expose enough parallelism Locality-aware Load balancing Low overheads Bounded memory footprint 53 Application Tasks Dynamic vs static schedulers: Dynamic: Poor locality, footprint not bounded if non-trivial dependences Static: Great compile-time schedules, but no load-balancing, only regular apps
54 Insight: Leverage Programming Model 54 Solution: Dynamic fine-grain scheduling techniques that leverage programming model information to satisfy requirements Expose all parallelism through fine-grain tasks Locality-aware task queuing and load-balancing Bounded footprint Make dynamic scheduling practical in rich programming models (StreamIt, GRAMPS, Delite) Significant improvements over state-of-the-art schedulers on existing 12-core, 24-thread Xeon SMP: Up to 17x over dynamic (more parallelism, locality-aware, footprint) Up to 5.3x over static (no load imbalance) Scheduler choice becomes more critical as we scale up!
55 Hardware-Accelerated Schedulers 55 Fine-grain scheduling with 100+ threads is slow in software Hardware schedulers (e.g., GPUs): Fast but inflexible Insight: Software schedulers dominated by communication Solution: Accelerate communication with simple hardware ADM: Asynchronous, register-register messages between threads Small and scalable costs (~1KB buffers per core), virtualizable ADM-accelerated fine-grain schedulers: Achieve speed and scalability of HW + flexibility of SW At 512 threads, 6.4x faster than SW and 70% faster than HW ADM can accelerate other primitives (e.g., barriers, IPC)
56 Contributions 56 Scalable cache hierarchies: Efficient highly-associative caches [MICRO 10] Scalable cache partitioning [ISCA 11, Top Picks 12] Scalable coherence directories [HPCA 12] Scalable scheduling: Efficient dynamic scheduling by leveraging programming model information [PACT 11] Hardware-accelerated scheduling [ASPLOS 10]
57 Conclusions Scaling to 1000 cores requires HW and SW techniques: Scale hardware with highly efficient caches with scalable partitioning and coherence Scale software with dynamic, fine-grain, HW-accelerated scheduling 57
58 Acknowledgements Christos Research group: Jacob, David, Richard, Christina, Woongki, Austen, Mike, Hari PPL faculty: Kunle, Bill, Mark, Pat, Mendel, John, Alex PPL students: George, Jeremy, Defense committee: Bill, Kunle, Nick Family & friends Borja, Gemma, Idoia, Carlos, Felix, Dani, Manuel, Gonzalo, Adrian, Christina, George, Yiannis, Sotiria, Alexandros, Nadine, Martin, Elliot, Nick, Steph, Olivier, Leen, John, Sam, Mario, Nicole, Cristina, Kshipra, Robert, Erik, 58
59 THANK YOU FOR YOUR ATTENTION QUESTIONS?
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationSCALABLE AND EFFICIENT FINE-GRAINED CACHE PARTITIONING WITH VANTAGE
... SCALABLE AND EFFICIENT FINE-GRAINED CACHE PARTITIONING WITH VANTAGE... THE VANTAGE CACHE-PARTITIONING TECHNIQUE ENABLES CONFIGURABILITY AND QUALITY-OF-SERVICE GUARANTEES IN LARGE-SCALE CHIP MULTIPROCESSORS
More informationJIGSAW: SCALABLE SOFTWARE-DEFINED CACHES
JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013 Summary NUCA is giving us more capacity, but further away 40 Applications
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationVantage: Scalable and Efficient Fine-Grain Cache Partitioning
Vantage: Scalable and Efficient Fine-Grain Cache Partitioning Daniel Sanchez and Christos Kozyrakis Electrical Engineering Department, Stanford University {sanchezd, kozyraki}@stanford.edu ABSTRACT Cache
More informationOptimizing Replication, Communication, and Capacity Allocation in CMPs
Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important
More informationDesigning High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC
Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system
More informationDynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle
Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationCS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck
Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find
More informationBanshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!
Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth
More informationWORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS
WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES ON BIG AND SMALL SERVER PLATFORMS Shuang Chen*, Shay Galon**, Christina Delimitrou*, Srilatha Manne**, and José Martínez* *Cornell University **Cavium
More informationCache Controller with Enhanced Features using Verilog HDL
Cache Controller with Enhanced Features using Verilog HDL Prof. V. B. Baru 1, Sweety Pinjani 2 Assistant Professor, Dept. of ECE, Sinhgad College of Engineering, Vadgaon (BK), Pune, India 1 PG Student
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationSecure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks
: Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationThree hours. Two academic papers are provided for use with the examination. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE
COMP60621 Three hours Two academic papers are provided for use with the examination. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE Designing for Parallelism and Future Multi-core Computing Date:
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationReplacement policies for shared caches on symmetric multicores : a programmer-centric point of view
1 Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view Pierre Michaud INRIA HiPEAC 11, January 26, 2011 2 Outline Self-performance contract Proposition for
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More information15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural
More informationImproving the Practicality of Transactional Memory
Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter
More informationSWAP: EFFECTIVE FINE-GRAIN MANAGEMENT
: EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMany-Core Computing Era and New Challenges. Nikos Hardavellas, EECS
Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019
More informationComputer Architecture
Computer Architecture Lecture 1: Introduction and Basics Dr. Ahmed Sallam Suez Canal University Spring 2016 Based on original slides by Prof. Onur Mutlu I Hope You Are Here for This Programming How does
More informationNetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013
NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching
More informationComputer Architecture Lecture 24: Memory Scheduling
18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM
More informationCOS 318: Operating Systems. Virtual Memory and Address Translation
COS 318: Operating Systems Virtual Memory and Address Translation Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Today s Topics
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationVirtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1
Virtual Memory Patterson & Hennessey Chapter 5 ELEC 5200/6200 1 Virtual Memory Use main memory as a cache for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationDesigning Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen
Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit
More informationTradeoffs in Transactional Memory Virtualization
Tradeoffs in Transactional Memory Virtualization JaeWoong Chung Chi Cao Minh, Austen McDonald, Travis Skare, Hassan Chafi,, Brian D. Carlstrom, Christos Kozyrakis, Kunle Olukotun Computer Systems Lab Stanford
More informationStash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho
Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho 20 th Interna+onal Symposium On High Performance Computer Architecture (HPCA). Orlando, FL, February
More informationLecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies
Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset
More informationComputer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013
Computer Systems Research in the Post-Dennard Scaling Era Emilio G. Cota Candidacy Exam April 30, 2013 Intel 4004, 1971 1 core, no cache 23K 10um transistors Intel Nehalem EX, 2009 8c, 24MB cache 2.3B
More informationComputer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer
More informationFutility Scaling: High-Associativity Cache Partitioning
Futility Scaling: High-Associativity Cache Partitioning Ruisheng Wang Lizhong Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA ruishenw,lizhongc@usc.edu
More informationExploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:
More informationFutility Scaling: High-Associativity Cache Partitioning (Extended Version)
Futility Scaling: High-Associativity Cache Partitioning (Extended Version) Ruisheng Wang and Lizhong Chen Computer Engineering Technical Report Number CENG-2014-07 Ming Hsieh Department of Electrical Engineering
More informationOAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches
OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches Jue Wang, Xiangyu Dong, Yuan Xie Department of Computer Science and Engineering, Pennsylvania State University Qualcomm Technology,
More informationCS3350B Computer Architecture
CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationThread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007
Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster
More informationCLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level
CLICK TO EDIT MASTER TITLE STYLE Second level THE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU PAUL BLINZER, FELLOW, HSA SYSTEM SOFTWARE, AMD SYSTEM ARCHITECTURE WORKGROUP CHAIR, HSA FOUNDATION
More informationWaveScalar. Winter 2006 CSE WaveScalar 1
WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE
More informationSpring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand
Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled
More informationIntroduction to Parallel Programming Models
Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures
More informationEECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun
EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More informationScalable Enterprise Networks with Inexpensive Switches
Scalable Enterprise Networks with Inexpensive Switches Minlan Yu minlanyu@cs.princeton.edu Princeton University Joint work with Alex Fabrikant, Mike Freedman, Jennifer Rexford and Jia Wang 1 Enterprises
More informationWrite only as much as necessary. Be brief!
1 CIS371 Computer Organization and Design Final Exam Prof. Martin Wednesday, May 2nd, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached (with
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationBandwidth Adaptive Snooping
Two classes of multiprocessors Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationImproving Virtual Machine Scheduling in NUMA Multicore Systems
Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore
More informationTowards Energy-Proportional Datacenter Memory with Mobile DRAM
Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University
More informationCHAPTER 8 - MEMORY MANAGEMENT STRATEGIES
CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide
More informationNo Tradeoff Low Latency + High Efficiency
No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS),
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationChapter 8: Main Memory. Operating System Concepts 9 th Edition
Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel
More informationTDT 4260 lecture 3 spring semester 2015
1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,
More informationCaches. Samira Khan March 23, 2017
Caches Samira Khan March 23, 2017 Agenda Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed
More informationVirtualization and memory hierarchy
Virtualization and memory hierarchy Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department
More informationHardware Support for NVM Programming
Hardware Support for NVM Programming 1 Outline Ordering Transactions Write endurance 2 Volatile Memory Ordering Write-back caching Improves performance Reorders writes to DRAM STORE A STORE B CPU CPU B
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationCOEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory
1 COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationChapter 8: Main Memory
Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationData-Centric Execution of Speculative Parallel Programs MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANCHEZ MICRO 2016
Data-entric Execution of Speculative Parallel Programs MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANHEZ MIRO 26 Executive summary Many-cores must exploit cache locality to
More informationLecture notes for CS Chapter 2, part 1 10/23/18
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationDesign and Implementation of Signatures in Transactional Memory Systems
Design and Implementation of Signatures in Transactional Memory Systems Daniel Sanchez August 2007 University of Wisconsin-Madison Outline Introduction and motivation Bloom filters Bloom signatures Area
More informationI, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.
5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000
More informationChapter 8. Virtual Memory
Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:
More informationSpring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand
Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications
More informationPostprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.
http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,
More informationWALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems
: A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More information12 Cache-Organization 1
12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required
More informationTowards Energy Proportionality for Large-Scale Latency-Critical Workloads
Towards Energy Proportionality for Large-Scale Latency-Critical Workloads David Lo *, Liqun Cheng *, Rama Govindaraju *, Luiz André Barroso *, Christos Kozyrakis Stanford University * Google Inc. 2012
More informationLecture-16 (Cache Replacement Policies) CS422-Spring
Lecture-16 (Cache Replacement Policies) CS422-Spring 2018 Biswa@CSE-IITK 1 2 4 8 16 32 64 128 From SPEC92 Miss rate: Still Applicable Today 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity
More informationRethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization
Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,
More informationComputer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015
18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April
More informationMemory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts
Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of
More informationTales of the Tail Hardware, OS, and Application-level Sources of Tail Latency
Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More information