SCALING HARDWARE AND SOFTWARE

Size: px
Start display at page:

Download "SCALING HARDWARE AND SOFTWARE"

Transcription

1 SCALING HARDWARE AND SOFTWARE FOR THOUSAND-CORE SYSTEMS Daniel Sanchez Electrical Engineering Stanford University

2 Multicore Scalability 1.E E E E E E Multicore performance (MIPS) s Transistors (Millions) performance (MIPS) 2 1.E Multicore is key to future of computing Scaling performance is hard, even with a lot of parallelism

3 Memory is Critical 3 Memory limits performance and energy efficiency Basic indicators: 64-bit FP op: ~1ns latency, ~20pJ energy Shared cache access: ~10ns latency, ~1nJ energy DRAM access: ~100ns latency, ~20nJ energy HW & SW must optimize memory performance

4 Multicore Memory Hierarchy 4 ate Main Memory Shared Coherence Directory ate ate ate Per-core private caches Fast access to critical working set Should satisfy most accesses Shared last-level cache Increases utilization Accelerates communication Can be partitioned for isolation Coherence protocol Makes caches transparent to SW Uses directory to track sharers

5 Memory Hierarchy Challenges at 1K s 5 Main Memory hierarchy is hard to scale Shared Coherence Directory

6 Memory Hierarchy Challenges at 1K s 6 Main Memory hierarchy is hard to scale 1. Directories scale poorly Shared Coherence Directory

7 Memory Hierarchy Challenges at 1K s 7 Main Memory Shared hierarchy is hard to scale 1. Directories scale poorly 2. Conflicts in caches & directory are more frequent Coherence Directory

8 Memory Hierarchy Challenges at 1K s 8 1 Main Memory Shared Coherence Directory hierarchy is hard to scale 1. Directories scale poorly 2. Conflicts in caches & directory are more frequent 3. Shared cache cannot be partitioned efficiently

9 Memory Hierarchy Challenges at 1K s 9 1 Shared Coherence Directory 2 Main Memory hierarchy is hard to scale 1. Directories scale poorly 2. Conflicts in caches & directory are more frequent 3. Shared cache cannot be partitioned efficiently 4. No isolation or QoS due to shared cache and directory

10 Scaling Parallel Runtimes 10 Parallel runtime maps application to hardware Resource management Scheduling Runtime is fundamental to scale with manageable complexity Parallel Application Parallel Runtime Operating System Hardware

11 Scheduling Parallel Applications 11 1 Main Memory Shared Coherence Directory Application Parallel tasks Different requirements May have dependences Application Tasks

12 Scheduling Parallel Applications 12 1 Main Memory Shared Coherence Directory Application Parallel tasks Different requirements May have dependences Scheduler assigns tasks to cores Application Tasks

13 Runtime & Scheduling Challenges 13 Main Memory Constrained parallelism Shared 1 Coherence Directory Application Tasks

14 Runtime & Scheduling Challenges 14 1 Main Memory Shared Coherence Directory Constrained parallelism Coarser tasks Unneeded serialization Application Tasks

15 Runtime & Scheduling Challenges 15 Main Memory Shared Constrained parallelism Increased cache misses 1 Coherence Directory Application Tasks

16 Runtime & Scheduling Challenges 16 1 Main Memory Shared Coherence Directory Constrained parallelism Increased cache misses Load imbalance Application Tasks

17 Runtime & Scheduling Challenges 17 1 Main Memory Shared Coherence Directory 2 Sched 3 Sched Constrained parallelism Increased cache misses Load imbalance Scheduling overheads Application Tasks

18 Runtime & Scheduling Challenges 18 1 Main Memory Shared Coherence Directory Constrained parallelism Increased cache misses Load imbalance Scheduling overheads Excessive memory footprint (crash!) Application Tasks

19 Runtime & Scheduling Challenges 19 1 Main Memory Shared Coherence Directory Constrained parallelism Increased cache misses Load imbalance Scheduling overheads Excessive memory footprint (crash!) Application Tasks Conflicting issues Need smart algorithms!

20 Contributions 20 Scalable cache hierarchies: Efficient highly-associative caches [MICRO 10] Scalable cache partitioning [ISCA 11, Top Picks 12] Scalable coherence directories [HPCA 12] Scalable scheduling: Efficient dynamic scheduling by leveraging programming model information [PACT 11] Hardware-accelerated scheduling [ASPLOS 10]

21 This Talk 21 Scalable cache hierarchies: Efficient highly-associative caches [MICRO 10] Scalable cache partitioning [ISCA 11, Top Picks 12] Scalable coherence directories [HPCA 12] Scalable scheduling: Efficient dynamic scheduling by leveraging programming model information [PACT 11] Hardware-accelerated scheduling [ASPLOS 10]

22 Rethinking Common-Case Design Conventional approach: Make the common case fast Based on patterns of past and current workloads Overprovision to mitigate worst case or for future workloads Multicore demands going beyond the common case Shared resources Need guarantees on all cases Overprovisioning alone is insufficient and wasteful Some overprovisioning simplifies design Must provide guarantees with minimal overprovisioning Root cause: Empirical design Limited understanding of system behavior 22

23 Solution: Analytical Design Approach 23 Design basic components that are easily analyzable Simple, accurate, workload-independent analytical models Easy to understand, reason about behavior Use models to design systems that work well in all cases Scalability and QoS guaranteed in all scenarios Outperform conventional techniques in the common case Need to revisit fundamental aspects of our systems (associativity, coherence, )

24 Set-Associative s 24 Basic building block of caches, directories Line address Hash Function H Index Way 1 Way 2 Way 3 Way 4 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Problems: Reducing conflicts (higher associativity) more ways Higher energy, latency, area Conflicts depend on workload s access patterns

25 Z 25 One hash function per way H1 Indexes Way 1 Way 2 Way 3 Line address H2 H3 Hits require a single lookup low hit energy and latency Misses exploit the multiple hash functions to obtain an arbitrarily large number of replacement candidates Multi-step process, draws on prior research on Cuckoo hashing Happens infrequently (on misses) and off the critical path

26 D M Z Replacement U F N B P A G V C D E K Z T M X J R H Q I H1 H2 H3 L O S Y A MISS Way 1 Way 2 Way 3 26

27 Z Replacement U F N B P A G V C D E K Z T M X J R H Q I H1 H2 H3 L O S Y Way 1 Way 2 Way 3 D M A 27

28 Z Replacement Instead of evicting A, can move it and evict K or X Similarly, can move K or X more candidates U F N B P A G V C D E K Z T M X J R H Q I H1 H2 H3 L O S A Way 1 Way 2 Way 3 A K X 28

29 Z Replacement 29 Y A D M K X B S P Z L M N E T X F K E Q G R Chosen by replacement policy (LRU/LFU/RRIP )

30 Z Replacement Y A K X L M N E U F N B P A G V C D E K Z T M X J R H Q I L O S Y D B Z T X G R M P S E Q F K 30

31 Z Replacement 31 Y A D M K X B S P Z L M N E T X F K E Q G R U V M F C A P B K E H R X D J Y G Z T Q I L O S

32 Z Replacement 32 Hits always take a single lookup Way 1 Way 2 Way 3 Y H1 H2 H U V M F C A P B K E H R X D J Y G Z T Q I L O S HIT Replacements do not affect hit latency, are simple to implement

33 Methodology zsim: A fast, 1000-core, microarchitectural x86 simulator Fast: Parallel, leverages dynamic binary translation (Pin) Minstrs/s per host core, 600 Minstrs/s on 12-core Xeon Scalable: Phase-based sync, simulates thousands of cores Validated: Within 10% of Atom and Nehalem systems Simple: ~20 KLoC, used in research and courses at Stanford 33 Integrate zsim with existing area, energy, and latency models (McPAT, CACTI)

34 Z Benefits Latency (ns) 8MB shared LLC optimized for area latency energy, 32nm: SA 4 SA 32 Z 4/52 Hit Energy (nj) SA 4 Z = Scalable associativity at low cost Cost of 4-way cache Associativity > 64-way cache SA 32 Z 4/52 Performance vs SA SA 4 SA 32 Z 4/52 Energy eff. vs SA SA 4 SA Z 4/52

35 Z Associativity Z associativity depends only on the number of replacement candidates (R) Independent of ways, workload, and replacement policy 35 Problems in defining associativity: array + replacement policy Insight 1: With Z, replacement candidates are very close to uniformly distributed over the array Insight 2: All policies do the same thing, rank cache lines Eviction priority: Rank of a line normalized to [0,1] Example: With LRU policy, LRU line has 1.0 priority, MRU has 0.0

36 Z Associativity Associativity: Probability distribution of eviction priorities of evicted lines Z associativity depends only on the number of replacement candidates (R): F A R ( x) = Pr( A x) = x, x [0,1] 36 With R=8, 2% of evictions in 60% of least evictable lines With R=64, only 10-6 of evictions in 80% of least evictable lines

37 Z Analytical Models 37 Analytical models are accurate in practice: Theory: 1 in a million Practice: 1 in 5 14 workloads, 1024 cores

38 Partitioning 38 Main Memory ate ate ate Shared Directory ate ate ate ate ate

39 Partitioning 39 Main Memory ate ate ate Shared Directory ate ate ate ate ate L2 L2 L2 L2 L2 L2 L2 L2 VM1 VM2 VM3 VM4 VM5 VM6 partitioning techniques divide cache space explicitly Isolation: Virtualize cache among applications, VMs Efficiency: Improve performance, fairness Configurability: SW-controlled buffers (performance, security)

40 Partitioning Techniques 40 Strict partitioning schemes: Based on restricting line placement Way partitioning: Restrict insertions to specific ways Strict, but supports few partitions and degrades associativity Way 1 Way 2 Way 3 Way 4 Way 5 Way 6 Way 7 Way 8 Soft partitioning schemes: Based on tweaking the replacement policy PIPP: Insert and promote lines in LRU chain depending on their partition Simple, but approximate partitioning and degrades replacement performance Way 1 Way 2 Way 3 Way 4 Way 5 Way 6 Way 7 Way 8

41 Partitioning with Vantage 41 Previous partitioning techniques have major drawbacks Not scalable, support few partitions Degrade performance Vantage solves deficiencies of previous techniques Scalable: Supports hundreds of fine-grain partitions Maintains high associativity and strict isolation among partitions (QoS)

42 Vantage Design 42 Vantage partitions most of the cache logically by modifying the replacement process No restrictions on line placement Managed region Unmanaged region

43 Vantage Design Vantage partitions the managed region Incoming lines (misses) inserted in partition Each partition demotes least wanted lines to unmanaged region Evict only from unmanaged region no interference 43 Insertions Partition 0 Partition 1 Partition 2 Partition 3 Demotions Unmanaged region Evictions

44 Controlling Demotions Access B (partition 0) MISS 44 Get replacement candidates (16) 5 P1 2 P2 6 P3 3 unmgd Evict from unmanaged region Insert new line (in partition 0) Demote? Always demoting from inserting partition does not scale with number of partitions Instead, maintain sizes by matching demotion rate to miss rate

45 Demoting with Apertures 45 Aperture: Portion of candidates to demote from each partition Apertures Partition 0 Partition 1 Partition 2 Partition 3 23% 15% 5% 11% Partition 0 MISS Over aperture? Replacement candidates No No No No No No No No No No Yes No No Demote (in top 11% of P3) Evict Over aperture? Evict Nothing is demoted (all candidates above apertures!) No No No No No No No No No No No No Over aperture? No Yes No No No No No Yes No No No No Demote (in top 23% of P0) Demote (in top 15% of P1) Evict

46 Managing Apertures Partition apertures can be derived analytically: A i = P k=1 M i M k P S k=1 k 1 S i R m Intuition: Aperture ~ miss rate (Mi)/size (Si) 46 Apertures are also capped to A max Higher aperture lower partition associativity A max ensures high minimum associativity e.g., A max =40% ~ R=16 associativity We just let partitions that need A i > A max grow

47 Bounds on Size and Interference The worst-case total growth of all partitions over their target sizes is bounded and small: Δ = 1 A max Intuition: A -sized partition is always stable, and multiple unstable partitions help each other demote Independent of the number of partitions! 1 R 47 Assign an extra to unmanaged region With R=52 and A max =0.4, =5% of the cache Bounded worst-case sizes & interference

48 A Simple Vantage Controller Use negative feedback loop to derive apertures Use timestamps to determine lines within aperture Practical implementation that maintains analytical guarantees 48 Tags: Extra partition ID field Tag Array Data Array Partition (6b) Timestamp (8b) Coherence/ Valid Bits 256 bits of state per partition Line Address Partition 0 state (256b) Controller Partition P-1 state (256b) Vantage Replacement Logic ~1% extra storage, grows with log(partitions) Simple logic, ~10 adders and comparators Logic not on critical path

49 Vantage Evaluation 49 Better than unpartitioned Worse than unpartitioned 350 mixes on a 32-core CMP with a shared LLC (32 partitions) Partitions sized to maximize throughput (utility-based partitioning) Each line shows throughput vs unpartitioned 64-way baseline Way-partitioning, PIPP degrade throughput for most workloads

50 Vantage Evaluation 50 Better than unpartitioned Worse than unpartitioned Vantage improves throughput for most workloads using a 4-way/52-candidate Zcache Other schemes cannot scale beyond a few cores

51 Scaling Directories 51 Main Memory ate ate ate Shared Directory ate ate ate ate ate Scaling directories is hard: Excessive latency, energy, area overheads, or too complex Introduce invalidations Interference

52 Scalable Coherence Directory 52 Insights: Flexible sharer set encoding: Lines with few sharers use one entry, widely shared lines use multiple entries Scalability Use Z Efficient high associativity, analytical models Negligible invalidations with minimal overprovisioning (~10%) SCD achieves scalability and performance guarantees Area, energy grow with log(cores), constant latency Simple: No modifications to coherence protocol At 1024 cores, SCD is 13x smaller than a sparse directory, 2x smaller, faster and simpler than a hierarchical directory

53 Scalable Scheduling 1 Main Memory Shared Coherence Directory Scheduling requirements: Expose enough parallelism Locality-aware Load balancing Low overheads Bounded memory footprint 53 Application Tasks Dynamic vs static schedulers: Dynamic: Poor locality, footprint not bounded if non-trivial dependences Static: Great compile-time schedules, but no load-balancing, only regular apps

54 Insight: Leverage Programming Model 54 Solution: Dynamic fine-grain scheduling techniques that leverage programming model information to satisfy requirements Expose all parallelism through fine-grain tasks Locality-aware task queuing and load-balancing Bounded footprint Make dynamic scheduling practical in rich programming models (StreamIt, GRAMPS, Delite) Significant improvements over state-of-the-art schedulers on existing 12-core, 24-thread Xeon SMP: Up to 17x over dynamic (more parallelism, locality-aware, footprint) Up to 5.3x over static (no load imbalance) Scheduler choice becomes more critical as we scale up!

55 Hardware-Accelerated Schedulers 55 Fine-grain scheduling with 100+ threads is slow in software Hardware schedulers (e.g., GPUs): Fast but inflexible Insight: Software schedulers dominated by communication Solution: Accelerate communication with simple hardware ADM: Asynchronous, register-register messages between threads Small and scalable costs (~1KB buffers per core), virtualizable ADM-accelerated fine-grain schedulers: Achieve speed and scalability of HW + flexibility of SW At 512 threads, 6.4x faster than SW and 70% faster than HW ADM can accelerate other primitives (e.g., barriers, IPC)

56 Contributions 56 Scalable cache hierarchies: Efficient highly-associative caches [MICRO 10] Scalable cache partitioning [ISCA 11, Top Picks 12] Scalable coherence directories [HPCA 12] Scalable scheduling: Efficient dynamic scheduling by leveraging programming model information [PACT 11] Hardware-accelerated scheduling [ASPLOS 10]

57 Conclusions Scaling to 1000 cores requires HW and SW techniques: Scale hardware with highly efficient caches with scalable partitioning and coherence Scale software with dynamic, fine-grain, HW-accelerated scheduling 57

58 Acknowledgements Christos Research group: Jacob, David, Richard, Christina, Woongki, Austen, Mike, Hari PPL faculty: Kunle, Bill, Mark, Pat, Mendel, John, Alex PPL students: George, Jeremy, Defense committee: Bill, Kunle, Nick Family & friends Borja, Gemma, Idoia, Carlos, Felix, Dani, Manuel, Gonzalo, Adrian, Christina, George, Yiannis, Sotiria, Alexandros, Nadine, Martin, Elliot, Nick, Steph, Olivier, Leen, John, Sam, Mario, Nicole, Cristina, Kshipra, Robert, Erik, 58

59 THANK YOU FOR YOUR ATTENTION QUESTIONS?

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

SCALABLE AND EFFICIENT FINE-GRAINED CACHE PARTITIONING WITH VANTAGE

SCALABLE AND EFFICIENT FINE-GRAINED CACHE PARTITIONING WITH VANTAGE ... SCALABLE AND EFFICIENT FINE-GRAINED CACHE PARTITIONING WITH VANTAGE... THE VANTAGE CACHE-PARTITIONING TECHNIQUE ENABLES CONFIGURABILITY AND QUALITY-OF-SERVICE GUARANTEES IN LARGE-SCALE CHIP MULTIPROCESSORS

More information

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013 Summary NUCA is giving us more capacity, but further away 40 Applications

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Vantage: Scalable and Efficient Fine-Grain Cache Partitioning

Vantage: Scalable and Efficient Fine-Grain Cache Partitioning Vantage: Scalable and Efficient Fine-Grain Cache Partitioning Daniel Sanchez and Christos Kozyrakis Electrical Engineering Department, Stanford University {sanchezd, kozyraki}@stanford.edu ABSTRACT Cache

More information

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Optimizing Replication, Communication, and Capacity Allocation in CMPs Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle Dynamic Fine Grain Scheduling of Pipeline Parallelism Presented by: Ram Manohar Oruganti and Michael TeWinkle Overview Introduction Motivation Scheduling Approaches GRAMPS scheduling method Evaluation

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth

More information

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES ON BIG AND SMALL SERVER PLATFORMS Shuang Chen*, Shay Galon**, Christina Delimitrou*, Srilatha Manne**, and José Martínez* *Cornell University **Cavium

More information

Cache Controller with Enhanced Features using Verilog HDL

Cache Controller with Enhanced Features using Verilog HDL Cache Controller with Enhanced Features using Verilog HDL Prof. V. B. Baru 1, Sweety Pinjani 2 Assistant Professor, Dept. of ECE, Sinhgad College of Engineering, Vadgaon (BK), Pune, India 1 PG Student

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks : Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Three hours. Two academic papers are provided for use with the examination. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE

Three hours. Two academic papers are provided for use with the examination. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE COMP60621 Three hours Two academic papers are provided for use with the examination. UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE Designing for Parallelism and Future Multi-core Computing Date:

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view

Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view 1 Replacement policies for shared caches on symmetric multicores : a programmer-centric point of view Pierre Michaud INRIA HiPEAC 11, January 26, 2011 2 Outline Self-performance contract Proposition for

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural

More information

Improving the Practicality of Transactional Memory

Improving the Practicality of Transactional Memory Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter

More information

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT : EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019

More information

Computer Architecture

Computer Architecture Computer Architecture Lecture 1: Introduction and Basics Dr. Ahmed Sallam Suez Canal University Spring 2016 Based on original slides by Prof. Onur Mutlu I Hope You Are Here for This Programming How does

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

COS 318: Operating Systems. Virtual Memory and Address Translation

COS 318: Operating Systems. Virtual Memory and Address Translation COS 318: Operating Systems Virtual Memory and Address Translation Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Today s Topics

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1 Virtual Memory Patterson & Hennessey Chapter 5 ELEC 5200/6200 1 Virtual Memory Use main memory as a cache for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services Presented by: Jitong Chen Outline Architecture of Web-based Data Center Three-Stage framework to benefit

More information

Tradeoffs in Transactional Memory Virtualization

Tradeoffs in Transactional Memory Virtualization Tradeoffs in Transactional Memory Virtualization JaeWoong Chung Chi Cao Minh, Austen McDonald, Travis Skare, Hassan Chafi,, Brian D. Carlstrom, Christos Kozyrakis, Kunle Olukotun Computer Systems Lab Stanford

More information

Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho

Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho Stash Directory: A Scalable Directory for Many- Core Coherence! Socrates Demetriades and Sangyeun Cho 20 th Interna+onal Symposium On High Performance Computer Architecture (HPCA). Orlando, FL, February

More information

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset

More information

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013 Computer Systems Research in the Post-Dennard Scaling Era Emilio G. Cota Candidacy Exam April 30, 2013 Intel 4004, 1971 1 core, no cache 23K 10um transistors Intel Nehalem EX, 2009 8c, 24MB cache 2.3B

More information

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer

More information

Futility Scaling: High-Associativity Cache Partitioning

Futility Scaling: High-Associativity Cache Partitioning Futility Scaling: High-Associativity Cache Partitioning Ruisheng Wang Lizhong Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, CA ruishenw,lizhongc@usc.edu

More information

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:

More information

Futility Scaling: High-Associativity Cache Partitioning (Extended Version)

Futility Scaling: High-Associativity Cache Partitioning (Extended Version) Futility Scaling: High-Associativity Cache Partitioning (Extended Version) Ruisheng Wang and Lizhong Chen Computer Engineering Technical Report Number CENG-2014-07 Ming Hsieh Department of Electrical Engineering

More information

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches Jue Wang, Xiangyu Dong, Yuan Xie Department of Computer Science and Engineering, Pennsylvania State University Qualcomm Technology,

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007

Thread-level Parallelism for the Masses. Kunle Olukotun Computer Systems Lab Stanford University 2007 Thread-level Parallelism for the Masses Kunle Olukotun Computer Systems Lab Stanford University 2007 The World has Changed Process Technology Stops Improving! Moore s law but! Transistors don t get faster

More information

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level CLICK TO EDIT MASTER TITLE STYLE Second level THE HETEROGENEOUS SYSTEM ARCHITECTURE ITS (NOT) ALL ABOUT THE GPU PAUL BLINZER, FELLOW, HSA SYSTEM SOFTWARE, AMD SYSTEM ARCHITECTURE WORKGROUP CHAIR, HSA FOUNDATION

More information

WaveScalar. Winter 2006 CSE WaveScalar 1

WaveScalar. Winter 2006 CSE WaveScalar 1 WaveScalar Dataflow machine good at exploiting ILP dataflow parallelism traditional coarser-grain parallelism cheap thread management memory ordering enforced through wave-ordered memory Winter 2006 CSE

More information

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled

More information

Introduction to Parallel Programming Models

Introduction to Parallel Programming Models Introduction to Parallel Programming Models Tim Foley Stanford University Beyond Programmable Shading 1 Overview Introduce three kinds of parallelism Used in visual computing Targeting throughput architectures

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

Scalable Enterprise Networks with Inexpensive Switches

Scalable Enterprise Networks with Inexpensive Switches Scalable Enterprise Networks with Inexpensive Switches Minlan Yu minlanyu@cs.princeton.edu Princeton University Joint work with Alex Fabrikant, Mike Freedman, Jennifer Rexford and Jia Wang 1 Enterprises

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Final Exam Prof. Martin Wednesday, May 2nd, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached (with

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Shared Symmetric Memory Systems

Shared Symmetric Memory Systems Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Bandwidth Adaptive Snooping

Bandwidth Adaptive Snooping Two classes of multiprocessors Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Towards Energy-Proportional Datacenter Memory with Mobile DRAM

Towards Energy-Proportional Datacenter Memory with Mobile DRAM Towards Energy-Proportional Datacenter Memory with Mobile DRAM Krishna Malladi 1 Frank Nothaft 1 Karthika Periyathambi Benjamin Lee 2 Christos Kozyrakis 1 Mark Horowitz 1 Stanford University 1 Duke University

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

No Tradeoff Low Latency + High Efficiency

No Tradeoff Low Latency + High Efficiency No Tradeoff Low Latency + High Efficiency Christos Kozyrakis http://mast.stanford.edu Latency-critical Applications A growing class of online workloads Search, social networking, software-as-service (SaaS),

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Chapter 8: Main Memory. Operating System Concepts 9 th Edition Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

TDT 4260 lecture 3 spring semester 2015

TDT 4260 lecture 3 spring semester 2015 1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,

More information

Caches. Samira Khan March 23, 2017

Caches. Samira Khan March 23, 2017 Caches Samira Khan March 23, 2017 Agenda Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

Virtualization and memory hierarchy

Virtualization and memory hierarchy Virtualization and memory hierarchy Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department

More information

Hardware Support for NVM Programming

Hardware Support for NVM Programming Hardware Support for NVM Programming 1 Outline Ordering Transactions Write endurance 2 Volatile Memory Ordering Write-back caching Improves performance Reorders writes to DRAM STORE A STORE B CPU CPU B

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory 1 COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Data-Centric Execution of Speculative Parallel Programs MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANCHEZ MICRO 2016

Data-Centric Execution of Speculative Parallel Programs MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANCHEZ MICRO 2016 Data-entric Execution of Speculative Parallel Programs MARK JEFFREY, SUVINAY SUBRAMANIAN, MALEEN ABEYDEERA, JOEL EMER, DANIEL SANHEZ MIRO 26 Executive summary Many-cores must exploit cache locality to

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Design and Implementation of Signatures in Transactional Memory Systems

Design and Implementation of Signatures in Transactional Memory Systems Design and Implementation of Signatures in Transactional Memory Systems Daniel Sanchez August 2007 University of Wisconsin-Madison Outline Introduction and motivation Bloom filters Bloom signatures Area

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

Chapter 8. Virtual Memory

Chapter 8. Virtual Memory Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

Postprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.

Postprint.   This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Towards Energy Proportionality for Large-Scale Latency-Critical Workloads

Towards Energy Proportionality for Large-Scale Latency-Critical Workloads Towards Energy Proportionality for Large-Scale Latency-Critical Workloads David Lo *, Liqun Cheng *, Rama Govindaraju *, Luiz André Barroso *, Christos Kozyrakis Stanford University * Google Inc. 2012

More information

Lecture-16 (Cache Replacement Policies) CS422-Spring

Lecture-16 (Cache Replacement Policies) CS422-Spring Lecture-16 (Cache Replacement Policies) CS422-Spring 2018 Biswa@CSE-IITK 1 2 4 8 16 32 64 128 From SPEC92 Miss rate: Still Applicable Today 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information