JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES
|
|
- Merry Doyle
- 6 years ago
- Views:
Transcription
1 JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013
2 Summary NUCA is giving us more capacity, but further away 40 Applications have widely varying cache behavior MPKI libquantum zeusmp sphinx3 Cache Size Cache organization should adapt to application Jigsaw uses physical cache resources as building blocks of virtual caches, or shares 0 16MB
3 Approach 3 Jigsaw uses physical cache resources as building blocks of virtual caches, or shares libquantum zeusmp 40 MPKI 0 16MB Cache Size Bank Tiled Multicore sphinx3
4 Agenda 4 Introduction Background Goals Existing Approaches Jigsaw Design Evaluation
5 Goals Make effective use of cache capacity 5 Place data for low latency Provide capacity isolation for performance Have a simple implementation
6 Existing Approaches: S-NUCA 6 Spread lines evenly across banks High Capacity High Latency No Isolation Simple
7 Existing Approaches: Partitioning 7 Isolate regions of cache between applications. High Capacity High Latency Isolation Simple Jigsaw needs partitioning; uses Vantage to get strong guarantees with no loss in associativity
8 Existing Approaches: Private 8 Place lines in local bank Low Capacity Low Latency Isolation Complex LLC directory
9 Existing Approaches: D-NUCA Placement, migration, and replication heuristics 9 High Capacity But beware of over-replication and restrictive mappings Low Latency Don t fully exploit capacity vs. latency tradeoff No Isolation Complexity Varies Private-baseline schemes require LLC directory
10 Existing Approaches: Summary 10 S-NUCA Partitioning Private D-NUCA High Capacity Low Latency Yes Yes No Yes No No Yes Yes Isolation No Yes Yes No Simple Yes Yes No Depends
11 Jigsaw High Capacity Any share can take full capacity, no replication 11 Low Latency Shares allocated near cores that use them Isolation Partitions within each bank Simple Low overhead hardware, no LLC directory, software-managed
12 Agenda 12 Introduction Background Jigsaw Design Operation Monitoring Configuration Evaluation
13 Jigsaw Components 13 Size & Placement Operation Accesses Configuration Miss Curves Monitoring
14 Jigsaw Components 14 Operation Configuration Monitoring
15 Agenda 15 Introduction Background Jigsaw Design Operation Monitoring Configuration Evaluation
16 Operation: Access 16 Data è shares, so no LLC coherence required Share 1 LLC Classifier TLB Share 2 Share 3 STB L1I L2 L1D... Share N Core LD 0x5CA1AB1E
17 Data Classification Jigsaw classifies data based on access pattern Thread, Process, Global, and Kernel 17 Data lazily re-classified on TLB miss Similar to R-NUCA but n R-NUCA: Classification è Location n Jigsaw: Classification è Share (sized & placed dynamically) Negligible overhead Operating System 6 thread shares 2 process shares 1 global share 1 kernel share
18 Operation: Share-bank Translation Buffer 18 Gives unique location of the line in the LLC Address, Share è Bank, Partition Hash lines proportionally q q Share: STB: A A B A A B 400 bytes; low overhead Share Id (from TLB) Address (from L1 miss) 2706 STB Entry 4 entries, associative, exception on miss H Bank/ Part 0 1 1/3 3/5 1/3 Share Config 0x5CA1AB1E Bank/ Part 63 0/8 Address 0x5CA1AB1E maps to bank 3, partition 5
19 Agenda 19 Introduction Background Jigsaw Design Operation Monitoring Configuration Evaluation
20 Monitoring Software requires miss curves for each share 20 Add utility monitors (UMONs) per tile to produce miss curves Dynamic sampling to model full LLC at each bank; see paper Way 0 Way N-1 Misses 0x3DF7AB 0xFE3D98 0xDD380B 0x3930EA 0xB3D3GA 0x0E5A7B 0x x7890AB 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 0x7A5744 0x7A4A70 0xADD235 0x , , ,021 32,103 Tag Array Hit Counters Size Cache Size
21 Configuration 21 Software decides share configuration Approach: Size è Place Solving independently is simple Sizing is hard, placing is easy Misses Size LLC SIZE PLACE
22 Configuration: Sizing Partitioning problem: Divide cache capacity of S among P partitions/shares to maximize hits 22 Use miss curves to describe partition behavior NP-complete in general Existing approaches: Hill climbing is fast but gets stuck in local optima UCP Lookahead is good but scales quadratically: O(P x S 2 ) Utility-based Cache Partitioning, Qureshi and Patt, MICRO 06 Can we scale Lookahead?
23 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 23 Misses Size LLC Size
24 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 24 Misses Size LLC Size
25 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 25 Misses Size LLC Size
26 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 26 Misses Size LLC Size
27 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 27 Misses Size LLC Size
28 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 28 Misses Size LLC Size
29 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 29 Misses Size LLC Size
30 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 30 Misses Size LLC Size
31 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 31 Misses Size LLC Size
32 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 32 Misses Size LLC Size
33 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 33 Misses Size LLC Size
34 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 34 Misses Size LLC Size
35 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 35 Misses Size LLC Size
36 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 36 Misses Size LLC Size
37 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 37 Maximum cache utility Misses Size LLC Size
38 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 38 Maximum cache utility Misses Size LLC Size
39 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 39 Maximum cache utility Misses Size LLC Size
40 Configuration: Lookahead Observation: Lookahead traces the convex hull of the miss curve 40 Maximum cache utility Misses Size LLC Size
41 Convex Hulls The convex hull of a curve is the set containing all lines between any two points on the curve, or the curve connecting the points along the bottom 41 Misses Misses Size LLC Size Size LLC Size
42 Configuration: Peekahead There are well-known linear algorithms to compute convex hulls Peekahead algorithm is an exact, linear-time implementation of UCP Lookahead 42 Misses Misses Size LLC Size Size LLC Size
43 Configuration: Peekahead 43 Peekahead computes all convex hulls encountered during allocation in linear time Starting from every possible allocation Up to any remaining cache capacity Misses Misses Size LLC Size Size LLC Size
44 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) Convex hulls have decreasing slope è decreasing average cache utility è only consider next point on hull Use max-heap to compare between partitions 44 Best Step?
45 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 45 Best Step? Current Allocation
46 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 46 Best Step Current Allocation
47 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 47 Best Step?
48 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 48 Best Step
49 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 49 Best Step?
50 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 50 Best Step
51 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 51 Best Step
52 Configuration: Peekahead 52 Full runtime is O(P x S) P number of partitions S cache size See paper for additional examples, algorithm, and corner cases See technical report for additional detail, proofs, and run-time analysis Jigsaw: Scalable Software-Defined Caches (Extended Version), Nathan Beckmann and Daniel Sanchez, Technical Report MIT-CSAIL-TR , Massachusetts Institute of Technology, July 2013
53 Re-configuration When STB changes, some addresses hash to different banks 53 H 0x5CA1AB1E H 0x5CA1AB1E 1/3 1/3 3/5 1/3 1/3 3/5 1/3 4/9 3/5 1/3 1/3 3/5 Selective invalidation hardware walks the LLC and invalidates lines that have moved Heavy-handed but infrequent and avoids directory Maximum of 300K cycles / 50M cycles = 0.6% overhead
54 Design: Hardware Summary Operation: Share-bank translation buffer (STB) handles accesses TLB augmented with share id Tile Organization Jigsaw L3 Bank Bank partitioning HW (Vantage) 54 Monitoring HW: produces miss curves Configuration: invalidation HW Partitioning HW (Vantage) Monitoring HW L1I L2 NoC Router L1D Core Inv HW STB TLBs Modified structures New/added structures
55 Agenda 55 Introduction Background Jigsaw Design Evaluation Methodology Performance Energy
56 Methodology Execution-driven simulation using zsim 56 Workloads: 16-core singlethreaded mixes of SPECCPU2006 workloads 64-core multithreaded (4x16-thread) mixes of PARSEC Cache organizations LRU shared S-NUCA cache with LRU replacement; baseline Vantage S-NUCA with Vantage and UCP Lookahead R-NUCA state-of-the-art shared-baseline D-NUCA organization IdealSPD ( shared-private D-NUCA ) private L3 + shared L4 n 2x capacity of other schemes n Upper bound for private-baseline D-NUCA organizations Jigsaw
57 Evaluation: Performance 16-core multiprogrammed mixes of SPECCPU Jigsaw achieves best performance Up to 50% improved throughput, 2.2x improved w. speedup Gmean +14% throughput, +18% w. speedup Jigsaw does even better on the most memory intensive mixes Top 20% of LRU MPKI Gmean +21% throughput, +29% w. speedup
58 Evaluation: Performance core multithreaded mixes of PARSEC Jigsaw achieves best performance Gmean +9% throughput, +9% w. speedup Remember IdealSPD is an upper bound with 2x capacity
59 Evaluation: Performance Breakdown 16-core multiprogrammed mixes of SPECCPU Breakdown memory stalls into network and DRAM Normalized to LRU R-NUCA is limited by capacity in these workloads (private data è local bank) Vantage only benefits DRAM IdealSPD acts as either a private organization (benefit network) or a shared organization (benefit DRAM) Optimum Jigsaw is the only scheme to simultaneously benefit network and DRAM latency
60 Evaluation: Energy core multiprogrammed mixes McPAT models of full-system energy (chip + DRAM) Jigsaw achieves best energy reduction Up to 72%, gmean of 11% Reduces both network and DRAM energy
61 Conclusion NUCA is giving us more capacity, but further away Applications have widely varying cache behavior Cache organization should adapt to meet application needs 40 MPKI 0 16MB Cache Size Jigsaw uses physical cache resources as building blocks of virtual caches, or shares Sized to fit working set Placed near application for low latency Jigsaw improves performance up to 2.2x and reduces energy up to 72%
62 62 Misses QUESTIONS Size LLC Size Address H 3 Way 0 Way N-1 0x3F7AB 0xFE3D98 0xD380B 0x3930EA 0xBD3GA 0x0E5A7B 0x x7890AB 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 Tag Array 0x7A5744 0x74A70 0xAD235 0x Limit 717, , ,021 32,103 Hit Counters
63 Placement Greedy algorithm Each share is allocated budget Shares take turns grabbing space in nearby banks Banks ordered by distance from center of mass of cores accessing share Repeat until budget & banks exhausted
64 Monitoring Software requires miss curves for each share 64 Add UMONs per tile Small tag array that models LRU on sample of accesses Tracks # hits per way, # misses è miss curve Changing sampling rate models a larger cache Sampling Rate = UMON Lines Modeled Cache Lines STB spreads lines proportionally to partition size, so sampling rate must compensate Share size Sampling Rate = Partition size UMON Lines Modeled Cache Lines
65 Monitoring STB spreads addresses unevenly è change sampling rate to compensate 65 Augment UMON with hash (shared with STB) and 32-bit limit register that gives fine control over sampling rate Address H 3 Way 0 Way N-1 0x3DF7AB 0xFE3D98 0xDD380B 0x3930EA 0xB3D3GA 0x0E5A7B 0x x7890AB 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 Tag Array < Limit 0x7A5744 0x7A4A70 0xADD235 0x , , ,021 32,103 Hit Counters UMON now models full LLC capacity exactly Shares require only one UMON Max four shares / bank è four UMONs / bank è 1.4% overhead
66 Evaluation: Extra See paper for: Out-of-order results Execution time breakdown Peekahead performance Sensitivity studies 66
An Analytical Approach to Memory System Design NATHAN BE CKMANN PHD DEFENSE
An Analytical Approach to Memory System Design NATHAN BE CKMANN PHD DEFENSE CSAIL MIT 17 AUGUST 2015 Executive Summary Data movement is a growing problem in multicores DRAM 1000 energy of FP multiply-add
More informationJigsaw: Scalable Software-Defined Caches (Extended Version)
: Scalable Software-Defined Caches (Extended Version) Nathan Beckmann and Daniel Sanchez Massachusetts Institute of Technology {beckmann, sanchez}@csail.mit.edu Abstract Shared last-level caches, widely
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationSWAP: EFFECTIVE FINE-GRAIN MANAGEMENT
: EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationBanshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!
Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth
More informationLecture 11: Large Cache Design
Lecture 11: Large Cache Design Topics: large cache basics and An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS 02 Distance Associativity for High-Performance
More informationSHARDS & Talus: Online MRC estimation and optimization for very large caches
SHARDS & Talus: Online MRC estimation and optimization for very large caches Nohhyun Park CloudPhysics, Inc. Introduction Efficient MRC Construction with SHARDS FAST 15 Waldspurger at al. Talus: A simple
More informationSCALING HARDWARE AND SOFTWARE
SCALING HARDWARE AND SOFTWARE FOR THOUSAND-CORE SYSTEMS Daniel Sanchez Electrical Engineering Stanford University Multicore Scalability 1.E+06 10 6 1.E+05 10 5 1.E+04 10 4 1.E+03 10 3 1.E+02 10 2 1.E+01
More informationSecure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks
: Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia
More informationReactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed
More informationJenga: Harnessing Heterogeneous Memories through Reconfigurable Cache Hierarchies Nathan Beckmann, Po-An Tsai, and Daniel Sanchez
Computer cience and Artificial Intelligence Laboratory Technical Report MIT-CAIL-TR-2015-035 December 19, 2015 ga: Harnessing Heterogeneous Memories through Reconfigurable Cache Hierarchies Nathan Beckmann,
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationOptimizing Replication, Communication, and Capacity Allocation in CMPs
Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important
More informationThe Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!
The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationNexus: A New Approach to Replication in Distributed Shared Caches
Nexus: New pproach to Replication in Distributed Shared Caches Po-n Tsai MIT CSIL poantsai@csail.mit.edu Nathan Beckmann CMU SCS beckmann@cs.cmu.edu Daniel Sanchez MIT CSIL sanchez@csail.mit.edu bstract
More informationEfficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero
Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationAdaptive Cache Partitioning on a Composite Core
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,
More informationRethinking Last-Level Cache Management for Multicores Operating at Near-Threshold
Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz, Omer Khan University of Connecticut Power Efficiency Performance/Watt Multicores enable efficiency Power-performance
More informationSpatial Locality-Aware Cache Partitioning for Effective Cache Sharing
Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing Saurabh Gupta Oak Ridge National Laboratory Oak Ridge, USA guptas@ornl.gov Abstract In modern multi-core processors, last-level caches
More informationLecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies
Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset
More informationIntroducing the Cray XMT. Petr Konecny May 4 th 2007
Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions
More informationLecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers
Lecture 12: Large ache Design Topics: Shared vs. private, centralized vs. decentralized, UA vs. NUA, recent papers 1 Shared Vs. rivate SHR: No replication of blocks SHR: Dynamic allocation of space among
More informationDatabase Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:
Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationMany-Core Computing Era and New Challenges. Nikos Hardavellas, EECS
Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationCACHE-GUIDED SCHEDULING
CACHE-GUIDED SCHEDULING EXPLOITING CACHES TO MAXIMIZE LOCALITY IN GRAPH PROCESSING Anurag Mukkara, Nathan Beckmann, Daniel Sanchez 1 st AGP Toronto, Ontario 24 June 2017 Graph processing is memory-bound
More information740: Computer Architecture, Fall 2013 Midterm I
Instructions: Full Name: Andrew ID (print clearly!): 740: Computer Architecture, Fall 2013 Midterm I October 23, 2013 Make sure that your exam has 17 pages and is not missing any sheets, then write your
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationAn Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors
ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,
More informationVirtual Memory. Kevin Webb Swarthmore College March 8, 2018
irtual Memory Kevin Webb Swarthmore College March 8, 2018 Today s Goals Describe the mechanisms behind address translation. Analyze the performance of address translation alternatives. Explore page replacement
More informationAB-Aware: Application Behavior Aware Management of Shared Last Level Caches
AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering
More informationVirtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1
Virtual Memory Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L20-1 Reminder: Operating Systems Goals of OS: Protection and privacy: Processes cannot access each other s data Abstraction:
More informationECE/CS 757: Homework 1
ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)
More informationExploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration
Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration Emilio G Cota Paolo Mantovani Luca P Carloni Department of Computer Science, Columbia University New York, NY,
More informationSEESAW: Set Enhanced Superpage Aware caching
SEESAW: Set Enhanced Superpage Aware caching http://synergy.ece.gatech.edu/ Set Associativity Mayank Parasar, Abhishek Bhattacharjee Ω, Tushar Krishna School of Electrical and Computer Engineering Georgia
More informationKPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores
Appears in the Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA), 218 KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores Nosayba
More informationSTLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip
STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,
More informationChapter 8. Virtual Memory
Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:
More informationCloudCache: Expanding and Shrinking Private Caches
CloudCache: Expanding and Shrinking Private Caches Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers Computer Science Department, University of Pittsburgh {abraham,cho,childers}@cs.pitt.edu Abstract The
More informationWhy memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho
Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide
More informationWALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems
: A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University
More informationSurvey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed
Survey results CS 6354: Memory Hierarchy I 29 August 2016 1 2 Processor/Memory Gap Variety in memory technologies SRAM approx. 4 6 transitors/bit optimized for speed DRAM approx. 1 transitor + capacitor/bit
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationRecall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory
Recall: Address Space Map 13: Memory Management Biggest Virtual Address Stack (Space for local variables etc. For each nested procedure call) Sometimes Reserved for OS Stack Pointer Last Modified: 6/21/2004
More information/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #:
16.482 / 16.561: Computer Architecture and Design Fall 2014 Final Exam December 4, 2014 Name: ID #: For this exam, you may use a calculator and two 8.5 x 11 double-sided page of notes. All other electronic
More informationOpenPrefetch. (in-progress)
OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),
More informationVirtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1
Virtual Memory Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L16-1 Reminder: Operating Systems Goals of OS: Protection and privacy: Processes cannot access each other s data Abstraction:
More informationCS 6354: Memory Hierarchy I. 29 August 2016
1 CS 6354: Memory Hierarchy I 29 August 2016 Survey results 2 Processor/Memory Gap Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time
More informationMEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS
MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing
More informationCloudCache: Expanding and Shrinking Private Caches
CloudCache: Expanding and Shrinking Private Caches Sangyeun Cho Computer Science Department Credits Parts of the work presented in this talk are from the results obtained in collaboration with students
More information740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I
Instructions: Full Name: Andrew ID (print clearly!): 740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I October 23, 2013 Make sure that your exam has 15 pages and is not missing any sheets, then
More informationTDT 4260 lecture 3 spring semester 2015
1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,
More informationMeet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors
Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Techniques to Reduce Cache Misses Victim caches Better replacement policies pseudo-lru, NRU Prefetching, cache
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationPortland State University ECE 587/687. Caches and Memory-Level Parallelism
Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationStructure of Computer Systems
222 Structure of Computer Systems Figure 4.64 shows how a page directory can be used to map linear addresses to 4-MB pages. The entries in the page directory point to page tables, and the entries in a
More informationCCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward
More informationCS521 CSE IITG 11/23/2012
LSPS: Logically Shared Physically Shared LSPD: Logically Shared Physically Distributed Same size Unequalsize ith out replication LSPD : Logically Shared Physically Shared with replication Access scheduling
More informationvcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments
vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments Daehoon Kim *, Hwanju Kim, Nam Sung Kim *, and Jaehyuk Huh * University of Illinois at Urbana-Champaign,
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 20 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Pages Pages and frames Page
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationAnalysis and Optimization. Carl Waldspurger Irfan Ahmad CloudPhysics, Inc.
PRESENTATION Practical Online TITLE GOES Cache HERE Analysis and Optimization Carl Waldspurger Irfan Ahmad CloudPhysics, Inc. SNIA Legal Notice The material contained in this tutorial is copyrighted by
More informationA Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffrey Stuecheli, Jian Chen and Lizy K. John Department of Electrical
More informationPOPS: Coherence Protocol Optimization for both Private and Shared Data
POPS: Coherence Protocol Optimization for both Private and Shared Data Hemayet Hossain, Sandhya Dwarkadas, and Michael C. Huang University of Rochester Rochester, NY 14627, USA {hossain@cs, sandhya@cs,
More informationPerformance of coherence protocols
Performance of coherence protocols Cache misses have traditionally been classified into four categories: Cold misses (or compulsory misses ) occur the first time that a block is referenced. Conflict misses
More informationLast Level Cache Size Flexible Heterogeneity in Embedded Systems
Last Level Cache Size Flexible Heterogeneity in Embedded Systems Mario D. Marino, Kuan-Ching Li Leeds Beckett University, m.d.marino@leedsbeckett.ac.uk Corresponding Author, Providence University, kuancli@gm.pu.edu.tw
More informationCANDY: Enabling Coherent DRAM Caches for Multi-Node Systems
CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems Chiachen Chou Aamer Jaleel Moinuddin K. Qureshi School of Electrical and Computer Engineering Georgia Institute of Technology {cc.chou, moin}@ece.gatech.edu
More informationI, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.
5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000
More informationAddress Translation. Tore Larsen Material developed by: Kai Li, Princeton University
Address Translation Tore Larsen Material developed by: Kai Li, Princeton University Topics Virtual memory Virtualization Protection Address translation Base and bound Segmentation Paging Translation look-ahead
More informationChapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.
More informationDo we need a crystal ball for task migration?
Do we need a crystal ball for task migration? Brandon {Myers,Holt} University of Washington bdmyers@cs.washington.edu 1 Large data sets Data 2 Spread data Data.1 Data.2 Data.3 Data.4 Data.0 Data.1 Data.2
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationOptimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service
Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Intel Montecito Cache Two cores, each with a private 12 MB L3 cache and 1 MB L2 Naffziger et al., Journal of Solid-State
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)
1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationMultiprocessors and Locking
Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access
More informationSpring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand
Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled
More informationCSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]
CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate
More informationAddressing End-to-End Memory Access Latency in NoC-Based Multicores
Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationvirtual memory Page 1 CSE 361S Disk Disk
CSE 36S Motivations for Use DRAM a for the Address space of a process can exceed physical memory size Sum of address spaces of multiple processes can exceed physical memory Simplify Management 2 Multiple
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationI/O CHARACTERIZATION AND ATTRIBUTE CACHE DATA FOR ELEVEN MEASURED WORKLOADS
I/O CHARACTERIZATION AND ATTRIBUTE CACHE DATA FOR ELEVEN MEASURED WORKLOADS Kathy J. Richardson Technical Report No. CSL-TR-94-66 Dec 1994 Supported by NASA under NAG2-248 and Digital Western Research
More informationTIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS. Xiangyao Yu, Srini Devadas CSAIL, MIT
TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS Xiangyao Yu, Srini Devadas CSAIL, MIT FOR FIFTY YEARS, WE HAVE RIDDEN MOORE S LAW Moore s Law and the scaling of clock frequency = printing press for the currency
More information