JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

Size: px
Start display at page:

Download "JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES"

Transcription

1 JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013

2 Summary NUCA is giving us more capacity, but further away 40 Applications have widely varying cache behavior MPKI libquantum zeusmp sphinx3 Cache Size Cache organization should adapt to application Jigsaw uses physical cache resources as building blocks of virtual caches, or shares 0 16MB

3 Approach 3 Jigsaw uses physical cache resources as building blocks of virtual caches, or shares libquantum zeusmp 40 MPKI 0 16MB Cache Size Bank Tiled Multicore sphinx3

4 Agenda 4 Introduction Background Goals Existing Approaches Jigsaw Design Evaluation

5 Goals Make effective use of cache capacity 5 Place data for low latency Provide capacity isolation for performance Have a simple implementation

6 Existing Approaches: S-NUCA 6 Spread lines evenly across banks High Capacity High Latency No Isolation Simple

7 Existing Approaches: Partitioning 7 Isolate regions of cache between applications. High Capacity High Latency Isolation Simple Jigsaw needs partitioning; uses Vantage to get strong guarantees with no loss in associativity

8 Existing Approaches: Private 8 Place lines in local bank Low Capacity Low Latency Isolation Complex LLC directory

9 Existing Approaches: D-NUCA Placement, migration, and replication heuristics 9 High Capacity But beware of over-replication and restrictive mappings Low Latency Don t fully exploit capacity vs. latency tradeoff No Isolation Complexity Varies Private-baseline schemes require LLC directory

10 Existing Approaches: Summary 10 S-NUCA Partitioning Private D-NUCA High Capacity Low Latency Yes Yes No Yes No No Yes Yes Isolation No Yes Yes No Simple Yes Yes No Depends

11 Jigsaw High Capacity Any share can take full capacity, no replication 11 Low Latency Shares allocated near cores that use them Isolation Partitions within each bank Simple Low overhead hardware, no LLC directory, software-managed

12 Agenda 12 Introduction Background Jigsaw Design Operation Monitoring Configuration Evaluation

13 Jigsaw Components 13 Size & Placement Operation Accesses Configuration Miss Curves Monitoring

14 Jigsaw Components 14 Operation Configuration Monitoring

15 Agenda 15 Introduction Background Jigsaw Design Operation Monitoring Configuration Evaluation

16 Operation: Access 16 Data è shares, so no LLC coherence required Share 1 LLC Classifier TLB Share 2 Share 3 STB L1I L2 L1D... Share N Core LD 0x5CA1AB1E

17 Data Classification Jigsaw classifies data based on access pattern Thread, Process, Global, and Kernel 17 Data lazily re-classified on TLB miss Similar to R-NUCA but n R-NUCA: Classification è Location n Jigsaw: Classification è Share (sized & placed dynamically) Negligible overhead Operating System 6 thread shares 2 process shares 1 global share 1 kernel share

18 Operation: Share-bank Translation Buffer 18 Gives unique location of the line in the LLC Address, Share è Bank, Partition Hash lines proportionally q q Share: STB: A A B A A B 400 bytes; low overhead Share Id (from TLB) Address (from L1 miss) 2706 STB Entry 4 entries, associative, exception on miss H Bank/ Part 0 1 1/3 3/5 1/3 Share Config 0x5CA1AB1E Bank/ Part 63 0/8 Address 0x5CA1AB1E maps to bank 3, partition 5

19 Agenda 19 Introduction Background Jigsaw Design Operation Monitoring Configuration Evaluation

20 Monitoring Software requires miss curves for each share 20 Add utility monitors (UMONs) per tile to produce miss curves Dynamic sampling to model full LLC at each bank; see paper Way 0 Way N-1 Misses 0x3DF7AB 0xFE3D98 0xDD380B 0x3930EA 0xB3D3GA 0x0E5A7B 0x x7890AB 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 0x7A5744 0x7A4A70 0xADD235 0x , , ,021 32,103 Tag Array Hit Counters Size Cache Size

21 Configuration 21 Software decides share configuration Approach: Size è Place Solving independently is simple Sizing is hard, placing is easy Misses Size LLC SIZE PLACE

22 Configuration: Sizing Partitioning problem: Divide cache capacity of S among P partitions/shares to maximize hits 22 Use miss curves to describe partition behavior NP-complete in general Existing approaches: Hill climbing is fast but gets stuck in local optima UCP Lookahead is good but scales quadratically: O(P x S 2 ) Utility-based Cache Partitioning, Qureshi and Patt, MICRO 06 Can we scale Lookahead?

23 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 23 Misses Size LLC Size

24 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 24 Misses Size LLC Size

25 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 25 Misses Size LLC Size

26 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 26 Misses Size LLC Size

27 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 27 Misses Size LLC Size

28 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 28 Misses Size LLC Size

29 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 29 Misses Size LLC Size

30 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 30 Misses Size LLC Size

31 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 31 Misses Size LLC Size

32 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 32 Misses Size LLC Size

33 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 33 Misses Size LLC Size

34 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 34 Misses Size LLC Size

35 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 35 Misses Size LLC Size

36 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 36 Misses Size LLC Size

37 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 37 Maximum cache utility Misses Size LLC Size

38 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 38 Maximum cache utility Misses Size LLC Size

39 Configuration: Lookahead UCP Lookahead: Scan miss curves to find allocation that maximizes average cache utility (hits per byte) 39 Maximum cache utility Misses Size LLC Size

40 Configuration: Lookahead Observation: Lookahead traces the convex hull of the miss curve 40 Maximum cache utility Misses Size LLC Size

41 Convex Hulls The convex hull of a curve is the set containing all lines between any two points on the curve, or the curve connecting the points along the bottom 41 Misses Misses Size LLC Size Size LLC Size

42 Configuration: Peekahead There are well-known linear algorithms to compute convex hulls Peekahead algorithm is an exact, linear-time implementation of UCP Lookahead 42 Misses Misses Size LLC Size Size LLC Size

43 Configuration: Peekahead 43 Peekahead computes all convex hulls encountered during allocation in linear time Starting from every possible allocation Up to any remaining cache capacity Misses Misses Size LLC Size Size LLC Size

44 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) Convex hulls have decreasing slope è decreasing average cache utility è only consider next point on hull Use max-heap to compare between partitions 44 Best Step?

45 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 45 Best Step? Current Allocation

46 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 46 Best Step Current Allocation

47 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 47 Best Step?

48 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 48 Best Step

49 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 49 Best Step?

50 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 50 Best Step

51 Configuration: Peekahead Knowing the convex hull, each allocation step is O(log P) 51 Best Step

52 Configuration: Peekahead 52 Full runtime is O(P x S) P number of partitions S cache size See paper for additional examples, algorithm, and corner cases See technical report for additional detail, proofs, and run-time analysis Jigsaw: Scalable Software-Defined Caches (Extended Version), Nathan Beckmann and Daniel Sanchez, Technical Report MIT-CSAIL-TR , Massachusetts Institute of Technology, July 2013

53 Re-configuration When STB changes, some addresses hash to different banks 53 H 0x5CA1AB1E H 0x5CA1AB1E 1/3 1/3 3/5 1/3 1/3 3/5 1/3 4/9 3/5 1/3 1/3 3/5 Selective invalidation hardware walks the LLC and invalidates lines that have moved Heavy-handed but infrequent and avoids directory Maximum of 300K cycles / 50M cycles = 0.6% overhead

54 Design: Hardware Summary Operation: Share-bank translation buffer (STB) handles accesses TLB augmented with share id Tile Organization Jigsaw L3 Bank Bank partitioning HW (Vantage) 54 Monitoring HW: produces miss curves Configuration: invalidation HW Partitioning HW (Vantage) Monitoring HW L1I L2 NoC Router L1D Core Inv HW STB TLBs Modified structures New/added structures

55 Agenda 55 Introduction Background Jigsaw Design Evaluation Methodology Performance Energy

56 Methodology Execution-driven simulation using zsim 56 Workloads: 16-core singlethreaded mixes of SPECCPU2006 workloads 64-core multithreaded (4x16-thread) mixes of PARSEC Cache organizations LRU shared S-NUCA cache with LRU replacement; baseline Vantage S-NUCA with Vantage and UCP Lookahead R-NUCA state-of-the-art shared-baseline D-NUCA organization IdealSPD ( shared-private D-NUCA ) private L3 + shared L4 n 2x capacity of other schemes n Upper bound for private-baseline D-NUCA organizations Jigsaw

57 Evaluation: Performance 16-core multiprogrammed mixes of SPECCPU Jigsaw achieves best performance Up to 50% improved throughput, 2.2x improved w. speedup Gmean +14% throughput, +18% w. speedup Jigsaw does even better on the most memory intensive mixes Top 20% of LRU MPKI Gmean +21% throughput, +29% w. speedup

58 Evaluation: Performance core multithreaded mixes of PARSEC Jigsaw achieves best performance Gmean +9% throughput, +9% w. speedup Remember IdealSPD is an upper bound with 2x capacity

59 Evaluation: Performance Breakdown 16-core multiprogrammed mixes of SPECCPU Breakdown memory stalls into network and DRAM Normalized to LRU R-NUCA is limited by capacity in these workloads (private data è local bank) Vantage only benefits DRAM IdealSPD acts as either a private organization (benefit network) or a shared organization (benefit DRAM) Optimum Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

60 Evaluation: Energy core multiprogrammed mixes McPAT models of full-system energy (chip + DRAM) Jigsaw achieves best energy reduction Up to 72%, gmean of 11% Reduces both network and DRAM energy

61 Conclusion NUCA is giving us more capacity, but further away Applications have widely varying cache behavior Cache organization should adapt to meet application needs 40 MPKI 0 16MB Cache Size Jigsaw uses physical cache resources as building blocks of virtual caches, or shares Sized to fit working set Placed near application for low latency Jigsaw improves performance up to 2.2x and reduces energy up to 72%

62 62 Misses QUESTIONS Size LLC Size Address H 3 Way 0 Way N-1 0x3F7AB 0xFE3D98 0xD380B 0x3930EA 0xBD3GA 0x0E5A7B 0x x7890AB 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 Tag Array 0x7A5744 0x74A70 0xAD235 0x Limit 717, , ,021 32,103 Hit Counters

63 Placement Greedy algorithm Each share is allocated budget Shares take turns grabbing space in nearby banks Banks ordered by distance from center of mass of cores accessing share Repeat until budget & banks exhausted

64 Monitoring Software requires miss curves for each share 64 Add UMONs per tile Small tag array that models LRU on sample of accesses Tracks # hits per way, # misses è miss curve Changing sampling rate models a larger cache Sampling Rate = UMON Lines Modeled Cache Lines STB spreads lines proportionally to partition size, so sampling rate must compensate Share size Sampling Rate = Partition size UMON Lines Modeled Cache Lines

65 Monitoring STB spreads addresses unevenly è change sampling rate to compensate 65 Augment UMON with hash (shared with STB) and 32-bit limit register that gives fine control over sampling rate Address H 3 Way 0 Way N-1 0x3DF7AB 0xFE3D98 0xDD380B 0x3930EA 0xB3D3GA 0x0E5A7B 0x x7890AB 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 Tag Array < Limit 0x7A5744 0x7A4A70 0xADD235 0x , , ,021 32,103 Hit Counters UMON now models full LLC capacity exactly Shares require only one UMON Max four shares / bank è four UMONs / bank è 1.4% overhead

66 Evaluation: Extra See paper for: Out-of-order results Execution time breakdown Peekahead performance Sensitivity studies 66

An Analytical Approach to Memory System Design NATHAN BE CKMANN PHD DEFENSE

An Analytical Approach to Memory System Design NATHAN BE CKMANN PHD DEFENSE An Analytical Approach to Memory System Design NATHAN BE CKMANN PHD DEFENSE CSAIL MIT 17 AUGUST 2015 Executive Summary Data movement is a growing problem in multicores DRAM 1000 energy of FP multiply-add

More information

Jigsaw: Scalable Software-Defined Caches (Extended Version)

Jigsaw: Scalable Software-Defined Caches (Extended Version) : Scalable Software-Defined Caches (Extended Version) Nathan Beckmann and Daniel Sanchez Massachusetts Institute of Technology {beckmann, sanchez}@csail.mit.edu Abstract Shared last-level caches, widely

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT : EFFECTIVE FINE-GRAIN MANAGEMENT OF SHARED LAST-LEVEL CACHES WITH MINIMUM HARDWARE SUPPORT Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University Page 1

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth

More information

Lecture 11: Large Cache Design

Lecture 11: Large Cache Design Lecture 11: Large Cache Design Topics: large cache basics and An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS 02 Distance Associativity for High-Performance

More information

SHARDS & Talus: Online MRC estimation and optimization for very large caches

SHARDS & Talus: Online MRC estimation and optimization for very large caches SHARDS & Talus: Online MRC estimation and optimization for very large caches Nohhyun Park CloudPhysics, Inc. Introduction Efficient MRC Construction with SHARDS FAST 15 Waldspurger at al. Talus: A simple

More information

SCALING HARDWARE AND SOFTWARE

SCALING HARDWARE AND SOFTWARE SCALING HARDWARE AND SOFTWARE FOR THOUSAND-CORE SYSTEMS Daniel Sanchez Electrical Engineering Stanford University Multicore Scalability 1.E+06 10 6 1.E+05 10 5 1.E+04 10 4 1.E+03 10 3 1.E+02 10 2 1.E+01

More information

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks : Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia

More information

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed

More information

Jenga: Harnessing Heterogeneous Memories through Reconfigurable Cache Hierarchies Nathan Beckmann, Po-An Tsai, and Daniel Sanchez

Jenga: Harnessing Heterogeneous Memories through Reconfigurable Cache Hierarchies Nathan Beckmann, Po-An Tsai, and Daniel Sanchez Computer cience and Artificial Intelligence Laboratory Technical Report MIT-CAIL-TR-2015-035 December 19, 2015 ga: Harnessing Heterogeneous Memories through Reconfigurable Cache Hierarchies Nathan Beckmann,

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Optimizing Replication, Communication, and Capacity Allocation in CMPs Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important

More information

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!

The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Nexus: A New Approach to Replication in Distributed Shared Caches

Nexus: A New Approach to Replication in Distributed Shared Caches Nexus: New pproach to Replication in Distributed Shared Caches Po-n Tsai MIT CSIL poantsai@csail.mit.edu Nathan Beckmann CMU SCS beckmann@cs.cmu.edu Daniel Sanchez MIT CSIL sanchez@csail.mit.edu bstract

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold

Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Rethinking Last-Level Cache Management for Multicores Operating at Near-Threshold Farrukh Hijaz, Omer Khan University of Connecticut Power Efficiency Performance/Watt Multicores enable efficiency Power-performance

More information

Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing

Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing Saurabh Gupta Oak Ridge National Laboratory Oak Ridge, USA guptas@ornl.gov Abstract In modern multi-core processors, last-level caches

More information

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies

Lecture 14: Large Cache Design II. Topics: Cache partitioning and replacement policies Lecture 14: Large Cache Design II Topics: Cache partitioning and replacement policies 1 Page Coloring CACHE VIEW Bank number with Page-to-Bank Tag Set Index Bank number with Set-interleaving Block offset

More information

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Introducing the Cray XMT. Petr Konecny May 4 th 2007 Introducing the Cray XMT Petr Konecny May 4 th 2007 Agenda Origins of the Cray XMT Cray XMT system architecture Cray XT infrastructure Cray Threadstorm processor Shared memory programming model Benefits/drawbacks/solutions

More information

Lecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers

Lecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers Lecture 12: Large ache Design Topics: Shared vs. private, centralized vs. decentralized, UA vs. NUA, recent papers 1 Shared Vs. rivate SHR: No replication of blocks SHR: Dynamic allocation of space among

More information

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question: Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

CACHE-GUIDED SCHEDULING

CACHE-GUIDED SCHEDULING CACHE-GUIDED SCHEDULING EXPLOITING CACHES TO MAXIMIZE LOCALITY IN GRAPH PROCESSING Anurag Mukkara, Nathan Beckmann, Daniel Sanchez 1 st AGP Toronto, Ontario 24 June 2017 Graph processing is memory-bound

More information

740: Computer Architecture, Fall 2013 Midterm I

740: Computer Architecture, Fall 2013 Midterm I Instructions: Full Name: Andrew ID (print clearly!): 740: Computer Architecture, Fall 2013 Midterm I October 23, 2013 Make sure that your exam has 17 pages and is not missing any sheets, then write your

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,

More information

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018 irtual Memory Kevin Webb Swarthmore College March 8, 2018 Today s Goals Describe the mechanisms behind address translation. Analyze the performance of address translation alternatives. Explore page replacement

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1 Virtual Memory Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L20-1 Reminder: Operating Systems Goals of OS: Protection and privacy: Processes cannot access each other s data Abstraction:

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration

Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration Emilio G Cota Paolo Mantovani Luca P Carloni Department of Computer Science, Columbia University New York, NY,

More information

SEESAW: Set Enhanced Superpage Aware caching

SEESAW: Set Enhanced Superpage Aware caching SEESAW: Set Enhanced Superpage Aware caching http://synergy.ece.gatech.edu/ Set Associativity Mayank Parasar, Abhishek Bhattacharjee Ω, Tushar Krishna School of Electrical and Computer Engineering Georgia

More information

KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores

KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores Appears in the Proceedings of the 24th International Symposium on High Performance Computer Architecture (HPCA), 218 KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores Nosayba

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information

Chapter 8. Virtual Memory

Chapter 8. Virtual Memory Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:

More information

CloudCache: Expanding and Shrinking Private Caches

CloudCache: Expanding and Shrinking Private Caches CloudCache: Expanding and Shrinking Private Caches Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers Computer Science Department, University of Pittsburgh {abraham,cho,childers}@cs.pitt.edu Abstract The

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed Survey results CS 6354: Memory Hierarchy I 29 August 2016 1 2 Processor/Memory Gap Variety in memory technologies SRAM approx. 4 6 transitors/bit optimized for speed DRAM approx. 1 transitor + capacitor/bit

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory

Recall: Address Space Map. 13: Memory Management. Let s be reasonable. Processes Address Space. Send it to disk. Freeing up System Memory Recall: Address Space Map 13: Memory Management Biggest Virtual Address Stack (Space for local variables etc. For each nested procedure call) Sometimes Reserved for OS Stack Pointer Last Modified: 6/21/2004

More information

/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #:

/ : Computer Architecture and Design Fall Final Exam December 4, Name: ID #: 16.482 / 16.561: Computer Architecture and Design Fall 2014 Final Exam December 4, 2014 Name: ID #: For this exam, you may use a calculator and two 8.5 x 11 double-sided page of notes. All other electronic

More information

OpenPrefetch. (in-progress)

OpenPrefetch. (in-progress) OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),

More information

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. April 12, 2018 L16-1 Virtual Memory Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L16-1 Reminder: Operating Systems Goals of OS: Protection and privacy: Processes cannot access each other s data Abstraction:

More information

CS 6354: Memory Hierarchy I. 29 August 2016

CS 6354: Memory Hierarchy I. 29 August 2016 1 CS 6354: Memory Hierarchy I 29 August 2016 Survey results 2 Processor/Memory Gap Figure 2.2 Starting with 1980 performance as a baseline, the gap in performance, measured as the difference in the time

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

CloudCache: Expanding and Shrinking Private Caches

CloudCache: Expanding and Shrinking Private Caches CloudCache: Expanding and Shrinking Private Caches Sangyeun Cho Computer Science Department Credits Parts of the work presented in this talk are from the results obtained in collaboration with students

More information

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I Instructions: Full Name: Andrew ID (print clearly!): 740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I October 23, 2013 Make sure that your exam has 15 pages and is not missing any sheets, then

More information

TDT 4260 lecture 3 spring semester 2015

TDT 4260 lecture 3 spring semester 2015 1 TDT 4260 lecture 3 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU http://research.idi.ntnu.no/multicore 2 Lecture overview Repetition Chap.1: Performance,

More information

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Techniques to Reduce Cache Misses Victim caches Better replacement policies pseudo-lru, NRU Prefetching, cache

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Structure of Computer Systems

Structure of Computer Systems 222 Structure of Computer Systems Figure 4.64 shows how a page directory can be used to map linear addresses to 4-MB pages. The entries in the page directory point to page tables, and the entries in a

More information

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward

More information

CS521 CSE IITG 11/23/2012

CS521 CSE IITG 11/23/2012 LSPS: Logically Shared Physically Shared LSPD: Logically Shared Physically Distributed Same size Unequalsize ith out replication LSPD : Logically Shared Physically Shared with replication Access scheduling

More information

vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments

vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments Daehoon Kim *, Hwanju Kim, Nam Sung Kim *, and Jaehyuk Huh * University of Illinois at Urbana-Champaign,

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 20 Main Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Pages Pages and frames Page

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Analysis and Optimization. Carl Waldspurger Irfan Ahmad CloudPhysics, Inc.

Analysis and Optimization. Carl Waldspurger Irfan Ahmad CloudPhysics, Inc. PRESENTATION Practical Online TITLE GOES Cache HERE Analysis and Optimization Carl Waldspurger Irfan Ahmad CloudPhysics, Inc. SNIA Legal Notice The material contained in this tutorial is copyrighted by

More information

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffrey Stuecheli, Jian Chen and Lizy K. John Department of Electrical

More information

POPS: Coherence Protocol Optimization for both Private and Shared Data

POPS: Coherence Protocol Optimization for both Private and Shared Data POPS: Coherence Protocol Optimization for both Private and Shared Data Hemayet Hossain, Sandhya Dwarkadas, and Michael C. Huang University of Rochester Rochester, NY 14627, USA {hossain@cs, sandhya@cs,

More information

Performance of coherence protocols

Performance of coherence protocols Performance of coherence protocols Cache misses have traditionally been classified into four categories: Cold misses (or compulsory misses ) occur the first time that a block is referenced. Conflict misses

More information

Last Level Cache Size Flexible Heterogeneity in Embedded Systems

Last Level Cache Size Flexible Heterogeneity in Embedded Systems Last Level Cache Size Flexible Heterogeneity in Embedded Systems Mario D. Marino, Kuan-Ching Li Leeds Beckett University, m.d.marino@leedsbeckett.ac.uk Corresponding Author, Providence University, kuancli@gm.pu.edu.tw

More information

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems

CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems CANDY: Enabling Coherent DRAM Caches for Multi-Node Systems Chiachen Chou Aamer Jaleel Moinuddin K. Qureshi School of Electrical and Computer Engineering Georgia Institute of Technology {cc.chou, moin}@ece.gatech.edu

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University Address Translation Tore Larsen Material developed by: Kai Li, Princeton University Topics Virtual memory Virtualization Protection Address translation Base and bound Segmentation Paging Translation look-ahead

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

Do we need a crystal ball for task migration?

Do we need a crystal ball for task migration? Do we need a crystal ball for task migration? Brandon {Myers,Holt} University of Washington bdmyers@cs.washington.edu 1 Large data sets Data 2 Spread data Data.1 Data.2 Data.3 Data.4 Data.0 Data.1 Data.2

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Chapter 18 - Multicore Computers

Chapter 18 - Multicore Computers Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca

More information

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 Intel Montecito Cache Two cores, each with a private 12 MB L3 cache and 1 MB L2 Naffziger et al., Journal of Solid-State

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) 1 MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1) Chapter 5 Appendix F Appendix I OUTLINE Introduction (5.1) Multiprocessor Architecture Challenges in Parallel Processing Centralized Shared Memory

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled

More information

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

virtual memory Page 1 CSE 361S Disk Disk

virtual memory Page 1 CSE 361S Disk Disk CSE 36S Motivations for Use DRAM a for the Address space of a process can exceed physical memory size Sum of address spaces of multiple processes can exceed physical memory Simplify Management 2 Multiple

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

I/O CHARACTERIZATION AND ATTRIBUTE CACHE DATA FOR ELEVEN MEASURED WORKLOADS

I/O CHARACTERIZATION AND ATTRIBUTE CACHE DATA FOR ELEVEN MEASURED WORKLOADS I/O CHARACTERIZATION AND ATTRIBUTE CACHE DATA FOR ELEVEN MEASURED WORKLOADS Kathy J. Richardson Technical Report No. CSL-TR-94-66 Dec 1994 Supported by NASA under NAG2-248 and Digital Western Research

More information

TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS. Xiangyao Yu, Srini Devadas CSAIL, MIT

TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS. Xiangyao Yu, Srini Devadas CSAIL, MIT TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS Xiangyao Yu, Srini Devadas CSAIL, MIT FOR FIFTY YEARS, WE HAVE RIDDEN MOORE S LAW Moore s Law and the scaling of clock frequency = printing press for the currency

More information