Are we ready for high-mlp?

Size: px
Start display at page:

Download "Are we ready for high-mlp?"

Transcription

1 Are we ready for high-mlp?, James Tuck, Josep Torrellas

2 Why MLP? overlapping long latency misses is a very effective way of tolerating memory latency hide latency at the cost of bandwidth 2

3 Motivation 3

4 Motivation many recent high-mlp innovations 3

5 Motivation many recent high-mlp innovations many rely on processor checkpointing and speculative execution Runahead, CPR, CAVA, Clear, CFP,... 3

6 Motivation many recent high-mlp innovations many rely on processor checkpointing and speculative execution Runahead, CPR, CAVA, Clear, CFP,... but what happens in the memory system? misses pile up quickly, very quickly 3

7 Miss Handling Architectures 4

8 Miss Handling Architectures logic and resources needed to support outstanding misses in a cache 4

9 Miss Handling Architectures logic and resources needed to support outstanding misses in a cache consolidate misses to the same line primary/secondary misses 4

10 Miss Handling Architectures logic and resources needed to support outstanding misses in a cache consolidate misses to the same line primary/secondary misses may perform data forwarding 4

11 MSHR 5

12 MSHR key structure in the MHA proposed by [Kroft 81] 5

13 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] 5

14 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 5

15 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 5

16 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries explicitly addressed off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] 5

17 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed explicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] limited information on existing designs 5

18 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed explicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] limited information on existing designs MSHRs take significant chip area 5

19 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed explicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] limited information on existing designs MSHRs take significant chip area 5

20 Processors Considered Superscalar LargeWindow as above but with a 512 i-window and 2048 ROB hides latency with independent work Checkpointed checkpoint-assisted value prediction (CAVA [Ceze 04]) all in a two-context SMT organization 6

21 Outstanding Misses Distribution 7

22 Outstanding Misses Distribution % of time (cumulative) # of outstanding read misses Superscalar 90% time < 16 7

23 Outstanding Misses Distribution % of time (cumulative) # of outstanding read misses # of outstanding read misses Superscalar 90% time < 16 Checkpointed 50% time > 40 7

24 Outstanding Misses Distribution % of time (cumulative) # of outstanding read misses # of outstanding read misses # of outstanding read misses Superscalar Checkpointed LargeWindow 90% time < 16 50% time > 40 50% time > 100 7

25 MHA Design Space capacity # of lines # of secondary read misses per line # of secondary write misses per line associativity 8

26 Capacity - Entries Unlimited secondary misses 9

27 Capacity - Entries Unlimited secondary misses speedup over Unlimited e 8e 16e 32e Unlimited 0 Int.GM FP.GM Mix.GM Superscalar 20% impact 9

28 Capacity - Entries Unlimited secondary misses speedup over Unlimited e 8e 16e 32e Unlimited speedup over Unlimited e 8e 16e 32e Unlimited speedup over Unlimited e 8e 16e 32e Unlimited 0 Int.GM FP.GM Mix.GM 0 Int.GM FP.GM Mix.GM 0 Int.GM FP.GM Mix.GM Superscalar Checkpointed LargeWindow 20% impact 50% impact 50% impact 9

29 Subentries Checkpointed processor only 10

30 Subentries Checkpointed processor only speedup over Unlimited r 8r 16r 32r Unlimited 0 Int.GM FP.GM Mix.GM Read Subentries 40% impact 10

31 Subentries Checkpointed processor only speedup over Unlimited r 8r 16r 32r Unlimited speedup over Unlimited w 8w 16w 32w Unlimited 0 Int.GM FP.GM Mix.GM 0 Int.GM FP.GM Mix.GM Read Subentries Write Subentries 40% impact 17% impact 10

32 Subentries Checkpointed processor only speedup over Unlimited Int.GM FP.GM Mix.GM Read Subentries 4r 8r 16r 32r Unlimited speedup over Unlimited Int.GM FP.GM Mix.GM Write Subentries 4w 8w 16w 32w Unlimited used MSHRs (%) S L C S L C S L C Int.GM FP.GM Mix.GM Distribution Read+Write Write Sub Only Read Sub Only 40% impact 17% impact rarely r+w 10

33 Associativity 32 entries total 11

34 Associativity 32 entries total speedup over FullyAssoc FullyAssoc 16 way 8 way 4 way 2 way 0 Int.GM FP.GM Mix.GM Checkpointed 27% impact 11

35 Two High-MLP Designs 12

36 Two High-MLP Designs Banked bank 0 bank 7 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes 12

37 Two High-MLP Designs Banked bank 0 bank 7 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes Unified 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes set0 8 way set1 8 way 12

38 Two High-MLP Designs Banked bank 0 bank 7 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes Unified 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes set0 8 way set1 8 way Current off, dest off, dest off, dest off, dest... data data data data 8 misses of any type 12

39 Performance 13

40 Performance speedup over Current 1.2 Current 1 Unified Banked 0.8 Unlimited Int.GM FP.GM Mix.GM Superscalar 15% in Mix 13

41 Performance speedup over Current 1.2 Current 1 Unified Banked 0.8 Unlimited Int.GM FP.GM Mix.GM speedup over Current Int.GM FP.GM Mix.GM Current Unified Banked Unlimited speedup over Current 2 Current Int.GM FP.GM Mix.GM Unified Banked Unlimited Superscalar Checkpointed LargeWindow 15% in Mix 70% in Mix 65% in Mix 13

42 Additional Issues bus bandwidth concerns initial experiments of bus prioritization in CAVA: assign high-priority to requests with lowconfidence predictions up to 20% speedup (swim) relationship with load-store queue 14

43 Conclusion 15

44 Conclusion new MHAs for high MLP processors 15

45 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses 15

46 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses it may be time to start rethinking miss-handling 15

47 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses it may be time to start rethinking miss-handling bus prioritization might be a good idea 15

48 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses it may be time to start rethinking miss-handling bus prioritization might be a good idea specially in a CMP world 15

49 Bus Concerns lots of misses, a lot of bus traffic 16

50 Bus Concerns lots of misses, a lot of bus traffic Bus contention norm. to Current Current Unified Banked bzip2 gap mcf perlbmk ammp applu art equake mesa mgrid swim wupwise artequake artgap artperlbmk equakeperlbmk mesaart mgridmcf swimmcf wupwiseperlbmk Int.GM FP.GM Mix.GM Fig. 9. Bus contention normalized to Current in the Checkpointed processor. 16

51 Bus Request Prioritization? initial experiments in bus prioritization in CAVA two priorities: high/low assign high-priority to requests with low-confidence predictions 17

52 Bus Request Prioritization? initial experiments in bus prioritization in CAVA two priorities: high/low assign high-priority to requests with low-confidence predictions Speedup over No!Prio equake swim artequake artgap artperlbmk mesaart mgridmcf swimmcf Int.GM FP.GM Mix.GM 17

53 A Few Words on LSQs 18

54 A Few Words on LSQs why have subentries? 18

55 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only 18

56 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes 18

57 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? 18

58 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors 18

59 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside 18

60 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside cache-side searches would possibly require global searches 18

61 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside cache-side searches would possibly require global searches speculative retirement instructions cause LSQ entries to be recycled 18

62 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside cache-side searches would possibly require global searches speculative retirement instructions cause LSQ entries to be recycled bottom-line: decouple things as much as possible and try to leave the LSQ alone, it has enough of its own problems 18

63 Experimental Setup Processor Conventional Checkpointed LargeWindow Fetch/Issue/Comm 6/5/5 SMT contexts 2 I-window/ROB size 92/ /2048 Int/FP regs 192/ /2048 Ld/St Q entries 60/50 768/768 Mem System I-L1 D-L1 L2 Size/Assoc 32KB/2-way 32KB/2-way 2MB/8-way RT Lat 2 cyc 3 cyc 15 cyc 16-stream stride prefetcher (bet. L2 and Mem) Bus BW: 10GB/s Mem RT: 650 cyc 19

64 Assumptions decoupled processor, cache, MHA interaction all requests sent to the cache sure to be fulfilled when the MHA is full, cache locks up MHA is considered full when it is possible that a request won t be accepted 20

Are We Ready for High Memory-Level Parallelism?

Are We Ready for High Memory-Level Parallelism? Are We Ready for High Memory-Level Parallelism? Luis Ceze, James Tuck and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign Email: {luisceze,jtuck,torrella}@cs.uiuc.edu

More information

Scalable Cache Miss Handling for High Memory-Level Parallelism

Scalable Cache Miss Handling for High Memory-Level Parallelism Appeared in IEEE/ACM International Symposium on Microarchitecture, December 6 (MICRO 6) Scalable Cache Miss Handling for High Memory-Level Parallelism James Tuck, Luis Ceze, and Josep Torrellas University

More information

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction

CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction Luis Ceze, Karin Strauss, James Tuck, Jose Renau, Josep Torrellas University of Illinois at Urbana-Champaign June 2004 Abstract Modern superscalar

More information

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses

CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses Luis Ceze, Karin Strauss, James Tuck, Jose Renau and Josep Torrellas University of Illinois at Urbana-Champaign {luisceze, kstrauss, jtuck,

More information

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2

José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science, University of Central Florida zhou@cs.ucf.edu Abstract Current integration trends

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

POSH: A TLS Compiler that Exploits Program Structure

POSH: A TLS Compiler that Exploits Program Structure POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure

Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York

More information

Exploring Wakeup-Free Instruction Scheduling

Exploring Wakeup-Free Instruction Scheduling Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance

More information

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical

More information

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Impact of Cache Coherence Protocols on the Processing of Network Traffic Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background

More information

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing

Parallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing Parallel Computing 38 (2012) 533 551 Contents lists available at SciVerse ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Algorithm-level Feedback-controlled Adaptive data

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Workloads, Scalability and QoS Considerations in CMP Platforms

Workloads, Scalability and QoS Considerations in CMP Platforms Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload

More information

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation

More information

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

AS the processor-memory speed gap continues to widen,

AS the processor-memory speed gap continues to widen, IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 7, JULY 2004 843 Design and Optimization of Large Size and Low Overhead Off-Chip Caches Zhao Zhang, Member, IEEE, Zhichun Zhu, Member, IEEE, and Xiaodong Zhang,

More information

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

OUT-OF-ORDER COMMİT PROCESSORS. Paper by; Adrian Cristal, Daniel Ortega, Josep Llosa and Mateo Valero

OUT-OF-ORDER COMMİT PROCESSORS. Paper by; Adrian Cristal, Daniel Ortega, Josep Llosa and Mateo Valero OUT-OF-ORDER COMMİT PROCESSORS Paper by; Adrian Cristal, Daniel Ortega, Josep Llosa and Mateo Valero INTRODUCTİON Increasing gap between processor speed and memory speed is steadily increasing memory latencies

More information

Kilo-instruction Processors, Runahead and Prefetching

Kilo-instruction Processors, Runahead and Prefetching Kilo-instruction Processors, Runahead and Prefetching Tanausú Ramírez 1, Alex Pajuelo 1, Oliverio J. Santana 2 and Mateo Valero 1,3 1 Departamento de Arquitectura de Computadores UPC Barcelona 2 Departamento

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract

More information

BOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution

BOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution Appears in Proceedings of HPCA-16 (2010) BOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution Andrew Hilton and Amir Roth Department of Computer and Information Science, University of Pennsylvania

More information

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain

Software-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

Multithreaded Value Prediction

Multithreaded Value Prediction Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

Guided Region Prefetching: A Cooperative Hardware/Software Approach

Guided Region Prefetching: A Cooperative Hardware/Software Approach Guided Region Prefetching: A Cooperative Hardware/Software Approach Zhenlin Wang Ý Doug Burger Ü Kathryn S. McKinley Ü Steven K. Reinhardt Þ Charles C. Weems Ý Ý Dept. of Computer Science Ü Dept. of Computer

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Thesis Contributions (Cont.) Question: Does compression help CMP performance? Contribution #3: Evaluate CMP cache and link compression

Thesis Contributions (Cont.) Question: Does compression help CMP performance? Contribution #3: Evaluate CMP cache and link compression Using to Improve Chip Multiprocessor Alaa R. Alameldeen Dissertation Defense Wisconsin Multifacet Project University of Wisconsin-Madison http://www.cs.wisc.edu/multifacet Thesis Contributions (Cont.)

More information

Implicitly-Multithreaded Processors

Implicitly-Multithreaded Processors Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Precise Instruction Scheduling

Precise Instruction Scheduling Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University

More information

FINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY

FINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY FINE-GRAIN STATE PROCESSORS By PENG ZHOU A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Science) MICHIGAN TECHNOLOGICAL UNIVERSITY

More information

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction

Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the

More information

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread

More information

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor

Boost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia 010021 P.R.China liqiang@imu.edu.cn

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Introducing Multi-core Computing / Hyperthreading

Introducing Multi-core Computing / Hyperthreading Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:

More information

The Sluice Gate Theory: Have we found a solution for memory wall?

The Sluice Gate Theory: Have we found a solution for memory wall? The Sluice Gate Theory: Have we found a solution for memory wall? Xian-He Sun Illinois Institute of Technology Chicago, Illinois sun@iit.edu Keynote, HPC China, Nov. 2, 205 Scalable Computing Software

More information

Improving Adaptability and Per-Core Performance of Many-Core Processors Through Reconfiguration

Improving Adaptability and Per-Core Performance of Many-Core Processors Through Reconfiguration Int J Parallel Prog (2010) 38:203 224 DOI 10.1007/s10766-010-0128-3 Improving Adaptability and Per-Core Performance of Many-Core Processors Through Reconfiguration Tameesh Suri Aneesh Aggarwal Received:

More information

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors

Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili Sibi Govindan Doug Burger Stephen W. Keckler beroy@cs.utexas.edu sibi@cs.utexas.edu dburger@microsoft.com skeckler@nvidia.com

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding

More information

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James

Computer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving

More information

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

Prefetch-Aware DRAM Controllers

Prefetch-Aware DRAM Controllers Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin

More information

MARACAS: A Real-Time Multicore VCPU Scheduling Framework

MARACAS: A Real-Time Multicore VCPU Scheduling Framework : A Real-Time Framework Computer Science Department Boston University Overview 1 2 3 4 5 6 7 Motivation platforms are gaining popularity in embedded and real-time systems concurrent workload support less

More information

Accelerating and Adapting Precomputation Threads for Efficient Prefetching

Accelerating and Adapting Precomputation Threads for Efficient Prefetching In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Accelerating and Adapting Precomputation Threads for Efficient Prefetching Weifeng Zhang Dean M.

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Dynamic Memory Dependence Predication

Dynamic Memory Dependence Predication Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Dynamic Speculative Precomputation

Dynamic Speculative Precomputation In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Dynamic Speculative Precomputation Jamison D. Collins y, Dean M. Tullsen y, Hong Wang z, John P. Shen z y Department

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology

More information

Dynamically Controlled Resource Allocation in SMT Processors

Dynamically Controlled Resource Allocation in SMT Processors Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering The University

More information

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011

15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 15-740/18-740 Computer Architecture Lecture 14: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 Reviews Due Today Chrysos and Emer, Memory Dependence Prediction Using

More information

SEVERAL studies have proposed methods to exploit more

SEVERAL studies have proposed methods to exploit more IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying

More information

Continual Flow Pipelines

Continual Flow Pipelines Continual Flow Pipelines Srikanth T. Srinivasan Ravi Rajwar Haitham Akkary Amit Gandhi Mike Upton Microarchitecture Research Labs Intel Corporation {srikanth.t.srinivasan, ravi.rajwar, haitham.h.akkary,

More information

Virtual Memory. Virtual Memory

Virtual Memory. Virtual Memory Virtual Memory Virtual Memory Main memory is cache for secondary storage Secondary storage (disk) holds the complete virtual address space Only a portion of the virtual address space lives in the physical

More information

Speculative Parallelization in Decoupled Look-ahead

Speculative Parallelization in Decoupled Look-ahead International Conference on Parallel Architectures and Compilation Techniques Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Many Cores, One Thread: Dean Tullsen University of California, San Diego

Many Cores, One Thread: Dean Tullsen University of California, San Diego Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Zero-Value Caches: Cancelling Loads that Return Zero

Zero-Value Caches: Cancelling Loads that Return Zero 2 th International Conference on Parallel Architectures and Compilation Techniques Zero-Value Caches: Cancelling Loads that Return Zero Mafijul Md. Islam and Per Stenstrom Department of Computer Science

More information

Instruction Based Memory Distance Analysis and its Application to Optimization

Instruction Based Memory Distance Analysis and its Application to Optimization Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological

More information

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University

Lecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads

Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads Davy Genbrugge Lieven Eeckhout ELIS Depment, Ghent University, Belgium Email: {dgenbrug,leeckhou}@elis.ugent.be Abstract This

More information