Are we ready for high-mlp?
|
|
- Oscar Howard
- 5 years ago
- Views:
Transcription
1 Are we ready for high-mlp?, James Tuck, Josep Torrellas
2 Why MLP? overlapping long latency misses is a very effective way of tolerating memory latency hide latency at the cost of bandwidth 2
3 Motivation 3
4 Motivation many recent high-mlp innovations 3
5 Motivation many recent high-mlp innovations many rely on processor checkpointing and speculative execution Runahead, CPR, CAVA, Clear, CFP,... 3
6 Motivation many recent high-mlp innovations many rely on processor checkpointing and speculative execution Runahead, CPR, CAVA, Clear, CFP,... but what happens in the memory system? misses pile up quickly, very quickly 3
7 Miss Handling Architectures 4
8 Miss Handling Architectures logic and resources needed to support outstanding misses in a cache 4
9 Miss Handling Architectures logic and resources needed to support outstanding misses in a cache consolidate misses to the same line primary/secondary misses 4
10 Miss Handling Architectures logic and resources needed to support outstanding misses in a cache consolidate misses to the same line primary/secondary misses may perform data forwarding 4
11 MSHR 5
12 MSHR key structure in the MHA proposed by [Kroft 81] 5
13 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] 5
14 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 5
15 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 5
16 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries explicitly addressed off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] 5
17 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed explicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] limited information on existing designs 5
18 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed explicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] limited information on existing designs MSHRs take significant chip area 5
19 MSHR key structure in the MHA proposed by [Kroft 81] as described in [Farkas 94] implicitly addressed explicitly addressed subentry as many as words in a cache line [dest,data] [dest,data] [dest,data] [dest,data] word 0 word 1 word 2 word 3 arbitrary number of subentries off,[dest,data] off,[dest,data] off,[dest,data] off,[dest,data] limited information on existing designs MSHRs take significant chip area 5
20 Processors Considered Superscalar LargeWindow as above but with a 512 i-window and 2048 ROB hides latency with independent work Checkpointed checkpoint-assisted value prediction (CAVA [Ceze 04]) all in a two-context SMT organization 6
21 Outstanding Misses Distribution 7
22 Outstanding Misses Distribution % of time (cumulative) # of outstanding read misses Superscalar 90% time < 16 7
23 Outstanding Misses Distribution % of time (cumulative) # of outstanding read misses # of outstanding read misses Superscalar 90% time < 16 Checkpointed 50% time > 40 7
24 Outstanding Misses Distribution % of time (cumulative) # of outstanding read misses # of outstanding read misses # of outstanding read misses Superscalar Checkpointed LargeWindow 90% time < 16 50% time > 40 50% time > 100 7
25 MHA Design Space capacity # of lines # of secondary read misses per line # of secondary write misses per line associativity 8
26 Capacity - Entries Unlimited secondary misses 9
27 Capacity - Entries Unlimited secondary misses speedup over Unlimited e 8e 16e 32e Unlimited 0 Int.GM FP.GM Mix.GM Superscalar 20% impact 9
28 Capacity - Entries Unlimited secondary misses speedup over Unlimited e 8e 16e 32e Unlimited speedup over Unlimited e 8e 16e 32e Unlimited speedup over Unlimited e 8e 16e 32e Unlimited 0 Int.GM FP.GM Mix.GM 0 Int.GM FP.GM Mix.GM 0 Int.GM FP.GM Mix.GM Superscalar Checkpointed LargeWindow 20% impact 50% impact 50% impact 9
29 Subentries Checkpointed processor only 10
30 Subentries Checkpointed processor only speedup over Unlimited r 8r 16r 32r Unlimited 0 Int.GM FP.GM Mix.GM Read Subentries 40% impact 10
31 Subentries Checkpointed processor only speedup over Unlimited r 8r 16r 32r Unlimited speedup over Unlimited w 8w 16w 32w Unlimited 0 Int.GM FP.GM Mix.GM 0 Int.GM FP.GM Mix.GM Read Subentries Write Subentries 40% impact 17% impact 10
32 Subentries Checkpointed processor only speedup over Unlimited Int.GM FP.GM Mix.GM Read Subentries 4r 8r 16r 32r Unlimited speedup over Unlimited Int.GM FP.GM Mix.GM Write Subentries 4w 8w 16w 32w Unlimited used MSHRs (%) S L C S L C S L C Int.GM FP.GM Mix.GM Distribution Read+Write Write Sub Only Read Sub Only 40% impact 17% impact rarely r+w 10
33 Associativity 32 entries total 11
34 Associativity 32 entries total speedup over FullyAssoc FullyAssoc 16 way 8 way 4 way 2 way 0 Int.GM FP.GM Mix.GM Checkpointed 27% impact 11
35 Two High-MLP Designs 12
36 Two High-MLP Designs Banked bank 0 bank 7 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes 12
37 Two High-MLP Designs Banked bank 0 bank 7 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes Unified 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes set0 8 way set1 8 way 12
38 Two High-MLP Designs Banked bank 0 bank 7 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes Unified 32 reads subentries writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes off, dest off, dest... off, dest imp. writes set0 8 way set1 8 way Current off, dest off, dest off, dest off, dest... data data data data 8 misses of any type 12
39 Performance 13
40 Performance speedup over Current 1.2 Current 1 Unified Banked 0.8 Unlimited Int.GM FP.GM Mix.GM Superscalar 15% in Mix 13
41 Performance speedup over Current 1.2 Current 1 Unified Banked 0.8 Unlimited Int.GM FP.GM Mix.GM speedup over Current Int.GM FP.GM Mix.GM Current Unified Banked Unlimited speedup over Current 2 Current Int.GM FP.GM Mix.GM Unified Banked Unlimited Superscalar Checkpointed LargeWindow 15% in Mix 70% in Mix 65% in Mix 13
42 Additional Issues bus bandwidth concerns initial experiments of bus prioritization in CAVA: assign high-priority to requests with lowconfidence predictions up to 20% speedup (swim) relationship with load-store queue 14
43 Conclusion 15
44 Conclusion new MHAs for high MLP processors 15
45 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses 15
46 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses it may be time to start rethinking miss-handling 15
47 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses it may be time to start rethinking miss-handling bus prioritization might be a good idea 15
48 Conclusion new MHAs for high MLP processors need to support orders of magnitude more misses it may be time to start rethinking miss-handling bus prioritization might be a good idea specially in a CMP world 15
49 Bus Concerns lots of misses, a lot of bus traffic 16
50 Bus Concerns lots of misses, a lot of bus traffic Bus contention norm. to Current Current Unified Banked bzip2 gap mcf perlbmk ammp applu art equake mesa mgrid swim wupwise artequake artgap artperlbmk equakeperlbmk mesaart mgridmcf swimmcf wupwiseperlbmk Int.GM FP.GM Mix.GM Fig. 9. Bus contention normalized to Current in the Checkpointed processor. 16
51 Bus Request Prioritization? initial experiments in bus prioritization in CAVA two priorities: high/low assign high-priority to requests with low-confidence predictions 17
52 Bus Request Prioritization? initial experiments in bus prioritization in CAVA two priorities: high/low assign high-priority to requests with low-confidence predictions Speedup over No!Prio equake swim artequake artgap artperlbmk mesaart mgridmcf swimmcf Int.GM FP.GM Mix.GM 17
53 A Few Words on LSQs 18
54 A Few Words on LSQs why have subentries? 18
55 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only 18
56 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes 18
57 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? 18
58 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors 18
59 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside 18
60 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside cache-side searches would possibly require global searches 18
61 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside cache-side searches would possibly require global searches speculative retirement instructions cause LSQ entries to be recycled 18
62 A Few Words on LSQs why have subentries? make the MHA keep the primary miss only the LSQ entries keep an entry-id that is searched when miss completes similar to AMD s Opteron? may not be a good idea in new high-mlp processors scalable LSQ proposals optimized for accesses from processorside cache-side searches would possibly require global searches speculative retirement instructions cause LSQ entries to be recycled bottom-line: decouple things as much as possible and try to leave the LSQ alone, it has enough of its own problems 18
63 Experimental Setup Processor Conventional Checkpointed LargeWindow Fetch/Issue/Comm 6/5/5 SMT contexts 2 I-window/ROB size 92/ /2048 Int/FP regs 192/ /2048 Ld/St Q entries 60/50 768/768 Mem System I-L1 D-L1 L2 Size/Assoc 32KB/2-way 32KB/2-way 2MB/8-way RT Lat 2 cyc 3 cyc 15 cyc 16-stream stride prefetcher (bet. L2 and Mem) Bus BW: 10GB/s Mem RT: 650 cyc 19
64 Assumptions decoupled processor, cache, MHA interaction all requests sent to the cache sure to be fulfilled when the MHA is full, cache locks up MHA is considered full when it is possible that a request won t be accepted 20
Are We Ready for High Memory-Level Parallelism?
Are We Ready for High Memory-Level Parallelism? Luis Ceze, James Tuck and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign Email: {luisceze,jtuck,torrella}@cs.uiuc.edu
More informationScalable Cache Miss Handling for High Memory-Level Parallelism
Appeared in IEEE/ACM International Symposium on Microarchitecture, December 6 (MICRO 6) Scalable Cache Miss Handling for High Memory-Level Parallelism James Tuck, Luis Ceze, and Josep Torrellas University
More informationCAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction
CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction Luis Ceze, Karin Strauss, James Tuck, Jose Renau, Josep Torrellas University of Illinois at Urbana-Champaign June 2004 Abstract Modern superscalar
More informationCAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses
CAVA: Using Checkpoint-Assisted Value Prediction to Hide L2 Misses Luis Ceze, Karin Strauss, James Tuck, Jose Renau and Josep Torrellas University of Illinois at Urbana-Champaign {luisceze, kstrauss, jtuck,
More informationJosé F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2
CHERRY: CHECKPOINTED EARLY RESOURCE RECYCLING José F. Martínez 1, Jose Renau 2 Michael C. Huang 3, Milos Prvulovic 2, and Josep Torrellas 2 1 2 3 MOTIVATION Problem: Limited processor resources Goal: More
More informationDual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era
More information15-740/ Computer Architecture Lecture 10: Runahead and MLP. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 10: Runahead and MLP Prof. Onur Mutlu Carnegie Mellon University Last Time Issues in Out-of-order execution Buffer decoupling Register alias tables Physical
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are
More informationDual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window
Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science, University of Central Florida zhou@cs.ucf.edu Abstract Current integration trends
More informationEfficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero
Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate
More informationChapter-5 Memory Hierarchy Design
Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationMicroarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationRegister Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department of Computer Science State University of New York
More informationExploring Wakeup-Free Instruction Scheduling
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University Outline Motivation Case study: Cyclone Towards high-performance
More informationFuture Execution: A Hardware Prefetching Technique for Chip Multiprocessors
Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors Ilya Ganusov and Martin Burtscher Computer Systems Laboratory Cornell University {ilya, burtscher}@csl.cornell.edu Abstract This
More informationDesign of Experiments - Terminology
Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific
More informationHigh Performance Memory Requests Scheduling Technique for Multicore Processors
High Performance Memory Requests Scheduling Technique for Multicore Processors Walid El-Reedy Electronics and Comm. Engineering Cairo University, Cairo, Egypt walid.elreedy@gmail.com Ali A. El-Moursy Electrical
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationParallel Computing 38 (2012) Contents lists available at SciVerse ScienceDirect. Parallel Computing
Parallel Computing 38 (2012) 533 551 Contents lists available at SciVerse ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Algorithm-level Feedback-controlled Adaptive data
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationWorkloads, Scalability and QoS Considerations in CMP Platforms
Workloads, Scalability and QoS Considerations in CMP Platforms Presenter Don Newell Sr. Principal Engineer Intel Corporation 2007 Intel Corporation Agenda Trends and research context Evolving Workload
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation
More informationLow-Complexity Reorder Buffer Architecture*
Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower
More informationAS the processor-memory speed gap continues to widen,
IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 7, JULY 2004 843 Design and Optimization of Large Size and Low Overhead Off-Chip Caches Zhao Zhang, Member, IEEE, Zhichun Zhu, Member, IEEE, and Xiaodong Zhang,
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationOUT-OF-ORDER COMMİT PROCESSORS. Paper by; Adrian Cristal, Daniel Ortega, Josep Llosa and Mateo Valero
OUT-OF-ORDER COMMİT PROCESSORS Paper by; Adrian Cristal, Daniel Ortega, Josep Llosa and Mateo Valero INTRODUCTİON Increasing gap between processor speed and memory speed is steadily increasing memory latencies
More informationKilo-instruction Processors, Runahead and Prefetching
Kilo-instruction Processors, Runahead and Prefetching Tanausú Ramírez 1, Alex Pajuelo 1, Oliverio J. Santana 2 and Mateo Valero 1,3 1 Departamento de Arquitectura de Computadores UPC Barcelona 2 Departamento
More informationECE404 Term Project Sentinel Thread
ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache
More informationPortland State University ECE 588/688. Cray-1 and Cray T3E
Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector
More informationLecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)
Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,
More informationAn Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors
An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group
More information18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012
18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more
More informationMicroarchitecture Overview. Performance
Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make
More informationImplicitly-Multithreaded Processors
Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University {parki,vijay}@ecn.purdue.edu http://min.ecn.purdue.edu/~parki http://www.ece.purdue.edu/~vijay Abstract
More informationBOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution
Appears in Proceedings of HPCA-16 (2010) BOLT: Energy-Efficient Out-of-Order Latency-Tolerant Execution Andrew Hilton and Amir Roth Department of Computer and Information Science, University of Pennsylvania
More informationSoftware-assisted Cache Mechanisms for Embedded Systems. Prabhat Jain
Software-assisted Cache Mechanisms for Embedded Systems by Prabhat Jain Bachelor of Engineering in Computer Engineering Devi Ahilya University, 1986 Master of Technology in Computer and Information Technology
More informationLecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)
Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache
More informationMultithreaded Value Prediction
Multithreaded Value Prediction N. Tuck and D.M. Tullesn HPCA-11 2005 CMPE 382/510 Review Presentation Peter Giese 30 November 2005 Outline Motivation Multithreaded & Value Prediction Architectures Single
More informationCombining Local and Global History for High Performance Data Prefetching
Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu
More informationGuided Region Prefetching: A Cooperative Hardware/Software Approach
Guided Region Prefetching: A Cooperative Hardware/Software Approach Zhenlin Wang Ý Doug Burger Ü Kathryn S. McKinley Ü Steven K. Reinhardt Þ Charles C. Weems Ý Ý Dept. of Computer Science Ü Dept. of Computer
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationThesis Contributions (Cont.) Question: Does compression help CMP performance? Contribution #3: Evaluate CMP cache and link compression
Using to Improve Chip Multiprocessor Alaa R. Alameldeen Dissertation Defense Wisconsin Multifacet Project University of Wisconsin-Madison http://www.cs.wisc.edu/multifacet Thesis Contributions (Cont.)
More informationImplicitly-Multithreaded Processors
Appears in the Proceedings of the 30 th Annual International Symposium on Computer Architecture (ISCA) Implicitly-Multithreaded Processors School of Electrical & Computer Engineering Purdue University
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationPrecise Instruction Scheduling
Journal of Instruction-Level Parallelism 7 (2005) 1-29 Submitted 10/2004; published 04/2005 Precise Instruction Scheduling Gokhan Memik Department of Electrical and Computer Engineering Northwestern University
More informationFINE-GRAIN STATE PROCESSORS PENG ZHOU A DISSERTATION. Submitted in partial fulfillment of the requirements. for the degree of DOCTOR OF PHILOSOPHY
FINE-GRAIN STATE PROCESSORS By PENG ZHOU A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Science) MICHIGAN TECHNOLOGICAL UNIVERSITY
More informationReducing Latencies of Pipelined Cache Accesses Through Set Prediction
Reducing Latencies of Pipelined Cache Accesses Through Set Prediction Aneesh Aggarwal Electrical and Computer Engineering Binghamton University Binghamton, NY 1392 aneesh@binghamton.edu Abstract With the
More informationTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems
More informationSpeculative Parallelization in Decoupled Look-ahead
Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer Engineering University of Rochester, Rochester, NY Motivation Single-thread
More informationBoost Sequential Program Performance Using A Virtual Large. Instruction Window on Chip Multicore Processor
Boost Sequential Program Performance Using A Virtual Large Instruction Window on Chip Multicore Processor Liqiang He Inner Mongolia University Huhhot, Inner Mongolia 010021 P.R.China liqiang@imu.edu.cn
More informationEITF20: Computer Architecture Part 5.1.1: Virtual Memory
EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache
More information18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013
18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationExploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)
Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu Acknowledgements! Graduate students
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationIntroducing Multi-core Computing / Hyperthreading
Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:
More informationThe Sluice Gate Theory: Have we found a solution for memory wall?
The Sluice Gate Theory: Have we found a solution for memory wall? Xian-He Sun Illinois Institute of Technology Chicago, Illinois sun@iit.edu Keynote, HPC China, Nov. 2, 205 Scalable Computing Software
More informationImproving Adaptability and Per-Core Performance of Many-Core Processors Through Reconfiguration
Int J Parallel Prog (2010) 38:203 224 DOI 10.1007/s10766-010-0128-3 Improving Adaptability and Per-Core Performance of Many-Core Processors Through Reconfiguration Tameesh Suri Aneesh Aggarwal Received:
More informationCS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science
CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationExploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors
Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili Sibi Govindan Doug Burger Stephen W. Keckler beroy@cs.utexas.edu sibi@cs.utexas.edu dburger@microsoft.com skeckler@nvidia.com
More informationTechniques for Efficient Processing in Runahead Execution Engines
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu
More informationSpeculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution
Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution Ravi Rajwar and Jim Goodman University of Wisconsin-Madison International Symposium on Microarchitecture, Dec. 2001 Funding
More informationComputer Systems Architecture I. CSE 560M Lecture 17 Guest Lecturer: Shakir James
Computer Systems Architecture I CSE 560M Lecture 17 Guest Lecturer: Shakir James Plan for Today Announcements and Reminders Project demos in three weeks (Nov. 23 rd ) Questions Today s discussion: Improving
More informationA Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt
Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationPrefetch-Aware DRAM Controllers
Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin
More informationMARACAS: A Real-Time Multicore VCPU Scheduling Framework
: A Real-Time Framework Computer Science Department Boston University Overview 1 2 3 4 5 6 7 Motivation platforms are gaining popularity in embedded and real-time systems concurrent workload support less
More informationAccelerating and Adapting Precomputation Threads for Efficient Prefetching
In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA 2007). Accelerating and Adapting Precomputation Threads for Efficient Prefetching Weifeng Zhang Dean M.
More informationThreshold-Based Markov Prefetchers
Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationDynamic Memory Dependence Predication
Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationDynamic Speculative Precomputation
In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Dynamic Speculative Precomputation Jamison D. Collins y, Dean M. Tullsen y, Hong Wang z, John P. Shen z y Department
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationCluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah ABSTRACT The growing dominance of wire delays at future technology
More informationDynamically Controlled Resource Allocation in SMT Processors
Dynamically Controlled Resource Allocation in SMT Processors Francisco J. Cazorla, Alex Ramirez, Mateo Valero Departament d Arquitectura de Computadors Universitat Politècnica de Catalunya Jordi Girona
More informationMicroarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering The University
More information15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011
15-740/18-740 Computer Architecture Lecture 14: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 Reviews Due Today Chrysos and Emer, Memory Dependence Prediction Using
More informationSEVERAL studies have proposed methods to exploit more
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 4, APRIL 2005 1 The Impact of Incorrectly Speculated Memory Operations in a Multithreaded Architecture Resit Sendag, Member, IEEE, Ying
More informationContinual Flow Pipelines
Continual Flow Pipelines Srikanth T. Srinivasan Ravi Rajwar Haitham Akkary Amit Gandhi Mike Upton Microarchitecture Research Labs Intel Corporation {srikanth.t.srinivasan, ravi.rajwar, haitham.h.akkary,
More informationVirtual Memory. Virtual Memory
Virtual Memory Virtual Memory Main memory is cache for secondary storage Secondary storage (disk) holds the complete virtual address space Only a portion of the virtual address space lives in the physical
More informationSpeculative Parallelization in Decoupled Look-ahead
International Conference on Parallel Architectures and Compilation Techniques Speculative Parallelization in Decoupled Look-ahead Alok Garg, Raj Parihar, and Michael C. Huang Dept. of Electrical & Computer
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationMany Cores, One Thread: Dean Tullsen University of California, San Diego
Many Cores, One Thread: The Search for Nontraditional Parallelism University of California, San Diego There are some domains that feature nearly unlimited parallelism. Others, not so much Moore s Law and
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationZero-Value Caches: Cancelling Loads that Return Zero
2 th International Conference on Parallel Architectures and Compilation Techniques Zero-Value Caches: Cancelling Loads that Return Zero Mafijul Md. Islam and Per Stenstrom Department of Computer Science
More informationInstruction Based Memory Distance Analysis and its Application to Optimization
Instruction Based Memory Distance Analysis and its Application to Optimization Changpeng Fang cfang@mtu.edu Steve Carr carr@mtu.edu Soner Önder soner@mtu.edu Department of Computer Science Michigan Technological
More informationLecture 16: Checkpointed Processors. Department of Electrical Engineering Stanford University
Lecture 16: Checkpointed Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 18-1 Announcements Reading for today: class notes Your main focus:
More informationComputer Architecture Lecture 24: Memory Scheduling
18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM
More informationStatistical Simulation of Chip Multiprocessors Running Multi-Program Workloads
Statistical Simulation of Chip Multiprocessors Running Multi-Program Workloads Davy Genbrugge Lieven Eeckhout ELIS Depment, Ghent University, Belgium Email: {dgenbrug,leeckhou}@elis.ugent.be Abstract This
More information