Store-Load Forwarding: Conventional. NoSQ: Store-Load Communication without a Store Queue. Store-Load Forwarding: Hmmm
|
|
- Elaine Watts
- 6 years ago
- Views:
Transcription
1 Store-Load Forwarding: Conventional No: Store-Load Communication without a Store Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu addr predictor addr addr (CAM) data addr data (RAM) (RAM) Associatively-searched queue () Complex On execution path Doesn t scale well Exists for benefit of 10% of s Pentium III (1999) MICRO-06 :: Dec 12, 2006 [ 2 ] Store-Load Forwarding: Proposals Reduce range and frequency of search Bloom-filtered [Sethumadhavan 03] Pipelined/chained [Park 03] Hierarchical/filtered [Srinivasan 04, Ghandi 05, Torres 05] Decomposed [Roth 05; Baugh 04] Replace full associativity with set associativity Address-indexed forwarding cache [Stone 05] Replace search with speculative indexed access Speculative indexed [Sha 05], Fire-and-Forget [Subramaniam 06] Some forwarding structure still there in the middle of the datapath [ 3 ] Store-Load Forwarding: Hmmm Observe I: can predict the exact forwarding accurately Memory dependence prediction [Moshovos 97, Chrysos 98, ] No need to search to find it Observe II: data already exists in register file No need to copy input register!! output register Should be able to eliminate if Can connect s input data register to s consumers Can verify in some way that doesn t require we have the technology Speculative memory bypassing (SMB) [Moshovos 97] Filtered in-order re-execution [Cain 04, Roth 05] [ 4 ]
2 Store-Load Forwarding : No distinguishes bypassing/non-bypassing s Non-bypassing s just access the data cache s skip out-of-order execution (SMB) Store vulnerability window (SVW) verifies and trains Stores skip out-of-order execution too Don t participate in forwarding anymore Commit pipeline extended to execute s Road Overview The road to No (prior work) Conventional - forwarding Load verification with filtered re-execution Speculative memory bypassing No Eliminating out-of-order s and the queue Eliminating the queue Store- bypassing predictor Evaluation No Store (or any forwarding structure)! No Load! [ 5 ] [ 6 ] Conventional Design Conventional Design Loads Execute: search, write address into Loads Execute: search, write address into Stores Execute: write address/data to, search for early s [ 7 ] [ 8 ]
3 Conventional Design In-Order Load Re-Execution [Cain 04] Loads Execute: search, write address into Stores Execute: write address/data to, search for early s Commit: use data/address from to write [ 9 ] [ 10 ] In-Order Load Re-Execution [Cain 04] Store Vulnerability Window (SVW) [Roth 05] flush? Replace search with re-execution prior to commit Squash if in-order value! out-of-order value Moves queue out of core datapath Can verify any form of speculation Consumes a lot of cache bandwidth if all s are speculative [ 11 ] Don t re-execute if no to s address in long time Store Sequence Numbers (Ns): formalize time N Bloom Filter (): N of youngest to address Store commit: update Load commit: read, skip re-execution if entry is older than Forwarding? forwarding Non-forwarding? youngest committed at time of execute [ 12 ]
4 Speculative Memory [Moshovos 97] Speculative Memory Raw insns: add R1, 4! R2 R2! A A! R3 sub R3, 4! R4 Renamed insns: add P1, 4! P2 P2! A A! P3 sub P3, 4! P4 SMB renamed insns: add P1, 4! P2 P2! A A! P2 Convert DEF---USE to DEF-USE Extend register renaming -table[.input] : -table[def.output] Predict - dependence -table[.output] : -table[.input] Store- removed from dataflow graph sub P2, 4! P4 [ 13 ] ( register queue): maps to input data register Originally: verify bypassing s by executing out-of-order Modification: verify using SVW-filtered re-execution [Petric 05] s skip out-of-order execution s (~10%) never access data cache [ 14 ] No: A New Use of SMB No 1: Remove Store from Load Path Load re-execution / SVW filtering: target design simplification Eliminate associative queue search Traditional SMB: targets performance improvement SMB as opportunistic complement to queue Only 10% of s forward! only 4% gains Not worth the effort [Loh 02] No SMB: targets design simplification SMB as exclusive replacement for queue Loads don t need to access queue during execution s: skip out-of-order execution (SMB) Non-bypassing s: get values from the data cache [ 15 ] [ 16 ]
5 No 1: Remove Store from Load Path No 2: Remove Store Loads don t need to access queue during execution s: skip out-of-order execution (SMB) Non-bypassing s: get values from the data cache Remove queue from the the path [ 17 ] [ 18 ] No 2: Remove Store No 3: Remove Load Move execution from out-of-order to commit Extend ROB to remember registers, offsets and data sizes Elongate commit pipeline to read register file, calculate address No additional regfile ports, adders: out-of-order ports! in-order ports Don t dispatch s to Eliminate queue [ 19 ] [ 20 ]
6 No 3: Remove Load No 3: Remove Load Generate addresses for bypassed s at commit (to verify) ROB is already extended, pipeline already elongated Generate addresses for bypassed s at commit (to verify) ROB is already extended, pipeline already elongated Re-generate addresses for non-bypassed s at commit May need additional register read port [ 21 ] [ 22 ] No 3: Remove Load Once Again: Conventional Design Generate addresses for bypassed s at commit (to verify) ROB is already extended, pipeline already elongated Re-generate addresses for non-bypassed s at commit May need additional register read port Eliminate queue [ 23 ] [ 24 ]
7 No Road Overview From conventional to No Conventional - forwarding Load verification with filtered re-execution Speculative memory bypassing No Eliminating out-of-order s and the queue Eliminating the queue No s - bypassing predictor Evaluation [ 25 ] [ 26 ] No s Store-Load Prediction Similar to previous - prediction, but more difficult Load scheduling prediction: [Chrysos 98, ] Can predict conservatively, predicts only violating s Speculative forwarding prediction [Sha 05, ] Must predict all s, but benefits from - address check Traditional bypassing prediction [Moshovos 97, ] No address check, but can decline to predict difficult s No s bypassing prediction Must predict all s, precise, no address check Distance-Based Dependence Prediction interface: PC! dynamic Load PC! PC(s)! dynamic E.g., Store Sets [Chrysos 98], Speculative indexed [Sha 05] Store PC! dynamic requires table Can only (easily) represent most recent instance of each PC Load PC! distance (in s) to! dynamic E.g., [Lipasti 97, Yoaz 98] Can represent any instance Dovetails with SVW: just compare/subtract distances/ns Predict:.N bypass N rename.distance bypass Verify:.N bypass [.address] Train:.distance bypass N commit [.address] [ 27 ] [ 28 ]
8 PC branchcall history Design xor tag tag dist dist Each entry includes tag, distance Explicitly path-sensitive design Two tables: path-insensitive path-sensitive Both set-associative Predict: prefer path-sensitive prediction Train: update both tables on every commit distance to forwarding Evaluation Goal!No queue or queue Same (or better) IPC as conventional design mis-predictions Deeper commit pipeline Latency benefit of SMB Reduced consumption of issue bandwidth and queue slots Simulation environment SPECint2000, SPECfp2000, Mediabench (only show 9) Dynamically scheduled 4-way superscalar 128-ROB, 40-entry issue queue, 11-stage front-end/core Base: 24/48-entry /, 2K-entry Store Sets, 6 stage commit No: No /, 2K-entry predictor, 8 stage commit [ 29 ] [ 30 ] Relative execution time No Performance No Avoid Bypass Mis-Predictions with Delay Two kinds of bypassing mis-predictions Difficult to predict Long signature, data dependent, predictor conflicts, etc. Simply cannot be bypassed Narrow- to wide- No can do wide-/narrow- and narrow-/narrow- See paper On average: slightly outperforms conventional design Prediction accuracy is ~99.8% (15 mis-predictions per 10,000 s) In a few cases: slowdowns 10 of 47 benchmarks >1% slowdown (worst case 7%) Catch-all: convert bypassing to delayed non-bypassing Inject it to the But delay it until the predicted bypassing commits Load gets value from data cache Attach a confidence counter to each predictor entry [ 31 ] [ 32 ]
9 No with Delay No No delay No with Perfect No No delay No perfect predictor Relative execution time Relative execution time Robust performance: only 1 of 47 benchmarks > 1% Overdelay hurts sometimes, but not too bad As fast or faster than conventional in almost all cases Only 2% between realistic and perfect predictor See paper for scalability, data cache accesses, etc. [ 33 ] [ 34 ] Conclusions Conventional queue / queue Complex, non scalable Exist for the benefit of 10% forwarding s No Exploits synergy of previously proposed mechanisms Re-execution with SVW filter, speculative memory bypassing New highly-accurate bypassing predictor Simple, clean data path with no queue/ queue More scalable Fits well with distributed, partitioned cores (e.g., SMT) Outperforms conventional design Thank you. Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu Store-Load Forwarding: Just Say No! [ 35 ]
HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding
HANDLING MEMORY OPS 9 Dynamically Scheduling Memory Ops Compilers must schedule memory ops conservatively Options for hardware: Hold loads until all prior stores execute (conservative) Execute loads as
More informationStore Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization
Department of Computer & Information Science Technical Reports (CIS) University of Pennsylvania Year 24 Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization Amir
More informationFire-and-Forget: Load/Store Scheduling with No Store Queue at All
Fire-and-Forget: Load/Store Scheduling with No Store Queue at All Samantika Subramaniam Gabriel H. Loh Georgia Institute of Technology College of Computing {samantik,loh}@cc.gatech.edu Abstract Modern
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 14: Speculation II Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CS 246, Harvard University] Tomasulo+ROB Add
More informationEXPLOITING AGGRESSIVE MEMORY DEPENDENCE SPECULATION TO SIMPLIFY THE STORE-LOAD DATAPATH. Tingting Sha. Computer and Information Science
EXPLOITING AGGRESSIVE MEMORY DEPENDENCE SPECULATION TO SIMPLIFY THE STORE-LOAD DATAPATH Tingting Sha A DISSERTATION in Computer and Information Science Presented to the Faculties of the University of Pennsylvania
More informationActive Store Window: Enabling Far Store-Load Forwarding with Scalability and Complexity-Efficiency
Zhang ZH, Wang XY, Tong Det al. Active store window: Enabling far store-load forwarding with scalability and complexityefficiency. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 27(4): 769 780 July 2012. DOI
More informationStatic & Dynamic Instruction Scheduling
CS3014: Concurrent Systems Static & Dynamic Instruction Scheduling Slides originally developed by Drew Hilton, Amir Roth, Milo Martin and Joe Devietti at University of Pennsylvania 1 Instruction Scheduling
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationSimultaneous Multithreading Processor
Simultaneous Multithreading Processor Paper presented: Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor James Lue Some slides are modified from http://hassan.shojania.com/pdf/smt_presentation.pdf
More informationEECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont
Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,
More informationSAMIE-LSQ: Set-Associative Multiple-Instruction Entry Load/Store Queue
SAMIE-LSQ: Set-Associative Multiple-Instruction Entry Load/Store Queue Jaume Abella and Antonio González Intel Barcelona Research Center Intel Labs, Universitat Politècnica de Catalunya Barcelona (Spain)
More informationMemory Dependence Prediction and Access Ordering for Memory Disambiguation and Renaming
Memory Dependence Prediction and Access Ordering for Memory Disambiguation and Renaming 1. Introduction The objective of modern superscalar processors is to maximize the instructionlevel parallelism (ILP)
More informationDynamic Memory Scheduling
Dynamic : Computer Architecture Dynamic Scheduling Overview Lecture Overview The Problem Conventional Solutions In-order Load/Store Scheduling Blind Dependence Scheduling Conservative Dataflow Scheduling
More informationSuperscalar Organization
Superscalar Organization Nima Honarmand Instruction-Level Parallelism (ILP) Recall: Parallelism is the number of independent tasks available ILP is a measure of inter-dependencies between insns. Average
More informationBloom Filtering Cache Misses for Accurate Data Speculation and Prefetching
Bloom Filtering Cache Misses for Accurate Data Speculation and Prefetching Jih-Kwon Peir, Shih-Chang Lai, Shih-Lien Lu, Jared Stark, Konrad Lai peir@cise.ufl.edu Computer & Information Science and Engineering
More informationSpectre and Meltdown. Clifford Wolf q/talk
Spectre and Meltdown Clifford Wolf q/talk 2018-01-30 Spectre and Meltdown Spectre (CVE-2017-5753 and CVE-2017-5715) Is an architectural security bug that effects most modern processors with speculative
More informationLate-Binding: Enabling Unordered Load-Store Queues
Late-Binding: Enabling Unordered Load-Store Queues Simha Sethumadhavan Franziska Roesner Joel S. Emer Doug Burger Stephen W. Keckler Department of Computer Sciences VSSAD, Intel The University of Texas
More informationLecture 12 Branch Prediction and Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer
More informationUG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects
Announcements UG4 Honours project selection: Talk to Vijay or Boris if interested in computer architecture projects Inf3 Computer Architecture - 2017-2018 1 Last time: Tomasulo s Algorithm Inf3 Computer
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More information15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationDynamic Memory Dependence Predication
Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationHigh Performance SMIPS Processor
High Performance SMIPS Processor Jonathan Eastep 6.884 Final Project Report May 11, 2005 1 Introduction 1.1 Description This project will focus on producing a high-performance, single-issue, in-order,
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationSPECULATIVE MULTITHREADED ARCHITECTURES
2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationEITF20: Computer Architecture Part3.2.1: Pipeline - 3
EITF20: Computer Architecture Part3.2.1: Pipeline - 3 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Dynamic scheduling - Tomasulo Superscalar, VLIW Speculation ILP limitations What we have done
More informationWide Instruction Fetch
Wide Instruction Fetch Fall 2007 Prof. Thomas Wenisch http://www.eecs.umich.edu/courses/eecs470 edu/courses/eecs470 block_ids Trace Table pre-collapse trace_id History Br. Hash hist. Rename Fill Table
More informationBeyond ILP. Hemanth M Bharathan Balaji. Hemanth M & Bharathan Balaji
Beyond ILP Hemanth M Bharathan Balaji Multiscalar Processors Gurindar S Sohi Scott E Breach T N Vijaykumar Control Flow Graph (CFG) Each node is a basic block in graph CFG divided into a collection of
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationUse-Based Register Caching with Decoupled Indexing
Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register
More informationPortland State University ECE 587/687. Superscalar Issue Logic
Portland State University ECE 587/687 Superscalar Issue Logic Copyright by Alaa Alameldeen, Zeshan Chishti and Haitham Akkary 2017 Instruction Issue Logic (Sohi & Vajapeyam, 1987) After instructions are
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationLecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue 1 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationComputer Architecture EE 4720 Final Examination
Name Computer Architecture EE 4720 Final Examination Primary: 6 December 1999, Alternate: 7 December 1999, 10:00 12:00 CST 15:00 17:00 CST Alias Problem 1 Problem 2 Problem 3 Problem 4 Exam Total (25 pts)
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in
More informationHardware Speculation Support
Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification
More informationEECS 470 Midterm Exam Winter 2009
EECS 70 Midterm Exam Winter 2009 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 18 2 / 12 3 / 29 / 21
More informationECE 550D Fundamentals of Computer Systems and Engineering. Fall 2016
ECE 550D Fundamentals of Computer ystems and Engineering Fall 2016 Pipelines Tyler letsch Duke University lides are derived from work by Andrew Hilton (Duke) and Amir Roth (Penn) Clock Period and CPI ingle-cycle
More informationStore Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced oad Optimization Amir Roth Department of Computer and Information Science, University of Pennsylvania amir@cis.upenn.edu Abstract The
More informationComputer Architecture Lecture 15: Load/Store Handling and Data Flow. Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014
18-447 Computer Architecture Lecture 15: Load/Store Handling and Data Flow Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014 Lab 4 Heads Up Lab 4a out Branch handling and branch predictors
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Out-of-Order Memory Access Dynamic Scheduling Summary Out-of-order execution: a performance technique Feature I: Dynamic scheduling (io OoO) Performance piece: re-arrange
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More informationLast lecture. Some misc. stuff An older real processor Class review/overview.
Last lecture Some misc. stuff An older real processor Class review/overview. HW5 Misc. Status issues Answers posted Returned on Wednesday (next week) Project presentation signup at http://tinyurl.com/470w14talks
More informationDMDC: Delayed Memory Dependence Checking through Age-Based Filtering
DMDC: Delayed Memory Dependence Checking through Age-Based Filtering Fernando Castro, Luis Pinuel, Daniel Chaver, Manuel Prieto, Michael Huang, Francisco Tirado University Complutense of Madrid fcastror@fis.ucm.es,
More informationECE 4750 Computer Architecture, Fall 2018 T12 Advanced Processors: Memory Disambiguation
ECE 4750 Computer Architecture, Fall 208 T2 Advanced Processors: Memory Disambiguation School of Electrical and Computer Engineering Cornell University revision: 208--09-4-58 Adding Memory Instructions
More informationEECS 470. Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018
EECS 470 Branches: Address prediction and recovery (And interrupt recovery too.) Lecture 6 Winter 2018 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen,
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More information15-740/ Computer Architecture Lecture 14: Runahead Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011
15-740/18-740 Computer Architecture Lecture 14: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/12/2011 Reviews Due Today Chrysos and Emer, Memory Dependence Prediction Using
More informationChapter. Out of order Execution
Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate
More informationA Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines
A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationCS 152, Spring 2011 Section 8
CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are
More informationArchitectures for Instruction-Level Parallelism
Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software
More information" # " $ % & ' ( ) * + $ " % '* + * ' "
! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File
More informationThe Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights
The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u
More information15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture
More informationData Speculation. Architecture. Carnegie Mellon School of Computer Science
Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996. Sodani and Sohi. Understanding the differences between value prediction and instruction reuse, 1998. 1 A
More informationComputer Architecture: Out-of-Order Execution II. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Out-of-Order Execution II Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15 Video
More informationAnnouncements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)
Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationUnderstanding Scheduling Replay Schemes
Appears in Proceedings of the 0th International Symposium on High Performance Computer Architecture (HPCA-0), February 2004. Understanding Scheduling Replay Schemes Ilhyun Kim and Mikko H. Lipasti Department
More informationAdvanced Computer Architecture
Advanced Computer Architecture 1 L E C T U R E 4: D A T A S T R E A M S I N S T R U C T I O N E X E C U T I O N I N S T R U C T I O N C O M P L E T I O N & R E T I R E M E N T D A T A F L O W & R E G I
More informationHigh Performance SMIPS Processor. Jonathan Eastep May 8, 2005
High Performance SMIPS Processor Jonathan Eastep May 8, 2005 Objectives: Build a baseline implementation: Single-issue, in-order, 6-stage pipeline Full bypassing ICache: blocking, direct mapped, 16KByte,
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationCS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II
CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time
More informationControl Hazards. Branch Recovery. Control Hazard Pipeline Diagram. Branch Performance
Control Hazards ranch Recovery D/
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real
More informationLecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3) 1 Support for Speculation In general, when we re-order instructions, register renaming
More informationece4750-t11-ooo-execution-notes.txt ========================================================================== ece4750-l12-ooo-execution-notes.txt ==========================================================================
More informationComputer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović
Computer Architecture and Engineering CS52 Quiz #3 March 22nd, 202 Professor Krste Asanović Name: This is a closed book, closed notes exam. 80 Minutes 0 Pages Notes: Not all questions are
More informationMEMORY ORDERING: A VALUE-BASED APPROACH
MEMORY ORDERING: A VALUE-BASED APPROACH VALUE-BASED REPLAY ELIMINATES THE NEED FOR CONTENT-ADDRESSABLE MEMORIES IN THE LOAD QUEUE, REMOVING ONE BARRIER TO SCALABLE OUT- OF-ORDER INSTRUCTION WINDOWS. INSTEAD,
More informationUnit 8: Superscalar Pipelines
A Key Theme: arallelism reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Unit 8: Superscalar ipelines Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'ennsylvania'
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationECE/CS 552: Introduction to Superscalar Processors
ECE/CS 552: Introduction to Superscalar Processors Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Limitations of Scalar Pipelines
More informationCISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions
CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis6627 Powerpoint Lecture Notes from John Hennessy
More informationChia-Jung Hsu Institute of Computer and Communication Engineering, National Cheng Kung University Tainan, Taiwan
Power-efficient and Scalable Load/Store Queue Design via Address Compression Yi-Ying Tsai Department of Electrical Engineering, National Cheng Kung University, magi@casmail.ee.ncku.edu.tw Chia-Jung Hsu
More informationLecture 8: Branch Prediction, Dynamic ILP. Topics: static speculation and branch prediction (Sections )
Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections 2.3-2.6) 1 Correlating Predictors Basic branch prediction: maintain a 2-bit saturating counter for each
More informationOutline. In-Order vs. Out-of-Order. Project Goal 5/14/2007. Design and implement an out-of-ordering superscalar. Introduction
Outline Group IV Wei-Yin Chen Myong Hyon Cho Introduction In-Order vs. Out-of-Order Register Renaming Re-Ordering Od Buffer Superscalar Architecture Architectural Design Bluespec Implementation Results
More informationCS152 Computer Architecture and Engineering. Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12
CS152 Computer Architecture and Engineering Assigned 2/28/2018 Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12 http://inst.eecs.berkeley.edu/~cs152/sp18 The problem
More informationInstruction scheduling. Based on slides by Ilhyun Kim and Mikko Lipasti
Instruction scheduling Based on slides by Ilhyun Kim and Mikko Lipasti Schedule of things to do HW5 posted later today. I ve pushed it to be due on the last day of class And will accept it without penalty
More informationAddress-Indexed Memory Disambiguation and Store-to-Load Forwarding
Copyright c 2005 IEEE. Reprinted from 38th Intl Symp Microarchitecture, Nov 2005, pp. 171-182. This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted.
More informationECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen Limitations of Scalar Pipelines
More informationComplex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units
6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd
More information