STATS AND CONFIGURATION
|
|
- Lester Harmon
- 6 years ago
- Views:
Transcription
1 ZSim Tutorial MICRO 2015 STATS AND CONFIGURATION Core Models Po-An Tsai
2 Outline 2
3 Outline ZSim core simulation techniques 2
4 Outline 2 ZSim core simulation techniques ZSim core structure Simple IPC 1 core Timing core OOO core I1D Core I1I
5 Outline 2 ZSim core simulation techniques ZSim core structure Simple IPC 1 core Timing core OOO core I1D Core I1I Coding examples with demo Branch predictor Westmere to Silvermont
6 Core Simulation Techniques 3
7 Core Simulation Techniques 3 ZSim simulates the system using Pin Leverages dynamic binary translation
8 Core Simulation Techniques 3 ZSim simulates the system using Pin Leverages dynamic binary translation ZSim mainly uses 4 types of analysis routine Basic block Load and Store Branch to cover the simulated program
9 Core Simulation Technique 4
10 Core Simulation Technique A basic block (BBL) from Pin 4
11 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 4
12 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5
13 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information
14 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a
15 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Decode BBL into BBL descriptor
16 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Decode BBL into BBL descriptor BblDescriptor: numinstructions = 4 numbytes = 4 uop[]
17 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) BasicBlock(BblDescriptor) ja 40530a Decode BBL into BBL descriptor BblDescriptor: numinstructions = 4 numbytes = 4 uop[]
18 Core Simulation Technique 6 Decode x86 instructions into uops With different latencies, src/dst pair, function unit ports
19 Core Simulation Technique 7
20 Core Simulation Technique 2. Simulate memory system operations with addresses 7
21 Core Simulation Technique 2. Simulate memory system operations with addresses 7 mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) BasicBlock(BblDescriptor) ja 40530a
22 Core Simulation Technique 2. Simulate memory system operations with addresses 7 mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a
23 Core Simulation Technique 2. Simulate memory system operations with addresses 7 mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Load(Address addr) { L1D->load(addr); } Store(Address addr) { L1D->Store(addr); }
24 Core Simulation Technique 8
25 Core Simulation Technique 8 Instruction-driven core activity (basic block) simulation Simulates multiple stages for single instruction at once Each stage maintains a separate clock
26 Core Simulation Technique 8 Instruction-driven core activity (basic block) simulation Simulates multiple stages for single instruction at once Each stage maintains a separate clock BasicBlock(BblDescriptor) { foreach uop { simulatefetch(uop); simulatedecode(uop); simulateissue(uop); simulateexecute(uop); simulatecommit(uop); } }
27 Core Simulation Technique 8 Instruction-driven core activity (basic block) simulation Simulates multiple stages for single instruction at once Each stage maintains a separate clock BasicBlock(BblDescriptor) { foreach uop { simulatefetch(uop); simulatedecode(uop); simulateissue(uop); simulateexecute(uop); simulatecommit(uop); } } simulateissue(uop) { adduoptorob(currobcycle, uop); if(rob.isfull()){ nextrobavailcycle = rob.advance(); } }
28 Core Simulation Technique 9 Event-driven uncore activity simulation
29 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core
30 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag
31 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss
32 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data
33 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data Cache Data
34 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data Cache Data Response to core
35 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data Weave phase Cache Response to core
36 General Core Structure 10
37 General Core Structure 10 ZSim simulates a core with 4 functions using Pin s APIs BblFunc LoadFunc StoreFunc BranchFunc
38 General Core Structure 10 ZSim simulates a core with 4 functions using Pin s APIs BblFunc LoadFunc StoreFunc BranchFunc Current supported core type Simple IPC1 core Timing core OOO core (Westmere-like)
39 Simple IPC1 Core 11 Current cycle = 0 I1D I1I mov (%rbp),%rcx IPC1 Core add %rax,%rbx mov %rdx,(%rbp) ja 40530a
40 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a
41 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a Current cycle = l1d->load(curcycle)
42 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle)
43 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle)
44 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle)
45 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Current cycle += 4
46 Timing Core 12 Current cycle = 0 I1D I1I mov (%rbp),%rcx IPC1 Core add %rax,%rbx mov %rdx,(%rbp) ja 40530a
47 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a
48 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a Current cycle = l1d->load(curcycle)
49 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core
50 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core
51 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core
52 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core
53 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Request from core Tag Acc Miss Write back Request from core Tag Acc Data Write Mem Data Read Data Write Response to core
54 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Request from core Tag Acc Miss Write back Request from core Tag Acc Data Write Mem Data Read Data Write Response to core
55 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Current cycle += 4 Request from core Tag Acc Miss Write back Request from core Tag Acc Data Write Mem Data Read Data Write Response to core
56 OOO Core - BBL 13 Simulate all stages at once Load A Exec Store A Exec
57 OOO Core - BBL 13 Simulate all stages at once Load A Exec Store A Exec Fetch Decode Issue OOO Execute Commit
58 OOO Core - BBL 14 Simulate all stages at once Load A Exec Store A Exec Fetch Decode Issue OOO Execute Commit
59 OOO Core - BBL 15 Simulate all stages at once Load A Exec Store A Exec Fetch Decode Issue OOO Execute Commit
60 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit
61 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit
62 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit
63 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit
64 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Fetch Decode Issue OOO Execute Commit
65 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Exec Decode Issue OOO Execute Commit
66 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Exec Decode Issue OOO Execute Commit
67 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Exec Decode Issue OOO Execute Commit
68 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Decode Issue OOO Execute Exec Commit
69 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Decode Issue OOO Execute Commit
70 OOO Core - BBL 16 Simulate all stages at once Load A Fetch Decode Issue OOO Execute Commit
71 OOO Core - BBL 17 Simulate all stages at once Load A Fetch cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode Issue OOO Execute Commit
72 OOO Core - BBL 18 Simulate all stages at once Load A Decode cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue OOO Execute Check next available cycle Adjust Decode clock Commit
73 OOO Core - BBL 19 Simulate all stages at once Load A Dispatch cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue Reg Scoreboard OOO Execute Check next Check src available available cycle cycle Rob Adjust Decode clock Check next avail cycle Issue width RegFile width Adjust issue clock Commit
74 OOO Core - BBL 20 Simulate all stages at once Load A Commit cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue Reg Scoreboard OOO Execute Check next Check src Ins Window available available cycle cycle Adjust Decode clock Rob Check next avail cycle Issue width RegFile width Adjust issue clock Schedule uop in the next cycle that needed ports avail LS Unit* Issue Load/Store *Only for load/store Commit
75 OOO Core - BBL 21 Simulate all stages at once Load A Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue Reg Scoreboard OOO Execute Check next Check src Ins Window available available cycle cycle Adjust Decode clock Rob Check next avail cycle Issue width RegFile width Adjust issue clock Schedule uop in the next cycle that needed ports avail LS Unit* Issue Load/Store Commit Reg Scoreboard Set dst available cycle Rob Retire uop considering rob width Adjust retire clock
76 OOO Core Load/Store 22 Simulate MLP Load A Load B
77 OOO Core Load/Store 22 Simulate MLP Load A Load B Cache Issue 30 70
78 OOO Core Load/Store 23 Simulate MLP Mem 90 Mem 110 Load A Load B Cache 50 Cache 70 Cache Issue 30 Issue
79 OOO Core Load/Store 23 Simulate MLP Mem 90 Mem 110 Load A Load B Cache 50 Cache 70 Cache Issue 30 Issue In weave phase, request B will not be delayed due to contentions for A
80 Simulation Speed for Different Core Type SPECCPU 2006 suite 24
81 Simulation Speed for Different Core Type SPECCPU 2006 suite 24 ~3X difference between IPC1 and OOO-C in Hmean
82 Not Modeled Core Behaviors 25
83 Not Modeled Core Behaviors 25 Wrong path execution Hard to simulate for Pin Okay to skip for Westmere
84 Not Modeled Core Behaviors 25 Wrong path execution Hard to simulate for Pin Okay to skip for Westmere Fine-grained message-passing Need significant changes
85 Not Modeled Core Behaviors 25 Wrong path execution Hard to simulate for Pin Okay to skip for Westmere Fine-grained message-passing Need significant changes TLBs and SMT Not supported yet
86 Coding Examples 26
87 Coding Examples Implement a branch predictor for OOO core 26
88 Coding Examples 26 Implement a branch predictor for OOO core Change OOO core type From Westmere to Silvermont
89 Implement Branch Predictors 27
90 Implement Branch Predictors Have a new branch predictor class 27
91 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27
92 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { private: bool lastseen; } 27
93 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { private: bool lastseen; } Implement the predict method 27
94 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) {
95 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen);
96 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken;
97 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken; return prediction; // always predict taken }
98 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken; return prediction; // always predict taken } Replace the branch predictor in ooo_core.h
99 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken; return prediction; // always predict taken } Replace the branch predictor in ooo_core.h //BranchPredictorPAg<11, 18, 14> branchpred; GSharePredictor branchpred;
100 Demo 28
101 Different OOO Micro-architecture 29
102 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29
103 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29 Step 1: obtain the important ooo core parameters
104 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29 Step 1: obtain the important ooo core parameters Step 2: change the core parameters in ooo_core.h/cpp
105 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29 Step 1: obtain the important ooo core parameters Step 2: change the core parameters in ooo_core.h/cpp Step 3: verify it against real system
106 Obtain Important Core Parameters 30 Westmere[1] Silvermont[2] Issue width 4 2 F/D/I/E stages 1/4/7/13 1/3/5/8 Fetch width 16B 8B RF read width 3 2 ROB size Ins window 1K * 36 1K * 16 Issue queue 28 8 [1] [2]
107 Change OOO Core Parameters 31
108 Change OOO Core Parameters Change sizes of hardware structures in ooo_core.h 31
109 Change OOO Core Parameters 31 Change sizes of hardware structures in ooo_core.h CycleQueue<28> uopqueue -> <8> ReorderBuffer<128, 4> rob -> <32, 2>
110 Change OOO Core Parameters 31 Change sizes of hardware structures in ooo_core.h CycleQueue<28> uopqueue -> <8> ReorderBuffer<128, 4> rob -> <32, 2> Change the ooo core parameter in ooo_core.cpp
111 Change OOO Core Parameters Change sizes of hardware structures in ooo_core.h CycleQueue<28> uopqueue -> <8> ReorderBuffer<128, 4> rob -> <32, 2> Change the ooo core parameter in ooo_core.cpp #define FETCH_STAGE 1 -> 1 #define DECODE_STAGE 4 -> 3 #define ISSUE_STAGE 7 -> 5 #define DISPATCH_STAGE 13 -> 8 #define FETCH_BYTES_PER_CYCLE 16 -> 8 #define ISSUES_PER_CYCLE 4 -> 2 #define RF_READS_PER_CYCLE 3 -> 2 31
112 Demo 32
113 Verify It Against Real System 33 IPC traces for Westmere and Silvermont Westmere (6% performance difference) Silvermont (9% performance difference)
114 Summary 34
115 Summary ZSim uses instruction-driven simulation for core activities and event-driven simulation for uncore activities 34
116 Summary 34 ZSim uses instruction-driven simulation for core activities and event-driven simulation for uncore activities ZSim currently supports 3 types of core Simple IPC1 core (simple_core.h) Timing core (timing_core.h) Westmere-like OOO core (ooo_core.h)
117 Summary 34 ZSim uses instruction-driven simulation for core activities and event-driven simulation for uncore activities ZSim currently supports 3 types of core Simple IPC1 core (simple_core.h) Timing core (timing_core.h) Westmere-like OOO core (ooo_core.h) Extending zsim core model is straightforward Modify 4 basic analysis routines Substitute the hardware structure with your implementation Change the parameters in OOO
118 Hacking Advice 35 As common Pin programming, functions in the core are very frequently called in zsim You should be aware of performance when coding It s the main reason why zsim statically allocates hardware structures and set ooo parameters
119 Thank you! Any questions? 36
120 Break / Q&A Try zsim now! 37
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationFast and Accurate Microarchitectural Simulation with ZSim
TUNING SLIDE Fast and Accurate Microarchitectural Simulation with ZSim Daniel Sanchez, Nathan Beckmann, Anurag Mukkara, Po-An Tsai MIT CSAIL MICRO-48 Tutorial December 5, 2015 Welcome! Agenda 4 8:30 9:10
More information" # " $ % & ' ( ) * + $ " % '* + * ' "
! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Out-of-Order Memory Access Dynamic Scheduling Summary Out-of-order execution: a performance technique Feature I: Dynamic scheduling (io OoO) Performance piece: re-arrange
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationE0-243: Computer Architecture
E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation
More informationCS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.
CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in
More informationLimitations of Scalar Pipelines
Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationLecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ
Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)
More informationCS 152, Spring 2012 Section 8
CS 152, Spring 2012 Section 8 Christopher Celio University of California, Berkeley Agenda More Out- of- Order Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core 2007+ 45nm
More informationSuperscalar Processor
Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU
More informationPentium 4 Processor Block Diagram
FP FP Pentium 4 Processor Block Diagram FP move FP store FMul FAdd MMX SSE 3.2 GB/s 3.2 GB/s L D-Cache and D-TLB Store Load edulers Integer Integer & I-TLB ucode Netburst TM Micro-architecture Pipeline
More informationUse-Based Register Caching with Decoupled Indexing
Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register
More informationCS152 Computer Architecture and Engineering. Complex Pipelines
CS152 Computer Architecture and Engineering Complex Pipelines Assigned March 6 Problem Set #3 Due March 20 http://inst.eecs.berkeley.edu/~cs152/sp12 The problem sets are intended to help you learn the
More informationLecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue 1 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following
More informationLecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue
Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are
More informationCS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism
CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste
More informationComplex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units
6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd
More informationCS 152, Spring 2011 Section 8
CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia
More informationece4750-t11-ooo-execution-notes.txt ========================================================================== ece4750-l12-ooo-execution-notes.txt ==========================================================================
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate
More informationLecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,
Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques, ARM Cortex-A53, and Intel Core i7 CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong
More informationThe Pentium II/III Processor Compiler on a Chip
The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda
More informationSuperscalar Processors
Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input
More informationComplex Pipelines and Branch Prediction
Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle
More informationAnnouncements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory
ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance
More informationOut of Order Processing
Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative
More informationComplex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar
Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More information15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 5: Precise Exceptions Prof. Onur Mutlu Carnegie Mellon University Last Time Performance Metrics Amdahl s Law Single-cycle, multi-cycle machines Pipelining Stalls
More information15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011
5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer
More informationThe Tomasulo Algorithm Implementation
2162 Term Project The Tomasulo Algorithm Implementation Assigned: 11/3/2015 Due: 12/15/2015 In this project, you will implement the Tomasulo algorithm with register renaming, ROB, speculative execution
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More informationComputer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.
Computer Architectures Chapter 4 Tien-Fu Chen National Chung Cheng Univ. chap4-0 Advance Pipelining! Static Scheduling Have compiler to minimize the effect of structural, data, and control dependence "
More informationOutline. In-Order vs. Out-of-Order. Project Goal 5/14/2007. Design and implement an out-of-ordering superscalar. Introduction
Outline Group IV Wei-Yin Chen Myong Hyon Cho Introduction In-Order vs. Out-of-Order Register Renaming Re-Ordering Od Buffer Superscalar Architecture Architectural Design Bluespec Implementation Results
More informationHigh-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs
High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:
More informationArchitectures for Instruction-Level Parallelism
Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software
More informationHANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding
HANDLING MEMORY OPS 9 Dynamically Scheduling Memory Ops Compilers must schedule memory ops conservatively Options for hardware: Hold loads until all prior stores execute (conservative) Execute loads as
More informationCS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars
CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationProcessor Architecture
Processor Architecture Advanced Dynamic Scheduling Techniques M. Schölzel Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading Content Tomasulo
More informationZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems
ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems Daniel Sanchez Massachusetts Institute of Technology sanchez@csail.mit.edu Christos Kozyrakis Stanford University christos@ee.stanford.edu
More informationCS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming
CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue and Register Renaming Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://wwweecsberkeleyedu/~krste
More informationLecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )
Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3
More informationCS 654 Computer Architecture Summary. Peter Kemper
CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining
More informationHARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah
HARDWARE SPECULATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Mid-term exam: Mar. 5 th No homework till after
More informationEfficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero
Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15
More informationPipelining to Superscalar
Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel
More informationSecurity-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat
Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More information15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15
More informationCMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago
CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office
More informationBranch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines
6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III
More informationECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison
ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen Limitations of Scalar Pipelines
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationChapter. Out of order Execution
Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until
More informationConstructive Computer Architecture Tutorial 6: Discussion for lab6. October 7, 2013 T05-1
Constructive Computer Architecture Tutorial 6: Discussion for lab6 October 7, 2013 T05-1 Introduction Lab 6 involves creating a 6 stage pipelined processor from a 2 stage pipeline This requires a lot of
More informationComputer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types
More informationLecture 18: Core Design, Parallel Algos
Lecture 18: Core Design, Parallel Algos Today: Innovations for ILP, TLP, power and parallel algos Sign up for class presentations 1 SMT Pipeline Structure Front End Front End Front End Front End Private/
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti
ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real
More information15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 4: Pipelining Prof. Onur Mutlu Carnegie Mellon University Last Time Addressing modes Other ISA-level tradeoffs Programmer vs. microarchitect Virtual memory Unaligned
More informationEECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)
EECS 470 Lecture 6 Branches: Address prediction and recovery (And interrupt recovery too.) Announcements: P3 posted, due a week from Sunday HW2 due Monday Reading Book: 3.1, 3.3-3.6, 3.8 Combining Branch
More informationChapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationCSE 490/590 Computer Architecture Homework 2
CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationEECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)
EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.) Warning: Crazy times coming Project handout and group formation today Help me to end class 12 minutes early P3
More informationAnnouncements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)
Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis
More informationVisualizing the out-of-order CPU model. Ryota Shioya Nagoya University
Visualizing the out-of-order CPU model Ryota Shioya Nagoya University Introduction This presentation introduces the visualization of the out-of-order CPU model in gem5 2 Introduction Let's suppose you
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationDynamic Memory Dependence Predication
Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is
More informationChapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences
Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution
More informationHW1 Solutions. Type Old Mix New Mix Cost CPI
HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by
More informationCPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo
More informationCS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline
CS 423 Computer Architecture Spring 2012 Lecture 04: A Superscalar Pipeline Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs423/ [Adapted from Computer Organization and Design, Patterson & Hennessy,
More informationCS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming
CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at
More informationAdvanced issues in pipelining
Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one
More informationLoad1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1
Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]
More informationThe University of Texas at Austin
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 4 Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38(20) (c) Mattan Erez 1 Outline 2 Principles of parallel
More information15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture
More informationECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories
ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-06 1 Processor and L1 Cache Interface
More informationMetodologie di Progettazione Hardware-Software
Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism
More informationAchieving Out-of-Order Performance with Almost In-Order Complexity
Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationComputer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013
18-447 Computer Architecture Lecture 13: State Maintenance and Recovery Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationInherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman
Inherently Lower Complexity Architectures using Dynamic Optimization Michael Gschwind Erik Altman ÿþýüûúùúüø öõôóüòñõñ ðïîüíñóöñð What is the Problem? Out of order superscalars achieve high performance....butatthecostofhighhigh
More informationSISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:
SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture
More information