STATS AND CONFIGURATION

Size: px
Start display at page:

Download "STATS AND CONFIGURATION"

Transcription

1 ZSim Tutorial MICRO 2015 STATS AND CONFIGURATION Core Models Po-An Tsai

2 Outline 2

3 Outline ZSim core simulation techniques 2

4 Outline 2 ZSim core simulation techniques ZSim core structure Simple IPC 1 core Timing core OOO core I1D Core I1I

5 Outline 2 ZSim core simulation techniques ZSim core structure Simple IPC 1 core Timing core OOO core I1D Core I1I Coding examples with demo Branch predictor Westmere to Silvermont

6 Core Simulation Techniques 3

7 Core Simulation Techniques 3 ZSim simulates the system using Pin Leverages dynamic binary translation

8 Core Simulation Techniques 3 ZSim simulates the system using Pin Leverages dynamic binary translation ZSim mainly uses 4 types of analysis routine Basic block Load and Store Branch to cover the simulated program

9 Core Simulation Technique 4

10 Core Simulation Technique A basic block (BBL) from Pin 4

11 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 4

12 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5

13 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information

14 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a

15 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Decode BBL into BBL descriptor

16 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Decode BBL into BBL descriptor BblDescriptor: numinstructions = 4 numbytes = 4 uop[]

17 Core Simulation Technique A basic block (BBL) from Pin mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a 5 1. Simulate core activities with a BBL descriptor that contains most of the static information mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) BasicBlock(BblDescriptor) ja 40530a Decode BBL into BBL descriptor BblDescriptor: numinstructions = 4 numbytes = 4 uop[]

18 Core Simulation Technique 6 Decode x86 instructions into uops With different latencies, src/dst pair, function unit ports

19 Core Simulation Technique 7

20 Core Simulation Technique 2. Simulate memory system operations with addresses 7

21 Core Simulation Technique 2. Simulate memory system operations with addresses 7 mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) BasicBlock(BblDescriptor) ja 40530a

22 Core Simulation Technique 2. Simulate memory system operations with addresses 7 mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a

23 Core Simulation Technique 2. Simulate memory system operations with addresses 7 mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Load(Address addr) { L1D->load(addr); } Store(Address addr) { L1D->Store(addr); }

24 Core Simulation Technique 8

25 Core Simulation Technique 8 Instruction-driven core activity (basic block) simulation Simulates multiple stages for single instruction at once Each stage maintains a separate clock

26 Core Simulation Technique 8 Instruction-driven core activity (basic block) simulation Simulates multiple stages for single instruction at once Each stage maintains a separate clock BasicBlock(BblDescriptor) { foreach uop { simulatefetch(uop); simulatedecode(uop); simulateissue(uop); simulateexecute(uop); simulatecommit(uop); } }

27 Core Simulation Technique 8 Instruction-driven core activity (basic block) simulation Simulates multiple stages for single instruction at once Each stage maintains a separate clock BasicBlock(BblDescriptor) { foreach uop { simulatefetch(uop); simulatedecode(uop); simulateissue(uop); simulateexecute(uop); simulatecommit(uop); } } simulateissue(uop) { adduoptorob(currobcycle, uop); if(rob.isfull()){ nextrobavailcycle = rob.advance(); } }

28 Core Simulation Technique 9 Event-driven uncore activity simulation

29 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core

30 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag

31 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss

32 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data

33 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data Cache Data

34 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data Cache Data Response to core

35 Core Simulation Technique 9 Event-driven uncore activity simulation Request from core Cache Tag Cache Miss Mem Data Weave phase Cache Response to core

36 General Core Structure 10

37 General Core Structure 10 ZSim simulates a core with 4 functions using Pin s APIs BblFunc LoadFunc StoreFunc BranchFunc

38 General Core Structure 10 ZSim simulates a core with 4 functions using Pin s APIs BblFunc LoadFunc StoreFunc BranchFunc Current supported core type Simple IPC1 core Timing core OOO core (Westmere-like)

39 Simple IPC1 Core 11 Current cycle = 0 I1D I1I mov (%rbp),%rcx IPC1 Core add %rax,%rbx mov %rdx,(%rbp) ja 40530a

40 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a

41 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a Current cycle = l1d->load(curcycle)

42 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle)

43 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle)

44 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle)

45 Simple IPC1 Core 11 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Current cycle += 4

46 Timing Core 12 Current cycle = 0 I1D I1I mov (%rbp),%rcx IPC1 Core add %rax,%rbx mov %rdx,(%rbp) ja 40530a

47 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a

48 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a Current cycle = l1d->load(curcycle)

49 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core

50 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core

51 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core

52 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Miss Write back Request from core Tag Acc Mem Data Read Data Write Response to core

53 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Request from core Tag Acc Miss Write back Request from core Tag Acc Data Write Mem Data Read Data Write Response to core

54 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Request from core Tag Acc Miss Write back Request from core Tag Acc Data Write Mem Data Read Data Write Response to core

55 Timing Core 12 Current cycle = 0 I1D I1I IPC1 Core mov (%rbp),%rcx Load(%rbp) add %rax,%rbx mov %rdx,(%rbp) Store(%rbp) BasicBlock(BblDescriptor) ja 40530a Current cycle = l1d->load(curcycle) Current cycle = l1d->store(curcycle) Current cycle += 4 Request from core Tag Acc Miss Write back Request from core Tag Acc Data Write Mem Data Read Data Write Response to core

56 OOO Core - BBL 13 Simulate all stages at once Load A Exec Store A Exec

57 OOO Core - BBL 13 Simulate all stages at once Load A Exec Store A Exec Fetch Decode Issue OOO Execute Commit

58 OOO Core - BBL 14 Simulate all stages at once Load A Exec Store A Exec Fetch Decode Issue OOO Execute Commit

59 OOO Core - BBL 15 Simulate all stages at once Load A Exec Store A Exec Fetch Decode Issue OOO Execute Commit

60 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit

61 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit

62 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit

63 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Load A Fetch Decode Issue OOO Execute Commit

64 OOO Core - BBL 15 Simulate all stages at once Exec Store A Exec Fetch Decode Issue OOO Execute Commit

65 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Exec Decode Issue OOO Execute Commit

66 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Exec Decode Issue OOO Execute Commit

67 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Exec Decode Issue OOO Execute Commit

68 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Decode Issue OOO Execute Exec Commit

69 OOO Core - BBL 15 Simulate all stages at once Store A Exec Fetch Decode Issue OOO Execute Commit

70 OOO Core - BBL 16 Simulate all stages at once Load A Fetch Decode Issue OOO Execute Commit

71 OOO Core - BBL 17 Simulate all stages at once Load A Fetch cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode Issue OOO Execute Commit

72 OOO Core - BBL 18 Simulate all stages at once Load A Decode cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue OOO Execute Check next available cycle Adjust Decode clock Commit

73 OOO Core - BBL 19 Simulate all stages at once Load A Dispatch cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue Reg Scoreboard OOO Execute Check next Check src available available cycle cycle Rob Adjust Decode clock Check next avail cycle Issue width RegFile width Adjust issue clock Commit

74 OOO Core - BBL 20 Simulate all stages at once Load A Commit cycle Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue Reg Scoreboard OOO Execute Check next Check src Ins Window available available cycle cycle Adjust Decode clock Rob Check next avail cycle Issue width RegFile width Adjust issue clock Schedule uop in the next cycle that needed ports avail LS Unit* Issue Load/Store *Only for load/store Commit

75 OOO Core - BBL 21 Simulate all stages at once Load A Fetch Miss prediction Fetch wrong ins Ins Fetch Fetch whole bbl Adjust Fetch clock Decode uop Queue Issue Reg Scoreboard OOO Execute Check next Check src Ins Window available available cycle cycle Adjust Decode clock Rob Check next avail cycle Issue width RegFile width Adjust issue clock Schedule uop in the next cycle that needed ports avail LS Unit* Issue Load/Store Commit Reg Scoreboard Set dst available cycle Rob Retire uop considering rob width Adjust retire clock

76 OOO Core Load/Store 22 Simulate MLP Load A Load B

77 OOO Core Load/Store 22 Simulate MLP Load A Load B Cache Issue 30 70

78 OOO Core Load/Store 23 Simulate MLP Mem 90 Mem 110 Load A Load B Cache 50 Cache 70 Cache Issue 30 Issue

79 OOO Core Load/Store 23 Simulate MLP Mem 90 Mem 110 Load A Load B Cache 50 Cache 70 Cache Issue 30 Issue In weave phase, request B will not be delayed due to contentions for A

80 Simulation Speed for Different Core Type SPECCPU 2006 suite 24

81 Simulation Speed for Different Core Type SPECCPU 2006 suite 24 ~3X difference between IPC1 and OOO-C in Hmean

82 Not Modeled Core Behaviors 25

83 Not Modeled Core Behaviors 25 Wrong path execution Hard to simulate for Pin Okay to skip for Westmere

84 Not Modeled Core Behaviors 25 Wrong path execution Hard to simulate for Pin Okay to skip for Westmere Fine-grained message-passing Need significant changes

85 Not Modeled Core Behaviors 25 Wrong path execution Hard to simulate for Pin Okay to skip for Westmere Fine-grained message-passing Need significant changes TLBs and SMT Not supported yet

86 Coding Examples 26

87 Coding Examples Implement a branch predictor for OOO core 26

88 Coding Examples 26 Implement a branch predictor for OOO core Change OOO core type From Westmere to Silvermont

89 Implement Branch Predictors 27

90 Implement Branch Predictors Have a new branch predictor class 27

91 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27

92 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { private: bool lastseen; } 27

93 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { private: bool lastseen; } Implement the predict method 27

94 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) {

95 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen);

96 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken;

97 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken; return prediction; // always predict taken }

98 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken; return prediction; // always predict taken } Replace the branch predictor in ooo_core.h

99 Implement Branch Predictors Have a new branch predictor class class GShareBranchPredictor { 27 private: bool lastseen; } Implement the predict method public: // Predicts and updates; returns false if mispredicted inline bool predict(address branchpc, bool taken) { bool prediction = (taken == lastseen); lastseen = taken; return prediction; // always predict taken } Replace the branch predictor in ooo_core.h //BranchPredictorPAg<11, 18, 14> branchpred; GSharePredictor branchpred;

100 Demo 28

101 Different OOO Micro-architecture 29

102 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29

103 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29 Step 1: obtain the important ooo core parameters

104 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29 Step 1: obtain the important ooo core parameters Step 2: change the core parameters in ooo_core.h/cpp

105 Different OOO Micro-architecture The original zsim assumes Westmere OOO core, but what if I want to simulate a Silvermont/Haswell OOO core? 29 Step 1: obtain the important ooo core parameters Step 2: change the core parameters in ooo_core.h/cpp Step 3: verify it against real system

106 Obtain Important Core Parameters 30 Westmere[1] Silvermont[2] Issue width 4 2 F/D/I/E stages 1/4/7/13 1/3/5/8 Fetch width 16B 8B RF read width 3 2 ROB size Ins window 1K * 36 1K * 16 Issue queue 28 8 [1] [2]

107 Change OOO Core Parameters 31

108 Change OOO Core Parameters Change sizes of hardware structures in ooo_core.h 31

109 Change OOO Core Parameters 31 Change sizes of hardware structures in ooo_core.h CycleQueue<28> uopqueue -> <8> ReorderBuffer<128, 4> rob -> <32, 2>

110 Change OOO Core Parameters 31 Change sizes of hardware structures in ooo_core.h CycleQueue<28> uopqueue -> <8> ReorderBuffer<128, 4> rob -> <32, 2> Change the ooo core parameter in ooo_core.cpp

111 Change OOO Core Parameters Change sizes of hardware structures in ooo_core.h CycleQueue<28> uopqueue -> <8> ReorderBuffer<128, 4> rob -> <32, 2> Change the ooo core parameter in ooo_core.cpp #define FETCH_STAGE 1 -> 1 #define DECODE_STAGE 4 -> 3 #define ISSUE_STAGE 7 -> 5 #define DISPATCH_STAGE 13 -> 8 #define FETCH_BYTES_PER_CYCLE 16 -> 8 #define ISSUES_PER_CYCLE 4 -> 2 #define RF_READS_PER_CYCLE 3 -> 2 31

112 Demo 32

113 Verify It Against Real System 33 IPC traces for Westmere and Silvermont Westmere (6% performance difference) Silvermont (9% performance difference)

114 Summary 34

115 Summary ZSim uses instruction-driven simulation for core activities and event-driven simulation for uncore activities 34

116 Summary 34 ZSim uses instruction-driven simulation for core activities and event-driven simulation for uncore activities ZSim currently supports 3 types of core Simple IPC1 core (simple_core.h) Timing core (timing_core.h) Westmere-like OOO core (ooo_core.h)

117 Summary 34 ZSim uses instruction-driven simulation for core activities and event-driven simulation for uncore activities ZSim currently supports 3 types of core Simple IPC1 core (simple_core.h) Timing core (timing_core.h) Westmere-like OOO core (ooo_core.h) Extending zsim core model is straightforward Modify 4 basic analysis routines Substitute the hardware structure with your implementation Change the parameters in OOO

118 Hacking Advice 35 As common Pin programming, functions in the core are very frequently called in zsim You should be aware of performance when coding It s the main reason why zsim statically allocates hardware structures and set ooo parameters

119 Thank you! Any questions? 36

120 Break / Q&A Try zsim now! 37

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Fast and Accurate Microarchitectural Simulation with ZSim

Fast and Accurate Microarchitectural Simulation with ZSim TUNING SLIDE Fast and Accurate Microarchitectural Simulation with ZSim Daniel Sanchez, Nathan Beckmann, Anurag Mukkara, Po-An Tsai MIT CSAIL MICRO-48 Tutorial December 5, 2015 Welcome! Agenda 4 8:30 9:10

More information

" # " $ % & ' ( ) * + $ " % '* + * ' "

 #  $ % & ' ( ) * + $  % '* + * ' ! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Out-of-Order Memory Access Dynamic Scheduling Summary Out-of-order execution: a performance technique Feature I: Dynamic scheduling (io OoO) Performance piece: re-arrange

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

E0-243: Computer Architecture

E0-243: Computer Architecture E0-243: Computer Architecture L1 ILP Processors RG:E0243:L1-ILP Processors 1 ILP Architectures Superscalar Architecture VLIW Architecture EPIC, Subword Parallelism, RG:E0243:L1-ILP Processors 2 Motivation

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

Limitations of Scalar Pipelines

Limitations of Scalar Pipelines Limitations of Scalar Pipelines Superscalar Organization Modern Processor Design: Fundamentals of Superscalar Processors Scalar upper bound on throughput IPC = 1 Inefficient unified pipeline

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ

Lecture: Out-of-order Processors. Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB)

More information

CS 152, Spring 2012 Section 8

CS 152, Spring 2012 Section 8 CS 152, Spring 2012 Section 8 Christopher Celio University of California, Berkeley Agenda More Out- of- Order Intel Core 2 Duo (Penryn) Vs. NVidia GTX 280 Intel Core 2 Duo (Penryn) dual- core 2007+ 45nm

More information

Superscalar Processor

Superscalar Processor Superscalar Processor Design Superscalar Architecture Virendra Singh Indian Institute of Science Bangalore virendra@computer.orgorg Lecture 20 SE-273: Processor Design Superscalar Pipelines IF ID RD ALU

More information

Pentium 4 Processor Block Diagram

Pentium 4 Processor Block Diagram FP FP Pentium 4 Processor Block Diagram FP move FP store FMul FAdd MMX SSE 3.2 GB/s 3.2 GB/s L D-Cache and D-TLB Store Load edulers Integer Integer & I-TLB ucode Netburst TM Micro-architecture Pipeline

More information

Use-Based Register Caching with Decoupled Indexing

Use-Based Register Caching with Decoupled Indexing Use-Based Register Caching with Decoupled Indexing J. Adam Butts and Guri Sohi University of Wisconsin Madison {butts,sohi}@cs.wisc.edu ISCA-31 München, Germany June 23, 2004 Motivation Need large register

More information

CS152 Computer Architecture and Engineering. Complex Pipelines

CS152 Computer Architecture and Engineering. Complex Pipelines CS152 Computer Architecture and Engineering Complex Pipelines Assigned March 6 Problem Set #3 Due March 20 http://inst.eecs.berkeley.edu/~cs152/sp12 The problem sets are intended to help you learn the

More information

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

Lecture 16: Core Design. Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue 1 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Instruction Commit The End of the Road (um Pipe) Commit is typically the last stage of the pipeline Anything an insn. does at this point is irrevocable Only actions following

More information

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue

Lecture 11: Out-of-order Processors. Topics: more ooo design details, timing, load-store queue Lecture 11: Out-of-order Processors Topics: more ooo design details, timing, load-store queue 1 Problem 0 Show the renamed version of the following code: Assume that you have 36 physical registers and

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units

Complex Pipelining: Out-of-order Execution & Register Renaming. Multiple Function Units 6823, L14--1 Complex Pipelining: Out-of-order Execution & Register Renaming Laboratory for Computer Science MIT http://wwwcsglcsmitedu/6823 Multiple Function Units 6823, L14--2 ALU Mem IF ID Issue WB Fadd

More information

CS 152, Spring 2011 Section 8

CS 152, Spring 2011 Section 8 CS 152, Spring 2011 Section 8 Christopher Celio University of California, Berkeley Agenda Grades Upcoming Quiz 3 What it covers OOO processors VLIW Branch Prediction Intel Core 2 Duo (Penryn) Vs. NVidia

More information

ece4750-t11-ooo-execution-notes.txt ========================================================================== ece4750-l12-ooo-execution-notes.txt ==========================================================================

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate

More information

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques,

Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques, Lecture 18: Instruction Level Parallelism -- Dynamic Superscalar, Advanced Techniques, ARM Cortex-A53, and Intel Core i7 CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

More information

The Pentium II/III Processor Compiler on a Chip

The Pentium II/III Processor Compiler on a Chip The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory

Announcements. ECE4750/CS4420 Computer Architecture L11: Speculative Execution I. Edward Suh Computer Systems Laboratory ECE4750/CS4420 Computer Architecture L11: Speculative Execution I Edward Suh Computer Systems Laboratory suh@csl.cornell.edu Announcements Lab3 due today 2 1 Overview Branch penalties limit performance

More information

Out of Order Processing

Out of Order Processing Out of Order Processing Manu Awasthi July 3 rd 2018 Computer Architecture Summer School 2018 Slide deck acknowledgements : Rajeev Balasubramonian (University of Utah), Computer Architecture: A Quantitative

More information

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar

Complex Pipelining COE 501. Computer Architecture Prof. Muhamed Mudawar Complex Pipelining COE 501 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Diversified Pipeline Detecting

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 5: Precise Exceptions. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 5: Precise Exceptions Prof. Onur Mutlu Carnegie Mellon University Last Time Performance Metrics Amdahl s Law Single-cycle, multi-cycle machines Pipelining Stalls

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

The Tomasulo Algorithm Implementation

The Tomasulo Algorithm Implementation 2162 Term Project The Tomasulo Algorithm Implementation Assigned: 11/3/2015 Due: 12/15/2015 In this project, you will implement the Tomasulo algorithm with register renaming, ROB, speculative execution

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ.

Computer Architectures. Chapter 4. Tien-Fu Chen. National Chung Cheng Univ. Computer Architectures Chapter 4 Tien-Fu Chen National Chung Cheng Univ. chap4-0 Advance Pipelining! Static Scheduling Have compiler to minimize the effect of structural, data, and control dependence "

More information

Outline. In-Order vs. Out-of-Order. Project Goal 5/14/2007. Design and implement an out-of-ordering superscalar. Introduction

Outline. In-Order vs. Out-of-Order. Project Goal 5/14/2007. Design and implement an out-of-ordering superscalar. Introduction Outline Group IV Wei-Yin Chen Myong Hyon Cho Introduction In-Order vs. Out-of-Order Register Renaming Re-Ordering Od Buffer Superscalar Architecture Architectural Design Bluespec Implementation Results

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Architectures for Instruction-Level Parallelism

Architectures for Instruction-Level Parallelism Low Power VLSI System Design Lecture : Low Power Microprocessor Design Prof. R. Iris Bahar October 0, 07 The HW/SW Interface Seminar Series Jointly sponsored by Engineering and Computer Science Hardware-Software

More information

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding HANDLING MEMORY OPS 9 Dynamically Scheduling Memory Ops Compilers must schedule memory ops conservatively Options for hardware: Hold loads until all prior stores execute (conservative) Execute loads as

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

Processor Architecture

Processor Architecture Processor Architecture Advanced Dynamic Scheduling Techniques M. Schölzel Content Tomasulo with speculative execution Introducing superscalarity into the instruction pipeline Multithreading Content Tomasulo

More information

ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems

ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems Daniel Sanchez Massachusetts Institute of Technology sanchez@csail.mit.edu Christos Kozyrakis Stanford University christos@ee.stanford.edu

More information

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming CS 152 Computer Architecture and Engineering Lecture 13 - Out-of-Order Issue and Register Renaming Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://wwweecsberkeleyedu/~krste

More information

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections )

Lecture 9: Dynamic ILP. Topics: out-of-order processors (Sections ) Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections 2.3-2.6) 1 An Out-of-Order Processor Implementation Reorder Buffer (ROB) Branch prediction and instr fetch R1 R1+R2 R2 R1+R3 BEQZ R2 R3

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

HARDWARE SPECULATION. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah HARDWARE SPECULATION Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Mid-term exam: Mar. 5 th No homework till after

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 22: Superscalar Processing (II) Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due Today Homework 4 Out today Due November 15

More information

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 8: Out-of-Order Execution Prof. Yanjing Li University of Chicago Administrative Stuff! Lab2 due tomorrow " 2 free late days! Lab3 is out " Start early!! My office

More information

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines

Branch Prediction & Speculative Execution. Branch Penalties in Modern Pipelines 6.823, L15--1 Branch Prediction & Speculative Execution Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 6.823, L15--2 Branch Penalties in Modern Pipelines UltraSPARC-III

More information

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison

ECE 552: Introduction To Computer Architecture 1. Scalar upper bound on throughput. Instructor: Mikko H Lipasti. University of Wisconsin-Madison ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen Limitations of Scalar Pipelines

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Chapter. Out of order Execution

Chapter. Out of order Execution Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until

More information

Constructive Computer Architecture Tutorial 6: Discussion for lab6. October 7, 2013 T05-1

Constructive Computer Architecture Tutorial 6: Discussion for lab6. October 7, 2013 T05-1 Constructive Computer Architecture Tutorial 6: Discussion for lab6 October 7, 2013 T05-1 Introduction Lab 6 involves creating a 6 stage pipelined processor from a 2 stage pipeline This requires a lot of

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

Lecture 18: Core Design, Parallel Algos

Lecture 18: Core Design, Parallel Algos Lecture 18: Core Design, Parallel Algos Today: Innovations for ILP, TLP, power and parallel algos Sign up for class presentations 1 SMT Pipeline Structure Front End Front End Front End Front End Private/

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real

More information

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 4: Pipelining. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 4: Pipelining Prof. Onur Mutlu Carnegie Mellon University Last Time Addressing modes Other ISA-level tradeoffs Programmer vs. microarchitect Virtual memory Unaligned

More information

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.)

EECS 470 Lecture 6. Branches: Address prediction and recovery (And interrupt recovery too.) EECS 470 Lecture 6 Branches: Address prediction and recovery (And interrupt recovery too.) Announcements: P3 posted, due a week from Sunday HW2 due Monday Reading Book: 3.1, 3.3-3.6, 3.8 Combining Branch

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

CSE 490/590 Computer Architecture Homework 2

CSE 490/590 Computer Architecture Homework 2 CSE 490/590 Computer Architecture Homework 2 1. Suppose that you have the following out-of-order datapath with 1-cycle ALU, 2-cycle Mem, 3-cycle Fadd, 5-cycle Fmul, no branch prediction, and in-order fetch

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.) EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.) Warning: Crazy times coming Project handout and group formation today Help me to end class 12 minutes early P3

More information

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog)

Announcements. EE382A Lecture 6: Register Renaming. Lecture 6 Outline. Dynamic Branch Prediction Using History. 1. Branch Prediction (epilog) Announcements EE382A Lecture 6: Register Renaming Project proposal due on Wed 10/14 2-3 pages submitted through email List the group members Describe the topic including why it is important and your thesis

More information

Visualizing the out-of-order CPU model. Ryota Shioya Nagoya University

Visualizing the out-of-order CPU model. Ryota Shioya Nagoya University Visualizing the out-of-order CPU model Ryota Shioya Nagoya University Introduction This presentation introduces the visualization of the out-of-order CPU model in gem5 2 Introduction Let's suppose you

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Dynamic Memory Dependence Predication

Dynamic Memory Dependence Predication Dynamic Memory Dependence Predication Zhaoxiang Jin and Soner Önder ISCA-2018, Los Angeles Background 1. Store instructions do not update the cache until they are retired (too late). 2. Store queue is

More information

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation. Types of dependences Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal pipeline CPI + stalls due to hazards invisible to programmer (unlike process level parallelism) ILP: overlap execution

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Tomasulo

More information

CS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline

CS 423 Computer Architecture Spring Lecture 04: A Superscalar Pipeline CS 423 Computer Architecture Spring 2012 Lecture 04: A Superscalar Pipeline Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs423/ [Adapted from Computer Organization and Design, Patterson & Hennessy,

More information

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

The University of Texas at Austin

The University of Texas at Austin EE382 (20): Computer Architecture - Parallelism and Locality Lecture 4 Parallelism in Hardware Mattan Erez The University of Texas at Austin EE38(20) (c) Mattan Erez 1 Outline 2 Principles of parallel

More information

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 8: Issues in Out-of-order Execution. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 8: Issues in Out-of-order Execution Prof. Onur Mutlu Carnegie Mellon University Readings General introduction and basic concepts Smith and Sohi, The Microarchitecture

More information

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-06 1 Processor and L1 Cache Interface

More information

Metodologie di Progettazione Hardware-Software

Metodologie di Progettazione Hardware-Software Metodologie di Progettazione Hardware-Software Advanced Pipelining and Instruction-Level Paralelism Metodologie di Progettazione Hardware/Software LS Ing. Informatica 1 ILP Instruction-level Parallelism

More information

Achieving Out-of-Order Performance with Almost In-Order Complexity

Achieving Out-of-Order Performance with Almost In-Order Complexity Achieving Out-of-Order Performance with Almost In-Order Complexity Comprehensive Examination Part II By Raj Parihar Background Info: About the Paper Title Achieving Out-of-Order Performance with Almost

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013

Computer Architecture Lecture 13: State Maintenance and Recovery. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 18-447 Computer Architecture Lecture 13: State Maintenance and Recovery Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/15/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple

More information

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman

Inherently Lower Complexity Architectures using Dynamic Optimization. Michael Gschwind Erik Altman Inherently Lower Complexity Architectures using Dynamic Optimization Michael Gschwind Erik Altman ÿþýüûúùúüø öõôóüòñõñ ðïîüíñóöñð What is the Problem? Out of order superscalars achieve high performance....butatthecostofhighhigh

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information