Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model
|
|
- Edwina Shaw
- 6 years ago
- Views:
Transcription
1 Simple Machine Model Fall 005 Lectures & 5: Instruction Scheduling Instructions are executed in sequence Fetch, decode, execute, store results One instruction at a time For branch instructions, start fetching from a different location if needed Check branch condition Next instruction may come from a new location given by the branch instruction Saman marasinghe 6.05 MIT Fall 998 Simple xecution Model 5 Stage pipe-line fetch decode execute memory writeback Fetch: get the next instruction ecode: figure-out what that instruction is xecute: Perform LU operation address calculation in a memory op Memory: o the memory access in a mem. Op. Write ack: write the results back IF X MM W IF X MM W Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe 6.05 MIT Fall 998 Inst Inst Inst Inst Inst Inst Inst 5 Simple xecution Model time IF X MM W IF X MM W IF X MM W IF X MM W IF X MM W From a Simple Machine Model to a Real Machine Model Many pipeline stages Pentium 5 Pentium Pro 0 Pentium IV (0nm) 0 Pentium IV (90nm) ifferent instructions taking different amount of time to execute Real Machine Model cont. Most modern processors have multiple execution units (superscalar) If the instruction sequence is correct, multiple operations will happen in the same cycles ven more important to have the right instruction sequence Hardware to stall the pipeline if an instruction uses a result that is not ready Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998
2 Constraints On Scheduling ata dependencies Control dependencies Resource Constraints ata ependency between Instructions If two instructions access the same variable, they can be dependent Kind of dependencies True: write read nti: read write Output: write write What to do if two instructions are dependent. The order of execution cannot be reversed Reduce the possibilities for scheduling Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe MIT Fall 998 Computing ependencies For basic blocks, compute dependencies by walking through the instructions Identifying register dependencies is simple is it the same register? For memory accesses simple: base + offset?= base + offset data dependence analysis: a[i]?= a[i+] interprocedural analysis: global?= parameter pointer alias analysis: p?= p Representing ependencies Using a dependence G, one per basic block Nodes are instructions, edges represent dependencies Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Representing ependencies Using a dependence G, one per basic block Nodes are instructions, edges represent dependencies : r = *(r + ) : r = *(r + 8) : r = r + r : r5 = r - dge is labeled with Latency: v(i j) = delay required between initiation times of i and j minus the execution time required by i Saman marasinghe 6.05 MIT Fall 998 : r = *(r + ) : r = *(r + ) : r = r + r : r5 = r - Saman marasinghe 6.05 MIT Fall 998
3 nother Control ependencies and Resource Constraints : r = *(r + ) : *(r + ) = r For now, lets only worry about basic blocks : r = r + r For now, lets look at simple pipelines : r5 = r - Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe 6.05 MIT Fall 998 List Scheduling lgorithm Results In : lea var_a, %rax cycle Idea : add $, %rax cycle : inc %r o a topological sort of the dependence G cycle : (%rsp), %r0 cycles Consider when an instruction can be scheduled 5: add %r0, 8(%rsp) without causing a stall 6: and 6(%rsp), %rbx cycles Schedule the instruction if it causes no stall and all cycles its predecessors are already scheduled st st 5 6 st st st Optimal list scheduling is NP-complete Use heuristics when necessary Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 List Scheduling lgorithm Create a dependence G of a basic block Topological Sort RY = nodes with no predecessors Loop until RY is empty Schedule each node in RY when no stalling Update RY Heuristics for selection Heuristics for selecting from the RY list pick the node with the longest path to a leaf in the dependence graph pick a node with most immediate successors pick a node that can go to a less busy pipeline (in a superscalar) Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe MIT Fall 998
4 Heuristics for selection pick the node with the longest path to a leaf in the dependence graph lgorithm (for node x) If no successors d x = 0 d x = MX( d y + c xy ) for all successors y of x Heuristics for selection pick a node with most immediate successors lgorithm (for node x): f x = number of successors of x reverse breadth-first visitation order Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Results In : lea var_a, %rax cycle : add $, %rax cycle : inc %r cycle : (%rsp), %r0 cycles 5: add %r0, 8(%rsp) 6: and 6(%rsp), %rbx cycles cycles 8: %rbx, 6(%rsp) 9: lea var_b, %rax d=5 d= f= f= RY = { } d= d= f= 6 f= 5 8 d= f= Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe 6.05 MIT Fall 998 Results In : lea var_a, %rax cycle : add $, %rax cycle : inc %r cycle : (%rsp), %r0 cycles 5: add %r0, 8(%rsp) 6: and 6(%rsp), %rbx cycles cycles 8: %rbx, 6(%rsp) 9: lea var_b, %rax st st 5 6 st st st cycles vs 9 cycles Saman marasinghe 6.05 MIT Fall 998 Resource Constraints Modern machines have many resource constraints Superscalar architectures: can run few parallel operations ut have constraints Saman marasinghe 6.05 MIT Fall 998
5 Resource Constraints of a Superscalar Processor : One fully pipelined reg-to-reg unit ll integer operations taking one cycle In parallel with One fully pipelined memory-to/from-reg unit ata loads take two cycles ata stores teke one cycle Saman marasinghe MIT Fall 998 List Scheduling lgorithm with resource constraints Represent the superscalar architecture as multiple pipelines ach pipeline represent some resource One single cycle reg-to-reg LU unit One two-cycle pipelined reg-to/from-memory unit LU MM MM Saman marasinghe MIT Fall 998 List Scheduling lgorithm with resource constraints Create a dependence G of a basic block Topological Sort RY = nodes with no predecessors Loop until RY is empty Let n RY be the node with the highest priority Schedule n in the earliest slot that satisfies precedence + resource constraints Update RY Saman marasinghe 6.05 MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {, 6,, } 8 9 LUop MM MM Saman marasinghe MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {, 6,, } 8 9 LUop MM MM Saman marasinghe MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {, 6,, } 8 9 LUop MM MM Saman marasinghe MIT Fall 998 5
6 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = { 6,, } 8 9 LUop 6 MM MM Saman marasinghe 6.05 MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {,, } 8 9 LUop MM MM 6 Saman marasinghe 6.05 MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {,, 5 } 8 9 LUop 6 MM MM Saman marasinghe 6.05 MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = {, 5, 8, 9 } 8 9 LUop 6 MM MM Saman marasinghe 6.05 MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = { 5, 8, 9 } 8 9 LUop 6 MM 5 MM Saman marasinghe MIT Fall 998 : lea var_a, %rax d= d= : add (%rsp), %rax f= f= : inc %r : (%rsp), %r0 d= d= 5: %r0, 8(%rsp) f= 6 f= 5 d= 9: %rbx, 6(%rsp) f= RY = { 8, 9 } 8 9 LUop 6 8 MM 5 MM Saman marasinghe MIT Fall 998 6
7 : lea var_a, %rax : add (%rsp), %rax : inc %r : (%rsp), %r0 5: %r0, 8(%rsp) 9: %rbx, 6(%rsp) RY = { } LUop MM MM Saman marasinghe 6.05 MIT Fall Scheduling across basic blocks Number of instructions in a basic block is small Cannot keep a multiple units with long pipelines busy by just scheduling within a basic block Need to handle control dependence Scheduling constraints across basic blocks Scheduling policy Saman marasinghe MIT Fall 998 Moving across basic blocks ownward to adjacent basic block Moving across basic blocks Upward to adjacent basic block C C path to that does not execute? path from C that does not reach? Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Control ependencies Constraints in ing instructions across basic blocks if (... ) a = b op c Control ependencies Constraints in ing instructions across basic blocks If ( valid address? ) d = *(a) Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe 6.05 MIT Fall 998
8 Trace Scheduling Trace Scheduling Find the most common trace of basic blocks Use profile information Combine the basic blocks in the trace and schedule them as one block Create clean-up code if the execution goes offtrace C F G H Saman marasinghe 6.05 MIT Fall 998 Saman marasinghe 6.05 MIT Fall 998 Trace Scheduling Large asic locks via Code uplication Creating large extended basic blocks by duplication Schedule the larger blocks G C C H Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Trace Scheduling C F G F G H H H Saman marasinghe 6.05 MIT Fall 998 Scheduling Loops Loop bodies are small ut, lot of time is spend in loops due to large number of iterations Need better ways to schedule loops Saman marasinghe MIT Fall 998 8
9 Loop Machine One load/store unit load cycles store cycles Two arithmetic units add cycles branch cycles multiply cycles oth units are pipelined (initiate one op each cycle) Source Code for i = to N [i] = [i] * b Source Code for i = to N [i] = [i] * b ssembly Code Loop loop: (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax bge loop Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Loop ssembly Code loop: (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax bge loop Schedule (9 cycles per iteration) bge bge d= d=5 d= 0 d= bge Loop Unrolling Unroll the loop body few times Pros: Create a much larger basic block for the body liminate few loop bounds checks Cons: Much larger program Setup code (# of iterations < unroll factor) beginning and end of the schedule can still have unused slots Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Loop loop: (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax bge loop Schedule (8 cycles per iteration) 0 0 mul mul d= d= d=9 d=9 d= d=5 d= d= bge bge bge Saman marasinghe MIT Fall 998 Loop Unrolling Rename registers Use different registers in different iterations Saman marasinghe MIT Fall 998 9
10 loop: bge Loop (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax (%rdi,%rax), %rcx %r, %rcx %rcx, (%rdi,%rax) $, %rax loop d= mul d= d=9 0 d=9 d= mul d=5 d= 0 d= bge Loop Unrolling Rename registers Use different registers in different iterations liminate unnecessary dependencies again, use more registers to eliminate true, anti and output dependencies eliminate dependent-chains of calculations when possible Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Loop loop: (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) d= mul $8, %rax d=5 (%rdi,%rbx), %rcx %r, %rcx mul d= bge %rcx, (%rdi,%rbx) $8, %rbx loop Schedule (.5 cycles per iteration d= d= d= bge bge bge Saman marasinghe MIT Fall 998 Software Pipelining Try to overlap multiple iterations so that the slots will be filled Find the steady-state window so that: all the instructions of the loop body is executed but from different iterations Saman marasinghe MIT Fall 998 Loop ssembly Code loop: (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax bge loop Schedule st st st 5 st 6 ld5 mul mul mul bge mul bge mul bge mul5 mul mul mul bge mul bge mul bge mul mul mul mul mul Saman marasinghe MIT Fall 998 iterations are overlapped value of %r don t change Loop regs for (%rdi,%rax) each addr. incremented by * regs to keep value %r0 st mul bge mul mul Same registers can be reused after of these blocks loop: generate code for blocks, otherwise need to e (%rdi,%rax), %r0 %r, %r0 %r0, (%rdi,%rax) $, %rax bge loop Saman marasinghe MIT Fall 998 0
11 Software Pipelining Optimal use of resources Need a lot of registers Values in multiple iterations need to be kept Issues in dependencies xecuting a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it should not have) Loads and stores are issued out-of-order (need to figure-out dependencies before doing this) Code generation issues Generate pre-amble and post-amble code Multiple blocks so no register copy is needed Register llocation and Instruction Scheduling If register allocation is before instruction scheduling restricts the choices for scheduling Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 : (%rbp), %rax : add %rax, %rbx : 8(%rbp), %rax : add %rax, %rcx : (%rbp), %rax : add %rax, %rbx : 8(%rbp), %r0 : add %r0, %rcx LUop MM MM Saman marasinghe MIT Fall 998 LUop MM MM Saman marasinghe MIT Fall 998 Register llocation and Instruction Scheduling If register allocation is before instruction scheduling restricts the choices for scheduling Register llocation and Instruction Scheduling If register allocation is before instruction scheduling restricts the choices for scheduling If instruction scheduling before register allocation Register allocation may spill registers Will change the carefully done schedule!!! Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998
12 Superscalar: Where have all the transistors gone? Out of order execution If an instruction stalls, go beyond that and start executing non-dependent instructions Pros: Hardware scheduling Tolerates unpredictable latencies Cons: Instruction window is small Superscalar: Where have all the transistors gone? Register renaming If there is an anti or output dependency of a register that stalls the pipeline, use a different hardware register Pros: voids anti and output dependencies Cons: Cannot do more complex transformations to eliminate dependencies Saman marasinghe MIT Fall 998 Saman marasinghe MIT Fall 998 Hardware vs. Compiler In a superscalar, hardware and compiler scheduling can work hand-in-hand Hardware can reduce the burden when not predictable by the compiler Compiler can still greatly enhance the performance Large instruction window for scheduling Many program transformations that increase parallelism Compiler is even more critical when no hardware support VLIW machines (Itanium, SPs) Saman marasinghe MIT Fall 998
Spring 2 Spring Loop Optimizations
Spring 2010 Loop Optimizations Instruction Scheduling 5 Outline Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler Induction Variable Recognition
More informationWhat Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009
What Compilers Can and Cannot Do Saman Amarasinghe Fall 009 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Optimization
More informationAdministration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering
dministration CS 1/13 Introduction to Compilers and Translators ndrew Myers Cornell University P due in 1 week Optional reading: Muchnick 17 Lecture 30: Instruction scheduling 1 pril 00 1 Impact of instruction
More informationBackground: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation
Instruction Scheduling Last week Register allocation Background: Pipelining Basics Idea Begin executing an instruction before completing the previous one Today Instruction scheduling The problem: Pipelined
More informationLecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)
Lecture 7 Instruction Scheduling I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code) Reading: Chapter 10.3 10.4 CS243: Instruction Scheduling 1 Scheduling Constraints Data dependences
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationHardware-based Speculation
Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions
More informationCS 3330 Exam 2 Fall 2017 Computing ID:
S 3330 Fall 2017 Exam 2 Variant page 1 of 8 Email I: S 3330 Exam 2 Fall 2017 Name: omputing I: Letters go in the boxes unless otherwise specified (e.g., for 8 write not 8 ). Write Letters clearly: if we
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines
ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationregister allocation saves energy register allocation reduces memory accesses.
Lesson 10 Register Allocation Full Compiler Structure Embedded systems need highly optimized code. This part of the course will focus on Back end code generation. Back end: generation of assembly instructions
More informationFinal Jeopardy. CS356 Unit 15. Binary Brainteaser 100. Binary Brainteaser 200. Review
15.1 Final Jeopardy 15.2 Binary Brainteasers Instruction Inquiry Random Riddles Memory Madness Processor Predicaments Programming Pickles CS356 Unit 15 Review 100 100 100 100 100 100 200 200 200 200 200
More informationCS 3330 Exam 3 Fall 2017 Computing ID:
S 3330 Fall 2017 Exam 3 Variant E page 1 of 16 Email I: S 3330 Exam 3 Fall 2017 Name: omputing I: Letters go in the boxes unless otherwise specified (e.g., for 8 write not 8 ). Write Letters clearly: if
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Multiple Issue: Superscalar and VLIW CS425 - Vassilis Papaefstathiou 1 Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationOutline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006
P3 / 2006 Register Allocation What is register allocation Spilling More Variations and Optimizations Kostis Sagonas 2 Spring 2006 Storing values between defs and uses Program computes with values value
More informationPipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationCompiler Architecture
Code Generation 1 Compiler Architecture Source language Scanner (lexical analysis) Tokens Parser (syntax analysis) Syntactic structure Semantic Analysis (IC generator) Intermediate Language Code Optimizer
More informationLecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationLecture 9: Multiple Issue (Superscalar and VLIW)
Lecture 9: Multiple Issue (Superscalar and VLIW) Iakovos Mavroidis Computer Science Department University of Crete Example: Dynamic Scheduling in PowerPC 604 and Pentium Pro In-order Issue, Out-of-order
More informationSeveral Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining
Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit
More informationSUPERSCALAR AND VLIW PROCESSORS
Datorarkitektur I Fö 10-1 Datorarkitektur I Fö 10-2 What is a Superscalar Architecture? SUPERSCALAR AND VLIW PROCESSORS A superscalar architecture is one in which several instructions can be initiated
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques
More informationCS152 Computer Architecture and Engineering. Complex Pipelines
CS152 Computer Architecture and Engineering Complex Pipelines Assigned March 6 Problem Set #3 Due March 20 http://inst.eecs.berkeley.edu/~cs152/sp12 The problem sets are intended to help you learn the
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence
More information15.1. CS356 Unit 15. Review
15.1 CS356 Unit 15 Review 15.2 Final Jeopardy Binary Brainteasers Instruction Inquiry Random Riddles Memory Madness Processor Predicaments Programming Pickles 100 100 100 100 100 100 200 200 200 200 200
More informationInstruction scheduling
Instruction scheduling iaokang Qiu Purdue University ECE 468 October 12, 2018 What is instruction scheduling? Code generation has created a sequence of assembly instructions But that is not the only valid
More informationChapter 3 (Cont III): Exploiting ILP with Software Approaches. Copyright Josep Torrellas 1999, 2001, 2002,
Chapter 3 (Cont III): Exploiting ILP with Software Approaches Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Exposing ILP (3.2) Want to find sequences of unrelated instructions that can be overlapped
More informationInstruction scheduling. Advanced Compiler Construction Michel Schinz
Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 Instruction ordering When a compiler emits the instructions corresponding to a program, it imposes a total order on them.
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationTopic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer
Topic 14: Scheduling COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Back End Well, let s see Motivating example Starting point Motivating example Starting point Multiplication
More informationTECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization
CH14 Instruction Level Parallelism and Superscalar Processors Decode and issue more and one instruction at a time Executing more than one instruction at a time More than one Execution Unit What is Superscalar?
More informationCS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended
More informationLecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining
Lecture 21 Software Pipelining & Prefetching I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining [ALSU 10.5, 11.11.4] Phillip B. Gibbons 15-745: Software
More information5008: Computer Architecture
5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage
More informationInstruction-Level Parallelism (ILP)
Instruction Level Parallelism Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: 1. Rely on hardware to help discover and exploit
More informationComputer Systems Architecture I. CSE 560M Lecture 10 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 10 Prof. Patrick Crowley Plan for Today Questions Dynamic Execution III discussion Multiple Issue Static multiple issue (+ examples) Dynamic multiple issue
More informationCompiler Optimization and Code Generation
Compiler Optimization and Code Generation Professor: Sc.D., Professor Vazgen elikyan 1 Course Overview ntroduction: Overview of Optimizations 1 lecture ntermediate-code Generation 2 lectures achine-ndependent
More informationCS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25
CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationCOSC 6385 Computer Architecture - Pipelining
COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage
More informationLecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2
Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time
More informationLecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.
Lecture 19 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm I. Example of DoAll Loops Machine: Per clock: 1 read, 1 write, 1 (2-stage) arithmetic op, with hardware loop op and
More informationLoad1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1
Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]
More informationCS433 Midterm. Prof Josep Torrellas. October 16, Time: 1 hour + 15 minutes
CS433 Midterm Prof Josep Torrellas October 16, 2014 Time: 1 hour + 15 minutes Name: Alias: Instructions: 1. This is a closed-book, closed-notes examination. 2. The Exam has 4 Questions. Please budget your
More informationInstruction Scheduling
Instruction Scheduling Michael O Boyle February, 2014 1 Course Structure Introduction and Recap Course Work Scalar optimisation and dataflow L5 Code generation L6 Instruction scheduling Next register allocation
More informationCS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!
More informationVLIW/EPIC: Statically Scheduled ILP
6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
More informationRegister Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations
Register Allocation Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations Copyright 2015, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class
More informationUNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.
UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known
More informationCS252 Graduate Computer Architecture
CS252 Graduate Computer Architecture University of California Dept. of Electrical Engineering and Computer Sciences David E. Culler Spring 2005 Last name: Solutions First name I certify that my answers
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationHW 2 is out! Due 9/25!
HW 2 is out! Due 9/25! CS 6290 Static Exploitation of ILP Data-Dependence Stalls w/o OOO Single-Issue Pipeline When no bypassing exists Load-to-use Long-latency instructions Multiple-Issue (Superscalar),
More informationCS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming
CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming John Wawrzynek Electrical Engineering and Computer Sciences University of California at
More informationDonn Morrison Department of Computer Science. TDT4255 ILP and speculation
TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple
More informationWhat is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise
CSCI 4717/5717 Computer Architecture Topic: Instruction Level Parallelism Reading: Stallings, Chapter 14 What is Superscalar? A machine designed to improve the performance of the execution of scalar instructions.
More informationELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism
ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationProcessor: Superscalars Dynamic Scheduling
Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),
More informationRegister Allocation, i. Overview & spilling
Register Allocation, i Overview & spilling 1 L1 p ::=(label f...) f ::=(label nat nat i...) i ::=(w
More informationComputer Science 246 Computer Architecture
Computer Architecture Spring 2009 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Compiler ILP Static ILP Overview Have discussed methods to extract ILP from hardware Why can t some of these
More informationLast time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls
CS 6354: Branch Prediction (con t) / Multiple Issue 14 September 2016 Last time: scheduling to avoid stalls 1 Last time: forwarding/stalls add $a0, $a2, $a3 ; zero or more instructions sub $t0, $a0, $a1
More informationComputer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović
Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović Name: ANSWER SOLUTIONS This is a closed book, closed notes exam. 80 Minutes 17 Pages Notes: Not all questions
More information15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011
15-740/18-740 Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011 Reviews Due next Monday Mutlu et al., Runahead Execution: An Alternative
More informationlast time out-of-order execution and instruction queues the data flow model idea
1 last time 2 out-of-order execution and instruction queues the data flow model idea graph of operations linked by depedencies latency bound need to finish longest dependency chain multiple accumulators
More informationCS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.
CS 33 Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Modern CPU Design Instruction Control Retirement Unit Register File
More information15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011
5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer
More information(Basic) Processor Pipeline
(Basic) Processor Pipeline Nima Honarmand Generic Instruction Life Cycle Logical steps in processing an instruction: Instruction Fetch (IF_STEP) Instruction Decode (ID_STEP) Operand Fetch (OF_STEP) Might
More informationCS 406/534 Compiler Construction Instruction Scheduling
CS 406/534 Compiler Construction Instruction Scheduling Prof. Li Xu Dept. of Computer Science UMass Lowell Fall 2004 Part of the course lecture notes are based on Prof. Keith Cooper, Prof. Ken Kennedy
More informationCS252 Graduate Computer Architecture Midterm 1 Solutions
CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate
More informationCode Generation. CS 540 George Mason University
Code Generation CS 540 George Mason University Compiler Architecture Intermediate Language Intermediate Language Source language Scanner (lexical analysis) tokens Parser (syntax analysis) Syntactic structure
More informationLecture 19: Instruction Level Parallelism
Lecture 19: Instruction Level Parallelism Administrative: Homework #5 due Homework #6 handed out today Last Time: DRAM organization and implementation Today Static and Dynamic ILP Instruction windows Register
More informationCourse on Advanced Computer Architectures
Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, July 9, 2018 Course on Advanced Computer Architectures Prof. D. Sciuto, Prof. C. Silvano EX1 EX2 EX3 Q1
More informationComputer Architecture Spring 2016
Computer rchitecture Spring 2016 Lecture 10: Out-of-Order Execution & Register Renaming Shuai Wang Department of Computer Science and Technology Nanjing University In Search of Parallelism Trivial Parallelism
More informationRecall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls
CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors
William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,
More informationComputer Science 146. Computer Architecture
Computer rchitecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 11: Software Pipelining and Global Scheduling Lecture Outline Review of Loop Unrolling Software Pipelining
More informationLecture 18 List Scheduling & Global Scheduling Carnegie Mellon
Lecture 18 List Scheduling & Global Scheduling Reading: Chapter 10.3-10.4 1 Review: The Ideal Scheduling Outcome What prevents us from achieving this ideal? Before After Time 1 cycle N cycles 2 Review:
More informationPipeline Architecture RISC
Pipeline Architecture RISC Independent tasks with independent hardware serial No repetitions during the process pipelined Pipelined vs Serial Processing Instruction Machine Cycle Every instruction must
More informationCOMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationvoid twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) {
Optimization void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; void twiddle2(int *xp, int *yp) { *xp += 2* *yp; void main() { int x = 3; int y = 3; twiddle1(&x, &y); x = 3; y = 3; twiddle2(&x,
More informationCSC 631: High-Performance Computer Architecture
CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 4: Pipelining Last Time in Lecture 3 icrocoding, an effective technique to manage control unit complexity, invented in era when logic
More informationCSE 401/M501 Compilers
CSE 401/M501 Compilers Code Shape I Basic Constructs Hal Perkins Autumn 2018 UW CSE 401/M501 Autumn 2018 K-1 Administrivia Semantics/type check due next Thur. 11/15 How s it going? Be sure to (re-)read
More informationLecture: Pipeline Wrap-Up and Static ILP
Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle
More informationLecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationOrganisasi Sistem Komputer
LOGO Organisasi Sistem Komputer OSK 11 Superscalar Pendidikan Teknik Elektronika FT UNY What is Superscalar? Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed
More informationTECH. 9. Code Scheduling for ILP-Processors. Levels of static scheduling. -Eligible Instructions are
9. Code Scheduling for ILP-Processors Typical layout of compiler: traditional, optimizing, pre-pass parallel, post-pass parallel {Software! compilers optimizing code for ILP-processors, including VLIW}
More informationHow to efficiently use the address register? Address register = contains the address of the operand to fetch from memory.
Lesson 13 Storage Assignment Optimizations Sequence of accesses is very important Simple Offset Assignment This lesson will focus on: Code size and data segment size How to efficiently use the address
More informationSOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:
SOLUTION Notes: CS 152 Computer Architecture and Engineering CS 252 Graduate Computer Architecture Midterm #1 February 26th, 2018 Professor Krste Asanovic Name: I am taking CS152 / CS252 This is a closed
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationAccessing Variables. How can we generate code for x?
S-322 Register llocation ccessing Variables How can we generate code for x? a := x + y The variable may be in a register:,r x, The variable may be in a static memory location: ST x,r w work register L
More informationCPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More informationPipelining. Pipeline performance
Pipelining Basic concept of assembly line Split a job A into n sequential subjobs (A 1,A 2,,A n ) with each A i taking approximately the same time Each subjob is processed by a different substation (or
More information