12.1. CS356 Unit 12. Processor Hardware Organization Pipelining

Size: px
Start display at page:

Download "12.1. CS356 Unit 12. Processor Hardware Organization Pipelining"

Transcription

1 12.1 CS356 Unit 12 Processor Hardware Organization Pipelining

2 BASIC HW 12.2

3 Inputs Outputs 12.3 Logic Circuits Combinational logic Performs a specific function (mapping of 2 n input combinations to desired output combinations) No internal state or feedback Given a set of inputs, we will always get the same output after some time (propagation) delay Sequential logic (Storage devices) Registers are the fundamental building blocks Remembers a set of bits for later use Acts like a variable from software Controlled by a "clock" signal Current inputs Sequential values feedback and provide "memory" Combinational Logic (Usually operations like +, -, *, /, &,, <<) Outputs depend only on current outputs Combinational Logic Register holding "state" Sequential Logic Outputs depend on current inputs and previous inputs (previous inputs summarized via state) Ou

4 12.4 Combinational Logic Gates Main Idea: Circuits called "gates" perform logic operations to produce desired outputs from some digital inputs N P R C V B T OR gate 0 NOT gate OR gate AND gate OR gate

5 12.5 Propagation Delay Main Idea: All digital logic circuits have propagation delay Time it takes for output to change when inputs are changed N P R C 1 0 V B T gate delays for input to propagate to outputs

6 Inputs Outputs 12.6 Combinational Logic Functions Map input combinations of n-bits to desired m-bit output Can describe function with a truth table and then find its circuit implementation IN0 IN1 IN2 OUT0 OUT In0 In1 Logic Circuit In2 Out

7 12.7 ALU s Perform a selected operation on two input numbers. FS[5:0] select the desired operation Func. Code Op. Func. Code Op. 00_0000 A SHL B 10_0000 A+B 00_0010 A SHR B 10_0010 A-B 00_0011 A SAR B 10_0100 A AND B X[31:0] 0x Y[31:0] 0x FS[5:0] A B C0 ALU RES OF Func ZF CF 0x7fffffff RES[31:0] _1000 A * B 10_0101 A OR B 01_1001 A * B (uns.) 10_0110 A XOR B 01_1010 A / B 10_0111 A NOR B 01_1011 A / B (uns.) 10_1010 A < B

8 12.8 Sequential Devices (Registers) Registers capture the D input value when a control input (aka the clock signal) transitions from 0->1 (clock edge) and stores that value at the Q output until the next clock edge A register is like a variable in software. It stores a value for later use We can choose to only clock the register at "certain" times when we want the register to capture a new value (i.e. when it is the destination of an instruction) Key Idea: Registers store data while we operate on those values The clock pulse (positive edge) here Clock pulse ALU sum Data Input (could be many bits) Clock pulse t = 0 ns t = 1 ns t = 5 ns t = 7 ns t = 10 ns d(t) add %rbx,%rax add %rcx,%rax Some input value changing over time %rax D Q CP Block Diagram of a Register d(1) d(2) d(3) d(4) d(5) d(6) d(7) d(8) d(9) d(10) d(11) d(12) Data Output (could be many bits) causes q(t) to sample and hold the current d(t) value until another clock pulse %rax q(t) unk d(1) d(5) d(7) d(10)

9 12.9 Clock Signal Alternating high/low voltage pulse train Controls the ordering and timing of operations performed in the processor 1 cycle is usually measured from rising edge to rising edge Clock frequency = # of cycles per second (e.g. 2.8 GHz = 2.8 * 10 9 cycles per second) 1 (5V) 0 (0V) Clock Signal Op. 1 Op. 2 Op. 3 1 cycle 2.8 GHz = 2.8*10 9 cycles per second = ns/cycle Processor

10 12.10 Basic HW organization for a simplified instruction set FROM X86 TO RISC

11 12.11 From CISC to RISC Complex Instruction Set Computers often have instructions that vary widely in how much work they perform and how much time they take to execute Benefit is fewer instructions are needed to accomplish a task Reduced Instruction Set Computers favor instructions that take roughly the same time to execute and follow a common sequence of steps It often requires more instructions to describe the overall task (larger code size) See example to the right RISC makes the hardware design easier so let's tweak our x86 instructions to be more RISC-like // CISC instruction movq 0x40(%rdi, %rsi, 4), %rax // RISC Equiv. w/ 1 mem. or ALU op. // per instruction mov %rsi, %rbx # use %rbx as a temp. shl 2, %rbx # %rsi * 4 add %rdi, %rbx # %rdi + (%rsi*4) add $0x40, %rbx # 0x40 + %rdi + (%rsi*4) mov (%rbx), %rax # %rax = *%rbx CISC vs. RISC Equivalents

12 12.12 A RISC Subset of x86 Split mov instructions that access memory into separate instructions: ld = Load/Read from memory st = Store/Write to memory Limit ld & st instructions to use at most indirect w/ displacement No ld 0x04(%rdi, %rsi, 4), %rax Too much work At most ld 0x40(%rdi), %rax or st %rax, 0x40(%rdi) Limit arithmetic & logic instructions to only operate on registers No add (%rsp), %rax since this implicitly accesses (dereferences) memory Only add %reg1, %reg2 // 3 x86 memory read instructions mov (%rdi), %rax // 1 mov 0x40(%rdi), %rax // 2 mov 0x40(%rdi,%rsi), %rax // 3 // Equiv. load sequences ld 0x0(%rdi), %rax // 1 ld 0x40(%rdi), %rax // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b ld 0x40(%rbx), %rax // 3c // 3 x86 memory write instructions mov %rax, (%rdi) // 1 mov %rax, 0x40(%rdi) // 2 mov %rax, 0x40(%rdi,%rsi) // 3 // Equiv. store sequences st %rax, 0x0(%rdi) // 1 st %rax, 0x40(%rdi) // 2 mov %rsi, %rbx // 3a add %rdi, %rbx // 3b st %rax, 0x40(%rbx) // 3c // CISC instruction add %rax, (%rsp) // Equiv. RISC sequence (w/ ld and st) ld 0(%rsp), %rbx add %rax, %rbx st %rbx, 0(%rsp)

13 12.13 Developing a Processor Organization Identify which hardware components each instruction type would use and in what order: ALU-Type, LD, ST, Jump Cond. Codes PC Addr. Data I-Cache / I-MEM Registers (%rax, %rbx, etc.) (aka RegFile) Addr. Data D-Cache / D-MEM ALU Zero Res. ALU-Type add %rax,%rbx LD ld 8(%rax),%rbx ST st %rbx, 8(%rax) JE je label/displacement PC Addr. of Instruc I-Cache Fetch Instruc Registers Get %rax,%rbx ALU Sum %rax+%rbx Registers Save result to %rbx PC Addr. of Instruc I-Cache Fetch Instruc Registers Get %rax ALU Sum %rax+8 D-Cache Read data 6. Registers Save data to %rbx PC Addr. of Instruc I-Cache Fetch Instruc Registers Get %rax ALU Sum %rax+8 D-Cache Write %rbx data PC Addr. of Instruc I-Cache Fetch Instruc Registers Get %rax ALU If cond=true, PC = PC+disp.

14 12.14 Processor Block Diagram Fetch Exec. Mem WB PC Control Signals (e.g. ALU operation, Read/Write D-Cache, etc.) ZF OFCF SF I-Cache Registers (aka RegFile) ALU D-Cache Addr Data Data Instruction (Machine Code) Operands ALU Output (Addr. or Result) Data to write to dest. register 10 ns 10 ns 10 ns 10 ns 10 ns Clock Cycle Time = Sum of delay through worst case pathway = 50 ns

15 12.15 Processor Execution (add) Fetch Exec. Mem WB PC I-Cache Registers (aka RegFile) Control Signals (e.g. ALU operation, Read/Write D-Cache, etc.) %rax %rdx ZF OFCF SF ALU %rax+%rdx D-Cache Addr Data Instruction (Machine Code) add %rax,%rdx [Machine Code: c2] Operands ALU Output (Addr. or Result) Data to write to dest. register %rdx = %rax+%rdx

16 12.16 Processor Execution (load) Fetch Exec. Mem WB PC I-Cache Registers (aka RegFile) Control Signals (e.g. ALU operation, Read/Write D-Cache, etc.) %rbx 0x40 ZF OFCF SF ALU addr D-Cache Addr Data data Instruction (Machine Code) ld 0x40(%rbx),%rax [Machine Code: 48 8b 43 40] Operands ALU Output (Addr. or Result) Data to write to dest. register %rdx = data

17 12.17 Processor Execution (store) Fetch Exec. Mem WB PC I-Cache Registers (aka RegFile) Control Signals (e.g. ALU operation, Read/Write D-Cache, etc.) %rbx 0x40 ZF OFCF SF ALU addr D-Cache Addr Data Data Instruction (Machine Code) %rax st %rax,0x40(%rbx) [Machine Code: ]

18 12.18 Processor Execution (branch/jump) Fetch Exec. Mem WB PC I-Cache Registers (aka RegFile) Control Signals (e.g. ALU operation, Read/Write D-Cache, etc.) PC 0x08 ZF 1 OF 0 CF SF 1 0 ALU PC + 0x08 D-Cache Addr Data Data Instruction (Machine Code) je L1 (disp. = 0x08) [Machine Code: 74 08]

19 PIPELINING 12.19

20 Cntr Example for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; i Memory A: A[i] B: C: B[i] 10 ns per input set = 1000 ns total

21 12.21 Pipelining Example for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; Stage 1 Stage 2 Stage 1 Stage 2 Clock 0 A[0] + B[0] Clock 1 A[1] + B[1] (A[0] + B[0]) / 4 Clock 2 A[2] + B[2] (A[1] + B[1]) / 4 Pipelining refers to insertion of registers to split combinational logic into smaller stages that can be overlapped in time (i.e. create an assembly line)

22 12.22 Need for Registers Provides separation between combinational functions Without registers, fast signals could catch-up to data values in the next operation stage 5 ns Signal i Signal j Register Register 2 ns CLK CLK Performing an operation yields signals with different paths and delays We don t want signals from two different data values mixing. Therefore we must collect and synchronize the values from the previous operation before passing them on to the next

23 12.23 Processors & Pipelines Overlaps execution of multiple instructions Natural breakdown into stages Fetch,, Execute Fetch an instruction, while decoding another, while executing another Inst 1 Inst 2 Inst 3 Inst 4 CLK 1 CLK 2 CLK 3 CLK 4 Fetch Execute Fetch Execute Fetch Fetch Pipelining (Instruction View) Fetch Exec. Clk 1 Inst. 1 Clk 2 Inst. 2 Inst. 1 Clk 3 Inst. 3 Inst. 2 Inst. 1 Clk 4 Inst. 4 Inst. 3 Inst. 2 Clk 5 Inst. 5 Inst. 4 Inst. 3 Pipelining (Stage View)

24 Fig. 4 F1 F2 D1 D2 E1a E1b E2a E2b Fig. 3 Fetch Register Register Exec. 1 Register Exec. 1 Fig. 2 Fetch Register Register Fig Balancing Pipeline Stages Clock period must equal the LONGEST delay from register to register Fig. 1: If total logic delay is 20ns => 50MHz Throughput: 1 instruc. / 20 ns Fig. 2: Unbalanced stage delays limit the clock speed to the slowest stage (worst case) Throughput: 1 instruc. / 10 ns => 100MHz Fig. 3: Better to split into more, balanced stages Throughput: 1 instruc. / 5 ns => 200MHz Fig. 4: Are more stages better Ideally: 2x stages => 2x throughput Throughput: 1 instruc. / 2.5 ns => 400MHz Each register adds extra delay so at some point deeper pipelines don't pay off 20 ns Processor Logic (Fetch + + Execute) 5 ns 5 ns 10 ns Exec 5 ns 5 ns 5 ns 5 ns 2.5 ns

25 Fig. 3 F1 F2 D1 D2 E1a E1b E2a E2b Fig. 2 Fetch Register Register Exec. 1 Register Exec. 1 Fig Balancing Pipeline Stages Main Points: Latency of any single instruction is unaffected Throughput and thus overall program performance can be dramatically improved Ideally K stage pipeline will lead to throughput increase by a factor of K Reality is splitting stages adds some delay and thus hits a point of diminishing returns 20 ns Processor Logic (Fetch + + Execute) 5 ns 5 ns 5 ns 5 ns 2.5 ns Non-pipelined (Latency = 20ns, Throughput = 1x) 4 Stage Pipeline (Latency = 20ns, Throughput = 4x) 8 Stage Pipeline (Latency = 20ns, Throughput = 8x)

26 Instruction (Machine Code) Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction Stage Pipeline Fetch Exec. Mem WB PC ZF OFCF SF I-Cache Reg. File ALU D-Cache

27 12.27 Pipelining Let's see how a sequence of instructions can be executed Instruction ld 0x8(%rbx), %rax add %rcx,%rdx je L1

28 Instruction (Machine Code) Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction Sample Sequence - 1 Fetch (LD) Exec. Mem WB PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch LD

29 LD 0x8(%rbx), %rax Operands & Control Signals ALU Output (Addr. or Result) Result of Instruction Sample Sequence - 2 Fetch (ADD) (LD) Exec. Mem WB PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch ADD instruction and fetch operands

30 ADD %rdx, %rcx 0x40 / %rbx / READ ALU Output (Addr. or Result) Result of Instruction Sample Sequence - 3 Fetch (JE) (ADD) Exec. (LD) Mem WB PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch JE instruction and fetch operands Add displacement 0x04 to %rbx

31 JE / displacement %rcx / %rdx / ADD %rbx + 0x40 / READ Result of Instruction Sample Sequence - 4 Fetch (i+1) (JE) Exec. (ADD) Mem (LD) WB PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch instruc i+1 instruction and fetch operands Add %rcx+%rdx Read word from memory

32 Instruc. i+1 Machine Code PC / Dsiplacement %rdx / %rcx + %rdx %rax / Value from Memory Sample Sequence - 5 Fetch (i+2) (i+1) Exec. (JE) Mem (ADD) WB (LD) PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch next instruc i+2 instruction i+1 and fetch operands Check if condition is true Just pass sum to next stage Write word to %rax

33 Instruc. i+2 Machine Code Instruc i+1 operands New PC %rdx / sum Sample Sequence - 6 Fetch (i+3) (i+2) Exec. (i+1) Mem (JE) WB (ADD) PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch next instruc i+3 instruction i+2 and fetch operands Use the ALU Update PC Write word to %rdx

34 Deleted Deleted Deleted %rdx / sum Sample Sequence - 7 Fetch (target) (i+3) Exec. (i+2) Mem (i+1) WB (JE) PC ZF OFCF SF I-Cache Reg. File ALU D-Cache Fetch next instruc i+3 Delete i+3 Delete i+2 Delete i+1 Do nothing

35 12.35 Problems from overlapping instruction execution HAZARDS

36 12.36 Hazards Hazards prevent parallel or overlapped execution! Control Hazards Problem: We don't know what instruction to fetch but we need to Examples: Jumps (branches) and calls Data Hazards / Data Dependencies Problem: When a later instruction needs data from a previous instruction Examples: sub %rdx,%rax add %rax,%rcx Structural Hazards Problem: Due to limited resources, the HW doesn't support overlapping a certain sequence of instructions Examples: See next slides

37 12.37 Structural Hazards Example structural hazard: A single cache rather than separate instruction & data caches Structural hazard any time an instruction needs to perform a data access (i.e. ld or st) since we always want to fetch a new instruction each clock cycle i+2 i+1 LD PC i+3 Reg. File ALU Hazard! Cache

38 ADD %rax, %rcx %rdx / %rax / SUB ALU Output (Addr. or Result) Result of Instruction Data Hazard - 1 add %rax,%rcx sub %rdx,%rax Fetch (i+1) (ADD) Exec. (SUB) Mem WB PC I-Cache Reg. File ALU D-Cache Fetch i+1 and get register operands (Do we get the desired %rax value?) Perform %rax-%rdx

39 i+1 %rax / %rcx / ADD New value for %rax Result of Instruction Data Hazard - 2 add %rax,%rcx sub %rdx,%rax Fetch (i+2) (i+1) Exec. (ADD) Mem (SUB) WB PC I-Cache Reg. File ALU D-Cache Fetch i+2 i+1 Perform %rax+%rcx using the wrong value! New value for %rax has not been written back yet

40 ADD %rax, %rcx nop nop New value of %rax Stalling Solution 1: Halt/Stall the ADD instruction in the DECODE stage and insert nops into the pipeline until the new value of the needed register is present at the cost of lower performance add %rax,%rcx sub %rdx,%rax Fetch (i+1) (ADD) Exec. (nop) Mem (nop) WB (SUB) PC I-Cache Reg. File ALU D-Cache

41 i+1 %rax / %rcx / ADD New value for %rax Result of Instruction Forwarding Solution 2: Create new hardware paths to hand-off (forward) the data from the producing instruction in the pipeline to the consuming instruction Fetch (i+2) (i+1) add %rax,%rcx Exec. (ADD) sub %rdx,%rax Mem (SUB) WB PC I-Cache Reg. File ALU D-Cache

42 12.42 Solving Data Hazards Key Point: Data dependencies (i.e. instructions needing values produced by earlier ones) limit performance Forwarding solves many of the data hazards (data dependencies) that exist It allows instructions to continue to flow through the pipeline without the need to stall and waste time The cost is additional hardware and added complexity Even forwarding cannot solve all the issues A structural hazard still exists when a LD reads a value needed by the next instruction

43 i+1 %old rax / %rcx / ADD Address (8+%rdx) / READ Result of Instruction LD + Dependent Instruction Hazard Even forwarding cannot prevent the need to stall when a Load instruction produces a value needed by the instruction behind it Would require performing 2 cycles worth of work in only a single cycle Fetch (i+2) (i+1) add %rax,%rcx Exec. (ADD) ld 8(%rdx),%rax Mem (LD) WB PC I-Cache Reg. File ALU D-Cache New value for %rax

44 i+1 %old rax / %rcx / ADD nop New Value of %rax LD + Dependent Instruction Hazard We would need to introduce 1 stall cycle (nop) into the pipeline to get the timing correct Keep this in mind as we move through the next slides Fetch (i+2) (i+1) add %rax,%rcx Exec. (ADD) Mem (nop) ld 8(%rdx),%rax WB (LD) PC I-Cache Reg. File ALU D-Cache New value for %rax

45 12.45 Control Hazards Branches/Jumps require us to know Where we want to jump to (aka branch/jump target location) really just the new value of the PC If we should branch or not (checking the jump condition) Problem: We often don't know those values until deep in the pipeline and thus we are not sure what instructions should be fetched in the interim Requires us to flush unwanted instructions and waste time

46 Instruc. i+2 Machine Code Instruc i+1 operands New PC %rdx / sum Control Hazard - 1 Fetch (i+3) (i+2) Exec. (i+1) Mem (JE) WB (ADD) PC I-Cache Reg. File ALU D-Cache Fetch next instruc i+3 instruction i+2 and fetch operands Use the ALU Update PC Write word to %rdx

47 Deleted Deleted Deleted %rdx / sum Control Hazard - 2 Fetch (target) (i+3) Exec. (i+2) Mem (i+1) WB (JE) PC I-Cache Reg. File ALU D-Cache Fetch next instruc i+3 Delete i+3 Delete i+2 Delete i+1 Need to "flush" wrongly fetched instructions Do nothing

48 12.48 Enlisting the help of the compiler A FIRST LOOK: CODE REORDERING

49 12.49 Two Sides of the Coin If the hardware has some problems it just can't solve, can software (i.e. the compiler) help? Yes!! Compilers can re-order instructions to take best advantage of the processor (pipeline) organization Identify the dependencies that will incur stalls and slow performance Load followed by add Jump instructions void sum(int* data, int n, int x) { for(int i=0; i < n; i++){ data[i] += x; } } sum: mov L1: cmp jge ld add st add add j L2: retq $0x0,%ecx %esi,%ecx L2 0(%rdi), %eax %edx, %eax %eax, 0(%rdi) $4, %rdi $1, %ecx L1 C code and its assembly translation

50 12.50 How Can the Compiler Help Compilers are written with general parsing and semantic representation front ends but architecture-specific backends that generate code optimized for a particular processor Q: How could the compiler help improve pipelined performance while still maintaining the external behavior that the high level code indicates A: By finding independent instructions and reordering the code Could we have moved any other instruction into that slot? No! sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax stall/nop add %edx, %eax st %eax, 0(%rdi) add $4, %rdi add $1, %ecx j L1 L2: retq Original Code (incurring 1 stall cycle) sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax add $1, %ecx add %edx, %eax C code and its assembly st %eax, 0(%rdi) translation add $4, %rdi j L1 L2: retq Updated Code (w/ Compiler reordering)

51 12.51 Taken or Not Taken: Branch Behavior When a conditional jump/branch is True, we say it is Taken False, we say it is Not Taken Currently our pipeline will fetch sequentially and then potentially flush if the branch is taken Effectively, our pipeline "predicts" that each branch is Not Taken The j L1 instruction is always taken and thus will incur wasted clock cycles each time it is executed Most of the time the jge L2 will be not taken and perform well T sum: mov $0x0,%ecx L1: cmp %esi,%ecx jge L2 ld 0(%rdi), %eax NT add $1, %ecx add %edx, %eax C code and its assembly st %eax, 0(%rdi) translation add $4, %rdi j L1 L2: retq

52 12.52 Branch Delay Slots Problem: After a jump/branch we fetch instructions that we are not sure should be executed Idea: Find an instruction(s) that should ALWAYS be executed (independent of whether branch is taken or not), move those instructions to directly after the branch, and have HW just let them be executed (not flushed) no matter what the branch outcome is Branch delay slot(s) = # of instructions that the HW will always execute (not flush) after a jump/branch instruction

53 12.53 Branch Delay Slot Example ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. NOT TAKEN CODE NEXT: TAKEN CODE Assume a single instruction delay slot Taken Path Code Before Code ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx T je Delay Slot NT Not Taken Path Code After Code Flowchart perspective of the delay slot ld 0(%rdi), %rcx cmp %rcx, %rdx je NEXT add %rbx, %rax NOT TAKEN CODE NEXT: TAKEN CODE Delay Slot Move an ALWAYS executed instruction down into the delay slot and let it execute no matter what

54 12.54 Implementing Branch Delay Slots HW will define the number of branch delay slots (usually a small number 1 or 2) Compiler will be responsible for arranging instructions to fill the delay slots Must find instructions that the branch does NOT DEPEND on If no instructions can be rearranged, can always insert 'nop' and just waste those cycles ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rdx je NEXT delay slot instruc. Cannot move ld into delay slot because je needs the %rcx value generated by it ld 0(%rdi), %rcx add %rbx, %rax cmp %rcx, %rax je NEXT delay slot instruc. If no instruction can be found a 'nop' can be inserted by the compiler

55 12.55 A Look Ahead: Branch Prediction Currently our pipeline assumes Not Taken and fetches down the sequential path after a jump/branch Could we build a pipeline that could predict taken? Not yet! Location to jump to (branch target) not known until later stages But suppose we could overcome those problems, would we even know how to predict the outcome of a jump/branch before actually looking at the condition codes deeper in the pipeline? We could allow a static prediction per instruction (give a hint with the branch that indicates T or NT) We could allow dynamic prediction per instruction (use its runtime history) T: loop T: if T Code Loop Body loop branch NT: done if..else branch NT: else After Code NT Code Loops High probability of being Taken. Prediction can be static. If Statements May exhibit data dependent behavior. Prediction may need to be dynamic.

56 Demo 12.56

57 12.57 Summary 1 Pipelining is an effective and important technique to improve the throughput of a processor Overlapping execution creates hazards which lead to stalls or wasted cycles Data, Control, Structural More hardware can be investigated to attempt to mitigate the stalls (e.g. forwarding) The compiler can help reorder code to avoid stalls and perform useful work (e.g. delay slots)

CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining

CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining 2a. 2a.2 CS356 Unit 2a Processor Hardware Organization Pipelining BASIC HW Logic Circuits 2a.3 Combinational Logic Gates 2a.4 logic Performs a specific function (mapping of input combinations to desired

More information

Mark Redekopp, All rights reserved. EE 352 Unit 8. HW Constructs

Mark Redekopp, All rights reserved. EE 352 Unit 8. HW Constructs EE 352 Unit 8 HW Constructs Logic Circuits Combinational logic Perform a specific function (mapping of 2 n input combinations to desired output combinations) No internal state or feedback Given a set of

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Intel x86-64 and Y86-64 Instruction Set Architecture

Intel x86-64 and Y86-64 Instruction Set Architecture CSE 2421: Systems I Low-Level Programming and Computer Organization Intel x86-64 and Y86-64 Instruction Set Architecture Presentation J Read/Study: Bryant 3.1 3.5, 4.1 Gojko Babić 03-07-2018 Intel x86

More information

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin.   School of Information Science and Technology SIST CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C

More information

12.1. CS356 Unit 12b. Advanced Processor Organization

12.1. CS356 Unit 12b. Advanced Processor Organization 12.1 CS356 Unit 12b Advanced Processor Organization 12.2 Goals Understand the terms and ideas used in a modern, high-performance processor Various systems have different kinds of processors and you should

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle? CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz CS 61C: Great Ideas in Computer Architecture Lecture 13: Pipelining Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 RISC-V Pipeline Pipeline Control Hazards Structural Data R-type

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

CPSC 313, 04w Term 2 Midterm Exam 2 Solutions 1. (10 marks) Short answers. CPSC 313, 04w Term 2 Midterm Exam 2 Solutions Date: March 11, 2005; Instructor: Mike Feeley 1a. Give an example of one important CISC feature that is normally not part of a

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

ECE260: Fundamentals of Computer Engineering

ECE260: Fundamentals of Computer Engineering Pipelining James Moscola Dept. of Engineering & Computer Science York College of Pennsylvania Based on Computer Organization and Design, 5th Edition by Patterson & Hennessy What is Pipelining? Pipelining

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14 MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN ARM COMPUTER ORGANIZATION AND DESIGN Edition The Hardware/Software Interface Chapter 4 The Processor Modified and extended by R.J. Leduc - 2016 To understand this chapter, you will need to understand some

More information

Module 4c: Pipelining

Module 4c: Pipelining Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A

More information

void twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) {

void twiddle1(int *xp, int *yp) { void twiddle2(int *xp, int *yp) { Optimization void twiddle1(int *xp, int *yp) { *xp += *yp; *xp += *yp; void twiddle2(int *xp, int *yp) { *xp += 2* *yp; void main() { int x = 3; int y = 3; twiddle1(&x, &y); x = 3; y = 3; twiddle2(&x,

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

Where we are. Instruction selection. Abstract Assembly. CS 4120 Introduction to Compilers

Where we are. Instruction selection. Abstract Assembly. CS 4120 Introduction to Compilers Where we are CS 420 Introduction to Compilers Andrew Myers Cornell University Lecture 8: Instruction Selection 5 Oct 20 Intermediate code Canonical intermediate code Abstract assembly code Assembly code

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Thoai Nam Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy & David a Patterson,

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Pipelining. CSC Friday, November 6, 2015

Pipelining. CSC Friday, November 6, 2015 Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not

More information

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards Instructors: Vladimir Stojanovic and Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/sp16 1 Pipelined Execution Representation Time

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

What is Pipelining? RISC remainder (our assumptions)

What is Pipelining? RISC remainder (our assumptions) What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Lecture 7 Pipelining. Peng Liu.

Lecture 7 Pipelining. Peng Liu. Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt

More information

Review addressing modes

Review addressing modes Review addressing modes Op Src Dst Comments movl $0, %rax Register movl $0, 0x605428 Direct address movl $0, (%rcx) Indirect address movl $0, 20(%rsp) Indirect with displacement movl $0, -8(%rdi, %rax,

More information

5008: Computer Architecture HW#2

5008: Computer Architecture HW#2 5008: Computer Architecture HW#2 1. We will now support for register-memory ALU operations to the classic five-stage RISC pipeline. To offset this increase in complexity, all memory addressing will be

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Pipeline Thoai Nam Outline Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy

More information

Lecture 3 CIS 341: COMPILERS

Lecture 3 CIS 341: COMPILERS Lecture 3 CIS 341: COMPILERS HW01: Hellocaml! Announcements is due tomorrow tonight at 11:59:59pm. HW02: X86lite Will be available soon look for an announcement on Piazza Pair-programming project Simulator

More information

Pipelining. Principles of pipelining Pipeline hazards Remedies. Pre-soak soak soap wash dry wipe. l Chapter 4.4 and 4.5

Pipelining. Principles of pipelining Pipeline hazards Remedies. Pre-soak soak soap wash dry wipe. l Chapter 4.4 and 4.5 Pipelining Pre-soak soak soap wash dry wipe Chapter 4.4 and 4.5 Principles of pipelining Pipeline hazards Remedies 1 Multi-stage process Sequential execution One process begins after previous finishes

More information

EE 457 Unit 6a. Basic Pipelining Techniques

EE 457 Unit 6a. Basic Pipelining Techniques EE 47 Unit 6a Basic Pipelining Techniques 2 Pipelining Introduction Consider a drink bottling plant Filling the bottle = 3 sec. Placing the cap = 3 sec. Labeling = 3 sec. Would you want Machine = Does

More information

Chapter 4 The Processor 1. Chapter 4A. The Processor

Chapter 4 The Processor 1. Chapter 4A. The Processor Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

Computer Architectures. DLX ISA: Pipelined Implementation

Computer Architectures. DLX ISA: Pipelined Implementation Computer Architectures L ISA: Pipelined Implementation 1 The Pipelining Principle Pipelining is nowadays the main basic technique deployed to speed-up a CP. The key idea for pipelining is general, and

More information

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17 Short Pipelining Review! ! Readings! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 4.5-4.7!! (Over the next 3-4 lectures)! Lecture 17" Short Pipelining Review! 3! Processor components! Multicore processors and programming! Recap: Pipelining

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 4 Processor Part 2: Pipelining (Ch.4) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations from Mike

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20

More information

Machine Language CS 3330 Samira Khan

Machine Language CS 3330 Samira Khan Machine Language CS 3330 Samira Khan University of Virginia Feb 2, 2017 AGENDA Logistics Review of Abstractions Machine Language 2 Logistics Feedback Not clear Hard to hear Use microphone Good feedback

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3. Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview

More information

Lecture 15: Pipelining. Spring 2018 Jason Tang

Lecture 15: Pipelining. Spring 2018 Jason Tang Lecture 15: Pipelining Spring 2018 Jason Tang 1 Topics Overview of pipelining Pipeline performance Pipeline hazards 2 Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time T a s k O r d e r A B C D 30 40 20

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

Pipelining, Branch Prediction, Trends

Pipelining, Branch Prediction, Trends Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping

More information

Branching and Looping

Branching and Looping Branching and Looping Ray Seyfarth August 10, 2011 Branching and looping So far we have only written straight line code Conditional moves helped spice things up In addition conditional moves kept the pipeline

More information

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets administrivia final hour exam next Wednesday covers assembly language like hw and worksheets today last worksheet start looking at more details on hardware not covered on ANY exam probably won t finish

More information

CS356 Unit 5. Translation to Assembly. Translating HLL to Assembly ASSEMBLY TRANSLATION EXAMPLE. x86 Control Flow

CS356 Unit 5. Translation to Assembly. Translating HLL to Assembly ASSEMBLY TRANSLATION EXAMPLE. x86 Control Flow 5.1 5.2 CS356 Unit 5 x86 Control Flow Compiler output ASSEMBLY TRANSLATION EXAMPLE Translation to Assembly 5.3 Translating HLL to Assembly 5.4 We will now see some C code and its assembly translation A

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07

CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as

More information

Computer Architecture. Lecture 6.1: Fundamentals of

Computer Architecture. Lecture 6.1: Fundamentals of CS3350B Computer Architecture Winter 2015 Lecture 6.1: Fundamentals of Instructional Level Parallelism Marc Moreno Maza www.csd.uwo.ca/courses/cs3350b [Adapted from lectures on Computer Organization and

More information

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...

Instruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1... Instruction-set Design Issues: what is the format(s) Opcode Dest. Operand Source Operand 1... 1) Which instructions to include: How many? Complexity - simple ADD R1, R2, R3 complex e.g., VAX MATCHC substrlength,

More information

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in

More information

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism

CS 252 Graduate Computer Architecture. Lecture 4: Instruction-Level Parallelism CS 252 Graduate Computer Architecture Lecture 4: Instruction-Level Parallelism Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://wwweecsberkeleyedu/~krste

More information

1 Hazards COMP2611 Fall 2015 Pipelined Processor

1 Hazards COMP2611 Fall 2015 Pipelined Processor 1 Hazards Dependences in Programs 2 Data dependence Example: lw $1, 200($2) add $3, $4, $1 add can t do ID (i.e., read register $1) until lw updates $1 Control dependence Example: bne $1, $2, target add

More information

5.1. CS356 Unit 5. x86 Control Flow

5.1. CS356 Unit 5. x86 Control Flow 5.1 CS356 Unit 5 x86 Control Flow 5.2 Compiler output ASSEMBLY TRANSLATION EXAMPLE 5.3 Translation to Assembly We will now see some C code and its assembly translation A few things to remember: Data variables

More information

CSE A215 Assembly Language Programming for Engineers

CSE A215 Assembly Language Programming for Engineers CSE A215 Assembly Language Programming for Engineers Lecture 4 & 5 Logic Design Review (Chapter 3 And Appendices C&D in COD CDROM) September 20, 2012 Sam Siewert ALU Quick Review Conceptual ALU Operation

More information

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count

More information

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27) Lecture Topics Today: Data and Control Hazards (P&H 4.7-4.8) Next: continued 1 Announcements Exam #1 returned Milestone #5 (due 2/27) Milestone #6 (due 3/13) 2 1 Review: Pipelined Implementations Pipelining

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions. Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions Stage Instruction Fetch Instruction Decode Execution / Effective addr Memory access Write-back Abbreviation

More information

6.1. CS356 Unit 6. x86 Procedures Basic Stack Frames

6.1. CS356 Unit 6. x86 Procedures Basic Stack Frames 6.1 CS356 Unit 6 x86 Procedures Basic Stack Frames 6.2 Review of Program Counter (Instruc. Pointer) PC/IP is used to fetch an instruction PC/IP contains the address of the next instruction The value in

More information

+ Machine Level Programming: x86-64 History

+ Machine Level Programming: x86-64 History + Machine Level Programming: x86-64 History + Intel x86 Processors Dominate laptop/desktop/server market Evolutionary design Backwards compatible up until 8086, introduced in 1978 Added more features as

More information

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

RISC I from Berkeley. 44k Transistors 1Mhz 77mm^2

RISC I from Berkeley. 44k Transistors 1Mhz 77mm^2 The Case for RISC RISC I from Berkeley 44k Transistors 1Mhz 77mm^2 2 MIPS: A Classic RISC ISA Instructions 4 bytes (32 bits) 4-byte aligned Instructions operate on memory and registers Memory Data types

More information

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers? Pipelined CPUs Where are the registers? Study Chapter 4 of Text Second Quiz on Friday. Covers lectures 8-14. Open book, open note, no computers or calculators. L17 Pipelined CPU I 1 Review of CPU Performance

More information

Pipelining. Maurizio Palesi

Pipelining. Maurizio Palesi * Pipelining * Adapted from David A. Patterson s CS252 lecture slides, http://www.cs.berkeley/~pattrsn/252s98/index.html Copyright 1998 UCB 1 References John L. Hennessy and David A. Patterson, Computer

More information

Instruction Pipelining

Instruction Pipelining Instruction Pipelining Simplest form is a 3-stage linear pipeline New instruction fetched each clock cycle Instruction finished each clock cycle Maximal speedup = 3 achieved if and only if all pipe stages

More information

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07 Pipelined CPUs Where are the registers? Study Chapter 6 of Text L19 Pipelined CPU I 1 Review of CPU Performance MIPS = Millions of Instructions/Second MIPS = Freq CPI Freq = Clock Frequency, MHz CPI =

More information

CSE351 Spring 2018, Midterm Exam April 27, 2018

CSE351 Spring 2018, Midterm Exam April 27, 2018 CSE351 Spring 2018, Midterm Exam April 27, 2018 Please do not turn the page until 11:30. Last Name: First Name: Student ID Number: Name of person to your left: Name of person to your right: Signature indicating:

More information

Initial Design of an Alpha Processor. March 12, 1998 (Revised 3/16/98)

Initial Design of an Alpha Processor. March 12, 1998 (Revised 3/16/98) Initial Design of an Alpha Processor March 12, 1998 (Revised 3/16/98) Topics Objective Instruction formats Instruction processing Principles of pipelining Inserting pipe registers Objective Design Processor

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #22 CPU Design: Pipelining to Improve Performance II 2007-8-1 Scott Beamer, Instructor CS61C L22 CPU Design : Pipelining to Improve Performance

More information

Mark Redekopp and Gandhi Puvvada, All rights reserved. EE 357 Unit 15. Single-Cycle CPU Datapath and Control

Mark Redekopp and Gandhi Puvvada, All rights reserved. EE 357 Unit 15. Single-Cycle CPU Datapath and Control EE 37 Unit Single-Cycle CPU path and Control CPU Organization Scope We will build a CPU to implement our subset of the MIPS ISA Memory Reference Instructions: Load Word (LW) Store Word (SW) Arithmetic

More information

CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 33. Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 33 Architecture and Optimization (2) CS33 Intro to Computer Systems XVI 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Modern CPU Design Instruction Control Retirement Unit Register File

More information

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one

More information

CS 230 Practice Final Exam & Actual Take-home Question. Part I: Assembly and Machine Languages (22 pts)

CS 230 Practice Final Exam & Actual Take-home Question. Part I: Assembly and Machine Languages (22 pts) Part I: Assembly and Machine Languages (22 pts) 1. Assume that assembly code for the following variable definitions has already been generated (and initialization of A and length). int powerof2; /* powerof2

More information

Processor (II) - pipelining. Hwansoo Han

Processor (II) - pipelining. Hwansoo Han Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number

More information

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good CPU performance equation: T = I x CPI x C Both effective CPI and clock cycle C are heavily influenced by CPU design. For single-cycle CPU: CPI = 1 good Long cycle time bad On the other hand, for multi-cycle

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions? Basic Instruction Timings Pipelining 1 Making some assumptions regarding the operation times for some of the basic hardware units in our datapath, we have the following timings: Instruction class Instruction

More information

Pipeline Review. Review

Pipeline Review. Review Pipeline Review Review Covered in EECS2021 (was CSE2021) Just a reminder of pipeline and hazards If you need more details, review 2021 materials 1 The basic MIPS Processor Pipeline 2 Performance of pipelining

More information

CS 261 Fall Mike Lam, Professor. x86-64 Control Flow

CS 261 Fall Mike Lam, Professor. x86-64 Control Flow CS 261 Fall 2016 Mike Lam, Professor x86-64 Control Flow Topics Condition codes Jumps Conditional moves Jump tables Motivation Can we translate the following C function to assembly, using only data movement

More information

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Moore s Law Gordon Moore @ Intel (1965) 2 Computer Architecture Trends (1)

More information

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner

CPS104 Computer Organization and Programming Lecture 19: Pipelining. Robert Wagner CPS104 Computer Organization and Programming Lecture 19: Pipelining Robert Wagner cps 104 Pipelining..1 RW Fall 2000 Lecture Overview A Pipelined Processor : Introduction to the concept of pipelined processor.

More information

Assembly Language II: Addressing Modes & Control Flow

Assembly Language II: Addressing Modes & Control Flow Assembly Language II: Addressing Modes & Control Flow Learning Objectives Interpret different modes of addressing in x86 assembly Move data seamlessly between registers and memory Map data structures in

More information

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Pipeline Hazards Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hazards What are hazards? Situations that prevent starting the next instruction

More information

CSEE 3827: Fundamentals of Computer Systems

CSEE 3827: Fundamentals of Computer Systems CSEE 3827: Fundamentals of Computer Systems Lecture 21 and 22 April 22 and 27, 2009 martha@cs.columbia.edu Amdahl s Law Be aware when optimizing... T = improved Taffected improvement factor + T unaffected

More information

CS429: Computer Organization and Architecture

CS429: Computer Organization and Architecture CS429: Computer Organization and Architecture Dr. Bill Young Department of Computer Sciences University of Texas at Austin Last updated: February 26, 2018 at 12:02 CS429 Slideset 8: 1 Controlling Program

More information