Computer Hardware Engineering

Size: px

Start display at page:

Download "Computer Hardware Engineering"

Jeremy Cummings
6 years ago
Views:

1 Computer Hardware ngineering IS2, spring 25 Lecture 6: Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides version. 2 Course Structure Module : Logic Design Module : I/O Systems L L DCÖ DCÖ2 Lab:dicom L7 6 7 Lab: nios2io 9 Lab: nios2int Module 2: C and ssembly Programming Module 5: Hierarchy L L2 L8 8 Home Lab: cache L 2 Lab: nios2time L Home lab: C Module : Processor Design Module 6: Parallel Processors and Programs L5 L6 5 L9 L

Design Logic and Building Blocks Digital Circuits nalog Circuits nalog Design and Physics Devices and Physics

2 bstractions in Computer Systems Networked Systems and Systems of Systems Computer System pplication Software Software Operating System Hardware/Software Interface Set rchitecture Microarchitecture Digital Hardware Design Logic and Building Blocks Digital Circuits nalog Circuits nalog Design and Physics Devices and Physics genda by Open Grid Scheduler / Grid ngine - Own work. Licensed under Creative Commons Zero, Public Domain

3 5 cknowledgement: The structure and several of the good examples are derived from the book Digital Design and Computer rchitecture (2) by D. M. Harris and S. L. Harris. 6 Jump Path (Revisited) Inst W 25:2 Zero 2:6 PC 2 2 :28 27: 25: 5: RegWrite Branch RegDst LUSrc LUControl 2:6 5: Sign xtend LU W MemWrite MemToReg

4 7 Control Unit (Revisited) Decoding the instruction op 5 Main Decoder RegWrite RegDst LUSrc Branch MemWrite MemToReg Jump Control signals to the data path LUOp 2 funct 6 LU Decoder LUControl 8 Performance nalysis (Revisited) xecution time (in seconds) = # instructions clock cycles instruction seconds clock cycle Number of instructions in a program (# = number of) Determined by programmer or the compiler or both. verage cycles per instruction (CPI) Determined by the microarchitecture implementation. Seconds per cycle = clock period T C. Determined by the critical path in the logic. For the single-cycle processor, each instruction takes one clock cycle. That is, CPI =. The main problem with the single-cycle processor design (last lecture) is the long critical path. Solution: Pipelining

5 Parallelism and Pipelining (/6) Definitions 9 Processing System: system that takes input and produces outputs. Token: n input that is processed by the processing system and results in an output. Latency: The time it takes for the system to process one token. Throughput: The number of tokens that can be processed per time unit. Parallelism and Pipelining (2/6) Sequential Processing xample: ssume we have a Christmas card factory with two machines (M and M2). pproach. Process tokens sequentially. In this case a token is a card. The latency is 6 = s M: Prints out the card (takes 6s) M2: Puts on a stamp (takes s) The throughput is / =. tokens per second or 6 tokens per minute. M M2 M M s

6 Parallelism and Pipelining (/6) Parallel Processing (Spatial Parallelism) xample: ssume we have a Christmas card factory with four machines. pproach 2. Process tokens in parallel using more machines. The latency is 6 = s M: Prints out the card (takes 6s) M2: Puts on a stamp (takes s) M: Prints out the card (takes 6s) M: Puts on a stamp (takes s) The throughput is 2 * / =.2 tokens per second or 2 tokens per minute. M M2 M M2 M M M M s Parallelism and Pipelining (/6) Pipelining (Temporal Parallelism) 2 xample: ssume we have a Christmas card factory with two machines. pproach. Process tokens by pipelining using only two machines. The latency is still 6 = s M: Prints out the card (takes 6s) M2: Puts on a stamp (takes s) The throughput is /6 =.666 tokens per second or tokens per minute. M M2 The factory starts the production of a new card every 6 second M M2 M M s

7 Parallelism and Pipelining (5/6) Summary pproach. Process tokens sequentially using two machines pproach 2. Process tokens in parallel using four machines pproach. Process tokens by pipelining using only two machines. M Latency: s Throughput: 6 tokens/min We improve throughput, but not latency Latency: s Throughput: 2 tokens/min Latency: s Throughput: tokens/min M2 Spatial parallelism adds extra machines, but pipelining does not Throughput improvements are limited by the slowest machine (in this case M) Parallelism and Pipelining (6/6) Performance nalysis for Pipelining Idea: We introduce a pipeline in the processor How does this affect the execution time? xecution time (in seconds) = # instructions clock cycles instruction seconds clock cycle Pipelining does not change the number of instructions Pipelining will not improve the CPI (actually, make it slightly worse) Pipelining will improve the cycle period (make the critical path shorter)

Towards a Pipelined (/8) 5 Recall the single-cycle data path (the logic for the j and beq instructions is hidden) PC next PC Inst W 25:2 2:6 2 2 2:6 5: 5: Sign xtend LU W Towards a Pipelined

8 Towards a Pipelined (/8) 5 Recall the single-cycle data path (the logic for the j and beq instructions is hidden) PC next PC Inst W 25:2 2: :6 5: 5: Sign xtend LU W Towards a Pipelined (2/8) Fetch Stage 6 register splits the datapath into stages, forming a pipeline. First, we introduce a instruction fetch stage. PC next Inst 25:2 2:6 5: 2 W 2 2:6 5: Sign xtend LU W Fetch (F)

9 Towards a Pipelined (/8) Decode Stage 7 decode stage decodes an instruction and reads out values from the register file. PC next W 2 2 2:6 5: LU W Sign xtend Fetch (F) Decode (D) Towards a Pipelined (/8) xecute Stage 8 n execute stage performs the computation using the LU. PC next W 2 2 2:6 5: LU W Sign xtend Fetch (F) Decode (D) xecute ()

10 Towards a Pipelined (5/8) Stage 9 PC next 2 W 2 2:6 5: Reading and writing to memory is done in the memory stage. LU W Sign xtend Fetch (F) Decode (D) xecute () (M) Towards a Pipelined (6/8) Writeback Stage Can you see a problem with the writeback? PC next 2 W 2 2:6 5: Sign xtend The results are written back to the register file in the writeback stage. LU W 2 Fetch (F) Decode (D) xecute () (M) Writeback (W)

PC next Towards a Pipelined (7/8) Writeback Stage Note that the register file is read in the decode stage, but written to in the writeback stage 2 W 2 2:6 5: Sign xtend The address must be forwarded

11 PC next Towards a Pipelined (7/8) Writeback Stage Note that the register file is read in the decode stage, but written to in the writeback stage 2 W 2 2:6 5: Sign xtend The address must be forwarded to the correct stage! LU W 2 Fetch (F) Decode (D) xecute () (M) Writeback (W) Towards a Pipelined (8/8) nother issue 22 Can you see another issue? PC next The program counter can be updated in the wrong stage (PC increment by or when branching). Solution not shown in the slides. W W 2 2 2:6 5: Sign xtend LU Fetch (F) Decode (D) xecute () (M) Writeback (W)

12 2 cknowledgement: The structure and several of the good examples are derived from the book Digital Design and Computer rchitecture (2) by D. M. Harris and S. L. Harris. Five-Stage Pipeline In each cycle, a new instruction is fetched, but it takes 5 cycles to complete the instruction. In each cycle all stages are handling different instructions in parallel. 2 xample. In cycle 6, the result of the sub instruction is written back to register $t. add $s, $s, $s F D M sub $t, $t, $t2 F D addi $t, $, 55 F D xori $t, $t5, F and $t6, $s, $s We can fill the pipeline because there are no dependencies between instructions xercise: What is the LU doing in cycle 5? nswer: dding together values and 55

13 Hazards (/) Read after Write (RW) add $s, $s, $s2 The add instruction writes back the value $s in cycle 5 But $s is used in the decode phase in cycle data hazard occurs when an instruction reads a register that has not yet been written to. This kind of data hazard is called read after write (RW) hazard. sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 and will also use the wrong value for $s. xercise: For MIPS, will instruction xori result in a hazard? Stand for yes, sleep for no. nswer: No. xori is OK for MIPS, because it writes on the first part of the cycle (falling edge) and reads on the second part (rising edge) Hazards (2/) Solution : Forwarding The result from the execute stage for add can be forwarded (also called bypassing) to the execute stage for sub. add $s, $s, $s Hazard detection is implemented using a hazard detection unit that gives control signals to the datapath if data should be forwarded. sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 Can all data hazards be solved using forwarding? The and instruction s hazard is solved by forwarding as well.

14 Hazards (/) Solution : Forwarding (partially) 27 xercise: Which of the instructions sub, and, and xori have data hazards? Which can be solved using forwarding? nswer: Hazards: sub and and Can use forwarding: and 2 5 The sub instruction cannot be solved using forwarding because the memory access is available at the end of cycle, but is needed in the beginning of cycle. lw $s, 2($s2) sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 The and instruction memory result can be forwarded after the memory stage to execution. xori can read the data from the write stage (writes in first part of cycle, reads in second part) Hazards (/) Solution 2: Stalling 28 Solution when forwarding does not work: stalling fter stalling, the result can be forwarded to the execute stage. 2 5 lw $s, 2($s2) sub $t, $s, $t and $t2, $t, $s xori $t, $s, 2 F D D M W F We need to stall the pipeline. Stages are repeated and the fetch of xori is delayed. Stalling results in more than one cycle per instruction. The unused stage is called a bubble.

15 (/5) ssume Branch Not Taken 29 2: beq $s, $s2, 2: sub $t, $s, $t 2 5 Computes the branch target address and compares for equality in the execute () stage. If branch taken, update the PC in the memory (M) stage. 28: and $t2, $t, $s 2C: xori $t, $s, : addi $t, $s, If the branch is taken, we need to flush the pipeline. We have a branch misprediction penalty of cycles. Can we improve this? (2/5) Improving the Pipeline dd an equality comparison for beq in the decode phase (not shown here) PC next 2 W 2 2:6 5: Sign xtend Move the branch address calculation to the decode stage LU Right now, branch comparison is done in the execute stage W Fetch (F) Decode (D) xecute () (M) Writeback (W)

16 (/5) ssume Branch Not Taken 2: beq $s, $s2, 2: sub $t, $s, $t 2 5 The decode phase can change the next PC, so that the instruction at the branch taken address is fetched. 28: and $t2, $t, $s 2C: xori $t, $s, : addi $t, $s, Branch misprediction penalty is now reduced to cycle. Note that we may now introduce another data hazard (if operands are not available in the decode stage). Can be solved with forwarding or stalling (/5) Deeper Pipelines Why do we sometimes want more stages than 5? The critical path can be shorter with less logic in the slowest stage. The processor can have higher clock frequency. For instance, Intel s Core 2 duo has more than pipeline stages. Why not always have more pipeline stages? dds hardware (registers) The branch mispredication penalty increases!

(/5) Deeper Pipelines How can we handle deep pipelines, Static Branch Predictors and minimize misprediction? Statically (at compile time) determine if a branch is taken or not.

17 (/5) Deeper Pipelines How can we handle deep pipelines, Static Branch Predictors and minimize misprediction? Statically (at compile time) determine if a branch is taken or not. For instance, predict branch not taken. Dynamic Branch Predictors Dynamically (at runtime) predict if a branch will be taken or note. Operates in the fetch state. Maintains a table, called the branch target buffer, that contains hundreds or thousands of executed branch instructions, their destinations, and information if the branches were taken or not. by Open Grid Scheduler / Grid ngine - Own work. Licensed under Creative Commons Zero, Public Domain

5 rmv7 The most popular IS for embedded devices (9 billon devices in 2, growth 2 billion a year) More complex addressing modes than MIPS (can do shift and add of addresses in registers in one

18 5 rmv7 The most popular IS for embedded devices (9 billon devices in 2, growth 2 billion a year) More complex addressing modes than MIPS (can do shift and add of addresses in registers in one instruction) RMv7 Condition results are saved in special flags: negative, zero, carry, overflow. 6 registers, each -bit (integers) size -bit (Thumb-mode, 6-bits encoding). Conditional execution of instructions, depending on condition code. xample: RM Cortex-8, a processor at GHz, -stage pipeline, with branch predictor. 6 Standard in laptops, PCs, and in the cloud CISC instructions are more powerful than for RM and MIPS, but requires more complex hardware architecture has evolved over the last 5 years, There are 6,, and 6 bits variants. 8 general purpose registers (eax, ebx, ecx, edx, esp, ebp, esi, edi). Variable length of instruction encoding (between and 5 bytes) rithmetic operations allow destination operand to be in memory. Major manufacturers are Intel and MD.

7 Summary Some key take away points: Pipelining is a temporal way of

reducing the clock period (shorter critical path) Pipelining introduces

There are two main kind of hazards: data hazards and control hazards.

19 7 Summary Some key take away points: Pipelining is a temporal way of achieving parallelism Pipelining processors improve performance by reducing the clock period (shorter critical path) Pipelining introduces pipeline hazards. There are two main kind of hazards: data hazards and control hazards. hazards are solved by forwarding or stalling Control hazards are solved by flushing the pipeline and improved by branch prediction. Thanks for listening!

Computer Hardware Engineering

Computer Hardware Engineering IS2, spring 2 Lecture : LU and s ssociate Professor, KTH Royal itute of Technology ssistant Research Engineer, University of California, Berkeley Revision v., June 7, 2: Minor