Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors

Size: px

Start display at page:

Download "Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors"

Cassandra Shelton
5 years ago
Views:

1 Vorlesung / Course IN2075: Mikroprozessoren / Microprocessors Pipelining 20 Nov 2017 Carsten Trinitis (LRR)

2 P Techniques: Pipelining etc.

3 Processor Organisation Tasks of a processor Input: Stream of instructions Instructions taken from ISA Execute instructions Modify stream in case of control instructions Continuous loop Instruction Cycle

4 Simple Instruction Cycle Fetch Instruction Decode Instruction Fetch Data Process Data Write Data Read the instruction from main memory Decode to query the requested action Get the data required for the requested action Perform the requested data processing Store the result of the processed data

5 Comment Instruction Cycle Simplified Some concepts missing Main point: Interrupts / Exception Interrupts External, asynchronous interruption of the processor Processor stops execution on current instructions Starts with interrupt routine Asynchronous change in control flow Exact number and tasks of each step varies (See also later)

6 Running a Processor A true Stream of Instructions Let s assume no branches for now FI DI FD PD WD FI DI FD PD WD FI DI FD PD WD Time 1 cycle Each step is carried out by a different unit During one cycle only one unit activated Others are idle

7 Optimisation Time FI DI FD PD WD FI DI FD PD WD FI DI FD PD WD FI DI FD PD WD FI DI FD PD WD FI DI FD PD WD FI DI FD PD WD Pipeline stage 1 cycle Make use of otherwise unused units Concept is called Pipelining Think of assembly lines

8 Basic Principle Task: D operation steps have to be executed on N items each. Sequential approach: D N time steps required Pipelining approach (overlapped execution): D workers: each of the D steps performed by a different worker. approx. N time steps required Prerequisites (simplification): Each step needs the same amount of time Each item needs the same types of operation (at least in the beginning)

9 Instruction Pipelining Problems: Not all phases are of equal length Some instructions need additional phases Hence, very well suited for RISC systems Fast instructions with little variance However, also CISC Systems use pipelining Saving in Execution Time (in optimal case) Example: 5 stages / 10 Instructions Without pipelining: 5*10=50 cycles With pipelining: 10+4 = 14 cycles

10 Comparison Pipelining (overlapped execution) D workers: each of the D steps is performed by a different worker approx. N time steps required Parallel execution: M workers: each worker performs all D steps on N/M items (N D)/M time required Parallel execution and pipelining can be (and usually are) combined. N M

11 Piplelining in Computer Science Basic operations: Arithmetics (e.g. FMUL, FDIV) Instruction/data load Instruction stream Fetch, Decode, Execute, Writeback,... Vector operations c = a + αb load, mult, load, add, store Example: Cray-1, vector computer

12 Piplelining in Computer Science Communicating processes Parallel threads on different cores, CPUs, machines Output of thread serves as input for next thread Example: IBM Cell Special purpose devices Digital signalling systems Network switches (routers) Graphic accelerators rendering pipeline Our focus: pipelining on basic operations and the instruction stream.

13 CL: Combinatorial logic R: Register Max. clock-frequency depends on longest path in CL

14 2 CL is divided into CL1 and CL2 Longest path in each CL is now shorter (half as long?) Clock frequency can be increased (doubled?)

15 Effects of pipelining Pipeline depth: D Throughput: times D (at best, usually worse) Latency: unchanged (at best, usually worse) Performance depends on both throughput and latency!

16 Increasing the Pipeline Growing benefit with growing pipeline length System with more units More units work on active instructions Approach: Split phases further Add more pipeline stages Example: Fetch Data Calculate Operands & Fetch Operands Similar things possible with other stages

17 Instruction Stream Pipelinine

18 Different Types of Instructions Most instructions need 1 cycle for execution (ADD, AND, MUL,...) Exceptions: FPU instructions, e.g. Pentium Pro: FADD 3, FMUL 5, FDIV 18, FSQRT 29 Load: At least two: generate address, read data Cache miss: up to thousands! Store: Stores are put into a queue Memory access is carried out by a dedicated unit

19 Problems with Pipelining So far: Optimal pipelining Each instruction independent Units can operate concurrently In reality: Three typrs ofconflicts can Occur Data Conflicts One instruction needs data from a previous one Resource Conflicts Two instructions need the same resources at the same time Occurs in more complex hardware Control Conflicts Wrong instructions executed in the pipeline

20 Data Conflicts Again three types: Read after write (RAW) Write after read (WAR) Write after write (WAW)

21 Data Conflicts (RAW) One instruction works on the data written by the previous instruction Example: ADD R2,R1,R2 // R2=R1+R2 MUL R3,R2,R3 // R3=R2*R3 Without pipelining, no problem Time FI DI FD PD WD ADD R2,R1,R2 FI DI FD PD WD MUL R3,R2,R3 Register 2

22 Data Conflicts (RAW) Scenario changes with pipelining ADD R2,R1,R2 MUL R3,R2,R3 FI DI FD PD WD FI DI FD PD WD Time Produces wrong result Based on old value in R2 Result of a data dependency Register 2

23 Data Conflicts (WAR) Example: 1 MOV AND DEC r0 [r1], r2 //(post decrement of r2) 2 ADD r2 r5, r6 Instruction 1 reads r2 in a late pipeline stage (after the move). Instruction 2 changes r2 in an early pipeline stage. 1 uses the new value.

24 Data Conflicts (WAW) Example: 1 FSQRT r0 r1 2 ADD r0 r1,r2 Execution of SQRT takes much longer than ADD. Hence, SQRT is stored instead of sum in r0!

25 Solving Data Conflicts Pipeline Interlocking Bypassing Software Solutions Detection of Data Conflicts by the Compiler Appropriate code structuring to avoid conflicts Hardware Solutions (automatic reordering) Hardware detects dependencies dynamically Treatment of conflicts Latter solution more complex, but No code change required (old codes still run) No complex compiler work

26 User/Compiler-based Re-ordering Example: ADD r0 r1,r2 SUB r3 r0, r5 AND r6 r7, r8 Insert NOPs Example: ADD r0 r1, r2 SUB r3 r0, r5 ADD r0 r1, r2 AND r6 r7, r8 SUB r3 r0, r5 ADD r0 r1, r2 NOP SUB r3 r0, r5

27 Pipeline Interlocking Detect conflict (hardware comparator for operand fields) Execution of early pipeline stages is stopped (pipeline stall) Wait until late stages are executed Continue operation NOPs are inserted automatically

28 Pipeline Interlocking

29 Pipeline Interlocking

30 Bypassing Detect conflict (hardware comparator for operand fields). Directly pass output of a late pipeline stage to an earlier stage + Pipeline can run at full speed! Usually implemented for all instructions requiring only one execution phase.

31 Bypassing

32 Bypassing

33 Structural Conflicts Definition: Two pipeline stages want to use the same circuit at the same time. Example: LOAD r0 [r1] ADD r5 r2,r3 If LOAD requires one more cycle than ADD, both operations want to write back within the same cycle.

34 Structural Conflicts Solutions: By programmer / compiler Pipeline interlocking Resource multiplication (e.g. virtual registers, see superscalarity)

35 Control Conflicts Pipelining requires known instruction stream Instructions need to be started before previous one has been finished Problem: Control flow instructions E.g. conditional branches Not known until after PD phase where the target, i.e. the next instruction is Until then, already many other instructions may have been started and partially executed

36 Control Conflicts Time JNZ target May be useless FI DI FD PD WD FI DI FD PD WD Start of new instruction stream FI DI FD PD WD FI DI FD PD WD After branch has been executed, Pipeline must be restarted Intermediate instructions must be aborted Their results must be dropped FI DI FD PD WD

37 Control Conflicts Problem: Execution of a branch/jump: change program counter Assumption: this is done in the n th pipeline stage Consequence: n 1 instructions after the jump are already in the pipeline n 1 instructions must be cancelled (pipeline flush) Every fourth to sixth instruction is a jump!

38 Control Conflicts: Solutions Reduce n: execute jump early in the pipeline Use delay slots Branch prediction Avoid branches

39 Avoiding Branches by Predication Example: ARM architecture Every instruction has a 4bit condition code Equal, not equal, unsigned higher or same, unsigned lower,..., always Conditional execution: instruction is only executed if condition holds true Avoids many jumps resulting from if-statements Instruction requires time even if condition is not true cannot be used for: long if-blocks avoiding jumps belonging to loops

40 ARM: Condition Flags & Codes Consider a simple fragment of C code: for (i = 10; i!= 0; i ) { do_something(); } A standard compiler would yield: mov r4, #10 loop_label: bl do_something sub r4, r4, #1 cmp r4, #0 bne loop_label

41 ARM: Condition Flags & Codes On an ARM architecture: mov r4, #10 loop_label: bl do_something subs r4, r4, #1 bne loop_label The s suffix causes the instruction (in this case sub) to update the flags itself based on its result. Example will be provided with exercises!

42 Conditional Execution on ARM Implemented with a 4-bit condition code selector. One of the four-bit codes is reserved as an "escape code" to specify certain unconditional instructions. However, nearly all common instructions are conditional. Example: Compute greatest common divisor (GCD) of two integers through Euclidean Algorithm

43 Euclidean Algorithm Find the Greatest Common Denominator (GCD) of two given numbers a and b. Basic algorithm: if (a == 0) return b; else while (b!= 0) if (a > b) a = a b; else b = b a; return a;

44 Euclidean Algorithm on ARM C code: In ARM assembly language: while (b!= 0) if (a > b) a = a b; else b = b a; Loop: CMP Ra, Rb ;set condition: ;"NE" if (a!=b), ;"GT" if (a>b), ;"LT" if (a<b) SUBGT Ra, Ra, Rb ; if "GT", a=a b; SUBLE Rb, Rb, Ra ; if "LT", b=b a CMP Rb, #0 BNE loop ; if "NE" then loop

45 Versions of ADD on ARM Unconditional versions of ADD: ADD ADDS (or ADDAL ADDALS) Conditional versions of ADD: ADDEQ ADDEQS ADDNE ADDNES ADDCS ADDCSS ADCC ADDCCS ADDMI ADDMIS ADDPL ADDPLS ADDVS ADDVSS ADDVC ADDVCS ADDHI ADDHIS ADDLS ADDLSS ADDGE ADDGES ADDLT ADDLTS ADDGT ADDGTS ADDLE ADDLES

46 Code Meaning (for cmp or subs) Flags Tested Technische Versions Universität München of ADD on ARM EQ Equal. Z==1 NE Not equal. Z==0 CS or HS CC or LO Unsigned higher or same (or carry set). Unsigned lower (or carry clear). C==1 C==0 MI Negative ( "minus"). N==1 PL Positive or zero ("plus"). N==0 VS Signed overflow ("V set"). V==1 VC No signed overflow ("V clear"). V==0 HI Unsigned higher. (C==1) && (Z==0) LS Unsigned lower or same. (C==0) (Z==1) GE Signed greater than or equal. N==V LT Signed less than. N!=V GT Signed greater than. (Z==0) && (N==V) LE Signed less than or equal. (Z==1) (N!=V) AL (omitted) always

47 Delay Slots Delay slots are the n 1 instructions following a branch On some architectures these are executed, even if jump is taken Example (assuming n==2): Conventional ISA Loop: LOAD r2 [r1] ADD r0 r0,r2 DEC r1 JZ loop ISA with delay slots loop: LOAD r2 [r1] DEC r1 JZ loop ADD r0 r0,r2 if no independent instruction is found, a NOP must be inserted used e.g. in AM29000 microprocessors (only works for small n)

48 Branch Prediction Problem: conditional jumps are executed at a later stage in the pipeline Solution: predict whether branch is taken pipeline must be flushed if prediction was wrong. Two approaches: Static branch prediction (does not depend on earlier branches) Dynamic branch prediction (depends on branch history)

49 Branch Prediction Simplest case: always assume no branch Better: always assume branch ( 2/3 of all branches are taken!) Even better: Backward branch: assume branch Forward branch: assume no branch Unconditional branch: assume branch Alternative: Compiler gives hint whether branch is taken or not Every conditional branch needs two opcodes

50 Dynamic Branch Prediction Idea: remember if branch at address XXX was taken in the past or not. loop: LOA r2 [r1] ADD r0 r0,r2 DEC r1 XXX: JZ loop Branch only mispredicted in last iteration!

51 Dynamic Branch Prediction Branch prediction cache: N entries Tag: address of jump Entry: one bit (taken or not taken) Bits needed for LRU Stores the N most recently encountered jumps

52 Dynamic Branch Prediction More sophisticated approach: Assume branch sequence: TNTNTNTNTNT No branch predicted correctly! Solution (2 bits per entry required):

53 Dynamic Branch Prediction More sophisticated approach: e.g. Assume branch sequence: TNTNTNTNTNT No branch predicted correctly! Solution (2 bits per entry required):

54 Branch Target Cache Motivation: Branch prediction is not sufficient. Determining branch target (e.g. load from memory) takes too long! Even worse for indirect jumps and returns. Solution: Branch target cache Maps location of branches to branch targets IFETCH stage: if <current IP matches cache tag> load cache entry into pipeline else: load next (IP+1) into pipeline

55 Branch Target Cache

56 Branch Target Cache

57 Branch Target Cache Problem: Branch target cache does not work well for RETinstructions. Solution (as introduced by Cyrix): CALL stores address not only on stack... but also in branch stack (within control unit) As long as branch cache does not overflow... all returns can be correctly predicted.

Flow Control In Assembly

Chapters 6 Flow Control In Assembly Embedded Systems with ARM Cortext-M Updated: Monday, February 19, 2018 Overview: Flow Control Basics of Flowcharting If-then-else While loop For loop 2 Flowcharting