INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Size: px

Start display at page:

Download "INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing"

Esmond Kelley
5 years ago
Views:

UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.

1 UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version English Lecture 07 Title: Pipelining - Data and Summary: Minimization of data hazards (revision) by software, read-write in opposite clock edges, data forwarding; Minimization of control hazards (anticipation of branch resolution, delayed branches, static and dynamic branch prediction). 2010/2011 Nuno.Roma@ist.utl.pt

.. Analysis of a program execution in a pipeline; Hazards in the pipeline: Structural hazards; Data

2 Architectures for Embedded Computing Pipelining: Data and Prof. Nuno Roma ACE 2010/11 - DEI-IST 1 / 37 Previous Class In the previous class... Analysis of a program execution in a pipeline; Hazards in the pipeline: Structural hazards; Data hazards: Types of hazards; Overcoming the Stalls: By software; By writing in opposite edges of the clock cycle; By data forwarding. Prof. Nuno Roma ACE 2010/11 - DEI-IST 2 / 37

Road Map Prof. Nuno Roma ACE 2010/11 - DEI-IST 3 / 37 Summary Today: - Stall minimization (revision): By software, Read-write in opposite clock edges, Data forwarding.

3 Road Map Prof. Nuno Roma ACE 2010/11 - DEI-IST 3 / 37 Summary Today: - Stall minimization (revision): By software, Read-write in opposite clock edges, Data forwarding. - Stall minimization: Anticipation of branch resolution, Delayed branches, Branch prediction: Static, Dynamic. Bibliography: Computer Architecture: a Quantitative Approach, Sections A.2 to A.3; 2.1 to 2.3 Prof. Nuno Roma ACE 2010/11 - DEI-IST 4 / 37

4 Prof. Nuno Roma ACE 2010/11 - DEI-IST 5 / 37 i instruction before j: RAW (read after write) j tries to read before i writes, loading the previous value. True data dependency most common hazard in pipelines. Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 37

5 i instruction before j: RAW (read after write) j tries to read before i writes, loading the previous value. True data dependency most common hazard in pipelines. WAW (write after write) j tries to write before i writes, storing the i value instead of j Output dependency. Does not happen in the basic version of the pipeline. Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 37 i instruction before j: RAW (read after write) j tries to read before i writes, loading the previous value. True data dependency most common hazard in pipelines. WAW (write after write) j tries to write before i writes, storing the i value instead of j Output dependency. Does not happen in the basic version of the pipeline. WAR (write after read) j tries to write before i reads, loading the updated value instead of the previous result of an anti-dependency. Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 37

6 i instruction before j: RAW (read after write) j tries to read before i writes, loading the previous value. True data dependency most common hazard in pipelines. WAW (write after write) j tries to write before i writes, storing the i value instead of j Output dependency. Does not happen in the basic version of the pipeline. WAR (write after read) j tries to write before i reads, loading the updated value instead of the previous result of an anti-dependency. RAR (read after read)? Prof. Nuno Roma ACE 2010/11 - DEI-IST 6 / 37 Data Forwarding Data-Forwarding: Prof. Nuno Roma ACE 2010/11 - DEI-IST 7 / 37

7 Pipeline Structure Prof. Nuno Roma ACE 2010/11 - DEI-IST 8 / 37 Implementation of Data Forwarding Prof. Nuno Roma ACE 2010/11 - DEI-IST 9 / 37

8 Data Forwarding Data forwarding does not always prevent the introduction of Stalls: Prof. Nuno Roma ACE 2010/11 - DEI-IST 10 / 37 Data Forwarding: Example Original: LD R8,0(R1) F D X M W DADD R9,R9,R8 F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W LD R8,0(R1) F Prof. Nuno Roma ACE 2010/11 - DEI-IST 11 / 37

9 Data Forwarding: Example Original: LD R8,0(R1) F D X M W DADD R9,R9,R8 F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W LD R8,0(R1) F With data forwarding: LD R8,0(R1) F D X M W DADD R9,R9,R8 F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W LD R8,0(R1) F Prof. Nuno Roma ACE 2010/11 - DEI-IST 12 / 37 Prof. Nuno Roma ACE 2010/11 - DEI-IST 13 / 37

10 Happens as a consequence of an interruption in the sequentiality of the instruction execution processes, as implied by the pipeline structure LD R8,0(R1) F D X M W DADD R9,R9,R8 F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W LD R8,0(R1)???? Prof. Nuno Roma ACE 2010/11 - DEI-IST 14 / 37 : pipeline structure Prof. Nuno Roma ACE 2010/11 - DEI-IST 15 / 37

11 Happens as a consequence of an interruption in the sequentiality of the instruction execution processes, as implied by the pipeline structure LD R8,0(R1) F D X M W DADD R9,R9,R8 F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W LD R8,0(R1)???? Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 37 Happens as a consequence of an interruption in the sequentiality of the instruction execution processes, as implied by the pipeline structure LD R8,0(R1) F D X M W DADD R9,R9,R8 F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W LD R8,0(R1) F S F S F S F D... Prof. Nuno Roma ACE 2010/11 - DEI-IST 16 / 37

12 : optimizations Optimizations: Anticipation of branch resolution Delayed branches Branch prediction Static Dynamic Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 37 : optimizations Optimizations: Anticipation of branch resolution Delayed branches Branch prediction Static Dynamic Prof. Nuno Roma ACE 2010/11 - DEI-IST 17 / 37

13 : pipeline structure Prof. Nuno Roma ACE 2010/11 - DEI-IST 18 / 37 Anticipation of Branch Resolution The evaluation of the branch condition and of the target address is anticipated to the ID stage of the pipeline. Prof. Nuno Roma ACE 2010/11 - DEI-IST 19 / 37

14 Anticipation of Branch Resolution Without anticipation of branch resolution: LD R8,0(R1) F D X M W DADD R9,R9,R8 F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W LD R8,0(R1) F S F S F S F Prof. Nuno Roma ACE 2010/11 - DEI-IST 20 / 37 Anticipation of Branch Resolution Without anticipation of branch resolution: LD R8,0(R1) F D X M W DADD R9,R9,R8 F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W LD R8,0(R1) F S F S F S F With anticipation of branch resolution: LD R8,0(R1) F D X M W DADD R9,R9,R8 F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W LD R8,0(R1) F S F D X Prof. Nuno Roma ACE 2010/11 - DEI-IST 21 / 37

15 : optimizations Optimizations: Anticipation of branch resolution Delayed branches Branch prediction Static Dynamic Prof. Nuno Roma ACE 2010/11 - DEI-IST 22 / 37 Delayed Branches Use the time required to resolve the branch to execute useful instructions: These instructions execute completely, either the branch is taken or not. How? Select one instruction prior to the branch to place in the delay-slot, provided that the sequentiality of the data processing is kept. Example: Prof. Nuno Roma ACE 2010/11 - DEI-IST 23 / 37

16 Delayed Branches Without using the delay-slot: LD R8,0(R1) F D X M W DADD R9,R9,R8 F S F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W LD R8,0(R1) F S F D X Using the delay-slot: LD R8,0(R1) F D X M W DADD R9,R9,R8 F S F D X M W DADDUI R1,R1,#-8 F D X M W SD R9,8(R1) F D X M W LD R8,0(R1) F D X M Prof. Nuno Roma ACE 2010/11 - DEI-IST 24 / 37 Delayed Branches Compiler: Reorder the instructions in order to take maximum advantage of the delay-slot, respecting all the program data dependencies. Prof. Nuno Roma ACE 2010/11 - DEI-IST 25 / 37

17 : optimizations Optimizations: Anticipation of branch resolution Delayed branches Branch prediction Static Dynamic Prof. Nuno Roma ACE 2010/11 - DEI-IST 26 / 37 Branch Prediction: Static Implemented by the compiler; Upon each branch instruction, instead of interrupting the pipeline until the new value of the PC register is updated: 1. Assume one decision beforehand: take or not take the branch; 2. Continue executing the instructions according to that premonition; 3. At the branch resolution instant, check if the assumed premonition is correct or not: Yes: continue the normal execution; No: eliminate the instructions that are being executed in the pipeline and re-start executing at the new value of the PC register. Prof. Nuno Roma ACE 2010/11 - DEI-IST 27 / 37

18 Branch Prediction: Static EXAMPLE: Predict Taken Pipeline WITHOUT anticipation of the branch resolution Prof. Nuno Roma ACE 2010/11 - DEI-IST 28 / 37 Branch Prediction: Static Predict Taken Case 1: Branch condition is TRUE Correct prediction!!! DADDUI R1,R1,#-8 F D X M W LD R8,0(R1) F D X M W DADD R9,R9,R8 F S F D X M W SD R9,0(R1) F D X M W DADDUI R1,R1,#-8 F D X M W Case 2: Branch condition is FALSE Wrong prediction!!! DADDUI R1,R1,#-8 F D X M W LD R8,0(R1) F D DADD R9,R9,R8 F... F D X M W... F D X M W Prof. Nuno Roma ACE 2010/11 - DEI-IST 29 / 37

19 Branch Prediction: Static The compiler may use information collected from the execution of previous instructions of the same program + processed data (profile based prediction); Prof. Nuno Roma ACE 2010/11 - DEI-IST 30 / 37 Branch Prediction: Static The compiler may use information collected from the execution of previous instructions of the same program + processed data (profile based prediction); In practice... Predict Not Taken approach is easier to implement; The compiler may choose the type of conditional branch in order to adjust the branch prediction to the most frequent path; Usual rule: Forward branches Predict Not Taken Backward branches Predict Taken Prof. Nuno Roma ACE 2010/11 - DEI-IST 30 / 37

20 : optimizations Optimizations: Anticipation of branch resolution Delayed branches Branch prediction Static Dynamic Prof. Nuno Roma ACE 2010/11 - DEI-IST 31 / 37 Branch Prediction: Dynamic The objective is to adapt the branch prediction to the execution pattern that was observed in that program until that instant. The most simple implementation is to use an one-bit memory, addressed with the least significant bits of the address of the branch instruction: Branch-Prediction Buffer or Branch History Table Prof. Nuno Roma ACE 2010/11 - DEI-IST 32 / 37

21 Branch Prediction: Dynamic In each position, this memory accommodates: the address of the branch instruction (current PC); the address of the target instruction, in case of a taken branch; one bit that indicates either the branch instruction was recently taken or not taken. Prof. Nuno Roma ACE 2010/11 - DEI-IST 33 / 37 Branch Prediction: Dynamic In each position, this memory accommodates: the address of the branch instruction (current PC); the address of the target instruction, in case of a taken branch; one bit that indicates either the branch instruction was recently taken or not taken. Usually, these tables may have 4096 or even more entries. Prof. Nuno Roma ACE 2010/11 - DEI-IST 33 / 37

22 Branch Prediction: Dynamic Problem: in a program loop, this predictor fails twice... Prof. Nuno Roma ACE 2010/11 - DEI-IST 34 / 37 Branch Prediction: Dynamic Optimization - usage of an n-bits counter: Increment whenever the branch is taken; Decrement whenever the branch is not taken; Predict the branch as taken whenever the counter is greater than one half of its maximum range. Prof. Nuno Roma ACE 2010/11 - DEI-IST 35 / 37

23 Branch Prediction: Dynamic Optimization - usage of an n-bits counter: Increment whenever the branch is taken; Decrement whenever the branch is not taken; Predict the branch as taken whenever the counter is greater than one half of its maximum range. In practice, n = 2 works almost as well as any other greater value of n. Prof. Nuno Roma ACE 2010/11 - DEI-IST 35 / 37 Prof. Nuno Roma ACE 2010/11 - DEI-IST 36 / 37

24 Pipelining - implementation problems: Multi-cycle instructions; Super-pipelining; Interruptions; Code optimization to a pipeline execution. Prof. Nuno Roma ACE 2010/11 - DEI-IST 37 / 37

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 06