TDT4255 Computer Design. Review Lecture. Magnus Jahre. TDT4255 Computer Design

Size: px

Start display at page:

Download "TDT4255 Computer Design. Review Lecture. Magnus Jahre. TDT4255 Computer Design"

Shonda Snow
5 years ago
Views:

1 1 TDT4255 Computer Design Review Lecture Magnus Jahre

2 2 ABOUT THE EXAM

3 3 About exam The exam will cover a large part of the curriculum (reading list) Exam properties that we seek: Comprehensible and unambiguous Correct Reasonable (e.g. not too easy, not too difficult, not ask about unimportant details but rather try to focus on principles and understanding, etc.) Relevant (same as above) Differentiating (NTNU has decided that an 'A' should be an outstanding result, and we need to have some difficult questions to be able to find eventual A-candidates and to get a reasonable distribution of the students among the possible marks.) Unpredictable (We think it should not be given information or answers to questions that are of a kind that makes it possible for smart or pushing students to find out what the exam will include or not. We want to influence the students so that they prepare for the exam by trying to maximize the learning of the course material rather than by speculation :-) ).

4 4 How to Answer an Exam Question Only answer what is asked for No points awarded for answers that are besides the point Only answer what you are reasonably sure is correct Norwegian saying: It s better to keep you mouth shut and let people think you are stupid than to open your mouth and remove all doubt. There is a limited amount of space available to answer the questions Prioritize: good priorities indicate good understanding

5 5 Example Assignment (1/2) Explain the difference between a write-through and a write-back strategy for caches Good answer: A write-through strategy updates main memory on all cache writes A write-back strategy writes back dirty data when the block is evicted from the cache Why is this good? Answers the question Only answers the question

6 6 Example Assignment (2/2) Explain the difference between a write-through and a write-back strategy for caches Poor answer: A write-through strategy updates main memory on all cache writes A write-back strategy writes back dirty data when the block is evicted from the cache Set associative caches are common in current processors Fully associative caches are popular because they give the lowest miss rates (the answer continues with any possible irrelevant facts about caches where some are correct and others are wrong or at least imprecise) Not asked for! Imprecise!

7 7 Other Practicalities The exam will have multiple choice Trade off: hard to write vs. easy to grade MIPS fact sheet will be provided Last years exam available on it s learning... but no solution The questions can have many correct answers

8 8 Chapter 1 Review Acknowledgement: Slides are adapted from Morgan Kaufmann companion material

9 9 Defining Performance Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Boeing 747 BAC/Sud Concorde Douglas DC Passenger Capacity Cruising Range (miles) Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Boeing 747 BAC/Sud Concorde Douglas DC Cruising Speed (mph) Passengers x mph

10 10 Response Time and Throughput Book definition: Time from issuing a command to its completion This is often referred to as the turn-around time More common response time definition: Time from issue to first response Execution time is the time the processor is busy execution the program Turn-around time includes the time the process waits to be executed, execution time does not Also: user execution time vs. system execution time Throughput is the total work per unit time

11 11 CPI in More Detail If different instruction classes take different numbers of cycles Clock Cycles = n i= 1 (CPIi Instruction Counti) Weighted average CPI CPI = Clock Cycles Instruction Count = n i= 1 CPI i Instruction Counti Instruction Count Relative frequency

12 12 Appendix D Review Acknowledgement: Slides are adapted from Morgan Kaufmann companion material

13 13 Combinatorial logic Combinatorial logic only depends on current inputs We don t need a clock! There might be inputs that are irrelevant to our circuit Don t cares Room for optimization

14 14 32 Bit ALU Exploit the 1 bit ALU abstraction to create a wide ALU Called a ripple carry adder Ripple carry adders are slow Carry propagation through the circuit is the critical path

15 15 Carry Lookahead Idea: We can use more logic to shorten the critical path of a ripple carry adder Each carry bit uses all previous carries and inputs We can compute each carry directly by applying the formulas recursively But: Logic overhead grows quickly Two bit carry lookahead example: ] [ ] [ b a b a c a c b a b a c a c b b c b a c a c b c b a c a c b c = + + = + + =

16 16 Sequential Systems Clocking methodologies Edge triggered: State elements are updated on clock transitions Level triggered: State elements are updated continuously while the clock is either 1 or 0 Choose one or the other Different methodologies may be appropriate for different production technologies

17 17 Register Collection of flipflops or latches that store multi-bit values Register files contain multiple registers and access logic reg: process(clk) begin if rising_edge(clk) then data_out <= data_in_1; end if; end process reg; VHDL code is identical to latch/flip-flop except that the signals are vectors and not scalars

18 18 Register File Example 2 Port Read logic 1 Port Write logic

19 19 Finite State Machines Commonly synchronous Changes state on clock tick Two types Moore: Next state only depends on current state Mealy: Next state depends on current state and inputs Moore or Mealy? Almost all electronic systems contain a number of state machines

20 20 Chapter 2 Review Acknowledgement: Slides are adapted from Morgan Kaufmann companion material

21 21 Instruction Set Design DP1: Simplicity favors regularity Regularity makes implementation simpler Simplicity enables higher performance at lower cost DP2: Smaller is faster DP3: Make the common case fast Small constants are common Immediate operand avoids a load instruction DP4: Good design demands good compromises Different formats complicate decoding, but allow 32-bit instructions uniformly Keep formats as similar as possible

22 22 MIPS R-format Instructions op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Instruction fields op: operation code (opcode) rs: first source register number rt: second source register number rd: destination register number shamt: shift amount (00000 for now) funct: function code (extends opcode)

23 23 MIPS I-format Instructions op rs rt constant or address 6 bits 5 bits 5 bits 16 bits Immediate arithmetic and load/store instructions rt: destination or source register number Constant: 2 15 to Address: offset added to base address in rs

24 24 Branch Addressing Branch instructions specify Opcode, two registers, target address Most branch targets are near branch Forward or backward op rs rt constant or address 6 bits 5 bits 5 bits 16 bits PC-relative addressing Target address = PC + offset 4 PC already incremented by 4 by this time

25 25 Jump Addressing Jump (j and jal) targets could be anywhere in text segment Encode full address in instruction op address 6 bits 26 bits (Pseudo)Direct jump addressing Target address = PC : (address 4)

26 26 Local Data on the Stack Local data allocated by callee e.g., C automatic variables Procedure frame (activation record) Used by some compilers to manage stack storage

27 27 Memory Layout Text: program code Static data: global variables e.g., static variables in C, constant arrays and strings $gp initialized to address allowing ±offsets into this segment Dynamic data: heap E.g., malloc in C, new in Java Stack: automatic storage

28 28 Translation and Startup Many compilers produce object modules directly Static linking

29 29 Chapter 3 Review Acknowledgement: Slides are adapted from Morgan Kaufmann companion material

30 Integer Addition Example: 7 + 6 Overflow if

30 30 Integer Addition Example: Overflow if result out of range Adding +ve and ve operands, no overflow Adding two +ve operands Overflow if result sign is 1 Adding two ve operands Overflow if result sign is 0

multiplier product 1000 1001 1000 0000 0000

31 31 Multiplication Start with long-multiplication approach multiplicand multiplier product Length of product is the sum of operand lengths

32 32 Optimized Multiplier Perform steps in parallel: add/shift One cycle per partial-product addition That s ok, if frequency of multiplications is low

33 33 Dividend/Divisor = Quotient Division divisor quotient dividend remainder n-bit operands yield n-bit quotient and remainder Check for 0 divisor Long division approach If divisor dividend bits 1 bit in quotient, subtract Otherwise 0 bit in quotient, bring down next dividend bit Restoring division Do the subtract, and if remainder goes < 0, add divisor back Signed division Divide using absolute values Adjust sign of quotient and remainder as required

34 34 Representable Floating Point Numbers

35 35 IEEE Floating-Point Format single: 8 bits double: 11 bits S Exponent single: 23 bits double: 52 bits Fraction x = ( 1) S (1+ Fraction) 2 (Exponent Bias) S: sign bit (0 non-negative, 1 negative) Normalize significand: 1.0 significand < 2.0 Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit) Significand is Fraction with the 1. restored Exponent: excess representation: actual exponent + Bias Ensures exponent is unsigned Single: Bias = 127; Double: Bias = 1203

36 36 Chapter 4 Review Acknowledgement: Slides are adapted from Morgan Kaufmann companion material

37 37 Single Cycle Datapath

38 38 R-Type Instruction

39 39 Load Instruction

40 40 Branch-on-Equal Instruction

41 41 Datapath With Jumps Added

42 42 Multi-cycle Datapath (1/2) Idea: Add registers at strategic points in the datapath Activate only needed functional units with control signals

43 43 Multicycle Datapath (2/2) Area savings possible (but not necessary) Only one memory Only one ALU

44 Chapter 4 Review

45 Structural hazards: Pipeline Hazards An occurrence in which a planned instruction cannot execute in the proper clock cycle because the hardware cannot support the combination of instructions that are set to execute in the given clock cycle Data hazards: An occurrence in which a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available Control hazards: An occurrence in which the proper instruction cannot execute in the proper clock cycle because the instruction that was fetched is not the one that is needed

46 Structural hazards For the pipelined instructions below we would have a structural hazard in clock cycle (cc) 5 if we only had one memory. The second lw is reading from the memory while the 5th lw instruction is being fetched. MIPS instruction set is designed to avoid structural hazards lw IF ID EX MEM WB lw IF ID EX MEM WB lw IF ID EX MEM WB lw IF ID EX MEM WB lw IF ID EX MEM WB

47 Structural hazards For the pipelined instructions below we would have a structural hazard in clock cycle (cc) 5 if we only had one memory. The second lw is reading from the memory while the 5th lw instruction is being fetched. MIPS instruction set is designed to avoid structural hazards What about the register file???? lw IF ID EX MEM WB lw IF ID EX MEM WB lw IF ID EX MEM WB lw IF ID EX MEM WB lw IF ID EX MEM WB

48 Data Hazards add $s0, $t0, $t1 IF ID EX MEM WB sub $t2, $s0, $t3 IF ID EX MEM WB An example of a data hazard: The sub instruction reads register $s0 at cc 3. The value that the sub is expecting is the value that is to be written into $s0 by the preceding add instruction. The add will not update the $s0 register until cc add $s0, $t0, $t1 IF ID EX MEM WB sub $t2, $s0, $t3 IF ID EX MEM WB

49 Data Hazards and forwarding The figure above shows that the new value for $s0 is already calculated the clock cycle before it is need by the sub instruction. Making this value available to the sub instruction is called forwarding. Forwarding: A method of resolving a data hazard by retrieving the missing data elements from internal buffers rather than waiting for it to arrive from programmervisible registers or memory

50 Data Hazards and stalls lw $s0, 20($t1) IF ID EX MEM WB sub $t2, $s0, $t3 IF ID EX MEM WB With the add as the first instruction the new value for $s0 was available after cc3 (in cc4) Now it is not available until after cc4 Even with forwarding we would need to stall the pipeline

51 6.3 pipelined control The pipelined datapath with control signals Simplified in chapter 4.6 As much as possible borrowed from the Singel cycle datapath The functions for these signals are defined in the 3 following figures (5.12, 5.16, 5.18)

52 Chapter 2 HP07 Review

53 Introduction Pipelining became a universal technique in 1985 Overlaps execution of instructions Exploits Instruction Level Parallelism (ILP) Introduction Beyond this, there are two main approaches: Hardware-based dynamic approaches Used in server and desktop processors Compiler-based static approaches Scientific and embedded markets

54 Instruction-Level Parallelism When exploiting instruction-level parallelism, goal is to minimize CPI Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls Introduction Parallelism with basic block is limited Typical size of basic block = 3-6 instructions Must optimize across branches

55 Data Dependence Why can t we just execute all instructions in parallel? Introduction Challenges: Data dependency (true dependence) Instruction j is data dependent on instruction i if Instruction i produces a result that may be used by instruction j Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i This includes registers AND memory! Examples? Dependent instructions cannot be executed simultaneously

56 Data Dependence Dependencies are a property of programs (the amount of parallelism is HIGHLY dependent on the type of program) Pipeline MUST satisfy dependences Data dependences specify: Order in which instructions MUST be executed. Upper bound on amount of parallelism Introduction Data dependencies that flow through memory locations are difficult to detect (example)?

57 Name Dependence Two instructions use the same name but no flow of information (false dependence) Not a true data dependence, but is a problem when reordering instructions Antidependence: instruction j writes a register or memory location that instruction i reads Initial ordering (i before j) must be preserved Output dependence: instruction i and instruction j write the same register or memory location Ordering must be preserved Introduction To resolve, use renaming techniques

58 Other Factors Data Hazards Read after write (RAW) True dependence Write after write (WAW) False dependence Write after read (WAR) False dependence Introduction Control Dependence Ordering of instruction i with respect to a branch instruction Instruction control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch An instruction not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

59 Compiler Techniques for Exposing ILP Pipeline scheduling Schedule dependent instruction from the source instruction by the pipeline latency of the source instruction Example: for (i=999; i>=0; i=i-1) x[i] = x[i] + s; Compiler Techniques

60 Loop: L.D F0,0(R1) stall ADD.D F4,F0,F2 stall stall S.D F4,0(R1) DADDUI R1,R1,#-8 stall BNE R1,R2,Loop Pipeline Stalls Compiler Techniques

61 Scheduled code: Loop: L.D F0,0(R1) DADDUI R1,R1,#-8 ADD.D F4,F0,F2 stall stall S.D F4,8(R1) BNE R1,R2,Loop Pipeline Scheduling Compiler Techniques

62 Loop unrolling Unroll by a factor of 4 (assume # elements is divisible by 4) Eliminate unnecessary instructions Loop: L.D F0,0(R1) ADD.D F4,F0,F2 S.D F4,0(R1) ;drop DADDUI & BNE L.D F6,-8(R1) ADD.D F8,F6,F2 S.D F8,-8(R1) ;drop DADDUI & BNE L.D F10,-16(R1) ADD.D F12,F10,F2 S.D F12,-16(R1) ;drop DADDUI & BNE L.D F14,-24(R1) ADD.D F16,F14,F2 S.D F16,-24(R1) DADDUI R1,R1,#-32 BNE R1,R2,Loop Loop Unrolling note: number of live registers vs. original loop Compiler Techniques

63 Loop Unrolling/Pipeline Scheduling Pipeline schedule the unrolled loop: Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4,0(R1) S.D F8,-8(R1) DADDUI R1,R1,#-32 S.D F12,16(R1) S.D F16,8(R1) BNE R1,R2,Loop Compiler Techniques

64 Dynamic Scheduling Rearrange order of instructions to reduce stalls while maintaining data flow Branch Prediction Advantages: Compiler doesn t need to have knowledge of microarchitecture Handles cases where dependencies are unknown at compile time Disadvantage: Substantial increase in hardware complexity Complicates exceptions

65 Dynamic Scheduling Dynamic scheduling implies: Out-of-order execution Out-of-order completion (but usually not out-of-order commit) Branch Prediction Creates the possibility for WAR and WAW hazards Tomasulo s Approach Tracks when operands are available Introduces register renaming in hardware Minimizes WAW and WAR hazards

66 Getting CPI below 1 CPI 1 if issue only 1 instruction every clock cycle Multiple-issue processors come in 3 flavors: 1. Statically-scheduled superscalar processors In-order execution Varying number of instructions issued (compiler) 2. Dynamically-scheduled superscalar processors Out-of-order execution Varying number of instructions issued (CPU) 3. VLIW (very long instruction word) processors In-order execution Fixed number of instructions issued

67 VLIW: Very Large Instruction Word (1/2) Each VLIW has explicit coding for multiple operations Several instructions combined into packets Possibly with parallelism indicated Tradeoff instruction space for simple decoding Room for many operations Independent operations => execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

68 VLIW: Very Large Instruction Word (2/2) Assume 2 load/store, 2 fp, 1 int/branch VLIW with 0-5 operations. Why 0? Important to avoid empty instruction slots Loop unrolling Local scheduling Global scheduling Scheduling across branches Difficult to find all dependencies in advance Solution1: Block on memory accesses Solution2: CPU detects some dependencies

69 IA-64 and EPIC 64 bit instruction set architecture Not a CPU, but an architecture Itanium and Itanium 2 are CPUs based on IA-64 Made by Intel and Hewlett-Packard (itanium 2 and 3 designed in Colorado) Uses EPIC: Explicitly Parallel Instruction Computing Departure from the x86 architecture Meant to achieve out-of-order performance with in-order HW + compiler-smarts Stop bits to help with code density Support for control speculation (moving loads above branches) Support for data speculation (moving loads above stores)

70 EPIC Conclusions Goal of EPIC was to maintain advantages of VLIW, but achieve performance of out-of-order. Results: Complicated bundling rules saves some space, but makes the hardware more complicated Add special hardware and instructions for scheduling loads above stores and branches (new complicated hardware) Add special hardware to remove branch penalties (predication) End result is a machine as complicated as an out-oforder, but now also requiring a super-sophisticated compiler.

71 Multiple Issue Multiple Issue and Static Scheduling

72 L11-72 Phases of Instruction Execution October 19, 2011 PC I-cache Fetch Buffer Issue Buffer Func. Units Result Buffer Arch. State Fetch: Instruction bits retrieved from cache. Decode: Instructions placed in appropriate issue (aka dispatch ) stage buffer Execute: Instructions and operands sent to execution units. When execution completes, all results and exception flags are available. Commit: Instruction irrevocably updates architectural state (aka graduation or completion ). Arvind & Emer

73 L11-73 Dataflow execution Ins# use exec op p1 src1 p2 src2 ptr 2 next to deallocate t 1 t 2... prt 1 next available Reorder buffer t n Instruction slot is candidate for execution when: It holds a valid instruction ( use bit is set) It has not already started execution ( exec bit is clear) October 19, 2011 Both operands are available (p1 and p2 are set) Arvind & Emer

74 L11-74 Data-Driven Execution Renaming table & reg file Reorder buffer Ins# use exec op p1 src1 p2 src2 t 1 t 2.. t n Replacing the tag by its value is an expensive operation October 19, 2011 Load Unit FU FU Store Unit < t, result > Instruction template (i.e., tag t) is allocated by the Decode stage, which also stores the tag in the reg file When an instruction completes, its tag is deallocated Arvind & Emer

75 L13-75 O-o-O Execution with ROB Rename Table Next to commit Next available Reorder buffer R1 R2 R3 R4 t i 0 t j 0 t 2 1 t 1 1 : : tag Register valid bit File R1 1 R2 2 R3 3 : Ins# use exec op p1 src1 p2 src2 pd dest data 0 X X add X 1 X 2 X R4 4 8 X ld X 256 R3 t 1 t2.. t n Load Unit FU FU FU Store Commit Unit < t, result > October 26, 2011 Basic Operation: Enter op and tag or data (if known) for each source Replace tag with data as it becomes available Issue instruction when all sources are available Save dest data when operation finishes Commit saved dest data when instruction commits Arvind & Emer

76 Reorder Buffer Holds Active Instruction Window L13-76 ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) (Older instructions) (Newer instructions) Cycle t Commit Execute Fetch ld r1, (r3) add r3, r1, r2 sub r6, r7, r9 add r3, r3, r6 ld r6, (r1) add r6, r6, r3 st r6, (r1) ld r6, (r1) Cycle t + 1 October 26, Arvind & Emer

77 L13-77 Recovering ROB/Renaming Table Rename Table r 1 r 2 t t v v t t v v Rename Snapshots Registe r File Ptr 2 next to commit rollback next available Ptr 1 next available Ins# use exec op p1 src1 p2 src2 pd dest data t 1 t 2.. t n Reorder buffer Load Unit FU FU FU Store Unit Commit < t, result > Take snapshot of register rename table at each predicted branch, recover earlier snapshot if branch mispredicted October 26, Arvind & Emer

78 L13-78 Speculative & Out-of-Order Execution Branch Prediction kill kill Branch Resolution kill kill Out-of-Order Update predictors In-Order PC Fetch Decode & Rename Reorder Buffer Commit In-Order Physical Reg. File Branch Unit ALU MEM Store Buffer D$ Execute October 26, Arvind & Emer

79 L11-79 Physical Register files Reorder buffers are space inefficient a data value may be stored in multiple places in the reorder buffer idea keep all data values in a physical register file Tag represents the name of the data value and name of the physical register that holds it Reorder buffer contains only tags Thus, 64 data values may be replaced by 8-bit tags for a 256 element physical register file October 19, Arvind & Emer

80 Chapter 5 Review

81 Principle of Temporal Locality If you read an address once, you are likely to touch it again. (variables) If you execute an instruction once, you are likely to execute it again (loops). Temporal locality Addresses recently referenced will tend to be referenced again soon Caches exploit temporal locality!

82 Principle of Spatial locality If you read an address once, you are likely to also read neighbouring addresses (arrays) If you execute an instruction once, you are likely to access neighbouring instructions. Spatial locality If you access address X, you are likely to access an address close to X. Caches exploit spatial locality!

83 Implementation of 4-way set-associative cache

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

TDT4255 Lecture 9: ILP and speculation Donn Morrison Department of Computer Science 2 Outline Textbook: Computer Architecture: A Quantitative Approach, 4th ed Section 2.6: Speculation Section 2.7: Multiple