Slide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

Size: px

Start display at page:

Download "Slide Set 7. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng"

Vivian Peters
5 years ago
Views:

1 Slide Set 7 for ENCM 501 in Winter Term, 2017 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2017

2 ENCM 501 W17 Lectures: Slide Set 7 slide 2/56 Contents ILP: Instruction-Level Parallelism Review of simple pipelining Pipeline Hazards Dynamic branch prediction Review of floating-point numbers Fitting FP operations into the 5-stage pipeline In-order versus out-of-order

3 ENCM 501 W17 Lectures: Slide Set 7 slide 3/56 Outline of Slide Set 7 ILP: Instruction-Level Parallelism Review of simple pipelining Pipeline Hazards Dynamic branch prediction Review of floating-point numbers Fitting FP operations into the 5-stage pipeline In-order versus out-of-order

4 ENCM 501 W17 Lectures: Slide Set 7 slide 4/56 ILP: Instruction-Level Parallelism ILP is a general term for enhancing instruction throughput within a single processor core by having multiple instructions in flight at any given time. Two important forms of ILP are pipelining: each instruction takes several clock cycles to complete, but instructions are started one per clock cycle multiple issue: two or more instructions are started in the same clock cycle Modern processors use both pipelining and multiple issue, and use complex sets of related features to try to maximize instruction throughput.

5 ENCM 501 W17 Lectures: Slide Set 7 slide 5/56 Outline of Slide Set 7 ILP: Instruction-Level Parallelism Review of simple pipelining Pipeline Hazards Dynamic branch prediction Review of floating-point numbers Fitting FP operations into the 5-stage pipeline In-order versus out-of-order

6 ENCM 501 W17 Lectures: Slide Set 7 slide 6/56 Review of simple pipelining Before diving into microarchitectures with multiple pipelines, let s review the design challenges of getting a single pipeline to work fast and correctly. The basic organization of a pipeline involves pipeline stages: A stage performs some small simple step as part of handling an instruction. For example, one stage might be responsible for reading GPR values used in an instruction, and another stage might compute memory addresses to be used in loads and stores. pipeline registers: At the end of each clock cycle, a pipeline register captures the results produced by a stage, making those results available for the next stage in the next cycle.

7 ENCM 501 W17 Lectures: Slide Set 7 slide 7/56 First stage of a simple pipeline: IF (instruction fetch) We ll look at an example pipeline that can handle a few different kinds of MIPS instructions. The IF stage is responsible for updating the PC register as appropriate reading an instruction from memory and copying the instruction in a pipeline register so the instruction is available to the next stage, called the ID stage. Despite what we ve just learned about memory, we ll pretend that instruction memory is a simple functional unit that can be read within a single clock cycle!

8 ENCM 501 W17 Lectures: Slide Set 7 slide 8/56 branch target address branch decision IF stage CLK ID stage CLK address instruction instruction memory PC 0x add 32 IF/ID 32 usual PC update In every single clock cycle, the IF stage will dump a new instruction into the IF/ID pipeline register.

9 ENCM 501 W17 Lectures: Slide Set 7 slide 9/56 More stages This lecture will follow the 5-stage design presented in Section C.3 of the course textbook. The stages are: IF, which we ve just seen ID: instruction decode and GPR read EX: execute perform computation in ALU (arithmetic/logic unit) MEM: access to data memory for load or store WB: writeback write result of a load or an instruction like DADD to a GPR Let s sketch out what each of these stages do...

10 ENCM 501 W17 Lectures: Slide Set 7 slide 10/56 IF ID EX MEM WB Attention: This slide and others like it will not attempt to describe every detail of a pipeline stage. Instead it will just explain the general role of a stage. The ID stage: decodes the instruction finds out what kind of instruction it is, and what its operands are copies two GPR values into the ID/EX register copies an offset into the ID/EX register, in case the offset is needed for load, store, or branch copies some instruction address information into the ID/EX register, in case that is needed to generate a branch target address

11 ENCM 501 W17 Lectures: Slide Set 7 slide 11/56 R-type instructions R-type is MIPS jargon for instructions such as DADDU, DSUBU, OR, AND, etc. An R-type instruction involves performing some simple ALU computation involving two GPR values, and writing the result to a GPR.

12 ENCM 501 W17 Lectures: Slide Set 7 slide 12/56 IF ID EX MEM WB The EX stage performs a computation in the ALU. For an R-type instruction, the ALU performs whatever operation is appropriate (add, subtract, AND, OR, etc.), and writes the result into the EX/MEM register. For a load or store, the ALU computes a memory address, and writes the address into the EX/MEM register. For a branch, the ALU computes a branch target address and makes a branch decision. Both of those results get written into the EX/MEM register. Attention: The branch instruction handling described on this slide is specific to textbook Figure C.22! We ll look at problems related to that design in the next lecture.

13 ENCM 501 W17 Lectures: Slide Set 7 slide 13/56 IF ID EX MEM WB The MEM stage is mostly for data memory access by loads and stores. Again we pretend that memory is really simple! For an R-type instruction, not much happens. Results are copied from the EX/MEM register to the MEM/WB register. For a load, data read from memory gets copied into the MEM/WB register. For a store, data memory is updated using an address and data found in the EX/MEM register. For a branch, if the decision in EX was to take the branch, the PC gets updated with the branch target address. Attention, again: The branch instruction handling described on this slide is specific to textbook Figure C.22!

14 ENCM 501 W17 Lectures: Slide Set 7 slide 14/56 IF ID EX MEM WB The WB stage is used to update a GPR with the result of an R-type or load instruction. For an R-type or load instruction, a GPR is updated, using the appropriate result from the MEM/WB register. It wasn t mentioned before, but the 5-bit number specifying the destination register had to be passed from ID through EX and MEM to get to WB at the same time as the ALU or load result. For a store or a branch, nothing happens in WB. Those instructions finish in MEM.

15 ENCM 501 W17 Lectures: Slide Set 7 slide 15/56 A rough sketch of the 5-stage pipeline IF ID EX MEM WB CLK CLK CLK CLK CLK I-mem instr. decode CLK ALU D-mem? PC add GPRs IF/ID ID/EX EX/MEM MEM/WB A lot of detail has been left out, but there s enough here for us to trace processing of LW followed by DSUBU followed by SW.

16 ENCM 501 W17 Lectures: Slide Set 7 slide 16/56 Outline of Slide Set 7 ILP: Instruction-Level Parallelism Review of simple pipelining Pipeline Hazards Dynamic branch prediction Review of floating-point numbers Fitting FP operations into the 5-stage pipeline In-order versus out-of-order

17 ENCM 501 W17 Lectures: Slide Set 7 slide 17/56 Pipeline Hazards If a certain sequence of instructions prevents the usual throughput of one instruction for clock cycle in a simple pipeline, the situation is called a pipeline hazard. Hazards can be categorized into three main types: structural hazards, data hazards, and control hazards.

18 ENCM 501 W17 Lectures: Slide Set 7 slide 18/56 Structural hazards These occur when two instructions want to use the same physical resource at the same time, in incompatible ways. For example, if the simple 5-stage pipeline had a single memory unit, instead of split instruction and data memories, MEM of an LW or SW instruction would interfere with IF of a later instruction. Why is access to three GPRs by two different instructions, one in WB and a later one in ID, not a structural hazard?

19 ENCM 501 W17 Lectures: Slide Set 7 slide 19/56 Structural hazards: solutions The best solution is to design hardware to avoid structural hazards wherever possible. For example: in the simple, 5-stage pipeline, use separate instruction and data memories; in real pipelines, have separate I-TLBs and D-TLBs, and separate L1 I-caches and D-caches. For complex pipelines, it may be practically impossible to avoid all structural hazards, so stalls may be required if two instructions are contending for a resource, one or the other will be delayed one or more clock cycles.

20 ENCM 501 W17 Lectures: Slide Set 7 slide 20/56 Data hazards (We ll use MIPS32 instructions as examples, to match the 32-bit system depicted in textbook Figures C.21 and C.22.) The most common kind of data hazard is called a RAW hazard: RAW stands for Read-After-Write. ADD SUB R8, R9, R10 R11, R12, R8 For correct processing, SUB must work as if R8 is read by SUB after R8 is written by ADD. (This is where the term RAW comes from.) Let s draw a pipeline diagram to get a precise understanding of the problem.

21 ENCM 501 W17 Lectures: Slide Set 7 slide 21/56 More examples of RAW hazards For the simple 5-stage pipeline, let s find all the RAW hazards in this sequence... LW AND OR SLT R8, 0(R4) R9, R8, R5 R10, R6, R8 R11, R8, R7 Remark: The deeper a pipeline is (the more stages it has), the greater will be the number and complexity of potential RAW hazards.

22 ENCM 501 W17 Lectures: Slide Set 7 slide 22/56 Forwarding Forwarding is the name given to a technique that can often solve RAW data hazards without loss of clock cycles to stalls. (Another name for forwarding is bypassing.) The essential idea is that if Instruction B depends on the result of Instruction A, Instruction B should not wait for Instruction A to write that result to its destination, but instead grab that result as soon as it is available. Let s look at how forwarding helps with this sequence... LW AND OR SLT R8, 0(R4) R9, R8, R5 R10, R6, R8 R11, R8, R7

23 ENCM 501 W17 Lectures: Slide Set 7 slide 23/56 Sketch of forwarding hardware for 5-stage MIPS32 Here is an incomplete schematic for the EX stage... CLK ID/EX pipeline register GPR GPR LW/SW offset forward control FwdA FwdB ALU data for SW ALU result from EX/MEM reg. LW or ALU result from MEM/WB reg. A B

24 ENCM 501 W17 Lectures: Slide Set 7 slide 24/56 Q1: What should the values of the forward control outputs be in the case where no forwarding is needed? Consider this sequence: LW AND SUB R8, 0(R4) R9, R10, R11 R12, R8, R9 Q2: What should the values of the forward control outputs be when SUB is in the EX stage? Q3: What are the inputs to forward control and how does the forwarding logic work? (We ll give an example or two, not completely specify the logic!)

25 ENCM 501 W17 Lectures: Slide Set 7 slide 25/56 Can forwarding solve all RAW hazards? Consider this sequence: LW ADD R15, 0(R14) R16, R17, R15 Is it possible to solve the hazard by forwarding? If not, what is the most time-efficient way to solve the hazard? Let s make some general remarks about optimal solutions of RAW data hazards.

26 ENCM 501 W17 Lectures: Slide Set 7 slide 26/56 Control hazards: Introduction In a simple pipeline, a control hazard is a difficulty in determining the address to use for the next Instruction Fetch. Look at this example, and assume a version of MIPS32 in which the delay slot instruction is not supposed to be completed if the branch is taken: L1: LW R9, 0(R5) instructions in loop body BEQ OR R8, R0, L1 R16, R10, R0 In the clock cycle after IF for the BEQ instruction, why is doing IF difficult? (There is more than one reason.)

27 ENCM 501 W17 Lectures: Slide Set 7 slide 27/56 Control hazards: Not just for conditional branches! In a conditional branch, there is an obvious motivation to wait for the decision about whether or not to take the branch. But consider the following unconditional updates to the PC: jump within a procedure; procedure call; procedure return. Why do these kinds of instructions generate control hazards? How many cycles might be lost due to such a hazard in a 5-stage pipeline like the one we ve been looking at?

28 ENCM 501 W17 Lectures: Slide Set 7 slide 28/56 Old school solutions to control hazards (1) Stall as long as necessary to ensure that instruction results are correct. This obviously makes CPI worse (higher) if programs have lots of conditional branches and unconditional jumps.

29 ENCM 501 W17 Lectures: Slide Set 7 slide 29/56 Old school solutions to control hazards (2) Delayed jumps and branches. Because it is very difficult to do IF properly in the cycle immediately following a jump or a taken branch, many ISA designs decreed that the successor to a jump or branch would always be completed before the jump or branch target instruction... BEQ R12, R0, L99 ADD R13, R14, R15 # successor more instructions L99: SUB R8, R9, R10 # branch target OR R16, R8, R0 Real MIPS ISAs (as opposed to some hypothetical MIPS-like ISAs in textbooks and lecture slides) have delayed branches and jumps.

30 ENCM 501 W17 Lectures: Slide Set 7 slide 30/56 Outline of Slide Set 7 ILP: Instruction-Level Parallelism Review of simple pipelining Pipeline Hazards Dynamic branch prediction Review of floating-point numbers Fitting FP operations into the 5-stage pipeline In-order versus out-of-order

31 ENCM 501 W17 Lectures: Slide Set 7 slide 31/56 Dynamic branch prediction Dynamic branch prediction is the most important current technology for management of control hazards. A branch prediction circuit is a memory array comparable in size to an L1 I-cache, and somewhat more complex. A branch prediction circuit records the locations of thousands of recently-encountered branches and jumps, along with the addresses of their targets. For each conditional branch, a branch prediction circuit maintains a few bits of information that can be used to predict whether the branch will be taken or untaken.

32 ENCM 501 W17 Lectures: Slide Set 7 slide 32/56 Branch prediction code example p and past_last are of type int*. count is an int. do { if (*p < 0) count++; p++; } while (p!= past_last); p walks through an array of int elements, and count records how many of those elements are negative.

33 ENCM 501 W17 Lectures: Slide Set 7 slide 33/56 Branch prediction code example, continued Assembly language for a MIPS32-like ISA that does not have delayed branch... L1: LW R8, (R4) SLT R9, R0, R8 BEQ R9, R0, L2 # branch if!(*p < 0) ADDIU R25, R25, 1 # count++ L2: ADDIU R4, R4, 4 # p++ BNE R4, R24, L1 # branch if p!= past_last Let s suppose that there are a lot of array elements, and most of them are negative. As the processor runs the loop, what predictions will it learn to make about the BEQ and BNE instructions?

34 ENCM 501 W17 Lectures: Slide Set 7 slide 34/56 Scalar versus Superscalar It seems like the right moment to introduce these terms. A scalar processor core starts no more than one instruction per clock cycle. In some cycles it can t start an instruction, due to a stall caused by a pipeline hazard. All of the pipeline examples so far have been for scalar cores. A superscalar processor core tries to start two or more instructions per clock cycle. When I start talking about superscalar cores, I will let you know.

35 ENCM 501 W17 Lectures: Slide Set 7 slide 35/56 A 5-stage pipeline with dynamic branch prediction Let s review our previous sketch of the 5-stage pipeline, then show how it would be modified to support dynamic branch prediction. An instruction fetch unit encapsulates a PC, an L1 I-cache, and a branch prediction circuit. Both sketches are for scalar systems.

36 ENCM 501 W17 Lectures: Slide Set 7 slide 36/56 A rough sketch of the 5-stage pipeline These are the pieces we saw previously... IF ID EX MEM WB CLK CLK CLK CLK CLK I-mem instr. decode CLK ALU D-mem? PC add GPRs IF/ID ID/EX EX/MEM MEM/WB

37 ENCM 501 W17 Lectures: Slide Set 7 slide 37/56 5-stage pipeline with dynamic branch prediction Note that a monster has moved into the IF stage... IF ID EX MEM WB CLK CLK CLK CLK CLK instr. decode instruction fetch unit CLK ALU D-mem? GPRs IF/ID ID/EX EX/MEM MEM/WB

38 ENCM 501 W17 Lectures: Slide Set 7 slide 38/56 Scalar performance with dynamic branch prediction If the branch predictor does a good job, CPI will be very close to 1. What are two reasons why, for most programs, CPI will be somewhat greater than 1?

39 ENCM 501 W17 Lectures: Slide Set 7 slide 39/56 Outline of Slide Set 7 ILP: Instruction-Level Parallelism Review of simple pipelining Pipeline Hazards Dynamic branch prediction Review of floating-point numbers Fitting FP operations into the 5-stage pipeline In-order versus out-of-order

40 ENCM 501 W17 Lectures: Slide Set 7 slide 40/56 A quick, incomplete review of floating-point numbers A lot of textbook examples use floating-point instructions, so a brief review might be a good idea. Essentially, floating-point is a base two version of scientific notation. Here s an example of scientific notation: The mass of the earth is about kg, more conveniently written as kg.

41 ENCM 501 W17 Lectures: Slide Set 7 slide 41/56 Any nonzero real number can be written as sign 2 exponent (1 + fraction), where the exponent is an integer and 0 fraction < 1.0. If we have a finite number of exponent bits, that will limit the magnitude range of the numbers we can represent. With a finite number of fraction bits, most real numbers can only be approximated floating-point representation involves rounding error. For a computer to work with floating-point numbers, we need a way to organize sign, exponent, and fraction bits into fixed-size chunks...

42 ENCM 501 W17 Lectures: Slide Set 7 slide 42/56 Bit fields in 64-bit floating-point exponent bits sign bit 52 fraction bits Sign bit: 0 for positive, 1 for negative. Exponent: Uses a bias of two = 1023 ten. Example bit patterns: means the exponent is zero; means the exponent is 1; means the exponent is +1.

43 ENCM 501 W17 Lectures: Slide Set 7 slide 43/ exponent bits sign bit 52 fraction bits Fraction bits: Only bits from the right side of the binary point are recorded. It is assumed that there is a single 1 bit to the left of the binary point, so that bit need not be recorded. Example: How is ten represented? = = two sign, exponent, and fraction are:

44 ENCM 501 W17 Lectures: Slide Set 7 slide 44/56 In IEEE 754 floating-point formats there are some special bit patterns: zero + NaN not a number. For example in IEEE 754, the result of 1.0/0.0 is +, but the result of 0.0/0.0 is NaN.

45 ENCM 501 W17 Lectures: Slide Set 7 slide 45/56 FP multiplication If A and B are nonzero, then A B is signa signb 2 (exponenta + exponentb) (1 + fractiona) (1 + fractionb) To do an FP multiplication, a logic circuit first has to check that operands are not zero or other special bit patterns. If the operands aren t special, the step that costs the most time (and energy) is the 53-bit-by-53-bit integer multiplication for (1 + fractiona) (1 + fractionb). At the end, there must be rounding, exponent adjustment, and a check for underflow or overflow.

46 ENCM 501 W17 Lectures: Slide Set 7 slide 46/56 Will FP multiplication fit into a single clock cycle? No! An example in textbook Section C.5 suggests a latency of 7 clock cycles for FP multiplication. The same example suggests a latency of 4 clock cycles for FP addition or subtraction, which are easier than FP multiplication, but much more complicated than integer addition or subtraction. Those numbers are examples. Together, Moore s Law and the ingenuity of circuit designers imply that the latencies of FP arithmetic operations vary from year to year and from one design to another.

47 ENCM 501 W17 Lectures: Slide Set 7 slide 47/56 Outline of Slide Set 7 ILP: Instruction-Level Parallelism Review of simple pipelining Pipeline Hazards Dynamic branch prediction Review of floating-point numbers Fitting FP operations into the 5-stage pipeline In-order versus out-of-order

48 ENCM 501 W17 Lectures: Slide Set 7 slide 48/56 Fitting FP operations into the 5-stage pipeline Actually, this applies to fitting in integer multiplication and integer division as well. Let s follow the textbook example: 7-cycle latency for FP or integer multiplication 4-cycle latency for FP addition 24-cycle latency for FP or integer division (Note: Division is notoriously hard to do fast in digital logic!) We are going to have to give up on our nice, easy 1-cycle EX stage in the middle of the 5-stage pipeline!

49 ENCM 501 W17 Lectures: Slide Set 7 slide 49/56 Let s make some notes about this picture... Integer unit EX M1 FP/integer multiply M2 M3 M4 M5 M6 M7 IF ID MEM WB FP adder A1 A2 A3 A4 FP/integer divider DIV Image is Figure C.35 from Hennessy J. L. and Patterson D. A., Computer Architecture: A Quantitative Approach, 5nd ed., c 2012, Elsevier, Inc.

50 ENCM 501 W17 Lectures: Slide Set 7 slide 50/56 Attention... The picture on slide 49 makes it clear that the simple 5-stage model used to introduce pipelining hides some important real-world difficulties! But the picture is still hiding one of the major difficulties in modern computer design. (Well, actually, it s hiding more than one such difficulty.) What is the most glaring oversimplification in the picture?

51 ENCM 501 W17 Lectures: Slide Set 7 slide 51/56 Quick overview of MIPS FP instructions Many versions of the MIPS ISA have bit floating-point registers: F0, F2, F4,..., F30 note use of even numbers only for FPRs. (Newer ISA versions have bit FPRs.) F0 is not special. Unlike the GPR R0, F0 is not hard-wired to have a value of 0.0.

52 ENCM 501 W17 Lectures: Slide Set 7 slide 52/56 Loads, stores and arithmetic are easy to understand. Here is a very short example: L.D F2, 0(R4) # load L.D F4, 0(R5) # load MUL.D F6, F2, F4 # multiply S.D F6, 0(R7) # store Note the use of GPRs for addresses. Remember, memory addresses are integers! The suffix.d is for double precision. Use.S instead to work with with 32-bit single precision FP numbers. To understand examples in ENCM 501, we do not need to know the details of instructions for FP comparison, branching on FP comparison results, or converting between integer and FP formats.

53 ENCM 501 W17 Lectures: Slide Set 7 slide 53/56 Outline of Slide Set 7 ILP: Instruction-Level Parallelism Review of simple pipelining Pipeline Hazards Dynamic branch prediction Review of floating-point numbers Fitting FP operations into the 5-stage pipeline In-order versus out-of-order

54 ENCM 501 W17 Lectures: Slide Set 7 slide 54/56 In-order versus out-of-order In-order execution of instructions implies that instructions are processed in the same order that they would be in a hypothetical computer that always completes one instruction before starting the next. The simple 5-stage pipeline is in-order, even though there are usually 5 instructions in flight within the pipeline. (What about instructions that get into the 5-stage pipeline but get cancelled due to a branch?) Out-of-order execution implies that start and completion of instructions is often but not always in-order.

55 ENCM 501 W17 Lectures: Slide Set 7 slide 55/56 5-stage pipeline with variable-length EX stage This pipeline always starts instructions in-order. This is known as in-order issue of instructions. However, there is a design choice to be made: Should we allow instructions to complete out-of-order? What are the advantages and disadvantages of forcing instruction completion to be in-order? What are some challenges created by allowing out-of-order completion?

56 ENCM 501 W17 Lectures: Slide Set 7 slide 56/56 More about out-of-order execution... In the next slide set, we ll look in detail about hazards related to out-of-order execution. Then we ll look at an organization for out-of-order execution called Tomasulo s algorithm, which solves RAW hazards, in a way that is interesting to compare to forwarding solves so-called WAW and WAR hazards, which can t happen with in-order execution deals effectively with variable latencies related to different kinds of arithmetic and variable latencies in memory access due TLB and cache misses

Slides for Lecture 15

Slides for Lecture 15 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary 6 March,