MC9211Computer Organization. Unit 4 Lesson 1 Processor Design

Size: px

Start display at page:

Download "MC9211Computer Organization. Unit 4 Lesson 1 Processor Design"

Sylvia Russell
5 years ago
Views:

1 MC92Computer Organization Unit 4 Lesson Processor Design

2 Basic Processing Unit

3 Connection Between the Processor and the Memory Memory MAR PC MDR R Control IR R Processo ALU R n- n general purpose registers Figure.2. Connections between the processor and the memory.

4 Add LOCA, R Transfer the contents of register PC to register MAR Issue a Read command to memory, and then wait until it has transferred the requested word into register MDR Transfer the instruction from MDR into IR and decode it Transfer the address LOCA from IR to MAR Issue a Read command and wait until MDR is loaded Transfer contents of MDR to the ALU Transfer contents of R to the ALU Perform addition of the two operands in the ALU and transfer result into R Transfer contents of PC to ALU Add 4 to operand in ALU and transfer incremented address to PC

5 Overview Instruction Set Processor (ISP) Central Processing Unit (CPU) A typical computing task consists of a series of steps specified by a sequence of machine instructions that constitute a program. An instruction is executed by carrying out a sequence of more rudimentary operations.

6 Some Fundamental Concepts

7 Fundamental Concepts Processor fetches one instruction at a time and perform the operation specified. Instructions are fetched from successive memory locations until a branch or a jump instruction is encountered. Processor keeps track of the address of the memory location containing the next instruction to be fetched using Program Counter (PC). Instruction Register (IR)

8 Executing an Instruction Fetch the contents of the memory location pointed to by the PC. The contents of this location are loaded into the IR (fetch phase). IR [[PC]] Assuming that the memory is byte addressable, increment the contents of the PC by 4 (fetch phase). PC [PC] + 4 Carry out the actions specified by the instruction in the IR (execution phase).

9 Processor Organization Internal processor bus Control signals PC Address lines MAR Instruction decoder and control logic Memory bus Data lines MDR Y IR Datapath Constant 4 R Select MUX ALU control lines Add Sub XOR A ALU B Carry-in R( n - ) TEMP Textbook Page 43 Z Figure 7.. Single-bus organization of the datapath inside a proc

10 Executing an Instruction Transfer a word of data from one processor register to another or to the ALU. Perform an arithmetic or a logic operation and store the result in a processor register. Fetch the contents of a given memory location and load them into a processor register. Store a word of data from a processor register into a given memory location.

11 Register Transfers bus Ri in Internal processor Ri Riout Yin Constant 4 Y Select MUX A ALU B Zin Z Textbook Page 46 Z out Figure 7.2. Input and output gating for the registers in Figure 7..

12 Register Transfers All operations and data transfers are controlled by the processor clock. Bus D Q Q Ri out Ri in Clock Figure 7.3. Input and output gating for one register bit.

13 Performing an Arithmetic or Logic Operation The ALU is a combinational circuit that has no internal storage. ALU gets the two operands from MUX and bus. The result is temporarily stored in register Z. What is the sequence of operations to add the contents of register R to those of R2 and store the result in R3?. Rout, Yin 2. R2out, SelectY, Add, Zin 3. Zout, R3in

14 Fetching a Word from Memory Address into MAR; issue Read operation; data into MDR. Memory-bus data lines MDR oute MDR out Internal processo bus MDR MDR ine MDR in Figure 7.4. Connection and and control signals signals for register for MDR.

15 Fetching a Word from Memory The response time of each memory access varies (cache miss, memory-mapped I/O, ). To accommodate this, the processor waits until it receives an indication that the requested operation has been completed (Memory- Function-Completed, MFC). Move (R), R2 MAR [R] Start a Read operation on the memory bus Wait for the MFC response from the memory Load MDR from the memory bus R2 [MDR]

16 Step 3 Clock 2 Timing MAR in Assume MAR is always available on the address lines of the memory bus. Address Read MR Move (R), R2. Rout, MARin, Read 2. MDRinE, WMFC 3. MDRout, R2in MDR ine Data MFC MDR out Figure 7.5. Timing of a memory Read operation.

17 Execution of a Complete Add (R3), R Fetch the instruction Instruction Fetch the first operand (the contents of the memory location pointed to by R3) Perform the addition Load the result into R

18 Architecturebus Ri in Internal processor Ri Riout Yin Constant 4 Y Select MUX A ALU B Zin Z Z out Figure 7.2. Input and output gating for the registers in Figure 7..

19 Execution of a Complete Instruction Add (R3), R Internal processor bus Control signals Step Action PC out, MAR in, Read, Select4,Add, Z in 2 Z out,pc in,y in, WMF C Memory bus Address lines PC MAR Instruction decoder and control logic 3 MDR out,ir in 4 R3 out, MAR in, Read Data lines MDR IR 5 R out,y in, WMF C 6 MDR out, SelectY, Add, Z in Constant 4 Y R 7 Z out,r in, End Select MUX ALU control lines igure 7.6. Control sequencefor execution of the instruction Add (R3),R. Add Sub XOR A ALU B Carry-in R( n - ) TEMP Add R2, R? Z Figure 7.. Single-bus organization of the datapath inside a proc

20 Execution of a Complete Instruction Add R2, R Internal processor bus Control signals Step Action PC out, MAR in, Read, Select4,Add, Z in 2 Z out,pc in,y in, WMF C Memory bus Address lines PC MAR Instruction decoder and control logic 3 MDR out,ir in 4 R3 out, MAR in, Read Data lines MDR IR 5 R out,y in, WMF C 6 MDR out, SelectY, Add, Z in R2 out 7 Z out,r in, End Select Constant 4 MUX Y R ALU control lines igure 7.6. Control sequencefor execution of the instruction Add (R3),R. Add Sub XOR A ALU B Carry-in R( n - ) TEMP Z Figure 7.. Single-bus organization of the datapath inside a proc

21 Execution of Branch Instructions A branch instruction replaces the contents of PC with the branch target address, which is usually obtained by adding an offset X given in the branch instruction. The offset X is usually the difference between the branch target address and the address immediately following the branch instruction. Conditional branch

22 Execution of Branch Instructions Step Action PC out, MAR in, Read, Select4,Add, Z in 2 Z out,pc in,y in,wmfc 3 MDR out,ir in 4 Offset-field-of-IR out, Add, Z in 5 Z out,pc in, End Figure 7.7. Control sequence for an unconditional branch instruction.

23 Multiple-Bus Organization Bus A Bus B Bus C Constant 4 MUX Memory b us data lines Incrementer A B PC Re gister file ALU Instruction decoder IR MDR MAR Address lines R Textbook Page 424 Allow the contents of two different registers to be accessed simultaneously and have their contents placed on buses A and B. Allow the data on bus C to be loaded into a third register during the same clock cycle. Incrementer unit. ALU simply passes one of ts two input operands unmodified to bus C control signal: R=A or R=B Figure 7.8. Three-b us or ganization of the datapath.

24 Multiple-Bus Organization Add R4, R5, R6 Step Action PC out, R=B, MAR in, Read, IncPC 2 WMFC 3 MDR outb, R=B, IR in 4 R4 outa, R5 outb, SelectA, Add, R6 in,end Figure 7.9. Control sequence for the instruction. Add R4,R5,R6, for the three-bus organization in Figure 7.8.

25 Exercise What is the control sequence for execution of the instruction Add R, R2 including the instruction fetch phase? (Assume single bus architecture) Memory bus Select ALU control lines Address lines Add Sub XOR Data lines Constant 4 MUX A PC MAR MDR Y ALU Z B Internal processor bus Carry-in Control signals Instruction decoder and control logic IR R R( n - ) TEMP Figure 7.. Single-bus organization of the datapath inside a proc

26 Hardwired Control

27 Overview To execute instructions, the processor must have some means of generating the control signals needed in the proper sequence. Two categories: hardwired control and microprogrammed control Hardwired system can operate at high speed; but with little flexibility.

28 Control Unit Organization Clock CLK Control step counter IR Decoder/ encoder External inputs Condition codes Control signals Figure 7.. Control unit organization.

29 Detailed Block Description Clock CLK Control step counter Reset Step decoder T T 2 T n IR Instruction decoder INS INS 2 INS m Encoder External inputs Condition codes Run End Control signals Figure 7.. Separation of the decoding and encoding function

30 Generating Z in Z in = T + T 6 ADD + T 4 BR + Branch Add T 4 T 6 T Figure 7.2. Generation of the Z i n control signal for the processor in Figure 7..

31 Generating End End = T 7 ADD + T 5 BR + (T 5 N + T 4 N) BRN + Add N Branch< N Branch T 7 T 5 T 4 T 5 End Figure 7.3. Generation of the End control signal.

32 A Complete Processor Instruction unit Integer unit Floating-point unit Instruction cache Data cache Bus interface Processor System bus Main memory Input/ Output Figure 7.4. Block diagram of a complete processor.

33 Microprogrammed Control

34 Microprogrammed Control Control signals are generated by a program similar to machine language programs. Control Word (CW); microroutine; microinstruction : Textbook page43 PC in PC out MAR in Read MDR out IR in Y in Select Add Z in Z out R out R in R3 out WMFC End Micro - instruction Figure 7.5 An example of microinstructions for Figure 7.6.

35 Overview Textbook page 42 Step Action PC out, MAR in, Read, Select4,Add, Z in 2 Z out,pc in,y in, WMF C 3 MDR out,ir in 4 R3 out, MAR in, Read 5 R out,y in, WMF C 6 MDR out, SelectY, Add, Z in 7 Z out,r in, End Figure 7.6. Control sequencefor execution of the instruction Add (R3),R.

36 Basic organization of a microprogrammed control unit Control store IR Starting address generator One function cannot be carried out by this simple organization. Clock µpc Control store CW Figure 7.6. Basic organization of a microprogrammed control

37 Conditional branch The previous organization cannot handle the situation when the control unit is required to check the status of the condition codes or external inputs to choose between alternative courses of action. Use conditional branch microinstruction. AddressMicroinstruction PC out,mar in, Read, Select4,Add, Z in Z out,pc in,y in,wmfc 2 MDR out,ir in 3 Branch to startingaddressof appropriatemicroroutine If N=, then branch to microinstruction 26 Offset-field-of-IR out, SelectY, Add, 27 Z out,pc in,end Z in Figure 7.7. Microroutine for the instruction Branch<.

38 Microprogrammed Control External inputs IR Starting and branch address generator Condition codes Clock µpc Control store CW Figure 7.8. Organization of the control unit to allow conditional branching in the microprogram.

39 Microinstructions A straightforward way to structure microinstructions is to assign one bit position to each control signal. However, this is very inefficient. The length can be reduced: most signals are not needed simultaneously, and many signals are mutually exclusive. All mutually exclusive signals are placed in the same group in binary coding.

40 Partial Format for the Microinstructions Microinstruction F F2 F3 F4 F5 F (4 bits) F2 (3 bits) F3 (3 bits) F4 (4 bits) F5 (2 bits) : No transfer : PC out : MDR out : Z out : R out : R out : R2 out : R3 out : TEMP out : Offset out : No transfer : PC in : IR in : Z in : R in : R in : R2 in : R3 in : No transfer : MAR in : MDR in : TEMP in : Y in : Add : Sub : XOR 6 ALU functions : No action : Read : Write F6 F7 F8 F6 ( bit) F7 ( bit) F8 ( bit) : SelectY : Select4 : No action : WMFC : Continue : End What is the price paid for this scheme? Require a little more hardware Figure 7.9. An example of a partial format for field-encoded microinstructions

41 Further Improvement Enumerate the patterns of required signals in all possible microinstructions. Each meaningful combination of active control signals can then be assigned a distinct code. Vertical organization Horizontal organization Textbook page 434

42 Microprogram Sequencing If all microprograms require only straightforward sequential execution of microinstructions except for branches, letting a µpc governs the sequencing would be efficient. However, two disadvantages: Having a separate microroutine for each machine instruction results in a large total number of microinstructions and a large control store. Longer execution time because it takes more time to carry out the required branches. Example: Add src, Rdst Four addressing modes: register, autoincrement, autodecrement, and indexed (with indirect

43 Textbook page Bit-ORing - Wide-Branch Addressing -WMFC

44 Mode Contents of IR OP code Rsrc Rdst Address (octal) Microinstruction Textbook page 439 PC out, MAR in, Read, Select 4, Add, Z in Z out, PC in, Y in, WMFC 2 MDR out, IR in 3 µbranch {µpc (from Instruction decoder); µpc 5,4 [IR,9 ]; µpc 3 [IR ] [IR 9 ] [IR 8 ]} 2 Rsrc out, MAR in, Read, Select4, Add, Z in 22 Z out, Rsrc in 23 µbranch {µpc 7;µPC [IR 8 ]}, WMFC 7 MDR out, MAR in, Read, WMFC 7 MDR out, Y in 72 Rdst out, SelectY, Add, Z in 73 Z out, Rdst in, End Figure 7.2. Microinstruction for Add (Rsrc)+,Rdst. Note:Microinstruction at location 7 is not executed for this addressing mode.

45 Microinstructions with Next- Address Field The microprogram we discussed requires several branch microinstructions, which perform no useful operation in the datapath. A powerful alternative approach is to include an address field as a part of every microinstruction to indicate the location of the next microinstruction to be fetched. Pros: separate branch microinstructions are virtually eliminated; few limitations in assigning addresses to microinstructions. Cons: additional bits for the address field (around /6)

46 Microinstructions with Next- Address Field IR External Inputs Condition codes Decoding circuits µar Control store Next address µir Microinstruction decoder Control signals Figure Microinstruction-sequencing organization.

47 Microinstruction F F F2 F3 F (8 bits) F (3 bits) F2 (3 bits) F3 (3 bits) Address of next microinstruction : No transfer : PC out : MDR out : Z out : Rsrc out : Rdst out : TEMP out : No transfer : PC in : IR in : Z in : Rsrc in : Rdst in : No transfer : MAR in : MDR in : TEMP in : Y in F4 F5 F6 F7 F4 (4 bits) F5 (2 bits) F6 ( bit) F7 ( bit) : Add : Sub : No action : Read : Write : SelectY : Select4 : No action : WMFC : XOR F8 F9 F F8 ( bit) : NextAdrs : InstDec F9 ( bit) : No action : OR mode F ( bit) : No action : OR indsrc Figure Format for microinstructions in the example of Section 7

48 Implementation of the Microroutine (See Figure 7.23 for encoded signa Figure Implementation of the microroutine of Figure 7.2 usin F9 F F8 F7 F6 F5 F4 F address Octal F F F3 next-microinstruction address field. 3

49 R5 in R5 out R in R out Decoder Decoder IR Rsrc Rdst External inputs Condition codes Decoding circuits InstDec out OR mode OR indsrc µar Control store Next address F F2 F8 F9 F Rdst out Rdst in Rsrc out Microinstruction decoder Rsrc in Other control signals Figure Some details of the control-signal-generating circuitry.

50 bit-oring

51 Pipelining

52 Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization requires sophisticated compilation techniques.

53 Basic Concepts

54 Making the Execution of Programs Faster Use faster circuit technology to build the processor and the main memory. Arrange the hardware so that more than one operation can be performed at the same time. In the latter way, the number of operations performed per second is increased even though the elapsed time needed to perform any one operation is not changed.

55 Traditional Pipeline Concept Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 3 minutes A B C D Dryer takes 4 minutes Folder takes 2 minutes

56 Traditional Pipeline Concept 6 PM Midnight A B C Time Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? D

57 Traditional Pipeline Concept 6 PM Midnight T a s k O r d e r A B C D Time Pipelined laundry takes 3.5 hours for 4 loads

58 Traditional Pipeline Concept T a s k O r d e r 6 PM Time A B C D Pipelining doesn t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to fill pipeline and time to drain it reduces speedup

59 Use the Idea of Pipelining in a Fetch + Execution Computer Time I I 2 I 3 Time Clock cycle F E F 2 E 2 F 3 E 3 Instruction (a) Sequential execution I F E Interstage buffer B I 2 F 2 E 2 I 3 F 3 E 3 Instruction fetch unit Execution unit (c) Pipelined execution (b) Hardware organization Figure 8.. Basic idea of instruction pipelining.

60 Use the Idea of Pipelining in a Computer Clock cycle Time Instruction Fetch + Decode + Execution + Write I I 2 F D F 2 E D 2 W E 2 W 2 I 3 F 3 D 3 E 3 W 3 I 4 F 4 D 4 E 4 W 4 (a) Instruction execution divided into four steps Interstage buffers F : Fetch instruction D : Decode instruction and fetch operands E: Execute operation B B2 B3 W : Write results Textbook page: 457 (b) Hardware organization Figure 8.2. A 4-stage pipeline.

61 Role of Cache Memory Each pipeline stage is expected to complete in one clock cycle. The clock period should be long enough to let the slowest pipeline stage to complete. Faster stages can only wait for the slowest one to complete. Since main memory is very slow compared to the execution, if each instruction needs to be fetched from main memory, pipeline is almost useless. Fortunately, we have cache.

62 Pipeline Performance The potential increase in performance resulting from pipelining is proportional to the number of pipeline stages. However, this increase would be achieved only if all pipeline stages require the same time to complete, and there is no interruption throughout program execution. Unfortunately, this is not true.

63 Pipeline Performance Clock cycle Time Instruction I F D E W I 2 F 2 D 2 E 2 W 2 I 3 F 3 D 3 E 3 W 3 I 4 F 4 D 4 E 4 W 4 I 5 F 5 D 5 E 5 Figure 8.3. Effect of an execution operation taking more than one clock ycle. c

64 Pipeline Performance The previous pipeline is said to have been stalled for two clock cycles. Any condition that causes a pipeline to stall is called a hazard. Data hazard any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline. So some operation has to be delayed, and the pipeline stalls. Instruction (control) hazard a delay in the availability of an instruction causes the pipeline to stall. Structural hazard the situation when two instructions require the use of a given hardware resource at the same time.

65 Pipeline Performance Instruction hazard Clock cycle Instruction I Time F D E W I 2 F 2 D 2 E 2 W 2 I 3 F 3 D 3 E 3 W 3 Clock cycle (a) Instruction execution steps in successive clock cycles Time 9 Stage F: Fetch D: Decode E: Execute F F 2 F 2 F 2 F 2 F 3 D idle idle idle D 2 D 3 E idle idle idle E 2 E 3 Idle periods stalls (bubbles) W: Write W idle idle idle W 2 W 3 (b) Function performed by each processor stage in successive clock cycles Figure 8.4. Pipeline stall caused by a cache miss in F2.

66 Pipeline Performance Structural hazard Load X(R), R2 Clock cycle Instruction Time I F D E W I 2 (Load) F 2 D 2 E 2 M 2 W 2 I 3 F 3 D 3 E 3 W 3 I 4 F 4 D 4 E 4 I 5 F 5 D 5 Figure 8.5. Effect of a Load instruction on pipeline timing.

67 Pipeline Performance Again, pipelining does not result in individual instructions being executed faster; rather, it is the throughput that increases. Throughput is measured by the rate at which instruction execution is completed. Pipeline stall causes degradation in pipeline performance. We need to identify all hazards that may cause the pipeline to stall and to find ways to minimize their impact.

68 Quiz Four instructions, the I2 takes two clock cycles for execution. Pls draw the figure for 4-stage pipeline, and figure out the total cycles needed for the four instructions to complete.

69 Data Hazards

70 Data Hazards We must ensure that the results obtained when instructions are executed in a pipelined processor are identical to those obtained when the same instructions are executed sequentially. Hazard occurs A 3 + A B 4 A No hazard A 5 C B 2 + C When two operations depend on each other, they must be executed sequentially in the correct order. Another example: Mul R2, R3, R4 Add R5, R4, R6

71 Data Hazards Clock cycle Instruction Time I (Mul) F D E W I 2 (Add) F 2 D 2 D 2A E 2 W 2 I 3 F 3 D 3 E 3 W 3 I 4 F 4 D 4 E 4 W 4 Figure 8.6. Pipeline stalled by data dependenc y between 2 and W. Figure 8.6. Pipeline stalled by data dependency between D 2 and W.

72 Operand Forwarding Instead of from the register file, the second instruction can get data directly from the output of ALU after the previous instruction is completed. A special arrangement needs to be made to forward the output of ALU to the input of ALU.

73 Source Source 2 SRC SRC2 Register file ALU RSLT Destination (a) Datapath SRC,SRC2 RSLT E: Execute (ALU) W: Write (Register file) Forwarding path (b) Position of the source and result registers in the processor pipeline Figure 8.7. Operand forwarding in a pipelined processor.

74 Handling Data Hazards in Software Let the compiler detect and handle the hazard: I: Mul R2, R3, R4 NOP NOP I2: Add R5, R4, R6 The compiler can reorder the instructions to perform some useful work during the NOP slots.

75 Side Effects The previous example is explicit and easily detected. Sometimes an instruction changes the contents of a register other than the one named as the destination. When a location other than one explicitly named in an instruction as a destination operand is affected, the instruction is said to have a side effect. (Example?) Example: conditional code flags: Add R, R3 AddWithCarry R2, R4 Instructions designed for execution on pipelined hardware should have few side effects.

76 Instruction Hazards

77 Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline stalls. Cache miss Branch

78 Unconditional Branches Clock cycle Time Instruction I F E I 2 (Branch) F 2 E 2 Execution unit idle I 3 F 3 X I k F k E k I k+ F k+ E k+ Figure 8.8. An idle cycle caused by a branch instruction.

79 Branch Timing F D E W Clock cycle I Time I 2 (Branch) F 2 D 2 E 2 I 3 F 3 D 3 X - Branch penalty I 4 F 4 X - Reducing the penalty I k F k D k E k W k I k+ F k+ D k+ E k+ (a) Branch address computed in ecute Ex stage Clock cycle Time I F D E W I 2 (Branch) F 2 D 2 I 3 F 3 X I k F k D k E k W k I k+ F k+ D k+ E k+ (b) Branch address computed in Decode stage Figure 8.9. Branch timing.

80 Instruction Queue and Prefetching Instruction fetch unit F : Fetch instruction Instruction queue D : Dispatch/ Decode unit E : Execute instruction W : Write results Figure 8.. Use of an instruction queue in the hardware organization of Figure 8.2b.

81 Conditional Braches A conditional branch instruction introduces the added hazard caused by the dependency of the branch condition on the result of a preceding instruction. The decision to branch cannot be made until the execution of that instruction has been completed. Branch instructions represent about 2% of the dynamic instruction count of most programs.

82 Super Scalar Architecture F: Instruction Fetch Unit Instruction Queue Floating Point Unit Dispatch Unit W: Write Results Integer Unit

83 Super Scalar Operation Equip processor with multiple processing units several instructions can be executed in the same clock cycle multiple issue processor Throughput can be > instruction / cycle Compiler should interleave floating point and integer instructions Out-Of-Order Execution should be taken care

UNIT 3 - Basic Processing Unit

UNIT 3 - Basic Processing Unit Overview Instruction Set Processor (ISP) Central Processing Unit (CPU) A typical computing task consists of a series of steps specified by a sequence of machine instructions