EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution

Similar documents
Instruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31

Computer Organization and Structure

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

CSEN 601: Computer System Architecture Summer 2014

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. The Processor

CS3350B Computer Architecture Quiz 3 March 15, 2018

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: A Based on P&H

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations


The Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture

Processor (I) - datapath & control. Hwansoo Han

Pipelining. lecture 15. MIPS data path and control 3. Five stages of a MIPS (CPU) instruction. - factory assembly line (Henry Ford years ago)

LECTURE 5. Single-Cycle Datapath and Control

Chapter 4. The Processor

Pipelining. CSC Friday, November 6, 2015

Chapter 4. The Processor

CSE 378 Midterm 2/12/10 Sample Solution

Chapter 4. The Processor

EECS150 - Digital Design Lecture 10- CPU Microarchitecture. Processor Microarchitecture Introduction

CSE Computer Architecture I Fall 2009 Lecture 13 In Class Notes and Problems October 6, 2009

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Lecture 4: Review of MIPS. Instruction formats, impl. of control and datapath, pipelined impl.

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

T = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good

CS3350B Computer Architecture Winter 2015

Improving Performance: Pipelining

Chapter 4 The Processor 1. Chapter 4A. The Processor

Due Nov. 6 th, 23:59pm No Late Submissions Accepted

Chapter 5: The Processor: Datapath and Control

COMPUTER ORGANIZATION AND DESIGN

Lecture Topics. Announcements. Today: Single-Cycle Processors (P&H ) Next: continued. Milestone #3 (due 2/9) Milestone #4 (due 2/23)

Outline. A pipelined datapath Pipelined control Data hazards and forwarding Data hazards and stalls Branch (control) hazards Exception

CPU Organization (Design)

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Basic Instruction Timings. Pipelining 1. How long would it take to execute the following sequence of instructions?

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 4: Datapath and Control

Processor Design Pipelined Processor (II) Hung-Wei Tseng

What do we have so far? Multi-Cycle Datapath (Textbook Version)

EECS150 - Digital Design Lecture 9- CPU Microarchitecture. Watson: Jeopardy-playing Computer

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

1 Hazards COMP2611 Fall 2015 Pipelined Processor

The MIPS Processor Datapath

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Systems Architecture

COMPUTER ORGANIZATION AND DESIGN

ECE232: Hardware Organization and Design

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

The overall datapath for RT, lw,sw beq instrucution

Final Exam Fall 2007

LECTURE 3: THE PROCESSOR

Computer and Information Sciences College / Computer Science Department The Processor: Datapath and Control

Chapter 4. The Processor Designing the datapath

cs470 - Computer Architecture 1 Spring 2002 Final Exam open books, open notes

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

6.823 Computer System Architecture Datapath for DLX Problem Set #2

CS 351 Exam 2 Mon. 11/2/2015

4.1.3 [10] < 4.3>Which resources (blocks) produce no output for this instruction? Which resources produce output that is not used?

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12

COMP2611: Computer Organization. The Pipelined Processor

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Review: Abstract Implementation View

CS 61C Fall 2016 Guerrilla Section 4: MIPS CPU (Datapath & Control)

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

Pipeline design. Mehran Rezaei

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CS 2506 Computer Organization II

Pipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.

Introduction. Datapath Basics

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 13 EE141

Mark Redekopp and Gandhi Puvvada, All rights reserved. EE 357 Unit 15. Single-Cycle CPU Datapath and Control

CS 251, Winter 2019, Assignment % of course mark

Inf2C - Computer Systems Lecture Processor Design Single Cycle

Processor Design CSCE Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed

CS 251, Winter 2018, Assignment % of course mark

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Adding Support for jal to Single Cycle Datapath (For More Practice Exercise 5.20)

Laboratory 5 Processor Datapath

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

COMPUTER ORGANIZATION AND DESI

Computer Science 61C Spring Friedland and Weaver. The MIPS Datapath

TDT4255 Computer Design. Lecture 4. Magnus Jahre. TDT4255 Computer Design

Ch 5: Designing a Single Cycle Datapath

ECE473 Computer Architecture and Organization. Processor: Combined Datapath

CO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19

The Processor: Datapath & Control

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

Full Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI

Improve performance by increasing instruction throughput

Lecture 7 Pipelining. Peng Liu.

Full Datapath. Chapter 4 The Processor 2

CS 61C Summer 2016 Guerrilla Section 4: MIPS CPU (Datapath & Control)

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000

ECE260: Fundamentals of Computer Engineering

Transcription:

EC 413 Computer Organization - Fall 2017 Problem Set 3 Problem Set 3 Solution Important guidelines: Always state your assumptions and clearly explain your answers. Please upload your solution document (PDF or TXT) to Blackboard. Your submission should be a single file titled [your name]_pset3.pdf (example: JohnDoe_pset3.pdf) or [your name]_pset3.txt. 100/100 points possible Due Wednesday, November 1 st by 11:59 P through Blackboard. Notes between the square brackets are simply to help you find the equivalent of the problem or question in the textbook. Part 1: CPU Datapath and Control Logic Branch u x 4 Add Data Add u x ALU operation PC Address Register # em Registers ALU Address Register # Zero u Data x Register # Reg Data em Control Figure 1: The basic implementation of the IPS subset, including the necessary multiplexors and control lines. Exercise 1: [4.1] Consider the following instruction: : AND Rd,Rs,Rt Interpretation: Reg[Rd] = Reg[Rs] AND Reg[Rt]

a. The values of the signals are as follows: Reg em ALUux em ALUop Regux Branch 1 0 0 0 10 0 0 ALUux is the control signal that controls the ux at the ALU input, 0 (Reg) selects the output of the register file, and 1 (Imm) selects the immediate from the instruction word as the second input to the ALU. Regux is the control signal that controls the ux at the Data input to the register file, 0 (ALU) selects the output of the ALU, and 1 (em) selects the output of. A value of X is a don t care (does not matter if signal is 0 or 1) b. All except branch Add unit and write port of the Registers c. Outputs that are not used: Branch Add, write port of Registers No outputs: None (all units produce outputs) Exercise 2: [4.2] The basic single-cycle IPS implementation in Figure 1 can only implement some instructions. New instructions can be added to an existing Set Architecture (ISA), but the decision whether or not to do that depends, among other things, on the cost and complexity the proposed addition introduces into the processor datapath and control. The first three problems in this exercise refer to the new instruction: : LWI Rt,Rd(Rs) Interpretation: Reg[Rt] = em[reg[rd]+reg[rs]] a. This instruction uses instruction, both register read ports, the ALU to add Rd and Rs together, data, and write port in Registers. b. None. This instruction can be implemented using existing blocks. c. None. This instruction can be implemented without adding new control signals. It only requires changes in the Control logic. Exercise 3: [4.3] When processor designers consider a possible improvement to the processor datapath, the decision usually depends on the cost/performance trade-off. In the following three problems, assume that we are starting with a datapath from Figure 1, where I-em, Add, ux, ALU, Regs, D-em, and Control blocks have latencies of 400 ps, 100 ps, 30 ps, 120 ps, 200 ps, 350 ps, and 100 ps, respectively, and costs of 1000, 30, 10, 100, 200, 2000, and 500, respectively. Consider the addition of a multiplier to the ALU. This addition will add 300 ps to the latency of the ALU and will add a cost of 600 to the ALU. The result will be 5% fewer instructions executed since we will no longer need to emulate the UL instruction.

a. Clock cycle time is determined by the critical path, which for the given latencies happens to be to get the data value for the load instruction: I-em (read instruction), Regs (takes longer than Control), ux (select ALU input), ALU, Data emory, and ux (select value from to be written into Registers). The latency of this path is 400 ps + 200 ps + 30 ps + 120 ps + 350 ps + 30 ps = 1130 ps. 1430 ps (1130 ps + 300 ps, ALU is on the critical path). b. The speedup comes from changes in clock cycle time and changes to the number of clock cycles we need for the program: We need 5% fewer cycles for a program, but cycle time is 1430 instead of 1130, so we have a speedup of (1/0.95)*(1130/1430) = 0.83, which means we actually have a slowdown. c. The cost is always the total cost of all components (not just those on the critical path, so the original processor has a cost of I-em, Regs, Control, ALU, D-em, 2 Add units and 3 ux units, for a total cost of 1000 + 200 + 500 + 100 + 2000 + 2*30 + 3*10 = 3890. We will compute cost relative to this baseline. The performance relative to this baseline is the speedup we previously computed, and our cost/ performance relative to the baseline is as follows: New Cost: 3890 + 600 = 4490 Relative Cost: 4490/3890 = 1.15 Cost/Performance: 1.15/0.83 = 1.39. We are paying significantly more for significantly worse performance; the cost/performance is a lot worse than with the unmodified processor. Part 2: Support Considerations in CPU Design Add 4 PC address Figure 2: A portion of the datapath used for fetching instructions and incrementing the program counter.

PCSrc Add ux 4 Shift left 2 ALU Add result PC address register 1 register 2 Registers register data Reg data 1 data 2 ALUSrc ux 4 ALU operation Zero ALU ALU result Address data em data Data emtoreg ux 16 Signextend 32 em Figure 3: The simple datapath for the core IPS architecture combines the elements required by different instruction classes. Exercise 1: [4.4] 4.4 Problems in this exercise assume that logic blocks needed to implement a processor s datapath have the following latencies: I-em Add ux ALU Regs D-em Sign- Extend Shift-Left-2 200ps 70ps 20ps 90ps 90ps 250ps 15ps 10ps a. I-em takes longer than the Add unit, so the clock cycle time is equal to the latency of the I-em: 200 ps b. The critical path for this instruction is through the instruction, Sign-extend and Shift-left-2 to get the offset, Add unit to compute the new PC, and ux to select that value instead of PC+4. Note that the path through the other Add unit is shorter, because the latency of I-em is longer that the latency of the Add unit. We have: 200 ps + 15 ps + 10 ps + 70 ps + 20 ps = 315 ps c. Conditional branches have the same long-latency path that computes the branch address as unconditional branches do. Additionally, they have a long- latency path that goes through Registers, ux, and ALU to compute the PCSrc condition. The critical path is the longer of the two, and the path through PCSrc is longer for these latencies: 200 ps + 90 ps + 20 ps + 90 ps + 20 ps = 420 ps d. PC-relative branches. e. PC-relative unconditional branch instructions. We saw in part c that this is not on the critical path of conditional branches, and it is only needed for PC-relative

branches. Note that IPS does not have actual unconditional branches (bne zero,zero,label plays that role so there is no need for unconditional branch opcodes) so for IPS the answer to this question is actually None. f. Of the two instructions (BNE and ADD), BNE has a longer critical path so it determines the clock cycle time. Note that every path for ADD is shorter than or equal to the corresponding path for BNE, so changes in unit latency will not affect this. As a result, we focus on how the unit s latency affects the critical path of BNE. This unit is not on the critical path, so the only way for this unit to become critical is to increase its latency until the path for address computation through sign extend, shift left, and branch add becomes longer than the path for PCSrc through registers, ux, and ALU. The latency of Regs, ux, and ALU is 200 ps and the latency of Sign-extend, Shift-left-2, and Add is 95 ps, so the latency of Shift-left-2 must be increased by 105 ps or more for it to affect clock cycle time. Exercise 2: [4.5] For the problems in this exercise, assume that there are no pipeline stalls and that the breakdown of executed instructions is as follows: add addi not beq lw sw 20% 20% 0% 25% 25% 10% a. The data is used by LW and SW instructions, so the answer is: 25% + 10% = 35% b. The sign-extend circuit is actually computing a result in every cycle, but its output is ignored for ADD and NOT instructions. The input of the sign- extend circuit is needed for ADDI (to provide the immediate ALU operand), BEQ (to provide the PC-relative offset), and LW and SW (to provide the offset used in addressing ) so the answer is: 20% + 25% + 25% + 10% = 80% Part 3: Efficiency Considerations in CPU Design Exercise 1: [4.10] 4.10 In this exercise, we examine how resource hazards, control hazards, and Set Architecture (ISA) design can affect pipelined execution. Problems in this exercise refer to the following fragment of IPS code: sw r16,12(r6) lw r16,8(r6) beq r5,r4,label # Assume r5!=r4 add r5,r1,r4 slt r5,r15,r4 Assume that individual pipeline stages have the following latencies:

IF ID EX E WB 200ps 120ps 150ps 190ps 100ps a. In the pipelined execution shown below, *** represents a stall when an instruction cannot be fetched because a load or store instruction is using the in that cycle. Cycles are represented from left to right, and for each instruction we show the pipeline stage it is in during that cycle: Pipeline Stage Cycles SW R16,12(R6) I ID EX E WB 11 LW R16,8(R6) F IF ED EX E WB BEQ R5,R4,Lbl IF ID EX E WB ADD R5,R1,R4 ** ** IF ID EX E WB SLT R5,R15,R4 * * IF ID EX E WB We cannot add NOPs to the code to eliminate this hazard NOPs need to be fetched just like any other instructions, so this hazard must be addressed with a hardware hazard detection unit in the processor. b. This change only saves one cycle in an entire execution without data hazards (such as the one given). This cycle is saved because the last instruction finishes one cycle earlier (one less stage to go through). If there were data hazards from loads to other instructions, the change would help eliminate some stall cycles. s Executed Cycles with 5 stages Cycles with 4 stages Speedup 5 4 + 5 = 9 3 + 5 = 8 9/8 = 1.13 c. Stall-on-branch delays the fetch of the next instruction until the branch is executed. When branches execute in the EXE stage, each branch causes two stall cycles. When branches execute in the ID stage, each branch only causes one stall cycle. Without branch stalls (e.g., with perfect branch prediction) there are no stalls, and the execution time is 4 plus the number of executed instructions. We have: s Executed Branches Executed Cycles with branch in EXE Cycles with branch in ID Speedup 5 1 4 + 5 + 1*2 = 11 4 + 5 + 1*1 = 10 11/10 = 1.10 d. The number of cycles for the (normal) 5-stage and the (combined EX/ E) 4-stage pipeline is already computed in b. The clock cycle time is equal to the latency of the longest-latency stage. Combining EX and E stages affects clock time only if the combined EX/E stage becomes the longest-latency stage: Cycle time with 5 stages Cycle time with 4 stages Speedup 200 ps (IF) 210 ps (E + 20 ps) (9*200)/(8*210) = 1.07

e. New ID latency New EX latency New cycle time Old cycle time Speedup 180 ps 140 ps 200 ps (IF) 200 ps (IF) (11*200)/(10*200) = 1.10 f. The cycle time remains unchanged: a 20 ps reduction in EX latency has no effect on clock cycle time because EX is not the longest-latency stage. The change does affect execution time because it adds one additional stall cycle to each branch. Because the clock cycle time does not improve but the number of cycles increases, the speedup from this change will be below 1 (a slowdown). In c. we already computed the number of cycles when branch is in EX stage. We have: Cycles with branch in EX Execution time (branch in EX) Cycles with branch in E Execution time (branch in E) Speedup a. 4 + 5 + 1*2 = 11 11*200 ps = 2200 ps 4 + 5 + 1*3 = 12 12*200 ps = 2400 ps 0.92 Exercise 2: [4.19] This exercise explores energy efficiency and its relationship with performance. Problems in this exercise assume the following energy consumption for activity in, Registers, and Data. You can assume that the other components of the datapath spend a negligible amount of energy. I-em 1 Register Register D-em D-em 140pJ 70pJ 60pJ 140pJ 120pJ Assume that components in the datapath have the following latencies. You can assume that the other components of the datapath have negligible latencies. I-em Control Register or ALU D-em or 200ps 150ps 90ps 90ps 250ps a. The energy for the two designs is the same: I-em is read, two registers are read, and a register is written. We have: 140 pj + 2*70 ps + 60 pj = 340 pj b. The instruction is read for all instructions. Every instruction also results in two register reads (even if only one of those values is actually used). A load instruction results in a read and a register write, a store instruction results in a write, and all other instructions result in either no register write (e.g., BEQ) or a register write. Because the sum of read and register write energy is larger than write energy, the worst-case instruction is a load instruction. For the energy spent by a load, we have:

140 pj + 2*70 pj + 60 pj + 140 pj = 480 pj c. must be read for every instruction. However, we can avoid reading registers whose values are not going to be used. To do this, we must add Reg1 and Reg2 control inputs to the Registers unit to enable or disable each register read. We must generate these control signals quickly to avoid lengthening the clock cycle time. With these new control signals, a LW instruction results in only one register read (we still must read the register used to generate the address), so we have: Energy before change 140 pj + 2*70 pj + 60 pj + 140 pj = 480 pj Energy saved by % Savings change 70 pj 14.6% d. Before the change, the Control unit decodes the instruction while register reads are happening. After the change, the latencies of Control and Register cannot be overlapped. This increases the latency of the ID stage and could affect the processor s clock cycle time if the ID stage becomes the longest-latency stage. We have: Clock cycle time before change 250 ps (D-em in E stage) Clock cycle time after change No change (150 ps + 90 ps < 250 ps) e. If is read in every cycle, the value is either needed (for a load instruction), or it does not get past the WB ux (or a non-load instruction that writes to a register), or it does not get written to any register (all other instructions, including stalls). This change does not affect clock cycle time because the clock cycle time must already allow enough time for to be read in the E stage. It does affect energy: a read occurs in every cycle instead of only in cycles when a load instruction is in the E stage. I-em active I-em Clock cycle energy latency time Total I-em energy Idle energy % 140 pj 200 ps 250 ps 140 pj + 50 ps*0.1*140 pj/200 ps = 143.5 pj 3.5 pj/143.5 pj = 2.44%