1 ( 23 points) 15 min.

Size: px

Start display at page:

Download "1 ( 23 points) 15 min."

Aubrey Griffith
5 years ago
Views:

1 ee57_mt_sp2.fm Spring 2 EE57 Instructor: Gandhi Puvvada Midterm Exam (2%) Date: //2, Friday Time: :M - 2:2PM in THH2 Name: Total points: 28 Perfect score: 9 / 28 ( 23 points) 5 min. Pipelining I.F.R.F (Internally Forwarding Register File): I.F.R.F is (useful/useless but harmless/harmful) in a pipelined CPU. I.F.R.F is (useful/useless but harmless/harmful) in a multicycle CPU (st ed.). I.F.R.F is (useful/useless but harmless/harmful) in a multicycle CPU (2nd ed.). I.F.R.F is (useful/useless but harmless/harmful) in a single cycle CPU. If you marked harmful for any one, explain why it is harmful..2 If all the above four designs of the CPU (the single cycle CPU, the two multicycle CPUs, and the 5- stage pipelined CPU) are operated at the same (clock) frequency, best performance is provided by Explain..3 Branch Delay Slot (is always advantageous / is always disadvantageous / depends on compiler s ability to fill the slot for it to be advantageous) Explain. If the original 5-stage CPU of first edition did not have a delay slot for the load-word instruction, it means, we (need / do not need) to have a Hazard Detection Unit (HDU) to stall an instruction in the delay slot, which is dependent on the load word. pril, 2 2:9 pm EE57 Midterm Exam - Spring 2 Page - / 9 C Copyright 2 Gandhi Puvvada

2 ee57_mt_sp2.fm 2 ( = 8 points) 7 min. Pipelining (Modified Lab 7 part 3): In the Fall 2 midterm exam, we have an DD unit in each of the two EX stages, EX and EX2. Here, those two stages, EX and EX2, are merged into EX2. So DD8 needs an extra clock in EX2 as it has to go through the second DD also. Instruction Operation Opcode MSD 32-bit instruction in hex BZ DD DD8 D=Destination, S=Source NOP DS $R, $X; ($R) <= ($X) 8 8DS SUB3 $R, $X; ($R) <= ($X) - 3 DS BZ $X, JJJJ; (PC) <= JJJJ if ($X) = 2 JJJJDS DD $R, $X; ($R) <= ($X) + 2 2DS DD8 $R, $X; ($R) <= ($X) + 8 DS Further we introduced a BZ (Branch if Zero) instruction. It uses the opcode previously allocated to the SUB3 instruction. The instructions are 32-bits in size, but the addresses are only -bit. PC is - bit wide and is incremented by a "". The JJJJ in the BZ $X, JJJJ stands for a -bit (-digit hex) absolute branch address. If the source register $X is a zero then we branch to JJJJ [ (PC) <= JJJJ if ($X) = ]. The "D" in "JJJJDS" is a random hex digit and should not be treated as a valid destination, similar to the "DS" in "DS" for a NOP instruction. BZ executes from the ID stage. You need to complete the early branch mechanism (dependency stalls, branch execution by causing PC to be changed to JJJJ and flushing the IF stage instruction, avoiding spurious branch execution during stalling, etc.) Complete the design on the page next to next (on page ). 2.2 In you lab 7 Part 3 Subpart 2 (EX and EX2 merged case), you used the left side circuit to stall for clock. Complete the design to show the STLL signal. Suppose you are given a flipflop with an asynchronous set as shown in the right side below (instead of the FF with an asynchronous clear as shown on the left). Redesign your stall circuit with this FF and show the STLL signal. EX2_DD D Q CLR SET D Q 2.3 When STLL_DD8 is active, you stall the entire pipeline. True / False When STLL_BR is active, you stall the entire pipeline. True / False IF_Flush mechanism here is (the same as / different from) the wrist-band mechanism used in our pipelined CPU design. pril, 2 2:9 pm EE57 Midterm Exam - Spring 2 Page - 2 / 9 C Copyright 2 Gandhi Puvvada

3 ee57_mt_sp2.fm 2. In this design we have implemented an early branch. Would a medium branch from EX2 be better? Yes / No / It depends. Explain. Is it possible to postpone executing the BZ instruction all the way into the WB stage (WB!, not EX2)? Not Possible / possible but undesirable / possible and desirable. Explain 2.5 Combining EX and EX2 into one EX2 stage (as done here) is (always better / always worse / depends on the instruction sequence in the program). Explain. 2. How come, we carried (PC + ) to the ID stage in the text book design, but we do not carry (PC+) to the ID stage here? 2.7 Complete the following "Single Cycle CPU" kind of a design for the pipelined design on the next page. Complete the control unit also. PCSource Single Cycle CPU PC_EN PC + I-MEM Branch ddress JJJJ DD CU Reg. File RD R-Write RegWrite _ZERO DD + R_Mux SKIP DD + R2_Mux SKIP2 DD8 BZ complete this Branch pril, 2 2:9 pm EE57 Midterm Exam - Spring 2 Page - 3 / 9 C Copyright 2 Gandhi Puvvada

4 ee57_mt_sp2.fm PCSource PC IF ID Comp Station in ID Stage ID_XMEX2 HDU_BR Reg. File RD R-Write EN EX2 WB XMEX2 EN FU + + RD EN Write X_Mux R_Mux EX2_DD EX2_DD8 EX2_ R2_Mux WB_RD WB_Write SKIP SKIP2 I-MEM EN STLL_BR DD DD WB_ DD8 DD DD8 DD FORW ID_ EX2_ BZ _ZERO + JJJJ Branch ddress EX2_Write IF_Flush ID_DD ID_DD8 ID_BZ EX2_XMEX2 Comp Station in ID Stage ID_ Matched with EX2_ ID_XMEX2 D Q CLR STLL_DD8 P=Q P Q ID_ EX2_. Complete all missing connections to 2. Complete the STLL_DD8 logic in EX2 and STLL_BR logic in ID stage. 3. Complete all four enable (EN) controls on the pipeline registers (including PC).. Draw the logic to produce PCSource, IF_Flush, FORW, SKIP, SKIP2 on this page itsef. Modified LB 7 Part 3 Block Diagram Q#2 pril, 2 2:9 pm EE57 Midterm Exam - Spring 2 Page - / 9 C Copyright 2 Gandhi Puvvada

5 ee57_mt_sp2.fm 2.8 Now let us try to build a multi-cycle version for the design on the previous page. It is proposed that, we go for a single LU, which can add a selected constant, ( for PC and for DD or DD8). DD8 uses the LU twice to add two times. This multi-cycle datapath is similar to the st edition design except that here the LU is built using dynamic logic (like in the 2nd edition). There is an LUOut register like in the 2nd edition. You need to carefully decide when to take (tap) data from the upstream of the LUOut register and when to take (tap) data from the downstream of the LUOut register. We have an IR register (Instruction Register) to hold the instruction at the end of the first state(s). IR is needed as PC is incremented using the LU in the very first state. We need to support a NOP instruction here besides, DD, DD8, and BZ Complete the datapath and the state diagram for control unit on the next two pages. To some extent, our state diagram resembles the 2nd edition state diagram reproduced below for your reference. We are doing a MOORE kind of state diagram and may be wasting a few clocks. Just for reference pril, 2 2:9 pm EE57 Midterm Exam - Spring 2 Page - 5 / 9 C Copyright 2 Gandhi Puvvada

6 ee57_mt_sp2.fm PC We (need / do not need) an LUOut_write control signal. Reg. File RD R-Write B +B I-MEM RegWrite PCWriteCond PCWrite PCSource ONE/FOUR Source[:] SKIP Multi Cycle CPU PC_EN PCSource PCWrite PCWriteCond lower of the Strip IRWrite IR Branch ddress JJJJ H 32 _ZERO X_Mux C_Mux In_Mux ONE/FOUR LU Source[] Source[] LUOut R_Mux SKIP Concatenate DD DD8 BZ CU CU pril, 2 2:9 pm EE57 Midterm Exam - Spring 2 Page - / 9 C Copyright 2 Gandhi Puvvada

7 ee57_mt_sp2.fm NOP (= DD DD8 BZ) S S PCSource= Source[:]= ONE/FOUR = No RTL needed. No signal list. DD or DD8 BZ S2 S3 S DD8 S DD S Mr. Trojan says that, we can easily improve the above state machine by combining states S, S2, and S into one mealy state S2. Complete the S2 state on the side and also write the new state transition condition from S2 to S. To S S2 pril, 2 2:9 pm EE57 Midterm Exam - Spring 2 Page - 7 / 9 C Copyright 2 Gandhi Puvvada

8 ee57_mt_sp2.fm 3 ( 3 points) 2 min. Cache mapping techniques: Fill-in all missing information in the table below based on information provided. In all four cases, it is the same amount of cache differently organized. byte addressable Processor -bit Data -bit address ddr Space Size GBytes Cache Size use this info.! Block Size KB Words ( Bytes) Mapping Technique Direct Fully ssociative Set ssociative 2 Blocks/Set Set ssociative Blocks/Set TG FIELD use this info.! BLOCK OR SET FIELD (as appropriate) WORD FIELD use this info.! BYTE FIELD - (BE3-BE) - (BE3-BE) - (BE3-BE) - (BE3-BE) TG M(s) and their size(s) and comparators to compare TG(s) and their size. In the case of Direct Mapping above, we use (state a number) TG M(s) of size together with (state a number) comparator(s) each of -bit wide. In the case of Set ssociative Mapping with 2 Blocks/Set above, we use (state a number) TG M(s) of size together with (state a number) comparator(s) each of -bit wide. In the case of Set ssociative Mapping with Blocks/Set above, we use (state a number) TG M(s) of size together with (state a number) comparator(s) each of -bit wide. The Fully ssociative Mapping is prohibitively expensive because you would need state a number) comparator(s) each of -bit wide. In the first case of direct mapping, the main memory shall be organized in a -way lowerorder interleaving to facilitate efficient The main memory organization is (same/different) in the above cache organizations, because In general, a set can potentially have a set-associativity equal to any number (not necessarily a power of 2). (However, here / Here also) a set (needs to be / does not need to be) a power of 2 in size, because If the total number of block frames in a cache is N, we can bring into that cache, if it is initially empty, any consecutive N blocks from the main memory (without causing any collision), in the case of cache using the mapping technique (circle all correct answers): Fully-associative Set-associative Direct pril, 2 2:9 pm EE57 Midterm Exam - Spring 2 Page - 8 / 9 C Copyright 2 Gandhi Puvvada

9 ee57_mt_sp2.fm ( 3 points) 2 min. Virtual Memory: 9 7. PTBR stands for. It is initiated by (hardware / operating system) and is utilized by (MMU / CCU) (i.e. memory management unit or cache control unit) to look up (TLB / Page Table / Cache Tag M)..2 Page Table: Number of,b,c Tables built by the OS: PQRST on the side represents a 2-bit (5-digit hex) VPN in a 3-level page table with upper 8 bits (PQ) indexing the -level table, next 8 bits (RS) indexing the B-level tables, and the last bits (T) indexing the C-level tables. Suppose the first 8 distinct virtual pages accessed by the application program had the VPNs as stated in TBLE-I (in sorted order). How many tables of what size are built by OS by this time? -level: B-level: C-level: TBLE-I P Q R S T The advantage of VIPT over PIPT comes from the fact that Memory addresses: In a 32-bit virtual address system using KB pages, state any two consecutive 32-bit word addresses (in hex) which do not fall in the same virtual page. I am evicting a page containing the byte with virtual address h. What is its virtual page number (in hex)?. What is the range of byte addresses residing in that page (lowest virtual byte address to highest virtual byte address). The physical page frame number in the main memory is 2 (just 2). What is the range of byte addresses residing in that page (lowest physical byte address to highest physical byte address)..5 Since we use write-back only for virtual memory, we need to maintain a dirty bit associated with (the entire TLB /the entire Page Table / each entry of the page table / each entry of the TLB / each entry in both TLB and Page Table).. Fully associative mapping may not be prohibitively expensive in the case of a (TLB / L2 cache) because The next few weeks are very important as we will be covering a lot of material in weeks. Please, please do attend every lecture and discussion. nd use our office hours. Thanks. -- The EE57 Teaching Team pril, 2 2:9 pm EE57 Midterm Exam - Spring 2 Page - 9 / 9 C Copyright 2 Gandhi Puvvada

1.3 A Branch Delay Slot is (always advantageous / always disadvantageous / depends on compiler s ability to fill the slot) Explain

1.3 A Branch Delay Slot is (always advantageous / always disadvantageous / depends on compiler s ability to fill the slot) Explain ee57_mt_sp2.fm Spring 2 EE57 Instructor: Gandhi Puvvada Midterm Exam (2%) Date: //2, Friday Time: :M - 2:2PM in THH2 Name: Total points: 28 Perfect score: 9 / 28 ( 23 points) 5 min. Pipelining 5 6 5 3.