CPSC 313, 04w Term 2 Midterm Exam 2 Solutions

Similar documents
CPSC 313, Summer 2013 Final Exam Solution Date: June 24, 2013; Instructor: Mike Feeley

CPSC 313, Summer 2013 Final Exam Date: June 24, 2013; Instructor: Mike Feeley

CPSC 313, Winter 2016 Term 1 Sample Date: October 2016; Instructor: Mike Feeley

3 (5 marks) RISC instruction sets do not allow an ALU operation (e.g., addl) to read from memory. 2a Why is the memory stage after the execute stage?

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

CS3330 Spring 2015 Exam 2 Page 1 of 6 ID:

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Write only as much as necessary. Be brief!

Processor Architecture

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

CS 230 Practice Final Exam & Actual Take-home Question. Part I: Assembly and Machine Languages (22 pts)

CS:APP Chapter 4 Computer Architecture Wrap-Up Randal E. Bryant Carnegie Mellon University

CPU Pipelining Issues

Superscalar Processors Ch 14

Superscalar Processing (5) Superscalar Processors Ch 14. New dependency for superscalar case? (8) Output Dependency?

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

CS3330 Exam 3 Spring 2015

Processor Architecture V! Wrap-Up!

EECS 470 Midterm Exam Winter 2008 answers

Superscalar Processors

Key Point. What are Cache lines

Chapter 4. The Processor

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Computer Architecture CS372 Exam 3

Computer Hardware Engineering

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Pipelining. Principles of pipelining Pipeline hazards Remedies. Pre-soak soak soap wash dry wipe. l Chapter 4.4 and 4.5

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

CSC258: Computer Organization. Memory Systems

Computer Architecture V Fall Practice Exam Questions

CS 3330 Exam 2 Fall 2017 Computing ID:

Computer Organization Chapter 4. Prof. Qi Tian Fall 2013

CS429: Computer Organization and Architecture

Computer Architecture and Engineering CS152 Quiz #3 March 22nd, 2012 Professor Krste Asanović

CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining

Memory. Lecture 22 CS301

CS:APP Chapter 4 Computer Architecture Wrap-Up Randal E. Bryant Carnegie Mellon University

ECE550 PRACTICE Final

EECS 470 Midterm Exam Winter 2009

Reducing Conflict Misses with Set Associative Caches

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

EE 4980 Modern Electronic Systems. Processor Advanced

12.1. CS356 Unit 12. Processor Hardware Organization Pipelining

Superscalar Processors Ch 13. Superscalar Processing (5) Computer Organization II 10/10/2001. New dependency for superscalar case? (8) Name dependency

Final Exam Fall 2008

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

What is Pipelining? RISC remainder (our assumptions)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

CS3330 Fall 2014 Exam 2 Page 1 of 6 ID:

ECE 341 Final Exam Solution

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

Chapter 4. The Processor

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

administrivia final hour exam next Wednesday covers assembly language like hw and worksheets

Caches Concepts Review

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

复旦大学软件学院 2017~2018 学年第一学期期中考试试卷

Lecture 15: Pipelining. Spring 2018 Jason Tang

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Final Lecture. A few minutes to wrap up and add some perspective

CSCE 212: FINAL EXAM Spring 2009

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Control Hazards. Prediction

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

101. The memory blocks are mapped on to the cache with the help of a) Hash functions b) Vectors c) Mapping functions d) None of the mentioned

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Outline Marquette University

Wrap-Up. Lecture 10 Computer Architecture V. Performance Metrics. Overview. CPI for PIPE (Cont.) CPI for PIPE. Clock rate

Computer Architecture

CS Computer Architecture

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CS146 Computer Architecture. Fall Midterm Exam

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Midterm Exam 1 Wednesday, March 12, 2008

CS Computer Architecture

Pentium IV-XEON. Computer architectures M

UNIT- 5. Chapter 12 Processor Structure and Function

Instruction Set Architecture

Instruction Set Architecture

Page 1. Multilevel Memories (Improving performance using a little cash )

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Cache Optimisation. sometime he thought that there must be a better way

1. Truthiness /8. 2. Branch prediction /5. 3. Choices, choices /6. 5. Pipeline diagrams / Multi-cycle datapath performance /11

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

COMPUTER ORGANIZATION AND DESIGN

Memory management units

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

CS 3330 Exam 2 Spring 2016 Name: EXAM KEY Computing ID: KEY

Wrap-Up. Lecture 10 Computer Architecture V. Performance Metrics. Overview. CPI for PIPE (Cont.) CPI for PIPE. Clock rate

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Caching Basics. Memory Hierarchies

SOLUTION. Midterm #1 February 26th, 2018 Professor Krste Asanovic Name:

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Transcription:

1. (10 marks) Short answers. CPSC 313, 04w Term 2 Midterm Exam 2 Solutions Date: March 11, 2005; Instructor: Mike Feeley 1a. Give an example of one important CISC feature that is normally not part of a RISC instruction set architecture. Very briefly explain why by referring back to the y86 pipeline implementation. In a RISC, arithmetic instructions can t access memory. The reason is that the ALU is needed to calculate the effective address of memory instructions and so isn t available to do anything else. Also okay: in RISC instructions are uniform size and conditional branches don t use op codes. 1b. Write a sequence of two C statements that are causally dependent on each other and show what causes the dependency. 1c. Define temporal locality and say why it important for caching? Temporal locality exists when the same memory address is access multiple times in a short interval of time. Its important for caching, because if a cache stores recently accessed items, subsequent accesses to those same items will hit in the cache. 1d. Define instruction-level parallelism. Is it important for pipelining or caching? Why? Instruction-level parallelism exists when instructions can be run in parallel or out of order due to the fact that the dependencies among them do not impose an order. It is important for pipelining, because for a pipeline to be effective, it must work on multiple instructions at once, partially in parallel. 1e. Give one advantage and one disadvantage of a direct-mapped cache compared to an associative cache (i.e., fully-associative or set-associative). Advantage: faster. Disadvantage: more conflict misses. 2. (6 marks) Pipelines. 2a. Carefully explain why one would expect that changing a processor implementation from sequential to pipelined would improve the processor s performance. The sequential circuitry in each pipeline stage will have less gate-propagation delay than the original sequential circuit and thus transistors will be used more effectively and the clock can run faster. In a sequential circuit with large propagational delay, the transistors close to the inputs spend must of their time just holding a value waiting for signal to get through the rest of the circuit. Similarly transistors distant from the inputs spend most of their time waiting for the signal to arrive. This time the transistors spend waiting is wasted time and the more waste there is the lower the overall throughput. 2b. List all of the reasons you can think of why adding more stages to a pipelined implementation might not necessarily improve performance. There are two main reasons. First, adding a pipeline stage adds a fixed propagational delay for the registers that sit between stages. When the delay of each stage is close to this register delay, then adding more stages provides little benefit. Second, deeper pipelines are hard to keep filled. They may require more instruction level parallelism than exists in most programs and thus the number of pipeline stalls starts to increase with pipeline depth (i.e., the CPI increases).

2c. Does it make sense for Intel to report a single throughput measurement for the P4 Prescott? Why or why not? No, because the actual throughput of a pipelined processor depends on the workload. Some workloads will have little instruction-level parallelism and thus many pipeline stalls. Others will do better. Throughput depends not just on the cycle time, but also on the number of pipeline stalls per instruction. 3. (3 marks) Object files and linking. 3a. What is the benefit of separate compilation and what problem does it cause that linking solves? The benefit is that it makes it easier to write larger programs. It allows you to compile just the modules that change during the software development cycle, not everything, and is thus faster. It also allows you to use modules for which you do not have the source code and thus aids software resuse (e.g., libraries). The problem it creates is that the compiler will be unable to resolve references to symbols defined in other modules. 3b. What is the difference between a relocatable and an executable object file? How do they relate to the first part of this question? The compiler generates relocatable object files. These files have unresolved references and do not yet have a runtime virtual-memory address assigned. The linker generates executable object files from collections of relocatable files. Executable files have no unresolved references, have runtime virtual-memory addresses assigned to all code and data and are thus read to load into memory and execute in the CPU. 4. (8 marks) You are responsible for designing the next-generation y86 cpu. The design challenge you are confronted with is that the execute phase has roughly twice the delay as any of the other pipeline phases. You decided to split this stage in two, so that the pipeline now has six stages and none of the results of the execute phase are available until the end of the second execute stage. Answer the following questions. 4a. What problem did the slow execute phase cause and how did splitting it in two solve this problem? The problem is that the clock can only be as fast as the slowest pipeline stage and thus the slow execute stage was constraining throughput. By splitting it into two stages, the clock speed can be doubled and thus throughput may double. 4b. Recall that a data hazard exists in the following scenario (among others): rrmovl %eax, %ebx %ebx, %ecx Briefly explain how the current, 5-stage, y86 implementation handles this particular hazard and state whether doing so requires any pipeline stalls or bubbles. Register values are forwarded from the output of the execute phase to the decode phase. In this case, the second instruction will read the value of %ebx from the forwarding signals instead of from the register file. No pipeline stall is required in the 5-stage pipeline to handle this hazard. 4c. Your 6-stage pipeline can not handle the hazard illustrated by this example in the same way as the 5-stage pipeline. Carefully explain why not. Are additional pipeline stalls or bubbles needed to correctly handle the two-instruction sequence above (assuming no other change to the design)? How many?

In the 6-stage pipeline, the new value of %ebx isn t known until the end of the second execute stage, which is two stages after the decode stage. Thus the second instruction will be in the first execute stage, not the decode stage, when the new value becomes available on the forwarding path. If we do nothing else, a single-cycle stall, and corresponding bubble, is required between the two instructions to keep the second instruction in the decode phase until the first has finished the second execute phase. 4d. Can you modify the design to improve the situation for the two-instruction sequence above? Give a brief, high-level outline of your solution (implementation details are not necessary). The rrmovl instruction doesn t actually need the value of %ebx until the write-back stage, at which point the first instruction has been retired. So, the solution is to add forwarding signals from the end the second execute phase back to the end of the first execute phase. The second instruction will thus read the wrong value from the register file in the decode phase, but then get the right value in the execute phase, in plenty of time for it to write the correct value back to %ecx in the register file. 4e. Can you use the same technique for the following two-instruction sequence? %eax, %ebx %ebx, %ecx Carefully explain why or why not. In this case, the second instruction needs the value of %ebx at the beginning of the first execute phase, but the value isn t generated until the first instruction finishes the second execute stage (i.e., when the second instruction is at the end of the first stage: too late to be useful). A stall is thus required in this case. 5. (7 marks) Consider the following y86 program and the standard y86 pipeline implementation (i.e., PIPE) described in class and in the textbook. 5a. Annotate this code to indicate every place that a pipeline stall occurs and explain why it occurs in each case. 1. Load-use hazard stalls this instruction in the decode phase for one cycle, introducing one bubble. 2. Possible two bubbles if branch is not taken. However, branch probably is usually taken (it is only not taken when %exc is zero), so there s probably not a problem here. 3. Return hazard stalls three cycles in fetch phase waiting to read return address from memory. 5b. Modify this piece of code to eliminate as many pipeline stalls as you can, but without changing its semantics (i.e., its meaning). You can view this code in complete isolation from any other code; that is, you can assume that no other code calls into this code. Write your modified code below, using comments to highlight and explain every change you have made.

start: rrmovl %eax, %edx # [1] mrmovl 12(%ebp), %ebx rmmovl %edx, 8(%ebp) # [2] %ebx, %ecx # [3] irmovl $100, %eax # [4] %eax, %ecx jne notzero # [5] irmovl $1, %ecx notzero: mrmovl %ecx, 12(%ebp)... hlt 1. Renamed %ebx to %edx to avoid anti-dependency between next two instructions so that they can be swapped. 2. Swapped with previous instruction to fill the load-use delay slot and thus eliminate the pipeline stall that used to occur here. 3. Forwarding logic can now void a stall here, because this use of %ecx is two instructions away from the load of %ecx from memory. 4. Inline the procedure foo to avoid the return hazard. 5. Nothing to do here, assuming the %ecx is usually not zero. 6. (4 marks) When caches help. 6a. What feature of a cache is designed to exploit temporal locality? The fact that the cache stores recently accessed items for a while. 6b. What feature of a cache is designed to exploit spatial locality? The fact that blocks are bigger than 4 bytes. 6c. Can a cache provide any benefit to a program that has neither temporal nor spatial locality? How? Yes if the hardware provides a mechanism for prefetching and either the workload has a predictable access pattern or the application program can predict future accesses sufficiently far in the future to hide memory-access latency. 7. (6 marks) You work for Intel s cache design group and are having a bad day. You ve just finished the design of the cache you are responsible for. Now your boss is pushing you to make one more change and you don t want to. For each of the following suggested changes give a good reason why the change might make things worse, assuming that the total size of the cache (i.e., the total amount of data it can store) must remain the same. 7a. Increase the block size. Decreases cache s ability to capture temporal locality, because the cache stores fewer total blocks. 7b. Decrease the block size. Decreases cache s ability to capture spatial locality. 7c. Decrease the set associativity (e.g., from 8-way to 2-way).

Increases potential for conflict misses. 7d. Increase the set associativity (e.g., from 8-way to 32-way). Slows cache-hit time. 7e. Changing from write-through to write-back. Slows write-miss handling, because the write must delay until the accessed block is fetched from memory into the cache. 8. (6 marks) Consider a small 64-B, 2-way set associative, write through, write allocate cache with 8-B blocks and using LRU replacement. The highest-order address bits are assigned to the tag. 8a. If addresses are 32 bits long, how many address bits are needed for each of the following: block offset (b) = 3 set index (s) = 2 tag (t) = 27 8b. The cache is initially empty and then it sees the following sequence of accesses. For each access, indicate whether it is a hit or miss. If its a miss, give the type of miss. read 0x10 = 0 10 000: cold miss read 0x04 = 0 00 100: cold miss read 0x14 = 0 10 100: hit write 0x08 = 0 01 000: cold miss read 0x0c = 0 01 100: hit read 0x24 = 1 00 100: cold miss read 0x10 = 0 10 000: hit