Write only as much as necessary. Be brief!

Size: px

Start display at page:

Download "Write only as much as necessary. Be brief!"

Lauren Watts
5 years ago
Views:

1 1 CIS371 Computer Organization and Design Final Exam Prof. Martin Wednesday, May 2nd, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached (with staple) if necessary. Please ensure that your answers are concise and legible. Read and follow the directions of each question carefully. Please attempt to answer all the questions (don t allow yourself to get stuck on a single question). You have 120 minutes to complete the exam. Write only as much as necessary. Be brief! Name: Problem Page Possible Score Total 78

2 2 [ 12 Points ] Datapath & Pipelining. (a) Consider a simple in-order five-stage pipeline with a two-cycle branch misprediction penalty and a single-cycle load-use delay penalty. For a specific program, 30% of the instructions are loads, 20% are branches, the remaining 50% of instructions are simple single-cycle ALU operations. Half of the load instructions are followed immediately by a dependent instruction, and 75% of branches are predicted correctly. What is the average CPI of this program on this processor? (b) The scalar pipeline in the lab assignment had no structural hazards. What two aspects of the design were responsible for avoiding such structural hazards? (c) The superscalar pipeline from the lab did have a structural hazard. What was the cause of the hazard and how was it handled in the design? (d) The maximum speedup achievable by pipelining a single-cycle datapath into five stages is 5x. Give three distinct reasons why this ideal speedup is generally not achieved in practice: 3. (e) The maximum speedup achievable by converting a scalar single-cycle datapath into a two-issue superscalar single-cycle processor is 2x. Give two distinct reasons why this ideal speedup is generally not achieved in practice:

3 [ 8 Points ] Branch Prediction Conflicts and Tagged Predictors. You re a microprocessor designer, and your simulations of an important workload indicate that branch instructions at the following two 32-bit addresses (in binary) are executed frequently: Address A: Address B: (a) The above two instructions are likely to conflict (hash to the same entry) in a branch predictor. For a simple bimodal predictor of two-bit saturating counters, how many entries must the predictor have to prevent these two branches from interfering (conflicting) with each other? How many total bytes must the predictor be? (b) You have a brilliant idea: Why not create a set-associative branch direction predictor? Your simulations indicate that a predictor with just 2048 entries (512 bytes) would be sufficient if it wasn t for these two trouble branches. Consider a two-way set-associative predictor that uses a straightforward tagging strategy (one tag for each two-bit counter, instructions have 32-bit addresses). How large (in KBs) is a two-way set-associative 2048-entry tagged predictor? (c) You have another brilliant idea: Because this is just a predictor, it can be wrong, so it doesn t actually need the full tags, just enough of a tag to avoid this particular conflict. With this new insight, (1) how large should the tags be and (2) what is the total size in KBs of this predictor? (d) How might a predictor capture both the conflict-mitigating benefits of a tagged set-associative predictor and most of the area efficiency of a tag-less predictor?

4 3. [ 10 Points ] Caching. Consider two different cache configurations for an 8-bit processor. Both caches have two 16-byte blocks (for a total capacity of 32 bytes), but one is direct-mapped and the other is two-way set associative and uses the least-recently used (LRU) replacement algorithm. All caches begin empty. (a) For the direct-mapped cache, how many bits are in the tag, index, and offset? tag: index: offset: (b) For the two-way set associative cache, how many bits are in the tag, index, and offset? tag: index: offset: (c) Give a short sequence (4 or fewer) of addresses in which the set-associative cache has a better hit rate than the direct-mapped cache. (Give the addresses as an 8-digit binary number.) Also give the miss rate for each. Sequence: 4 Miss rate on direct-mapped: Miss rate on set-associative: (d) Give a short sequence (4 or fewer) of addresses in which the direct-mapped cache has a better hit rate than the set-associative cache. (Give the addresses as an 8-digit binary number.) Also give the miss rate for each. Sequence: Miss rate on direct-mapped: Miss rate on set-associative:

5 5 4. [ 11 Points ] Memory Hierarchy Calculations. (a) Consider a simple core with a single-level of cache that achieves 1 CPI when every memory operation hits in the cache. The miss penalty to the main memory is 150 cycles. For a specific workload, the miss rate is 1 miss per every 50 instructions (or 0.02 misses per instruction). What is the CPI of this chip for this workload? (b) Consider adding a second-level cache in between the first-level cache and memory. For this workload, a miss in the second-level cache occurs only once every 250 instructions (0.004 misses per instruction) and has the same latency as accessing memory as above (150 cycles). The firstlevel cache is unchanged. Calculate under what conditions adding this cache will at least double the performance (half the CPI). Now let s revisit this design in the context of energy. Assume that when all memory operations hit in the cache, the core consumes 1 nano-joule per instruction (njpi). At 1 CPI at 1Ghz, that is equal to expending 1 Watt of power. That is, njpi * IPC * (clock frequency in Ghz) = Watts. Hint: apply what you have learned about calculating CPIs. (c) Of course, the memory hierarchy also consumes energy. Using the same miss rates (1 miss per every 50 instructions), if a main memory access consumes 100 nj, calculate the nano-joules per instruction of the single-level cache hierarchy. (d) Perform a similar calculation (to part (c)) for the two-level hierarchy, assuming the energy to access the second-level cache is 30 nj and it incurs one miss every 250 instructions. (e) Did adding the second level of cache increase or decrease the energy per instruction? (f) Now consider energy per second, also known as power and measured in Watts. Assuming that adding the second level of cache does halve the CPI, does it increase or decrease the power dissipation of the chip? Why? (g) Based on these two metrics, would you expect battery life of a mobile device using this chip to increase or decrease? Explain.

6 5. [ 11 Points ] Scheduling. When the code below is executed, assume the loop iterates thousands of times, and thus you should ignore any startup or initialization effects. 6 LOOP: MUL V <- X * X ADD Y <- Y + 1 STORE 1 -> [Y+V] BRANCH-IF (V == 0), EXIT LOAD X <- [X] BRANCH LOOP LOAD and MUL have a latency of more than one cycle; all other operations have a latency of one cycle. The pipeline is fully bypassed, has no structural hazards, and all execution units are fully pipelined. The hardware s branch direction and branch target prediction are both perfect. The pipeline is non-superscalar, so it can execute at most one instruction per cycle. Initially, consider an in-order pipeline. (a) Assuming the latency of MUL and LOAD are both two cycles (one cycle use penalty), approximately how many cycles would it take to execute this loop 1000 times (to the nearest 1000 cycles)? Why? (b) Assuming the latency of MUL and LOAD are both increased to four cycles (three cycle use penalty), approximately how many cycles would it take to execute this loop 1000 times (to the nearest 1000 cycles)? Why? (c) Static scheduling does not improve the performance of this loop. Why not? Give two specific limitations that combine to prevent static scheduling from being effective in this example. (d) Now consider a dynamically scheduled (out-of-order) pipeline that supports a large number of in-flight instructions (100+). The pipeline is still non-superscalar, so it can execute a maximum of one instruction per cycle. Assuming the same four-cycle latency (three cycle use penalty) for MUL and LOAD instructions, approximately how many cycles would it take to execute this loop 1000 times (to the nearest 1000 cycles)? Why?

7 7 6. [ 10 Points ] Thread-Level Parallelism and Multicore. (a) What is the primary disadvantage of coarse-grained locking? (b) What are two disadvantages of employing fine-grained locking? (c) What new mechanism has recently been proposed to help ameliorate the locking granularity problem? (d) What is the primary advantage of the MSI protocol over the simpler VI cache coherence protocol? (e) What is the primary advantage of the MESI protocol over the simpler MSI cache coherence protocol? (f) What is false sharing. How can software help avoid false sharing? (g) What is a memory fence (also known as memory barrier)? Where are they typically used?

8 7. [ 16 Points ] Parallelism At All Levels. Throughout the semester we ve explored parallelism at many different levels (from very fine-grained parallelism within an ALU to multiple cores on a chip). Give an example use of parallelism at five different levels of granularity. These should be big picture major approaches for using parallelism to extract significant performance (for example, each could provide a 4x or more performance improvement). For each of the examples, give: (i) the name or term, (ii) the specific reasons for (or benefits of) employing parallelism, and (iii) the primary disadvantage or challenge of exploiting parallelism at that level of granularity Give a specific concrete instance of a system that exploits all of these forms of parallelism:

9 9 For reference: Hex Binary Decimal 0x x x x x x x x x x xA xB xC xD xE xF Power Bytes Kilobytes Megabytes Gigabytes KB KB KB KB KB KB , KB , KB , KB KB KB 0.25 MB KB 0.5 MB KB 1 MB KB 2 MB KB 4 MB KB 8 MB ,384 KB 16 MB ,768 KB 32 MB ,536 KB 64 MB MB MB 0.25 GB MB 0.5 GB MB 1 GB MB 2 GB MB 4 GB MB 8 GB ,384 MB 16 GB ,768 MB 32 GB ,536 MB 64 GB

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Midterm Exam Prof. Martin Thursday, March 15th, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached