CS146: Computer Architecture Spring 2004 Homework #2 Due March 10, 2003 (Wednesday) Evening

Similar documents
Please state clearly any assumptions you make in solving the following problems.

CS146 Computer Architecture. Fall Midterm Exam

Instruction Frequency CPI. Load-store 55% 5. Arithmetic 30% 4. Branch 15% 4

CS433 Homework 2 (Chapter 3)

CS433 Homework 3 (Chapter 3)

CS433 Homework 2 (Chapter 3)

Good luck and have fun!

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

Computer Architecture Homework Set # 3 COVER SHEET Please turn in with your own solution

Four Steps of Speculative Tomasulo cycle 0

CS433 Midterm. Prof Josep Torrellas. October 19, Time: 1 hour + 15 minutes

Instruction-Level Parallelism and Its Exploitation

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Website for Students VTU NOTES QUESTION PAPERS NEWS RESULTS

ELE 818 * ADVANCED COMPUTER ARCHITECTURES * MIDTERM TEST *

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

CS252 Graduate Computer Architecture Lecture 6. Recall: Software Pipelining Example

Lecture-13 (ROB and Multi-threading) CS422-Spring

Processor: Superscalars Dynamic Scheduling

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Metodologie di Progettazione Hardware-Software

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

Instruction Level Parallelism

Multi-cycle Instructions in the Pipeline (Floating Point)

CS425 Computer Systems Architecture

CS2100 Computer Organisation Tutorial #10: Pipelining Answers to Selected Questions

Midterm I March 21 st, 2007 CS252 Graduate Computer Architecture

CS 6354: Static Scheduling / Branch Prediction. 12 September 2016

EXAM #1. CS 2410 Graduate Computer Architecture. Spring 2016, MW 11:00 AM 12:15 PM

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

Review: Evaluating Branch Alternatives. Lecture 3: Introduction to Advanced Pipelining. Review: Evaluating Branch Prediction

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CS 614 COMPUTER ARCHITECTURE II FALL 2004

TDT 4260 lecture 7 spring semester 2015

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

CMCS Mohamed Younis CMCS 611, Advanced Computer Architecture 1

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

ECE 4750 Computer Architecture, Fall 2017 T05 Integrating Processors and Memories

Complications with long instructions. CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. How slow is slow?

Page 1. Recall from Pipelining Review. Lecture 15: Instruction Level Parallelism and Dynamic Execution

CISC 662 Graduate Computer Architecture Lecture 11 - Hardware Speculation Branch Predictions

Branch prediction ( 3.3) Dynamic Branch Prediction

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Solutions to exercises on Instruction Level Parallelism

Multiple Instruction Issue. Superscalars

Compiler Optimizations. Lecture 7 Overview of Superscalar Techniques. Memory Allocation by Compilers. Compiler Structure. Register allocation

Updated Exercises by Diana Franklin

CSE 490/590 Computer Architecture Homework 2

Hardware-Based Speculation

CS152 Computer Architecture and Engineering. Complex Pipelines

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

Static vs. Dynamic Scheduling

Lecture 4: Introduction to Advanced Pipelining

Computer Science 246 Computer Architecture

ECE 154B Spring Project 4. Dual-Issue Superscalar MIPS Processor. Project Checkoff: Friday, June 1 nd, Report Due: Monday, June 4 th, 2018

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Course on Advanced Computer Architectures

Advanced Computer Architecture CMSC 611 Homework 3. Due in class Oct 17 th, 2012

Computer Organization and Structure

Computer Architecture Practical 1 Pipelining

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

ECE473 Computer Architecture and Organization. Pipeline: Control Hazard

Super Scalar. Kalyan Basu March 21,

CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation

CMSC411 Fall 2013 Midterm 2 Solutions

CPE 631 Lecture 11: Instruction Level Parallelism and Its Dynamic Exploitation

EE557--FALL 1999 MAKE-UP MIDTERM 1. Closed books, closed notes

CS 251, Winter 2019, Assignment % of course mark

6.823 Computer System Architecture

LECTURE 10. Pipelining: Advanced ILP

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Hardware-based Speculation

Reduction of Data Hazards Stalls with Dynamic Scheduling So far we have dealt with data hazards in instruction pipelines by:

The Tomasulo Algorithm Implementation

Alexandria University

INSTRUCTION LEVEL PARALLELISM

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Chapter 4. The Processor

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Multicycle ALU Operations 2/28/2011. Diversified Pipelines The Path Toward Superscalar Processors. Limitations of Our Simple 5 stage Pipeline

Pipelining. CSC Friday, November 6, 2015

CSE Lecture 13/14 In Class Handout For all of these problems: HAS NOT CANNOT Add Add Add must wait until $5 written by previous add;

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

CS420/520 Homework Assignment: Pipelining

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS422 Computer Architecture

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Topics. Digital Systems Architecture EECE EECE Predication, Prediction, and Speculation

Adapted from David Patterson s slides on graduate computer architecture

Predict Not Taken. Revisiting Branch Hazard Solutions. Filling the delay slot (e.g., in the compiler) Delayed Branch

CS / ECE 6810 Midterm Exam - Oct 21st 2008

Transcription:

CS146: Computer Architecture Spring 2004 Homework #2 Due March 10, 2003 (Wednesday) Evening 1. Pipelining Consider the code segment below. Assume that full bypassing/forwarding has been implemented. Assume that the initial value of register R23 is much bigger than the initial value of register R20. Assume that all memory references take a single cycle. Assume that both load-use hazards and branch delay slots are hidden using delay slots. You may- not reorder instructions to fill such slots, but if a subsequent instruction is independent and is properly positioned, you may assume that it fills the slot. Otherwise, fill slots with additional no-ops as needed. LOOP: lw R10, X(R20) lw R11, Y(R20) subu R10, R10, R11 sw Z(R20), R10 addiu R20, R20, 4 subu R5, R23, R20 bnez R5, LOOP nop ; 1 delay slot a. (5 points) Draw a pipeline diagram of 2 iterations of its execution on a standard 5-stage MIPS pipeline (for clarity, use graph paper or a computer). Assume that the branch is resolved using an ID control point. In the box below, write the total number of cycles required to complete 2 iterations of the loop. Cycles =

b. (15 points) On the second grid page that follows, draw a pipeline diagram of 2 iterations of the loop on the pipeline below. Note that the loop has a single branch delay slot nop included you may need to add more. You cannot assume anything about the program s register usage before or after this code segment. Fill in the boxes below. Pipeline Branch Delay = Pipeline Load Delay = Cycles = Pipeline: IF1: Begin Instruction Fetch IF2: Complete Instruction Fetch ID: Instruction Decode RF: Register Fetch EX1: ALU operation execution begins. Branch target calculation finishes. Memory address calculation. Branch condition resolution calculation begins. EX2: Branch condition resolution finishes. Finish ALU ops. (But branch and memory address calculations finish in a single cycle). M1: First part of memory access, TLB access. M2: Second part of memory access, Data sent to memory for stores OR returned from memory for loads. WB: Write back results to register file

2. Scoreboarding For the code sequence shown below, draw a pipeline diagram of how instructions would issue in a machine using scoreboarding as discussed in class. Use the execution mix and scoreboard structure as given in the class example. Assume that the FP Add unit has 4 EX phases, the FP Multiply unit has 7 EX phases, and divide has 24 EX phases. FP Adds, Subtracts, and Multiplies are fully-pipelined, while divide operations are NOT pipelined. LD F6, 12(R2) LD F2, 16(R3) ADDD F0, F2, F4 DIVD F10, F0, F6 SUBD F8, F6, F2 ADDI R2, R2, 8 ADDI R3, R3, 16 ADDD F6, F8, F2 3. Scoreboarding vs. Tomasulo s Algorithm A shortcoming of the scoreboard approach occurs when multiple functional units that share input buses are waiting for a single result. The units cannot start simultaneously, but must serialize. This is not true in Tomasulo s algorithm. Give a code sequence that uses no more than 10 instructions and shows this problem. Assume the hardware configuration from Figure A.51, for the scoreboard, and Figure 3.2 for Tomasulo s scheme. Indicate where Tomasulo s algorithm can continue, but the scoreboard approach must stall. Assume the following latencies. Instruction Producing Result Instruction Using Result Latency in Clock Cycles FP ALU op Another FP Alu Op 3 FP ALU op Store double 2 Load Double FP ALU Op 1 Load Double Store Double 1 4. SimpleScalar Problem In this problem, we will begin to use the sim-bpred simulator that allows the user to explore various branch prediction algorithms via functional simulation. This simulator can study basic branch predictors like the ones that we studied in class: predict nottaken, predict taken, bimodal (2-bit counters), two-level, or a combination of bimodal and two-level. Here is the list of SPEC2000 benchmarks that we will run for this assignment. Pick at least three to report your results for.

Benchmark Instructions Sim-bpred Time (sec) perlbmk 205927571 80 ammp 45838888 21 gcc 96761516 39 mcf 187806891 91 parser 269103374 108 a) How to run the benchmarks? I have made a few scripts that will help you with this assignment. These are ~cs146/run_scripts/ on the FAS system. There are two scripts that you will fine here run_spec2k.pl and extract_results.pl. These two perl scripts have all of the information that you will need to run the benchmarks. You will just need to setup your account and configure the script for this assignment. Steps 1) Go to your home directory and make two directories run_scripts and results (ie. mkdir run_scripts). This can be in the root of your home directory or somewhere else. For example you may have a subdirectory cs146 where you will put all of your cs146 coursework. You should also have your simplesim-3.0 directory in here. 2) Copy the scripts from ~cs146/run_scripts to this directory. 3) First edit run_spec2k.pl. There are a few things to note in this script. First, the third definition of @BENCH_LIST lists the benchmarks that the script will run. Edit this to add or subtract benchmarks that you would like to run. Now skip all the way to near the end of the script where it says $HOME_DIR =. Change this directory to where you did mkdir run_scripts above. You can also change the binary that you are simulating with by changing the $SIM_BIN variable in the base script it is set for sim-fast you will want to change this to sim-bpred for this assignment. Now you can just execute./run_spec2k.pl to run the simulations for this assignment. All of the results will be stored in your results directory with the subdirectory of the benchmark names. 4) Make similar changes to the extract_spec2k.pl file. This file will help you automatically extract the result data from your results directory. Change @STAT_LIST to print out additional stats (for example the branch predictor hit rates, etc). Just enter the new simplescalar stat name into the list.

b) Problem Statement. This problem involves a design space exploration study. Recall that 2-bit saturating counters are preferable to 1-bit counters because they provide some hysteresis for the branch decisions. However, using 2-bit counters requires more space thus leaving less room for entries in the predictor. In this problem, we want to explore varying the number of bits in these counters (1-bit or 2-bits) vs. the number of entries in the branch predictor. The point of this assignment is to determine whether the extra storage space is better spent on more entries or more and if there is a crossover point where one option becomes more efficient than the other. This will require some minor changes to the bpred.c code to support the 1-bit counter option. The 2-bit counter is the default so this will run out-of-the-box. The changes are pretty minimal after you find where to make them. Choose three of the benchmarks to run with sim-bpred and then simulate with various values of -bpred (this can be changed by adding parameters to the $CONFIG variable in the script better yet use a foreach statement to cycle through all of your combinations make sure that you generate different result files as well -for each configuration). Use the Table on the previous page to make sure that you are correctly running the benchmarks. Try at least the bimodal predictor and at least two 2-level predictors (say, GAp and PAg). c) What to report? The above will require a fair amount of simulations, with about 1-2 minutes required per simulation. You will have three benchmarks, three base predictor configurations (bimodal and two 2-level predictors), and the 1-bit vs. 2-bit counter simulations. Then you will have to perform enough sizing analysis to determine what decision makes the most sense. After you have finished debugging, you may want to make clean and then change the compiler optimization level to -O3 to help speed up your simulations. Even though each simulation only takes a couple of minutes, you will obviously want to automate this process with provided scripts. Let me or Kim know if you need any help with the perl scripts again, the changes required should be minimal. Generate charts that graphically depict which configuration gives the best predictor accuracy for the least amount of hardware and comment on your results. Clearly state which predictors you used (you may want to draw the diagrams) and how many bits are used in the various places.