Project #3: Dynamic Instruction Scheduling. Due: Wednesday, Dec. 4, 11:00 PM

Size: px

Start display at page:

Download "Project #3: Dynamic Instruction Scheduling. Due: Wednesday, Dec. 4, 11:00 PM"

Drusilla Davis
5 years ago
Views:

1 ECE 463/521: Fall 2002 Project #3: Dynamic Instruction Scheduling Due: Wednesday, Dec. 4, 11:00 PM Project rules 1. All students are encouraged to work in teams of two, using pair programming. Pair programming means programming where two people sit at the same workstation, writing the code collaboratively. 2. You may not work with the same partner on more than one project this semester. 3. You must register your partnership by posting on the Pair-Programming Partners message board, under the topic Project Sharing of code between teams will be considered cheating, and will be penalized in accordance with the Academic Integrity policy. 5. It is acceptable for you to compare your results with other groups to help debug your program. It is not acceptable to collaborate on the final experiments. 5. You must do all your work in the C/C++ or Java languages. C++ (or Java) is encouraged because it enables straightforward code-reuse and division of labor. 6. Homework will be submitted over the Wolfware Submit system and run in the Eos/Unity environment. Project Description: In this project, you will construct a simulator for an out-of-order superscalar processor based on Tomasulo s algorithm that fetches/issues N instructions per cycle. (ECE 463 students may assume that N = 4.) Only the dynamic scheduling mechanism will be modeled in detail, i.e., perfect caches are assumed and there are no control hazards. 1. Input format The input traces will be given in the form: <address> <operation type> <dest reg #> <src1 reg #> <src2 reg#> <address> <operation type> <dest reg #> <src1 reg #> <src2 reg#>... Where <address> is the address of the instruction (in hex) <operation type> is either "0", "1" or "2" <dest reg #>, <src1 reg#> and <src2 reg #> are integers in the range [0..127] Note: If any reg # is 1, then there is no register for that part of the instruction (e.g., a branch instruction has 1 for its <dest reg #>). 1

2 For example, means ab ab ab12002c "operation type 0" R1, R2, R3 "operation type 1" R4, R1, R3 "operation type 2", R4, R7 No destination register! Traces will be placed in the directory /afs/eos/courses/ece/ece521/common/www/homework/projects/3/traces. 2. Simulator description 2.1 Functional units There are 6 functional units (one for type-0 instructions, two for type-1 instructions, and three for type-2 instructions). Each functional unit can begin just one instruction on each cycle. The operation type of an instruction indicates the execution latency of the instruction. Type 0 has a latency of 1 cycle. Type 1 has a latency of 2 cycles. Type 2 has a latency of 4 cycles. 2.2 Pipeline description Assume the following pipeline structure: Stage Name Number of cycles per instruction 1 IF: Instruction fetch/decode 1 2 ID: Dispatch Variable; depends on scheduling queue availablility. 3 IS: Issue (scheduling) Variable; depends on data dependences, available issue bandwidth, and available FUs. 4 EX: Execute Variable; depends on operation type (pipelined latency). 5 WB: Writeback (state update) 1 [see note 12] Details 1. Tags are generated sequentially for every instruction, beginning with 0. The first instruction in a trace will use a tag value of 0, the second will use 1, etc. The traces are sufficiently short so that you should not need to reuse tags. 2. The fetch/decode rate is N instructions per cycle. Each cycle, the fetch/decode unit can always supply N instructions to the dispatch unit provided there is room for these instructions in the dispatch queue. 2

3 3. You should explicitly model the dispatch and scheduling queues. 4. The dispatch queue has 2N entries. Note that the dispatch queue is scanned from head to tail (in program order). Scanning is stopped as soon as the scheduling queue is full. 5. There is one centralized scheduling queue that issues instructions to functional units. So, instead of having dedicated reservation stations in front of each functional unit, there is a shared pool of reservation stations. However, the reservation stations feed separate pipelines for each type of instruction. Only one instruction may be issued per cycle to a pipeline, so, for example, only three type-2 instructions may be issued at a time. 6. If there are multiple independent instructions ready to issue during the same cycle in the scheduling queue, service them in tag order (i.e., lowest tag value to highest). 7. An issued instruction is removed from the scheduling queue when it issues to a functional unit. 8. When the scheduling queue becomes full, stage 2 (dispatch) stalls. 9. When the dispatch queue becomes full, stage 1 (fetch/decode) stalls. 10. The peak dispatch rate is N instructions per cycle. Dispatch rate is the rate at which instructions can be moved from the dispatch queue to the scheduling queue. 11. The peak issue rate is N instructions per cycle. Actual issue rate in a given cycle is constrained by peak issue rate, available functional units, and data dependences. 12. There are unlimited result buses (common data buses, or CDBs). This means that instruction completion never stalls due to contention for a result bus. 13. Assume that instructions retire in the same order that they complete. Instruction retirement is unconstrained (imprecise interrupts are possible). 2.3 Simulator structure 1. Define 5 states that an instruction can be in (e.g. use an enumerated type): IF, ID, IS, EX, WB. 2. Define a circular FIFO that holds all active instructions in their program order conceptually, this is like a ROB, although we don t need to maintain precise interrupts. Each entry in the FIFO should be a data structure containing per-instruction information, e.g. state of the instruction, operation type, operand state, sequence number (tag), etc. An instruction is added to the FIFO when it is fetched from the trace and removed when it has reached the WB state and all prior instructions have been removed from the FIFO. (Note: The FIFO is useful for printing out stuff in program order for use by the scope tool more info below; we aren t really simulating a ROB, so make the FIFO large enough that it never overflows suggest 1024 entries.) 3. Define 3 lists: a. dispatch_list: This contains a list of instructions in either the IF or ID state. By including IF state, you don t need to explicitly model the fetch latch. The dispatch_list models the dispatch queue. b. issue_list: This contains a list of instructions in the IS state (waiting for operands, available issue bandwidth, or an available FU). 3

4 c. execute_list: This contains a list of instructions in the EX state (waiting for the execution latency of the operation). 4. Call each processing stage in reverse order in main simulator loop, as follows. The detailed comments indicate tasks to be performed the order is obviously important. do { Retire(); //Remove instructions from the head of the ROB until an instruction //is reached that is not in the WB state. Execute(); // From the execute_list, check for instructions that are finishing //execution this cycle, and: // 1) Remove the instruction from the execute_list. // 2) Transition from EX state to WB state. // 3) Update the register file state (e.g., ready flag) and wakeup // dependent instructions (set their operand ready flags). Issue(); // From the issue_list, construct a temp list of instructions whose // operands are ready these are the READY instructions. Scan the // READY instructions in ascending order of tags and, if issue // bandwidth is available and an FU is available, then: // 1) Remove the instruction from the issue_list and add it to the // execute_list. // 2) Transition from the IS state to the EX state. // 3) Free up the scheduling queue entry (e.g., decrement a count // of the number of instructions in the scheduling queue) // 4) Set a timer in the instruction s data structure that will allow // you to model the execution latency. Dispatch(); // From the dispatch_list, construct a temp list of instructions // in the ID state (don t include those in the IF state you must // model the 1 cycle fetch latency). Scan the temp list in ascending // order of tags and, if the scheduling queue is not full, then: // 1) Remove the instruction from the dispatch_list and add it to // the issue_list. Reserve a schedule queue entry (e.g. increment // a count of the number of instructions in the scheduling queue) // and free a dispatch-queue entry (e.g. decrement a count of the // number of instructions in the dispatch queue). // 2) Transition from the ID state to the IS state. // 3) Rename source operands by looking up state in the register file; 4

5 // rename destination operands by updating state in the register // file.for instructions in the dispatch_list that are in the IF // state, unconditionally transitionto the ID state (models the // 1-cycle latency for instruction fetch). Fetch(); // Read new instructions from the trace as long as 1) you have not // reached the end of file, 2) the fetch bandwidth is not exceeded, // and 3) the dispatch queue is not full. Then, for each incoming // instruction: // 1) Push the new instruction onto the ROB. Initialize the // instruction s data structure, including setting its state to // IF. // 2) Add the instruction to the dispatch_list and reserve a // dispatch-queue entry (e.g., increment a count of the number // of instructions in the dispatch queue). } while (Advance_Cycle()); // When it becomes known that the ROB is empty AND the trace is depleted, //the Advance_Cycle function returns false to terminate the loop. 3. Helping you debug: A scope tool There is a tool that allows you to display pipeline timing diagrams o Location of tool: /afs/eos.ncsu.edu/courses/ece/ece521/common/www/homework/projects/3/scope o The tool is a solaris binary (can be run on suns). o Usage: scope <input-file> <output-file> o The tool has quite a bit of error checking to make sure you comply with formatting and usage, however, beware it is not error-proof. Warning messages will sometimes direct you in the right direction (often it just spits out my address J). You must provide an input file that encodes the timing of each instruction in the program. Your simulator dumps this timing information. There should be one line for each instruction in the program, and instructions must be dumped in program order. The format of each line is as follows. Note: you must substitute an integer everywhere there is a <> pair. <seq_no> fu{<op_type>} src{<src1>,<src2>} dst{<dst>} IF{<begin-cycle>,<duration>} ID{ } IS{ } EX{ } WB{ } <seq_no> is the unique tag of the instruction (line number in trace, starting at 0). Substitute 0, 1, or 2 for the <op_type>. <src1>, <src2>, and <dst> are register numbers (include 1 if that is the case). For each of the IF, ID, IS, EX, and WB states, indicate the first cycle that the instruction was in that state followed by the number of 5

6 cycles the instruction was in that state. The tool automatically does some consistency checks and dies if there is a problem, e.g., begin cycle of ID should equal begin cycle of IF plus duration of IF. Two output files are created: <output-file>: This contains the timing diagram. <output-file>.html: This is a very brief html shell that you can edit or keep as is. To view the html file, in Netscape type: file:<full-path>/<output-file>.html 4. Statistics (output) The simulator outputs the following statistics after completion of the experimental run: 1. Total number of instructions in the trace. 2. Total simulated run time for the input: the run time is the total number of cycles from when the first instruction entered the instruction fetch/decode unit until when the last instruction retired. 3. Average number of instructions completed per cycle (IPC). 4. The cycle-by-cycle behavior for the first 10 and last 10 instructions in the trace(use output from scope tool). 5. Experiments 5.1 Validation Your simulator must match the validation output that we will place on-line. 5.2 Runs For each trace on the website, use your simulator as follows: 1. For each benchmark, make a graph with IPC on the y-axis and S (schedule queue size) on the x-axis. Use S = 8, 16, 32, 64, 128, and 256. ECE 521 students should plot 4 different curves (lines) on the graph: one curve for each of N=1, N=2, N=4, and N=8. ECE 463 students should plot a curve for N = 4 only. 2. Discussion: (a) Discuss trends for each benchmark individually. Give possible explanations for the trends. (b) Compare and contrast results from different benchmarks. Give possible explanations for differences among benchmarks. Sample discussion topics: What happens to IPC as schedule queue size (S) increases? (Does the answer depend on N?) 6

7 What happens to IPC as peak issue rate (N) increases? (Does the answer depend on S?) What is the relationship between S and N? Explain 5.3 Hints Work out by hand what should occur in each unit at each cycle for an example set of instructions (e.g., the first 10 instructions in one of the traces). Do this before you begin writing your program! Keep a counter for each line in the trace file. Use this for the Tomasulo tags. (You could also use ROB entry numbers as tags this is how a real processor does it.) As described above, make each pipeline stage into a C/C++/Java function and call the functions in reverse order from the flow of instructions. 5.4 Command-Line Format sim trace file S (schedule queue size) N (issue rate) 6. Grading 0% You do not hand in anything by the due date +20% Makefile and program interface. +50% Your simulator works. +30% Report and Analysis 7

ECE 463/521: Fall 2002 Project #1: Data-Cache System Design Due: Monday, Sept. 30, 11:00 PM

ECE 463/521: Fall 2002 Project #1: Data-Cache System Design Due: Monday, Sept. 30, 11:00 PM Project rules 1. All students are encouraged to work in teams of two, using pair programming. Pair programming