COMP4690 Assignment 1 Due: November 4, 2008

Size: px

Start display at page:

Download "COMP4690 Assignment 1 Due: November 4, 2008"

Blaze Shon Ward
5 years ago
Views:

1 COMP4690 Assignment 1 Due November 4, 2008 Objectives To reinforce your understanding of basic issues related to pipelining, simple superscalar execution and caches. To introduce you to doing independent study in computer architecture. To introduce you to working with something close to real data. To reinforce your understanding of some techniques for supporting aggressive ILP. To clarify the distinction between synchronization and consistency maintenance. Questions [10] 1. Given the basic 5 stage (IF, ID, EX, MEM, WB) pipeline defined in Patterson and Hennessey (and shown below) for their MIPS-like load/store architecture, draw a GANTT chart (plot of activity in the five stages [on the y axis] against time [on the x axis]) for the following instruction sequence. Mark with the word stall any pipeline time slots corresponding to stalls and be sure to explain what hazards cause any pipeline stalls that occur. You should assume that all instructions execute in a single cycle. If you are not familiar with the pipeline structure, please your instructor and an appropriate introductory review will be provided. Pipeline Structure Page 1

2 COMP4690 Assignment 1 Due November 4, 2008 Instruction Sequence I 1 LW R1, 0(R1) I 2 SUB R4, R1, R5 // R4=R1+R5 I 3 ADD R6, R1, R7 I 4 AND R8, R1, R9 I 5 OR R8, R2, R5 I 6 ADD R5, R5, R8 I 7 SUB R5, R2, R3 [15] 2. Given a superscalar machine with two integer add/subtract units, one integer multiply/divide unit, and single floating point add/subtract, multiply, and divide units which each have a single virtual unit and assuming the instruction execution times in the table below, give a schedule (i.e., the start and end times) for each instruction in the sequence shown below. Assume that 4 instructions can be fetched and decoded in each cycle and that sufficient buses are available to forward data between any number of pairs of functional units and/or virtual functional units concurrently. Once all fetched instructions are scheduled, the next group of four may be fetched. A Gantt chart might help in solving this problem. When the next machine instruction (assuming freedom from dependences) could be started? Table of Instruction Durations Instruction Type Duration Integer add (IADD) 1 Integer subtract (ISUB) 1 Integer multiply (IMUL) 2 Integer divide (IDIV) 4 Floating point add (FPADD) 2 Floating point subtract (FPSUB) 2 Floating point multiple (FPMUL) 4 Floating point divide (FPDIV) 12 Instruction Sequence I 1 IADD R1, R2, R3 I 2 ISUB R4, R5, R6 I 3 IADD R7, R8, R9 I 4 ISUB R1, R4, R7 I 5 IDIV R3, R4, R8 I 6 IADD R4, R2, R6 I 7 FADD R1, R9, R4 I 8 FDIV R5, R2, R6 I 9 FMUL R7, R2, R4 I 10 FMUL R8, R6, R4 Page 2

3 COMP4690 Assignment 1 Due November 4, 2008 I 11 I 12 I 13 I 14 I 15 I 16 I 17 I 18 I 19 I 20 FSUB R9, R2, R3 IADD R1, R2, R3 IADD R2, R2, R1 IMUL R4, R2, R9 IADD R6, R9, R5 FADD R9, R2, R1 IADD R7, R9, R6 ISUB R8, R2, R3 IDIV R1, R3, R4 IADD R6, R2, R3 [5] 3. Given a three level cache structure (i.e. L1, L2, and L3 caches) with the characteristics given in the table shown below, compute the average effective access time for memory references showing your work. What is the maximum apparent data transfer rate into the CPU? Cache/Memory Level Access Time Hit Rate L1 3 nsec 90% L2 15 nsec 95% L3 40 nsec 99% M p 120 nsec 100% [25] 4. Write a program (C, C++ or Java) which, given a sequence of memory addresses referenced by a program, a line size in bytes and a number of cache lines, will simulate a direct mapped cache to determine the hit rate of the cache. A trace file of the addresses generated from a real executing program (taken from the tracebase at NMSU") is available on the course homepage and this file is to be used as input data for your program. The format of the file is simple a decimal address followed by either the letter R or W (for Read or Write). For this assignment, all you need is the addresses. Warning the file is close to 90MB in size so download will be an issue. Run your program for each of the pairs of cache parameters shown in the table below and record the hit rates obtained. Graph your results to illustrate the effects of changing cache capacity and changing cache line size for a given cache size. Cache Line Size Number of Cache Lines 4 bytes bytes bytes bytes bytes bytes bytes bytes 4096 [10] 5. Given the code shown below (in MIPS-like assembly language), rewrite the loop using unrolling (of degree 4) and register renaming. Graphically illustrate the dependences between the statements in a single iteration of the unrolled loop. Rewrite the same loop using software pipelining and, again, graphically illustrate the dependences between the statements in a single iteration of the resulting loop. You may assume you have floating point registers F0 through F63 and that doubles are stored in even-odd pairs of floating point registers. Page 3

4 COMP4690 Assignment 1 Due November 4, 2008 Loop LD F0,0(R1) LD F8,0(R2) ADDD F8,F8,F2 MULD F4,F0,F8 SD 0(R2),F4 SUBI R2,R2,#8 SUBI R1,R1,#8 BNEZ R1,Loop [10] 6. Explain, using an example, why synchronization code (e.g., lock acquisition and release) are still required even though cache consistency is provided in the hardware. What does this tell you about the implementation of locking operations on a multiprocessor? (Hint think about what happens if processes on two processors are concurrently attempting to acquire the same lock.) What is false sharing? Illustrate the concept using a simple example on a machine with two processors. [10] 7. Given the following sequence of MIPS-like instructions (all of single cycle duration, including writeback) for the two specified threads, illustrate, using a Gantt chart, the scheduling of the operations onto hardware consisting of two integer add/subtract units, one integer multiply unit and 1 integer divide unit when using a) simultaneous multithreading, and b) strict round robin hardware multithreading. You should assume that instructions from each thread are fetched and decoded in groups of 4 under both multithreading schemes and that new groups of instructions cannot be fetched until all previous instructions are under way. For SMT, assume that instructions from multiple threads will be picked whenever possible). Be sure to clearly mark which thread each instruction comes from (e.g. A1 means first instruction from thread A). Remember that each thread has its own set of registers so there are no dependences between threads on registers accessed. Instrn# Thread A Thread B 1 addu R1,R2,R3 mulu R4,R3,R2 2 subu R4,R5,R6 divu R5,R6,R2 3 mulu R7,R8,R9 addu R3,R1,R10 4 divu R10,R11,R12 addu R4,R1,R1 5 subu R1,R2,R3 addu R9,R7,R2 6 addu R1,R2,R3 divu R5,R4,R7 7 divu R6,R8,R9 mulu R9,R8,R7 8 add R10,R10,R4 subu R1,R5,R2 [10] 8. We briefly discussed the idea of using a profiling run of a program to analyze its behaviour and then, based on the results, modify the code generated by a compiler to suit the hardware the program will be run on. What are two fundamental disadvantages/limitations of such profile driven optimization? Do a little online research to discover three different uses of profile driven optimization and briefly summarize them. [10] 9. Explain how the dcbx type instructions in the Power PC ISA might be used. Some Googling will be required. Give a brief, realistic example of their use. Page 4

5 COMP4690 Assignment 1 Due November 4, 2008 Total 105 marks Be sure to include your name, student number, and userid in each file handed in. For written questions PostScript, PDF, Word (.doc), Excel (.xls) and PowerPoint (.ppt) format files are acceptable. For programming questions, please hand in all source files and results (no binaries or temporary files, please!). You must also include a text file named README for each programming question that briefly describes how to both compile and run your program. While your final programs must run on the Linux machines in the Linux lab, you are free to develop your code on other platforms if you like. You alone, however, are responsible for dealing with any differences in the compilers, tools, etc. that you may encounter. Please remember to leave plenty of time to port your code to Linux. Normally this takes less than 30 minutes but sometimes it may take several hours if your code is still a little buggy. (e.g., the C compiler on Solaris sets all un-initialized memory to zeroes. Hence, bad pointers are treated as NULLs and thus, sometimes incorrect code works there but not with other compilers and operating systems.) Page 5

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching