/ / / Net Speedup. Percentage of Vectorization

Size: px

Start display at page:

Download "/ / / Net Speedup. Percentage of Vectorization"

Beatrix Andrews
5 years ago
Views:

1 Question (Amdahl Law): In this exercise, we are considering enhancing a machine by adding vector hardware to it. When a computation is run in vector mode on the vector hardware, it is 2 times faster than the normal mode of execution. We call the percentage of time that could be spent using vector mode, the percentage of vectorization. a) Draw a graph that plots the speedup as a function of the percentage of the computation performed in vector mode. b) What percentage of vectorization is needed to achieve a speedup of 2? c) What percentage of the execution time is spent in vector mode when the speedup of 2 is achieved? d) What percentage of vectorization is needed to achieve one half the maximum possible speedup attainable from using vector mode? e) Suppose you have measured the typical percentage of vectorization to be 70%. The hardware design group estimates that it can speed up the vector hardware to 20 (instead of 2) with significant additional investment. You wonder if the compiler group could increase the percentage of vectorization, instead. What percentage of vectorization would the compiler group need to achieve in order to equal the improvement proposed by the hardware group? a) Percentage of Vectorization Net Speedup / / / Net Speedup Percentage of Vectorization

2 b) 2 2; c) % % d) Maximum speed up is 2. One half of maximum speed up is % e) 70% vectorization and 20 times faster hardware yield a net speedup of /20 To achieve this net speedup by improving the compiler, we must increase the percentage of vectorization. We have ; %

3 Question 2: Assuming that you are designing an instruction set with a fixed 32 bit instruction length for a machine with 64 general purpose registers, and that you want to provide three different instruction formats: op code R, R2, R3 ; the instruction requires three register addresses op code R, R2, I(R3) ; the instruction requires three register addresses and a constant I op code R2, I(R3) ; the instruction requires two register addresses and a constant, I Assuming that you are using variable length op codes and that you want to use 2 bits for the constant I, what is the maximum number of instructions that you can have for each of the three instruction formats? show the formats of your instructions. The major point of this question is about op code extension. Given that, for type 2 instructions (ex. op code R, R2, I(R3)), we need 6 bits for each of the three register addresses and 2 bits for the constant I, there are 2 bits left for encoding the operations which can support at most 2^2 = 4 different op codes. Since we need to provide type (ex. op code R, R2, R3) instructions and type 3 (ex. op code R2, I(R3)) instructions, we can use one of the op code (for example, ) to signal an op code extension for type 3 and type instructions, thus resulting in 3 different type 2 instructions. For type 3 instructions, we can use the 6 bits op ext to provide 63 different type 3 instructions, while reserving one of the op ext code (for example ) to signal an op code extension for type instructions. For type instructions, we can use the 6 bits op ext2 to provide 64 different type instructions. As a result, we can support at most 64 type instructions, 3 type 2 instructions, and 63 type 3 instructions. The formats of the three variations of instructions are listed as follows. () Type 2 instructions (op code R, R2, I(R3)) op rs rt rd constant 2 bits 6 bits 6 bits 6 bits 2 bits (2) Type 3 instructions (op code R2, I(R3)) op op ext rt rd constant 6 bits 6 bits 6 bits 2 bits (3) Type instructions (op code R, R2, R3) op op ext op ext2 rs rt rd 6 bits 6 bits 6 bits 6 bits Note that we could reserve two opcodes in type 3 instructions to provide 28 type instructions while having only 62 type 3 instructions. The above design opt to balance the number of type and type 3 instructions. We could also reserve two opcodes in type 2 instructions for extensions. However, this will leave us with only 2 type 2 instructions, which again increases the imbalance of the number of instructions of the different types.

4 Question 3 (MIPS ISA): Consider the following fragment of C code: for(i=0;i<n; i++) A[i] = B[i] + C ; Assume that A and B are arrays of 64 bit integers stored in memory starting at address locations 000 and 3000, respectively. Assume also that the integer constant C and the number of iterations, N, are stored in memory at addresses 5000 and 5008, respectively. a) Write the code for the above fragment in MIPS. b) What is the MIPS code size in bytes? c) When the MIPS code is executed, how many instructions will be dynamically executed? d) How many memory/cache accesses are needed to execute the MIPS code (differentiate between accesses to fetch instructions and accesses to fetch data)? A list of MIPS instructions that are useful for question 3: LW R, C(R2) // load the content of memory address (R2)+C to R SW R, C(R2) // store (R) to memory address (R2)+C DADD R, R2, R3 // (R2) + (R3) R DADDI R, R2, C // (R2) + C R DSUB R, R2, R3 // (R2) (R3) R BEQZ R, Label // Go to Label if (R) = 0 a) R0 < 0 // (R0) holds constant 0 DADDI R, R0, 0 // R holds the bit offset of counter i LW R3, 5000(R0) // Load C into R3 LW R4, 5008(R0) // Load N into R4 LOOP: LW R5, 3000(R) // Load B[i] into R5 DADD R6, R5, R3 // Add B[i]+C SW R6, 000(R) // Store A[i] DADDI R4, R4, // Decrement N DADDI R, R, 8 // Increment i bit offset BEQZ R4, EXIT // Return when N is zero BEQZ R0, LOOP // Go to the label LOOP EXIT: RET // Return or exit from the routine b) *4 = 44 bytes c) S + P * N = * = 703 d) Instruction Cache access = 703, Data Cache access = * 00 = 202 Question 4 (Pipelined design): Modify the processor data path shown in this figure to add the capability of executing a new type of instruction Rtype m r, r2, r3, which is equivalent to Rtype r, r2, r3 except that the operands are in memory at the addresses stored in registers r2 and r3, instead of being in registers r2 and r3 and that the results are to be stored in memory at the address stored in register r. You can add new stages as well as new functional units. You may also change the specification of the functional units in the 5 stage pipeline.

) for Rtype we need to read only r2 and r3, but for Rtype m we need to read all r, r2, and r3, which may be accomplished by adding a third reading port to the register file.

5 ) for Rtype we need to read only r2 and r3, but for Rtype m we need to read all r, r2, and r3, which may be accomplished by adding a third reading port to the register file. 2) The memory module in the original design can either read or write one memory location at a time (call the original address lines D and the original input and output data lines A and O). However, Rtype m instructions need to read two memory locations. This can be accomplished by having an addition reading port (address lines A2 and output data lines O2). 3) To read the two operands of the Rtype m instruction, we should provide paths from the two first register outputs (A and B in the figure) to the address lines A and A2 of the memory. The path from B to A2 is straight forward. The path from A to A can be through the ALU assuming that one of the ALU operations is to pass one of its inputs unchanged to the output. If we remove this assumption, then a separate path from A to A has to be provided. 4) To operate on the operands after they are read from memory, we may add a new ALU stage which operate on O and O2 and produce a result R, which is to be directed to the input data port, D, of the memory. To store the result in memory, the memory address, C, which was read from the third register in the Rtype m instruction has to be routed with the instruction through the pipeline stages and directed to the input memory address port, A. 5) All the original data paths in the architecture has to be preserved. Multiplexers are used whenever needed to add the new paths to the original ones.

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations