Pipelining Exercises, Continued

Pipelining Exercises, Continued. Spot all data dependencies (including ones that do not lead to stalls). Draw arrows from the stages where data is made available, directed to where it is needed. Circle the involved registers in the instructions. Assume no forwarding. One dependency has been drawn for you. time -> addi $t0 $t1 100 lw $t2 ($t0) add $t3 $t1 $t2 sw $t3 8($t0) lw $t 0($t6) or $t $t0 $t3 Without forwarding, the register values become available in the write- back phase, and are needed in the decode phase.. Redraw the arrows for the above question assuming that our hardware provides forwarding. time -> addi $t0 $t1 100 lw $t2 ($t0) add $t3 $t1 $t2 sw $t3 8($t0) lw $t 0($t6) or $t $t0 $t3 With forwarding, the register values become available as soon as they are computed/retrieved, and are needed as late as possible in the computation. Notice that arithmetic operations with forwarding do not cause stalls, but load word still does. 1

Instruction Scheduling Suppose we have an array of structs of this form: struct point { int x; int y; }; We wish to square each member of point and add them to another array: sum[i] = p[i].x*p[i].x+p[i].y*p[i].y; Suppose the number of points in p is in $a0, base of p in $a1, and the base of sum is in $a2. Then we can perform the operation with this MIPS code: compiledata: beq $a0, $0, exit lw $t0, 0($a1) lw $t1, ($a1) addi $a1, $a1, 8 addi $a2, $a2, exit: Exercises: Assume that you have a dual- issue machine wherein one ALU/branch operation can be scheduled in parallel with a load/store operation. Can you schedule the instructions in the above loop to improve performance? There are 3 load/stores and 8 other instructions. Even with forwarding, loads cannot provide the data from memory in time for the next instruction. Therefore the data dependencies occur in the following places: - - must happen at least 2 cycles after lw $t0, 0($a1) must happen at least 2 cycles after lw $t1, ($a1) In addition, we must also preserve the following orderings: - must happen after the two multiplies - must happen after the add and before addi $a2, $a2, - addi $a1, $a1, 8 must happen after the loads Below is one potential fastest ordering: 1 beq $a0, $0, exit lw $t0, 0($a1) 2 addi $a1, $a1, 8 (this avoids load- use data hazards) lw $t1, ($a1) 3 6 7 8 addi $a2, $a2, 2

Here is another potential ordering: 1 beq $a0, $0, exit 2 addi $a2, $a2, (this avoids load- use data hazards) lw $t0, 0($a1) 3 addi $a1, $a1, 8 (this avoids load- use data hazards) lw $t1, ($a1) 6 7 8 sw $t3, -($a2) It is okay to be executing addi $a1, $a1, 8 and lw $t1, ($a1) at the same time because the register write of addi is done later in the pipeline, so they will both read the correct value of $a1. Unroll the loop by a factor of 2, apply register renaming, and schedule again (you may assume $a0 is even). How much improvement can be obtained? When we unroll the loop, we only need to double the instructions that do the real work (i.e. not the incrementing or the loop comparison instructions). This means that we now have 6 load/stores and 11 other instructions. By using register renaming, we now have many new instructions that don t have data dependencies with each other and we can more easily schedule full issue packets. One potential ordering is below: 1 beq $a0, $0, exit lw $t0, 0($a1) 2 addi $a1, $a1, 16 lw $t1, ($a1) 3 lw $t, -8($a1) lw $t, -($a1) 6 mul $t, $t, $t 7 mul $t, $t, $t 8 add $t6, $t, $t 9 10 addi $a2, $a2, 8 addi $a0, $a0, -2 sw $t6, ($a2) 11 The original code took 11 single- issue packets per loop (22 per 2 loops). With a dual- issue machine and no loop unrolling, the code would take 8 packets per loop (16 per 2 loops). With loop unrolling, the code takes 11 packets per 2 loops! 3

Virtual Memory Overview Virtual address (VA): What your program uses Virtual Page Number Page Offset Physical address (PA): What actually determines where in memory to go Physical Page Number Page Offset With KiB pages and byte addresses, 2^(page offset bits) = 096, so page offset bits = 12. The Big Picture: Logical Flow Translate VA to PA using the TLB and Page Table. Then use PA to access memory as the program intended. Pages A chunk of memory or disk with a set size. Addresses in the same virtual page get mapped to addresses in the same physical page. The page table determines the mapping. The Page Table Index = Virtual Page Number (not stored) Page Valid Page Dirty Permission Bits (read, write,...) Physical Page Number 0 1 2 (Max virtual page number) Each stored row of the page table is called a page table entry (the grayed section is the first page table entry). The page table is stored in memory; the OS sets a register telling the hardware the address of the first entry of the page table. The processor updates the page dirty in the page table: page dirty bits are used by the OS to know whether updating a page on disk is necessary. Each process gets its own page table. Protection Fault- - The page table entry for a virtual page has permission bits that prohibit the requested operation Page Fault- - The page table entry for a virtual page has its valid bit set to false. The entry is not in memory.

The Translation Lookaside Buffer (TLB) A cache for the page table. Each block is a single page table entry. If an entry is not in the TLB, it s a TLB miss. Assuming fully associative: TLB Entry Valid Tag = Virtual Page Number Page Table Entry Page Dirty Permission Bits Physical Page Number The Big Picture Revisited Exercises What are three specific benefits of using virtual memory? [there are many] Bridges memory and disk in memory hierarchy. Simulates full address space for each process. Enforces protection between processes. What should happen to the TLB when a new value is loaded into the page table address register? The valid bits of the TLB should all be set to 0. The page table entries in the TLB corresponded to the old page table, so none of them are valid once the page table address register points to a different page table. x86 has an "accessed" bit in each page table entry, which is like the dirty bit but set whenever a page is used (load or store). Why is this helpful when using memory as a cache for disk? It allows smarter replacements. We naturally want fewer misses (page faults), so if possible, we would want to replace a page table entry that hasn t been used. The accessed bit is one way of giving us enough information to implement this. Fill this table out! Virtual Address Bits Physical Address Bits Page Size VPN Bits PPN Bits Bits per row of PT ( extra bits) 32 32 16KB 18 18 22 32 26 8KB 19 13 17 36 32 32KB 21 17 21 0 36 32KB 2 21 2 6 0 6KB 8 2 28