S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)

Size: px

Start display at page:

Download "S = 32 2 d kb (1) L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4)"

Prudence Miles
5 years ago
Views:

1 1 Cache Design You have already written your civic registration number (personnummer) on the cover page in the format YyMmDd-XXXX. Use the following formulas to calculate the parameters of your caches: S = 32 2 d kb L = 32 2 D B (2) A = 2 2 m mod 4 (3) W = 16 2 y mod 4 b (4) 1. Assume that you have a direct mapped cache of size S (Equation 1) with the line size L (Equation 2). The cache is physically indexed and tagged. The physical address is 50 bits, numbered from 0 to 49 (with 0 being the least significant bit). The machine has a word size of W bits (Equation 4) and the memory is byteaddressable. a. Write your personal values of the parameters S, L, A and W (along with the parameters y, d, D and m). b. Make a schematic drawing of the cache. (2) c. Describe which bits are used to index the cache, i.e. used to select the row in the cache. d. Which bits are compared with the address tag? e. Which bits are used to select a word from the selected cache line? 1.1 Associative Cache 2. Now, assume that you have an A-way (Equation 3) set associative cache of size S (Equation 1) and cache line size L (Equation 2). a. Make a schematic drawing of the cache. (2) b. Describe which bits are used to index the cache, i.e. used to select the row in the cache. c. Which bits are compared with the address tag? d. Which bits are used to select the word to read from the selected cache line? 2 More on Caches 3. Describe how the LRU and RANDOM replacement policies work. (2) 4. Why would a computer architect choose to implement: a. LRU instead of RANDOM? Autumn 2012 Final :00:25Z r75

2 b. RANDOM instead of LRU? 5. Explain when the following miss types occur and how they can be avoided: a. Compulsory misses b. Capacity misses c. Conflict misses 3 Virtual Memory 6. Describe two reasons for implementing virtual memory. (2) 7. Give at least one reason why a computer system with virtual memory should have a TLB. 8. What is the reach of a TLB with the page size 4 kb and 512 entries? 9. What is the benefit of using a virtual indexed and physically tagged cache? Are there any problems? (2) 10. What is the benefit of using a virtually indexed and virtually tagged cache? Are there any problems? (2) Autumn 2012 Final :00:25Z r75

3 1 ISAs 1. Over the years, several different classes of instruction sets have evolved (and died). Three common strategies to handle operands is to use a stack-based ISA, an accumulator based ISA, or a register based ISA. All of them have been implemented in hardware at some point. Stack based instruction sets were once used by HP in their HP 3000 series. Today they are mainly used in virtual machines, e.g. the Java Virtual Machine (JVM). Accumulator machines have been popular since the dawn of time, the iconic PDP-8 is one example. They are still common in microcontrollers, such as the PIC-series. The x86-architecture evolved from this class, but is today (mostly) a register based architecture. Almost all modern machines are register based and generally do not use fixed function registers. A good example in this class is the MIPS. a. One of the problems with stack machines is that it is generally hard to make efficient hardware implementations. However, they have one large benefit that has made them successful in the JVM. What is the main benefit of stack machines over register machines? b. Most operations require 3 different operands, 2 input operands and 1 destination operand. In accumulator based architectures, one of the input operands is always the accumulator. What is normally the source of the 2nd operand? 2 Hazards 2. There are three different classes of hazards, structural hazards, data hazards and control hazards. a. What is a structural hazard? b. Describe how a control hazard can be transformed into a data hazard. Give at least one example. 3. Data hazards can be further divided into three different types: RAW Read After Write WAR Write After Read WAW Write After Write Op i and Op i+1 are two consecutive instructions in program order. a. A RAW hazard occurs when Op i modifies A and Op i+1 reads A before Op i has committed its new value. Describe how this situation can occur in a simple 5-stage pipeline and a simple hardware solution. b. Why can t WAR and WAW hazards occur in a simple in-order 5-stage pipeline? Autumn 2012 Final :14:43Z r89

4 3 Instruction scheduling 4. There are two main strategies to exploit instruction level parallelism (ILP) and feed multiple parallel execution units. What is the main difference between VLIW and super scalar processor? Think about how functional units are scheduled. 5. Tomasulo s algorithm introduces several new hardware structures to support outof-order execution. a. When instructions are issued, they are put in a reservation station. What are reservation stations used for? b. What is the reorder buffer used for and what does it guarantee when instructions commit? 6. A highly desirable feature in a processors is precise exceptions, which guarantees that all side effects of instructions happening before the exception are visible and no side effects from later instructions are visible. How can precise exceptions be implemented in a CPU that implements out-oforder execution using Tomasulo s algorithm? 4 Branch Prediction 7. A very simple branch predictor is the 1-bit branch history table. Describe how it works. 8. What is the difference between the 1-bit and 2-bit branch prediction scheme? What does the latter try to optimize? 9. The branch target buffer (BTB) allows something that is known as branch folding. Describe how the BTB works and how it can improve performance by folding branches. Autumn 2012 Final :14:43Z r89

5 1 Scalable Multiprocessors 1. What is the difference between NUMA and UMA architectures? 2. What is the main difference between a bus based cache coherency protocol and a directory based one? Why does the latter provide better scalability? (2) 3. In a directory based coherence protocol, a special data structure, the directory, is used to keep track of where a particular cache line is stored. In a naïve implementation, the directory has to be large enough to store book-keeping for every cache line in every cache in the system. Instead of using a bit vector with a bit representing each processor in the system, the limited pointers approach stores a limited number of pointers to processors that store a specific cache line. This allows the directory to scale to a larger number of processors than the bit vector. The number of pointers that can be stored per cache line might not be enough to represent all copies of that line if the line is shared between a large number of processors. In this case, we have an overflow condition. Describe two different methods to handle overflows. 4. The directory normally holds a pointer to all copies of a cache line. The Scalable Coherence Interface (SCI) has a clever way of keeping track of multiple copies of a cache line that makes the directory size independent of the number of nodes in a system. What kind of data structure is used to keep track of cache lines in SCI? Where is this structure stored? (2) (2) 2 Programming Multiprocessors 5. What s the main difference when programming a message passing system compared to when programming a shared memory system? 6. A process in most operating system contains at least the following resources: Program counter Stack pointer File descriptors Virtual to physical memory mappings Which of the above resources are not shared between threads? 7. In which of the following frameworks does false sharing normally not occur: MPI Pthreads OpenMP Autumn 2012 Draft :02:18Z r77

6 (a) Red-Black (b) Vertical blocks Figure 1: Two different approaches to parallelizing the Gauss-Seidel algorithm. The red-black implementation provides parallelism by breaking up the dependencies between elements, i.e. all the elements of one color can be processed in parallel. The block wise parallelization uses large chunks of elements which are computed in parallel. 8. There are multiple ways to implement a Gauss-Seidel sweep. Two of them were discussed in Sverker s lecture. The red-black breaks the dependencies between elements by coloring even elements and odd elements with different colors (Figure 1a). The algorithm only updates one color per iteration and alternates between the colors. This allows all the element in one iteration (i.e. all the elements of one color) to be updated in parallel. The striped implementation (Figure 1b) divides the matrix into vertical blocks, where each block can be updated in parallel. There are a several problems with the red-black version, one of them being a lower convergence rate. Mention one problem that is specific to the red-black version that is due to architectural issues when implemented on a shared memory system. Autumn 2012 Draft :02:18Z r77

7 1 Cache Coherence 1.1 False-sharing 1. What is a false-sharing miss? 2. How can false-sharing be avoided? Give two examples, one using only software techniques and one where the hardware is changed. (2) 1.2 MSI-coherence A simple snooping based cache coherence protocol is described on page 214 (4th edition) or page 360 (5th edition) in Computer Architecture A Quantitative Approach. The protocol uses three states, Exclusive, Shared and Invalid. The Exclusive state in the book is normally called Modified, so we ll call it that! Hence, the protocol will be called MSI. A cache line enters the Modified state whenever a write occurs on that cache line. If the cache line was in the Invalid state, data have to be fetched and all other caches have to invalidate their copies. If the cache line was in the Shared state, no data have to be fetched, but all other caches still have to invalidate their copies. We call the transition from Shared to Modified an upgrade. step CPU0 CPU1 CPU2 CPU3 on bus data source 0 wr 0 RTW MEM 1 rd 0 2 wr 0 3 wr 0 4 wr 0 5 rd 0 6 wr 0 Table 1: MSI transaction table 3. Table 1 contains a set of serialized transaction such that step n happens before n + 1. Each memory access in the table references the same cache line. Fill in the on bus column with the bus request caused by the each access. Use RTS for read to share, RTW for read to write, INV for invalidate or none if no bus activity is required. Also, indicate the source of data. (2) 1.3 MOSI-coherence The MOSI protocol, as discussed in class, is similar to the MSI protocol, but contains an additional Owner state. 4. Describe what the MOSI protocol tries to optimize compared to the MSI protocol. Autumn 2012 Final :02:18Z r77

8 2 Memory Consistency The code below will be used in the following questions. Prior to executing the code below, a=1 and flag=0. a = 2 ; f l a g = 1 ; Listing 1: CPU0 s code Listing 2: CPU1 s code while ( f l a g!= 1) ; / Wait u n t i l f l a g i s 1 / p r i n t f ( "%i \ n ", a ) ; 5. What value will CPU1 print when executed on... a.... a sequentially consistent machine? b.... a machine implementing Total Store Order? c.... a machine implementing Release Consistency? Note: Only the answer 1, 2 or timing dependent is correct. 6. Give at least one reason why a multiprocessor machine should implement sequential consistency instead of release consistency. 7. Give at least one reason why a multiprocessor machine should implement total store order instead of sequential consistency. 8. Why would a computer architect chose to implement any of the weaker memory models? 9. Does a weaker memory order affect the correctness of correctly synchronized applications? Assume that the application uses pthreads. Motivate. Autumn 2012 Final :02:18Z r77

RECAP. B649 Parallel Architectures and Programming

RECAP. B649 Parallel Architectures and Programming RECAP B649 Parallel Architectures and Programming RECAP 2 Recap ILP Exploiting ILP Dynamic scheduling Thread-level Parallelism Memory Hierarchy Other topics through student presentations Virtual Machines