ECE 463/521: Spring 2005 Project 1: Data-Cache System Design Due: Wednesday, February 23, 2005, 11:00 PM

ECE 463/521: Spring 2005 Project 1: Data-Cache System Design Due: Wednesday, February 23, 2005, 11:00 PM Project rules 1. All students are encouraged to work in teams of two, using pair programming. Pair programming means programming where two people sit at the same workstation, writing the code collaboratively. To find a partner, post a message on the message board at http://wolfware.ncsu.edu/wrap-bin/mesgboard/ece:521::001:1:2005. 2. ECE 521 students will have additional parts to do. If a 463 student pairs with a 521 student, they will have to meet the 521 requirements. 3. You may not work with the same partner on more than one project this semester.4.you must register your partnership by posting on the Pair-Programming Partners message board, under the topic Project 1. 5. Sharing of code between teams will be considered cheating, and will be penalized in accordance with the Academic Integrity policy. 6. It is acceptable for you to compare your results with other groups to help debug your program. It is not acceptable to collaborate on the final experiments. 7. You must do all your work in C, C++, or Java. C++ and Java are encouraged because they enable straightforward code-reuse and division of labor. 8. Homework will be submitted over the Wolfware Submit system and run in the Eos/Unity environment. Project description This project will study instruction caches, and the performance impact of varying the cache line size and using different compilation parameters for code to be run through the cache. ECE 521 students will also simulate the translation-lookaside buffer (TLB) for the system. You will write a trace-driven simulator, which inputs a trace (a sequence of references) from a dynamic instruction stream to simulate hardware operations. Input Output I-cache simulator 1

Input: Trace file The simulator reads a trace file that records instructions in the following format:!start address":!instruction length 1 "!instruction length 2 "!instruction length m "!start address":!instruction length 1 "!instruction length 2 "!instruction length m " All input is in hex. The start address is the address of the first instruction in a sequence, in hexadecimal. The instruction length is a hex digit that tells how many bytes an instruction occupies. A single line of the trace file represents instructions that are executed sequentially, without any jumps. Every time there is a jump (an unconditional jump or a taken branch), a new line is used in the trace file. This format allows us to represent a lot of instructions in a relatively small trace file. Example: 00abcdef:8476A58 00abcfee:4853C4B84 00abcdf0:5 00b03540:8C3D4... Simulator: Your task Specification of simulator Cache simulation capabilities a. The simulator models a memory hierarchy with an L1 instruction cache and an optional victim cache (the system can be configured with or without a victim cache). Tip: If you are using C++ or Java, note that the victim cache code is quite different from the L1 cache code, so it is best to implement different classes for each. b. L1 cache description: o SIZE: total bytes of data storage o ASSOC: the associativity of the cache (ASSOC = 1 is a direct-mapped cache) o BLOCKSIZE: the number of bytes in a block o LRU replacement policy There are a few constraints on the above parameters: (i) BLOCKSIZE is a power of two and (ii) the number of sets is a power of two. Note that ASSOC (and, therefore SIZE) need not be a power of two. The number of sets is determined by the following equation: # sets = SIZE BLOCKSIZE x ASSOC You may assume that a miss to the L1 cache that has to be satisfied from memory takes as long to process as MEM_DELAY hits (where MEM_DELAY is a constant derived as explained in the section on AAT Calculation, below). 2

c. Victim cache description: o The victim cache is fully associative and uses LRU replacement policy. o The number of victim cache entries, V, is adjustable. o L1 cache miss / victim cache hit: If there is a miss in the L1 cache and a hit in the victim cache (say for block X), block X is placed in the L1 cache. Then, the block that X replaced in the L1 cache (say block Y, which we call the victim block) is placed in the victim cache. Block Y goes into the victim cache entry where block X was found, replacing block X in the victim cache. That is, the two caches swap blocks X and Y. This also means that a given block will never reside in both caches at the same time. A special case is when there is no victim block from the L1 cache (which occurs when there are invalid entries in the L1 cache). In this case, block X is simply invalidated in the victim cache, instead of being replaced by a victim block. o L1 cache miss / victim cache miss: If there is a miss in the L1 cache and a miss in the victim cache (say for block X), block X is placed in the L1 cache. Then, the block that X replaced in the L1 cache (say block Y, which we call the victim block) is placed in the victim cache. We cannot perform a swap, as we did above, because block X was not found in the victim cache. Instead, block Y replaces the LRU block in the victim cache. In the special case that there is no victim block from the L1 cache, the victim cache does nothing. e. Your simulator should be capable of prefetching. If prefetching is turned on for a particular run, when a reference to a block, say block i, causes a miss, block i is fetched, but immediately after block i is in the cache, block i+1 is also fetched. This means that another line needs to be replaced from the cache. Note that this line will always come from the next set after the set into which line i was fetched. The prefetched line then becomes the MRU line of the set. Since it takes as long to process a miss as to process MEM_DELAY hits, when a block is prefetched, you should schedule the prefetch to begin as soon as the cache miss that triggered the prefetch has been processed. That is, if the current time is t when a miss is encountered, the processor will be stalled until time t+mem_delay waiting for the miss to be serviced. Then a prefetch will begin at time t+mem_delay and finish at time t+2mem_delay. Once a prefetch has begun, the block that is being prefetched is neither valid nor invalid, but in transition. That is, if it is referenced, it does not cause another miss, but the processor can t continue right away either. Rather, the processor is stalled until the prefetch has finished. Here are some thoughts on how to implement this. Instead of a valid/invalid bit for each cache line, we now need a variable that can take on the values VALID, INVALID, and IN TRANSITION. When a prefetch begins, the line into which the block is being prefetched changes state to IN TRANSITION. 3

What happens if a block is referenced while it is IN TRANSITION? The processor must stall till the prefetch finishes. So, the simulator can have a variable called, e.g., time_prefetch_done. Whenever a prefetch occurs, this variable is set to the time when it will be finished (e.g., if MEM_DELAY = 20 in the example above, it will finish at t+20). Now, if a block that is IN TRANSITION is referenced, a stall must occur until that time. So in the simulator, we can simply set current_time = time_prefetch_done. This takes care of handling the stall. (Note that we can get by with a single time_prefetch_done variable, since only one block can be being prefetched at a time.) Of course, before we handle a prefetch-induced stall, we need to be careful that the block is still IN TRANSITION. If the current time is after the time that the prefetch finishes, we don t need to stall (indeed, we better not stall!). We just need to set the block s status to VALID. An easy way to implement references to blocks IN TRANSITION would be to set current_time = min(current_time+1, time_prefetch_done) and set the block s status to VALID. There s another special case we need to consider. That is, what happens when a miss occurs while a prefetch is in progress? We can t start to process the miss until the prefetch is finished. (Actually, high-performance processors use split-transaction buses, which allow transfers to memory to be overlapped, but in our first project, we will opt for the simpler approach of assuming that only one transfer is active at a time.) So this means that if the current time is t, the miss will not finish being processed at time t+20, but rather at time_prefetch_done+20. Thus, prefetching makes it possible to have stalls that are almost twice as long as the stalls without prefetching. We hope that this doesn t happen very often, and that the cost is outweighed by having prefetched blocks made available without any processor stalls. f. Assume that no instruction is ever written to during the course of our simulations. Thus, you will not have to implement a dirty bit. i. Five parameters completely specify the system: SIZE, ASSOC, BLOCKSIZE, PREFETCH, and V. The variable V is the size of the victim cache. If V=0, the victim cache is disabled. j. The size of an address is 32 bits. TLB simulation capabilities ECE 521 students will also simulate the TLB of the system. The page size of the system should be a parameter that for a particular run, will be set either to 4K or 8K bytes. You may assume that physical memory consists of 256 MB (how large does this make the page-frame number?). k. TLB description. o ENTRIES: number of TLB entries o T_ASSOC: the associativity of the TLB (T_ASSOC = 1 # direct-mapped TLB) o LRU replacement policy 4

Each TLB entry consists of a valid bit, a page number, and a page-frame number. A real architecture would also maintain a write-protect bit for each page, but that is not necessary in our simulation. l. Address translation. Assume that the addresses in the trace file are virtual addresses. They must be translated to physical addresses before the instructions are placed in the I-cache. Translation proceeds as follows. The address is separated into a page number and a displacement. The page number is looked up in the TLB. If it is present, it is translated to a pageframe number, which is concatenated with the displacement to form the physical address. Then the physical address is looked up in the cache. If the page number is not present in the TLB, an entry must be made for it. In a real architecture, this would require looking it up in the page table. However, to simplify matters, we will simply assign the page-frame number by applying the following function to the page number: Discard the 8 most significant bits of the address, as well as the bits that constitute the displacement. Then take the ones-complement of the remaining bits. This becomes the pageframe number. An entry is then made for this page in the TLB. The entry consists of a valid bit (set to true ), the page number, and the page-frame number. This entry is placed in the proper place in the TLB, and may replace a pre-existing TLB entry. The cost of the TLB miss is 2 eight-byte memory reads. Project Specification Model 1: L1 cache (Both 463 and 521 students should complete this part) Requirements 1. The parameters, such as cache size, block size, associativity (n $ 1), and prefetch are adjustable in your simulator. 2. LRU is the only replacement policy. Output of your simulator In overview, this is the kind of information you will be collecting for each of the cache systems. For more complete information, refer to the specific lists in the descriptions of each type of cache system, below. 1. Number of lines fetched into the cache. 2. Average number of bytes from each line that are referenced by the processor while in the cache (see below). 3. Number of cache reads 4. Number of read misses 5

cache read misses 5. Cache miss rate = cache reads For item 2 above, keep a bit-vector for each line brought into the cache. The bit-vector contains one bit for each byte in the cache line. Each time a line is brought into the cache, the bit-vector is initialized to all 0s. Each time a byte is referenced by the processor, the corresponding bit in the bit-vector is turned on. (Note that a single instruction is usually several bytes in length, so in simulating the execution of an instruction, it may be necessary to turn on several bits of the bitvector). When a line is replaced in the cache, record the number of 1-bits in the bit vector. Call this number the number of active bytes. At the end of the simulation, go through each valid line in the cache and record the number of active bytes (it might be easier to achieve this just by invalidating all the lines in the cache when the simulation ends). Then calculate FBU =! # active bytes # cache misses " BLOCKSIZE Data analysis: Use your data collected from your experiments to analyze the relationship of the miss rate, AAT (average access time refer to AAT calculation and CACTI tool), cache size, block size, and other parameters of cache (better with table and graphic). Find two data cache systems with the best AAT for each trace file. Model 2: L1 cache + victim cache (for both of 463/521 students) Requirements 1. The cache size, block size, cache associativity, and prefetching policy of L1 and victim cache (fully associative) are adjustable in your simulator. 2. LRU is the only replacement policy. Output of your simulator: 1. Number of lines fetched into the L1 cache. 2. Average number of bytes from each line that are referenced by the processor while in the cache. 3. Number of L1 cache reads 4. Number of L1 read misses cache read misses 5. L1 cache miss rate = cache reads 6. Number of VC reads. 7. Number of VC read misses VC read misses 9. VC miss rate = L1 read misses 6

Data analysis Take the two best combinations of parameters from Model 1, and vary the size of the victim cache. Find the two data-cache systems with the best AAT for each trace file. Describe why you think the best-performing system outperforms the others. Model 3: All of the above + TLB Requirements 1. The cache size, block size, cache associativity, and prefetching policy of L1 and victim cache (fully associative) are adjustable in your simulator. 2. The number of entries and associativity of the TLB are adjustable in your simulator. 3. LRU is the only replacement policy. Output of your simulator: All of the outputs listed in Model 1 and Model 2, whichever applies to a particular run, as well as 1. Number of TLB reads (should be the same as the number of L1 cache reads). 2. Number of TLB misses TLB misses 3. TLB miss rate = TLB reads Data analysis Use the data collected from your experiments to analyze the relationship of miss rate, AAT (average access time), cache size, block size, number of TLB entries and other parameters of the cache (ideally, with tables and graphics). Find two data-cache systems with the best AAT for each trace file. Overall analysis From an architectural perspective, analyze and compare the characters of those three models with the results from your experiments. CACTI tool The CACTI tool is used to calculate the access time of a cache based on its configurations (size, associativity, etc. Installing CACTI Log into your Unity home directory in Unix (Solaris, etc.). CACTI does not run on Linux. Using CACTI We ve compiled cacti. Just type its name to run it. The executable is in /afs/eos/courses/ece/ece521/common/www/homework/projects/1/cacti (or http://courses.ncsu.edu/ece521/common/homework/projects/1/cacti). 7

Using our function (simplest) The simplest way to use CACTI is to call the function that we have provided. Simply copy the function cacti_get_at from the file callcacti.c in the same directory, and paste it into your code. It has three parameters, cache size in bytes, block size in bytes, and associativity. (For a TLB, use 8 (bytes) as the block size.) In calling this function, use an associativity of 1 to denote a direct-mapped cache and an associativity of -1 to denote a fully associative cache. For example, the function might be called like this: cacti_get_at(16384, 128, 4) You are not required to use the cacti_get_at function. You may call CACTI directly, as follows. Calling CACTI directly (more flexibility) In case you need to use your own code to call cacti, here s what you need to know: CACTI Syntax cacti_ <csize> <bsize> <assoc> <tech> csize - size of cache in bytes (e.g., 16384) bsize - block size of cache in bytes (e.g., 32) assoc - associativity of cache (e.g., 2, 4, DM, or FA) For direct mapped caches use 1 or DM For set-associative caches use a number n for an n-way associative cache For fully associative caches, use FA tech - technology size in microns (we use 0.18um) Example: eos% cacti 16384 32 2 0.18um will return the following timing and power analysis for a 16K 2-way set associative cache with a 32-byte block size at a 0.18 µm feature (technology) size: Technology Size: 0.18um Vdd: 1.7V Access Time (ns): 1.20222 ----------------------------this is what we need. Cycle Time (wave pipelined) (ns): 0.400739 Power (nj): 1.49542 Wire scale from mux drivers to data output: 0.10..jj CACTI doesn t allow the number of sets to be too small. To simulate a fully associative cache, you have to use the FA option. Example: eos% cacti 16384 32 FA 0.18um will then return... Access Time (ns): 2.89345 8

... Limits of Parameters No matter whether you use our function or your own code to call CACTI, there are some limits on the parameters: Minimum Maximum Cache Size (bytes) 2 6 2 30 Block Size * (bytes) 2 3 2 17 Associativity 1 2 5 or 2 25 (using option FA) Number of Sets 2 3 2 27 * Block size must be a power of 2. AAT calculation The access times of the L1 cache, victim cache, and TLB can be determined using the CACTI tool. Below, T L1, T TLB and T V refer to the L1 cache, TLB cache and victim cache access times (in ns), respectively. Set the processor cycle time to be max(t TLB, T L1, T V ). For L1 cache AAT (s) = T L1 + Miss Rate L1 % Miss Penalty L1 It takes 20 cycles to fetch the first 8 bytes of a block from memory, and the remaining bytes are transferred at a rate of 8 bytes/cycle. Assuming L1 cache has the longest access time of any cache in the system (and thus,., the clock period in ns. is equal to T L1 ), and its line length is 64 bytes, Memory-to-L1 transfer time = 20 + 7 = 27 cycles Miss Penalty L1 = MEM_DELAY = 27 cycles For L1+victim cache AAT (s) = T L1 + Miss Rate L1 % V_DELAY + Miss Rate V % MEM_DELAY Transferring a block from the victim cache to the L1 cache occurs at a rate of 8 bytes/cycle. The block must be completely transferred before the processor resumes. The V-to-L1 transfer time in ns is equal to: V-to-L1 transfer time V_DELAY = (BLOCKSIZE / 8) cycles Miss Penalty L1 = T v + (1 Miss Rate V ) % (BLOCKSIZE / 8) % T L1 + Miss Rate V % Miss Penalty V For L1 cache + TLB Assume that TLB lookup is done in parallel with cache access, so there is no impact on access time, except in case of a TLB miss. Take the TLB miss rate and multiply it by the time to service a TLB miss (which will be the time to fetch 8 bytes from main memory). Then proceed as before with the AAT calculation. Program Interface Requirements To assure that your code can be run and graded easily, 1. Your code must be able to run in the Unix environment on the Eos/Unity system. 9

2. A makefile must be provided. 3. A make will create an executable file named icache. 4. The program must be executable from the command line, e.g., for a run of the simulator without TLB, eos% icache!tracefile"!size"!assoc"!blocksize"!prefetch"!v" and, for a run with a TLB, eos% icache!tracefile"!size"!assoc"!blocksize"!prefetch"!v"!entries"!t_assoc" Example for no TLB: eos% icache trace1.txt 16384 2 128 0 4 <return> The first parameter is the the cache size, in bytes (16384), The second parameter is the associativity (2), and The third parameter is the block size (128), The fourth parameter is whether prefetching is in use (0 = no, 1 = yes) For a victim cache, there is a fifth parameter V, the size of the victim cache in number of lines. What to hand in 1. Makefile 2. Source code 3. Project report (Microsoft Word.doc file recommended) * No executable file (including cacti) is needed; just submit your source code and makefile. Make sure you use zip to compress all your source code and report into a single zip file named project1.zip. Inform us if your file is larger than 1M. Grading 0% You do not hand in (submit electronically) anything by the due date. +10% Your Makefile works, and creates three simulators (executable files). +10% Your simulator can read trace files from the command line and has the proper interface for parameter settings. +50% Your simulator produces the correct output. +30% You have done a good analysis of the experiment. 10