University of Nevada, Reno. CacheVisual: A Visualization Tool for Cache Simulation

Size: px

Start display at page:

Download "University of Nevada, Reno. CacheVisual: A Visualization Tool for Cache Simulation"

Isaac Stanley
5 years ago
Views:

1 University of Nevada, Reno CacheVisual: A Visualization Tool for Cache Simulation A professional paper submitted in partial fulfillment of the requirements for the degree of Masters of Science in Computer Engineering by Howard Silva Dr. Angkul Kongmunvattana/Advisor August, 2004

3 Abstract This professional paper describes a cache visualization project (CacheVisual) a project to develop a graphics-based program demonstrating some of the basic principles of cache design through computer simulation. The three basic (simple) cache organizations (direct mapped, fully associative, and set associative) are explained and illustrated. In addition, sector and sector pool caches are also explained and illustrated. The design and implementation of the project is covered, including detailed illustrations of program input/output and of program demonstration. Besides the inherent ability of CacheVisual to assist teachers and students with cache design principles through a visual interactive process, CacheVisual is also a trace-driven research tool for generating cache performance data. The overall intent of this paper is to show how CacheVisual provides an alternative method of computer simulation to that of typical simulation tools used in computer architecture research. 1 Introduction This project is intended to satisfy a requirement for attaining a Master s degree in Computer Engineering from the University of Nevada, Reno. The specific reasons as to why this project was selected are explained in the remainder of this section. The importance of computer simulations (tools) to analyze and verify computer architecture designs has been well established. With the advent of superscalar CPUs, pipelining CPUs, and varying cache organizations, this importance has become even greater. While these tools have provided much help for those doing research, the aspect of understanding the ever-increasing complexity of these designs has been typically left to normal publishing methods. This same complexity can mean that these tools may require a fair amount of knowledge on the part of the user just to use them. A need has arisen to (1), combine the proven research aspects of these types of tools with a user-friendly interactive interface, thereby simplifying their use, and (2), to incorporate their use into the learning process. The goals then of this project are to develop an interactive tool that allows a user to visualize cache organization and operation, while also providing for research, thus expanding the overall usefulness of these types of tools. The approach used provides for a graphics-based interactive interface to allow for various combinations of user input, and to immediately display the results of that input. By using this hands-on visual approach it is hoped that this will help the user to understand the concepts involved, and to do so more effectively than with other methods of learning. This project attempts to contribute to the study of microarchitecture computer design in three distinct areas. One, to expand the scope of research tools to include an easy-to-use interface and to include educational aspects of the technology being studied. Two, to provide for modeling of sector cache and sector pool cache designs. Three, to support basic research on cache design utilizing trace files.

4 2 Related Work The SimpleScalar tool set [BA97] performs fast and accurate simulations of modern processors. The tool set can simulate superscalar processors, pipelines, branch predictors, and cache organizations, with many other features too numerous to list. This product is essentially a research tool that collects and reports on an extensive array of performance data. The product has very little, if any, interactive capability, and does not provide any form of graphical user interface. The product is UNIX based, and requires a user to have an understanding of the many optional command-line parameters used in its execution, and therefore by necessity, of having some overall knowledge of the technical aspects SimpleScalar covers. In addition, while the SimpleScalar source code is available for software modifications, the source code is large, involving over four million bytes. The SimpleScalar product also includes: binary utilities (recommended); compiler, assembler, libraries (all optional); other precompiled binary files (optional). Currently at Version 2.0, Simple Scalar was developed over a period of years starting in the mid-1980 s and spanning most of the 1990 s, with updates and support provided up to the present time. The introductory PDF document for SimpleScalar is 21 pages, and includes sections on installation and product description. An online overview and tutorial are also available over the Internet. Similar to the Simple Scalar Scalar tool set, the Dinero IV Trace-Driven Uniprocessor Cache Simulator [EH] is a UNIX-based product that is parameter driven, but only reports on hit and miss data. It also does not have a graphical user interface. Besides the trace-driven aspect of this product, some features cited on its Web page are: subroutine-callable interface in addition to trace-reading program simulation of multi-level caches simulation of dissimilar I and D caches better performance, especially for highly associative caches classification of compulsory, capacity, and conflict misses support for multiple input formats As the above two descriptions indicate, research tools typically can require a fair amount of knowledge and effort just to use them. Besides installation and compatibility aspects, there may be a learning curve involved in dealing with their use, and the execution of these tools does not involve an ease-of use that is, not being what we would normally describe as user friendly. They also do not typically provide for the educational aspect of the basic principles of the technology they study. The CacheVisual project is designed to address these issues. 3 Cache Organization One of the essential uses of a computer is to complete a task in the least amount of time. To reach this goal a storage medium hierarchy evolved during the growth of computer technology. This hierarchy came about due to the inherent design factors and cost factors in building faster and larger storage media. A typical storage medium hierarchy consists of the CPU registers, cache, memory, the I/O devices, and the interconnectivity between each level. The purpose of a cache is to provide a high-speed buffer close to the CPU in order to reduce the delay when accessing memory. A cache can be thought of as having physical areas of storage called lines which hold a block, or blocks, of bytes, and these lines each have the means (tags) of identifying where the blocks reside in memory. The basic process for a cache access is searching for the requested memory reference (program counter or memory address) in cache and accessing the item if found, and if not found, then fetching the item from memory and placing it in cache. A cache access resulting in the requested item (data or instruction) being found in cache is a cache hit, while an item not being found is a cache miss. The rates of each (hit rate and miss rate, respectively) are found by dividing by the total cache accesses. Cache misses have the additional penalty of having to fetch the missing item from slower memory. In general, smaller caches result in faster access, but higher miss rates. As expected, larger caches result in slower access, but lower miss rates. The three aspects of cache design that this project is concerned with are block placement (mapping of memory block addresses to cache lines), block identification (tag comparison), and block replacement policy in the event of a miss [HP03]. Block placement schemes fall into three basic types or organizations: direct mapped, fully associative,

5 and set associative. In addition, two other placement schemes, sector cache and sector pool cache, involve variations of these three prior types. As one might expect, these schemes involve tradeoffs of performance when compared to each other, and also involve tradeoffs when the internals of how each organization is implemented are compared. 3.1 Direct Mapped Cache Direct-mapped caches result in a memory block being placed in only one location (a 1-to-1 mapping of one block to one cache line) in cache. Figure 1 depicts an example of a direct mapped cache organization with 256 kilobytes of total cache (32 bytes per cache line yields 8 thousand cache lines), 128 megabytes of total memory (32 bytes per block yields 4 million block frames), and a breakdown of the program counter (PC) showing the number of bits used for the tag, for the block/line, and for the byte. This type of cache is the easiest to implement in hardware as no cache search is involved, and a block replacement strategy is not required. As might be expected, these factors also result in a faster cache access, a smaller cache tag, and a lower cost. The tradeoffs are: a higher miss rate due to the constrained cache placement, and a lower utilization of cache space. The cache miss rate can be higher if many addressed blocks map to the same cache location. An additional bit (not shown) is used for each block to indicate if it s valid or invalid (empty). The valid bit is also utilized in the set-associative and fullyassociative cache organizations described below. This placement scheme (mapping) typically uses a modulo method to determine the cache line value, (block address) MOD (number of blocks in cache). (2.1) This mapping method results in a cache line value that ranges from zero to the number of blocks in cache minus one. As shown in Figure 1, using Eq. 2.1 a PC value of maps to cache line 8190 (16382 MOD 8192), where is the block address, and 8192 is the number of blocks in cache (2^18 bytes divided by 2^5 bytes per block). As noted at the beginning of section three, a block in cache is identified by its tag value (location in memory). To determine the tag value the PC (16382, ) is first partitioned as follows (note that here the PC is block addressable, so the byte offset is omitted): PC block address tag cache line bits 13 bits Then the PC value is shifted right to the right 13 bits (divided by 8192) yielding the nine-bit tag integer value. Thus a PC value of yields a tag value of one (16382/8192).

6 256 KB Cache (2^18 bytes) 128 MB of Memory (2^27 bytes) 32 bytes per cache line 32 bytes per block frame = 2^13 or 8K cache lines = 2^22 or 4M block frames C0 B0 C1 B C8190 B8190 C8191 B8191 B8192 B B16382 B B B These blocks will map to a cache tag value of zero (tag 0) Tag 1 Tag 511 Program Counter 27 bits 9 bits 13 bits 5 bits Tag Block/Line Byte Figure 1 Example of Direct Mapped Cache Organization 3.2 Fully Associative Cache Fully-associative caches are essentially the opposite of direct mapped caches in that a fully associative mapping results in a memory block being placed anywhere (a 1-to-any mapping of one block to any cache line) in cache. Figure 2 depicts an example of a fully associative cache organization with 256 kilobytes of total cache (32 bytes per cache line yields 8 thousand cache lines), 128 megabytes of total memory (32 bytes per block yields 4 million block frames), and a breakdown of the program counter showing the number of bits used for the tag and for the byte. This type of cache results in the highest utilization of cache space and a higher hit rate. The tradeoffs are: the hardest placement scheme to implement in hardware since every cache tag needs to be searched which results in slower cache access, the largest cache tag, and a higher implementation cost. Because of this higher cost, fullyassociative caches have typically been limited to smaller-size applications, such as microprocessors. 6

7 Again, a block in cache is identified by its tag value (location in memory). To determine the tag value the PC (16382, ) is first partitioned as follows (note that here the PC is block addressable, so the byte offset is omitted): PC block address tag bits As the above partitioning of the PC indicates, the tag value is the PC value, so a PC value of would have a tag value of As noted at the beginning of section three, there are three basic replacement policies: random, least recently used (LRU), and first-in first-out (FIFO). A random replacement policy attempts to produce a uniform cache distribution. For a fully-associative cache the random replacement policy would result in any cache line being replaced in the event of a miss. The LRU replacement policy records accesses to cache blocks, so the block that was accessed the furthest time in the past will be replaced. The idea here is to take advantage of the principle of locality: blocks recently accessed, or blocks close to recently accessed blocks in memory, tend to be accessed more frequently. For a fully-associative cache, the LRU replacement policy would result in all of the cache lines being checked for the least recently used line. Besides the additional hardware aspects (some sort of counter for each cache line) involved to implement a LRU policy, additional time would also required to check all of the counters on all of the cache lines. The FIFO replacement policy simply treats each cache line as a member of a queue to determine the oldest member of the queue. Similar to the previous two methods, the FIFO replacement policy for a fully-associative cache involves finding the oldest member of cache for all of the cache lines. 7

8 256 KB Cache (2^18 bytes) 128 MB of Memory (2^27 bytes) 32 bytes per cache line 32 bytes per block frame = 2^13 or 8K cache lines = 2^22 or 4M block frames C0 B0 C1 B C8190 B8190 C8191 B8191 B8192 B B16382 B B B Tag 0 Tag 1 Tag Program Counter 27 bits 22 bits 5 bits Tag Byte Figure 2 Example of Fully Associative Cache Organization 3.3 Set Associative Cache As might be expected a middle ground exists between direct-mapped and fully-associative cache organizations. This middle ground results in a memory block being mapped anywhere in a set (a 1-to-set mapping of one block to one cache set) of cache lines. The set is made up of n of blocks, and the placement is called n-way set associative. Figure 3 depicts an example of an 8-way set associative cache organization with 256 kilobytes of total cache (32 bytes per cache line yields 8 thousand cache lines and one thousand sets), 128 megabytes of total memory (32 bytes per block yields 4 million block frames), and a breakdown of the program counter showing the number of bits used for the tag, for the block/line, and for the byte. The search and block replacement schemes are simpler than fully associative due to the smaller size of a set. The other advantages and tradeoffs of this cache organization fall between those of direct mapped and fully associative cache organizations. Internally, given a fixed cache size, tradeoffs exist between the set size and the number of sets, which can lead to a higher hit rate. 8

9 This placement scheme (mapping) typically uses a modulo method to determine the cache line value, (block address) MOD (number of sets in cache). (2.2) This mapping method results in a cache line value that ranges from zero to the number of sets in cache minus one. As shown in Figure 3, using Eq. 2.2 a PC value of maps to cache set 1022 (16382 MOD 1024), where is the block address, and 1024 is the number of sets in cache (2^18 bytes divided by 2^5 bytes per block divided by 2^3 cache lines per set). Again, a block in cache is identified by its tag value (location in memory). To determine the tag value the PC (16382, ) is first partitioned as follows (note that here the PC is block addressable, so the byte offset is omitted): PC block address tag cache set bits 10 bits Then the PC value is shifted to the right 10 bits (divided by 1024) yielding the twelve-bit tag integer value. Thus a PC value of yields a tag value of fifteen (16382/1024). For a set-associative cache the random replacement policy would result in any cache line in the set being replaced in the event of a miss. For a set-associative cache the LRU replacement policy would result in all of the cache lines in the set being checked for the least recently used line. Similar to the previous two policies, the FIFO replacement policy for set-associative cache involves finding the oldest member of the set. As a note, set-associative caches can be generalized to include direct-mapped and fully-associative caches. Direct-mapped caches can be thought of as having a cache set size of one cache line (1-way set associative), so the number of sets equals the number of cache lines. Fully-associative caches can be thought of as having a cache set size equaling the number of cache lines (c-way set associative, where c is the number of cache lines), so there is only one set. While each replacement policy offers advantages over the others, the FIFO policy, and a close relative, second-chance FIFO (blocks that have been hit are not replaced) were selected as options for this project. As a note, the FIFO replacement policy is considered to closely approximate the LRU replacement policy. 9

10 256 KB Cache (2^18 bytes) 128 MB of Memory (2^27 bytes) 32 bytes per cache line 32 bytes per block frame = 2^13 or 8K cache lines = 2^22 or 4M block frames 8 cache lines per set = 2^10 sets C0 B0. B1. Set 0... C7. C8 B1022. B1023. Set 1 B1024. B1025 C B2046 C8176 B Set 1022 B16832 C8183. C8184 B Set C8191 B These blocks will map to a cache tag value of zero (tag 0) Tag 1 Tag 4095 Program Counter 27 bits 12 bits 10 bits 5 bits Tag Block/Set Byte Figure 3 Example of 8-Way Set Associative Cache Organization 3.4 Sector Cache Given the finite resources (physical size and transistors) of a CPU and cache, the amount of resources used to store tag values is an essential consideration of resource utilization. As computer memories expanded, the tag values associated with a block of data became larger. While increasing the block size helps to alleviate this problem, larger block sizes lead to higher bandwidth for data transfer, and to higher miss rates due to poor utilization of cache storage. To deal with this problem sectors (groups of blocks /subsectors) of memory became a fundamental unit-of-measure of storage. This allows more data to be associated with each cache tag. To help alleviate the impact of higher bandwidths and higher miss rates associated with larger units of storage, only the referenced subsector is placed into its sector area (offset) in cache. On a cache miss, only the missing subsector has to be fetched if the relevant sector is already in cache. Even if a sector has to be evicted due to a cache miss, only the referenced subsector is placed in cache along with the new tag value for the new sector. Two additional bits (not shown) are used for each subsector to indicate if it s valid/invalid (empty) and clean/dirty (modified while in cache). Figure 4 depicts an example of a direct-mapped sector cache organization with 256 kilobytes of total cache (32 bytes per subsector and 4 subsectors per sector yields two thousand cache lines), 128 megabytes of total memory 10

11 (32 bytes per subsector yields 4 million block frames), and a breakdown of the program counter showing the number of bits used for the tag, for the sector, for the block/subsector, and for the byte. 256 KB Cache (2^18 bytes) 128 MB of Memory (2^27 bytes) 32 bytes per subsector 32 bytes per Subsector Block Frame (SBF) 4 subsectors per sector = 4M SBFs = 2^11 or 2K cache lines Cache Memory Cache line 0 Cache line 1 Cache line 2046 Cache line 2047 Tag B0 B1 B2 B3 Tag B4 B5 B6 B Tag Tag These sectors will map to a cache tag value of zero (tag 0). Tag 1 Tag 511 Program Counter 27 Bits 9 bits 11 bits 2 bits 5 bits Tag Sector Block/ Byte Subsector Figure 4 Example of Direct Mapped Sector Cache Organization 3.5 Sector Pool Cache While sector caches do prove useful in some cases, a major problem can occur with their use a sector is typically evicted before all of its subsectors are used, resulting in an inefficient use of cache storage. To deal with this problem Rothman and Smith [RS99] proposed that a pool of subsectors be shared among a set of sectors. Figure 5a depicts an example of an 8-way set associative sector pool cache organization with 256 kilobytes of total cache (32 bytes per cache line, 4 subsectors per sector, 8 sectors per sets yields 256 sets), 128 megabytes of total memory (32 bytes per subsector block frame yields 4 million subsector block frames), and a breakdown of the program counter showing the number of bits used for the tag/sector, for the set, for the block/subsector, and for the byte. The total number of subsectors in the pool is less than that of a sector cache with the same set size and number of subsectors per sector. Associated with a pool of subsectors are the concepts of a pool depth, or the number of subsectors reserved for each subsector offset in the set, and of a pool list, one for each subsector offset, which contains the actual subsectors. As a subsector is fetched into cache, its offset position in the cache line (sector) will hold a pointer value indicating its position in the pool depth for its relevant pool list. A pointer value of zero indicates the subsector is not in the pool. Figure 5b depicts a cache line with a tag field and four pointer values (3, 1, 11

12 0, 5) pointing to a pool of subsectors with a pool depth of five. An additional bit (not shown) is used for each subsector to indicate if it s clean or dirty (modified while in cache) All of the aspects described above in sections 3.2 and 3.3 (such as advantages and tradeoffs, mapping and identification, and replacement policies) for fully associative and set associative caches apply to sector pool caches, except that the basic storage unit is a sector, and not a block. So, instead of a block, a sector is mapped to cache, a sector has a tag, a sector is replaced, and there are sets of sectors. An additional overhead of sector pool cache is the maintenance of the pointers and of the subsector pool in the event of a sector miss (the sector is not in cache). In addition to the usual tag value being updated, all of the nonzero pointers for the evicted sector must be cleared, along with the subsectors they pointed to in the subsector pool. Associated with a sector miss is the case where the subsector pool is full (there s no room in the pool for the referenced subsector). In this case the replacement policy must be used to select a sector to be evicted in the event of a sector miss, or to select a sector to release a subsector in the subsector pool in the event of a sector hit (the subsector is not in cache, but its associated sector is in cache). 12

13 256 KB Cache (2^18 bytes) 128 MB of Memory (2^27 bytes) 32 bytes per sector 32 bytes per Subsector Block Frame (SBF) 4 subsectors per sector = 4M SBFs 8 sectors per set, 256 sets (of pointers) Cache Memory Tag Ptr 0 Ptr 1 Ptr 2 Ptr 3 B0 B1 B2 B3 Tag Ptr 4 Ptr 5 Ptr 6 Ptr 7 B4 B5 B6 B Set Tag Ptr 28 Ptr 29 Ptr 30 Ptr Tag Ptr 32 Ptr 33 Ptr 34 Ptr Set Tag Ptr 60 Ptr 61 Ptr 62 Ptr Tag P 8160 P 8161 P 8162 P Set Tag P 8188 P 8189 P 8190 P These sectors will map to a cache tag value of zero (tag 0). Tag 1 Tag 4095 Program Counter 27 Bits 12 bits 8 bits 2 bits 5 bits Tag/Sector Set Block/ Byte Subsector (a) 8-Way Set Associative Cache Offset 0 Offset 1 Offset 2 Offset 3 Tag Pointer = 011 Pointer = 001 Pointer = 000 Pointer = 101 _ Pool Depth of five. Pool List 0 Pool List 1 Pool List 2 Pool List 3 Subsector Pool (b) Cache line with tag and pointers to subsector pool Figure 5 Example of 8-Way Set Associative Sector Pool Cache, Cache Line, and Subsector Pool 13

14 4 Design and Implementation The overall design of the CacheVisual project was driven by the needs to provide the inputs necessary to allow the user to use the tool, and to provide meaningful output to help the user to understand the principles of cache design. The features of simple cache include three cache organizations (i.e., direct-mapped, set-associtative, and fully associative) with cache size ranges from 1 to 8 kilobytes. The output includes number of cache accesses, hit rate, and miss rate. The Microsoft Visual C++ programming environment was chosen to meet these needs. The project design was divided into two parts. First, a working tool was developed for a simple cache simulation, then building upon what was learned from developing this first tool, a sector pool cache simulation was developed. 4.1 Simple Cache The simple cache design involves a user interface and program outputs to demonstrate the concepts behind direct mapped, fully associative, and set associative cache organizations utilizing a block of bytes for each cache line. User Interface The inputs required to allow the user to configure a simple cache are: Cache Size in kilobytes (1024 bytes); range of 1, 2, 4 and 8 2. Cache Line Size in bytes; range of 1, 2, 4, 8, 16 and 32; also sets memory block size. 3. Cache Type (organization) of direct mapped, fully associative or set associative (2-way to 128-way) 4. Replacement Policy choice of FIFO or SC-FIFO (second chance FIFO) 5. Memory Size in kilobytes (1024 bytes); range of 8, 16, 32, 64, 128, 256, 512 and 1024 Note that a change in any of the above inputs (except the Replacement Policy) will reset the Cache and Memory lists, and will clear (zero) all of the fields in the Results group box (lists and Results group are explained in the Program Output section below). Also, relevant message boxes are displayed if any of the above inputs are omitted. A Random PC (6) check box (pseudo-random program counter generated with each click of the Execute button) is also available; if this box is not checked the user-editable Memory Block # (7) will be used as the PC. The user can also enter a file name for the Memory Block #, or click on the Browse Files (8) button (unchecks PC check box) to search for a file (the file name and path are placed in the Memory Block #). The file is used to enter a series of program counter values (traces). An unchecked Random PC box and a blank Memory Block # will generate a relevant message box. In addition to the above, the following buttons are available: Reset Lists (resets all three list boxes and zeroes the Results fields) Clear Blocks (clears the Memory Block # combo box) Execute (executes the program using the PC check box, Memory Block #, or file) Step (step through a file while displaying graphics); Execute button renamed Cancel (cancel the Simple Cache simulation) Step Cancel (cancel stepping through a file); Cancel button renamed See Figure 6 for a screen display of all of the above numbered input fields.

15 Cache Line Size spin control (2) Cache Size spin control (1) Cache Type combo box (3) Replacement Policy buttons (4) Set FIFO Values (Old and New) (20) Cache list box. The C is short for cache line, the T is short for tag (initialized to 1), and the B is short for block (initialized to Empty. (24) Memory Size spin control (5) Random PC check box (6) Memory Block # combo box (7) (15) Memory list box. The B is short for block. (25) Tag (16) Set statrs at line (17) Cache Data (18) Line Tag (19) Hits (21) Misses (22) Total (23) Browse Files button (8) Execute button (11) (12) Cancel button (13) (14) Reset Lists button (9) Clear Blocks button (10) Figure 6 Simple Cache Screen Example (user input in green, program output in pink)

16 Program Flow For each execution of the program the user inputs are checked (unless a file is being processed in step mode) to see if any of them have been changed since the previous execution. If any user input is changed a group of sanity checks is performed on the cache configuration. If a sanity check is positive, a message box displays the information to the user and that execution terminates. Note that a sanity check for the memory is not required since it s set at a minimum value of eight kilobytes. After passing the sanity checks, the list boxes used for the cache and the memory displays are reset, and a number of dynamically allocated arrays are reset and initialized. These arrays involve the tags, the cache lines, the valid bits, the FIFO value for each set, and the hits. For the program to continue, a PC value or a file name must be processed. Currently, the Random PC check box reigns supreme in this determination. If it is checked, a user-entered PC value or file name is ignored, and the program-generated PC is range checked (a user-entered PC is also range checked), and rejected (a message box is displayed) if it is out of range. If the user manually enters a file name in the Memory Block # combo box, or a file name is placed there by the user selecting the Browse Files button (Random PC box is unchecked), a message box is displayed that warns the user of the required file format. If the user opts to continue, another message box is displayed (with the file size in bytes) to query the user to process the file in step mode (one datum at a time). If the step mode is selected, each file datum is processed as described in the paragraphs below. If the step mode is not selected, each file datum is processed with only the hit and misses being tabulated, with the graphic aspects described in the Program Output section below not being in effect, and the whole file is processed until the end-of-file marker, so the user has no ability to pause or stop the processing of the file. There is no range checking when processing file data since each file datum is operated upon by a modulo of the memory size to produce a memory block number (PC) that ranges from zero to the memory size minus one. Preliminary results with file data that exceed the memory size indicate that the hit ratio is increased due to this form of compression of a larger data set to a smaller data set. After computing the tag, the cache set, and the first line of the cache set, each cache line in the set that has its valid bit set is checked to see if its tag array value is equal to the calculated tag value. A hit or miss is determined by testing if the search goes beyond the last line in the cache set. Note that for a direct mapped cache type the set size is simply one. In processing a miss, the FIFO array for each set is used to determine what cache line to replace. If the user has selected the second chance FIFO replacement policy, the set FIFO value is not used if it points to a line that has been hit, and instead a line is searched for that hasn t been hit. The tag, cache line, hit, FIFO, and valid bit arrays are then updated. Since we re not dealing with any actual data, each cache line array element simply holds the memory block value. The relevant cache list box element is updated with the new tag and memory block values. Depending upon whether the PC generated a hit or a miss the appropriate counters are updated along with the appropriate graphics effect displayed previously described. Figure 7 shows the pseudo-code for three of the functions that are used to implement most of the flow described above. Note that in reality there are many more functions involved in the Simple Cache simulation, but these three do most of the work in simulating a simple cache organization. The three functions are responsible for checking the cache configuration for any problems, for handling the execute aspect when the user initiates (clicks) a program response, and processing the datum to determine if a cache hit or miss occurs and performing the appropriate actions in the event of each occurrence.

17 Check Cache Configuration If any change in cache configuration since last execution or the reset-lists button pushed Perform sanity checks (Error message and terminate execution if a problem); Reset list boxes (Warning message for large list boxes); Reset arrays (Error message and terminate if any reset fails); Initialize arrays; Save current cache configuration; Clear all output fields; Execute If processing file in step mode (execute button = Step ) Read file datum; If not EOF Convert file datum (datum modulo memory size); Process Datum; Else (EOF) Close file; Execute button = Execute ; Cancel button = Cancel ; Else (not processing file in step mode) If Check Cache Configuration fails terminate execution; If random PC box unchecked and Memory Block # empty Information msg. and terminate execution; If random PC box checked If PC datum out of range Error msg. and terminate execution; Process Datum; Else (user-entered file name or datum) If file name entered Warning msg. of file format requirements; If user cancels Terminate execution; Open file; Question msg. with file size; If user cancels Terminate execution; If user selected file step mode Execute button = Step ; Cancel button = Step Cancel ; Return; Else Process entire file (Process Datum loop) with no graphics and displaying progress bar; Close file; Display file run time; Hide progress bar; Execute button = Execute ; Cancel button = Cancel ; Else (assume user entered datum) If user datum out of range Error msg. and terminate execution; Process Datum; Process Datum Determine if cache hit or miss; If cache miss Increment miss counter; Use replacement policy (unless direct mapped) to find block to replace; If not processing an entire file Update output fields; Update cache list box; Display miss graphics; Update arrays; Else (a cache hit) Increment hit counter; If not processing an entire file Update output fields; Display hit graphics; If SC-FIFO and not direct mapped Set hit array element; Update result group fields; Figure 7 Pseudo-code of three functions for Simple Cache simulation. 17

18 Program Output The displayed data consists of the following: 15. Memory Block # generated by the program 16. Tag value calculated by the program from the PC 17. Set starts at line (cache line number where the set starts; for set associative only) 18. Cache Data (Block #) stored in the cache line 19. Line Tag of the cache line 20. Set FIFO Values, Old and New 21. The number of accumulated Hits 22. The number of accumulated Misses 23. The Total number of Hits and Misses 24. Cache list box 25. Memory list box See Figure 6 for a screen display of all of the above numbered data fields. In addition to the above, when processing a file the following is displayed: Original (in parentheses) and decimal values of each file datum when in step mode; File processing time in seconds (in Memory Block # field) when not in step mode; Progress bar when not in step mode; Besides the user input and program output, the primary graphics aspect of the program design consists of list boxes that display the area of both cache and memory that is affected. This is accomplished by enclosing the blocks (Cache list box and Memory list box) in a colored box. If the program execution results in a hit, the box is green. If the result is a miss, the box is red. For cache types involving sets, a blue box encloses the set if the entire set can be displayed in the cache list box. The set number is displayed at the bottom of each set. See Figure 6 for an example of the graphics. During the course of executing the program, message boxes may be displayed due to questionable user input, to query the user for additional information, or to provide additional information to the user. These message boxes have four forms: error, warning, question, and informational. A list of the general descriptions for the message boxes follows: Error; cache configuration incomplete Error; replacement policy not selected (unless direct mapped cache type) Error; set size exceeds the number of cache lines available Warning; memory list box size may exceed computer resources Warning; cache list box size may exceed computer resources Error; resizing of an array failed Informational; the Random PC box was unchecked and the Memory Block # was blank Error; random PC value out of range Warning; file format requirements Question; process file in step mode (graphics displayed) or without interruption Error; entered PC value out of range 18

19 Program Demonstration To help illustrate the User Interface and Program Output sections, a simple demonstration will follow using six screen snapshots. Snapshot one will involve configuring a simple cache simulation, and snapshots two through six will show the results of five executions of that configuration. Snapshot 1 inputs: Cache Size: 1 Cache Line Size: 32 Cache Type: 4-way Memory Size: 256 Replacement Policy: FIFO Random PC box checked Snapshot 1 19

20 Snapshot 2 outputs: Memory Block #: 2566 Tag: 320 Set starts at line: 24 Cache Data (Block #): -1 Line Tag: -1 Set FIFO Values: Old 0 New 1 Hits: 0 Misses: 1 Total: 1 Snapshot 2 20

21 Snapshot 3 outputs: Memory Block #: 1222 Tag: 152 Set starts at line: 24 Cache Data (Block #): -1 Line Tag: -1 Set FIFO Values: Old 1 New 2 Hits: 0 Misses: 2 Total: 2 Snapshot 3 21

22 Snapshot 4 outputs: Memory Block #: 7020 Tag: 816 Set starts at line: 16 Cache Data (Block #): -1 Line Tag: -1 Set FIFO Values: Old 0 New 1 Hits: 0 Misses: 3 Total: 3 Snapshot 4 22

23 Snapshot 5 outputs: Memory Block #: 807 Tag: 100 Set starts at line: 28 Cache Data (Block #): -1 Line Tag: -1 Set FIFO Values: Old 0 New 1 Hits: 0 Misses: 4 Total: 4 Snapshot 5 23

24 Snapshot 6 outputs: (Random PC box unchecked to force a hit) Memory Block #: 807 Tag: 100 Set starts at line: 28 Cache Data (Block #): 807 Line Tag: 100 Set FIFO Values: Old 0 New 1 Hits: 1 Misses: 4 Total: 5 Snapshot 6 24

25 4.2 Sector Pool Cache The sector pool cache design involves a user interface and program outputs to demonstrate the concepts behind fully associative and set associative cache organizations utilizing a sector of blocks (subsectors) for each cache line. User Interface The inputs required to allow the user to configure a sector pool cache organization are: 1. Cache Size in kilobytes (1024 bytes); range of 1, 2, 4 and 8 2. Subsector Size in bytes; range of 1, 2, 4, 8, 16 and Subsectors Per Sector; range of 1, 2, 4, 8, 16 and Pool Depth; range of from 1 to 100, inclusive 5. Cache Type (organization) of fully associative or set associative (2-way to 128-way) 6. Replacement Policy choice of FIFO or SC-FIFO (second chance FIFO) 7. Memory Size in kilobytes (1024 bytes); range of 8, 16, 32, 64, 128, 256, 512 and 1024 Note that a change in any of the above inputs (except the Replacement Policy) will reset the Cache, Memory, and Subsector Pools lists, and will clear (zero) all of the fields in the Results group box (lists and Results group are explained in the Program Output section below). Also, relevant message boxes are displayed if any of the above inputs are omitted. A Random PC (8) check box (pseudo-random program counter generated with each click of the Execute button) is also available; if this box is not checked the user-editable Memory Block # (9) will be used as the PC. The user can also enter a file name for the Memory Block #, or click on the Browse Files (10) button (unchecks PC check box) to search for a file (the file name and path are placed in the Memory Block #). The file is used to enter a series of program counter values (traces). An unchecked Random PC box and a blank Memory Block # will generate a relevant message box. In addition to the above, the following buttons are available: 11. Reset Lists (resets all three list boxes and zeroes the Results fields) 12 Clear Blocks (clears the Memory Block # combo box) 13. Execute (executes the program using the PC check box, Memory Block #, or file) 14. Step (step through a file while displaying graphics); Execute button renamed 15. Cancel (cancel the Sector Pool Cache simulation) 16. Step Cancel (cancel stepping through a file); Cancel button renamed See Figure 8 for a screen display of all of the above numbered input fields. 25

26 Subsector Size spin control (2) Cache Size spin control (1) Subsectors Per Sector spin control (3) Pool Depth spin control (4) Cache Type combo box (5) Replacement Policy buttons (6) Set FIFO Values (Old and New) (21) Cache list box. The C is short for cache line, the T is short for tag (initialized to 1), and the P is short for pointer (multiple pointer values follow). (26) Memory list box. The B is short for block. (27) Memory Size spin control (7) Random PC check box (8) Memory Block # combo box (9) (17) Sector Tag (18) Subsector Pools list box. A zero (0) means vacant and a one (1) means occupied. (28) Pool List # (19) Pool Occupancy % (20) Misses (24) Total (25) Hits (22) Sector Hits (23) Browse Files button (10) Execute button (13) (14) Cancel button (15) (16) Clear Blocks button (12) Reset Lists button (11) Figure 8 Sector Pool Cache Screen Example (user input in green, program output in pink)

27 Program Flow The initial phase of program flow for the sector pool cache simulation is nearly identical to that of the simple cache design. The major differences involve the additional data structures needed to implement the sector pool design, such as the new user inputs, the increased complexity of cache, and the subsector pools list, just to name a few. Due to this similarity of the two designs for the initial program flow, we will skip to the most important aspect of the program flow for the sector pool cache simulation, executing the program with the PC value. The main aspects involved in executing the PC is to determine if there s a cache hit or a cache miss, then to perform the necessary operations in the event of each. In order to determine a hit or a miss, the tag, sector set, sector-set cache line, and subsector offset values are calculated. The cache data structure is an array of sector sets, and each set is a 2-dimensional array of pointer values. The sector set and subsector-offset values determine which set and which column in the set, respectively. Starting with the sector-set cache line (the cache line number of the first sector in the set), a while-loop is used to search a tag array (one tag value for each cache line) to find a tag match. A tag match (sector hit) in the sector set determines the row value of the set, so using the set, row, and column values the cache data structure can be checked for a nonzero pointer value, which if found, is a cache hit, or a cache miss if otherwise. If a cache hit occurs, the hit accumulator is updated, and the cache-hit graphics are displayed. The hit array (a hit value of zero or one for each cache line) is updated if the SC-FIFO replacement policy is being used. If a cache miss occurs, the valid bits array (a value of zero or one for each subsector in all of the subsector pools) is used to determine if the respective subsector pool list is full, similar to the approach used in [SMT00]. Note that valid bits are used in other cache designs, but not in a sector pool cache. They are used in this program for the previously stated reason, and to visualize the state of each subsector pool. A full subsector pool list adds additional complexity to the replacement problem because a sector hit means that a subsector from another sector in the set must be evicted from the pool list, or if not a sector hit a sector must be replaced that has a subsector in the pool list. In any case, the replacement policy is used to search for a sector that has a subsector in the full subsector pool list. Note that during the search the set FIFO pointer is only incremented if not a sector hit. If a sector hit occurs, the sector-hit accumulator is incremented, the Cache and the Subsector Pools list boxes are updated, and the sector-hit graphics are displayed. If not a sector hit, the cache and valid bit array elements are cleared for the evicted sector, and updated for the fetched sector, the hit and tag arrays are updated, and the Subsector Pools list box is updated. A subsector pool list that is not full is easier to handle than the above full-pool-list scenario. If a sector hit occurs, the sector-hit accumulator is incremented, the cache and valid bit arrays are updated, the Cache and the Subsector Pools list boxes are updated, and the sector-hit graphics are displayed. If not a sector hit, the replacement policy is used to find a sector to replace, the cache and valid bit array elements are cleared for the evicted sector, and updated for the fetched sector, the hit and tag arrays are updated, and the Subsector Pools list box is updated. The handling of a cache miss (not a sector hit) ends with the updating of the Cache list box, the displaying of the cache-miss graphics, and the incrementing and displaying of the new set FIFO pointer. The relevant cache list box element is updated with new tag and subsector pool pointer values. Figure 9 shows the pseudo-code for the function that is used to implement most of the flow described above. Note that in reality there are many more functions involved in the Sector Pool Cache simulation, but this function does most of the work in simulating a sector pool cache organization. It is responsible for processing the datum to determine if a cache hit or miss occurs and performing the appropriate actions in the event of each occurrence.

28 Process Datum Determine if cache hit or miss, and if a sector hit; If cache miss Increment miss counter; Determine if the subsector pool is full; If the pool is full Use replacement policy to find subsector to replace; If sector hit Increment sector hit counter; Replace subsector; If not processing entire file Display sector pool; Update cache list box; Pool pointer cleanup; If not processing entire file Update cache list box; Display sector hit graphics; Else (not a sector hit) Save pool pointer of sector to be replaced; Replace sector; Clear valid bits of evicted subsectors; Clear pool pointers of evicted subsectors; Set subsector pointer of new sector; Update arrays; If not processing an entire file Display subsector pool; Else ( pool is not full) If sector hit Increment sector hit counter; Update arrays; If not processing entire file Update cache list box; Display subsector pool; Display sector hit graphics; Else (not a sector hit) Use replacement policy to find sector to replace; Clear valid bits of evicted subsectors; Clear pool pointers of evicted subsectors; Set subsector pointer of new sector; Update arrays; If not processing an entire file Display subsector pool; If not a sector hit If not processing an entire file Update cache list box; Display miss graphics; Increment set FIFO pointer; If not processing an entire file Display new FIFO value; Else ( a cache hit) Increment hit counter; If not processing an entire file Display subsector pool; Display hit graphics; If SC-FIFO Set hit array element; Update result group fields; Figure 9 Pseudo-code of one function for Sector Pool Cache simulation. 28

29 Program Output The displayed data consists of the following: 17. Memory Block # generated by the program 18. Sector Tag value calculated by the program 19. Pool List # (subsector offset) calculated by the program; starts at zero 20. Pool Occupancy % calculated by the program 21. Set FIFO Values, Old and New 22. The number of accumulated Hits 23. The number of accumulated Sector Hits 24. The number of accumulated Misses 25. The Total number of Hits and Misses 26. Cache list box 27. Memory list box 28. Subsector Pools list box See Figure 8 for a screen display of all of the above numbered data fields. In addition to the above, when processing a file the following is displayed: Original (in parentheses) and decimal values of each file datum when in step mode; File processing time in seconds (in Memory Block # field) when not in step mode; Progress bar when not in step mode; Besides the user input and program output, the primary graphics aspect of the program design consists of list boxes that display the area of cache, memory, and subsector pools that is affected. This is accomplished by enclosing the sector (Cache list box), block /subsector (Memory list box), and pool row (Subsector Pools list box) in a colored box. If the program execution results in a hit, the box is green. If the result is a miss, the box is red. For sector hits (a cache miss, but the sector is in cache) the box is violet. A blue box encloses the sector set (Cache list box) and its subsector pool (Subsector Pools list box) if each can be displayed entirely in their respective list box. The set number is displayed at the bottom of each set and pool. See Figure 8 for an example of the graphics. During the course of executing the program, message boxes may be displayed due to questionable user input, to query the user for additional information, or to provide additional information to the user. These message boxes have four forms: error, warning, question, and informational. A list of the general descriptions for the message boxes follows: Error; cache configuration incomplete Error; pool depth cannot be less than the set size Error; replacement policy not selected Error; sector set size exceeds the number of cache lines available Error; sector set size exceeds the number of subsectors in a pool Warning; sector set size equaling the pool depth is a sector cache Warning; memory list box size may exceed computer resources Warning; cache list box size may exceed computer resources Error; resizing of an array failed Informational; the Random PC box was unchecked and the Memory Block # was blank Error; random PC value out of range Warning; file format requirements Question; process file in step mode (graphics displayed) or without interruption Error; entered PC value out of range Program Demonstration 29

30 To help illustrate the User Interface and Program Output sections, a simple demonstration will follow using six screen snapshots. Snapshot one will involve configuring a sector pool cache simulation, and snapshots two through six will show the results of five executions of that configuration. Snapshot 1 inputs: Cache Size: 4 Subsector: 32 Subsectors Per Sector: 4 Pool Depth: 3 Cache Type: 4-way Memory Size: 256 Replacement Policy: FIFO Random PC box checked Snapshot 1 30

31 Snapshot 2 outputs: Memory Block #: 5569 Sector Tag: 174 Pool List #: 1 Pool Occupancy %: 8.33 Set FIFO Values: Old 0 New 1 Hits: 0 Sector Hits: 0 Misses: 1 Total: 1 Snapshot 2 31

32 Snapshot 3 outputs: Memory Block #: 7491 Sector Tag: 234 Pool List #: 3 Pool Occupancy %: Set FIFO Values: Old 1 New 2 Hits: 0 Sector Hits: 0 Misses: 2 Total: 2 Snapshot 3 32

33 Snapshot 4 outputs: Memory Block #: 3531 Sector Tag: 110 Pool List #: 3 Pool Occupancy %: 8.33 Set FIFO Values: Old 0 New 1 Hits: 0 Sector Hits: 0 Misses: 3 Total: 3 Snapshot 4 33

34 Snapshot 5 outputs: (Random PC box unchecked to force a hit) Memory Block #: 3531 Sector Tag: 110 Pool List #: 3 Pool Occupancy %: 8.33 Set FIFO Values: Old 1 New 1 Hits: 1 Sector Hits: 0 Misses: 3 Total: 4 Snapshot 5 34

35 Snapshot 6 outputs: (Random PC box unchecked to force a sector hit) Memory Block #: 3528 (manually entered datum) Sector Tag: 110 Pool List #: 0 Pool Occupancy %: Set FIFO Values: Old 1 New 1 Hits: 1 Sector Hits: 1 Misses: 4 Total: 5 Snapshot 6 35

36 5 Verification and Testing We verified the accuracy of simple cache module of our CacheVisual tool by comparing its results with the outputs obtained from simplescalar toolset. The first test is on the direct-mapped cache configuration. We setup simplescalar to model a level one (L1) cache as direct-mapped cache with 32 bytes per line for a total of 32 cache lines, which translates into a cache size of 1KB. The output display from simplescalar is shown below with the cache configuration and important outputs highlighted in bold font. The result shows that the test-fmath program generates 1,748 memory accesses, causing 803 hits and 945 misses. [angkul@u-35:~/ss3]$ sim-cache -cache:il1 il1:32:32:1:f tests-pisa/bin.little/test-fmath sim-cache: SimpleScalar/PISA Tool Set version 3.0 of August, Copyright (c) by Todd M. Austin, Ph.D. and SimpleScalar, LLC. All Rights Reserved. This version of SimpleScalar is licensed for academic non-commercial use. No portion of this work may be used by any commercial entity, or for any commercial purpose, without the prior written permission of SimpleScalar, LLC (info@simplescalar.com). sim: command line: sim-cache -cache:il1 il1:32:32:1:f tests-pisa/bin.little/test-fmath sim: simulation Tue Aug 3 13:35: , options follow: sim-cache: This simulator implements a functional cache simulator. Cache statistics are generated for a user-selected cache and TLB configuration, which may include up to two levels of instruction and data cache (with any levels unified), and one level of instruction and data TLBs. No timing information is generated. # -config # load configuration from a file # -dumpconfig # dump configuration to a file # -h false # print help message # -v false # verbose operation # -d false # enable debug message # -i false # start in Dlite debugger -seed 1 # random number generator seed (0 for timer seed) # -q false # initialize and terminate immediately # -chkpt <null> # restore EIO trace execution from <fname> # -redir:sim <null> # redirect simulator output to file (non-interactive only) # -redir:prog <null> # redirect simulated program output to file -nice 0 # simulator scheduling priority -max:inst 0 # maximum number of inst's to execute -cache:dl1 dl1:256:32:1:l # l1 data cache config, i.e., {<config> none} -cache:dl2 ul2:1024:64:4:l # l2 data cache config, i.e., {<config> none} -cache:il1 il1:32:32:1:f # l1 inst cache config, i.e., {<config> dl1 dl2 none} -cache:il2 dl2 # l2 instruction cache config, i.e., {<config> dl2 none} -tlb:itlb itlb:16:4096:4:l # instruction TLB config, i.e., {<config> none} -tlb:dtlb dtlb:32:4096:4:l # data TLB config, i.e., {<config> none} -flush false # flush caches on system calls -cache:icompress false # convert 64-bit inst addresses to 32-bit inst equivalents # -pcstat <null> # profile stat(s) against text addr's (mult uses ok) The cache config parameter <config> has the following format: <name>:<nsets>:<bsize>:<assoc>:<repl> <name> - name of the cache being defined <nsets> - number of sets in the cache <bsize> - block size of the cache

37 <assoc> - associativity of the cache <repl> - block replacement strategy, 'l'-lru, 'f'-fifo, 'r'-random Examples: -cache:dl1 dl1:4096:32:1:l -dtlb dtlb:128:4096:32:r Cache levels can be unified by pointing a level of the instruction cache hierarchy at the data cache hiearchy using the "dl1" and "dl2" cache configuration arguments. Most sensible combinations are supported, e.g., A unified l2 cache (il2 is pointed at dl2): -cache:il1 il1:128:64:1:l -cache:il2 dl2 -cache:dl1 dl1:256:32:1:l -cache:dl2 ul2:1024:64:2:l Or, a fully unified cache hierarchy (il1 pointed at dl1): -cache:il1 dl1 -cache:dl1 ul1:256:32:1:l -cache:dl2 ul2:1024:64:2:l sim: ** starting functional simulation w/ caches ** sim: ** simulation statistics ** TRUNCATED il1.accesses 1748 # total number of accesses il1.hits 803 # total number of hits il1.misses 945 # total number of misses We ran the memory address trace collected from simplescalar toolset through simple cache module of our CacheVisual tool. The output from CacheVisual is shown in the screen dump below. The output shows the total number of memory access hits, misses, total at 803, 945, and 1748, respectively, confirming the results obtained from simplescalar toolset.

38 The second test is on the fully associative cache configuration. We setup simplescalar to model a level one (L1) cache as fully associative cache with 32 bytes per line for a total of 32 cache lines, which translates into a cache size of 1KB. The output display from simplescalar is shown below with the cache configuration and important outputs highlighted in bold font. The result shows that the test-fmath program generates 1,748 memory accesses, causing 914 hits and 834 misses. [angkul@u-35:~/ss3]$ sim-cache -cache:il1 il1:32:32:32:f tests-pisa/bin.little/test-fmath sim-cache: SimpleScalar/PISA Tool Set version 3.0 of August, Copyright (c) by Todd M. Austin, Ph.D. and SimpleScalar, LLC. All Rights Reserved. This version of SimpleScalar is licensed for academic non-commercial use. No portion of this work may be used by any commercial entity, or for any commercial purpose, without the prior written permission of SimpleScalar, LLC (info@simplescalar.com). sim: command line: sim-cache -cache:il1 il1:32:32:32:f tests-pisa/bin.little/test-fmath sim: simulation Tue Aug 3 13:57: , options follow: sim-cache: This simulator implements a functional cache simulator. Cache statistics are generated for a user-selected cache and TLB configuration, which may include up to two levels of instruction and data cache (with any levels unified), and one level of instruction and data TLBs. No timing information is generated. # -config # load configuration from a file # -dumpconfig # dump configuration to a file # -h false # print help message # -v false # verbose operation # -d false # enable debug message # -i false # start in Dlite debugger -seed 1 # random number generator seed (0 for timer seed) # -q false # initialize and terminate immediately # -chkpt <null> # restore EIO trace execution from <fname> # -redir:sim <null> # redirect simulator output to file (non-interactive only) # -redir:prog <null> # redirect simulated program output to file -nice 0 # simulator scheduling priority -max:inst 0 # maximum number of inst's to execute -cache:dl1 dl1:256:32:1:l # l1 data cache config, i.e., {<config> none} -cache:dl2 ul2:1024:64:4:l # l2 data cache config, i.e., {<config> none} -cache:il1 il1:32:32:32:f # l1 inst cache config, i.e., {<config> dl1 dl2 none} -cache:il2 dl2 # l2 instruction cache config, i.e., {<config> dl2 none} -tlb:itlb itlb:16:4096:4:l # instruction TLB config, i.e., {<config> none} -tlb:dtlb dtlb:32:4096:4:l # data TLB config, i.e., {<config> none} -flush false # flush caches on system calls -cache:icompress false # convert 64-bit inst addresses to 32-bit inst equivalents # -pcstat <null> # profile stat(s) against text addr's (mult uses ok) The cache config parameter <config> has the following format: <name>:<nsets>:<bsize>:<assoc>:<repl> <name> - name of the cache being defined <nsets> - number of sets in the cache <bsize> - block size of the cache <assoc> - associativity of the cache <repl> - block replacement strategy, 'l'-lru, 'f'-fifo, 'r'-random

39 Examples: -cache:dl1 dl1:4096:32:1:l -dtlb dtlb:128:4096:32:r Cache levels can be unified by pointing a level of the instruction cache hierarchy at the data cache hiearchy using the "dl1" and "dl2" cache configuration arguments. Most sensible combinations are supported, e.g., A unified l2 cache (il2 is pointed at dl2): -cache:il1 il1:128:64:1:l -cache:il2 dl2 -cache:dl1 dl1:256:32:1:l -cache:dl2 ul2:1024:64:2:l Or, a fully unified cache hierarchy (il1 pointed at dl1): -cache:il1 dl1 -cache:dl1 ul1:256:32:1:l -cache:dl2 ul2:1024:64:2:l sim: ** starting functional simulation w/ caches ** sim: ** simulation statistics **TRUNCATED il1.accesses 1748 # total number of accesses il1.hits 914 # total number of hits il1.misses 834 # total number of misses We ran the memory address trace collected from simplescalar toolset through simple cache module of our CacheVisual tool. The output from CacheVisual is shown in the screen dump below. The output shows the total number of memory access hits, misses, total at 914, 834, and 1748, respectively, confirming the results obtained from simplescalar toolset.

40 The third test is on the 4-way set associative cache configuration. We setup simplescalar to model a level one (L1) cache as 4-way cache with 32 bytes per line for a total of 32 cache lines, which translates into a cache size of 1KB. The output display from simplescalar is shown below with the cache configuration and important outputs highlighted in bold font. The result shows that the test-fmath program generates 1,748 memory accesses, causing 877 hits and 871 misses. [angkul@u-35:~/ss3]$ sim-cache -cache:il1 il1:32:32:4:f tests-pisa/bin.little/test-fmath sim-cache: SimpleScalar/PISA Tool Set version 3.0 of August, Copyright (c) by Todd M. Austin, Ph.D. and SimpleScalar, LLC. All Rights Reserved. This version of SimpleScalar is licensed for academic non-commercial use. No portion of this work may be used by any commercial entity, or for any commercial purpose, without the prior written permission of SimpleScalar, LLC (info@simplescalar.com). sim: command line: sim-cache -cache:il1 il1:32:32:4:f tests-pisa/bin.little/test-fmath sim: simulation Tue Aug 3 14:01: , options follow: sim-cache: This simulator implements a functional cache simulator. Cache statistics are generated for a user-selected cache and TLB configuration, which may include up to two levels of instruction and data cache (with any levels unified), and one level of instruction and data TLBs. No timing information is generated. # -config # load configuration from a file # -dumpconfig # dump configuration to a file # -h false # print help message # -v false # verbose operation # -d false # enable debug message # -i false # start in Dlite debugger -seed 1 # random number generator seed (0 for timer seed) # -q false # initialize and terminate immediately # -chkpt <null> # restore EIO trace execution from <fname> # -redir:sim <null> # redirect simulator output to file (non-interactive only) # -redir:prog <null> # redirect simulated program output to file -nice 0 # simulator scheduling priority -max:inst 0 # maximum number of inst's to execute -cache:dl1 dl1:256:32:1:l # l1 data cache config, i.e., {<config> none} -cache:dl2 ul2:1024:64:4:l # l2 data cache config, i.e., {<config> none} -cache:il1 il1:32:32:4:f # l1 inst cache config, i.e., {<config> dl1 dl2 none} -cache:il2 dl2 # l2 instruction cache config, i.e., {<config> dl2 none} -tlb:itlb itlb:16:4096:4:l # instruction TLB config, i.e., {<config> none} -tlb:dtlb dtlb:32:4096:4:l # data TLB config, i.e., {<config> none} -flush false # flush caches on system calls -cache:icompress false # convert 64-bit inst addresses to 32-bit inst equivalents # -pcstat <null> # profile stat(s) against text addr's (mult uses ok) The cache config parameter <config> has the following format: <name>:<nsets>:<bsize>:<assoc>:<repl> <name> - name of the cache being defined <nsets> - number of sets in the cache <bsize> - block size of the cache <assoc> - associativity of the cache <repl> - block replacement strategy, 'l'-lru, 'f'-fifo, 'r'-random

Examples: -cache:dl1 dl1:4096:32:1:l -dtlb dtlb:128:4096:32:r Cache levels can be unified by pointing a level of the instruction cache hierarchy at the data cache hiearchy using the "dl1" and "dl2"

41 Examples: -cache:dl1 dl1:4096:32:1:l -dtlb dtlb:128:4096:32:r Cache levels can be unified by pointing a level of the instruction cache hierarchy at the data cache hiearchy using the "dl1" and "dl2" cache configuration arguments. Most sensible combinations are supported, e.g., A unified l2 cache (il2 is pointed at dl2): -cache:il1 il1:128:64:1:l -cache:il2 dl2 -cache:dl1 dl1:256:32:1:l -cache:dl2 ul2:1024:64:2:l Or, a fully unified cache hierarchy (il1 pointed at dl1): -cache:il1 dl1 -cache:dl1 ul1:256:32:1:l -cache:dl2 ul2:1024:64:2:l sim: ** starting functional simulation w/ caches ** sim: ** simulation statistics **TRUNCATED il1.accesses 1748 # total number of accesses il1.hits 877 # total number of hits il1.misses 871 # total number of misses We ran the memory address trace collected from simplescalar toolset through simple cache module of our CacheVisual tool. The output from CacheVisual is shown in the screen dump below. The output shows the total number of memory access hits, misses, total at 877, 871, and 1748, respectively, confirming the results obtained from simplescalar toolset.

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency