University of Nevada, Reno. CacheVisual: A Visualization Tool for Cache Simulation

Size: px
Start display at page:

Download "University of Nevada, Reno. CacheVisual: A Visualization Tool for Cache Simulation"

Transcription

1 University of Nevada, Reno CacheVisual: A Visualization Tool for Cache Simulation A professional paper submitted in partial fulfillment of the requirements for the degree of Masters of Science in Computer Engineering by Howard Silva Dr. Angkul Kongmunvattana/Advisor August, 2004

2 Copyright 2004 Howard Silva All rights reserved

3 Abstract This professional paper describes a cache visualization project (CacheVisual) a project to develop a graphics-based program demonstrating some of the basic principles of cache design through computer simulation. The three basic (simple) cache organizations (direct mapped, fully associative, and set associative) are explained and illustrated. In addition, sector and sector pool caches are also explained and illustrated. The design and implementation of the project is covered, including detailed illustrations of program input/output and of program demonstration. Besides the inherent ability of CacheVisual to assist teachers and students with cache design principles through a visual interactive process, CacheVisual is also a trace-driven research tool for generating cache performance data. The overall intent of this paper is to show how CacheVisual provides an alternative method of computer simulation to that of typical simulation tools used in computer architecture research. 1 Introduction This project is intended to satisfy a requirement for attaining a Master s degree in Computer Engineering from the University of Nevada, Reno. The specific reasons as to why this project was selected are explained in the remainder of this section. The importance of computer simulations (tools) to analyze and verify computer architecture designs has been well established. With the advent of superscalar CPUs, pipelining CPUs, and varying cache organizations, this importance has become even greater. While these tools have provided much help for those doing research, the aspect of understanding the ever-increasing complexity of these designs has been typically left to normal publishing methods. This same complexity can mean that these tools may require a fair amount of knowledge on the part of the user just to use them. A need has arisen to (1), combine the proven research aspects of these types of tools with a user-friendly interactive interface, thereby simplifying their use, and (2), to incorporate their use into the learning process. The goals then of this project are to develop an interactive tool that allows a user to visualize cache organization and operation, while also providing for research, thus expanding the overall usefulness of these types of tools. The approach used provides for a graphics-based interactive interface to allow for various combinations of user input, and to immediately display the results of that input. By using this hands-on visual approach it is hoped that this will help the user to understand the concepts involved, and to do so more effectively than with other methods of learning. This project attempts to contribute to the study of microarchitecture computer design in three distinct areas. One, to expand the scope of research tools to include an easy-to-use interface and to include educational aspects of the technology being studied. Two, to provide for modeling of sector cache and sector pool cache designs. Three, to support basic research on cache design utilizing trace files.

4 2 Related Work The SimpleScalar tool set [BA97] performs fast and accurate simulations of modern processors. The tool set can simulate superscalar processors, pipelines, branch predictors, and cache organizations, with many other features too numerous to list. This product is essentially a research tool that collects and reports on an extensive array of performance data. The product has very little, if any, interactive capability, and does not provide any form of graphical user interface. The product is UNIX based, and requires a user to have an understanding of the many optional command-line parameters used in its execution, and therefore by necessity, of having some overall knowledge of the technical aspects SimpleScalar covers. In addition, while the SimpleScalar source code is available for software modifications, the source code is large, involving over four million bytes. The SimpleScalar product also includes: binary utilities (recommended); compiler, assembler, libraries (all optional); other precompiled binary files (optional). Currently at Version 2.0, Simple Scalar was developed over a period of years starting in the mid-1980 s and spanning most of the 1990 s, with updates and support provided up to the present time. The introductory PDF document for SimpleScalar is 21 pages, and includes sections on installation and product description. An online overview and tutorial are also available over the Internet. Similar to the Simple Scalar Scalar tool set, the Dinero IV Trace-Driven Uniprocessor Cache Simulator [EH] is a UNIX-based product that is parameter driven, but only reports on hit and miss data. It also does not have a graphical user interface. Besides the trace-driven aspect of this product, some features cited on its Web page are: subroutine-callable interface in addition to trace-reading program simulation of multi-level caches simulation of dissimilar I and D caches better performance, especially for highly associative caches classification of compulsory, capacity, and conflict misses support for multiple input formats As the above two descriptions indicate, research tools typically can require a fair amount of knowledge and effort just to use them. Besides installation and compatibility aspects, there may be a learning curve involved in dealing with their use, and the execution of these tools does not involve an ease-of use that is, not being what we would normally describe as user friendly. They also do not typically provide for the educational aspect of the basic principles of the technology they study. The CacheVisual project is designed to address these issues. 3 Cache Organization One of the essential uses of a computer is to complete a task in the least amount of time. To reach this goal a storage medium hierarchy evolved during the growth of computer technology. This hierarchy came about due to the inherent design factors and cost factors in building faster and larger storage media. A typical storage medium hierarchy consists of the CPU registers, cache, memory, the I/O devices, and the interconnectivity between each level. The purpose of a cache is to provide a high-speed buffer close to the CPU in order to reduce the delay when accessing memory. A cache can be thought of as having physical areas of storage called lines which hold a block, or blocks, of bytes, and these lines each have the means (tags) of identifying where the blocks reside in memory. The basic process for a cache access is searching for the requested memory reference (program counter or memory address) in cache and accessing the item if found, and if not found, then fetching the item from memory and placing it in cache. A cache access resulting in the requested item (data or instruction) being found in cache is a cache hit, while an item not being found is a cache miss. The rates of each (hit rate and miss rate, respectively) are found by dividing by the total cache accesses. Cache misses have the additional penalty of having to fetch the missing item from slower memory. In general, smaller caches result in faster access, but higher miss rates. As expected, larger caches result in slower access, but lower miss rates. The three aspects of cache design that this project is concerned with are block placement (mapping of memory block addresses to cache lines), block identification (tag comparison), and block replacement policy in the event of a miss [HP03]. Block placement schemes fall into three basic types or organizations: direct mapped, fully associative,

5 and set associative. In addition, two other placement schemes, sector cache and sector pool cache, involve variations of these three prior types. As one might expect, these schemes involve tradeoffs of performance when compared to each other, and also involve tradeoffs when the internals of how each organization is implemented are compared. 3.1 Direct Mapped Cache Direct-mapped caches result in a memory block being placed in only one location (a 1-to-1 mapping of one block to one cache line) in cache. Figure 1 depicts an example of a direct mapped cache organization with 256 kilobytes of total cache (32 bytes per cache line yields 8 thousand cache lines), 128 megabytes of total memory (32 bytes per block yields 4 million block frames), and a breakdown of the program counter (PC) showing the number of bits used for the tag, for the block/line, and for the byte. This type of cache is the easiest to implement in hardware as no cache search is involved, and a block replacement strategy is not required. As might be expected, these factors also result in a faster cache access, a smaller cache tag, and a lower cost. The tradeoffs are: a higher miss rate due to the constrained cache placement, and a lower utilization of cache space. The cache miss rate can be higher if many addressed blocks map to the same cache location. An additional bit (not shown) is used for each block to indicate if it s valid or invalid (empty). The valid bit is also utilized in the set-associative and fullyassociative cache organizations described below. This placement scheme (mapping) typically uses a modulo method to determine the cache line value, (block address) MOD (number of blocks in cache). (2.1) This mapping method results in a cache line value that ranges from zero to the number of blocks in cache minus one. As shown in Figure 1, using Eq. 2.1 a PC value of maps to cache line 8190 (16382 MOD 8192), where is the block address, and 8192 is the number of blocks in cache (2^18 bytes divided by 2^5 bytes per block). As noted at the beginning of section three, a block in cache is identified by its tag value (location in memory). To determine the tag value the PC (16382, ) is first partitioned as follows (note that here the PC is block addressable, so the byte offset is omitted): PC block address tag cache line bits 13 bits Then the PC value is shifted right to the right 13 bits (divided by 8192) yielding the nine-bit tag integer value. Thus a PC value of yields a tag value of one (16382/8192).

6 256 KB Cache (2^18 bytes) 128 MB of Memory (2^27 bytes) 32 bytes per cache line 32 bytes per block frame = 2^13 or 8K cache lines = 2^22 or 4M block frames C0 B0 C1 B C8190 B8190 C8191 B8191 B8192 B B16382 B B B These blocks will map to a cache tag value of zero (tag 0) Tag 1 Tag 511 Program Counter 27 bits 9 bits 13 bits 5 bits Tag Block/Line Byte Figure 1 Example of Direct Mapped Cache Organization 3.2 Fully Associative Cache Fully-associative caches are essentially the opposite of direct mapped caches in that a fully associative mapping results in a memory block being placed anywhere (a 1-to-any mapping of one block to any cache line) in cache. Figure 2 depicts an example of a fully associative cache organization with 256 kilobytes of total cache (32 bytes per cache line yields 8 thousand cache lines), 128 megabytes of total memory (32 bytes per block yields 4 million block frames), and a breakdown of the program counter showing the number of bits used for the tag and for the byte. This type of cache results in the highest utilization of cache space and a higher hit rate. The tradeoffs are: the hardest placement scheme to implement in hardware since every cache tag needs to be searched which results in slower cache access, the largest cache tag, and a higher implementation cost. Because of this higher cost, fullyassociative caches have typically been limited to smaller-size applications, such as microprocessors. 6

7 Again, a block in cache is identified by its tag value (location in memory). To determine the tag value the PC (16382, ) is first partitioned as follows (note that here the PC is block addressable, so the byte offset is omitted): PC block address tag bits As the above partitioning of the PC indicates, the tag value is the PC value, so a PC value of would have a tag value of As noted at the beginning of section three, there are three basic replacement policies: random, least recently used (LRU), and first-in first-out (FIFO). A random replacement policy attempts to produce a uniform cache distribution. For a fully-associative cache the random replacement policy would result in any cache line being replaced in the event of a miss. The LRU replacement policy records accesses to cache blocks, so the block that was accessed the furthest time in the past will be replaced. The idea here is to take advantage of the principle of locality: blocks recently accessed, or blocks close to recently accessed blocks in memory, tend to be accessed more frequently. For a fully-associative cache, the LRU replacement policy would result in all of the cache lines being checked for the least recently used line. Besides the additional hardware aspects (some sort of counter for each cache line) involved to implement a LRU policy, additional time would also required to check all of the counters on all of the cache lines. The FIFO replacement policy simply treats each cache line as a member of a queue to determine the oldest member of the queue. Similar to the previous two methods, the FIFO replacement policy for a fully-associative cache involves finding the oldest member of cache for all of the cache lines. 7

8 256 KB Cache (2^18 bytes) 128 MB of Memory (2^27 bytes) 32 bytes per cache line 32 bytes per block frame = 2^13 or 8K cache lines = 2^22 or 4M block frames C0 B0 C1 B C8190 B8190 C8191 B8191 B8192 B B16382 B B B Tag 0 Tag 1 Tag Program Counter 27 bits 22 bits 5 bits Tag Byte Figure 2 Example of Fully Associative Cache Organization 3.3 Set Associative Cache As might be expected a middle ground exists between direct-mapped and fully-associative cache organizations. This middle ground results in a memory block being mapped anywhere in a set (a 1-to-set mapping of one block to one cache set) of cache lines. The set is made up of n of blocks, and the placement is called n-way set associative. Figure 3 depicts an example of an 8-way set associative cache organization with 256 kilobytes of total cache (32 bytes per cache line yields 8 thousand cache lines and one thousand sets), 128 megabytes of total memory (32 bytes per block yields 4 million block frames), and a breakdown of the program counter showing the number of bits used for the tag, for the block/line, and for the byte. The search and block replacement schemes are simpler than fully associative due to the smaller size of a set. The other advantages and tradeoffs of this cache organization fall between those of direct mapped and fully associative cache organizations. Internally, given a fixed cache size, tradeoffs exist between the set size and the number of sets, which can lead to a higher hit rate. 8

9 This placement scheme (mapping) typically uses a modulo method to determine the cache line value, (block address) MOD (number of sets in cache). (2.2) This mapping method results in a cache line value that ranges from zero to the number of sets in cache minus one. As shown in Figure 3, using Eq. 2.2 a PC value of maps to cache set 1022 (16382 MOD 1024), where is the block address, and 1024 is the number of sets in cache (2^18 bytes divided by 2^5 bytes per block divided by 2^3 cache lines per set). Again, a block in cache is identified by its tag value (location in memory). To determine the tag value the PC (16382, ) is first partitioned as follows (note that here the PC is block addressable, so the byte offset is omitted): PC block address tag cache set bits 10 bits Then the PC value is shifted to the right 10 bits (divided by 1024) yielding the twelve-bit tag integer value. Thus a PC value of yields a tag value of fifteen (16382/1024). For a set-associative cache the random replacement policy would result in any cache line in the set being replaced in the event of a miss. For a set-associative cache the LRU replacement policy would result in all of the cache lines in the set being checked for the least recently used line. Similar to the previous two policies, the FIFO replacement policy for set-associative cache involves finding the oldest member of the set. As a note, set-associative caches can be generalized to include direct-mapped and fully-associative caches. Direct-mapped caches can be thought of as having a cache set size of one cache line (1-way set associative), so the number of sets equals the number of cache lines. Fully-associative caches can be thought of as having a cache set size equaling the number of cache lines (c-way set associative, where c is the number of cache lines), so there is only one set. While each replacement policy offers advantages over the others, the FIFO policy, and a close relative, second-chance FIFO (blocks that have been hit are not replaced) were selected as options for this project. As a note, the FIFO replacement policy is considered to closely approximate the LRU replacement policy. 9

10 256 KB Cache (2^18 bytes) 128 MB of Memory (2^27 bytes) 32 bytes per cache line 32 bytes per block frame = 2^13 or 8K cache lines = 2^22 or 4M block frames 8 cache lines per set = 2^10 sets C0 B0. B1. Set 0... C7. C8 B1022. B1023. Set 1 B1024. B1025 C B2046 C8176 B Set 1022 B16832 C8183. C8184 B Set C8191 B These blocks will map to a cache tag value of zero (tag 0) Tag 1 Tag 4095 Program Counter 27 bits 12 bits 10 bits 5 bits Tag Block/Set Byte Figure 3 Example of 8-Way Set Associative Cache Organization 3.4 Sector Cache Given the finite resources (physical size and transistors) of a CPU and cache, the amount of resources used to store tag values is an essential consideration of resource utilization. As computer memories expanded, the tag values associated with a block of data became larger. While increasing the block size helps to alleviate this problem, larger block sizes lead to higher bandwidth for data transfer, and to higher miss rates due to poor utilization of cache storage. To deal with this problem sectors (groups of blocks /subsectors) of memory became a fundamental unit-of-measure of storage. This allows more data to be associated with each cache tag. To help alleviate the impact of higher bandwidths and higher miss rates associated with larger units of storage, only the referenced subsector is placed into its sector area (offset) in cache. On a cache miss, only the missing subsector has to be fetched if the relevant sector is already in cache. Even if a sector has to be evicted due to a cache miss, only the referenced subsector is placed in cache along with the new tag value for the new sector. Two additional bits (not shown) are used for each subsector to indicate if it s valid/invalid (empty) and clean/dirty (modified while in cache). Figure 4 depicts an example of a direct-mapped sector cache organization with 256 kilobytes of total cache (32 bytes per subsector and 4 subsectors per sector yields two thousand cache lines), 128 megabytes of total memory 10

11 (32 bytes per subsector yields 4 million block frames), and a breakdown of the program counter showing the number of bits used for the tag, for the sector, for the block/subsector, and for the byte. 256 KB Cache (2^18 bytes) 128 MB of Memory (2^27 bytes) 32 bytes per subsector 32 bytes per Subsector Block Frame (SBF) 4 subsectors per sector = 4M SBFs = 2^11 or 2K cache lines Cache Memory Cache line 0 Cache line 1 Cache line 2046 Cache line 2047 Tag B0 B1 B2 B3 Tag B4 B5 B6 B Tag Tag These sectors will map to a cache tag value of zero (tag 0). Tag 1 Tag 511 Program Counter 27 Bits 9 bits 11 bits 2 bits 5 bits Tag Sector Block/ Byte Subsector Figure 4 Example of Direct Mapped Sector Cache Organization 3.5 Sector Pool Cache While sector caches do prove useful in some cases, a major problem can occur with their use a sector is typically evicted before all of its subsectors are used, resulting in an inefficient use of cache storage. To deal with this problem Rothman and Smith [RS99] proposed that a pool of subsectors be shared among a set of sectors. Figure 5a depicts an example of an 8-way set associative sector pool cache organization with 256 kilobytes of total cache (32 bytes per cache line, 4 subsectors per sector, 8 sectors per sets yields 256 sets), 128 megabytes of total memory (32 bytes per subsector block frame yields 4 million subsector block frames), and a breakdown of the program counter showing the number of bits used for the tag/sector, for the set, for the block/subsector, and for the byte. The total number of subsectors in the pool is less than that of a sector cache with the same set size and number of subsectors per sector. Associated with a pool of subsectors are the concepts of a pool depth, or the number of subsectors reserved for each subsector offset in the set, and of a pool list, one for each subsector offset, which contains the actual subsectors. As a subsector is fetched into cache, its offset position in the cache line (sector) will hold a pointer value indicating its position in the pool depth for its relevant pool list. A pointer value of zero indicates the subsector is not in the pool. Figure 5b depicts a cache line with a tag field and four pointer values (3, 1, 11

12 0, 5) pointing to a pool of subsectors with a pool depth of five. An additional bit (not shown) is used for each subsector to indicate if it s clean or dirty (modified while in cache) All of the aspects described above in sections 3.2 and 3.3 (such as advantages and tradeoffs, mapping and identification, and replacement policies) for fully associative and set associative caches apply to sector pool caches, except that the basic storage unit is a sector, and not a block. So, instead of a block, a sector is mapped to cache, a sector has a tag, a sector is replaced, and there are sets of sectors. An additional overhead of sector pool cache is the maintenance of the pointers and of the subsector pool in the event of a sector miss (the sector is not in cache). In addition to the usual tag value being updated, all of the nonzero pointers for the evicted sector must be cleared, along with the subsectors they pointed to in the subsector pool. Associated with a sector miss is the case where the subsector pool is full (there s no room in the pool for the referenced subsector). In this case the replacement policy must be used to select a sector to be evicted in the event of a sector miss, or to select a sector to release a subsector in the subsector pool in the event of a sector hit (the subsector is not in cache, but its associated sector is in cache). 12

13 256 KB Cache (2^18 bytes) 128 MB of Memory (2^27 bytes) 32 bytes per sector 32 bytes per Subsector Block Frame (SBF) 4 subsectors per sector = 4M SBFs 8 sectors per set, 256 sets (of pointers) Cache Memory Tag Ptr 0 Ptr 1 Ptr 2 Ptr 3 B0 B1 B2 B3 Tag Ptr 4 Ptr 5 Ptr 6 Ptr 7 B4 B5 B6 B Set Tag Ptr 28 Ptr 29 Ptr 30 Ptr Tag Ptr 32 Ptr 33 Ptr 34 Ptr Set Tag Ptr 60 Ptr 61 Ptr 62 Ptr Tag P 8160 P 8161 P 8162 P Set Tag P 8188 P 8189 P 8190 P These sectors will map to a cache tag value of zero (tag 0). Tag 1 Tag 4095 Program Counter 27 Bits 12 bits 8 bits 2 bits 5 bits Tag/Sector Set Block/ Byte Subsector (a) 8-Way Set Associative Cache Offset 0 Offset 1 Offset 2 Offset 3 Tag Pointer = 011 Pointer = 001 Pointer = 000 Pointer = 101 _ Pool Depth of five. Pool List 0 Pool List 1 Pool List 2 Pool List 3 Subsector Pool (b) Cache line with tag and pointers to subsector pool Figure 5 Example of 8-Way Set Associative Sector Pool Cache, Cache Line, and Subsector Pool 13

14 4 Design and Implementation The overall design of the CacheVisual project was driven by the needs to provide the inputs necessary to allow the user to use the tool, and to provide meaningful output to help the user to understand the principles of cache design. The features of simple cache include three cache organizations (i.e., direct-mapped, set-associtative, and fully associative) with cache size ranges from 1 to 8 kilobytes. The output includes number of cache accesses, hit rate, and miss rate. The Microsoft Visual C++ programming environment was chosen to meet these needs. The project design was divided into two parts. First, a working tool was developed for a simple cache simulation, then building upon what was learned from developing this first tool, a sector pool cache simulation was developed. 4.1 Simple Cache The simple cache design involves a user interface and program outputs to demonstrate the concepts behind direct mapped, fully associative, and set associative cache organizations utilizing a block of bytes for each cache line. User Interface The inputs required to allow the user to configure a simple cache are: Cache Size in kilobytes (1024 bytes); range of 1, 2, 4 and 8 2. Cache Line Size in bytes; range of 1, 2, 4, 8, 16 and 32; also sets memory block size. 3. Cache Type (organization) of direct mapped, fully associative or set associative (2-way to 128-way) 4. Replacement Policy choice of FIFO or SC-FIFO (second chance FIFO) 5. Memory Size in kilobytes (1024 bytes); range of 8, 16, 32, 64, 128, 256, 512 and 1024 Note that a change in any of the above inputs (except the Replacement Policy) will reset the Cache and Memory lists, and will clear (zero) all of the fields in the Results group box (lists and Results group are explained in the Program Output section below). Also, relevant message boxes are displayed if any of the above inputs are omitted. A Random PC (6) check box (pseudo-random program counter generated with each click of the Execute button) is also available; if this box is not checked the user-editable Memory Block # (7) will be used as the PC. The user can also enter a file name for the Memory Block #, or click on the Browse Files (8) button (unchecks PC check box) to search for a file (the file name and path are placed in the Memory Block #). The file is used to enter a series of program counter values (traces). An unchecked Random PC box and a blank Memory Block # will generate a relevant message box. In addition to the above, the following buttons are available: Reset Lists (resets all three list boxes and zeroes the Results fields) Clear Blocks (clears the Memory Block # combo box) Execute (executes the program using the PC check box, Memory Block #, or file) Step (step through a file while displaying graphics); Execute button renamed Cancel (cancel the Simple Cache simulation) Step Cancel (cancel stepping through a file); Cancel button renamed See Figure 6 for a screen display of all of the above numbered input fields.

15 Cache Line Size spin control (2) Cache Size spin control (1) Cache Type combo box (3) Replacement Policy buttons (4) Set FIFO Values (Old and New) (20) Cache list box. The C is short for cache line, the T is short for tag (initialized to 1), and the B is short for block (initialized to Empty. (24) Memory Size spin control (5) Random PC check box (6) Memory Block # combo box (7) (15) Memory list box. The B is short for block. (25) Tag (16) Set statrs at line (17) Cache Data (18) Line Tag (19) Hits (21) Misses (22) Total (23) Browse Files button (8) Execute button (11) (12) Cancel button (13) (14) Reset Lists button (9) Clear Blocks button (10) Figure 6 Simple Cache Screen Example (user input in green, program output in pink)

16 Program Flow For each execution of the program the user inputs are checked (unless a file is being processed in step mode) to see if any of them have been changed since the previous execution. If any user input is changed a group of sanity checks is performed on the cache configuration. If a sanity check is positive, a message box displays the information to the user and that execution terminates. Note that a sanity check for the memory is not required since it s set at a minimum value of eight kilobytes. After passing the sanity checks, the list boxes used for the cache and the memory displays are reset, and a number of dynamically allocated arrays are reset and initialized. These arrays involve the tags, the cache lines, the valid bits, the FIFO value for each set, and the hits. For the program to continue, a PC value or a file name must be processed. Currently, the Random PC check box reigns supreme in this determination. If it is checked, a user-entered PC value or file name is ignored, and the program-generated PC is range checked (a user-entered PC is also range checked), and rejected (a message box is displayed) if it is out of range. If the user manually enters a file name in the Memory Block # combo box, or a file name is placed there by the user selecting the Browse Files button (Random PC box is unchecked), a message box is displayed that warns the user of the required file format. If the user opts to continue, another message box is displayed (with the file size in bytes) to query the user to process the file in step mode (one datum at a time). If the step mode is selected, each file datum is processed as described in the paragraphs below. If the step mode is not selected, each file datum is processed with only the hit and misses being tabulated, with the graphic aspects described in the Program Output section below not being in effect, and the whole file is processed until the end-of-file marker, so the user has no ability to pause or stop the processing of the file. There is no range checking when processing file data since each file datum is operated upon by a modulo of the memory size to produce a memory block number (PC) that ranges from zero to the memory size minus one. Preliminary results with file data that exceed the memory size indicate that the hit ratio is increased due to this form of compression of a larger data set to a smaller data set. After computing the tag, the cache set, and the first line of the cache set, each cache line in the set that has its valid bit set is checked to see if its tag array value is equal to the calculated tag value. A hit or miss is determined by testing if the search goes beyond the last line in the cache set. Note that for a direct mapped cache type the set size is simply one. In processing a miss, the FIFO array for each set is used to determine what cache line to replace. If the user has selected the second chance FIFO replacement policy, the set FIFO value is not used if it points to a line that has been hit, and instead a line is searched for that hasn t been hit. The tag, cache line, hit, FIFO, and valid bit arrays are then updated. Since we re not dealing with any actual data, each cache line array element simply holds the memory block value. The relevant cache list box element is updated with the new tag and memory block values. Depending upon whether the PC generated a hit or a miss the appropriate counters are updated along with the appropriate graphics effect displayed previously described. Figure 7 shows the pseudo-code for three of the functions that are used to implement most of the flow described above. Note that in reality there are many more functions involved in the Simple Cache simulation, but these three do most of the work in simulating a simple cache organization. The three functions are responsible for checking the cache configuration for any problems, for handling the execute aspect when the user initiates (clicks) a program response, and processing the datum to determine if a cache hit or miss occurs and performing the appropriate actions in the event of each occurrence.

17 Check Cache Configuration If any change in cache configuration since last execution or the reset-lists button pushed Perform sanity checks (Error message and terminate execution if a problem); Reset list boxes (Warning message for large list boxes); Reset arrays (Error message and terminate if any reset fails); Initialize arrays; Save current cache configuration; Clear all output fields; Execute If processing file in step mode (execute button = Step ) Read file datum; If not EOF Convert file datum (datum modulo memory size); Process Datum; Else (EOF) Close file; Execute button = Execute ; Cancel button = Cancel ; Else (not processing file in step mode) If Check Cache Configuration fails terminate execution; If random PC box unchecked and Memory Block # empty Information msg. and terminate execution; If random PC box checked If PC datum out of range Error msg. and terminate execution; Process Datum; Else (user-entered file name or datum) If file name entered Warning msg. of file format requirements; If user cancels Terminate execution; Open file; Question msg. with file size; If user cancels Terminate execution; If user selected file step mode Execute button = Step ; Cancel button = Step Cancel ; Return; Else Process entire file (Process Datum loop) with no graphics and displaying progress bar; Close file; Display file run time; Hide progress bar; Execute button = Execute ; Cancel button = Cancel ; Else (assume user entered datum) If user datum out of range Error msg. and terminate execution; Process Datum; Process Datum Determine if cache hit or miss; If cache miss Increment miss counter; Use replacement policy (unless direct mapped) to find block to replace; If not processing an entire file Update output fields; Update cache list box; Display miss graphics; Update arrays; Else (a cache hit) Increment hit counter; If not processing an entire file Update output fields; Display hit graphics; If SC-FIFO and not direct mapped Set hit array element; Update result group fields; Figure 7 Pseudo-code of three functions for Simple Cache simulation. 17

18 Program Output The displayed data consists of the following: 15. Memory Block # generated by the program 16. Tag value calculated by the program from the PC 17. Set starts at line (cache line number where the set starts; for set associative only) 18. Cache Data (Block #) stored in the cache line 19. Line Tag of the cache line 20. Set FIFO Values, Old and New 21. The number of accumulated Hits 22. The number of accumulated Misses 23. The Total number of Hits and Misses 24. Cache list box 25. Memory list box See Figure 6 for a screen display of all of the above numbered data fields. In addition to the above, when processing a file the following is displayed: Original (in parentheses) and decimal values of each file datum when in step mode; File processing time in seconds (in Memory Block # field) when not in step mode; Progress bar when not in step mode; Besides the user input and program output, the primary graphics aspect of the program design consists of list boxes that display the area of both cache and memory that is affected. This is accomplished by enclosing the blocks (Cache list box and Memory list box) in a colored box. If the program execution results in a hit, the box is green. If the result is a miss, the box is red. For cache types involving sets, a blue box encloses the set if the entire set can be displayed in the cache list box. The set number is displayed at the bottom of each set. See Figure 6 for an example of the graphics. During the course of executing the program, message boxes may be displayed due to questionable user input, to query the user for additional information, or to provide additional information to the user. These message boxes have four forms: error, warning, question, and informational. A list of the general descriptions for the message boxes follows: Error; cache configuration incomplete Error; replacement policy not selected (unless direct mapped cache type) Error; set size exceeds the number of cache lines available Warning; memory list box size may exceed computer resources Warning; cache list box size may exceed computer resources Error; resizing of an array failed Informational; the Random PC box was unchecked and the Memory Block # was blank Error; random PC value out of range Warning; file format requirements Question; process file in step mode (graphics displayed) or without interruption Error; entered PC value out of range 18

19 Program Demonstration To help illustrate the User Interface and Program Output sections, a simple demonstration will follow using six screen snapshots. Snapshot one will involve configuring a simple cache simulation, and snapshots two through six will show the results of five executions of that configuration. Snapshot 1 inputs: Cache Size: 1 Cache Line Size: 32 Cache Type: 4-way Memory Size: 256 Replacement Policy: FIFO Random PC box checked Snapshot 1 19

20 Snapshot 2 outputs: Memory Block #: 2566 Tag: 320 Set starts at line: 24 Cache Data (Block #): -1 Line Tag: -1 Set FIFO Values: Old 0 New 1 Hits: 0 Misses: 1 Total: 1 Snapshot 2 20

21 Snapshot 3 outputs: Memory Block #: 1222 Tag: 152 Set starts at line: 24 Cache Data (Block #): -1 Line Tag: -1 Set FIFO Values: Old 1 New 2 Hits: 0 Misses: 2 Total: 2 Snapshot 3 21

22 Snapshot 4 outputs: Memory Block #: 7020 Tag: 816 Set starts at line: 16 Cache Data (Block #): -1 Line Tag: -1 Set FIFO Values: Old 0 New 1 Hits: 0 Misses: 3 Total: 3 Snapshot 4 22

23 Snapshot 5 outputs: Memory Block #: 807 Tag: 100 Set starts at line: 28 Cache Data (Block #): -1 Line Tag: -1 Set FIFO Values: Old 0 New 1 Hits: 0 Misses: 4 Total: 4 Snapshot 5 23

24 Snapshot 6 outputs: (Random PC box unchecked to force a hit) Memory Block #: 807 Tag: 100 Set starts at line: 28 Cache Data (Block #): 807 Line Tag: 100 Set FIFO Values: Old 0 New 1 Hits: 1 Misses: 4 Total: 5 Snapshot 6 24

25 4.2 Sector Pool Cache The sector pool cache design involves a user interface and program outputs to demonstrate the concepts behind fully associative and set associative cache organizations utilizing a sector of blocks (subsectors) for each cache line. User Interface The inputs required to allow the user to configure a sector pool cache organization are: 1. Cache Size in kilobytes (1024 bytes); range of 1, 2, 4 and 8 2. Subsector Size in bytes; range of 1, 2, 4, 8, 16 and Subsectors Per Sector; range of 1, 2, 4, 8, 16 and Pool Depth; range of from 1 to 100, inclusive 5. Cache Type (organization) of fully associative or set associative (2-way to 128-way) 6. Replacement Policy choice of FIFO or SC-FIFO (second chance FIFO) 7. Memory Size in kilobytes (1024 bytes); range of 8, 16, 32, 64, 128, 256, 512 and 1024 Note that a change in any of the above inputs (except the Replacement Policy) will reset the Cache, Memory, and Subsector Pools lists, and will clear (zero) all of the fields in the Results group box (lists and Results group are explained in the Program Output section below). Also, relevant message boxes are displayed if any of the above inputs are omitted. A Random PC (8) check box (pseudo-random program counter generated with each click of the Execute button) is also available; if this box is not checked the user-editable Memory Block # (9) will be used as the PC. The user can also enter a file name for the Memory Block #, or click on the Browse Files (10) button (unchecks PC check box) to search for a file (the file name and path are placed in the Memory Block #). The file is used to enter a series of program counter values (traces). An unchecked Random PC box and a blank Memory Block # will generate a relevant message box. In addition to the above, the following buttons are available: 11. Reset Lists (resets all three list boxes and zeroes the Results fields) 12 Clear Blocks (clears the Memory Block # combo box) 13. Execute (executes the program using the PC check box, Memory Block #, or file) 14. Step (step through a file while displaying graphics); Execute button renamed 15. Cancel (cancel the Sector Pool Cache simulation) 16. Step Cancel (cancel stepping through a file); Cancel button renamed See Figure 8 for a screen display of all of the above numbered input fields. 25

26 Subsector Size spin control (2) Cache Size spin control (1) Subsectors Per Sector spin control (3) Pool Depth spin control (4) Cache Type combo box (5) Replacement Policy buttons (6) Set FIFO Values (Old and New) (21) Cache list box. The C is short for cache line, the T is short for tag (initialized to 1), and the P is short for pointer (multiple pointer values follow). (26) Memory list box. The B is short for block. (27) Memory Size spin control (7) Random PC check box (8) Memory Block # combo box (9) (17) Sector Tag (18) Subsector Pools list box. A zero (0) means vacant and a one (1) means occupied. (28) Pool List # (19) Pool Occupancy % (20) Misses (24) Total (25) Hits (22) Sector Hits (23) Browse Files button (10) Execute button (13) (14) Cancel button (15) (16) Clear Blocks button (12) Reset Lists button (11) Figure 8 Sector Pool Cache Screen Example (user input in green, program output in pink)

27 Program Flow The initial phase of program flow for the sector pool cache simulation is nearly identical to that of the simple cache design. The major differences involve the additional data structures needed to implement the sector pool design, such as the new user inputs, the increased complexity of cache, and the subsector pools list, just to name a few. Due to this similarity of the two designs for the initial program flow, we will skip to the most important aspect of the program flow for the sector pool cache simulation, executing the program with the PC value. The main aspects involved in executing the PC is to determine if there s a cache hit or a cache miss, then to perform the necessary operations in the event of each. In order to determine a hit or a miss, the tag, sector set, sector-set cache line, and subsector offset values are calculated. The cache data structure is an array of sector sets, and each set is a 2-dimensional array of pointer values. The sector set and subsector-offset values determine which set and which column in the set, respectively. Starting with the sector-set cache line (the cache line number of the first sector in the set), a while-loop is used to search a tag array (one tag value for each cache line) to find a tag match. A tag match (sector hit) in the sector set determines the row value of the set, so using the set, row, and column values the cache data structure can be checked for a nonzero pointer value, which if found, is a cache hit, or a cache miss if otherwise. If a cache hit occurs, the hit accumulator is updated, and the cache-hit graphics are displayed. The hit array (a hit value of zero or one for each cache line) is updated if the SC-FIFO replacement policy is being used. If a cache miss occurs, the valid bits array (a value of zero or one for each subsector in all of the subsector pools) is used to determine if the respective subsector pool list is full, similar to the approach used in [SMT00]. Note that valid bits are used in other cache designs, but not in a sector pool cache. They are used in this program for the previously stated reason, and to visualize the state of each subsector pool. A full subsector pool list adds additional complexity to the replacement problem because a sector hit means that a subsector from another sector in the set must be evicted from the pool list, or if not a sector hit a sector must be replaced that has a subsector in the pool list. In any case, the replacement policy is used to search for a sector that has a subsector in the full subsector pool list. Note that during the search the set FIFO pointer is only incremented if not a sector hit. If a sector hit occurs, the sector-hit accumulator is incremented, the Cache and the Subsector Pools list boxes are updated, and the sector-hit graphics are displayed. If not a sector hit, the cache and valid bit array elements are cleared for the evicted sector, and updated for the fetched sector, the hit and tag arrays are updated, and the Subsector Pools list box is updated. A subsector pool list that is not full is easier to handle than the above full-pool-list scenario. If a sector hit occurs, the sector-hit accumulator is incremented, the cache and valid bit arrays are updated, the Cache and the Subsector Pools list boxes are updated, and the sector-hit graphics are displayed. If not a sector hit, the replacement policy is used to find a sector to replace, the cache and valid bit array elements are cleared for the evicted sector, and updated for the fetched sector, the hit and tag arrays are updated, and the Subsector Pools list box is updated. The handling of a cache miss (not a sector hit) ends with the updating of the Cache list box, the displaying of the cache-miss graphics, and the incrementing and displaying of the new set FIFO pointer. The relevant cache list box element is updated with new tag and subsector pool pointer values. Figure 9 shows the pseudo-code for the function that is used to implement most of the flow described above. Note that in reality there are many more functions involved in the Sector Pool Cache simulation, but this function does most of the work in simulating a sector pool cache organization. It is responsible for processing the datum to determine if a cache hit or miss occurs and performing the appropriate actions in the event of each occurrence.

28 Process Datum Determine if cache hit or miss, and if a sector hit; If cache miss Increment miss counter; Determine if the subsector pool is full; If the pool is full Use replacement policy to find subsector to replace; If sector hit Increment sector hit counter; Replace subsector; If not processing entire file Display sector pool; Update cache list box; Pool pointer cleanup; If not processing entire file Update cache list box; Display sector hit graphics; Else (not a sector hit) Save pool pointer of sector to be replaced; Replace sector; Clear valid bits of evicted subsectors; Clear pool pointers of evicted subsectors; Set subsector pointer of new sector; Update arrays; If not processing an entire file Display subsector pool; Else ( pool is not full) If sector hit Increment sector hit counter; Update arrays; If not processing entire file Update cache list box; Display subsector pool; Display sector hit graphics; Else (not a sector hit) Use replacement policy to find sector to replace; Clear valid bits of evicted subsectors; Clear pool pointers of evicted subsectors; Set subsector pointer of new sector; Update arrays; If not processing an entire file Display subsector pool; If not a sector hit If not processing an entire file Update cache list box; Display miss graphics; Increment set FIFO pointer; If not processing an entire file Display new FIFO value; Else ( a cache hit) Increment hit counter; If not processing an entire file Display subsector pool; Display hit graphics; If SC-FIFO Set hit array element; Update result group fields; Figure 9 Pseudo-code of one function for Sector Pool Cache simulation. 28

29 Program Output The displayed data consists of the following: 17. Memory Block # generated by the program 18. Sector Tag value calculated by the program 19. Pool List # (subsector offset) calculated by the program; starts at zero 20. Pool Occupancy % calculated by the program 21. Set FIFO Values, Old and New 22. The number of accumulated Hits 23. The number of accumulated Sector Hits 24. The number of accumulated Misses 25. The Total number of Hits and Misses 26. Cache list box 27. Memory list box 28. Subsector Pools list box See Figure 8 for a screen display of all of the above numbered data fields. In addition to the above, when processing a file the following is displayed: Original (in parentheses) and decimal values of each file datum when in step mode; File processing time in seconds (in Memory Block # field) when not in step mode; Progress bar when not in step mode; Besides the user input and program output, the primary graphics aspect of the program design consists of list boxes that display the area of cache, memory, and subsector pools that is affected. This is accomplished by enclosing the sector (Cache list box), block /subsector (Memory list box), and pool row (Subsector Pools list box) in a colored box. If the program execution results in a hit, the box is green. If the result is a miss, the box is red. For sector hits (a cache miss, but the sector is in cache) the box is violet. A blue box encloses the sector set (Cache list box) and its subsector pool (Subsector Pools list box) if each can be displayed entirely in their respective list box. The set number is displayed at the bottom of each set and pool. See Figure 8 for an example of the graphics. During the course of executing the program, message boxes may be displayed due to questionable user input, to query the user for additional information, or to provide additional information to the user. These message boxes have four forms: error, warning, question, and informational. A list of the general descriptions for the message boxes follows: Error; cache configuration incomplete Error; pool depth cannot be less than the set size Error; replacement policy not selected Error; sector set size exceeds the number of cache lines available Error; sector set size exceeds the number of subsectors in a pool Warning; sector set size equaling the pool depth is a sector cache Warning; memory list box size may exceed computer resources Warning; cache list box size may exceed computer resources Error; resizing of an array failed Informational; the Random PC box was unchecked and the Memory Block # was blank Error; random PC value out of range Warning; file format requirements Question; process file in step mode (graphics displayed) or without interruption Error; entered PC value out of range Program Demonstration 29

30 To help illustrate the User Interface and Program Output sections, a simple demonstration will follow using six screen snapshots. Snapshot one will involve configuring a sector pool cache simulation, and snapshots two through six will show the results of five executions of that configuration. Snapshot 1 inputs: Cache Size: 4 Subsector: 32 Subsectors Per Sector: 4 Pool Depth: 3 Cache Type: 4-way Memory Size: 256 Replacement Policy: FIFO Random PC box checked Snapshot 1 30

31 Snapshot 2 outputs: Memory Block #: 5569 Sector Tag: 174 Pool List #: 1 Pool Occupancy %: 8.33 Set FIFO Values: Old 0 New 1 Hits: 0 Sector Hits: 0 Misses: 1 Total: 1 Snapshot 2 31

32 Snapshot 3 outputs: Memory Block #: 7491 Sector Tag: 234 Pool List #: 3 Pool Occupancy %: Set FIFO Values: Old 1 New 2 Hits: 0 Sector Hits: 0 Misses: 2 Total: 2 Snapshot 3 32

33 Snapshot 4 outputs: Memory Block #: 3531 Sector Tag: 110 Pool List #: 3 Pool Occupancy %: 8.33 Set FIFO Values: Old 0 New 1 Hits: 0 Sector Hits: 0 Misses: 3 Total: 3 Snapshot 4 33

34 Snapshot 5 outputs: (Random PC box unchecked to force a hit) Memory Block #: 3531 Sector Tag: 110 Pool List #: 3 Pool Occupancy %: 8.33 Set FIFO Values: Old 1 New 1 Hits: 1 Sector Hits: 0 Misses: 3 Total: 4 Snapshot 5 34

35 Snapshot 6 outputs: (Random PC box unchecked to force a sector hit) Memory Block #: 3528 (manually entered datum) Sector Tag: 110 Pool List #: 0 Pool Occupancy %: Set FIFO Values: Old 1 New 1 Hits: 1 Sector Hits: 1 Misses: 4 Total: 5 Snapshot 6 35

36 5 Verification and Testing We verified the accuracy of simple cache module of our CacheVisual tool by comparing its results with the outputs obtained from simplescalar toolset. The first test is on the direct-mapped cache configuration. We setup simplescalar to model a level one (L1) cache as direct-mapped cache with 32 bytes per line for a total of 32 cache lines, which translates into a cache size of 1KB. The output display from simplescalar is shown below with the cache configuration and important outputs highlighted in bold font. The result shows that the test-fmath program generates 1,748 memory accesses, causing 803 hits and 945 misses. [angkul@u-35:~/ss3]$ sim-cache -cache:il1 il1:32:32:1:f tests-pisa/bin.little/test-fmath sim-cache: SimpleScalar/PISA Tool Set version 3.0 of August, Copyright (c) by Todd M. Austin, Ph.D. and SimpleScalar, LLC. All Rights Reserved. This version of SimpleScalar is licensed for academic non-commercial use. No portion of this work may be used by any commercial entity, or for any commercial purpose, without the prior written permission of SimpleScalar, LLC (info@simplescalar.com). sim: command line: sim-cache -cache:il1 il1:32:32:1:f tests-pisa/bin.little/test-fmath sim: simulation Tue Aug 3 13:35: , options follow: sim-cache: This simulator implements a functional cache simulator. Cache statistics are generated for a user-selected cache and TLB configuration, which may include up to two levels of instruction and data cache (with any levels unified), and one level of instruction and data TLBs. No timing information is generated. # -config # load configuration from a file # -dumpconfig # dump configuration to a file # -h false # print help message # -v false # verbose operation # -d false # enable debug message # -i false # start in Dlite debugger -seed 1 # random number generator seed (0 for timer seed) # -q false # initialize and terminate immediately # -chkpt <null> # restore EIO trace execution from <fname> # -redir:sim <null> # redirect simulator output to file (non-interactive only) # -redir:prog <null> # redirect simulated program output to file -nice 0 # simulator scheduling priority -max:inst 0 # maximum number of inst's to execute -cache:dl1 dl1:256:32:1:l # l1 data cache config, i.e., {<config> none} -cache:dl2 ul2:1024:64:4:l # l2 data cache config, i.e., {<config> none} -cache:il1 il1:32:32:1:f # l1 inst cache config, i.e., {<config> dl1 dl2 none} -cache:il2 dl2 # l2 instruction cache config, i.e., {<config> dl2 none} -tlb:itlb itlb:16:4096:4:l # instruction TLB config, i.e., {<config> none} -tlb:dtlb dtlb:32:4096:4:l # data TLB config, i.e., {<config> none} -flush false # flush caches on system calls -cache:icompress false # convert 64-bit inst addresses to 32-bit inst equivalents # -pcstat <null> # profile stat(s) against text addr's (mult uses ok) The cache config parameter <config> has the following format: <name>:<nsets>:<bsize>:<assoc>:<repl> <name> - name of the cache being defined <nsets> - number of sets in the cache <bsize> - block size of the cache

37 <assoc> - associativity of the cache <repl> - block replacement strategy, 'l'-lru, 'f'-fifo, 'r'-random Examples: -cache:dl1 dl1:4096:32:1:l -dtlb dtlb:128:4096:32:r Cache levels can be unified by pointing a level of the instruction cache hierarchy at the data cache hiearchy using the "dl1" and "dl2" cache configuration arguments. Most sensible combinations are supported, e.g., A unified l2 cache (il2 is pointed at dl2): -cache:il1 il1:128:64:1:l -cache:il2 dl2 -cache:dl1 dl1:256:32:1:l -cache:dl2 ul2:1024:64:2:l Or, a fully unified cache hierarchy (il1 pointed at dl1): -cache:il1 dl1 -cache:dl1 ul1:256:32:1:l -cache:dl2 ul2:1024:64:2:l sim: ** starting functional simulation w/ caches ** sim: ** simulation statistics ** TRUNCATED il1.accesses 1748 # total number of accesses il1.hits 803 # total number of hits il1.misses 945 # total number of misses We ran the memory address trace collected from simplescalar toolset through simple cache module of our CacheVisual tool. The output from CacheVisual is shown in the screen dump below. The output shows the total number of memory access hits, misses, total at 803, 945, and 1748, respectively, confirming the results obtained from simplescalar toolset.

38 The second test is on the fully associative cache configuration. We setup simplescalar to model a level one (L1) cache as fully associative cache with 32 bytes per line for a total of 32 cache lines, which translates into a cache size of 1KB. The output display from simplescalar is shown below with the cache configuration and important outputs highlighted in bold font. The result shows that the test-fmath program generates 1,748 memory accesses, causing 914 hits and 834 misses. [angkul@u-35:~/ss3]$ sim-cache -cache:il1 il1:32:32:32:f tests-pisa/bin.little/test-fmath sim-cache: SimpleScalar/PISA Tool Set version 3.0 of August, Copyright (c) by Todd M. Austin, Ph.D. and SimpleScalar, LLC. All Rights Reserved. This version of SimpleScalar is licensed for academic non-commercial use. No portion of this work may be used by any commercial entity, or for any commercial purpose, without the prior written permission of SimpleScalar, LLC (info@simplescalar.com). sim: command line: sim-cache -cache:il1 il1:32:32:32:f tests-pisa/bin.little/test-fmath sim: simulation Tue Aug 3 13:57: , options follow: sim-cache: This simulator implements a functional cache simulator. Cache statistics are generated for a user-selected cache and TLB configuration, which may include up to two levels of instruction and data cache (with any levels unified), and one level of instruction and data TLBs. No timing information is generated. # -config # load configuration from a file # -dumpconfig # dump configuration to a file # -h false # print help message # -v false # verbose operation # -d false # enable debug message # -i false # start in Dlite debugger -seed 1 # random number generator seed (0 for timer seed) # -q false # initialize and terminate immediately # -chkpt <null> # restore EIO trace execution from <fname> # -redir:sim <null> # redirect simulator output to file (non-interactive only) # -redir:prog <null> # redirect simulated program output to file -nice 0 # simulator scheduling priority -max:inst 0 # maximum number of inst's to execute -cache:dl1 dl1:256:32:1:l # l1 data cache config, i.e., {<config> none} -cache:dl2 ul2:1024:64:4:l # l2 data cache config, i.e., {<config> none} -cache:il1 il1:32:32:32:f # l1 inst cache config, i.e., {<config> dl1 dl2 none} -cache:il2 dl2 # l2 instruction cache config, i.e., {<config> dl2 none} -tlb:itlb itlb:16:4096:4:l # instruction TLB config, i.e., {<config> none} -tlb:dtlb dtlb:32:4096:4:l # data TLB config, i.e., {<config> none} -flush false # flush caches on system calls -cache:icompress false # convert 64-bit inst addresses to 32-bit inst equivalents # -pcstat <null> # profile stat(s) against text addr's (mult uses ok) The cache config parameter <config> has the following format: <name>:<nsets>:<bsize>:<assoc>:<repl> <name> - name of the cache being defined <nsets> - number of sets in the cache <bsize> - block size of the cache <assoc> - associativity of the cache <repl> - block replacement strategy, 'l'-lru, 'f'-fifo, 'r'-random

39 Examples: -cache:dl1 dl1:4096:32:1:l -dtlb dtlb:128:4096:32:r Cache levels can be unified by pointing a level of the instruction cache hierarchy at the data cache hiearchy using the "dl1" and "dl2" cache configuration arguments. Most sensible combinations are supported, e.g., A unified l2 cache (il2 is pointed at dl2): -cache:il1 il1:128:64:1:l -cache:il2 dl2 -cache:dl1 dl1:256:32:1:l -cache:dl2 ul2:1024:64:2:l Or, a fully unified cache hierarchy (il1 pointed at dl1): -cache:il1 dl1 -cache:dl1 ul1:256:32:1:l -cache:dl2 ul2:1024:64:2:l sim: ** starting functional simulation w/ caches ** sim: ** simulation statistics **TRUNCATED il1.accesses 1748 # total number of accesses il1.hits 914 # total number of hits il1.misses 834 # total number of misses We ran the memory address trace collected from simplescalar toolset through simple cache module of our CacheVisual tool. The output from CacheVisual is shown in the screen dump below. The output shows the total number of memory access hits, misses, total at 914, 834, and 1748, respectively, confirming the results obtained from simplescalar toolset.

40 The third test is on the 4-way set associative cache configuration. We setup simplescalar to model a level one (L1) cache as 4-way cache with 32 bytes per line for a total of 32 cache lines, which translates into a cache size of 1KB. The output display from simplescalar is shown below with the cache configuration and important outputs highlighted in bold font. The result shows that the test-fmath program generates 1,748 memory accesses, causing 877 hits and 871 misses. [angkul@u-35:~/ss3]$ sim-cache -cache:il1 il1:32:32:4:f tests-pisa/bin.little/test-fmath sim-cache: SimpleScalar/PISA Tool Set version 3.0 of August, Copyright (c) by Todd M. Austin, Ph.D. and SimpleScalar, LLC. All Rights Reserved. This version of SimpleScalar is licensed for academic non-commercial use. No portion of this work may be used by any commercial entity, or for any commercial purpose, without the prior written permission of SimpleScalar, LLC (info@simplescalar.com). sim: command line: sim-cache -cache:il1 il1:32:32:4:f tests-pisa/bin.little/test-fmath sim: simulation Tue Aug 3 14:01: , options follow: sim-cache: This simulator implements a functional cache simulator. Cache statistics are generated for a user-selected cache and TLB configuration, which may include up to two levels of instruction and data cache (with any levels unified), and one level of instruction and data TLBs. No timing information is generated. # -config # load configuration from a file # -dumpconfig # dump configuration to a file # -h false # print help message # -v false # verbose operation # -d false # enable debug message # -i false # start in Dlite debugger -seed 1 # random number generator seed (0 for timer seed) # -q false # initialize and terminate immediately # -chkpt <null> # restore EIO trace execution from <fname> # -redir:sim <null> # redirect simulator output to file (non-interactive only) # -redir:prog <null> # redirect simulated program output to file -nice 0 # simulator scheduling priority -max:inst 0 # maximum number of inst's to execute -cache:dl1 dl1:256:32:1:l # l1 data cache config, i.e., {<config> none} -cache:dl2 ul2:1024:64:4:l # l2 data cache config, i.e., {<config> none} -cache:il1 il1:32:32:4:f # l1 inst cache config, i.e., {<config> dl1 dl2 none} -cache:il2 dl2 # l2 instruction cache config, i.e., {<config> dl2 none} -tlb:itlb itlb:16:4096:4:l # instruction TLB config, i.e., {<config> none} -tlb:dtlb dtlb:32:4096:4:l # data TLB config, i.e., {<config> none} -flush false # flush caches on system calls -cache:icompress false # convert 64-bit inst addresses to 32-bit inst equivalents # -pcstat <null> # profile stat(s) against text addr's (mult uses ok) The cache config parameter <config> has the following format: <name>:<nsets>:<bsize>:<assoc>:<repl> <name> - name of the cache being defined <nsets> - number of sets in the cache <bsize> - block size of the cache <assoc> - associativity of the cache <repl> - block replacement strategy, 'l'-lru, 'f'-fifo, 'r'-random

41 Examples: -cache:dl1 dl1:4096:32:1:l -dtlb dtlb:128:4096:32:r Cache levels can be unified by pointing a level of the instruction cache hierarchy at the data cache hiearchy using the "dl1" and "dl2" cache configuration arguments. Most sensible combinations are supported, e.g., A unified l2 cache (il2 is pointed at dl2): -cache:il1 il1:128:64:1:l -cache:il2 dl2 -cache:dl1 dl1:256:32:1:l -cache:dl2 ul2:1024:64:2:l Or, a fully unified cache hierarchy (il1 pointed at dl1): -cache:il1 dl1 -cache:dl1 ul1:256:32:1:l -cache:dl2 ul2:1024:64:2:l sim: ** starting functional simulation w/ caches ** sim: ** simulation statistics **TRUNCATED il1.accesses 1748 # total number of accesses il1.hits 877 # total number of hits il1.misses 871 # total number of misses We ran the memory address trace collected from simplescalar toolset through simple cache module of our CacheVisual tool. The output from CacheVisual is shown in the screen dump below. The output shows the total number of memory access hits, misses, total at 877, 871, and 1748, respectively, confirming the results obtained from simplescalar toolset.

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS161 Design and Architecture of Computer Systems. Cache $$$$$ CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

Introduction. Memory Hierarchy

Introduction. Memory Hierarchy Introduction Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year 1 Memory Hierarchy Levels of memory with different sizes & speeds close to

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

Week 2: Tiina Niklander

Week 2: Tiina Niklander Virtual memory Operations and policies Chapters 3.4. 3.6 Week 2: 17.9.2009 Tiina Niklander 1 Policies and methods Fetch policy (Noutopolitiikka) When to load page to memory? Placement policy (Sijoituspolitiikka

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Perform page replacement. (Fig 8.8 [Stal05])

Perform page replacement. (Fig 8.8 [Stal05]) Virtual memory Operations and policies Chapters 3.4. 3.7 1 Policies and methods Fetch policy (Noutopolitiikka) When to load page to memory? Placement policy (Sijoituspolitiikka ) Where to place the new

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

Operating Systems, Fall

Operating Systems, Fall Policies and methods Virtual memory Operations and policies Chapters 3.4. 3.6 Week 2: 17.9.2009 Tiina Niklander 1 Fetch policy (Noutopolitiikka) When to load page to memory? Placement policy (Sijoituspolitiikka

More information

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg

Computer Architecture and System Software Lecture 09: Memory Hierarchy. Instructor: Rob Bergen Applied Computer Science University of Winnipeg Computer Architecture and System Software Lecture 09: Memory Hierarchy Instructor: Rob Bergen Applied Computer Science University of Winnipeg Announcements Midterm returned + solutions in class today SSD

More information

Course Administration

Course Administration Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring Caches and the Memory Hierarchy CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Spring 2019 Caches and the Memory Hierarchy Assigned February 13 Problem Set #2 Due Wed, February 27 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

CS152 Computer Architecture and Engineering

CS152 Computer Architecture and Engineering CS152 Computer Architecture and Engineering Caches and the Memory Hierarchy Assigned 9/17/2016 Problem Set #2 Due Tue, Oct 4 http://inst.eecs.berkeley.edu/~cs152/fa16 The problem sets are intended to help

More information

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016 1 Cache Lab Recitation 7: Oct 11 th, 2016 2 Outline Memory organization Caching Different types of locality Cache organization Cache lab Part (a) Building Cache Simulator Part (b) Efficient Matrix Transpose

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

CMPSC 311- Introduction to Systems Programming Module: Caching

CMPSC 311- Introduction to Systems Programming Module: Caching CMPSC 311- Introduction to Systems Programming Module: Caching Professor Patrick McDaniel Fall 2016 Reminder: Memory Hierarchy L0: Registers CPU registers hold words retrieved from L1 cache Smaller, faster,

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors

More information

Address spaces and memory management

Address spaces and memory management Address spaces and memory management Review of processes Process = one or more threads in an address space Thread = stream of executing instructions Address space = memory space used by threads Address

More information

Logical Diagram of a Set-associative Cache Accessing a Cache

Logical Diagram of a Set-associative Cache Accessing a Cache Introduction Memory Hierarchy Why memory subsystem design is important CPU speeds increase 25%-30% per year DRAM speeds increase 2%-11% per year Levels of memory with different sizes & speeds close to

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache

Q3: Block Replacement. Replacement Algorithms. ECE473 Computer Architecture and Organization. Memory Hierarchy: Set Associative Cache Fundamental Questions Computer Architecture and Organization Hierarchy: Set Associative Q: Where can a block be placed in the upper level? (Block placement) Q: How is a block found if it is in the upper

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2009 Lecture 3: Memory Hierarchy Review: Caches 563 L03.1 Fall 2010 Since 1980, CPU has outpaced DRAM... Four-issue 2GHz superscalar accessing 100ns DRAM could

More information

6.004 Tutorial Problems L14 Cache Implementation

6.004 Tutorial Problems L14 Cache Implementation 6.004 Tutorial Problems L14 Cache Implementation Cache Miss Types Compulsory Miss: Starting with an empty cache, a cache line is first referenced (invalid) Capacity Miss: The cache is not big enough to

More information

VIRTUAL MEMORY READING: CHAPTER 9

VIRTUAL MEMORY READING: CHAPTER 9 VIRTUAL MEMORY READING: CHAPTER 9 9 MEMORY HIERARCHY Core! Processor! Core! Caching! Main! Memory! (DRAM)!! Caching!! Secondary Storage (SSD)!!!! Secondary Storage (Disk)! L cache exclusive to a single

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Repeated References, to a set of locations: Temporal Locality Take advantage of behavior

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Virtual Memory COMPSCI 386

Virtual Memory COMPSCI 386 Virtual Memory COMPSCI 386 Motivation An instruction to be executed must be in physical memory, but there may not be enough space for all ready processes. Typically the entire program is not needed. Exception

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

CS 136: Advanced Architecture. Review of Caches

CS 136: Advanced Architecture. Review of Caches 1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you

More information

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

6.004 Tutorial Problems L14 Cache Implementation

6.004 Tutorial Problems L14 Cache Implementation 6.004 Tutorial Problems L14 Cache Implementation Cache Miss Types Compulsory Miss: Starting with an empty cache, a cache line is first referenced (invalid) Capacity Miss: The cache is not big enough to

More information

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

CPU issues address (and data for write) Memory returns data (or acknowledgment for write) The Main Memory Unit CPU and memory unit interface Address Data Control CPU Memory CPU issues address (and data for write) Memory returns data (or acknowledgment for write) Memories: Design Objectives

More information

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook CS356: Discussion #9 Memory Hierarchy and Caches Marco Paolieri (paolieri@usc.edu) Illustrations from CS:APP3e textbook The Memory Hierarchy So far... We modeled the memory system as an abstract array

More information

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017 Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the

More information

Lecture 11 Cache. Peng Liu.

Lecture 11 Cache. Peng Liu. Lecture 11 Cache Peng Liu liupeng@zju.edu.cn 1 Associative Cache Example 2 Associative Cache Example 3 Associativity Example Compare 4-block caches Direct mapped, 2-way set associative, fully associative

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

CMPSC 311- Introduction to Systems Programming Module: Caching

CMPSC 311- Introduction to Systems Programming Module: Caching CMPSC 311- Introduction to Systems Programming Module: Caching Professor Patrick McDaniel Fall 2014 Lecture notes Get caching information form other lecture http://hssl.cs.jhu.edu/~randal/419/lectures/l8.5.caching.pdf

More information

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Memory Management! How the hardware and OS give application pgms: The illusion of a large contiguous address space Protection against each other Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware

More information

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File EE 260: Introduction to Digital Design Technology Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 2 Technology Naive Register File Write Read clk Decoder Read Write 3 4 Arrays:

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

CS 240 Stage 3 Abstractions for Practical Systems

CS 240 Stage 3 Abstractions for Practical Systems CS 240 Stage 3 Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the process model Virtual memory Dynamic memory allocation Victory lap Memory Hierarchy: Cache Memory

More information

Lecture 17: Memory Hierarchy: Cache Design

Lecture 17: Memory Hierarchy: Cache Design S 09 L17-1 18-447 Lecture 17: Memory Hierarchy: Cache Design James C. Hoe Dept of ECE, CMU March 24, 2009 Announcements: Project 3 is due Midterm 2 is coming Handouts: Practice Midterm 2 solutions The

More information

Fig 7.30 The Cache Mapping Function. Memory Fields and Address Translation

Fig 7.30 The Cache Mapping Function. Memory Fields and Address Translation 7-47 Chapter 7 Memory System Design Fig 7. The Mapping Function Example: KB MB CPU Word Block Main Address Mapping function The cache mapping function is responsible for all cache operations: Placement

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Virtual Memory 11282011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review Cache Virtual Memory Projects 3 Memory

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming MEMORY HIERARCHY BASICS B649 Parallel Architectures and Programming BASICS Why Do We Need Caches? 3 Overview 4 Terminology cache virtual memory memory stall cycles direct mapped valid bit block address

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

Caches Part 1. Instructor: Sören Schwertfeger. School of Information Science and Technology SIST

Caches Part 1. Instructor: Sören Schwertfeger.   School of Information Science and Technology SIST CS 110 Computer Architecture Caches Part 1 Instructor: Sören Schwertfeger http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016 Caches and Memory Hierarchy: Review UCSB CS240A, Winter 2016 1 Motivation Most applications in a single processor runs at only 10-20% of the processor peak Most of the single processor performance loss

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Structure of Computer Systems

Structure of Computer Systems 222 Structure of Computer Systems Figure 4.64 shows how a page directory can be used to map linear addresses to 4-MB pages. The entries in the page directory point to page tables, and the entries in a

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2 Instructors: John Wawrzynek & Vladimir Stojanovic http://insteecsberkeleyedu/~cs61c/ Typical Memory Hierarchy Datapath On-Chip

More information

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!) 7/4/ CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches II Instructor: Michael Greenbaum New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to

More information

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Highly-Associative

More information

1. Creates the illusion of an address space much larger than the physical memory

1. Creates the illusion of an address space much larger than the physical memory Virtual memory Main Memory Disk I P D L1 L2 M Goals Physical address space Virtual address space 1. Creates the illusion of an address space much larger than the physical memory 2. Make provisions for

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Virtual Memory 1 Chapter 8 Characteristics of Paging and Segmentation Memory references are dynamically translated into physical addresses at run time E.g., process may be swapped in and out of main memory

More information

ECE 411 Exam 1 Practice Problems

ECE 411 Exam 1 Practice Problems ECE 411 Exam 1 Practice Problems Topics Single-Cycle vs Multi-Cycle ISA Tradeoffs Performance Memory Hierarchy Caches (including interactions with VM) 1.) Suppose a single cycle design uses a clock period

More information

CS 61C: Great Ideas in Computer Architecture Caches Part 2

CS 61C: Great Ideas in Computer Architecture Caches Part 2 CS 61C: Great Ideas in Computer Architecture Caches Part 2 Instructors: Nicholas Weaver & Vladimir Stojanovic http://insteecsberkeleyedu/~cs61c/fa15 Software Parallel Requests Assigned to computer eg,

More information

CS399 New Beginnings. Jonathan Walpole

CS399 New Beginnings. Jonathan Walpole CS399 New Beginnings Jonathan Walpole Memory Management Memory Management Memory a linear array of bytes - Holds O.S. and programs (processes) - Each cell (byte) is named by a unique memory address Recall,

More information

Operating Systems. Operating Systems Sina Meraji U of T

Operating Systems. Operating Systems Sina Meraji U of T Operating Systems Operating Systems Sina Meraji U of T Recap Last time we looked at memory management techniques Fixed partitioning Dynamic partitioning Paging Example Address Translation Suppose addresses

More information

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved. CS 33 Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved. Hyper Threading Instruction Control Instruction Control Retirement Unit

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Cache introduction. April 16, Howard Huang 1

Cache introduction. April 16, Howard Huang 1 Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently

More information

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches CS 61C: Great Ideas in Computer Architecture The Memory Hierarchy, Fully Associative Caches Instructor: Alan Christopher 7/09/2014 Summer 2014 -- Lecture #10 1 Review of Last Lecture Floating point (single

More information

Cache Lab Implementation and Blocking

Cache Lab Implementation and Blocking Cache Lab Implementation and Blocking Lou Clark February 24 th, 2014 1 Welcome to the World of Pointers! 2 Class Schedule Cache Lab Due Thursday. Start soon if you haven t yet! Exam Soon! Start doing practice

More information

Memory Hierarchy Design (Appendix B and Chapter 2)

Memory Hierarchy Design (Appendix B and Chapter 2) CS359: Computer Architecture Memory Hierarchy Design (Appendix B and Chapter 2) Yanyan Shen Department of Computer Science and Engineering 1 Four Memory Hierarchy Questions Q1 (block placement): where

More information

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)] ECE7995 (6) Improving Cache Performance [Adapted from Mary Jane Irwin s slides (PSU)] Measuring Cache Performance Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU

More information

Memory: Paging System

Memory: Paging System Memory: Paging System Instructor: Hengming Zou, Ph.D. In Pursuit of Absolute Simplicity 求于至简, 归于永恒 Content Paging Page table Multi-level translation Page replacement 2 Paging Allocate physical memory in

More information

ECEC 355: Cache Design

ECEC 355: Cache Design ECEC 355: Cache Design November 28, 2007 Terminology Let us first define some general terms applicable to caches. Cache block or line. The minimum unit of information (in bytes) that can be either present

More information

Memory Management Prof. James L. Frankel Harvard University

Memory Management Prof. James L. Frankel Harvard University Memory Management Prof. James L. Frankel Harvard University Version of 5:42 PM 25-Feb-2017 Copyright 2017, 2015 James L. Frankel. All rights reserved. Memory Management Ideal memory Large Fast Non-volatile

More information

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018 CS 31: Intro to Systems Virtual Memory Kevin Webb Swarthmore College November 15, 2018 Reading Quiz Memory Abstraction goal: make every process think it has the same memory layout. MUCH simpler for compiler

More information

Chapter 4: Memory Management. Part 1: Mechanisms for Managing Memory

Chapter 4: Memory Management. Part 1: Mechanisms for Managing Memory Chapter 4: Memory Management Part 1: Mechanisms for Managing Memory Memory management Basic memory management Swapping Virtual memory Page replacement algorithms Modeling page replacement algorithms Design

More information

4.1 Paging suffers from and Segmentation suffers from. Ans

4.1 Paging suffers from and Segmentation suffers from. Ans Worked out Examples 4.1 Paging suffers from and Segmentation suffers from. Ans: Internal Fragmentation, External Fragmentation 4.2 Which of the following is/are fastest memory allocation policy? a. First

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Q.1 Explain Computer s Basic Elements

Q.1 Explain Computer s Basic Elements Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some

More information

Memory Management! Goals of this Lecture!

Memory Management! Goals of this Lecture! Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Why it works: locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware and

More information

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures

Homework 6. BTW, This is your last homework. Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23. CSCI 402: Computer Architectures Homework 6 BTW, This is your last homework 5.1.1-5.1.3 5.2.1-5.2.2 5.3.1-5.3.5 5.4.1-5.4.2 5.6.1-5.6.5 5.12.1 Assigned today, Tuesday, April 10 Due time: 11:59PM on Monday, April 23 1 CSCI 402: Computer

More information

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 L20 Virtual Memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 Questions from last time Page

More information

HY225 Lecture 12: DRAM and Virtual Memory

HY225 Lecture 12: DRAM and Virtual Memory HY225 Lecture 12: DRAM and irtual Memory Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS May 16, 2011 Dimitrios S. Nikolopoulos Lecture 12: DRAM and irtual Memory 1 / 36 DRAM Fundamentals Random-access

More information

Roadmap. Java: Assembly language: OS: Machine code: Computer system:

Roadmap. Java: Assembly language: OS: Machine code: Computer system: Roadmap C: car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Assembly language: Machine code: get_mpg: pushq movq... popq ret %rbp %rsp, %rbp %rbp 0111010000011000

More information

Chapter 8. Virtual Memory

Chapter 8. Virtual Memory Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:

More information