Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:

Size: px

Start display at page:

Download "Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, Textbook web site:"

Marcia Pitts
6 years ago
Views:

1 Textbook: Burdea and Coiffet, Virtual Reality Technology, 2 nd Edition, Wiley, 2003 Textbook web site: 1

2 Textbook web site: Laboratory Hardware 2

3 Topics 14:332:331 The Memory Hierarchy 3

4 A Typical Memory Hierarchy Architectures need to create the illusion of unlimited and fast memory; To do so we remember that applications do not access all the program memory, or data memory, at once Rather programs access a relatively small portion of their address space at any moment in time; Temporal locality - means that a referenced item in memory tends to be referenced again soon; (loops) Spatial locality - means that if an item is referenced, those adjacent to it tend to be referenced soon. (arrays) A Typical Memory Hierarchy These principles allows memory to be organized as a hierarchy of multiple levels with different access speeds and sizes On-Chip Components Control Datapath RegFile ITLB DTLB Instr Data Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Memory (Disk) Speed: 0.5 ns 1 ns 5 ns ns 5,000,000 ns Size: 128 B 64 KB 256 KB 4 GB TB s Cost/GB highest $10,000 $100 $1 lowest 4

5 A Typical Memory Hierarchy Since main memory is much slower, to avoid pipeline stalls, the data and the instructions that the CPU needs soon should be moved into cache(s). Even though memory consists of multiple levels, data is copied only between two adjacent levels at a time. Within each level, a unit of information that is present or not is called a block. Blocks can be either single-word (32 bits wide), or multiple-word. block How is the Hierarchy Managed? When the data that the CPU needs are present in cache, it is a hit. When the needed data are not present, it is a miss, and misses have penalties - When there is a miss, the pipeline is frozen until data is fetched from main memory- affects performance. Data needs to always be present in the lowest level of the hierarchy. Temporal Locality - Keep most recently accessed data items closer to the processor Spatial Locality - Move blocks consisting of contiguous words to the upper levels 5

6 How do we apply the principle of locality? To Processor Upper Level Memory Blk X Lower Level Memory Data cannot be present here If it is not present here From Processor Blk Y Hit: data needed by CPU appears in some block in the upper level (Block X) Hit Rate: the fraction of accesses found in the upper level Hit Time: Time to access the upper level = RAM access time + Time to determine if the access is a hit/miss Miss penalty: time for data to be retrieve from a lower level memory (needed data is in Block Y) and deliver it to the processor Hit Time << Miss Penalty Miss Penalty Miss: data needs to be retrieve from a lower level block (Blk Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level with a block from the lower level + Time to deliver this block to the processor In general, Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate Recall that CPUtime=CPUexec.time (incl hits) +MemAccess.time MemStall.Access.time=(Read-Stall.cycles+Write-stall.cycles) clock cycle.time MemStall time=reads/program Read miss rate Miss penalty + Writes/Program Write miss rate Miss penalty (ignore write buffer stalls) Or Mem-stall = (Misses/program) Miss penalty 6

7 Miss Penalty What is the degrading influence of misses (stalls)? Performance perfect cache = CPUtime with stalls Performance cache with stalls CPUtime without stalls = CPU exec time + Mem. Stall time = CPU exec time # Instr. x CPIwithstall x Clock cycle time = CPIwithstall #Instr. x CPIperfect x Clock cycle time CPIperfect CPIwithstall depends on the application. CPIwithstall = CPIperfect+CPImisses= = CPIperfect+CPImiss inst+cpi miss.data.mem If miss penalty is 100 cycles, miss rate for instructions is 2%, miss rate for data memory cache is 4%, and data memory access rate is 36% (for gcc example) and CPIperfect is 2, then CPImisses = (2%+36%x4%)x100=3.44 Performance perfect cache = = 2.72 thus 172% degradation Performance with stalls 2 Miss Penalty Increasing the clock rate will not solve the problem - doubling the clock rate, for example, will mean that the miss penalty goes from 100 cycles to 200 cycles. In that case Performance fast clock with stalls = ICxCPIstallxClock cycle time Performance slow clock with stalls ICxCPIstallxClockcycletime/2 = [(2%+36%x4%) ]x2 = 10.88=1.23 Performance grows 23% not 100%! (2%+36%x4%) We need a way to reduce both the miss rate (%) and the miss penalty ( cycles) Reducing the miss penalty is done by doing a multi-level cache - a miss in the primary cache means data is retrieved from secondary cache (faster - less cycles) instead of main memory. Reducing the miss rate is dependent on cache architecture as we will see shortly. 7

8 The Simplest Cache The simplest cache has blocks of one-word size. Direct-mapped cache - each memory location (32-bit memory address) maps exactly to one cache location But, main memory is much larger than cache, thus several memory addresses map to one cache location. Question 1: - How do we know whether the requested word is in the cache or not? Question 2: - How do we know if the data in cache corresponds to the requested word or not? Question 3: - How do we know if the data found in cache is valid? Cache The first bit in the cache block is a valid bit which tells the cache controller if the data in that block are valid or not. Each block in cache is indexed, allowing each block to be addressed when the CPU is looking for instructions or data stored in cache. A tag identifies which memory location corresponds to that particular block in cache. The tag contains the upper portion of the address, while the lower portion is used in the index. Bits 0 and 1 are not used. 8

9 Cache - example Cache initially empty. All Valid bits are off CPU requests address a miss CPU requests address another miss Cache - example After that CPU requests address hit Then CPU requests address hit CPU requests address miss CPU requests address miss Then CPU requests address hit CPU requests address miss Temporal locality - recently accessed words replace less recently accessed words 9

10 MIPS Direct Mapped Cache Example Memory address Hit Tag Index Index Valid Tag 20 = Data Byte offset 32 Cache Data If there is an instruction miss, Control Block stalls, PC-4 is sent to memory, memory performs read, data is placed in cache in the lower bits index slot, the tag is written with the upper memory address, valid is 1 and instruction is re-fetched Example: The series of memory address references as word addresses is 1,4,8,5,20,17,19,11,4,43,5,6,9,17. Assume directmapped cache with 16 1-word blocks, which is initially empty. Label each reference in the list as Hit or Miss and show the final contents of the cache. Binary Representation REFERENCE HIT/MISS Miss Miss Miss Miss Miss Miss Miss Miss Miss Miss Miss Miss Hit Miss Hit Hit Block 0000 (0) 0001 (1) 0010 (2) 0011 (3) 0100 (4) 0101 (5) 0110 (6) 0111 (7) 1000 (8) 1001 (9) 1010 (10) 1011 (11) 1100 (12) 1101 (13) 1110 (14) 1111 (15) TAG

11 Cache size How large needs the cache be? Each row in the cache is formed of the valid bit + tag bits + data bits Assuming that there are n index bits, thus 2 n rows in the cache, then the cache size is 2 n [32 + (32 - n - 2) + 1] = 2 n (63-n) For 64 kb of data that needs to be placed in cache which has 1-word blocks, since 1 word = 4 byte, cache needs to hold 16 k words = 16 k blocks = 16,384 which means n = 14 Thus the cache size is 2 n (63-n)= 16,384(63-14)= 802,816 bits = 100 kb. For a 256 kb of data, 1-word blocks, we have 64 k words, n = 16, the total size is 2 16 (63-16)= 65,536x47=3,080,192 bits=385 kb 64-bit cache with 16-byte blocks Hit 16 Tag Data Block offset Index 12 2 =

12 16 KB cache with 16-words blocks Hit 18 Tag Block offset Data Index 8 4 = Cache size The same data quantity can be accommodated by a cache that has larger blocks (4-words each) but less number of index bits. Using several words per block improves the miss rate vs. single-word per block caches. - spatial locality For the same application gcc 5.4% miss rate (1-word blocks) goes down to 1.9% (4-word blocks). For spice the miss rate goes from 1.2% (1-word blocks) to 0.4% (4-word blocks). However, the miss penalty increases since now 4 words need to be loaded at once (more cycles used in a stall). Early restart fetch the requested word first (works for instruction cache if memory can deliver one instr/cycle) 12

13 Cache size If blocks become too large vs. cache size, the number of blocks for a given cache size becomes too small, the miss rate goes back up. 4 KB cache 16 KB cache 256 KB cache 64 KB cache Based on SPEC92 Handling writes When the CPU writes data to cache, it should also be written in main memory to keep it consistent (called write-through) Writing to main memory is slow and reduces performance (ex. A write to main memory may take 100 cycles). If 10% of instructions are sw, then CPI becomes =11 cycles/instruction vs. 1. ) Thus we need to use a write buffer which should be several words deep to account for write bursts To avoid buffer overflow (which corrupts data), the rate of reading the buffer by the main memory needs to be larger than the rate at which the buffer is filled by the CPU. An alternative is write back write in memory only when cache block is being replaced. 13

Memory systems that support cache DRAMS are designed to increase density not access time To reduce the miss penalty we need to change the memory access design, to increase throughput.

memory architecture can affect overall system performance in dramatic ways.

14 Memory systems that support cache DRAMS are designed to increase density not access time To reduce the miss penalty we need to change the memory access design, to increase throughput. One word-wide memory Wide memory Interleaved memory Sequential access Parallel access to all words in a block Simultaneous memory read Memory Systems that Support Caches The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways. on-chip 32-bit data & 32-bit addr per cycle CPU Cache bus DRAM Memory One word wide organization (one word wide bus and one word wide memory) Assume 1 clock cycle (2 ns) to send the address 25 clock cycles (50 ns) for DRAM cycle time, 1 clock cycle (2ns) to return a word of data Memory-Bus to Cache bandwidth number of bytes accessed from memory and transferred to cache/cpu per clock cycle 14

15 One Word Wide Memory Organization on-chip CPU Cache bus Memory If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory 1 cycle to send address 25 cycles to read DRAM 1 cycle to return data 27 total clock cycles miss penalty Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock cycles 4/27 = One Word Wide Memory Organization on-chip CPU Cache bus Memory What if the block size were four words? 1 cycle to send 1 st address 100 cycles to read DRAM 1 cycle to return last data word 102 total clock cycles miss penalty 25 cycles 25 cycles 25 cycles 25 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/102 = bytes/clock cycle 15

Interleaved Memory Organization For a block size of four words on-chip CPU Cache 1 cycle to send 1 st address 25 + 3 = 28 cycles to read DRAM 1 cycle to return last data word 30 total clock cycles

16 Interleaved Memory Organization For a block size of four words on-chip CPU Cache 1 cycle to send 1 st address = 28 cycles to read DRAM 1 cycle to return last data word 30 total clock cycles miss penalty bus Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3 25 cycles 25 cycles 25 cycles 25 cycles Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/30 = bytes per clock cycle = bits/clock cycle. Further Improvements to Memory Organization (DDR-SDRAMs) An external clock (300 MHz) synchronizes memory addresses Example 4 M DRAM outputs one bit from the array 2048 column latches and 1 multiplexor SDRAM is provided the starting address and the burst length 2/4/8 need not provide successive addresses. DDR double data rate transfers data on both the raising and falling edge of the external clock DRAMS were 64 Kbit, column access to an existing row 150 ns 2004 DRAMS were 1024 Mbit, column access to existing row 3 ns. 16

17 Further Improvements two-level cache Figure shows the AMD Athlon and Duron processor architecture Two-level caches allow L1 cache to be smaller improves the hit time as they are faster L2 cache is larger its access time is less crytical larger block sized L2 is accessed whenever a miss occurs in L1, which reduces the L1 miss penalty dramatically. L2 is also used to store the contents of the victim buffer data rejected from L1 cache when a L1 miss occurs Reducing Cache Misses through Associativity We recall that a direct-mapped cache allows one memory location to map to only one block in cache (uses tags) - needs only one comparator. A fully associative cache - a block in memory can map to any block in the cache. Thus all entries in cache must be searched. This is done in parallel, with one comparator for each memory block. It is expensive (from hardware point of view). Works for small caches In-between the two extremes are set-associative caches - a block in memory maps to only one set of blocks, but can occupy any position within that set. 17

Reducing Cache Misses through Associativity A n-way set-associative cache has sets with n blocks each All blocks in the set have to be searched - reduces the number of comparators

increases the miss rate decreases (1-way 10.3%, 8-way 8.1% data miss rate), but the hit time increases. For set-assoc.

18 Reducing Cache Misses through Associativity A n-way set-associative cache has sets with n blocks each All blocks in the set have to be searched - reduces the number of comparators to n. One-way set-associative (same as direct mapped) Two-way set-associative Four-way set-associative Eight-way set-associative (same as fully associative) As associativity increases the miss rate decreases (1-way 10.3%, 8-way 8.1% data miss rate), but the hit time increases. For set-assoc. caches, any doubling of associativity decreases the number of index bits by one and increases the number of tag bits by 1. For fully associative cache, no index bits since it is only one set. 4-way associative cache Four blocks 22 Four comparators 32 18

19 Recall the Direct Mapped Cache Byte offset Hit Tag Index Data Index Valid Tag 20 Data 32 It had 20 tag bits vs. 22 for 4-way associative cache and 10 index bits vs. 8 for the 4-way associative cache. How many tag and index bits for an 8-way associative cache? 23 tag bits 7 index bits Which block to replace in an associative cache? The basic principle is Least-recently Used (LRU) replace the block that is older. Keeping track of a block s age done in hardware It is practical for small set-associativity (2-way or 4- way). For higher associativity LRU is either approximated of replacement is random For 2-way set-associative, random replacement has 10% higher miss rate than LRU As caches become larger the miss rates for both strategies fall, and the difference between the two is smaller. 19

20 Exercise Associativity usually improves miss ratio, but not always. Give a short series of address references for which a 2-way set-associative cache with LRU replacement would experience more misses than a direct-mapped cache of the same size. 2-way has half the number of sets for same size. All map to same set A B C A B B A The sequence A,B,C,A,B,C C generates: Miss, miss, miss, Hit, The same sequence generates: miss, miss, Hit.. Miss, miss, miss, miss, miss, miss. Exercise Suppose a computer address size is k bits (using byte addressing), the cache size is S bytes, the block size is B bytes, and the cache is A-way set-associative. Assume that B is a power of 2, so B=2 b. Figure out what the following quantities are: - the number of sets in the cache? - the number of index bits in the address? - the number of bits needed to implement the cache? Address size = k bits Cache size = S bytes/cache Block size = B= 2 b bytes/block Associativity= A blocks/set The number of sets/cache=bytes/cache = Bytes/cache = S Bytes/set Block/set Bytes/block A B 20

21 Exercise - continued Index bits 2 (#index bits) = sets/cache = S A B #Index bits =log 2 ( S ) = log 2 ( S ) = log 2 ( S ) - log 2 (2 b ) = log 2 ( S ) - b A B A 2 b A A Tag address bits = total address bits - index bits - block offset bits= k - [log 2 ( S ) -b]-b = K -log 2 ( S ) A A Bits in tag memory/cache = Tag address bits/block Blocks/set Sets/cache = [K - log 2 (S) ] A S = S [K - log 2 (S) ] A A B B A Virtual Memory When multiple applications (processes) run at the same time, the main memory (DRAM) becomes too small Virtual memory extends the memory hierarchy to the hard disk, and treats the RAM as cache for the hard disk. This way each process is allocated a portion of the RAM, and each program has its own range of physical memory addresses Virtual memory translates (maps) the virtual addresses of each program to physical addresses in main memory Protections have to be in place in case of data sharing. 21

Virtual Memory The CPU generates virtual addresses and memory is accessed by physical addresses The memory is treated as fully-associative cache, and divided into pages Translation eliminates the

22 Virtual Memory The CPU generates virtual addresses and memory is accessed by physical addresses The memory is treated as fully-associative cache, and divided into pages Translation eliminates the need to find a contiguous block of memory to allocate to a program. DRAM page Shared memory Virtual Memory - continued The translation mechanism maps the CPU 32-bit address to the real physical address using a virtual page number and a page offset Virtual address space is much larger than physical address space (2 20 vs ) 4 GB vs. 1 GB RAM - illusion of infinite memory 22

23 Virtual Memory - continued The number of page offset bits determines the page size typically 4 KB to 16 KB) - should be large enough to reduce the chances of page fault When there is a page fault - millions of clock cycles as penalty - it is treated in software through the exception mechanism. Software can reduce page faults by cleverly deciding which pages to replace in DRAM (older pages) Pages always exist on the hard disk, but are loaded into DRAM only when needed. A write-back mechanism insures that pages that were altered (written into in RAM) are saved on disk before being discarded. Virtual Memory - continued The translation mechanism is provided by a page table Each program has its own page table which contains the physical addresses of the pages and is indexed by the virtual page number. Each program when it has possession of the CPU has its pointer to the page table loaded by the OS and its page table is read Since each process has its own page table, programs can have same virtual address space because the page table will have different mappings for different programs (protection) 23

Pointer to the location of the first address of the page table of the active process Page Table Valid bit indicates if the page is in DRAM The size of the page table has to be limited, such that no

If the Valid Bit is 0 - Page fault - the address points to a page on the hard disk The page needs to be loaded in DRAM by the OS and the page table written to change the mapping to a new address in

24 Pointer to the location of the first address of the page table of the active process Page Table Valid bit indicates if the page is in DRAM The size of the page table has to be limited, such that no one process gobbles up the whole physical memory The page table of a process is not fixed- it is altered by the OS to assure different processes do not collide and in case of page faults Page Faults If the Valid Bit is 0 - Page fault - the address points to a page on the hard disk The page needs to be loaded in DRAM by the OS and the page table written to change the mapping to a new address in physical memory and turn the Valid Bit to 1 If physical memory is full, an existing page needs to be discarded before a new page is loaded from disk OS uses least-recently used scheme It uses a reference bit for each physical page set whenever that page is accessed. OS looks at it, while also periodically resets this bit (a statistical LRU) A dirty bit is added to the page table to indicate if the page was altered - if yes it needs to be saved before being discarded 24

Translation-Look-aside Buffer (TLB) To optimize the translation process and reduce memory access time TLB is a cache that holds recently used page table mappings.

25 Translation-Look-aside Buffer (TLB) To optimize the translation process and reduce memory access time TLB is a cache that holds recently used page table mappings. TLB tags hold the virtual page number and its data holds the corresponding physical page number. TLB also holds the reference bit, valid bit and dirty bit TLB miss - page in page table loaded by the CPU - much more frequent or Page not in page table - page fault exception In case of a miss the CPU selects which entry in the TLB needs to be replaced. Its reference and dirty bits are then written back into the page table. Miss rates for the TLB are % penalty is clock cycles much smaller than page fault! Example: Consider a virtual memory system with 40-bit virtual byte address, 16 KB page and 36-bit physical byte address. What is the total size of the page table for each process on this machine, assuming that the valid, protection, dirty and use bits take a total of 4 bits and that all the virtual pages are in use? Assume that disk addresses are not stored on the page table. Page table size = #entries entry size The #entries = # pages in virtual address = 2 40 bytes = bytes/page = 2 40 = 2 26 entries The width of each entry is = 40 bits Thus the size of the page table is = bytes= 335 MB

26 TLB and cache working together (Intrinsity FastMATH Proc.) 4 KB pages, TLB - 16 entries, fully associative - all need to be compared. Each entry is 64-bits 20 tag bits (virtual page #) 20 data bits (physical page #) valid, ref and dirty bits, etc. One of the extra bits is a write access bit. Prevents programs from writing into pages for which they have only read access - part of protection mechanism. There could be three misses - cache miss, TLB miss and page fault. A TLB miss in this case takes 16 cycles on average. CPU saves process state then gives control of the CPU to another process, then brings page from disk. 26

27 How are TLB misses and Page Faults handled? TLB miss no entry in TLB matches the virtual address. In that case, if the page is in memory (as indicated by the page table) then that address is placed in the TLB. So the TLB miss is handled by the OS in software. Once the TLB has the virtual address in, then the instruction that caused the TLB miss is re-executed. If the valid bit of the retrieved page address in the TLB is 0, then a page fault When a page fault occurs, the OS takes control and stores the states of the process that caused the page fault, as well as the address of the instruction that caused the page fault in the EPC. How are TLB misses and Page Faults handled? The OS then finds a place for the page by discarding an old one (if it was dirty it first has to be saved on disk) After that the OS starts the transfer of the needed page from hard disk and gives control of the CPU to another process (millions of cycles). Once the page was transferred, then the OS reads the EPC and returns control to the offending process so it can complete. Also, if that instruction that caused the page fault was a sw, the write control line for the data memory is deasserted to prevent the sw from completing. When an exception occurs, the processor sets a bit that disable exceptions, so that a subsequent exception will not overwrite the EPC. 27

28 The influence of Block size In general, larger block size take advantage of spatial locality BUT: Larger block size means larger miss penalty - Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up Too few cache blocks In general, Average Access Time = Hit Time (1 - Miss Rate) + Miss Penalty Miss Rate Miss Penalty Miss Rate Exploits Spatial Locality Average Access Time Fewer blocks: compromises temporal locality Increased Miss Penalty & Miss Rate Block Size Block Size Block Size The Influence of Associativity Every change that improves the miss rate can also negatively affect overall performance Ex. We can reduce the miss rate by increasing associativity (30% gain for small caches going from direct-mapped to two-way associative). But large associativity does not make sense for modern caches which are large, since hardware costs more (more comparators) and the access time is larger. While for cache full associativity does not pay, for paged memory it is good because misses are very expensive. Large page size means that Page Table is small. 28

29 The influence of associativity (SPEC2000) Small caches Large caches Memory writes options There are two options: write-through (for cache) and writeback (for paged memory). During write-back pages are written to disk only if they were modified prior to being replaced. The advantages of write-back are that multiple writes to a given page require only one write to the disk, and using high bandwidth, not one word-at-a-time. Individual words can be written in a page much faster (cache rate) than if they were written-through to disk. The advantage of write-through is that misses are simpler to handle and easier to implement (using write buffer). In the future more caches will use write-back because of the CPU- Memory gap. 29

Processor-DRAM Memory Gap (latency) 7% DRAM annual performance improvement Solutions to reduce the gap: -L3 cache - Have the L2, L3 caches do something while idle Sources of (Cache) Misses Compulsory

30 Processor-DRAM Memory Gap (latency) 7% DRAM annual performance improvement Solutions to reduce the gap: -L3 cache - Have the L2, L3 caches do something while idle Sources of (Cache) Misses Compulsory (cold start or process migration, first reference): first access to a block Cold fact of life: not a whole lot you can do about it Note: If you are going to run billions of instruction, Compulsory Misses are insignificant Conflict (collision): Multiple memory locations (blocks) mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size Invalidation: other process (e.g., I/O) updates memory 30

31 Total Misses Rate vs. Cache type and size Additional conflict misses when going from two-way to one-way associative cache Additional conflict misses when going from four-way to two-way associative cache Capacity misses reduce for larger caches Design alternatives Increase cache size Decreases capacity misses May increase access time Increase associativity Decreases conflict miss rate May increase access time Increase block size: Decreases miss rate due to spatial locality But increased miss penalty Very large blocks may increase miss rate for small caches So design of memory hierarchies is interesting 31

32 Processor-DRAM Memory Gap for Multi-cores Performance degradation for memory intensive applications Cores Processor-DRAM Memory Gap for Multi-cores Solution is to have 3-D chips which place the DRAM on chip 32

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson