Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2
Internal and External Memories CPU Date transfer Control Main Memory Data transfer Control Secondary Memory Zebo Peng, IDA, LiTH 3 Main Memory Model A word (8, 6, 32, or 64 bits) Memory Control Unit 3 2 address 0 a bit Read/write control MBR (in CPU) Address selection MAR (in CPU) Zebo Peng, IDA, LiTH 4 2
Memory Characteristics The most important characteristics of a memory: speed as fast as possible; size as large as possible; cost reasonable price. They are determined by the technology used for implementation. Your personal library Zebo Peng, IDA, LiTH 5 Memory Access Bottleneck CPU Quantitative measurement of the capacity of the bottleneck is the Memory Bandwidth Memory Zebo Peng, IDA, LiTH 6 3
Memory Bandwidth Memory bandwidth denotes the amount of data that can be accessed from a memory per second: M-Bandwidth = amount of data per access memory cycle time Ex. MCT = 00 nano second and 4 bytes (a word) per access: M-Bandwidth = 40 mega bytes per second. There are two basic techniques to increase the bandwidth of a given memory: Reduce the memory cycle time Expensive Memory size limitation Divide the memory into several banks, each of which has its own control unit (using parallelism). Zebo Peng, IDA, LiTH 7 Memory Banks Interleaving placement of program and data 2 3 4 5 8 9 0 4 5 6 7 0 2 3 Control Unit Control Unit Control Unit Control Unit CPU Zebo Peng, IDA, LiTH 8 4
Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH 9 Motivation What do we need? A memory to store very large programs/data and to work at a speed comparable to that of the CPU. The reality is: The larger a memory, the slower it will be; The faster a memory, the greater the cost per bit. A solution: To build a composite memory system which combines a small and fast memory with a large and slow memory, and behaves, most of the time, like a large and fast memory. This two-level principle can be extended to a hierarchy of many levels. Zebo Peng, IDA, LiTH 0 5
Memory Hierarchy CPU Registers Cache Main Memory Secondary Memory of direct access type Secondary Memory of archive type Zebo Peng, IDA, LiTH Access time example 0.25 ns ns 8ns 2 ms (for 4KB) 5 s (for 8KB) CPU Registers Cache Main Memory Secondary Memory of direct access type Secondary Memory of archive type Memory Hierarchy Capacity example KB 4MB 8GB 2 TB (00 M/tape) As one goes down the hierarchy, the following occur: Decreasing cost/bit. Increasing capacity. Increasing access time. Decreasing frequency of access by the CPU. Zebo Peng, IDA, LiTH 2 6
Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH 3 Mismatch of CPU and MM Speeds Cycle Time (nano second) 0 4 0 3 0 2 0 Speed Gap (ca. one order of magnitude, i.e., 0 times) 0 955 960 965 970 975 980 985 990 2000 2005 200 205 Zebo Peng, IDA, LiTH 4 7
Cache Memory addresses CPU Cache hit addresses instructions and data instructions and data Main Memory Cache addresses instructions and data A cache is a very fast memory which is put between the main memory and the CPU, and used to hold segments of program and data of the main memory. Zebo Peng, IDA, LiTH 5 Zebo s Cache Memory Model Personal library for a high-speed reader Storage cells Cache Memory controller A computer is a predictable and iterative reader. High cache hit ratio, e.g., 96%, is achievable, even with a relatively small cache (e.g., 0.% of memory size). Zebo Peng, IDA, LiTH 6 8
Cache Memory Features It is transparent to the programmers. Only a very small part of the program/data in the main memory has its copy in the cache (e.g., 4MB cache with 8GB memory). If the CPU wants to access program/data not in the cache (called a cache miss), the relevant block of the main memory will be copied into the cache. The intermediate-future memory access will usually refer to the same word or words in the neighborhood, and will not have to involve the main memory. This property of program executions is denoted as locality of reference. Zebo Peng, IDA, LiTH 7 Locality of Reference Programs access a small proportion of their address space at any short period of time. Temporal locality: If an item is accessed, it will tend to be accessed again soon. Spatial locality: If an item is accessed, items whose addresses are close by will tend to be accessed soon. This access pattern is an intrinsic features of the von Neumann architecture: Sequential instruction storage and execution. Loops and iterations (e.g., subroutine calls). Sequential data storage (e.g., array). Zebo Peng, IDA, LiTH 8 9
Layered Memory Performance Average Access Time P hit x T cache_access + ( P hit ) x (T mm_access + T cache_access ) x Block_size + T checking where P hit = the probability of cache hit, cache hit ratio; T cache_access = cache access time; T mm_access = main memory access time; Block_size = number of words in a cache block; and T checking = the time needed to check for cache hit or miss. Ex. A computer has 8MB MM with 00 ns access time, 8KB cache with 0 ns access time, BS=4, and T checking = 2. ns, P hit = 0.97, AAT will be 25 ns. Zebo Peng, IDA, LiTH 9 Cache Design The size and nature of the copied block must be carefully designed, as well as the algorithm to decide which block to be removed from the cache when it is full: Cache block size (line size). Total cache size. Mapping function. Replacement method. Write policy. Numbers of caches: Single, two-level, or three-level cache. Unified vs. split cache. Zebo Peng, IDA, LiTH 20 0
Split Data and Instruction Caches? Split caches (Harvard Architectures): + Competition for the cache between instruction processing and execution units is eliminated. + Instruction fetch can proceed in parallel with memory access from the CPU for operands. One may be overloaded while the other is under utilized. Unified caches: + Better balance the load between instruction and data fetches depending on the dynamics of the program execution. + Design and implementation are cheaper. Lower performance. Zebo Peng, IDA, LiTH 2 Direct Mapping Cache Direct mapping - Each block of the main memory is mapped into a fixed cache slot. 2 2 Storage cells 2 Cache Memory controller Zebo Peng, IDA, LiTH 22
Direct Mapping Cache Example We have a 0,000-word MM and a 00-word cache. 0 memory cells are grouped into a block. Tag Slot Word Memory address = 2 9990-9999 0 5 Tag Slot No. 020-029 00-09 000-009 0020-0029 000-009 0000-0009 0 9 8 7 6 5 4 3 2 0 90-99 80-89 70-79 66-69 50-59 40-49 30-39 20-29 0-9 00-09 0,000-Word Memory 00-Word Cache Zebo Peng, IDA, LiTH 23 Direct Mapping Pros & Cons Simple to implement and therefore inexpensive. Very fast checking time for cache hit or miss. Fixed location for blocks. If a program accesses 2 blocks that map to the same cache slot repeatedly, cache miss rate is very high. 2 2 Storage cells 2 Cache Memory controller Zebo Peng, IDA, LiTH 24 2
Associative Mapping A main memory block can be loaded into any slot of the cache. To determine if a block is in the cache, a mechanism is needed to simultaneously examine every slot s tag. Memory address = Tag Word 9990-9999 020-029 00-09 000-009 0020-0029 000-009 0000-0009 0,000-Word Memory Tag 006, 007 Associative memory example Tag (3 ps) 0 0 2 8 7 0 0 2 9 7 00-Word Cache 90-99 80-89 70-79 66-69 50-59 40-49 30-39 20-29 0-9 00-09 Zebo Peng, IDA, LiTH 25 Fully Associative Organization Zebo Peng, IDA, LiTH 26 3
Set Associative Organization The cache is divided into a number of sets (K). Each set contains a number of slots (W). A given block maps to any slot in a given set. e.g. block i can be in any slot of set j. For example, 2 slots per set (W = 2): 2-way associative mapping. A given block can be in one of 2 slots. Direct mapping: W = (no alternative). Fully associative: K = (W = total number of all slots in the cache, all mappings possible). W is the most important parameter (typically 2-6). Zebo Peng, IDA, LiTH 27 Replacement Algorithms With direct mapping, it is no need. With associative mapping, a replacement algorithm is needed in order to determine which block to replace: First-in-first-out (FIFO). Least-recently used (LRU) - replace the block that has been in the cache longest with not reference to it. Lest-frequently used (LFU) - replace the block that has experienced the fewest references. Random. Use info 5:55 Tag Zebo Peng, IDA, LiTH 28 4
The problem: Write Policy How to keep cache content and main memory content consistent without losing too much performance? Write through: All write operations are passed to main memory: If the addressed location is currently in the cache, the cache is updated so that it is coherent with the main memory. For writes, the processor always slows down to main memory speed. Since the percentage of writes is small (ca. 5%), this scheme doesn t lead to large performance reduction. Zebo Peng, IDA, LiTH 29 Write Policy (Cont d) Write through with buffered write: The same as write-through, but instead of slowing the processor down by writing directly to main memory, the write address and data are stored in a high-speed write buffer; the write buffer transfers data to main memory while the processor continues its task. Higher speed, but more complex hardware. Write back: Write operations update only the cache memory which is not kept coherent with main memory. When the slot is replaced from the cache, its content has to be copied back to memory. Good performance (usually several writes are performed on a cache block before it is replaced), but more complex hardware is needed. Cache coherence problems are very complex and difficult to solve in multiprocessor systems (to be discussed later)! Zebo Peng, IDA, LiTH 30 5
Cache Architecture Examples Intel Pentium (introduced 993) Two on-chip caches, one for data and one for instructions. Each cache: 8 KB. Line size: 32 bytes. 2-way set associative organization. AMD Opteron 40 (introduced 2003) Two L cache: one for instruction and one for data; 64 KB each. 2-way associative organization. L2 cache: MB, 6-way associative organization. ARM Cortex-A5 (introduced 202) Each core has separate L data and instruction caches. 64 KB (32 KB I-cache, 32 KB D-cache) per core. L2 cache, unified and common for all cores, up to 4 MB. Zebo Peng, IDA, LiTH 3 3-Level Cache Example Intel Itanium 2 (introduced 2002): L L2 L3 Contents Split D and I Unified D + I Unified D + I Size 6 Kbytes each 256 Kbytes 3 Mbytes Line size 64 bytes 28 bytes 28 bytes Associativity 4 way 8 way 2 way Access time cycle 5-7 cycles 4-7 cycles Store policy Write-through Write-back Write-back Zebo Peng, IDA, LiTH 32 6
Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH 33 Motivation for Virtual Memory The physical main memory (RAM) is relatively limited in capacity. It may not be big enough to store all the executing programs at the same time. A program may need memory larger than the main memory size, but the whole program doesn t need to be kept in the main memory at the same time. Virtual Memory takes advantage of the fact that at any given instant of time, an executing program needs only a fraction of the memory that the whole program occupies. The basic idea: Load only pieces of each executing program which are currently needed. Zebo Peng, IDA, LiTH 34 7
Paging of Memory Divide programs (processes) into equal sized, small blocks, called pages. Divide the primary memory into equal sized, small blocks called page frames. Allocate the required number of page frames to a program. A program does not require continuous page frames! The operating system (OS) is responsible for: Maintaining a list of free frames. Using a page table to keep track of the mapping between pages and page frames. Zebo Peng, IDA, LiTH 35 Logical and Physical Addresses Implementation of the page-tables: Main memory slow since an extra memory access is needed. 0 2 3 Separate registers fast but expensive. Cache. Zebo Peng, IDA, LiTH 36 8
Objective of Virtual Memory To give the programmer a much bigger memory space than the main memory with the help of the operative system. Virtual memory size is very much bigger than the main memory size. Program addresses 0000 000 2000 3000 MM addresses 0000 000 2000 3000 4000 Secondary memory 5000 Zebo Peng, IDA, LiTH 37 Page Fault When accessing a VM page which is not in the main memory, a page fault occurs. The page must then be loaded from the secondary memory into the main memory by the OS. Virtual Address Page Number Offset Page Map Page Fault (Interrupt to OS) Pages in MM Zebo Peng, IDA, LiTH 38 9
Page Replacement When a page fault occurs and all page frames are occupied, one of them must be replaced. If the replaced page has been modified during the time it resides in the main memory, the updated version should be written back to the secondary memory. Our wish is to replace the page which will not be accessed in the future for the longest amount of time. Problem We don t know exactly what will happen in the future. Solution We predict the future by studying the access patterns up till now ( learn from history ). Zebo Peng, IDA, LiTH 39 Replacement Algorithms FIFO (First In First Out) To replace the one in MM the longest of time. LRU (Least Recently Used) To replace the one that has not be accessed the longest time. LFU (Least Frequently Used) To replace the one that has the smallest number of access during the latest time period. The replacement by random (used for Cache) is not used for VM! Zebo Peng, IDA, LiTH 40 20
Summary A memory system has to store very large programs and a lot of data and still provide fast access. No single type of memory can provide all such need of a computer system. Therefore, several different storage mechanisms are organized in a layer hierarchy. The layer structure works very well due to the locality of reference principle. Cache is a hardware solution to improve memory access which is transparent to the programmers. Virtual memory provides a much larger address space than the available physical space with the help of the OS (software solution). Zebo Peng, IDA, LiTH 4 2