CSEE W4824 Computer Architecture Fall 2012 Lecture 8 Memory Hierarchy Design: Memory Technologies and the Basics of Caches Luca Carloni Department of Computer Science Columbia University in the City of New York http://www.cs.columbia.edu/~cs4824/ Announcements: Class Pre-Taping Wednesday /3 Lecture #8 Regular Class Monday /8 Lecture #9 (Pre-taped) Pre-taped this Wed /3 at 4:15pm in Mudd 1127 Wednesday / Lecture # Guest lecturer Regular Class Reason: Instructor is traveling to attend Embedded Systems Week 2012 Pre-taped lectures will be shown as videos from the class PC during regular class time in Mudd 535 Instructor s office hours are canceled for the week of October 8 CSEE 4824 Fall 2012 - Lecture 7 Page 2 1
Announcement Homework #1 Results: Average score: 31.59 / 35 Std. Deviation: 2.71 CSEE 4824 Fall 2012 - Lecture 8 Page 3 The Processor-Memory Performance Gap (log scale) CPU speed assumes 25% improvement per year until 1986, 52% until 2000, 20% until 2005, and no change (on a per-core basis) until 20 Memory Baseline: 64KB DRAM w/ 150-250ns latency in 1980, 7% per year latency improvement Architects must attempt to work around this gap to minimize the memory bottleneck CSEE 4824 Fall 2012 - Lecture 8 Page 4 2
How Many Memory References? A modern high-end multi-core processor (e.g. Intel Core i7) can generate two data memory references per core each clock cycle with 4 cores and a 3.2 GHz clock rate, this leads to a peak of 25.6 billion 64-bit data-memory references per second, in addition to a peak of about 12.8 billion 128-bit instruction references. How to support a total peak bandwidth of 409.6 GB/sec!? The Memory Hierarchy multiporting and pipelining the caches, using multiple levels of caches, using separate first- and sometimes second-level caches per core; and by using a Harvard architecture for the first-level cache in contrast, the peak bandwidth to DRAM main memory is only 6% of this (25 GB/sec). CSEE 4824 Fall 2012 - Lecture 8 Page 5 Typical PC Organization Source: B. Jacob et al. Memory Systems CSEE 4824 Fall 2012 - Lecture 8 Page 7 3
DSP-Style Memory System: Example based on TI TMS320C3x DSP family Source: B. Jacob et al. Memory Systems dual tag-less on-chip SRAMs (visible to programmer) off-chip programmable ROM (or PROM or FLASH) that holds the executable image off-chip DRAM used for computation CSEE 4824 Fall 2012 - Lecture 8 Page 8 Memory Technology At the core of the success of computers Various types of memory most common types Dynamic Random-Access Memory (DRAM) Static Random-Access Memory (SRAM) Read-Only Memory (ROM) Flash Memory Memory Latency Metrics Access time time between when a read is requested and when the desired word arrives Cycle time ( Access time) minimum time between two requests to memory memory needs the address lines to be stable between accesses CSEE 4824 Fall 2012 - Lecture 8 Page 9 4
A 64M-bit DRAM: Logical Organization Highest memory cell density only 1 transistor used to store 1 bit to prevent data loss, each bit must be refreshed periodically DRAM access periodically all bits in every row (refresh) about 5% of the time a DRAM is not available due to refreshing To limit package costs, address lines are multiplexed e.g., first send 14-bit row address (Row Access Strobe), then 14-bit column address (Column Access Strobe) CSEE 4824 Fall 2012 - Lecture 8 Page 11 Logical Organization of Wide Data-Out DRAMs In order to output more than one bit at a time, the DRAM is organized internally with multiple arrays, each providing one bit towards the aggregate output Wider output DRAMs have appeared in the last two decades DRAM parts with x16 and x32 data widths are now common, used primarily in high-performance applications CSEE 4824 Fall 2012 - Lecture 8 Page 12 Source: B. Jacob et al. Memory Systems 5
DIMMs, Ranks, Banks, and Arrays A memory system may have many DIMMs, each of which may contain one or more ranks Each rank is a set of engaged DRAM devices, each of which may have many banks Each bank may have many constituent arrays, depending on the part s data width Source: B. Jacob et al. Memory Systems CSEE 4824 Fall 2012 - Lecture 8 Page 13 DRAM Generations Year of Introd. Chip Size (bit) $ per GB Total Access Time to a new row/column Total Access Time to existing row 1980 64K 1,5M 250ns 150ns 1983 256K 500k 185ns 0ns 1985 1M 200k 135ns 40ns 1989 4M 50k 1ns 40ns 1992 16M 15k 90ns 30ns 1996 64M k 60ns 12ns 1998 128M 4k 60ns ns 2000 256M 1k 55ns 7ns 2004 512M 250 50ns 5ns 2007 1G 50 40ns 1.25ns CSEE 4824 Fall 2012 - Lecture 8 Page 14 6
SRAMs SRAM memory cell is bigger than DRAM cell typically 6 transistors per bit Better for low-power applications thanks to stand-by mode only minimal power is necessary to retain charge in stand-by mode Access Time = Cycle Time Address lines are not multiplexed (for speed) In comparable technologies SRAM has a only 1/4-1/8 of DRAM capacity SRAM cycle time is 8-16 times faster than DRAM SRAM cost-per-bit is 8-16 times more expensive than DRAM CSEE 4824 Fall 2012 - Lecture 8 Page 15 ROM and Flash Memory ROM programmed once and for all at manufacture time cannot be rewritten by microprocessor 1 transistor per bit good for storing code and data constants in embedded applications replace magnetic disks in providing nonvolatile storage add level of protection for embedded software Flash Memories floating-gate technology read access time comparable to DRAMs 50-0us depending on size (16M-128M) write is -0 slower than DRAMs (plus erasing time 1-2ms) price is cheaper than DRAM but more expensive than magnetic disks Flash: $2/GB, DRAM: $40/GB; disk = $0.09/GB Initially, mostly used for low power/embedded applications but now also as solid-state replacements for disks or efficient intermediate storage between DRAM and disks CSEE 4824 Fall 2012 - Lecture 8 Page 16 7
Flash Storage: Increasingly an Alternative to Magnetic Disks nonvolatile like disks, but smaller (0-00x) latency smaller, more power efficient, more shock resistant critical for mobile electronics high volumes lead to technology improvements cost per GB is falling 50% per year $2-4 per GB (in 2011) 2-40x higher than disk 5-x lower than DRAM Unlike DRAM, flash memory bits wear out on-chip controller necessary to spread the writes by remapping blocks that have been written multiple times (wear leveling) write limits are delaying the application to desktops/servers but now commonly used in laptops instead of hard disks to offer faster boot times, smaller size, and longer battery life CSEE 4824 Fall 2012 - Lecture 8 Page 17 FLASH Storage Memories: Price Decrease and Relative Performance/Power Source: A.Leventhal, Flash Storage Memories CSEE 4824 Fall 2012 - Lecture 8 Page 18 8
DRAM vs SDRAM vs DDR SDRAM Conventional DRAM asynchronous interface to memory controller every transfer involves additional synchronization overhead Synchronous DRAM added a clock signal so that repeated transfers would not bear that overhead typically have a programmable register to hold the number of bytes requested, to send many bytes over several cycles per request Double Data Rate (DDR) DRAM double peak bandwidth by transferring data on both clock edges to supply data at these high rates, DDR SDRAM activate multiple banks internally CSEE 4824 Fall 2012 - Lecture 8 Page 19 Clock Rate, Bandwidth, and Names of DDR DRAMs and DIMMs in 20 Standard Clock Rate (Mhz) Transfers (M / sec) DRAM name MB/sec/ DIMM DIMM name DDR 133 266 DDR266 2128 PC20 DDR 150 300 DDR300 2400 PC2400 DDR 200 400 DDR400 3200 PC3200 DDR2 266 533 DDR2-533 4264 PC4300 DDR2 333 667 DDR2-667 5336 PC5300 DDR2 400 800 DDR2-800 6400 PC6400 DDR3 533 66 DDR3-66 8528 PC8500 DDR3 666 1333 DDR3-1333 644 PC700 DDR3 800 1600 DDR3-1600 12800 PC12800 DDR4 66-1600 2133-3200 DDR4-3200 756-25600 PC25600 CSEE 4824 Fall 2012 - Lecture 8 Page 20 9
Giving the Illusion of Unlimited, Fast Memory: Exploiting Memory Hierarchy Technology SRAM DRAM Magnetic Disk 2008 Cost ($/GB) $2000-$5000 $20-$75 $0.2-$2 4-16 GB 4-16 TB 0.5 2.5 ns 50 70 ns 5-ms Principle of Locality Smaller HW is typically faster All data in one level are usually found also in the level below Bandwidth 20-0 GB/sec 5- GB/sec 1-5 GB/sec 20-150 MB/sec Energy per Access 1nJ 1-0nJ (per device) 0-00mJ Managed by Backed by compiler hardware operating systems operating systems / operator cache main memory disk CD or tape CSEE 4824 Fall 2012 - Lecture 8 Page 21 Typical Memory Hierarchies: Servers vs. Personal Mobile Devices CSEE 4824 Fall 2012 - Lecture 8 Page 22
Review: Principle of Locality Temporal Locality a resource that is referenced at one point in time will be referenced again sometime in the near future Spatial Locality the likelihood of referencing a resource is higher if a resource near it was just referenced 90/ Locality Rule of Thumb a program spends 90% of its execution time in only % of its code a consequence of how we program and we store the data in the memory hence, it is possible to predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past CSEE 4824 Fall 2012 - Lecture 8 Page 23 Cache Concepts The term Cache the first (from the CPU) level of the memory hierarchy often used to refer to any buffering technique exploiting the principle of locality Directly exploits temporal locality providing faster access to a smaller subset of the main memory which contains copy of data recently used But, all data in the cache are not necessarily data that are spatially close in the main memory still, when a cache miss occurs a fixed-size block of contiguous memory cells is retrieved from the main memory based on the principle of spatial locality CSEE 4824 Fall 2012 - Lecture 8 Page 24 11
Cache Concepts cont. Cache Hit CPU find the requested data item in the cache Cache Miss CPU doesn t find the requested data item in the cache Miss Penalty time to replace a block in the cache (plus time to deliver data item to CPU) time depends on both latency & bandwidth latency determines the time to retrieve the first word bandwidth determines the time to retrieve rest of the block handled by hardware that stalls the memory unit (and, therefore, the whole instruction processing in case of simple single-issue P) CSEE 4824 Fall 2012 - Lecture 8 Page 25 Cache : Main Memory = Main Memory : Disk Virtual Memory makes it possible to increase the amount of memory that a program can use by temporarily storing some objects on disk program address space is divided in pages (fixed-size blocks) which reside either in cache/main memory or disk better way to organize address space across programs necessary protection scheme to control page access when the CPU references an item within a page that is not present in cache/main memory a page fault occurs the entire page is moved from the disk to main memory page faults have long penalty time handled in SW without stalling the CPU, which switches to other tasks CSEE 4824 Fall 2012 - Lecture 8 Page 26 12
Caching the Address Space Programs today are written to run on no particular HW configuration Processes execute in imaginary address spaces that are mapped onto the memory system (including DRAM and disk) by the OS Every HW memory structure between the CPU and the permanent store is a cache for instruction & data in the process s address space Source: B. Jacob et al. Memory Systems CSEE 4824 Fall 2012 - Lecture 8 Page 27 Cache Schemes: Placing a Memory Block into a Cache Block Frame Block unit of memory transferred across hierarchy levels Set a group of blocks The range of caches is really a continuum of levels of set associativity 8-way set associative 1-way set associative 2-way set associative Modern processors direct map 2-way set associative 4-way set associative Modern memories millions of blocks Modern caches thousands of block frames Set Index = (Block Address) MOD (Number of Sets in Cache) CSEE 4824 Fall 2012 - Lecture 8 Page 28 13
Example: Direct Mapped Cache with 8 Block Frames Each memory block is mapped to one cache entry cache index = (block address) mod (# of cache blocks) e.g., with 8 blocks, 3 low-order address bits are sufficient Log2 (8) = 3 Is a block present in cache? must check cache block tag upper bit of block address Block offset addresses bytes in a block block==word offset =2 bits How do we know if data in a block is valid? add valid bit to each entry The tag index boundary moves to the right as we increase associativity (no index field in fully associative caches) CSEE 4824 Fall 2012 - Lecture 8 Page 29 Ex: Direct Mapped with 24 Blocks Frames and Block Size of 1 Word for MIPS-32 Block Offset is just a byte offset because each block of this cache contains 1 word Byte Offset least significant 2 bits because in MIPS-32 memory words are aligned to multiples of 4 bytes Block Index low-order address bits because this cache has 24 block frames Block Tag remaining 20 address bits in order to check that the address of the requested word matches the cache entry Index is for addressing Tag is for checking/searching CSEE 4824 Fall 2012 - Lecture 8 Page 30 14
Example: 16KB Direct Mapped Cache with 256 Block Frames (of 16 Words Each) 18 Single tag comparator needed CSEE 4824 Fall 2012 - Lecture 8 Page 31 Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 N 001 N 0 N 011 N 0 N 1 N 1 N 111 N Assumption 8 block frames block size = 1 word main memory of 32 words toy example we consider ten subsequent accesses to memory CSEE 4824 Fall 2012 - Lecture 8 Page 32 15
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 N 001 N 0 N 011 N 0 N 1 N 1 Y Mem[1] 111 N cycle Memory Address address in decimal Cache Event 1 1 22 miss 2 3 4 5 6 7 8 9 CSEE 4824 Fall 2012 - Lecture 8 Page 33 Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 N 001 N 0 Y 11 Mem[1] 011 N 0 N 1 N 1 Y Mem[1] 111 N cycle Memory Address address in decimal Cache Event 1 1 22 miss 2 1 26 miss 3 4 5 6 7 8 9 CSEE 4824 Fall 2012 - Lecture 8 Page 34 16
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 N 001 N 0 Y 11 Mem[1] 011 N 0 N 1 N 1 Y Mem[1] 111 N cycle Memory Address address in decimal Cache Event 1 1 22 miss 2 1 26 miss 3 1 26 hit 4 5 6 7 8 9 CSEE 4824 Fall 2012 - Lecture 8 Page 35 Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 N 001 N 0 Y 11 Mem[1] 011 N 0 N 1 N 1 Y Mem[1] 111 N cycle Memory Address address in decimal Cache Event 1 1 22 miss 2 1 26 miss 3 1 26 hit 4 1 22 hit 5 6 7 8 9 CSEE 4824 Fall 2012 - Lecture 8 Page 36 17
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 Y Mem[000] 001 N 0 Y 11 Mem[1] 011 N 0 N 1 N 1 Y Mem[1] 111 N cycle Memory Address address in decimal Cache Event 1 1 22 miss 2 1 26 miss 3 1 26 hit 4 1 22 hit 5 000 16 miss 6 7 8 9 CSEE 4824 Fall 2012 - Lecture 8 Page 37 Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 Y Mem[000] 001 N 0 Y 11 Mem[1] 011 Y 00 Mem[00011] 0 N 1 N 1 Y Mem[1] 111 N cycle Memory Address address in decimal Cache Event 1 1 22 miss 2 1 26 miss 3 1 26 hit 4 1 22 hit 5 000 16 miss 6 00011 3 miss 7 8 9 CSEE 4824 Fall 2012 - Lecture 8 Page 38 18
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 Y Mem[000] 001 N 0 Y 11 Mem[1] 011 Y 00 Mem[00011] 0 N 1 N 1 Y Mem[1] 111 N cycle Memory Address address in decimal Cache Event 1 1 22 miss 2 1 26 miss 3 1 26 hit 4 1 22 hit 5 000 16 miss 6 00011 3 miss 7 000 16 hit 8 9 CSEE 4824 Fall 2012 - Lecture 8 Page 39 Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 Y Mem[000] 001 N 0 Y Mem[0] 011 Y 00 Mem[00011] 0 N 1 N 1 Y Mem[1] 111 N cycle Memory Address address in decimal Cache Event 1 1 22 miss 2 1 26 miss 3 1 26 hit 4 1 22 hit 5 000 16 miss 6 00011 3 miss 7 000 16 hit 8 0 18 miss 9 CSEE 4824 Fall 2012 - Lecture 8 Page 40 19
Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 Y Mem[000] 001 N 0 Y 11 Mem[1] 011 Y 00 Mem[00011] 0 N 1 N 1 Y Mem[1] 111 N cycle Memory Address address in decimal Cache Event 1 1 22 miss 2 1 26 miss 3 1 26 hit 4 1 22 hit 5 000 16 miss 6 00011 3 miss 7 000 16 hit 8 0 18 miss 9 1 26 miss CSEE 4824 Fall 2012 - Lecture 8 Page 41 Example: Accessing a Direct Mapped Cache with 8 Blocks and Block Size of 1 Word Index V Tag Data 000 Y Mem[000] 001 N 0 Y 11 Mem[1] 011 Y 00 Mem[00011] 0 N 1 N 1 Y Mem[1] 111 N cycle Memory Address address in decimal Cache Event 1 1 22 miss 2 1 26 miss 3 1 26 hit 4 1 22 hit 5 000 16 miss 6 00011 3 miss 7 000 16 hit 8 0 18 miss 9 1 26 miss 1 26 hit CSEE 4824 Fall 2012 - Lecture 8 Page 42 20
Example: Measuring Cache Size How many total bits are required for a direct-mapped cache with 16KB of data and 4-word block frames assuming a 32-bit address? 12 16KB of data = 4K words = 2 words 2 Block Size of 4 (=2 ) words 2 blocks TAG 18 INDEX OFFSET 2 2 # Bits in a Tag = 32 - ( + 2 + 2) = 18 # Bits in a block = # Tag Bits + # Data Bits + Valid bit # Bits in a block = 18 + (4 * 32) + 1 = 147 Cache Size= # Blocks x #Bits in a block= 2 x 147=147Kbits Cache Overhead = 147Kbits / 16KB = 147 / 128 = 1.15 CSEE 4824 Fall 2012 - Lecture 8 Page 43 Performance Metrics for Caches Miss Rate (misses per memory references) fraction of cache accesses that result in a miss Misses Per Instructions often reported as misses per 00 instructions for speculative processors we only count the instructions that commit Miss Per Instructions = Miss Rate x (Memory Accesses / Instruction Count) Miss Penalty additional clock cycles necessary to retrieve the block with the missing word from the main memory CSEE 4824 Fall 2012 - Lecture 8 Page 44 21
Performance Metrics for Caches - continue Average Memory Access Time (AMAT) AMAT = Hit time + Miss rate x Miss penalty Average Memory Access Time a better estimate on cache performance but still not a substitute for execution time Impact on CPU Time including hit clock cycles in CPU execution clock cycles CPU Time = (CPU execution cycles + memory stall cycles) x CCT CSEE 4824 Fall 2012 - Lecture 8 Page 45 Performance Metrics for Caches - continue Impact on CPU Time including hit clock cycles in CPU execution clock cycles and breaking down the memory stall cycles CPU Time = IC x (CPIexec+ missrate x memaccperinstr x misspenalty) x CCT the loweris the CPI, the higher the relative impact of a fixed number of cache miss clock cycles the fasterthe CPU (i.e. the lower CCT), the higher is the number of clock cycles per miss CSEE 4824 Fall 2012 - Lecture 8 Page 46 22
Example: The Impact of Cache on Performance Assumptions CPI_exec = 1 clock cycle (ignoring memory stalls) Miss rate = 2% Miss penalty = 200 clock cycles Average memory references per instruction = 1.5 (CPI)no_cache = 1 + 1.5 x 200 = 301 (CPI)with_cache = 1 + (1.5 x 0.02 x 200) = 7 Impact of Cache on CPU Time is greater the lower is the CPI of the other instructions for a fixed number of cache miss clock cycles the lower is the clock cycle time of the CPU because the CPU has a larger number of clock cycles per miss (i.e. a higher memory portion of CPI) CSEE 4824 Fall 2012 - Lecture 8 Page 47 Assigned Readings Computer Architecture A Quantitative Approach by John Hennessy Stanford University Dave Patterson UC Berkeley Fifth Edition - 2012 Morgan Kaufmann (Elsevier) Section 2.1 and 2.3 Appendix B.1 For review purposes: see Chapter 7 of Hennessy & Patterson Computer Organization & Design book Assigned paper: A. Leventhal, Flash Storage Memories CSEE 4824 Fall 2012 - Lecture 8 Page 48 23