CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

cps 14 memory.1 RW Fall 2 CPS11 Computer Organization and Programming Lecture 13 The System Robert Wagner Outline of Today s Lecture System the BIG Picture? Technology Technology DRAM A Real Life Example SPARCstation 2 s System Summary Hierarchy Direct Mapped Chapter 7, Appendix B, pages 26-33 cps 14 memory.2 RW Fall 2 The Big Picture Where are We Now? The Five Classic Components of a Computer Processor Control Input Datapath Output Today s Topic System cps 14 memory.3 RW Fall 2

cps 14 memory.4 RW Fall 2 Technologies Random Access Random is good access time is the same for all locations DRAM Dynamic Random Access - High density, low power, cheap, slow - Dynamic content must be refreshed regularly Static Random Access - Low density, high power, expensive, fast - Static content will last forever (until lose power) Not-so-random Access Technology Access time varies from location to location and from time to time Examples Disk, CDROM Sequential Access Technology access time linear in location (e.g.,tape) Hierarchy The various technologies listed have varying speed and cost Registers DRAM Disk Nsec/access 3 5-25 5-8 5-1 million Cost per MByte $25, $1-25 $5-1 $.1-.2 Programs are known to spend most of their time accessing a small part of their address space -- Principle of Locality Motivates designing memory as a HIERARCHY, with small, fast levels holding recently accessed data (which you hope is likely to be accessed again soon), while larger, slower-access levels hold larger and larger pieces of the total memory the program needs -- Disk holding it all Scheme gives the ILLUSION that all program memory can be accessed equally quickly, by having OS and hardware move data items from level to level, based on the access pattern of the program TAPE is generally so slow that it is reserved only for daily or monthly backups, to protect users from disk drive failures and viruses -- some applications may need it to store huge data bases, also. cps 14 memory.5 RW Fall 2 Random Access (RAM) Technology Why do computer professionals need to know about RAM technology? Processor performance is usually limited by memory latency and bandwidth. Latency The time it takes to access a word in memory. Bandwidth The average speed of access to memory (Words/Sec). As IC densities increase, lots of memory will fit on processor chip - Tailor on-chip memory to specific needs. - Instruction cache - Data cache - Write buffer What makes RAM different from a bunch of flip-flops? Density RAM is much denser Speed RAM access is slower than flip-flop (register) access. cps 14 memory.6 RW Fall 2

cps 14 memory.7 RW Fall 2 Technology Trends Capacity Speed Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 1.4x in 1 years Disk 2x in 3 years 1.4x in 1 years DRAM Year Size Cycle Time 11! 21! 198 64 Kb 25 ns 1983 256 Kb 22 ns 1986 1 Mb 19 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 12 ns Static RAM 6-Transistor 1 word (row select) word 1 bit bit Write bit 1. Drive bit lines (bit=1, bit=) 2.. Select row 3. Strength of BIT line signal over-rides latch bit Read 1. Precharge bit and bit to Vdd 2.. Select row 3. pulls one line low 4. Sense amp on column detects difference between bit and bit cps 14 memory.8 RW Fall 2 Typical Organization 16-word x 4-bit Din 3 Din 2 Din 1 Din Precharge WrEn Wr Driver & Wr Driver & Wr Driver & Wr Driver & - Pre-charger + - Pre-charger+ - Pre-charger+ - Pre-charger+ Word Word 1 Address Decoder A A1 A2 A3 - Sense Amp + - Sense Amp + - Sense Amp + - Sense Amp+ Word 15 Dout 3 Dout 2 Dout 1 Dout cps 14 memory.9 RW Fall 2

cps 14 memory.1 RW Fall 2 1-Transistor (DRAM) Write 1. Drive bit line 2.. Select row row select Read 1. Precharge bit line to Vdd 2.. Select row bit 3. and bit line share charges - Very small voltage changes on the bit line 4. Sense (fancy sense amp) - Can detect changes of ~1 million electrons 5. Write restore the value Refresh 1. Just do a dummy read to every cell. Introduction to DRAM Dynamic RAM (DRAM) Refresh required Very high density Low power (.1 -.5 W active,.25-1 mw standby) Low cost per bit Pin sensitive (few pins) - Output Enable (OE_L) - Write Enable (WE_L) - Row address strobe (ras) - Col address strobe (cas) addr log N 2 r o w N cell array N bits SA & col OE_L N WE_L D cps 14 memory.11 RW Fall 2 Classical DRAM Organization (square) bit (data) lines r o w d e c o d e r RAM Array Each intersection represents a 1-T DRAM word (row) select row address Sense-Amps, Column Selector & I/O Circuits Column Address data Row and Column Address together Select 1 bit at a time cps 14 memory.12 RW Fall 2

cps 14 memory.13 RW Fall 2 Typical DRAM Organization Typical DRAMs access multiple bits in parallel Example 2 Mb DRAM = 256K x 8 = 512 rows x 512 cols x 8 bits Row and column addresses are applied to all 8 planes in parallel Plane 7 512 rows 512 cols Plane One Plane of 256 Kb DRAM Plane 256 Kb DRAM D<1> 256 Kb DRAM D<7> D<> Increasing Bandwidth - Interleaving Access Pattern without Interleaving CPU D1 available Start Access for D1 Start Access for D2 Access Pattern with 4-way Interleaving Access Bank Access Bank 1 Access Bank 2 Access Bank 3 We can Access Bank again CPU Bank Bank 1 Bank 2 Bank 3 cps 14 memory.14 RW Fall 2 SPARCstation 2 s System Overview Controller Bus (SIMM Bus) 128-bit wide datapath Processor Bus (Mbus) 64-bit wide Module 7 External Module 6 Module 5 Module 4 Module 3 Module 2 Module 1 Processor Module (Mbus Module) SuperSPARC Processor Instruction Data Register File Module cps 14 memory.15 RW Fall 2

cps 14 memory.16 RW Fall 2 Principle of Locality A Summary Two Different Types of Locality are observed Temporal Locality (Locality in Time) If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space) If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense Good choice for presenting the user with a BIG memory system is fast but expensive and not very dense Good choice for providing the user FAST access time. The Motivation for s System Processor DRAM Motivation Large memories (DRAM) are slow Small memories () are fast Make the average access time small by Servicing most accesses from a small, fast memory. Reduce the bandwidth required of the large memory cps 14 memory.17 RW Fall 2 Levels of the Hierarchy Capacity Access Time Cost CPU Registers 1s Bytes <1s ns K Bytes 1-1 ns ~$2/MByte Main M Bytes 1ns-1ns ~$6/Mbyte Disk G Bytes 1,,ns ~$1/GByte Tape infinite sec-min ~$5/GByte Registers Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 512-4K bytes user/operator Mbytes Upper Level faster Larger Lower Level cps 14 memory.18 RW Fall 2

cps 14 memory.19 RW Fall 2 The Principle of Locality Probability of reference Address Space 2 n The Principle of Locality Programs access a relatively small portion of their address space during any period of a few hundred instructions. Example 9% of time in 1% of the code Two Different Types of Locality Temporal Locality (Locality in Time) If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space) If an item is referenced, items whose addresses are close by tend to be referenced soon. Hierarchy Principles of Operation At any given time, data is copied between only 2 adjacent levels Upper Level () the one closer to the processor - Smaller, faster, and uses more expensive technology Lower Level () the one further away from the processor - Bigger, slower, and uses less expensive technology Block The minimum unit of information that can either be present or not present in the top level of the hierarchy To Processor From Processor Upper Level () Blk X Lower Level () Blk Y cps 14 memory.2 RW Fall 2 Hierarchy Terminology Hit data appears in some block in the upper level (example Block X) Hit Rate the fraction of memory accesses found in the upper level Hit Time Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty = Time to replace a block in the upper level + Time to deliver the block to the processor Hit Time << Miss Penalty To Processor From Processor Upper Level () Blk X Lower Level () Blk Y cps 14 memory.21 RW Fall 2

cps 14 memory.22 RW Fall 2 Basic Terminology Typical Values Typical Values Block (line) size 16-128 bytes Hit time 1-4 cycles Miss penalty 1-1 cycles (and increasing) (access time) (1-1 cycles) (transfer time) (5-22 cycles) Miss rate 1% - 2% Size 8 KB - 4 MB How Does Work? Temporal Locality (Locality in Time) If an item is referenced, it will tend to be referenced again soon. Keep more recently accessed data items closer to the processor Spatial Locality (Locality in Space) If an item is referenced, items whose addresses are close by tend to be referenced soon. Move blocks consisting of contiguous words to the cache To Processor From Processor Upper Level Blk X Lower Level Blk Y cps 14 memory.23 RW Fall 2 Implementation Principles A memory address must be examined quickly, to determine if the addressed location IS or IS NOT in the cache = look-up table BUT Hardware has time for only ONE reference to the memory which implements the cache Hardware approximation Use SOME bits of the address to determine WHERE in cache data at that address MUST be stored Called cache index bits Since many addresses have SAME index bits, cache also stores the rest of the address Called cache tag bits Lowest order bits of address are used to select BYTE within a cache BLOCK 31 9 4 Tag Index Byte Select cps 14 memory.24 RW Fall 2

cps 14 memory.25 RW Fall 2 Direct Mapped Direct Mapped cache is an array of fixed size blocks. Each block holds consecutive bytes of main memory data. The Tag Array holds the Block Address. A valid bit associated with each cache block tells if the data is valid. (The data is not valid if it hasn t been loaded from memory.) Index The location of a block (and it s tag) in the cache. Byte Select The byte location within the cache block. -Index = (<Address> Mod (_Size))/ Block_Size Byte-Select = <Address> Mod (Block_Size) Tag = <Address> / (_Size) The Simplest Direct Mapped 1 Byte Address 1 2 3 4 5 6 7 8 9 A B C D E F 4 Byte Direct Mapped Index 1 2 3 Location can be occupied by data from location, 4, 8,... etc. In general any memory location whose 2 LSBs of the address are s Address<1> => cache index Which one should we place in the cache? How can we tell which one is in the cache? cps 14 memory.26 RW Fall 2 Tag and Index Assume a 32-bit memory (byte ) address A 2 N byte direct mapped cache with 1-Byte blocks - Index The lower N bits of the memory address - Tag The upper (32 - N) bits of the memory address 31 Example address x53. holds 2 8 = 256 bytes N Tag Example x5 Index Ex x3 Valid Bit Stored as part of the cache state x5 2 N Bytes Direct Mapped Byte Byte 1 1 Byte 2 2 Byte 3 3 Byte 2**N -1 2 N - 1 cps 14 memory.27 RW Fall 2

Access Example Start Up V Tag Data Access 1 (miss) Access 1 (HIT) M [1] 1 M [11] Miss Handling Load Data Write Tag & Set V Access 1 1 (miss) M [1] Access 1 1 (HIT) M [1] 1 M [11] Load Data Write Tag & Set V M [1] 1 M [11] Sad Fact of Life A lot of misses at start up Compulsory Misses - Cold start misses - Process migration cps 14 memory.28 RW Fall 2 Definition of a Block Block the cache data that has its own cache tag Our previous extreme example 4-byte Direct Mapped cache Block Size = 1 Byte Take advantage of Temporal Locality If a byte is referenced, it will tend to be referenced soon. Did not take advantage of Spatial Locality If a byte is referenced, its adjacent bytes will be referenced soon. In order to take advantage of Spatial Locality increase the block size Valid Tag Direct Mapped Data Byte Byte 1 Byte 2 Byte 3 cps 14 memory.29 RW Fall 2 Example 1 KB Direct Mapped with 32 Byte Blocks For a 2 N byte cache with Block Size = 2 M Bytes The uppermost (32 - N) bits are always the Tag The lowest M bits are the Byte Select bits Index bits N-M bits just higher-order than the Byte Select 31 Valid Bit Tag Tag x5 Example x5 Stored as part of the cache state 9 Index Ex x1 Data Byte 31 cps 14 memory.3 RW Fall 2 4 Byte Select Ex x Byte 1 Byte Select Byte 1 2 3 31

cps 14 memory.31 RW Fall 2 Block Size Tradeoff In general, larger block sizes take advantage of spatial locality BUT Larger block size means larger miss penalty - Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up - Too few cache blocks In general, Average Access Time Miss Penalty = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Average Access Time Miss Rate Exploits Spatial Locality Fewer blocks compromises temporal locality Increased Miss Penalty & Miss Rate Block Size Block Size Block Size