The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Control Datapath Today s Topics: technologies Technology trends Impact on performance Hierarchy The principle of locality hierarchy terminology Input Output Hierarchy Technology Random Access: Random is good: access time is the same for all locations DRAM: Dynamic Random Access - High density, low power, cheap, slow - Dynamic: need to be refreshed regularly : Static Random Access - Low density, high power, expensive, fast - Static: content will last forever (until lose power) Non-so-random Access Technology: Access time varies from location to location and from time to time Examples: Disk, CDROM Sequential Access Technology: access time linear in location (e.g.,tape) Main memory: DRAMs + Caches: s cs 152 L1 6.1 cs 152 L1 6.2
Technology Trends Capacity Speed (latency) Logic:2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years DRAM Year Size Cycle Time 1000:1! 2:1! 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns cs 152 L1 6.3 Performance cs 152 L1 6.4 Who Cares About the Hierarchy? 1000 Processor-DRAM Gap (latency) 100 10 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 µproc CPU 60%/yr. Moore s Law (2X/1.5yr) Processor- Performance Gap: (grows 50% / year) 1989 1990 1991 1992 Time 1993 1994 1995 1996 DRAM 1997 1998 1999 2000 DRAM 9%/yr. (2X/10 yrs)
Impact on Performance Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty DataMiss (1.6) 49% Inst Miss (0.5) 16% CPI = ideal CPI + average stalls per instruction = 1.1 + ( 0.30 (memory access/ins) x 0.10 (miss/memory access) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2. 6 58 % of the time the processor is stalled waiting for memory! a 1% instruction miss rate would add an additional 0.5 cycles to the CPI! Ideal CPI (1.1) 35% The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy Parallelism cs 152 L1 6.5 cs 152 L1 6.6
An Expanded View of the System Why hierarchy works Processor The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Datapath Control Probability of reference 0 Address Space 2^n - 1 Speed: Size: Cost: Fastest Smallest Highest Slowest Biggest Lowest cs 152 L1 6.7 cs 152 L1 6.8
Hierarchy: How Does it Work? Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels To Processor From Processor Upper Level Blk X Lower Level Blk Y Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty Lower Level To Processor From Processor Upper Level Blk X Blk Y cs 152 L1 6.9 cs 152 L1 6.10
Hierarchy of a Modern Computer System By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. Datapath Processor Control Registers On-Chip Cache Second Level Cache () Main (DRAM) Secondary Storage (Disk) Tertiary Storage (Disk) How is the hierarchy managed? Registers <-> by compiler (programmer?) cache <-> memory by the hardware memory <-> disks by the hardware and operating system (virtual memory) by the programmer (files) Speed (ns): 1s 10s 100s Size (bytes): 100s Ks Ms 10,000,000s (10s ms) Gs 10,000,000,000s (10s sec) Ts cs 152 L1 6.11 cs 152 L1 6.12
Static RAM () 4 X 2 D latch 15 Address Din[1] Din[0] C D Q _ Q Chip select Output enable Write enable Din[7 0] 8 ı 32K 8 8 Dout[7 0] Write enable 0 2-to-4ı decoder 1 Select 0 Address Data 0 In Out 2 Three-state buffers Select 1 Data 1 In Out 3 Select 2 Output Data 2 In Out Dout[1] Dout[0] Select 3 Data 3 In Out cs 152 L1 6.13 cs 152 L1 6.14
32K X 8 Dynamic RAM (DRAM) DRAM cell Word line Addressı [14 6] 9-to-512ı decoder 512 Pass transistor Capacitor Bit line 4M X 1 DRAM Addressı [5 0] 64 Rowı ı decoderı 11-to-2048 2048 2048ı array Dout7 Dout6 Dout5 Dout4 Dout3 Dout2 Dout1 Dout0 Address[10 0] Column latches Dout cs 152 L1 6.15 cs 152 L1 6.16