Who Cares About the Memory Hierarchy? Time CS 152 Lec16.6. Performance. Recap

Size: px

Start display at page:

Download "Who Cares About the Memory Hierarchy? Time CS 152 Lec16.6. Performance. Recap"

Florence Rice
5 years ago
Views:

1 CS 152 Lec16.1 Recap CS152: Computer rchitecture and Engineering Lecture 16: Locality and Technology MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load) More performance from deeper pipelines, parallelism Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency SW Pipelining Symbolic Loop Unrolling to get most from pipeline with little code expansion, little overhead ynamic Branch Prediction + early branch address for speculative execution Superscalar and VLIW CPI < 1 CS 152 Lec16.2 ynamic issue vs. Static issue More instructions issue at same time, larger the penalty of hazards Intel EPIC in I-64 a hybrid: compact LIW + data hazard check The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Control atapath Today s Topics: Locality and Hierarchy Technology RM Technology Organization Input Output Technology Trends (from 1st lecture) Capacity Speed (latency) Logic: 2x in 3 years 2x in 3 years RM: 4x in 3 years 2x in 10 years isk: 4x in 3 years 2x in 10 years RM Year Size Cycle 1000:1! 2:1! Kb 250 ns Kb 220 ns Mb 190 ns Mb 165 ns Mb 145 ns Mb 120 ns CS 152 Lec16.3 CS 152 Lec16.4 Performance Who Cares bout the Hierarchy? 1000 Processor-RM Gap (latency) µproc CPU 60%/yr. Moore s Law (2X/1.5yr) Processor- Performance Gap: (grows 50% / year) RM RM 9%/yr. (2X/10 yrs) Today s Situation: Microprocessor Rely on caches to bridge gap Microprocessor-RM performance gap time of a full cache miss in instructions executed 1st lpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions 2nd lpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 instructions 3rd lpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions 1/2X latency x 3X clock rate x 3X Instr/clock ~5X CS 152 Lec16.5 CS 152 Lec16.6

2 CS 152 Lec16.7 Impact on Performance Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle) CPI = % arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty atamiss (1.6) 49% CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle cycle = % of the time the processor is stalled waiting for memory! Inst Miss (0.5) 16% a 1% instruction miss rate would add an additional 0.5 cycles to the CPI! Ideal CPI (1.1) 35% CS 152 Lec16.8 The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy Parallelism n Expanded View of the System Why hierarchy works Processor Control The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Probability of reference atapath 0 ddress Space 2^n - 1 Speed: Size: Cost: Fastest Smallest Highest Slowest Biggest Lowest CS 152 Lec16.9 CS 152 Lec16.10 Hierarchy: How oes it Work? Hierarchy: Terminology Temporal Locality (Locality in ): => Keep most recently accessed data items closer to the processor Spatial Locality (Locality in Space): => Move blocks which consist of contiguous words to the upper levels To Processor From Processor Upper Level Blk X Lower Level Blk Y Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit : to access the upper level which consists of RM access time + to determine hit/miss Miss: data needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: to replace a block in the upper level + to deliver the block the processor Hit << Miss Penalty To Processor From Processor Upper Level Blk X Lower Level Blk Y CS 152 Lec16.11 CS 152 Lec16.12

3 CS 152 Lec16.13 Hierarchy of a Modern Computer System By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. atapath Processor Control Registers On-Chip Cache Second Level Cache () Main (RM) Secondary Storage (isk) Tertiary Storage (isk) How is the hierarchy managed? Registers <-> by compiler (programmer?) cache <-> memory by the hardware memory <-> disks by the hardware and operating system (virtual memory) by the programmer (files) Speed (ns): 1s 10s 100s Size (bytes): 100s Ks Ms 10,000,000s (10s ms) Gs 10,000,000,000s (10s sec) Ts CS 152 Lec16.14 Hierarchy Technology Random ccess: Random is good: access time is the same for all locations RM: ynamic Random ccess - High density, low power, cheap, slow - ynamic: need to be refreshed regularly : Static Random ccess - Low density, high power, expensive, fast - Static: content will last forever (until lose power) Non-so-random ccess Technology: ccess time varies from location to location and from time to time Examples: isk, CROM Sequential ccess Technology: access time linear in location (e.g.,tape) The next two lectures will concentrate on random access technology The Main : RMs + Caches: s Main Background Performance of Main : Latency: Cache Miss Penalty - ccess : time between request and word arrives - Cycle : time between requests Bandwidth: & Large Block Miss Penalty (L2) Main is RM: ynamic Random ccess ynamic since needs to be refreshed periodically (8 ms) ddresses divided into 2 halves ( as a 2 matrix): - RS or Row ccess Strobe - CS or Column ccess Strobe Cache uses : Static Random ccess No refresh (6 transistors/ vs. 1 transistor) Size: RM/ 4-8, Cost/Cycle time: /RM 8-16 CS 152 Lec16.15 CS 152 Lec16.16 Random ccess (RM) Technology Static RM Why do computer designers need to know about RM technology? Processor performance is usually limited by memory bandwidth s IC densities increase, lots of memory will fit on processor chip - Tailor on-chip memory to specific needs - Instruction cache - ata cache - Write buffer What makes RM different from a bunch of flip-flops? ensity: RM is much more denser 6-Transistor Write: 1. rive lines (=1, =0) 2.. Select row word (row select) Read: 1. Precharge and to Vdd 2.. Select row 3. pulls one line low 1 word replaced with pullup to save area 4. Sense amp on column detects difference between and CS 152 Lec16.17 CS 152 Lec16.18

4 CS 152 Lec16.19 Typical Organization: 16-word x 4- in 3 Wr river & Wr river & Wr river & Wr river & - Precharger+ - Precharger+ - Precharger+ - Precharger+ : : : : - Sense mp + - Sense mp + - Sense mp + - Sense mp + out 3 in 2 out 2 in 1 out 1 in 0 out 0 Precharge Word 0 Word 1 Word 15 ddress ecoder WrEn Logic iagram of a Typical Write Enable is usually active low () in and out are combined to save pins: new control signal, output enable () is needed is asserted (Low), is disasserted (High) - serves as the data input pin is disasserted (High), is asserted (Low) - is the data output pin Both and are asserted: - Result is unknown. on t do that!!! lthough could change VHL to do whatever you desire, must do the best with what you ve got (vs. what you need) CS 152 Lec16.20 N 2 N words x M M Typical Timing Problems with Write Timing: ata In Write ddress N Write Setup High Z 2 N words x M Write Hold M Read Timing: Junk Read ddress Read ccess ata Out Read ddress Read ccess ata Out On Select = 1 Off On On Off N1 N2 = 1 = 0 Six transistors use up a lot of area Consider a Zero is stored in the cell: Transistor N1 will try to pull to 0 Transistor P2 will try to pull bar to 1 P1 But lines are precharged high: re P1 and P2 necessary? P2 On CS 152 Lec16.21 CS 152 Lec Transistor (RM) Classical RM Organization (square) (data) lines Write: 1. rive line 2.. Select row Read: 1. Precharge line to Vdd 2.. Select row 3. and line share charges - Very small voltage changes on the line 4. Sense (fancy sense amp) - Can detect changes of ~10-100k electrons 5. Write: restore the value row select r o w d e c o d e r row address RM rray Column Selector & Circuits Each intersection represents a 1-T RM word (row) select Column ddress Refresh 1. Just do a dummy read to every cell. CS 152 Lec16.23 CS 152 Lec16.24 data Row and Column ddress together: Select 1 a time

5 CS 152 Lec16.25 RM logical organization (4 M) RM physical organization (4 M) Column ecoder 11 Sense mps & Column ddress 8 s 0 10 rray (2,048 x 2,048) Q Row ddress Block Row ec. 9 : 512 Block Row ec. 9 : 512 Block Row ec. 9 : 512 Block Row ec. 9 : Q Square root of s per RS/CS Word Line Storage Block 0 Block 3 8 s CS 152 Lec16.26 Systems Logic iagram of a Typical RM RS_L CS_L address n RM Controller Timing Controller Tc = Tcycle + Tcontroller + Tdriver RM 2^n x 1 chip w Bus rivers 256K x 8 9 RM 8 Control Signals (RS_L, CS_L,, ) are all active low in and out are combined (): is asserted (Low), is disasserted (High) - serves as the data input pin is disasserted (High), is asserted (Low) - is the data output pin Row and column addresses share the same pins () RS_L goes low: Pins are latched in as row address CS_L goes low: Pins are latched in as column address RS/CS edge-sensitive CS 152 Lec16.27 CS 152 Lec16.28 Key RM Timing Parameters t RC : minimum time from RS line falling to the valid data output. Quoted as the speed of a RM fast 4Mb RM t RC = 60 ns t RC : minimum time from the start of one row access to the start of the next. t RC = 110 ns for a 4M RM with a t RC of 60 ns t CC : minimum time from CS line falling to valid data output. 15 ns for a 4M RM with a t RC of 60 ns t PC : minimum time from the start of one column access to the start of the next. 35 ns for a 4M RM with a t RC of 60 ns RM Performance 60 ns (t RC ) RM can perform a row access only every 110 ns (t RC ) perform column access (t CC ) in 15 ns, but time between column accesses is at least 35 ns (t PC ). - In practice, external address delays and turning around buses make it 40 to 50 ns These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead. rive parallel RMs, external memory controller, bus to turn around, SIMM module, pins 180 ns to 250 ns latency from processor to memory is good for a 60 ns (t RC ) RM CS 152 Lec16.29 CS 152 Lec16.30

6 CS 152 Lec16.31 RM Write Timing RM Read Timing Every RM access begins at: The assertion of the RS_L 2 ways to write: early or late v. CS RM WR Cycle RS_L CS_L 256K x 8 9 RM 8 Every RM access begins at: The assertion of the RS_L 2 ways to read: early or late v. CS RM Read Cycle RS_L CS_L 256K x 8 9 RM 8 RS_L RS_L CS_L CS_L Row ddress Col ddress Junk Row ddress Col ddress Junk Row ddress Col ddress Junk Row ddress Col ddress Junk Junk ata In Junk ata In Junk WR ccess Early Wr Cycle: asserted before CS_L WR ccess Late Wr Cycle: asserted after CS_L High Z Junk ata Out High Z ata Out CS 152 Lec16.32 Read ccess Early Read Cycle: asserted before CS_L Output Enable elay Late Read Cycle: asserted after CS_L Main Performance Cycle versus ccess Cycle Simple: CPU, Cache, Bus, same width (32 s) Wide: CPU/Mux 1 word; Mux/Cache, Bus, N words (lpha: 64 s & 256 s) Interleaved: CPU, Cache, Bus 1 word: N Modules (4 Modules); example is word interleaved ccess RM (Read/Write) Cycle >> RM (Read/Write) ccess 2:1; why? RM (Read/Write) Cycle : How frequently can you initiate an access? RM (Read/Write) ccess : How quickly will you get what you want once you initiate an access? RM Bandwidth Limitation : How much data can you get from the memory? CS 152 Lec16.33 CS 152 Lec16.34 Increasing Bandwidth - Interleaving Main Performance ccess Pattern without Interleaving: 1 available Start ccess for 1 ccess Pattern with 4-way Interleaving: ccess Bank 0 CS 152 Lec16.35 Start ccess for 2 ccess Bank 1 ccess Bank 2 ccess Bank 3 We can ccess Bank 0 again CPU CPU Bank 0 Bank 1 Bank 2 Bank 3 CS 152 Lec16.36 Timing to access one 4 word cache block 1 to send address, 6 access time, 1 to send data Cache Block is 4 words Simple M.P. = 4 x (1+6+1) = 32 Wide M.P. = = 8 Interleaved M.P. = x1 = 11

7 CS 152 Lec16.37 Independent Banks How many banks? number banks = number clocks to access word in bank For sequential accesses, otherwise will return to original bank before it has next word ready Increasing RM => fewer chips => harder to have banks Growth s/chip RM : 50%-60%/yr Nathan Myrvold M/S: mature software growth (33%/yr for NT) growth MB/$ of RM (25%-30%/yr) (from Pete MacWilliams, Intel) Minimum PC Size CS 152 Lec16.38 Fewer RMs/System over 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB RM Generation Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 32 8 per 16 4 RM 60% / year 8 2 per System 25%-30% / year Page Mode RM: Motivation Fast Page Mode Operation Regular RM Organization: N rows x N column x M- Read & Write M- at a time Each M- access requires a RS / CS cycle Fast Page Mode RM N x M register to save a row Column ddress N rows N cols RM M- Output M s Row ddress Fast Page Mode RM N x M to save a row fter a row is read into the register Only CS is needed to access other M- blocks on that row RS_L remains asserted while CS_L is toggled Column ddress N rows M- Output N cols RM N x M M s Row ddress RS_L 1st M- ccess 2nd M- ccess RS_L 1st M- ccess 2nd M- 3rd M- 4th M- CS_L CS_L Row ddress Col ddress Junk Row ddress Col ddress Junk Row ddress Col ddress Col ddress Col ddress Col ddress CS 152 Lec16.39 CS 152 Lec16.40 RM v. esktop Microprocessor Cultures RM esign Goals Standards pinout, package, refresh rate, binary compatibility, IEEE 754, bus capacity,... Sources Multiple Single Figures 1) capacity, 2) $/ 1) SPEC speed of Merit 3) BW, 4) latency 2) cost Improve 1) 60%, 2) 25%, 1) 60%, Rate/year 3) 20%, 4) 7% 2) little change Reduce cell size 2.5, increase die size 1.5 Sell 10% of a single RM generation 6.25 billion RMs sold in phases: engineering samples, first customer ship(fcs), mass production Fastest to FCS, mass production wins share ie size, testing time, yield => profit Yield >> 60% (redundant rows/columns to repair flaws) CS 152 Lec16.41 CS 152 Lec16.42

8 CS 152 Lec16.43 RM History Summary: RMs: capacity +60%/yr, cost 30%/yr 2.5X cells/area, 1.5X die size in 3 years 97 RM fab line costs $1B to $2B RM only: density, leakage v. speed Rely on increasing no. of computers & memory per computer (60% market) SIMM or IMM is replaceable unit => computers use any generation RM Commodity, second source industry => high volume, low profit, conservative Little organization innovation in 20 years page mode, EO, Synch RM Order of importance: 1) Cost/ 2) Capacity RMBUS: 10X BW, cost increase?? Two ifferent Types of Locality: Temporal Locality (Locality in ): If an item is referenced, it will tend to be referenced again soon. CS 152 Lec16.44 Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. RM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system is fast but expensive and not very dense: Good choice for providing the user FST access time. Summary: Processor- Performance Gap Tax Processor % rea %Transistors (~cost) (~power) lpha % 77% Strongrm S110 61% 94% Pentium Pro 64% 88% 2 dies per package: Proc/I$/$ + L2$ Caches have no inherent value, only try to close performance gap CS 152 Lec16.45

EEC 483 Computer Organization

EEC 483 Computer Organization Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chansu Yu Table of Contents Ch.1 Introduction Ch. 2 Instruction: Machine Language Ch. 3-4 CPU Implementation Ch. 5 Cache