CSE 141 Computer Architecture Summer Session I, Lectures 10 Advanced Topics, Memory Hierarchy and Cache. Pramod V. Argade

Size: px

Start display at page:

Download "CSE 141 Computer Architecture Summer Session I, Lectures 10 Advanced Topics, Memory Hierarchy and Cache. Pramod V. Argade"

Winfred Thomas
5 years ago
Views:

1 CSE 141 Compter Architectre Smmer Session I, 2004 Lectres 10 Advanced Topics, emory Hierarchy and Cache Pramod V. Argade

2 CSE141: Introdction to Compter Architectre Instrctor: TA: Pramod V. Argade Office Hors: Te. 7:30-8:30 P (AP& 4141) Wed. 4:30-5:30 P (AP& 4141) Anjm Gpta (a3gpta@cs.csd.ed) Office Hor: on/wed 12-1 P Chengmo Yang (c5yang@cs.csd.ed) Office Hor: on/th 2-3 P Lectre: on/wed. 6-8:50 P, Center 109 Tetbook: Web-page: Compter Organization & Design The Hardware Software Interface, 2 nd Edition. Athors: Patterson and Hennessy 2

3 Reading Assignment: Annoncements Advanced Topics, Sections 6.8 (onday) Caches, Sections (onday) Virtal emory, Section (Wednesday) Homework 6: De Fri., Jly 30 Dring Discssion Cache: 7.7, 7.8, 7.9, 7.15, 7.16, 7.18, 7.20, 7.21 Virtal emory: 7.32, 7.33 Qiz 6 When: Wednesday, Jly 28, First 10 mintes of the class Topic: Caches, Chapter 7 Need: Paper, pen Final Eam When: Sat., Jly 31, 7-10 P, Center 101 (Note room change!) 3

4 CSE141 Corse Schedle Lectre # Date Time Room Topic Qiz topic 1 on. 6/28 6-8:50 P Center Wed. 6/30 6-8:50 P Center 109 Introdction, Ch. 1 ISA, Ch. 3 Performance, Ch. 2 Arithmetic, Ch. 4 Homework De - - ISA Ch. 3 - on. 7/5 No Class Jly 4th Holiday Wed. 7/7 6-8:50 P Center on. 7/12 6-8:50 P Center Te. 7/13 7:30-8:50 P Center Wed. 7/14 6-8:50 P Center on. 7/19 6-8:50 P Center Te. 7/20 7:30-8:50 P Center 109 Arithmetic, Ch. 4 Cont. Single-cycle CPU Ch. 5 Single-cycle CPU Ch. 5 Cont. lti-cycle CPU Ch. 5 lti-cycle CPU Ch. 5 Cont. (Jly 5th make p class) Single and lticycle CPU Eamples and Review for idterm id-term Eam Eceptions Pipelining Ch. 6 (Jly 5th make p class) Performance Ch. 2 #1 #2 Arithmetic, Ch. 4 #3 - - Single-cycle CPU Ch. 5 - # Wed. 7/21 6-8:50 P Center 109 Hazards, Ch on. 7/26 6-8:50 P Center 109 emory Hierarchy & Caches Ch Wed. 7/28 6-8:50 P Center 109 Virtal emory, Ch. 7 Corse Review Hazards Ch. 6 Cache Ch Sat. 7/ P Center 109 Final Eam #5 #6 4

5 Advanced Techniqes 5

6 Advanced Techniqes CPU CPU time time = Seconds = Instrctions Cycles Cycles 1 * * Program Program Instrction Clock Clock Freq Freq Sperpipelining ore pipeline stages Operand forwarding becomes complicated Branch penalty is high st se branch prediction scheme Enables rnning the clock at higher freqency Sperscalar ltiple pipelines eecting in parallel Each pipeline may be dedicated to a particlar task (integer, float, mem) Challenge is finding instrctions in parallel Decreases CPI 6

7 Sperscalar IPS Datapath ALU PC Instrction memory Registers Write data Data memory Sign etend Sign etend ALU Address Upto two instrctions issed per clock: one integer ALU instrction and one LD/ST 7

8 Sperscalar Isses Two instrctions have to be fetched and decoded 64-bits fetched at a given PC Additional ports are needed in the register file Total 4 read ports, 2 write ports in or eample Hardware resorces have to be replicated One ALU for arithmetic operation, another for E Address Additional data forwarding paths, control logic, Problems How to find mltiple instrctions to isse at rn time? Dependency on load instrction cannot be for mltiple clocks Determined by the nmber of instrctions issed in parallel Compiler technology need to statically schedle instrctions Breaks binary compatibility How to deal with a stall between LD and arithmetic instrction? In this case, net two instrctions cannot se load reslt w/o stalling 8

9 Advanced Techniqes Dynamic Pipeline Schedling Dynamic pipelining Eecte instrction ot-of-order to avoid pipeline hazards/stalls A stalled instrction shold not hold other instrctions Retire instrctions in eection order (i.e. commit reslt) Decreases CPI Three major sections Instrction fetch and isse Eecte nits Each nit has reservation station to hold operands and operations Instrctions held in the reservation station ntil ready to eecte Commit nit Common approach is in-order completion st discard instrctions as a reslt of a mis-predicted branch 9

10 Dynamically Schedled Pipeline Instrction fetch and decode nit In-order isse Reservation station Reservation station Reservation station Reservation station Fnctional nits Integer Integer Floating point Load/ Store Ot-of-order eecte In-order commit Commit nit Very comple to design and verify 10

11 emory Hierarchy 11

12 emory Systems Compter Control Inpt emory Datapath Otpt 12

13 Pipelined Design: Datapath and Control IF.Flsh ID.Flsh EX.Flsh IF/ID Hazard detection nit Control 0 ID/EX WB EX Case 0 0 EX/E WB E/WB WB 4 Shift left 2 Ecept PC PC I-E Instrction memory Registers = ALU D-E Data memory Sign etend Forwarding nit Can arbitrarily large amont of I-E and D-E be accessed in a single cycle? 13

14 emory Hierarchy in Compter Systems Processor Datapath Control Registers On-Chip Cache Second Level Cache (SRA) ain emory (DRA) Secondary Storage (Disk) Tertiary Storage Speed: 1 ns 10 s ns 100 s ns (10s ms) Size (bytes): 100s ~ KBytes ~ Bygtes ~G Bytes ~Tera Bytes Cycles (3 GHz): s 10 s of illions 14

15 emory Sbsystem Challenge Conflicting goals to provide: Largest possible memory At fastest access time With lowest cost Processor speeds now eceed 3 Ghz (0.3 ns cycle time) DRA access times are still ~10s of ns Serios emory access gap Every instrction has to be accessed from memory ~15% of the instrctions are load/store 15

16 Static RA Cell and Data Access 6-Transistor SRA Cell 0 1 word (row select) Write: 1. Drive bit lines to data Select row Read: 1. Precharge bit and bit to Vdd 2. Select row 3. Cell plls one line low bit bit 4. Sense amp on colmn detects difference between bit and bit Fast access, large area (6 transistors per cell) 16

17 Dynamic RA (DRA) Cell and Data Access 1-Transistor DRA Cell Write: 1. Drive bit line to data 2. Select row Read: 1. Precharge bit line to Vdd 2. Select row 3. Cell and bit line share charges Very small voltage changes on the bit line 4. Sense voltage difference Can detect changes of ~1 million electrons 5. Write: restore the vale Refresh 1. Jst do a dmmy read to every cell bit row select Slow access, small area (1 transistor per cell). Needs periodic refresh. 17

18 agnetic Disk Platters Tracks Average access time = Average seek time + Average rotational delay + Data transfer time + Disk controller overhead Platter Sectors Track Slow access (~ ms), very large capacity (100 s GB) 18

19 Caches 19

20 Who Cares abot emory Hierarchy? Processor vs emory Performance 1000 CPU CPU-DRA Gap DRA emory technology has not kept pace with Processor Performance emory access time is the performance bottleneck

21 emory Hierarchy and Locality emory locality is the principle that ftre memory accesses are near past accesses There are two types of locality Temporal locality -- near in time we will often access the same data again very soon Spatial locality -- near in space/distance or net access is often very close in address to recent access Types(s) of locality in following address seqence? 1,2,3,4,7,8,9,10,8,8,4,8,9,8,10,8,8 emory hierarchy is designed to take advantage of memory locality. Cache is implemented with SRA (fast, epensive) ain memory is implemented with DRA (cheap, slower) Storage is disk and tape (very slow, cheap, vast) 21

22 emory Hierarchy Faster Registers Cache Operands Blocks Program/compiler 1-8 bytes Cache controller bytes emory Pages OS 512-4K bytes Disk Files OS bytes Tape Larger 22

23 Dictionary meaning: What is a Cache? A hiding place sed especially for storing provisions A cache is a small amont of fast memory emory hierarchies eploit locality by caching (keeping close to the processor) data likely to be sed again. It is impractical to bild large, fast memories. Caches give an illsion of Fast access time (of a SRA) With very large capacity (provided by a disk) 23

24 Locality and Caching A cache is a small amont of fast memory emory hierarchies eploit locality by caching (keeping close to the processor) data likely to be sed again. This is done becase we can bild large, slow memories and small, fast memories, bt we can t bild large, fast memories. If it works, we get the illsion of SRA access time with disk capacity SRA (static RA) ns access time, very epensive DRA (dynamic RA) ns, cheaper disk -- access time measred in milliseconds, very cheap 24

25 Cache Terminology Instrction cache: cache that only holds instrctions Data cache: cache that only holds data Split cache: instrction and data cache are separate Provides increased bandwidth from the cache Hit rate is lower (than nified cache) Wins over nified cache de to higher bandwidth Unified cache: cache that holds both instrctions and data Hit rate is higher Bandwidth is lower (than that of of the split cache) 25

26 Cache Terminology Cache hit: an access where the data is fond in the cache Cache miss: an access which is not fond in the cache Hit time: time to access the cache iss penalty: Time to process a cache miss ove data from lower level memory to the cache and CPU Hit ratio: % of time the data is fond in the cache iss ratio: (1 - hit ratio) Cache block size or cache line size: the amont of data that gets transferred on a cache miss Effective access time: (Hit Ratio * Hit Time) + (iss Ratio * iss Time) 26

27 Pipelined Design: I-Cache & D-Cache IF.Flsh ID.Flsh EX.Flsh IF/ID Hazard detection nit Control 0 ID/EX WB EX Case 0 0 EX/E WB E/WB WB 4 Shift left 2 Ecept PC PC I Cache Instrction memory Registers = ALU Data D memory Cache Sign etend Forwarding nit How is the cache organized and managed? 27

28 How are Cache Entries ade? X4 X1 Xn 2 X4 X1 Xn 2 Xn 1 X2 Xn 1 X2 Xn X3 X3 a. Before the reference to Xn b. After the reference to Xn How is it determined whether the data for a given address is in the cache? In of of a miss, where is the data corresponding to the new address stores? 28

29 A Direct-mapped Cache If a data item is in the cache, how do we find it? Cache location = (block address) modlo (Nmber of cache blocks in the cache) In the following case: Nmber of cache blocks in the cache = 8 Cache emory 29

30 A Direct-mapped Cache, contd. Address trace: An An inde inde is is sed sed to to determine determine which which line line an an address address might might be be fond fond in in the the cache cache The The tag tagidentifies a a portion portion of of address address of of the the cached cached data data tag v data Valid Valid bit bit indicates indicates that that entry entry is is valid valid 4 entries, each block holds one word, each word in memory maps to eactly one cache location. A cache that can pt a line of data in eactly one place is called a direct-mapped cache 30

31 How is a Block fond in the Cache? Address Address (showing bit positions) Byte offset A 4 Kbyte Cache with a 1 word blocks Nmber of blocks = Cache Size/(Block Size) = 1 K Inde bits = log 2 (Nmber of blocks) = 10 bits Hit Tag Inde Inde Valid Tag Data Data Tag bits = ( Total address -Inde bits - Byte offset bits ) = = Kbyte Cache direct mapped cache with 1 word (4-byte) blocks 31

32 Handling a Cache Read iss IF.Flsh ID.Flsh EX.Flsh IF/ID Hazard detection nit Control 0 ID/EX WB EX Case 0 0 EX/E WB E/WB WB Processor Chip PC 4 Instrction I memory Cache Shift left 2 Registers = Ecept PC ALU Data D memory Cache Sign etend Forwarding nit I-Cache iss Logic I/O Controller I-Cache iss Logic A mis-match on tag and/or Valid bit indicates a miss. Stall CPU. ake read reqest to memory (via memory controller) When memory retrns the data write it into the cache Retrn the data to the CPU 32

33 Handling a Cache Write iss IF.Flsh ID.Flsh EX.Flsh IF/ID Hazard detection nit Control 0 ID/EX WB EX Case 0 0 EX/E WB E/WB WB Processor Chip PC 4 Instrction I memory Cache Shift left 2 Registers = Ecept PC ALU Data D memory Cache Sign etend Forwarding nit I-Cache iss Logic I/O Controller I-Cache iss Logic A mis-match on tag and/or Valid bit indicates a miss. Write tag, valid bit and data into the cache Works only if block size = word size Shold the data be written to memory also? 33

34 Dealing with Stores Stores mst be handled differently than loads, becase... They don t necessarily reqire the CPU to stall stores don t prodce register vales sed by other instrctions They change the content of cache/memory (creating memory consistency isses) Policy decisions for stores write-throgh => all writes go to both cache and main memory write-back => writes go only to cache. odified cache lines are written back to memory when the line is replaced. write-allocate => on a store miss, bring written line into the cache no-write-allocate => write to main memory, and ignore cache 34

35 How to Insre emory Consistency? Write-throgh Cache Write data to cache as well as lower level cache/memory This incrs performance penalty Use write bffer CPU can proceed with the following instrctions What abot brst writes? Use mltiple entries in the write bffer Write-back Cache Write cache data to memory when it is abot to be overwritten for another address Write-allocate: On a write miss, bring written line into the cache No-write-allocate: On a write miss, write to main memory, and ignore cache 35

36 Write-throgh Cache Handling a Cache Write iss Write data to cache as well as memory Don t need to consider whether the write hits or misses the cache Disadvantage: Every write cases the data to be written to the main memory Use write bffer so CPU can proceed with the following instrctions Write-back Cache When write occrs, write the new vale only to the block in the cache Write cache data to memory when it is abot to be overwritten for another address Improves performance over write-throgh cache ore comple to implement 36

37 Smmary for Stores On a store hit, write the new data to cache. In a write-throgh cache, write the data immediately to memory. In a write-back cache, mark the line as dirty. On a store miss, initiate a cache block load from memory for a write-allocate cache. Write directly to memory for a no writeallocate cache. On any kind of cache miss in a write-back cache, if the line to be replaced in the cache is dirty, write it back to memory. 37

38 Taking advantage of Spatial Locality Consider following address trace: 0,1,2,3,17,8,9,10,11,17,4,5,6,7 Notice that addresses lie in the vicinity of each other Instrctions show high degree of spatial locality Typically accessed seqentially Generally, code consists of a lot of loops Data also shows spatial locality Typically less than that of instrctions Different elements of a strctre may be accessed Why not bring mltiple words on a cache miss? Instead of bringing a single? 38

39 Spatial Locality: Larger Cache Blocks address string: tag data 4 entries, each block holds two words, each word in memory maps to eactly one cache location (this cache is twice the total size of the prior caches). Large cache blocks take advantage of spatial locality. Too large of a block size can waste cache space. Larger cache blocks reqire less tag space 39

40 A 64 KB Cache sing 16-byte Blocks Address Address (showing bit positions) Hit Tag Byte offset Inde Block offset Data 16 bits 128 bits V Tag Data 4K entries

41 Complication with Larger Blocks Write-throgh cache: Can t write to the cache while performing a tag comparison Ok if there is a hit in the cache Not ok if there is a cache miss: The block has to be fetched from memory and placed in the cache Rewrite the word that cased the miss into the cache 41

42 Impact of Block Size on iss Rate 40% 35% 30% iss rate 25% 20% 15% 10% 5% 0% Block size (bytes) In general, larger block decreases miss rate, however, Larger block size means larger miss penalty: Takes longer time to fill p the block iss rates go p if block size is too big Since there are too few cache blocks 1 KB 8 KB 16 KB 64 KB 256 KB 42

43 Cache Performance 64 KB each instrction cache and data cache (direct mapped) Program Block Size in words Instrction miss rate Data miss rate Effective Combined miss rate gcc spice 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3% 0.6% 0.4% In general, Average Access Time: = [Hit Time * (1 - iss Rate)] + [iss Penalty * iss Rate] Limitations of direct mapped cache A block can go in eactly one place in the cache 43

44 Fleible Placement of Blocks Direct mapped cache A block can go in eactly one place in the cache Leads to collision among blocks Flly associative cache A block can go in any place in the cache All addresses have to be compared simltaneosly Slow and epensive N-way set-associative cache Consists of a nmber of sets Each set consists of N blocks Each block in memory maps to a niqe set A block can be placed in any element of the set Set containing a memory block = (block nmber) modlo(nmber of sets in the cache) Nmber of sets in the cache = Cache size/[(block size)*(associativity)] 44

45 Locating a Block in a Cache Block address = Direct apped Set-associative Flly associative Direct mapped Set associative Flly associative Block # Set # Data Data Data Tag 1 2 Tag 1 2 Tag 1 2 Search Search Search 45

46 Cache Configrations An eight-block cache with varios configrations On-way Associative One-way set associative (direct mapped) Block Tag Data Two-way set associative Set Tag Data Tag Data For-way set associative Set 0 1 Tag Data Tag Data Tag Data Tag Data Eight-way set associative (flly associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data 46

47 Accessing a 4-way Set-associative cache Address Nmber of Blocks = Cache Size/Block Size = 4 Kbytes/4 Bytes = 1 K blocks Nmber of Sets = (# Blocks)/associativity = 1 K/4 = 256 Inde V Tag Data V Tag Data V Tag Data V Tag Data Inde bits = log 2 ( # Sets ) = log 2 ( 256 ) = 8 4-to-1 mltipleor 4 K-byte 4-way set-associative cache, with a block size of 4 bytes Hit Data 47

48 Accessing a Direct apped Cache 64 KB cache, direct-mapped, 32-byte cache block size tag inde hit/miss valid tag = 11 data 256 word offset KB / 32 bytes = 2 K cache blocks/sets 48

49 Accessing a Set-associative Cache 32 KB cache, 2-way set-associative, 16-byte block size tag inde valid 10 word offset tag data valid tag data 32 KB / 16 bytes / 2 = 1 K cache sets hit/miss = = 49

50 A Flly-associative cache address string: ? The The tag tagidentifies the the address address of of the the cached cached data data tag v Valid Valid bit bit indicates indicates that that entry entry is is valid valid data 4 entries, each block holds one word, any block can hold any word. A cache that can pt a block of data anywhere is called flly associative To access the cache, address mst be compared with all the entries in the cache 50

51 Cache Organization A typical cache has three dimensions Nmber of sets (cache size) tag tag data data tag tag data data tag tag data data tag tag data data tag tag data data tag tag data data tag tag data data tag tag data data... Blocks/set (associativity) Bytes/block (block size) tag inde block offset 51

52 The Three Cs Complsory misses Cased by the first access to a block that has never been in the cache Also called cold-start misses Can be redced by increasing the block size Capacity misses Cased when cache cannot contain all the blocks needed Occr becase of blocks being replaced and later retrieved Can be redced by enlarging the cache Conflict misses Occr in direct mapped and set-associative caches ltiple blocks compete for the same set Can be eliminated by sing flly associative cache 52

53 Which Block Shold be Replaced on a iss? Direct apped is Easy Set associative or flly associative: Random (large associativities) LRU (smaller associativities) iss rates for the two schemes: Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% LRU is preferred scheme for a small size cache 53

54 Associative Caches: Higher hit rates, bt... Longer access time (longer to determine hit/miss, more ming of otpts) ore space (longer tags) 16 KB, 16-byte blocks, direct mapped, tag =? 16 KB, 16-byte blocks, 4-way, tag =? 54

55 Smmary The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three ajor Categories of Cache isses: Complsory isses: sad facts of life. Eample: cold start misses. Conflict isses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity isses: increase cache size Cache Design Space total size, block size, associativity replacement policy write-hit policy (write-throgh, write-back) Caches give an illsion of a large, cheap memory with the access time of a fast, epensive memory 55

EXAMINATIONS 2003 END-YEAR COMP 203. Computer Organisation

EXAMINATIONS 2003 END-YEAR COMP 203. Computer Organisation EXAINATIONS 2003 COP203 END-YEAR Compter Organisation Time Allowed: 3 Hors (180 mintes) Instrctions: Answer all qestions. There are 180 possible marks on the eam. Calclators and foreign langage dictionaries