CPU Pipelining Issues

Similar documents
Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

CPU Pipelining Issues

Pipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...

Cycle Time for Non-pipelined & Pipelined processors

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

ECE 2300 Digital Logic & Computer Organization. Caches

More CPU Pipelining Issues

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Computer Architecture Spring 2016

Improve performance by increasing instruction throughput

Chapter 4 (Part II) Sequential Laundry

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Full Datapath. Chapter 4 The Processor 2

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

LECTURE 11. Memory Hierarchy

Modern Computer Architecture

Chapter Six. Dataı access. Reg. Instructionı. fetch. Dataı. Reg. access. Dataı. Reg. access. Dataı. Instructionı fetch. 2 ns 2 ns 2 ns 2 ns 2 ns

Chapter 4. The Processor

Computer Architecture Review. Jo, Heeseung

Pipelining. CSC Friday, November 6, 2015

Midnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4

6.004 Computation Structures Spring 2009

CPU Pipelining Issues

Memory. Lecture 22 CS301

Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

Chapter 4 The Processor 1. Chapter 4B. The Processor

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

COMPUTER ORGANIZATION AND DESIGN

Pipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

LECTURE 3: THE PROCESSOR

Lecture 9 Pipeline and Cache

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Orange Coast College. Business Division. Computer Science Department. CS 116- Computer Architecture. Pipelining

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

COSC 6385 Computer Architecture - Pipelining

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

What is Pipelining? RISC remainder (our assumptions)

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

EN1640: Design of Computing Systems Topic 06: Memory System

COMPUTER ORGANIZATION AND DESIGN

Full Datapath. Chapter 4 The Processor 2

COMPUTER ORGANIZATION AND DESI

The Processor Pipeline. Chapter 4, Patterson and Hennessy, 4ed. Section 5.3, 5.4: J P Hayes.

Processor (II) - pipelining. Hwansoo Han

Review : Pipelining. Memory Hierarchy

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Chapter 5. Memory Technology

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Outline Marquette University

CSE 2021: Computer Organization

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

ECE331: Hardware Organization and Design

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CSE 2021: Computer Organization

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory hierarchy Outline

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

L19 Pipelined CPU I 1. Where are the registers? Study Chapter 6 of Text. Pipelined CPUs. Comp 411 Fall /07/07

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

1 Hazards COMP2611 Fall 2015 Pipelined Processor

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Caching Basics. Memory Hierarchies

EITF20: Computer Architecture Part2.2.1: Pipeline-1

DLX Unpipelined Implementation

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

Pipeline design. Mehran Rezaei

CS654 Advanced Computer Architecture. Lec 2 - Introduction

CpE 442. Memory System

Handout 4 Memory Hierarchy

UCB CS61C : Machine Structures

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

EIE/ENE 334 Microprocessors

Chapter 4. The Processor

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

SI232 Set #20: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Chapter 6 ADMIN. Reading for Chapter 6: 6.1,

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

CS 61C: Great Ideas in Computer Architecture. Lecture 13: Pipelining. Krste Asanović & Randy Katz

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy. Mehran Rezaei

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

ECE 2300 Digital Logic & Computer Organization. More Caches Measuring Performance

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Chapter 4. The Processor

ELE 655 Microprocessor System Design

UC Berkeley CS61C : Machine Structures

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Chapter 3. Pipelining. EE511 In-Cheol Park, KAIST

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Transcription:

CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1

Pipelining Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) Instruction fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8 ns Instruction fetch ALU Data access lw $3, 300($0) Program execution Time order (in instructions) lw $1, 100($0) lw $2, 200($0) Instruction fetch 2 ns 8 ns 2 4 6 8 10 12 14 Instruction fetch ALU Data access ALU Data access Instruction fetch 8 ns... lw $3, 300($0) 2 ns Instruction fetch ALU Data access 2 ns 2 ns 2 ns 2 ns 2 ns Ideal speedup is number of stages in the pipeline. Do we achieve this? L17 Pipeline Issues & Memory 2

What makes it easy Pipelining all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction Individual Instructions still take the same number of cycles But we ve improved the through-put by increasing the number of simultaneously executing instructions L17 Pipeline Issues & Memory 3

Structural Hazards Inst Fetch Read ALU Data Access Write Inst Fetch Read ALU Data Access Write Inst Fetch Read ALU Data Access Write Inst Fetch Read ALU Data Access Write L17 Pipeline Issues & Memory 4

Data Hazards Problem with starting next instruction before first is finished dependencies that go backward in time are data hazards Time (in clock cycles) Value of register $2: Program execution order (in instructions) sub $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 IM CC 7 CC 8 CC 9 10 10 10 10 10/ 20 20 20 20 20 DM and $12, $2, $5 IM DM or $13, $6, $2 IM DM add $14, $2, $2 IM DM sw $15, 100($2) IM DM L17 Pipeline Issues & Memory 5

Software Solution Have compiler guarantee no hazards Where do we insert the nops? sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Problem: this really slows us down! L17 Pipeline Issues & Memory 6

Forwarding Use temporary results, don t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 Value of register $2 : 10 10 10 10 10/ 20 20 20 20 20 Value of EX/MEM : X X X 20 X X X X X Value of MEM/WB : X X X X 20 X X X X Program execution order (in instructions) sub $2, $1, $3 IM DM and $12, $2, $5 IM DM or $13, $6, $2 IM DM add $14, $2, $2 IM DM sw $15, 100($2) IM DM L17 Pipeline Issues & Memory 7

Can't always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Program execution order (in instructions) lw $2, 20($1) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 IM DM CC 7 CC 8 CC 9 and $4, $2, $5 IM DM or $8, $2, $6 IM DM add $9, $4, $2 IM DM slt $1, $6, $7 IM DM Thus, we need a hazard detection unit to stall the instruction L17 Pipeline Issues & Memory 8

Stalling We can stall the pipeline by keeping an instruction in the same stage Program Time (in clock cycles) execution order (in instructions) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 lw $2, 20($1) IM DM and $4, $2, $5 IM DM or $8, $2, $6 add $9, $4, $2 IM IM DM bubble IM DM slt $1, $6, $7 IM DM L17 Pipeline Issues & Memory 9

Branch Hazards When we decide to branch, other instructions are in the pipeline! Program execution order (in instructions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 40 beq $1, $3, 7 IM DM 44 and $12, $2, $5 IM DM 48 or $13, $6, $2 IM DM 52 add $14, $2, $2 IM DM 72 lw $4, 50($7) IM DM We are predicting branch not taken need to add hardware for flushing instructions if we are wrong L17 Pipeline Issues & Memory 10

Improving Performance Try to avoid stalls! E.g., reorder these instructions: lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1) Add a branch delay slot the next instruction after a branch is always executed rely on compiler to fill the slot with something useful Superscalar: start more than one instruction in the same cycle L17 Pipeline Issues & Memory 11

Dynamic Scheduling The hardware performs the scheduling hardware tries to find instructions to execute out of order execution is possible speculative execution and dynamic branch prediction All modern processors are very complicated Pentium 4: 20 stage pipeline, 6 simultaneous instructions PowerPC and Pentium: branch history table Compiler technology important L17 Pipeline Issues & Memory 12

Pipeline Summary (I) Started with unpipelined implementation direct execute, 1 cycle/instruction it had a long cycle time: mem + regs + alu + mem + wb We ended up with a 5-stage pipelined implementation increase throughput (3x???) delayed branch decision (1 cycle) Choose to execute instruction after branch delayed register writeback (3 cycles) Add bypass paths (6 x 2 = 12) to forward correct value memory data available only in WB stage Introduce NOPs at IR ALU, to stall IF and RF stages until LD result was ready L17 Pipeline Issues & Memory 13

Pipeline Summary (II) Fallacy #1: Pipelining is easy Smart people get it wrong all of the time! Fallacy #2: Pipelining is independent of ISA Many ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes). Fallacy #3: Increasing Pipeline stages improves performance Diminishing returns. Increasing complexity. L17 Pipeline Issues & Memory 14

RISC = Simplicity??? The P.T. Barnum World s Tallest Dwarf Competition World s Most Complex RISC? VLIWs,? Super-Scalars Addressing features, eg index registers Pipelines, Bypasses, Annulment,,... RISCs Primitive Machines with direct implementations Complex instructions, addressing modes Generalization of registers and operand coding L17 Pipeline Issues & Memory 15

Memory Hierarchy Why are you dressed like that? Halloween was weeks ago! It makes me look faster, don t you think? Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity (Study Chapter 5) L17 Pipeline Issues & Memory 16

What Do We Want in a Memory? minimips PC INST MADDR MDATA Wr ADDR DOUT ADDR DATA R/W MEMORY Capacity Latency Cost ister 1000 s of bits 10 ps $$$$ SRAM 1 s Mbytes 0.2 ns $$$ DRAM 10 s Gbytes 10 ns $ Hard disk* 10 s Tbytes 10 ms Want? * non-volatile 100 Gbytes 0.2 ns cheap L17 Pipeline Issues & Memory 17

Row Address Decoder Tricks for Increasing Throughput Multiplexed Address Col. 1 Col. 2 Col. 3 bit lines word lines Col. 2 M Row 1 The first thing that should pop into your mind when asked to speed up a digital design PIPELINING N Row 2 Synchronous DRAM (SDRAM) Row 2 N ($25 per Gbyte) N Column Multiplexer/Shifter memory cell (one bit) Double-clocked Synchronous DRAM (SDRAM) Clock t 1 t 2 t 3 t 4 D Data out L17 Pipeline Issues & Memory 18

figures from www.pctechguide.com Hard Disk Drives Typical drive: Average latency = 4 ms (7200 rpm) Average seek time = 8.5 ms Transfer rate = 140 Mbytes/s (SATA) Capacity = 1.5 T byte Cost = $149 (10 G byte) L17 Pipeline Issues & Memory 19

Quantity vs Quality Your memory system can be BIG and SLOW... or SMALL and FAST. $/GB 1000 100 10 1.1.01 SRAM (5000$/GB, 0.2 ns) DRAM (25$/GB, 5 ns) DISK (0.10$/GB, 10 ms) DVD Burner (0.06$/G, 150ms) 10-8 10-6 10-3 1 100 We ve explored a range of device-design trade-offs. Is there an ARCHITECTURAL solution to this DELIMA? Access Time L17 Pipeline Issues & Memory 20

Managing Memory via Programming In reality, systems are built with a mixture of all these various memory types SRAM CPU MAIN MEM How do we make the most effective use of each memory? We could push all of these issues off to programmers Keep most frequently used variables and stack in SRAM Keep large data structures (arrays, lists, etc) in DRAM Keep bigger data structures on disk (databases) on DISK It is harder than you think data usage evolves over a program s execution L17 Pipeline Issues & Memory 21

Best of Both Worlds What we REALLY want: A BIG, FAST memory! (Keep everything within instant access) We d like to have a memory system that PERFORMS like 10 GBytes of SRAM; but COSTS like 1-4 Gbytes of slow memory. SURPRISE: We can (nearly) get our wish! KEY: Use a hierarchy of memory technologies: SRAM CPU MAIN MEM L17 Pipeline Issues & Memory 22

Key IDEA Keep the most often-used data in a small, fast SRAM (often local to CPU chip) Refer to Main Memory only rarely, for remaining data. The reason this strategy works: LOCALITY Locality of Reference: Reference to location X at time t implies that reference to location X+ X at time t+ t becomes more probable as X and t approach zero. L17 Pipeline Issues & Memory 23

cache (kash) n. Cache A hiding place used especially for storing provisions. A place for concealment and safekeeping, as of valuables. The store of goods or valuables concealed in a hiding place. Computer Science. A fast storage buffer in the central processing unit of a computer. In this sense, also called cache memory. v. tr. cached, cach ing, cach es. To hide or store in a cache. L17 Pipeline Issues & Memory 24

Cache Analogy You are writing a term paper at a table in the library As you work you realize you need a book You stop writing, fetch the reference, continue writing You don t immediately return the book, maybe you ll need it again Soon you have a few books at your table and no longer have to fetch more books The table is a CACHE for the rest of the library L17 Pipeline Issues & Memory 25

Typical Memory Reference Patterns address stack data program MEMORY TRACE A temporal sequence of memory references (addresses) from a real program. TEMPORAL LOCALITY If an item is referenced, it will tend to be referenced again soon SPATIAL LOCALITY If an item is referenced, nearby items will tend to be referenced soon. time L17 Pipeline Issues & Memory 26

Working Set address S stack t data program S is the set of locations accessed during t. Working set: a set S which changes slowly w.r.t. access time. Working set size, S t time L17 Pipeline Issues & Memory 27

Exploiting the Memory Hierarchy Approach 1 (Cray, others): Expose Hierarchy isters, Main Memory, Disk each available as storage alternatives; Tell programmers: Use them cleverly Approach 2: Hide Hierarchy SRAM CPU MAIN MEM Programming model: SINGLE kind of memory, single address space. Machine AUTOMATICALLY assigns locations to fast or slow memory, depending on usage patterns. CPU Small Static Dynamic RAM HARD DISK MAIN MEMORY L17 Pipeline Issues & Memory 28

Why We Care CPU performance is dominated by memory performance. More significant than: ISA, circuit optimization, pipelining, super-scalar, etc CPU Small Static CACHE Dynamic RAM MAIN MEMORY HARD DISK TRICK #1: How to make slow MAIN MEMORY appear faster than it is. Technique: VIRTUAL MEMORY SWAP SPACE TRICK #2: How to make a small MAIN MEMORY appear bigger than it is. Technique: CACHEING VIRTUAL MEMORY L17 Pipeline Issues & Memory 29

The Cache Idea: Program-Transparent Memory Hierarchy CPU 1.0 (1.0- ) 100 37 DYNAMIC RAM "CACHE" "MAIN MEMORY" Cache contains TEMPORARY COPIES of selected main memory locations... eg. Mem[100] = 37 GOALS: 1) Improve the average access time t (1- ) ave HIT RATIO: Fraction of refs found in CACHE. MISS RATIO: Remaining references. t ( 1 )(t t ) t ( 1 ) t c c m c m Challenge: To make the hit ratio as high as possible. 2) Transparency (compatibility, programming ease) L17 Pipeline Issues & Memory 30

How High of a Hit Ratio? Suppose we can easily build an on-chip static memory with a 0.8 ns access time, but the fastest dynamic memories that we can buy for main memory have an average access time of 10 ns. How high of a hit rate do we need to sustain an average access time of 1 ns? t t t ave c 1 1 98% m 1 0.8 10 WOW, a cache really needs to be good? L17 Pipeline Issues & Memory 31

Cache Sits between CPU and main memory Very fast table that stores a TAG and DATA TAG is the memory address DATA is a copy of memory at the address given by TAG Tag 1000 17 1040 1 1032 97 1008 11 Data 1000 17 1004 23 1008 11 1012 5 1016 29 1020 38 1024 44 1028 99 1032 97 1036 25 1040 1 1044 4 Memory L17 Pipeline Issues & Memory 32

Cache Access On load we look in the TAG entries for the address we re loading Found a HIT, return the DATA Not Found a MISS, go to memory for the data and put it and the address (TAG) in the cache Tag Data 1000 17 1040 1 1032 97 1008 11 1000 17 1004 23 1008 11 1012 5 1016 29 1020 38 1024 44 1028 99 1032 97 1036 25 1040 1 1044 4 Memory L17 Pipeline Issues & Memory 33

Cache Lines Usually get more data than requested (Why?) a LINE is the unit of memory stored in the cache usually much bigger than 1 word, 32 bytes per line is common bigger LINE means fewer misses because of spatial locality but bigger LINE means longer time on miss Tag Data 1000 17 23 1040 1 4 1032 97 25 1008 11 5 1000 17 1004 23 1008 11 1012 5 1016 29 1020 38 1024 44 1028 99 1032 97 1036 25 1040 1 1044 4 Memory L17 Pipeline Issues & Memory 34

Finding the TAG in the Cache A 1MByte cache may have 32k different lines each of 32 bytes We can t afford to sequentially search the 32k different tags ASSOCIATIVE memory uses hardware to compare the address to the tags in parallel but it is expensive and 1MByte is thus unlikely Incoming Address TAG =? TAG Data Data =? HIT TAG =? Data Data Out L17 Pipeline Issues & Memory 35

Finding the TAG in the Cache A 1MByte cache may have 32k different lines each of 32 bytes We can t afford to sequentially search the 32k different tags ASSOCIATIVE memory uses hardware to compare the address to the tags in parallel but it is expensive and 1MByte is thus unlikely DIRECT MAPPED CACHE computes the cache entry from the address multiple addresses map to the same cache line use TAG to determine if right Choose some bits from the address to determine the Cache line low 5 bits determine which byte within the line we need 15 bits to determine which of the 32k different lines has the data which of the 32 5 = 27 remaining bits should we use? L17 Pipeline Issues & Memory 36

Direct-Mapping Example With 8 byte lines, the bottom 3 bits determine the byte within the line With 4 cache lines, the next 2 bits determine which line to use 1024d = 10000000000b line = 00b = 0d 1000d = 01111101000b line = 01b = 1d 1040d = 10000010000b line = 10b = 2d Tag Data 1024 44 99 1000 17 23 1040 1 4 1016 29 38 1000 17 1004 23 1008 11 1012 5 1016 29 1020 38 1024 44 1028 99 1032 97 1036 25 1040 1 1044 4 Memory L17 Pipeline Issues & Memory 37

Direct Mapping Miss What happens when we now ask for address 1008? 1008d = 01111110000b line = 10b = 2d but earlier we put 1040d there... 1040d = 10000010000b line = 10b = 2d Tag Data 1024 44 99 1000 17 23 1040 1008 11 45 1016 29 38 1000 17 1004 23 1008 11 1012 5 1016 29 1020 38 1024 44 1028 99 1032 97 1036 25 1040 1 1044 4 Memory L17 Pipeline Issues & Memory 38

Miss Penalty and Rate The MISS PENALTY is the time it takes to read the memory if it isn t in the cache 50 to 100 cycles is common. The MISS RATE is the fraction of accesses which MISS The HIT RATE is the fraction of accesses which HIT MISS RATE + HIT RATE = 1 Suppose a particular cache has a MISS PENALTY of 100 cycles and a HIT RATE of 95%. The CPI for load on HIT is 5 but on a MISS it is 105. What is the average CPI for load? Average CPI = 10 Suppose MISS PENALTY = 120 cycles? 5 * 0.95 + 105 * 0.05 = 10 then CPI = 11 (slower memory doesn t hurt much) L17 Pipeline Issues & Memory 39

Some Associativity can help Direct-Mapped caches are very common but can cause problems... SET ASSOCIATIVITY can help. Multiple Direct-mapped caches, then compare multiple TAGS 2-way set associative = 2 direct mapped + 2 TAG comparisons 4-way set associative = 4 direct mapped + 4 TAG comparisons Now array size == power of 2 doesn t get us in trouble But slower less memory in same area maybe direct mapped wins... L17 Pipeline Issues & Memory 40

What about store? What happens in the cache on a store? WRITE BACK CACHE put it in the cache, write on replacement WRITE THROUGH CACHE put in cache and in memory What happens on store and a MISS? WRITE BACK will fetch the line into cache WRITE THROUGH might just put it in memory L17 Pipeline Issues & Memory 41