Lecture #22 Pipelining II, Cache I

Size: px

Start display at page:

Download "Lecture #22 Pipelining II, Cache I"

Alicia Summers
6 years ago
Views:

inst.eecs.bekeley.edu/~cs61c CS61C : Machine Stuctues Lectue #22 Pipelining II, Cache I Wiewold cicuits 2008-7-29 http://www.maa.

html Albet Chae, Instucto CS61C L22 Pipelining II, Cache I (1) Review: Pocesso Pipelining (1/2) Pipeline egistes ae added to the

Optimal Pipeline Each stage is executing pat of an instuction each clock cycle. One inst. finishes duing each clock cycle.

1 inst.eecs.bekeley.edu/~cs61c CS61C : Machine Stuctues Lectue #22 Pipelining II, Cache I Wiewold cicuits Albet Chae, Instucto CS61C L22 Pipelining II, Cache I (1) Review: Pocesso Pipelining (1/2) Pipeline egistes ae added to the datapath/contolle to neatly divide the single cycle pocesso into pipeline stages. Optimal Pipeline Each stage is executing pat of an instuction each clock cycle. One inst. finishes duing each clock cycle. On aveage, execute fa moe quickly. What makes this wok well? Similaities between instuctions allow us to use same stages fo all instuctions (geneally). Each stage takes about the same amount of time as all othes: little wasted time. CS61C L22 Pipelining II, Cache I (2)

2 A pipelined datapath Fom P&H CS61C L22 Pipelining II, Cache I (3) Review: Pipeline (2/2) Pipelining is a BIG IDEA widely used concept What makes it less than pefect? Stuctual hazads: Conflicts fo esouces. Suppose we had only one cache? Need moe HW esouces Contol hazads: Banch instuctions effect which instuctions come next. Delayed banch Data hazads: Data flow between instuctions. Fowading CS61C L22 Pipelining II, Cache I (4)

3 Review Some fixes to hazads Illusion of two memoies Registe file convention Fowading Load delay slot All else fails, bubble/stall Latency vs thoughput What pevents us fom getting n-times speedup, whee n is the numbe of pipeline stages? CS61C L22 Pipelining II, Cache I (5) I n s t. O d e Gaphical Pipeline Repesentation (In Reg, ight half highlight ead, left half wite) Time (clock cycles) Load Add Stoe Sub O I$ Reg I$ CS61C L22 Pipelining II, Cache I (6) Reg I$ D$ Reg I$ Reg D$ Reg I$ Reg D$ Reg Reg D$ Reg D$ Reg

4 I n s t. O d e Contol Hazad: Banching (1/8) beq Inst 1 Inst 2 Inst 3 Inst 4 Time (clock cycles) I$ Reg D$ Reg I$ Reg D$ Reg I$ Whee do we do the compae fo the banch? I$ Reg D$ Reg Reg D$ Reg I$ Reg D$ Reg CS61C L22 Pipelining II, Cache I (7) Contol Hazad: Banching (2/8) We had put banch decision-making hadwae in stage theefoe two moe instuctions afte the banch will always be fetched, whethe o not the banch is taken Desied functionality of a banch if we do not take the banch, don t waste any time and continue executing nomally if we take the banch, don t execute any instuctions afte the banch, just go to the desied label CS61C L22 Pipelining II, Cache I (8)

5 Contol Hazad: Banching (3/8) Initial Solution: Stall until decision is made inset no-op instuctions (those that accomplish nothing, just take time) o hold up the fetch of the next instuction (fo 2 cycles). Dawback: banches take 3 clock cycles each (assuming compaato is put in stage) CS61C L22 Pipelining II, Cache I (9) Contol Hazad: Banching (4/8) Optimization #1: inset special banch compaato in Stage 2 as soon as instuction is decoded (Opcode identifies it as a banch), immediately make a decision and set the new value of the PC Benefit: since banch is complete in Stage 2, only one unnecessay instuction is fetched, so only one no-op is needed Side Note: This means that banches ae idle in Stages 3, 4 and 5. CS61C L22 Pipelining II, Cache I (10)

6 I n s t. O d e Contol Hazad: Banching (5/8) beq Inst 1 Inst 2 Inst 3 Inst 4 Time (clock cycles) I$ Reg D$ Reg I$ Reg D$ Reg I$ Banch compaato moved to Decode stage. I$ Reg D$ Reg Reg D$ Reg I$ Reg D$ Reg CS61C L22 Pipelining II, Cache I (11) Contol Hazad: Banching (6a/8) I n s t. O d e Use inseting no-op instuction add beq nop lw Time (clock cycles) I$ Reg D$ Reg I$ Reg D$ Reg bub ble bub ble I$ bub ble bub ble bub ble Reg D$ Reg Impact: 2 clock cycles pe banch instuction slow CS61C L22 Pipelining II, Cache I (12)

7 Contol Hazad: Banching (6b/8) I n s t. O d e Contolle inseting a single bubble add beq lw Time (clock cycles) I$ Reg D$ Reg I$ Reg D$ Reg bub ble I$ Reg D$ Reg Impact: 2 clock cycles pe banch instuction slow CS61C L22 Pipelining II, Cache I (13) Contol Hazad: Banching (7/8) Optimization #2: Redefine banches Old definition: if we take the banch, none of the instuctions afte the banch get executed by accident New definition: whethe o not we take the banch, the single instuction immediately following the banch gets executed (called the banch-delay slot) The tem Delayed Banch means we always execute inst afte banch This optimization is used on the MIPS CS61C L22 Pipelining II, Cache I (14)

8 Contol Hazad: Banching (8/8) Notes on Banch-Delay Slot Wost-Case Scenaio: can always put a no-op in the banch-delay slot Bette Case: can find an instuction peceding the banch which can be placed in the banch-delay slot without affecting flow of the pogam - e-odeing instuctions is a common method of speeding up pogams - compile must be vey smat in ode to find instuctions to do this - usually can find such an instuction at least 50% of the time - Jumps also have a delay slot CS61C L22 Pipelining II, Cache I (15) Example: Nondelayed vs. Delayed Banch Nondelayed Banch Delayed Banch o $8, $9,$10 add $1,$2,$3 add $1,$2,$3 sub $4, $5,$6 beq $1, $4, Exit xo $10, $1,$11 sub $4, $5,$6 beq $1, $4, Exit o $8, $9,$10 xo $10, $1,$11 Exit: Exit: CS61C L22 Pipelining II, Cache I (16)

9 Out-of-Ode Laundy: Don t Wait T a s k O d e 12 2 AM 6 PM A B C D E F A depends on D; est continue; need moe esouces to allow out-of-ode CS61C L22 Pipelining II, Cache I (17) bubble Time Supescala Laundy: Paallel pe stage T a s k O d e 12 2 AM 6 PM A B C D E F Moe esouces, HW to match mix of paallel tasks? CS61C L22 Pipelining II, Cache I (18) Time (light clothing) (dak clothing) (vey dity clothing) (light clothing) (dak clothing) (vey dity clothing)

Supescala Laundy: Mismatch Mix 12 2 AM 6 PM 7 8

30 30 30 (light clothing) (light clothing) (dak

exta esouces CS61C L22 Pipelining II, Cache I

manage of a HUGE assembly plant to build

Box Main pipeline 10 minutes/ pipeline stage 60

10 Supescala Laundy: Mismatch Mix 12 2 AM 6 PM T a s k O d e A B C D Time (light clothing) (light clothing) (dak clothing) (light clothing) Task mix undeutilizes exta esouces CS61C L22 Pipelining II, Cache I (19) Real-wold pipelining poblem You e the manage of a HUGE assembly plant to build computes. Box Main pipeline 10 minutes/ pipeline stage 60 stages Latency: 10h CS61C L22 Pipelining II, Cache I (20) Poblem: need to un 2 h test befoe done..help!

Real-wold pipelining poblem solution 1 You emembe: a pipeline fequency is limited by its slowest stage, so Box Main pipeline 10 2hous/ minutes/ pipeline stage 60 stages Latency: 120h 10h CS61C L22

11 Real-wold pipelining poblem solution 1 You emembe: a pipeline fequency is limited by its slowest stage, so Box Main pipeline 10 2hous/ minutes/ pipeline stage 60 stages Latency: 120h 10h CS61C L22 Pipelining II, Cache I (21) Poblem: need to un 2 h test befoe done..help! Real-wold pipelining poblem solution 2 Ceate a sub-pipeline! Box Main pipeline 10 minutes/ pipeline stage 60 stages 2h test (12 CPUs in this pipeline) CS61C L22 Pipelining II, Cache I (22)

12 Pee Instuction (1/2) Assume 1 inst/clock, delayed banch, 5 stage pipeline, fowading, intelock on unesolved load hazads (afte 10 3 loops, so pipeline full) Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addiu $s1, $s1, -4 bne $s1, $zeo, Loop nop How many pipeline stages (clock cycles) pe CS61C L22 Pipelining II, Cache I (23) loop iteation to execute this code? Pee Instuction Answe (1/2) Assume 1 inst/clock, delayed banch, 5 stage pipeline, fowading, intelock on unesolved load hazads iteations, so pipeline full. 2. (data hazad so stall) Loop: 1.lw $t0, 0($s1) 3.addu $t0, $t0, $s2 4.sw $t0, 0($s1) 6. (!= in DCD) 5.addiu $s1, $s1, -4 7.bne $s1, $zeo, Loop 8.nop (delayed banch so exec. nop) How many pipeline stages (clock cycles) pe loop iteation to execute this code? CS61C L22 Pipelining II, Cache I (24)

13 Pee Instuction (2/2) Assume 1 inst/clock, delayed banch, 5 stage pipeline, fowading, intelock on unesolved load hazads (afte 10 3 loops, so pipeline full). Rewite this code to educe pipeline stages (clock cycles) pe loop to as few as possible. Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addiu $s1, $s1, -4 bne $s1, $zeo, Loop nop How many pipeline stages (clock cycles) pe CS61C L22 Pipelining II, Cache I (25) loop iteation to execute this code? Pee Instuction (2/2) How long to execute? Rewite this code to educe clock cycles pe loop to as few as possible: (no hazad since exta cycle) Loop: 1. lw $t0, 0($s1) 2. addiu $s1, $s1, addu $t0, $t0, $s2 4. bne $s1, $zeo, Loop 5. sw $t0, +4($s1) (modified sw to put past addiu) How many pipeline stages (clock cycles) pe loop iteation to execute you evised code? (assume pipeline is full) CS61C L22 Pipelining II, Cache I (26)

14 Administivia HW5 due TODAY 7/29 Quiz9 due Wednesday 7/30 HW6 due Fiday 8/1 Poj3 out soon, due next Tuesday 8/5 Will be hand gaded in peson, signups will be posted soon Midtem egades due TODAY 7/29 Poj1 gades out, poj2 hopefully soon appeals due 7/31 CS61C L22 Pipelining II, Cache I (27) Administivia Lab on polling/inteupts is cancelled We will give eveyone 4 pts on that lab Dop o gading option deadline August 1 summe.bekeley.edu fo moe details CS61C L22 Pipelining II, Cache I (28)

egiste file (~100 Bytes) Registes accessed on nanosecond timescale Memoy (we ll call main memoy ) Disk Moe capacity than egistes (~Gbytes) Access

15 The Big Pictue Compute Pocesso (active) Contol ( bain ) Datapath ( bawn ) Memoy (passive) (whee pogams, data live when unning) Devices Input Output Keyboad, Mouse Disk, Netwok Display, Pinte CS61C L22 Pipelining II, Cache I (29) Memoy Hieachy Stoage in compute systems: Pocesso holds data in egiste file (~100 Bytes) Registes accessed on nanosecond timescale Memoy (we ll call main memoy ) Disk Moe capacity than egistes (~Gbytes) Access time ~ ns Hundeds of clock cycles pe memoy access?! HUGE capacity (vitually limitless) VERY slow: uns ~milliseconds CS61C L22 Pipelining II, Cache I (30)

16 Motivation: Why We Use Caches (witten $) Pefomance CPU DRAM µpoc 60%/y. Pocesso-Memoy Pefomance Gap: (gows 50% / yea) DRAM 7%/y fist Intel CPU with cache on chip 1998 Pentium III has two levels of cache on chip CS61C L22 Pipelining II, Cache I (31) Memoy Caching Mismatch between pocesso and memoy speeds leads us to add a new level: a memoy cache Implemented with same IC pocessing technology as the CPU (usually integated on same chip): faste but moe expensive than DRAM memoy. Cache is a copy of a subset of main memoy. Most pocessos have sepaate caches fo instuctions and data. CS61C L22 Pipelining II, Cache I (32)

17 Memoy Hieachy Highe Levels in memoy hieachy Lowe Pocesso Level 1 Level 2 Level 3... Level n Inceasing Distance fom Poc., Deceasing speed Size of memoy at each level As we move to deepe levels the latency goes up and pice pe bit goes down. CS61C L22 Pipelining II, Cache I (33) Memoy Hieachy If level close to Pocesso, it is: smalle faste subset of lowe levels (contains most ecently used data) Lowest Level (usually disk) contains all available data (o does it go beyond the disk?) Memoy Hieachy pesents the pocesso with the illusion of a vey lage vey fast memoy. CS61C L22 Pipelining II, Cache I (34)

18 Memoy Hieachy Analogy: Libay (1/2) You e witing a tem pape (Pocesso) at a table in Doe Doe Libay is equivalent to disk essentially limitless capacity vey slow to etieve a book Table is main memoy smalle capacity: means you must etun book when table fills up easie and faste to find a book thee once you ve aleady etieved it CS61C L22 Pipelining II, Cache I (35) Memoy Hieachy Analogy: Libay (2/2) Open books on table ae cache smalle capacity: can have vey few open books fit on table; again, when table fills up, you must close a book much, much faste to etieve data Illusion ceated: whole libay open on the tabletop Keep as many ecently used books open on table as possible since likely to use again Also keep as many books on table as possible, since faste than going to libay CS61C L22 Pipelining II, Cache I (36)

19 Memoy Hieachy Basis Cache contains copies of data in memoy that ae being used. Memoy contains copies of data on disk that ae being used. Caches wok on the pinciples of tempoal and spatial locality. Tempoal Locality: if we use it now, chances ae we ll want to use it again soon. Spatial Locality: if we use a piece of memoy, chances ae we ll use the neighboing pieces soon. CS61C L22 Pipelining II, Cache I (37) Cache Design How do we oganize cache? Whee does each memoy addess map to? (Remembe that cache is subset of memoy, so multiple memoy addesses map to the same cache location.) How do we know which elements ae in cache? How do we quickly locate them? CS61C L22 Pipelining II, Cache I (38)

20 Diect-Mapped Cache (1/4) In a diect-mapped cache, each memoy addess is associated with one possible block within the cache Theefoe, we only need to look in a single location in the cache fo the data if it exists in the cache Block is the unit of tansfe between cache and memoy CS61C L22 Pipelining II, Cache I (39) Diect-Mapped Cache (2/4) Cache Memoy Index Memoy Addess A B C D E F CS61C L22 Pipelining II, Cache I (40) 4 Byte Diect Mapped Cache Block size = 1 byte Cache Location 0 can be occupied by data fom: Memoy location 0, 4, 8,... 4 blocks any memoy location that is multiple of 4 What if we wanted a block to be bigge than one byte?

21 Diect-Mapped Cache (3/4) Cache Memoy Index Memoy Addess A C E A 1C 1E etc CS61C L22 Pipelining II, Cache I (41) 8 Byte Diect Mapped Cache Block size = 2 bytes When we ask fo a byte, the system finds out the ight block, and loads it all! How does it know ight block? How do we select the byte? E.g., Mem addess 11101? How does it know WHICH coloed block it oiginated fom? What do you do at baggage claim? Diect-Mapped Cache (4/4) Memoy Addess Cache 8 Byte Diect Memoy Index Mapped Cache w/tag! (addesses shown) E Tag Data etc 1 (Block size = 2 bytes) A C E A 1C 1E CS61C L22 Pipelining II, Cache I (42) 2 3 What should go in the tag? Do we need the entie addess? - What do all these tags have in common? What did we do with the immediate when we wee banch addessing, always count by bytes? Why not count by cache #? Cache# It s useful to daw memoy with the same width as the block size

22 Issues with Diect-Mapped Since multiple memoy addesses map to same cache index, how do we tell which one is in thee? What if we have a block size > 1 byte? Answe: divide memoy addess into thee fields ttttttttttttttttt iiiiiiiiii oooo tag index byte to check to offset if have select within coect block block block CS61C L22 Pipelining II, Cache I (43) Diect-Mapped Cache Teminology All fields ae ead as unsigned integes. Index: specifies the cache index (which ow /block of the cache we should look in) Offset: once we ve found coect block, specifies which byte within the block we want Tag: the emaining bits afte offset and index ae detemined; these ae used to distinguish between all the memoy addesses that map to the same location CS61C L22 Pipelining II, Cache I (44)

23 TIO Dan s geat cache mnemonic AREA (cache size, B) 2 (H+W) = 2 H * 2 W = HEIGHT (# of blocks) * WIDTH (size of one block, B/block) Tag Index Offset WIDTH (size of one block, B/block) HEIGHT (# of blocks) AREA (cache size, B) CS61C L22 Pipelining II, Cache I (45) Diect-Mapped Cache Example (1/3) Suppose we have a 16KB of data in a diect-mapped cache with 4 wod blocks Detemine the size of the tag, index and offset fields if we e using a 32-bit achitectue Offset need to specify coect byte within a block block contains 4 wods = 16 bytes = 2 4 bytes need 4 bits to specify coect byte CS61C L22 Pipelining II, Cache I (46)

24 Diect-Mapped Cache Example (2/3) Index: (~index into an aay of blocks ) need to specify coect block in cache cache contains 16 KB = 2 14 bytes block contains 2 4 bytes (4 wods) # blocks/cache = bytes/cache = 2 14 bytes/cache = 2 10 blocks/cache need 10 bits to specify this many blocks CS61C L22 Pipelining II, Cache I (47) Diect-Mapped Cache Example (3/3) Tag: use emaining bits as tag tag length = add length offset - index = bits = 18 bits so tag is leftmost 18 bits of memoy addess Why not full 32 bit addess as tag? All bytes within block need same addess (4b) Index must be same fo evey addess within a block, so it s edundant in tag check, thus can leave off to save memoy (hee 10 bits) CS61C L22 Pipelining II, Cache I (48)

25 Caching Teminology When we ty to ead memoy, 3 things can happen: 1. cache hit: cache block is valid and contains pope addess, so ead desied wod 2. cache miss: nothing in cache in appopiate block, so fetch fom memoy 3. cache miss, block eplacement: wong data is in cache at appopiate block, so discad it and fetch desied data fom memoy (cache always copy) CS61C L22 Pipelining II, Cache I (49) Pee instuction Conside an addess split into fields fo cache access as follows: ttttttttttttttttttttttt iiiiii oooo How big ae the cache blocks in wods? How many enties does the cache have? How big is a cache enty? CS61C L22 Pipelining II, Cache I (50)

26 In Conclusion Pipeline challenge is hazads Fowading helps w/many data hazads Delayed banch helps with contol hazad in 5 stage pipeline Load delay slot / intelock necessay Moe aggessive pefomance: Supescala Out-of-ode execution Use caches to simulate fast lage memoy CS61C L22 Pipelining II, Cache I (51)

Lecture 8 Introduction to Pipelines Adapated from slides by David Patterson

Lecture 8 Introduction to Pipelines Adapated from slides by David Patterson Lectue 8 Intoduction to Pipelines Adapated fom slides by David Patteson http://www-inst.eecs.bekeley.edu/~cs61c/ * 1 Review (1/3) Datapath is the hadwae that pefoms opeations necessay to execute pogams.