Multi-cycle Datapath (Our Version)

Size: px

Start display at page:

Download "Multi-cycle Datapath (Our Version)"

Jessie Mitchell
5 years ago
Views:

1 ulti-cycle Datapath (Our Version) npc_sel Next PC PC Instruction Fetch IR File Operand Fetch A B ExtOp ALUSrc ALUctr Ext ALU R emrd emwr em Access emto Data em Dst Wr. File isters added: IR: Instruction register A, B: Two registers to hold operands read from register file. R: or ALUOut, holds the output of the ALU : or emory data register (DR) to hold data read from data memory Result Store Equal #1 Final Review Spring

2 Operations In Each Cycle R-Type Logic Immediate Load Store Branch IF Instruction Fetch IR em[pc] IR em[pc] IR em[pc] IR em[pc] IR em[pc] ID Instruction Decode A R[rs] B R[rt] A R[rs] A R[rs] A R[rs] B R[rt] A R[rs] B R[rt] If Equal = 1 EX Execution R A + B R A OR ZeroExt[imm16] R A + SignEx(Im16) R A + SignEx(Im16) PC PC (SignExt(imm16) x4) else PC PC + 4 E emory em[r] em[r] B PC PC + 4 WB Write Back R[rd] R PC PC + 4 R[rt] R PC PC + 4 R[rd] PC PC + 4 #2 Final Review Spring

3 ulti-cycle Datapath Instruction CPI R-Type/Immediate: Require four cycles, CPI =4 IF, ID, EX, WB Loads: Require five cycles, CPI = 5 IF, ID, EX, E, WB Stores: Require four cycles, CPI = 4 IF, ID, EX, E Branches: Require three cycles, CPI = 3 IF, ID, EX Average program 3 CPI 5 depending on program profile (instruction mix). #3 Final Review Spring

4 IPS ulti-cycle Datapath Performance Evaluation What is the average CPI? State diagram gives CPI for each instruction type. Workload (program) below gives frequency of each type. Type CPI i for type Frequency CPI i x freqi i Arith/Logic 4 40% 1.6 Load 5 30% 1.5 Store 4 10% 0.4 branch 3 20% 0.6 Average CPI: 4.1 Better than CPI = 5 if all instructions took the same number of clock cycles (5). #4 Final Review Spring

5 Instruction Pipelining Instruction pipelining is a CPU implementation technique where multiple operations on a number of instructions are overlapped. The next instruction is fetched in the next cycle without waiting for the current instruction to complete. An instruction execution pipeline involves a number of steps, where each step completes one part of an instruction. Each step is called a pipeline stage or a pipeline segment. The stages or steps are connected one to the next to form a pipeline -- instructions enter at one end and progress through the stages and exit at the other end when completed. Instruction Pipeline Throughput : The instruction completion rate of the pipeline and is determined by how often an instruction exists the pipeline. The time to move an instruction one step down the line is is equal to the machine cycle and is determined by the stage with the longest processing delay. Instruction Pipeline Latency: The time required to complete an instruction: Cycle time x Number of pipeline stages. #5 Final Review Spring

6 Pipelining: Design Goals The length of the machine clock cycle is determined by the time required for the slowest pipeline stage. An important pipeline design consideration is to balance the length of each pipeline stage. If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions with no stalls): Time per instruction on unpipelined machine Number of pipe stages Under these ideal conditions: Speedup from pipelining = the number of pipeline stages = k One instruction is completed every cycle: CPI = 1. #6 Final Review Spring

7 From IPS ulti-cycle Datapath: Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load IF ID EX E WB 1- Instruction Fetch (IF) Instruction Fetch Fetch the instruction from the Instruction emory. 2- Instruction Decode (ID): isters Fetch and InstructionDecode. 3- Execute (EX): Calculate the memory address. 4- emory (E): Read the data from the Data emory. 5- Write Back (WB): Write the data back to the register file. #7 Final Review Spring

8 Ideal Pipelined Instruction Processing Representation Clock cycle Number Time in clock cycles Instruction Number Instruction I IF ID EX E WB Instruction I+1 IF ID EX E WB Instruction I+2 IF ID EX E WB Instruction I+3 IF ID EX E WB Instruction I +4 IF ID EX E WB Time to fill the pipeline 5 Pipeline Stages: IF = Instruction Fetch ID = Instruction Decode EX = Execution E = emory Access WB = Write Back First instruction, I Completed Last instruction, I+4 completed Ideal without any stall cycles #8 Final Review Spring

9 Pipelined Instruction Processing Time IF ID EX E WB Representation IF ID EX E WB IF ID EX E WB IF ID EX E WB Program Flow IF ID EX E WB IF ID EX E WB #9 Final Review Spring

10 Clk Single Cycle, ulti-cycle, Vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 8 ns Load Store Waste 2ns Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk ultiple Cycle Implementation: Load IF ID EX E WB Store IF ID EX E R-type IF Pipeline Implementation: Load IF ID EX E WB Store IF ID EX E WB R-type IF ID EX E WB #10 Final Review Spring

11 Single Cycle, ulti-cycle, Pipeline: Performance Comparison Example For 1000 instructions, execution time: Single Cycle achine: 8 ns/cycle x 1 CPI x 1000 inst = 8000 ns ulti-cycle achine: 2 ns/cycle x 4.6 CPI (due to inst mix) x 1000 inst = 9200 ns Ideal pipelined machine, 5-stages: 2 ns/cycle x (1 CPI x 1000 inst + 4 cycle fill) = 2008 ns #11 Final Review Spring

12 Basic Pipelined CPU Design Steps 1. Analyze instruction set operations using independent RTN => datapath requirements. 2. Select required datapath components and connections. 3. Assemble an initial datapath meeting the ISA requirements. 4. Identify pipeline stages based on operation, balancing stage delays, and ensuring no hardware conflicts exist when common hardware is used by two or more stages simultaneously in the same cycle. 5. Divide the datapath into the stages identified above by adding buffers between the stages of sufficient width to hold: Instruction fields. Remaining control lines needed for remaining pipeline stages. All results produced by a stage and any unused results of previous stages. 6. Analyze implementation of each instruction to determine setting of control points that effects the register transfer taking pipeline hazard conditions into account. 7. Assemble the control logic. #12 Final Review Spring

13 A Pipelined Datapath 0 u x 1 IF/ID ID/EX EX/E E/WB Add 4 Shift left 2 Add result Add PC Address Instruction memory Instruction Read register 1 Read data 1 Read register 2 isters Read data 2 Write register Write data 0 ux 1 Zero ALU ALU result Address Write data Data memory Read data 1 ux 0 16 Sign extend 32 IF ID EX E WB Instruction Fetch Instruction Decode Execution emory Write Back #13 Final Review Spring

14 Representing Pipelines Graphically Time (in clock cycles) Program execution order (in instructions) lw $10, 20($1) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I ALU D sub $11, $2, $3 I ALU D Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Use this representation to help understand datapaths #14 Final Review Spring

15 Adding Pipeline Control Points PCSrc 0 u x 1 IF/ID ID/EX EX/E E/WB Add 4 Write Shift left 2 Add Add result Branch PC Address Instruction memory Instruction Read register 1 Read Read data 1 register 2 isters Read Write data 2 register Write data ALUSrc 0 u x 1 Zero ALU ALU result Address Write emwrite Data memory Read data emto 1 u x 0 Instruction [15 0] 16 Sign 32 extend 6 ALU control data emread Instruction [20 16] Instruction [15 11] 0 u x 1 ALUOp Dst #15 Final Review Spring

16 Pipeline Control Pass needed control signals along from one stage to the next as the instruction travels through the pipeline just like the data Execution/Address Calculation stage control lines emory access stage control lines Write-back stage control lines Instruction Dst ALU Op1 ALU Op0 ALU Src Branch em Read em Write write em to R-format lw sw X X beq X X WB Instruction Control WB EX WB IF/ID ID/EX EX/E E/WB #16 Final Review Spring

17 Pipeline Control The ain Control generates the control signals during /Dec Control signals for Exec (ExtOp, ALUSrc,...) are used 1 cycle later Control signals for em (emwr Branch) are used 2 cycles later Control signals for Wr (emto emwr) are used 3 cycles later ID EX em WB ExtOp ExtOp ALUSrc ALUSrc IF/ID ister ain Control ALUOp Dst emwr Branch emto ID/Ex ister ALUOp Dst emwr Branch emto Ex/em ister emwr Branch emto em/wb ister emto Wr Wr Wr Wr #17 Final Review Spring

18 Pipelined Datapath with Control Added PCSrc 0 u x 1 Control ID/EX WB EX/E WB E/WB IF/ID EX WB Add PC 4 Address Instruction memory Instruction Read register 1 Read data 1 Read register 2 isters Read Write data 2 register Write data R egwrite Shift left 2 0 u x 1 Add Add result ALUSrc Zero ALU ALU result Branch Write data emwrite Address Data memory Read data emto 1 u x 0 Instruction [15 0] 16 Sign 32 extend 6 ALU control emread Instruction [20 16] Instruction [15 11] 0 u x 1 Dst ALUOp Target address of branch determined in E #18 Final Review Spring

19 Basic Performance Issues In Pipelining Pipelining increases the CPU instruction throughput: The number of instructions completed per unit time. Under ideal condition instruction throughput is one instruction per machine cycle, or CPI = 1 Pipelining does not reduce the execution time of an individual instruction: The time needed to complete all processing steps of an instruction (also called instruction completion latency). It usually slightly increases the execution time of each instruction over unpipelined implementations due to the increased control overhead of the pipeline and pipeline stage registers delays. #19 Final Review Spring

20 Pipelining Performance Example Example: For an unpipelined machine: Clock cycle = 10ns, 4 cycles for ALU operations and branches and 5 cycles for memory operations with instruction frequencies of 40%, 20% and 40%, respectively. If pipelining adds 1ns to the machine clock cycle then the speedup in instruction execution from pipelining is: Non-pipelined Average instruction execution time = Clock cycle x Average CPI = 10 ns x ((40% + 20%) x %x 5) = 10 ns x 4.4 = 44 ns In the pipelined five implementation five stages are used with an average instruction execution time of: 10 ns + 1 ns = 11 ns Speedup from pipelining = Instruction time unpipelined Instruction time pipelined = 44 ns / 11 ns = 4 times #20 Final Review Spring

21 Pipeline Hazards Hazards are situations in pipelining which prevent the next instruction in the instruction stream from executing during the designated clock cycle resulting in one or more stall cycles. Hazards reduce the ideal speedup gained from pipelining and are classified into three classes: Structural hazards: Arise from hardware resource conflicts when the available hardware cannot support all possible combinations of instructions. Data hazards: Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. Control hazards: Arise from the pipelining of conditional branches and other instructions that change the PC. #21 Final Review Spring

22 Structural Hazards In pipelined machines overlapped instruction execution requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. If a resource conflict arises due to a hardware resource being required by more than one instruction in a single cycle, and one or more such instructions cannot be accommodated, then a structural hazard has occurred, for example: when a machine has only one register file write port or when a pipelined machine has a shared single-memory pipeline for data and instructions. stall the pipeline for one cycle for register writes or memory data access #22 Final Review Spring

23 Structural hazard Example: Single emory For Instructions & Data Time (clock cycles) I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 ALU em em em em ALU em ALU em em ALU em ALU em em Detection is easy in this case (right half highlight means read, left half write) #23 Final Review Spring

24 Data Hazards Example Problem with starting next instruction before first is finished Data dependencies here that go backward in time create data hazards. sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Time (in clock cycles) Value of register $2: Program execution order (in instructions) sub $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I CC 7 CC 8 CC / D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D #24 Final Review Spring

25 Data Hazard Resolution: Stall Cycles Stall the pipeline by a number of cycles. The control unit must detect the need to insert stall cycles. In this case two stall cycles are needed. Time (in clock cycles) Value of register $2: Program execution order (in instructions) sub $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I CC 7 CC / D CC 9 20 CC CC and $12, $2, $5 I STALL STALL D or $13, $6, $2 STALL STALL I D add $14, $2, $2 I D sw $15, 100($2) I D #25 Final Review Spring

26 Performance of Pipelines with Stalls Hazards in pipelines may make it necessary to stall the pipeline by one or more cycles and thus degrading performance from the ideal CPI of 1. CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction If pipelining overhead is ignored and we assume that the stages are perfectly balanced then: Speedup = CPI unpipelined / (1 + Pipeline stall cycles per instruction) When all instructions take the same number of cycles and is equal to the number of pipeline stages then: Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction) #26 Final Review Spring

27 Data Hazard Resolution: Forwarding Observation: Why not use temporary results produced by memory/alu and not wait for them to be written back in the register bank. Forwarding is a hardware-based technique (also called register bypassing or short-circuiting) used to eliminate or minimize data hazard stalls that makes use of this observation. Using forwarding hardware, the result of an instruction is copied directly from where it is produced (ALU, memory read port etc.), to where subsequent instructions need it (ALU input register, memory write port etc.) #27 Final Review Spring

28 Data Hazard Resolution: Forwarding ister file forwarding to handle read/write to same register ALU forwarding #28 Final Review Spring

29 Pipelined Datapath With Forwarding ID/EX WB EX/E Control WB E/WB IF/ID EX WB PC Instruction memory Instruction isters u x u x ALU Data memory u x IF/ID.isterRs Rs IF/ID.isterRt Rt IF/ID.isterRt IF/ID.isterRd Rt Rd u x EX/E.isterRd Forwarding unit E/WB.isterRd #29 Final Review Spring

30 Data Hazard Example With Forwarding Value of register $2 : Value of EX/E : Value of E/WB : Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC / X X X 20 X X X X X X X X X 20 X X X X Program execution order (in instructions) sub $2, $1, $3 I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D #30 Final Review Spring

31 A Data Hazard Requiring A Stall A load followed by an R-type instruction that uses the loaded value Program execution order (in instructions) lw $2, 20($1) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I D CC 7 CC 8 CC 9 and $4, $2, $5 I D or $8, $2, $6 I D add $9, $4, $2 I D slt $1, $6, $7 I D Even with forwarding in place a stall cycle is needed This condition must be detected by hardware #31 Final Review Spring

32 A Data Hazard Requiring A Stall A load followed by an R-type instruction that uses the loaded value Program execution order (in instructions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 lw $2, 20($1) I D and $4, $2, $5 I D or $8, $2, $6 add $9, $4, $2 I I D bubble I D slt $1, $6, $7 I D We can stall the pipeline by keeping an instruction in the same stage #32 Final Review Spring

33 Compiler Scheduling Example Reorder the instructions to avoid as many pipeline stalls as possible: lw $15, 0($2) lw $16, 4($2) add $14, $5, $16 sw $16, 4($2) The data hazard occurs on register $16 between the second lw and the add resulting in a stall cycle With forwarding we need to find only one independent instructions to place between them, swapping the lw instructions works: lw $16, 4($2) lw $15, 0($2) add $14, $5, $16 sw $16, 4($2) Without forwarding we need two independent instructions to place between them, so in addition a nop is added. lw $16, 4($2) lw $15, 0($2) nop add $14, $5, $16 sw $16, 4($2) #33 Final Review Spring

34 Datapath With Hazard Detection Unit A load followed by an instruction that uses the loaded value is detected and a stall cycle is inserted. Hazard detection unit ID/EX.emRead ID/EX IF/IDWrite IF/ID Control 0 ux WB EX EX/E WB E/WB WB PCWrite PC Instruction memory Instruction isters ux ALU Data memory ux ux IF/ID.isterRs IF/ID.isterRt IF/ID.isterRt IF/ID.isterRd Rt Rd ux EX/E.isterRd ID/EX.isterRt Rs Rt Forwarding unit E/WB.isterRd #34 Final Review Spring

35 Control Hazards: Example Three other instructions are in the pipeline before branch instruction target decision is made when BEQ is in E stage. Program execution order (in instructions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 40 beq $1, $3, 7 I D 44 and $12, $2, $5 I D 48 or $13, $6, $2 I D 52 add $14, $2, $2 I D 72 lw $4, 50($7) I D In the above diagram, we are predicting branch not taken Need to add hardware for flushing the three following instructions if we are wrong losing three cycles. #35 Final Review Spring

36 Reducing Delay of Taken Branches Next PC of a branch known in E stage: Costs three lost cycles if taken. If next PC is known in EX stage, one cycle is saved. Branch address calculation can be moved to ID stage using a register comparator, costing only one cycle if branch is taken. IF.Flush Hazard detection unit u x ID/EX WB EX/E Control 0 u x WB E/WB IF/ID EX WB PC 4 Instruction memory Shift left 2 isters = u x u x ALU Data memory u x Sign extend u x Forwarding unit #36 Final Review Spring

37 Pipeline Performance Example Assume the following IPS instruction mix: Type Frequency Arith/Logic 40% Load 30% of which 25% are followed immediately by an instruction using the loaded value Store 10% branch 20% of which 45% are taken What is the resulting CPI for the pipelined IPS with forwarding and branch address calculation in ID stage? CPI = Ideal CPI + Pipeline stall clock cycles per instruction = 1 + stalls by loads + stalls by branches = 1 +.3x.25x1 +.2 x.45x1 = = #37 Final Review Spring

38 emory Hierarchy: otivation The gap between CPU performance and realistic (non-ideal) main memory speed has been widening with higher performance CPUs creating performance bottlenecks for memory access instructions. The memory hierarchy is organized into several levels of memory or storage with the smaller, more expensive, and faster levels closer to the CPU: registers, then primary or Level 1 (L 1 ) SRA Cache, then possibly one or more secondary cache levels (L 2, L 3 ), then DRA main memory, then mass storage (virtual memory). Each level of the hierarchy is a subset of the level below: data found in a level is also found in the level below but at a lower speed. Each level maps addresses from a larger physical memory/storage level to a smaller level above it. This concept is greatly aided by the principal of locality both temporal and spatial which indicates that programs tend to reuse data and instructions that they have used recently or those stored in their vicinity leading to working set of a program. #38 Final Review Spring

39 Processor-DRA Performance Gap Impact: Example To illustrate the performance impact, assume a pipelined RISC CPU with CPI = 1 using non-ideal memory. Over a sixteen year period, ignoring other factors, the cost of a full memory access in terms of number of wasted instructions: CPU CPU emory inimum CPU cycles or Year speed cycle Access instructions wasted HZ ns ns 1986: /125 = : /30 = : /13.3 = : /5 = : /3.33 = : /1 = 60 #39 Final Review Spring

40 Levels of The emory Hierarchy Part of The On-chip CPU Datapath isters One or more levels (Static RA): Level 1: On-chip 16-64K Level 2: On or Off-chip K Level 3: Off-chip 128K-8 Dynamic RA (DRA) 16-16G Interface: SCSI, RAID, IDE, G-100G isters Cache ain emory agnetic Disc Farther away from The CPU Lower Cost/Bit Higher Capacity Increased Access Time/Latency Lower Throughput Optical Disk or agnetic Tape #40 Final Review Spring

41 A Typical emory Hierarchy (With Two Levels of Cache) Processor Faster Larger Capacity Datapath Control isters On-Chip Level One Cache L 1 Second Level Cache (SRA) L 2 ain emory (DRA) Virtual emory, Secondary Storage (Disk) Tertiary Storage (Tape) Speed (ns): < 1 10s 100s 10,000,000s (10s ms) Size (bytes): 100s KBs Bs GBs On or off Chip 10,000,000,000s (10s sec) TBs #41 Final Review Spring

42 emory Hierarchy Operation If an instruction or operand is required by the CPU, the levels of the memory hierarchy are searched for the item starting with the level closest to the CPU (Level 1, L 1 cache): If the item is found, it s delivered to the CPU resulting in a cache hit without searching lower levels. If the item is missing from an upper level, resulting in a cache miss, then the level just below is searched. For systems with several levels of cache, the search continues with cache level 2, 3 etc. If all levels of cache report a miss then main memory is accessed for the item. CPU cache memory: anaged by hardware. If the item is not found in main memory resulting in a page fault, then disk (virtual memory) is accessed for the item. emory disk: ainly managed by the operating system with hardware support. #42 Final Review Spring

43 emory Hierarchy: Terminology A Block: The smallest unit of information transferred between two levels. Hit: Item is found in some block in the upper level (example: Block X) Hit Rate: The fraction of memory access found in the upper level. Hit Time: Time to access the upper level which consists of RA access time + Time to determine hit/miss iss: Item needs to be retrieved from a block in the lower level (Block Y) iss Rate = 1 - (Hit Rate) iss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << iss Penalty To Processor Upper Level emory Lower Level emory From Processor Blk X Blk Y #43 Final Review Spring

44 Cache Concepts Cache is the first level of the memory hierarchy once the address leaves the CPU and is searched first for the requested data. If the data requested by the CPU is present in the cache, it is retrieved from cache and the data access is a cache hit otherwise a cache miss and data must be read from main memory. On a cache miss a block of data must be brought in from main memory to into a cache block frame to possibly replace an existing cache block. The allowed block addresses where blocks can be mapped into cache from main memory is determined by cache placement strategy. Locating a block of data in cache is handled by cache block identification mechanism. On a cache miss the cache block being removed is handled by the block replacement strategy in place. When a write to cache is requested, a number of main memory update strategies exist as part of the cache write policy. #44 Final Review Spring

45 Cache Organization & Placement Strategies Placement strategies or mapping of a main memory data block onto cache block frame addresses divide cache into three organizations: 1 Direct mapped cache: A block can be placed in one location only, given by: (Block address) OD (Number of blocks in cache) 2 Fully associative cache: A block can be placed anywhere in cache. 3 Set associative cache: A block can be placed in a restricted set of places, or cache block frames. A set is a group of block frames in the cache. A block is first mapped onto the set and then it can be placed anywhere within the set. The set in this case is chosen by: (Block address) OD (Number of sets in cache) If there are n blocks in a set the cache placement is called n-way set-associative. #45 Final Review Spring

46 Cache Organization: Direct apped Cache A block can be placed in one location only, given by: (Block address) OD (Number of blocks in cache) In this case: (Block address) OD (8) C a ch e cache block frames 32 memory blocks cacheable (11101) OD (1000) = e m o ry #46 Final Review Spring

47 4KB Direct apped Cache Example Tag field A d d re ss (s h o w i n g b it p o s i tio n s ) B y te o ffse t Index field H i t T a g D a t a 1K = 1024 Blocks Each block = one word In d e x In d e x V a l id T ag D a ta Can cache up to 2 32 bytes = 4 GB of memory apping function: Cache Block frame number = (Block address) OD (1024) Block Address = 30 bits Tag = 20 bits Index = 10 bits Block offset = 2 bits #47 Final Review Spring

48 64KB Direct apped Cache Example Tag field 4K= 4096 blocks Each block = four words = 16 bytes Can cache up to 2 32 bytes = 4 GB of memory H it Ta g A d d ress (sho w in g b it po sition s) In d ex B yte o ffset 1 6 bi ts 12 8 b its V T a g D ata Index field Word select B loc k o ffset D a ta 4K en trie s u x 32 Block Address = 28 bits Tag = 16 bits Index = 12 bits Block offset = 4 bits apping Function: Cache Block frame number = (Block address) OD (4096) Larger blocks take better advantage of spatial locality #48 Final Review Spring

49 Cache Organization: Set Associative Cache O n e - w a y se t a ss o ci a tiv e (d i re c t m a p p e d ) B lo c k T a g D a ta T w o -w a y se t a ss o cia t iv e S e t T a g D a ta T a g D a ta 7 F o u r-w a y s e t a s s oc ia tiv e S e t T a g D a ta T a g D a ta T a g D a ta T a g D a ta 0 1 E ig h t -w a y se t a s so c ia t ive (fu ll y a sso c i a tive ) T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta #49 Final Review Spring

50 Cache Organization Example #50 Final Review Spring

51 Locating A Data Block in Cache Each block frame in cache has an address tag. The tags of every cache block that might contain the required data are checked or searched in parallel. A valid bit is added to the tag to indicate whether this entry contains a valid address. The address from the CPU to cache is divided into: A block address, further divided into: An index field to choose a block set or frame in cache. (no index field when fully associative). A tag field to search and match addresses in the selected set. A block offset to select the data from the block. Tag Block Address Index Block Offset #51 Final Review Spring

52 Address Field Sizes Physical Address Generated by CPU Tag Block Address Index Block Offset Block offset size = log2(block size) Index size = log2(total number of blocks/associativity) Tag size = address size - index size - offset size Number of Sets apping function: Cache set or block frame number = Index = = (Block Address) OD (Number of Sets) #52 Final Review Spring

53 1024 block frames Each block = one word 4-way set associative 256 sets 4K Four-Way Set Associative Cache: IPS Implementation Example Tag Field Index V Tag A ddress Index Field Data V Tag D ata V Tag Data V Tag Data Can cache up to 2 32 bytes = 4 GB of memory Block Address = 30 bits Tag = 22 bits Index = 8 bits Block offset = 2 bits 4-to-1 m ultiplexo r apping Function: Cache Set Number = (Block address) OD (256) H it Da ta #53 Final Review Spring

54 Cache Organization/Addressing Example Given the following: A single-level L 1 cache with 128 cache block frames Each block frame contains four words (16 bytes) 16-bit memory addresses to be cached (64K bytes main memory or 4096 memory blocks) Show the cache organization/mapping and cache address fields for: Fully Associative cache. Direct mapped cache. 2-way set-associative cache. #54 Final Review Spring

55 Cache Example: Fully Associative Case V V All 128 tags must be checked in parallel by hardware to locate a data block V Valid bit Block Address = 12 bits Tag = 12 bits apping Function = none Block offset = 4 bits #55 Final Review Spring

56 Cache Example: Direct apped Case V V V Only a single tag must be checked in parallel to locate a data block V Valid bit Block Address = 12 bits Tag = 5 bits Index = 7 bits Block offset = 4 bits apping Function: Cache Block frame number = Index = (Block address) OD (128) ain emory #56 Final Review Spring

57 Cache Example: 2-Way Set-Associative Two tags in a set must be checked in parallel to locate a data block Valid bits not shown Block Address = 12 bits Tag = 6 bits Index = 6 bits Block offset = 4 bits apping Function: Cache Set Number = Index = (Block address) OD (64) ain emory #57 Final Review Spring

58 Calculating Number of Cache Bits Needed How many total bits are needed for a direct- mapped cache with 64 KBytes of data and one word blocks, assuming a 32-bit address? 64 Kbytes = 16 K words = 2 14 words = 2 14 blocks Block size = 4 bytes => offset size = 2 bits, #sets = #blocks = 2 14 => index size = 14 bits Tag size = address size - index size - offset size = = 16 bits Bits/block = data bits + tag bits + valid bit = = 49 Bits in cache = #blocks x bits/block = 2 14 x 49 = 98 Kbytes How many total bits would be needed for a 4-way set associative cache to store the same amount of data? Block size and #blocks does not change. #sets = #blocks/4 = (2 14 )/4 = 2 12 => index size = 12 bits Tag size = address size - index size - offset = = 18 bits Bits/block = data bits + tag bits + valid bit = = 51 Bits in cache = #blocks x bits/block = 2 14 x 51 = 102 Kbytes Increase associativity => increase bits in cache #58 Final Review Spring

59 Calculating Cache Bits Needed How many total bits are needed for a direct- mapped cache with 64 KBytes of data and 8 word blocks, assuming a 32-bit address (it can cache 2 32 bytes in memory)? 64 Kbytes = 2 14 words = (2 14 )/8 = 2 11 blocks block size = 32 bytes => offset size = block offset + byte offset = 5 bits, #sets = #blocks = 2 11 => index size = 11 bits tag size = address size - index size - offset size = = 16 bits bits/block = data bits + tag bits + valid bit = 8 x = 273 bits bits in cache = #blocks x bits/block = 2 11 x 273 = Kbytes Increase block size => decrease bits in cache. #59 Final Review Spring

60 Cache Replacement Policy When a cache miss occurs the cache controller may have to select a block of cache data to be removed from a cache block frame and replaced with the requested data, such a block is selected by one of two methods: Random: Any block is randomly selected for replacement providing uniform allocation. Simple to build in hardware. The most widely used cache replacement strategy. Least-recently used (LRU): Accesses to blocks are recorded and and the block replaced is the one that was not used for the longest period of time. LRU is expensive to implement, as the number of blocks to be tracked increases, and is usually approximated. #60 Final Review Spring

61 iss Rates for Caches with Different Size, Associativity & Replacement Algorithm Sample Data Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% #61 Final Review Spring

62 Unified vs.. Separate Level 1 Cache Unified Level 1 Cache (Princeton emory Architecture). A single level 1 cache is used for both instructions and data. Separate instruction/data Level 1 caches (Harvard emory Architecture): The level 1 (L 1 ) cache is split into two caches, one for instructions (instruction cache, L 1 I-cache) and the other for data (data cache, L 1 D-cache). Processor Processor Control Control Datapath isters Unified Level One Cache L 1 Datapath isters L 1 I-cache L 1 D-cache Unified Level 1 Cache (Princeton emory Architecture) Separate Level 1 Caches (Harvard emory Architecture) #62 Final Review Spring

63 Cache Performance Princeton (Unified( L1) emory Architecture For a CPU with a single level (L1) of cache for both instructions and data (Princeton memory architecture) and no stalls for cache hits: CPU time = (CPU execution clock cycles + emory stall clock cycles) x clock cycle time emory stall clock cycles = (Reads x Read miss rate x Read miss penalty) + (Writes x Write miss rate x Write miss penalty) If write and read miss penalties are the same: With ideal memory emory stall clock cycles = emory accesses x iss rate x iss penalty #63 Final Review Spring

64 Cache Performance Princeton emory Architecture CPUtime = Instruction count x CPI x Clock cycle time CPI execution = CPI with ideal memory CPI = CPI execution + em Stall cycles per instruction CPUtime = Instruction Count x (CPI execution + em Stall cycles per instruction) x Clock cycle time em Stall cycles per instruction = em accesses per instruction x iss rate x iss penalty CPUtime = IC x (CPI execution + em accesses per instruction x iss rate x iss penalty) x Clock cycle time isses per instruction = emory accesses per instruction x iss rate CPUtime = IC x (CPI execution + isses per instruction x iss penalty) x Clock cycle time #64 Final Review Spring

65 Cache Performance Example Suppose a CPU executes at Clock Rate = 200 Hz (5 ns per cycle) with a single level of cache (unified). CPI execution = 1.1 Instruction mix: 50% arith/logic, 30% load/store, 20% control Assume a cache miss rate of 1.5% and a miss penalty of 50 cycles. CPI = CPI execution + mem stalls per instruction em Stalls per instruction = em accesses per instruction x iss rate x iss penalty em accesses per instruction = = 1.3 Instruction fetch Load/store em Stalls per instruction = 1.3 x.015 x 50 = CPI = = The CPU with ideal cache (no misses) is 2.075/1.1 = 1.88 times faster #65 Final Review Spring

66 Cache Performance Example Suppose for the previous example we double the clock rate to 400 HZ, how much faster is this machine, assuming similar miss rate, instruction mix? Since memory speed is not changed, the miss penalty takes more CPU cycles: iss penalty = 50 x 2 = 100 cycles. CPI = x.015 x 100 = = 3.05 Speedup = (CPI old x C old )/ (CPI new x C new ) = x 2 / 3.05 = 1.36 The new machine is only 1.36 times faster rather than 2 times faster due to the increased effect of cache misses. CPUs with higher clock rate, have more cycles per cache miss and more memory impact on CPI. #66 Final Review Spring

67 Cache Performance Harvard emory Architecture For a CPU with separate level one (L1) caches for instructions and data (Harvard memory architecture) and no stalls for cache hits: CPUtime = Instruction count x CPI x Clock cycle time CPI = CPI execution + em Stall cycles per instruction CPUtime = Instruction Count x (CPI execution + em Stall cycles per instruction) x Clock cycle time em Stall cycles per instruction = Instruction Fetch iss rate x iss Penalty + Data emory Accesses Per Instruction x Data iss Rate x iss Penalty #67 Final Review Spring

68 Cache Performance Example Suppose a CPU uses separate level one (L1) caches for instructions and data (Harvard memory architecture) with different miss rates for instruction and data access: A cache hit incurs no stall cycles while a cache miss incurs 200 stall cycles for both memory reads and writes. CPI execution = 1.1 Instruction mix: 50% arith/logic, 30% load/store, 20% control Assume a cache miss rate of 0.5% for instruction fetch and a cache data miss rate of 6%. A cache hit incurs no stall cycles while a cache miss incurs 200 stall cycles for both memory reads and writes. Find the resulting CPI using this cache? How much faster is the CPU with ideal memory? CPI = CPI execution + mem stalls per instruction em Stall cycles per instruction = Instruction Fetch iss rate x iss Penalty + Data emory Accesses Per Instruction x Data iss Rate x iss Penalty em Stall cycles per instruction = 0.5/100 x x 6/100 x 200 = = 4.6 CPI = CPI execution + mem stalls per instruction = = 5.7 The CPU with ideal cache (no misses) is 5.7/1.1 = 5.18 times faster With no cache the CPI would have been = X 200 = 261.1!! #68 Final Review Spring

What do we have so far? Multi-Cycle Datapath (Textbook Version)

What do we have so far? Multi-Cycle Datapath (Textbook Version) What do we have so far? ulti-cycle Datapath (Textbook Version) CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instruction being processed in datapath How to lower CPI further? #1 Lec # 8 Summer2001