Multi-cycle Datapath (Our Version)
|
|
- Jessie Mitchell
- 5 years ago
- Views:
Transcription
1 ulti-cycle Datapath (Our Version) npc_sel Next PC PC Instruction Fetch IR File Operand Fetch A B ExtOp ALUSrc ALUctr Ext ALU R emrd emwr em Access emto Data em Dst Wr. File isters added: IR: Instruction register A, B: Two registers to hold operands read from register file. R: or ALUOut, holds the output of the ALU : or emory data register (DR) to hold data read from data memory Result Store Equal #1 Final Review Spring
2 Operations In Each Cycle R-Type Logic Immediate Load Store Branch IF Instruction Fetch IR em[pc] IR em[pc] IR em[pc] IR em[pc] IR em[pc] ID Instruction Decode A R[rs] B R[rt] A R[rs] A R[rs] A R[rs] B R[rt] A R[rs] B R[rt] If Equal = 1 EX Execution R A + B R A OR ZeroExt[imm16] R A + SignEx(Im16) R A + SignEx(Im16) PC PC (SignExt(imm16) x4) else PC PC + 4 E emory em[r] em[r] B PC PC + 4 WB Write Back R[rd] R PC PC + 4 R[rt] R PC PC + 4 R[rd] PC PC + 4 #2 Final Review Spring
3 ulti-cycle Datapath Instruction CPI R-Type/Immediate: Require four cycles, CPI =4 IF, ID, EX, WB Loads: Require five cycles, CPI = 5 IF, ID, EX, E, WB Stores: Require four cycles, CPI = 4 IF, ID, EX, E Branches: Require three cycles, CPI = 3 IF, ID, EX Average program 3 CPI 5 depending on program profile (instruction mix). #3 Final Review Spring
4 IPS ulti-cycle Datapath Performance Evaluation What is the average CPI? State diagram gives CPI for each instruction type. Workload (program) below gives frequency of each type. Type CPI i for type Frequency CPI i x freqi i Arith/Logic 4 40% 1.6 Load 5 30% 1.5 Store 4 10% 0.4 branch 3 20% 0.6 Average CPI: 4.1 Better than CPI = 5 if all instructions took the same number of clock cycles (5). #4 Final Review Spring
5 Instruction Pipelining Instruction pipelining is a CPU implementation technique where multiple operations on a number of instructions are overlapped. The next instruction is fetched in the next cycle without waiting for the current instruction to complete. An instruction execution pipeline involves a number of steps, where each step completes one part of an instruction. Each step is called a pipeline stage or a pipeline segment. The stages or steps are connected one to the next to form a pipeline -- instructions enter at one end and progress through the stages and exit at the other end when completed. Instruction Pipeline Throughput : The instruction completion rate of the pipeline and is determined by how often an instruction exists the pipeline. The time to move an instruction one step down the line is is equal to the machine cycle and is determined by the stage with the longest processing delay. Instruction Pipeline Latency: The time required to complete an instruction: Cycle time x Number of pipeline stages. #5 Final Review Spring
6 Pipelining: Design Goals The length of the machine clock cycle is determined by the time required for the slowest pipeline stage. An important pipeline design consideration is to balance the length of each pipeline stage. If all stages are perfectly balanced, then the time per instruction on a pipelined machine (assuming ideal conditions with no stalls): Time per instruction on unpipelined machine Number of pipe stages Under these ideal conditions: Speedup from pipelining = the number of pipeline stages = k One instruction is completed every cycle: CPI = 1. #6 Final Review Spring
7 From IPS ulti-cycle Datapath: Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load IF ID EX E WB 1- Instruction Fetch (IF) Instruction Fetch Fetch the instruction from the Instruction emory. 2- Instruction Decode (ID): isters Fetch and InstructionDecode. 3- Execute (EX): Calculate the memory address. 4- emory (E): Read the data from the Data emory. 5- Write Back (WB): Write the data back to the register file. #7 Final Review Spring
8 Ideal Pipelined Instruction Processing Representation Clock cycle Number Time in clock cycles Instruction Number Instruction I IF ID EX E WB Instruction I+1 IF ID EX E WB Instruction I+2 IF ID EX E WB Instruction I+3 IF ID EX E WB Instruction I +4 IF ID EX E WB Time to fill the pipeline 5 Pipeline Stages: IF = Instruction Fetch ID = Instruction Decode EX = Execution E = emory Access WB = Write Back First instruction, I Completed Last instruction, I+4 completed Ideal without any stall cycles #8 Final Review Spring
9 Pipelined Instruction Processing Time IF ID EX E WB Representation IF ID EX E WB IF ID EX E WB IF ID EX E WB Program Flow IF ID EX E WB IF ID EX E WB #9 Final Review Spring
10 Clk Single Cycle, ulti-cycle, Vs. Pipeline Single Cycle Implementation: Cycle 1 Cycle 2 8 ns Load Store Waste 2ns Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk ultiple Cycle Implementation: Load IF ID EX E WB Store IF ID EX E R-type IF Pipeline Implementation: Load IF ID EX E WB Store IF ID EX E WB R-type IF ID EX E WB #10 Final Review Spring
11 Single Cycle, ulti-cycle, Pipeline: Performance Comparison Example For 1000 instructions, execution time: Single Cycle achine: 8 ns/cycle x 1 CPI x 1000 inst = 8000 ns ulti-cycle achine: 2 ns/cycle x 4.6 CPI (due to inst mix) x 1000 inst = 9200 ns Ideal pipelined machine, 5-stages: 2 ns/cycle x (1 CPI x 1000 inst + 4 cycle fill) = 2008 ns #11 Final Review Spring
12 Basic Pipelined CPU Design Steps 1. Analyze instruction set operations using independent RTN => datapath requirements. 2. Select required datapath components and connections. 3. Assemble an initial datapath meeting the ISA requirements. 4. Identify pipeline stages based on operation, balancing stage delays, and ensuring no hardware conflicts exist when common hardware is used by two or more stages simultaneously in the same cycle. 5. Divide the datapath into the stages identified above by adding buffers between the stages of sufficient width to hold: Instruction fields. Remaining control lines needed for remaining pipeline stages. All results produced by a stage and any unused results of previous stages. 6. Analyze implementation of each instruction to determine setting of control points that effects the register transfer taking pipeline hazard conditions into account. 7. Assemble the control logic. #12 Final Review Spring
13 A Pipelined Datapath 0 u x 1 IF/ID ID/EX EX/E E/WB Add 4 Shift left 2 Add result Add PC Address Instruction memory Instruction Read register 1 Read data 1 Read register 2 isters Read data 2 Write register Write data 0 ux 1 Zero ALU ALU result Address Write data Data memory Read data 1 ux 0 16 Sign extend 32 IF ID EX E WB Instruction Fetch Instruction Decode Execution emory Write Back #13 Final Review Spring
14 Representing Pipelines Graphically Time (in clock cycles) Program execution order (in instructions) lw $10, 20($1) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I ALU D sub $11, $2, $3 I ALU D Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Use this representation to help understand datapaths #14 Final Review Spring
15 Adding Pipeline Control Points PCSrc 0 u x 1 IF/ID ID/EX EX/E E/WB Add 4 Write Shift left 2 Add Add result Branch PC Address Instruction memory Instruction Read register 1 Read Read data 1 register 2 isters Read Write data 2 register Write data ALUSrc 0 u x 1 Zero ALU ALU result Address Write emwrite Data memory Read data emto 1 u x 0 Instruction [15 0] 16 Sign 32 extend 6 ALU control data emread Instruction [20 16] Instruction [15 11] 0 u x 1 ALUOp Dst #15 Final Review Spring
16 Pipeline Control Pass needed control signals along from one stage to the next as the instruction travels through the pipeline just like the data Execution/Address Calculation stage control lines emory access stage control lines Write-back stage control lines Instruction Dst ALU Op1 ALU Op0 ALU Src Branch em Read em Write write em to R-format lw sw X X beq X X WB Instruction Control WB EX WB IF/ID ID/EX EX/E E/WB #16 Final Review Spring
17 Pipeline Control The ain Control generates the control signals during /Dec Control signals for Exec (ExtOp, ALUSrc,...) are used 1 cycle later Control signals for em (emwr Branch) are used 2 cycles later Control signals for Wr (emto emwr) are used 3 cycles later ID EX em WB ExtOp ExtOp ALUSrc ALUSrc IF/ID ister ain Control ALUOp Dst emwr Branch emto ID/Ex ister ALUOp Dst emwr Branch emto Ex/em ister emwr Branch emto em/wb ister emto Wr Wr Wr Wr #17 Final Review Spring
18 Pipelined Datapath with Control Added PCSrc 0 u x 1 Control ID/EX WB EX/E WB E/WB IF/ID EX WB Add PC 4 Address Instruction memory Instruction Read register 1 Read data 1 Read register 2 isters Read Write data 2 register Write data R egwrite Shift left 2 0 u x 1 Add Add result ALUSrc Zero ALU ALU result Branch Write data emwrite Address Data memory Read data emto 1 u x 0 Instruction [15 0] 16 Sign 32 extend 6 ALU control emread Instruction [20 16] Instruction [15 11] 0 u x 1 Dst ALUOp Target address of branch determined in E #18 Final Review Spring
19 Basic Performance Issues In Pipelining Pipelining increases the CPU instruction throughput: The number of instructions completed per unit time. Under ideal condition instruction throughput is one instruction per machine cycle, or CPI = 1 Pipelining does not reduce the execution time of an individual instruction: The time needed to complete all processing steps of an instruction (also called instruction completion latency). It usually slightly increases the execution time of each instruction over unpipelined implementations due to the increased control overhead of the pipeline and pipeline stage registers delays. #19 Final Review Spring
20 Pipelining Performance Example Example: For an unpipelined machine: Clock cycle = 10ns, 4 cycles for ALU operations and branches and 5 cycles for memory operations with instruction frequencies of 40%, 20% and 40%, respectively. If pipelining adds 1ns to the machine clock cycle then the speedup in instruction execution from pipelining is: Non-pipelined Average instruction execution time = Clock cycle x Average CPI = 10 ns x ((40% + 20%) x %x 5) = 10 ns x 4.4 = 44 ns In the pipelined five implementation five stages are used with an average instruction execution time of: 10 ns + 1 ns = 11 ns Speedup from pipelining = Instruction time unpipelined Instruction time pipelined = 44 ns / 11 ns = 4 times #20 Final Review Spring
21 Pipeline Hazards Hazards are situations in pipelining which prevent the next instruction in the instruction stream from executing during the designated clock cycle resulting in one or more stall cycles. Hazards reduce the ideal speedup gained from pipelining and are classified into three classes: Structural hazards: Arise from hardware resource conflicts when the available hardware cannot support all possible combinations of instructions. Data hazards: Arise when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. Control hazards: Arise from the pipelining of conditional branches and other instructions that change the PC. #21 Final Review Spring
22 Structural Hazards In pipelined machines overlapped instruction execution requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. If a resource conflict arises due to a hardware resource being required by more than one instruction in a single cycle, and one or more such instructions cannot be accommodated, then a structural hazard has occurred, for example: when a machine has only one register file write port or when a pipelined machine has a shared single-memory pipeline for data and instructions. stall the pipeline for one cycle for register writes or memory data access #22 Final Review Spring
23 Structural hazard Example: Single emory For Instructions & Data Time (clock cycles) I n s t r. O r d e r Load Instr 1 Instr 2 Instr 3 Instr 4 ALU em em em em ALU em ALU em em ALU em ALU em em Detection is easy in this case (right half highlight means read, left half write) #23 Final Review Spring
24 Data Hazards Example Problem with starting next instruction before first is finished Data dependencies here that go backward in time create data hazards. sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 add $14, $2, $2 sw $15, 100($2) Time (in clock cycles) Value of register $2: Program execution order (in instructions) sub $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I CC 7 CC 8 CC / D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D #24 Final Review Spring
25 Data Hazard Resolution: Stall Cycles Stall the pipeline by a number of cycles. The control unit must detect the need to insert stall cycles. In this case two stall cycles are needed. Time (in clock cycles) Value of register $2: Program execution order (in instructions) sub $2, $1, $3 CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I CC 7 CC / D CC 9 20 CC CC and $12, $2, $5 I STALL STALL D or $13, $6, $2 STALL STALL I D add $14, $2, $2 I D sw $15, 100($2) I D #25 Final Review Spring
26 Performance of Pipelines with Stalls Hazards in pipelines may make it necessary to stall the pipeline by one or more cycles and thus degrading performance from the ideal CPI of 1. CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction If pipelining overhead is ignored and we assume that the stages are perfectly balanced then: Speedup = CPI unpipelined / (1 + Pipeline stall cycles per instruction) When all instructions take the same number of cycles and is equal to the number of pipeline stages then: Speedup = Pipeline depth / (1 + Pipeline stall cycles per instruction) #26 Final Review Spring
27 Data Hazard Resolution: Forwarding Observation: Why not use temporary results produced by memory/alu and not wait for them to be written back in the register bank. Forwarding is a hardware-based technique (also called register bypassing or short-circuiting) used to eliminate or minimize data hazard stalls that makes use of this observation. Using forwarding hardware, the result of an instruction is copied directly from where it is produced (ALU, memory read port etc.), to where subsequent instructions need it (ALU input register, memory write port etc.) #27 Final Review Spring
28 Data Hazard Resolution: Forwarding ister file forwarding to handle read/write to same register ALU forwarding #28 Final Review Spring
29 Pipelined Datapath With Forwarding ID/EX WB EX/E Control WB E/WB IF/ID EX WB PC Instruction memory Instruction isters u x u x ALU Data memory u x IF/ID.isterRs Rs IF/ID.isterRt Rt IF/ID.isterRt IF/ID.isterRd Rt Rd u x EX/E.isterRd Forwarding unit E/WB.isterRd #29 Final Review Spring
30 Data Hazard Example With Forwarding Value of register $2 : Value of EX/E : Value of E/WB : Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC / X X X 20 X X X X X X X X X 20 X X X X Program execution order (in instructions) sub $2, $1, $3 I D and $12, $2, $5 I D or $13, $6, $2 I D add $14, $2, $2 I D sw $15, 100($2) I D #30 Final Review Spring
31 A Data Hazard Requiring A Stall A load followed by an R-type instruction that uses the loaded value Program execution order (in instructions) lw $2, 20($1) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 I D CC 7 CC 8 CC 9 and $4, $2, $5 I D or $8, $2, $6 I D add $9, $4, $2 I D slt $1, $6, $7 I D Even with forwarding in place a stall cycle is needed This condition must be detected by hardware #31 Final Review Spring
32 A Data Hazard Requiring A Stall A load followed by an R-type instruction that uses the loaded value Program execution order (in instructions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 lw $2, 20($1) I D and $4, $2, $5 I D or $8, $2, $6 add $9, $4, $2 I I D bubble I D slt $1, $6, $7 I D We can stall the pipeline by keeping an instruction in the same stage #32 Final Review Spring
33 Compiler Scheduling Example Reorder the instructions to avoid as many pipeline stalls as possible: lw $15, 0($2) lw $16, 4($2) add $14, $5, $16 sw $16, 4($2) The data hazard occurs on register $16 between the second lw and the add resulting in a stall cycle With forwarding we need to find only one independent instructions to place between them, swapping the lw instructions works: lw $16, 4($2) lw $15, 0($2) add $14, $5, $16 sw $16, 4($2) Without forwarding we need two independent instructions to place between them, so in addition a nop is added. lw $16, 4($2) lw $15, 0($2) nop add $14, $5, $16 sw $16, 4($2) #33 Final Review Spring
34 Datapath With Hazard Detection Unit A load followed by an instruction that uses the loaded value is detected and a stall cycle is inserted. Hazard detection unit ID/EX.emRead ID/EX IF/IDWrite IF/ID Control 0 ux WB EX EX/E WB E/WB WB PCWrite PC Instruction memory Instruction isters ux ALU Data memory ux ux IF/ID.isterRs IF/ID.isterRt IF/ID.isterRt IF/ID.isterRd Rt Rd ux EX/E.isterRd ID/EX.isterRt Rs Rt Forwarding unit E/WB.isterRd #34 Final Review Spring
35 Control Hazards: Example Three other instructions are in the pipeline before branch instruction target decision is made when BEQ is in E stage. Program execution order (in instructions) Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 40 beq $1, $3, 7 I D 44 and $12, $2, $5 I D 48 or $13, $6, $2 I D 52 add $14, $2, $2 I D 72 lw $4, 50($7) I D In the above diagram, we are predicting branch not taken Need to add hardware for flushing the three following instructions if we are wrong losing three cycles. #35 Final Review Spring
36 Reducing Delay of Taken Branches Next PC of a branch known in E stage: Costs three lost cycles if taken. If next PC is known in EX stage, one cycle is saved. Branch address calculation can be moved to ID stage using a register comparator, costing only one cycle if branch is taken. IF.Flush Hazard detection unit u x ID/EX WB EX/E Control 0 u x WB E/WB IF/ID EX WB PC 4 Instruction memory Shift left 2 isters = u x u x ALU Data memory u x Sign extend u x Forwarding unit #36 Final Review Spring
37 Pipeline Performance Example Assume the following IPS instruction mix: Type Frequency Arith/Logic 40% Load 30% of which 25% are followed immediately by an instruction using the loaded value Store 10% branch 20% of which 45% are taken What is the resulting CPI for the pipelined IPS with forwarding and branch address calculation in ID stage? CPI = Ideal CPI + Pipeline stall clock cycles per instruction = 1 + stalls by loads + stalls by branches = 1 +.3x.25x1 +.2 x.45x1 = = #37 Final Review Spring
38 emory Hierarchy: otivation The gap between CPU performance and realistic (non-ideal) main memory speed has been widening with higher performance CPUs creating performance bottlenecks for memory access instructions. The memory hierarchy is organized into several levels of memory or storage with the smaller, more expensive, and faster levels closer to the CPU: registers, then primary or Level 1 (L 1 ) SRA Cache, then possibly one or more secondary cache levels (L 2, L 3 ), then DRA main memory, then mass storage (virtual memory). Each level of the hierarchy is a subset of the level below: data found in a level is also found in the level below but at a lower speed. Each level maps addresses from a larger physical memory/storage level to a smaller level above it. This concept is greatly aided by the principal of locality both temporal and spatial which indicates that programs tend to reuse data and instructions that they have used recently or those stored in their vicinity leading to working set of a program. #38 Final Review Spring
39 Processor-DRA Performance Gap Impact: Example To illustrate the performance impact, assume a pipelined RISC CPU with CPI = 1 using non-ideal memory. Over a sixteen year period, ignoring other factors, the cost of a full memory access in terms of number of wasted instructions: CPU CPU emory inimum CPU cycles or Year speed cycle Access instructions wasted HZ ns ns 1986: /125 = : /30 = : /13.3 = : /5 = : /3.33 = : /1 = 60 #39 Final Review Spring
40 Levels of The emory Hierarchy Part of The On-chip CPU Datapath isters One or more levels (Static RA): Level 1: On-chip 16-64K Level 2: On or Off-chip K Level 3: Off-chip 128K-8 Dynamic RA (DRA) 16-16G Interface: SCSI, RAID, IDE, G-100G isters Cache ain emory agnetic Disc Farther away from The CPU Lower Cost/Bit Higher Capacity Increased Access Time/Latency Lower Throughput Optical Disk or agnetic Tape #40 Final Review Spring
41 A Typical emory Hierarchy (With Two Levels of Cache) Processor Faster Larger Capacity Datapath Control isters On-Chip Level One Cache L 1 Second Level Cache (SRA) L 2 ain emory (DRA) Virtual emory, Secondary Storage (Disk) Tertiary Storage (Tape) Speed (ns): < 1 10s 100s 10,000,000s (10s ms) Size (bytes): 100s KBs Bs GBs On or off Chip 10,000,000,000s (10s sec) TBs #41 Final Review Spring
42 emory Hierarchy Operation If an instruction or operand is required by the CPU, the levels of the memory hierarchy are searched for the item starting with the level closest to the CPU (Level 1, L 1 cache): If the item is found, it s delivered to the CPU resulting in a cache hit without searching lower levels. If the item is missing from an upper level, resulting in a cache miss, then the level just below is searched. For systems with several levels of cache, the search continues with cache level 2, 3 etc. If all levels of cache report a miss then main memory is accessed for the item. CPU cache memory: anaged by hardware. If the item is not found in main memory resulting in a page fault, then disk (virtual memory) is accessed for the item. emory disk: ainly managed by the operating system with hardware support. #42 Final Review Spring
43 emory Hierarchy: Terminology A Block: The smallest unit of information transferred between two levels. Hit: Item is found in some block in the upper level (example: Block X) Hit Rate: The fraction of memory access found in the upper level. Hit Time: Time to access the upper level which consists of RA access time + Time to determine hit/miss iss: Item needs to be retrieved from a block in the lower level (Block Y) iss Rate = 1 - (Hit Rate) iss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << iss Penalty To Processor Upper Level emory Lower Level emory From Processor Blk X Blk Y #43 Final Review Spring
44 Cache Concepts Cache is the first level of the memory hierarchy once the address leaves the CPU and is searched first for the requested data. If the data requested by the CPU is present in the cache, it is retrieved from cache and the data access is a cache hit otherwise a cache miss and data must be read from main memory. On a cache miss a block of data must be brought in from main memory to into a cache block frame to possibly replace an existing cache block. The allowed block addresses where blocks can be mapped into cache from main memory is determined by cache placement strategy. Locating a block of data in cache is handled by cache block identification mechanism. On a cache miss the cache block being removed is handled by the block replacement strategy in place. When a write to cache is requested, a number of main memory update strategies exist as part of the cache write policy. #44 Final Review Spring
45 Cache Organization & Placement Strategies Placement strategies or mapping of a main memory data block onto cache block frame addresses divide cache into three organizations: 1 Direct mapped cache: A block can be placed in one location only, given by: (Block address) OD (Number of blocks in cache) 2 Fully associative cache: A block can be placed anywhere in cache. 3 Set associative cache: A block can be placed in a restricted set of places, or cache block frames. A set is a group of block frames in the cache. A block is first mapped onto the set and then it can be placed anywhere within the set. The set in this case is chosen by: (Block address) OD (Number of sets in cache) If there are n blocks in a set the cache placement is called n-way set-associative. #45 Final Review Spring
46 Cache Organization: Direct apped Cache A block can be placed in one location only, given by: (Block address) OD (Number of blocks in cache) In this case: (Block address) OD (8) C a ch e cache block frames 32 memory blocks cacheable (11101) OD (1000) = e m o ry #46 Final Review Spring
47 4KB Direct apped Cache Example Tag field A d d re ss (s h o w i n g b it p o s i tio n s ) B y te o ffse t Index field H i t T a g D a t a 1K = 1024 Blocks Each block = one word In d e x In d e x V a l id T ag D a ta Can cache up to 2 32 bytes = 4 GB of memory apping function: Cache Block frame number = (Block address) OD (1024) Block Address = 30 bits Tag = 20 bits Index = 10 bits Block offset = 2 bits #47 Final Review Spring
48 64KB Direct apped Cache Example Tag field 4K= 4096 blocks Each block = four words = 16 bytes Can cache up to 2 32 bytes = 4 GB of memory H it Ta g A d d ress (sho w in g b it po sition s) In d ex B yte o ffset 1 6 bi ts 12 8 b its V T a g D ata Index field Word select B loc k o ffset D a ta 4K en trie s u x 32 Block Address = 28 bits Tag = 16 bits Index = 12 bits Block offset = 4 bits apping Function: Cache Block frame number = (Block address) OD (4096) Larger blocks take better advantage of spatial locality #48 Final Review Spring
49 Cache Organization: Set Associative Cache O n e - w a y se t a ss o ci a tiv e (d i re c t m a p p e d ) B lo c k T a g D a ta T w o -w a y se t a ss o cia t iv e S e t T a g D a ta T a g D a ta 7 F o u r-w a y s e t a s s oc ia tiv e S e t T a g D a ta T a g D a ta T a g D a ta T a g D a ta 0 1 E ig h t -w a y se t a s so c ia t ive (fu ll y a sso c i a tive ) T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta T a g D a ta #49 Final Review Spring
50 Cache Organization Example #50 Final Review Spring
51 Locating A Data Block in Cache Each block frame in cache has an address tag. The tags of every cache block that might contain the required data are checked or searched in parallel. A valid bit is added to the tag to indicate whether this entry contains a valid address. The address from the CPU to cache is divided into: A block address, further divided into: An index field to choose a block set or frame in cache. (no index field when fully associative). A tag field to search and match addresses in the selected set. A block offset to select the data from the block. Tag Block Address Index Block Offset #51 Final Review Spring
52 Address Field Sizes Physical Address Generated by CPU Tag Block Address Index Block Offset Block offset size = log2(block size) Index size = log2(total number of blocks/associativity) Tag size = address size - index size - offset size Number of Sets apping function: Cache set or block frame number = Index = = (Block Address) OD (Number of Sets) #52 Final Review Spring
53 1024 block frames Each block = one word 4-way set associative 256 sets 4K Four-Way Set Associative Cache: IPS Implementation Example Tag Field Index V Tag A ddress Index Field Data V Tag D ata V Tag Data V Tag Data Can cache up to 2 32 bytes = 4 GB of memory Block Address = 30 bits Tag = 22 bits Index = 8 bits Block offset = 2 bits 4-to-1 m ultiplexo r apping Function: Cache Set Number = (Block address) OD (256) H it Da ta #53 Final Review Spring
54 Cache Organization/Addressing Example Given the following: A single-level L 1 cache with 128 cache block frames Each block frame contains four words (16 bytes) 16-bit memory addresses to be cached (64K bytes main memory or 4096 memory blocks) Show the cache organization/mapping and cache address fields for: Fully Associative cache. Direct mapped cache. 2-way set-associative cache. #54 Final Review Spring
55 Cache Example: Fully Associative Case V V All 128 tags must be checked in parallel by hardware to locate a data block V Valid bit Block Address = 12 bits Tag = 12 bits apping Function = none Block offset = 4 bits #55 Final Review Spring
56 Cache Example: Direct apped Case V V V Only a single tag must be checked in parallel to locate a data block V Valid bit Block Address = 12 bits Tag = 5 bits Index = 7 bits Block offset = 4 bits apping Function: Cache Block frame number = Index = (Block address) OD (128) ain emory #56 Final Review Spring
57 Cache Example: 2-Way Set-Associative Two tags in a set must be checked in parallel to locate a data block Valid bits not shown Block Address = 12 bits Tag = 6 bits Index = 6 bits Block offset = 4 bits apping Function: Cache Set Number = Index = (Block address) OD (64) ain emory #57 Final Review Spring
58 Calculating Number of Cache Bits Needed How many total bits are needed for a direct- mapped cache with 64 KBytes of data and one word blocks, assuming a 32-bit address? 64 Kbytes = 16 K words = 2 14 words = 2 14 blocks Block size = 4 bytes => offset size = 2 bits, #sets = #blocks = 2 14 => index size = 14 bits Tag size = address size - index size - offset size = = 16 bits Bits/block = data bits + tag bits + valid bit = = 49 Bits in cache = #blocks x bits/block = 2 14 x 49 = 98 Kbytes How many total bits would be needed for a 4-way set associative cache to store the same amount of data? Block size and #blocks does not change. #sets = #blocks/4 = (2 14 )/4 = 2 12 => index size = 12 bits Tag size = address size - index size - offset = = 18 bits Bits/block = data bits + tag bits + valid bit = = 51 Bits in cache = #blocks x bits/block = 2 14 x 51 = 102 Kbytes Increase associativity => increase bits in cache #58 Final Review Spring
59 Calculating Cache Bits Needed How many total bits are needed for a direct- mapped cache with 64 KBytes of data and 8 word blocks, assuming a 32-bit address (it can cache 2 32 bytes in memory)? 64 Kbytes = 2 14 words = (2 14 )/8 = 2 11 blocks block size = 32 bytes => offset size = block offset + byte offset = 5 bits, #sets = #blocks = 2 11 => index size = 11 bits tag size = address size - index size - offset size = = 16 bits bits/block = data bits + tag bits + valid bit = 8 x = 273 bits bits in cache = #blocks x bits/block = 2 11 x 273 = Kbytes Increase block size => decrease bits in cache. #59 Final Review Spring
60 Cache Replacement Policy When a cache miss occurs the cache controller may have to select a block of cache data to be removed from a cache block frame and replaced with the requested data, such a block is selected by one of two methods: Random: Any block is randomly selected for replacement providing uniform allocation. Simple to build in hardware. The most widely used cache replacement strategy. Least-recently used (LRU): Accesses to blocks are recorded and and the block replaced is the one that was not used for the longest period of time. LRU is expensive to implement, as the number of blocks to be tracked increases, and is usually approximated. #60 Final Review Spring
61 iss Rates for Caches with Different Size, Associativity & Replacement Algorithm Sample Data Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% #61 Final Review Spring
62 Unified vs.. Separate Level 1 Cache Unified Level 1 Cache (Princeton emory Architecture). A single level 1 cache is used for both instructions and data. Separate instruction/data Level 1 caches (Harvard emory Architecture): The level 1 (L 1 ) cache is split into two caches, one for instructions (instruction cache, L 1 I-cache) and the other for data (data cache, L 1 D-cache). Processor Processor Control Control Datapath isters Unified Level One Cache L 1 Datapath isters L 1 I-cache L 1 D-cache Unified Level 1 Cache (Princeton emory Architecture) Separate Level 1 Caches (Harvard emory Architecture) #62 Final Review Spring
63 Cache Performance Princeton (Unified( L1) emory Architecture For a CPU with a single level (L1) of cache for both instructions and data (Princeton memory architecture) and no stalls for cache hits: CPU time = (CPU execution clock cycles + emory stall clock cycles) x clock cycle time emory stall clock cycles = (Reads x Read miss rate x Read miss penalty) + (Writes x Write miss rate x Write miss penalty) If write and read miss penalties are the same: With ideal memory emory stall clock cycles = emory accesses x iss rate x iss penalty #63 Final Review Spring
64 Cache Performance Princeton emory Architecture CPUtime = Instruction count x CPI x Clock cycle time CPI execution = CPI with ideal memory CPI = CPI execution + em Stall cycles per instruction CPUtime = Instruction Count x (CPI execution + em Stall cycles per instruction) x Clock cycle time em Stall cycles per instruction = em accesses per instruction x iss rate x iss penalty CPUtime = IC x (CPI execution + em accesses per instruction x iss rate x iss penalty) x Clock cycle time isses per instruction = emory accesses per instruction x iss rate CPUtime = IC x (CPI execution + isses per instruction x iss penalty) x Clock cycle time #64 Final Review Spring
65 Cache Performance Example Suppose a CPU executes at Clock Rate = 200 Hz (5 ns per cycle) with a single level of cache (unified). CPI execution = 1.1 Instruction mix: 50% arith/logic, 30% load/store, 20% control Assume a cache miss rate of 1.5% and a miss penalty of 50 cycles. CPI = CPI execution + mem stalls per instruction em Stalls per instruction = em accesses per instruction x iss rate x iss penalty em accesses per instruction = = 1.3 Instruction fetch Load/store em Stalls per instruction = 1.3 x.015 x 50 = CPI = = The CPU with ideal cache (no misses) is 2.075/1.1 = 1.88 times faster #65 Final Review Spring
66 Cache Performance Example Suppose for the previous example we double the clock rate to 400 HZ, how much faster is this machine, assuming similar miss rate, instruction mix? Since memory speed is not changed, the miss penalty takes more CPU cycles: iss penalty = 50 x 2 = 100 cycles. CPI = x.015 x 100 = = 3.05 Speedup = (CPI old x C old )/ (CPI new x C new ) = x 2 / 3.05 = 1.36 The new machine is only 1.36 times faster rather than 2 times faster due to the increased effect of cache misses. CPUs with higher clock rate, have more cycles per cache miss and more memory impact on CPI. #66 Final Review Spring
67 Cache Performance Harvard emory Architecture For a CPU with separate level one (L1) caches for instructions and data (Harvard memory architecture) and no stalls for cache hits: CPUtime = Instruction count x CPI x Clock cycle time CPI = CPI execution + em Stall cycles per instruction CPUtime = Instruction Count x (CPI execution + em Stall cycles per instruction) x Clock cycle time em Stall cycles per instruction = Instruction Fetch iss rate x iss Penalty + Data emory Accesses Per Instruction x Data iss Rate x iss Penalty #67 Final Review Spring
68 Cache Performance Example Suppose a CPU uses separate level one (L1) caches for instructions and data (Harvard memory architecture) with different miss rates for instruction and data access: A cache hit incurs no stall cycles while a cache miss incurs 200 stall cycles for both memory reads and writes. CPI execution = 1.1 Instruction mix: 50% arith/logic, 30% load/store, 20% control Assume a cache miss rate of 0.5% for instruction fetch and a cache data miss rate of 6%. A cache hit incurs no stall cycles while a cache miss incurs 200 stall cycles for both memory reads and writes. Find the resulting CPI using this cache? How much faster is the CPU with ideal memory? CPI = CPI execution + mem stalls per instruction em Stall cycles per instruction = Instruction Fetch iss rate x iss Penalty + Data emory Accesses Per Instruction x Data iss Rate x iss Penalty em Stall cycles per instruction = 0.5/100 x x 6/100 x 200 = = 4.6 CPI = CPI execution + mem stalls per instruction = = 5.7 The CPU with ideal cache (no misses) is 5.7/1.1 = 5.18 times faster With no cache the CPI would have been = X 200 = 261.1!! #68 Final Review Spring
What do we have so far? Multi-Cycle Datapath (Textbook Version)
What do we have so far? ulti-cycle Datapath (Textbook Version) CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instruction being processed in datapath How to lower CPI further? #1 Lec # 8 Summer2001
More informationT = I x CPI x C. Both effective CPI and clock cycle C are heavily influenced by CPU design. CPI increased (3-5) bad Shorter cycle good
CPU performance equation: T = I x CPI x C Both effective CPI and clock cycle C are heavily influenced by CPU design. For single-cycle CPU: CPI = 1 good Long cycle time bad On the other hand, for multi-cycle
More informationMemory Hierarchy: Motivation
Memory Hierarchy: Motivation The gap between CPU performance and main memory speed has been widening with higher performance CPUs creating performance bottlenecks for memory access instructions. The memory
More informationMemory Hierarchy: The motivation
Memory Hierarchy: The motivation The gap between CPU performance and main memory has been widening with higher performance CPUs creating performance bottlenecks for memory access instructions. The memory
More informationWhat do we have so far? Multi-Cycle Datapath
What do we have so far? lti-cycle Datapath CPI: R-Type = 4, Load = 5, Store 4, Branch = 3 Only one instrction being processed in datapath How to lower CPI frther? #1 Lec # 8 Spring2 4-11-2 Pipelining pipelining
More informationImprove performance by increasing instruction throughput
Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access
More informationThe Memory Hierarchy & Cache
Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory
More informationThe Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):
The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:
More informationChapter 4 (Part II) Sequential Laundry
Chapter 4 (Part II) The Processor Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Sequential Laundry 6 P 7 8 9 10 11 12 1 2 A T a s k O r d e r A B C D 30 30 30 30 30 30 30 30 30 30
More informationThe Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.
The Hierarchical Memory System The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory Hierarchy:
More informationChapter Six. Dataı access. Reg. Instructionı. fetch. Dataı. Reg. access. Dataı. Reg. access. Dataı. Instructionı fetch. 2 ns 2 ns 2 ns 2 ns 2 ns
Chapter Si Pipelining Improve perfomance by increasing instruction throughput eecutionı Time lw $, ($) 2 6 8 2 6 8 access lw $2, 2($) 8 ns access lw $3, 3($) eecutionı Time lw $, ($) lw $2, 2($) 2 ns 8
More informationEECS 322 Computer Architecture Improving Memory Access: the Cache
EECS 322 Computer Architecture Improving emory Access: the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow
More informationCPU Organization Datapath Design:
CPU Organization Datapath Design: Capabilities & performance characteristics of principal Functional Units (FUs): (e.g., Registers, ALU, Shifters, Logic Units,...) Ways in which these components are interconnected
More informationDesigning a Pipelined CPU
Designing a Pipelined CPU CSE 4, S2'6 Review -- Single Cycle CPU CSE 4, S2'6 Review -- ultiple Cycle CPU CSE 4, S2'6 Review -- Instruction Latencies Single-Cycle CPU Load Ifetch /Dec Exec em Wr ultiple
More information3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationMajor CPU Design Steps
Datapath Major CPU Design Steps. Analyze instruction set operations using independent RTN ISA => RTN => datapath requirements. This provides the the required datapath components and how they are connected
More information3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?
CSE 2021: Computer Organization Single Cycle (Review) Lecture-10b CPU Design : Pipelining-1 Overview, Datapath and control Shakil M. Khan 2 Single Cycle with Jump Multi-Cycle Implementation Instruction:
More informationLecture 6: Pipelining
Lecture 6: Pipelining i CSCE 26 Computer Organization Instructor: Saraju P. ohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed from various books, websites, authors pages, and other
More informationPipelining. Ideal speedup is number of stages in the pipeline. Do we achieve this? 2. Improve performance by increasing instruction throughput ...
CHAPTER 6 1 Pipelining Instruction class Instruction memory ister read ALU Data memory ister write Total (in ps) Load word 200 100 200 200 100 800 Store word 200 100 200 200 700 R-format 200 100 200 100
More informationCOMP303 - Computer Architecture Lecture 10. Multi-Cycle Design & Exceptions
COP33 - Computer Architecture Lecture ulti-cycle Design & Exceptions Single Cycle Datapath We designed a processor that requires one cycle per instruction RegDst busw 32 Clk RegWr Rd ux imm6 Rt 5 5 Rs
More informationPipelining: Basic Concepts
Pipelining: Basic Concepts Prof. Cristina Silvano Dipartimento di Elettronica e Informazione Politecnico di ilano email: silvano@elet.polimi.it Outline Reduced Instruction Set of IPS Processor Implementation
More informationCPU Design Steps. EECC550 - Shaaban
CPU Design Steps 1. Analyze instruction set operations using independent RTN => datapath requirements. 2. Select set of datapath components & establish clock methodology. 3. Assemble datapath meeting the
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationUnpipelined Machine. Pipelining the Idea. Pipelining Overview. Pipelined Machine. MIPS Unpipelined. Similar to assembly line in a factory
Pipelining the Idea Similar to assembly line in a factory Divide instruction into smaller tasks Each task is performed on subset of resources Overlap the execution of multiple instructions by completing
More informationPipelining. CSC Friday, November 6, 2015
Pipelining CSC 211.01 Friday, November 6, 2015 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file Not
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationCOSC 6385 Computer Architecture - Pipelining
COSC 6385 Computer Architecture - Pipelining Fall 2006 Some of the slides are based on a lecture by David Culler, Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage
More informationPipelining Analogy. Pipelined laundry: overlapping execution. Parallelism improves performance. Four loads: Non-stop: Speedup = 8/3.5 = 2.3.
Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup =2n/05n+15 2n/0.5n 1.5 4 = number of stages 4.5 An Overview
More informationMIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14
MIPS Pipelining Computer Organization Architectures for Embedded Computing Wednesday 8 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationWhat is Pipelining? Time per instruction on unpipelined machine Number of pipe stages
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationWhat is Pipelining? RISC remainder (our assumptions)
What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism
More informationChapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction
More informationCOSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University
COSC4201 Chapter 5 Memory Hierarchy Design Prof. Mokhtar Aboelaze York University 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating
More informationPipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12
Pipelined Datapath Lecture notes from KP, H. H. Lee and S. Yalamanchili Sections 4.5 4. Practice Problems:, 3, 8, 2 ing Note: Appendices A-E in the hardcopy text correspond to chapters 7- in the online
More informationLecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1
Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined
More informationAdvanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017
Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation
More informationCOMP2611: Computer Organization. The Pipelined Processor
COMP2611: Computer Organization The 1 2 Background 2 High-Performance Processors 3 Two techniques for designing high-performance processors by exploiting parallelism: Multiprocessing: parallelism among
More informationComputer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining
Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining Single-Cycle Design Problems Assuming fixed-period clock every instruction datapath uses one
More informationThe Processor. Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut. CSE3666: Introduction to Computer Architecture
The Processor Z. Jerry Shi Department of Computer Science and Engineering University of Connecticut CSE3666: Introduction to Computer Architecture Introduction CPU performance factors Instruction count
More informationLECTURE 3: THE PROCESSOR
LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU
More informationInstruction fetch. MemRead. IRWrite ALUSrcB = 01. ALUOp = 00. PCWrite. PCSource = 00. ALUSrcB = 00. R-type completion
. (Chapter 5) Fill in the vales for SrcA, SrcB, IorD, Dst and emto to complete the Finite State achine for the mlti-cycle datapath shown below. emory address comptation 2 SrcA = SrcB = Op = fetch em SrcA
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationECEC 355: Pipelining
ECEC 355: Pipelining November 8, 2007 What is Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution. A pipeline is similar in concept to an assembly
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationAdvanced Computer Architecture Pipelining
Advanced Computer Architecture Pipelining Dr. Shadrokh Samavi Some slides are from the instructors resources which accompany the 6 th and previous editions of the textbook. Some slides are from David Patterson,
More informationProcessor (II) - pipelining. Hwansoo Han
Processor (II) - pipelining Hwansoo Han Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 =2.3 Non-stop: 2n/0.5n + 1.5 4 = number
More informationProcessor Design CSCE Instructor: Saraju P. Mohanty, Ph. D. NOTE: The figures, text etc included in slides are borrowed
Lecture 3: General Purpose Processor Design CSCE 665 Advanced VLSI Systems Instructor: Saraju P. ohanty, Ph. D. NOTE: The figures, tet etc included in slides are borrowed from various books, websites,
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationPipelined Datapath. Reading. Sections Practice Problems: 1, 3, 8, 12 (2) Lecture notes from MKP, H. H. Lee and S.
Pipelined Datapath Lecture notes from KP, H. H. Lee and S. Yalamanchili Sections 4.5 4. Practice Problems:, 3, 8, 2 ing (2) Pipeline Performance Assume time for stages is ps for register read or write
More informationCOMPUTER ORGANIZATION AND DESIGN
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationThe University of Alabama in Huntsville Electrical & Computer Engineering Department CPE Test II November 14, 2000
The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE 513 01 Test II November 14, 2000 Name: 1. (5 points) For an eight-stage pipeline, how many cycles does it take to
More informationCPE 631 Lecture 04: CPU Caches
Lecture 04 CPU Caches Electrical and Computer Engineering University of Alabama in Huntsville Outline Memory Hierarchy Four Questions for Memory Hierarchy Cache Performance 26/01/2004 UAH- 2 1 Processor-DR
More informationFull Datapath. CSCI 402: Computer Architectures. The Processor (2) 3/21/19. Fengguang Song Department of Computer & Information Science IUPUI
CSCI 42: Computer Architectures The Processor (2) Fengguang Song Department of Computer & Information Science IUPUI Full Datapath Branch Target Instruction Fetch Immediate 4 Today s Contents We have looked
More informationCPU Organization Datapath Design:
CPU Organization Datapath Design: Capabilities & performance characteristics of principal Functional Units (FUs): (e.g., Registers, ALU, Shifters, Logic Units,...) Ways in which these components are interconnected
More informationcs470 - Computer Architecture 1 Spring 2002 Final Exam open books, open notes
1 of 7 ay 13, 2002 v2 Spring 2002 Final Exam open books, open notes Starts: 7:30 pm Ends: 9:30 pm Name: (please print) ID: Problem ax points Your mark Comments 1 10 5+5 2 40 10+5+5+10+10 3 15 5+10 4 10
More informationComputer Architectures. DLX ISA: Pipelined Implementation
Computer Architectures L ISA: Pipelined Implementation 1 The Pipelining Principle Pipelining is nowadays the main basic technique deployed to speed-up a CP. The key idea for pipelining is general, and
More information微算機系統第六章. Enhancing Performance with Pipelining 陳伯寧教授電信工程學系國立交通大學. Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
微算機系統第六章 Enhancing Performance with Pipelining 陳伯寧教授電信工程學系國立交通大學 chap6- Pipeline is natural! Laundry Example Ann, Brian, athy, Dave each have one load of clothes to wash, dry, and fold A B D Washer takes
More informationChapter 4 The Processor 1. Chapter 4A. The Processor
Chapter 4 The Processor 1 Chapter 4A The Processor Chapter 4 The Processor 2 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware
More informationPipelining: Hazards Ver. Jan 14, 2014
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Pipelining: Hazards Ver. Jan 14, 2014 Marco D. Santambrogio: marco.santambrogio@polimi.it Simone Campanoni:
More informationComputer Architecture
Lecture 3: Pipelining Iakovos Mavroidis Computer Science Department University of Crete 1 Previous Lecture Measurements and metrics : Performance, Cost, Dependability, Power Guidelines and principles in
More informationECE331: Hardware Organization and Design
ECE331: Hardware Organization and Design Lecture 27: Midterm2 review Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Midterm 2 Review Midterm will cover Section 1.6: Processor
More informationCOSC4201. Chapter 4 Cache. Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT
COSC4201 Chapter 4 Cache Prof. Mokhtar Aboelaze York University Based on Notes By Prof. L. Bhuyan UCR And Prof. M. Shaaban RIT 1 Memory Hierarchy The gap between CPU performance and main memory has been
More informationCSCI 402: Computer Architectures. Fengguang Song Department of Computer & Information Science IUPUI. Today s Content
3/6/8 CSCI 42: Computer Architectures The Processor (2) Fengguang Song Department of Computer & Information Science IUPUI Today s Content We have looked at how to design a Data Path. 4.4, 4.5 We will design
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 02: Introduction II Shuai Wang Department of Computer Science and Technology Nanjing University Pipeline Hazards Major hurdle to pipelining: hazards prevent the
More informationPerfect Student CS 343 Final Exam May 19, 2011 Student ID: 9999 Exam ID: 9636 Instructions Use pencil, if you have one. For multiple choice
Instructions Page 1 of 7 Use pencil, if you have one. For multiple choice questions, circle the letter of the one best choice unless the question specifically says to select all correct choices. There
More informationInstruction Pipelining Review
Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number
More informationPipeline Review. Review
Pipeline Review Review Covered in EECS2021 (was CSE2021) Just a reminder of pipeline and hazards If you need more details, review 2021 materials 1 The basic MIPS Processor Pipeline 2 Performance of pipelining
More informationModern Computer Architecture
Modern Computer Architecture Lecture3 Review of Memory Hierarchy Hongbin Sun 国家集成电路人才培养基地 Xi an Jiaotong University Performance 1000 Recap: Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20
More informationLecture 7 Pipelining. Peng Liu.
Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn 1 Review: The Single Cycle Processor 2 Review: Given Datapath,RTL -> Control Instruction Inst Memory Adr Op Fun Rt
More informationChapter 4 The Processor 1. Chapter 4B. The Processor
Chapter 4 The Processor 1 Chapter 4B The Processor Chapter 4 The Processor 2 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationOverview of Pipelining
EEC 58 Compter Architectre Pipelining Department of Electrical Engineering and Compter Science Cleveland State University Fndamental Principles Overview of Pipelining Pipelined Design otivation: Increase
More informationProcessor Architecture
Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong (jinkyu@skku.edu)
More informationData Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard
Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:
More informationPS Midterm 2. Pipelining
PS idterm 2 Pipelining Seqential Landry 6 P 7 8 9 idnight Time T a s k O r d e r A B C D 3 4 2 3 4 2 3 4 2 3 4 2 Seqential landry takes 6 hors for 4 loads If they learned pipelining, how long wold landry
More informationCPU Pipelining Issues
CPU Pipelining Issues What have you been beating your head against? This pipe stuff makes my head hurt! L17 Pipeline Issues & Memory 1 Pipelining Improve performance by increasing instruction throughput
More informationProcessor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Moore s Law Gordon Moore @ Intel (1965) 2 Computer Architecture Trends (1)
More informationCPU Organization (Design)
ISA Requirements CPU Organization (Design) Datapath Design: Capabilities & performance characteristics of principal Functional Units (FUs) needed by ISA instructions (e.g., Registers, ALU, Shifters, Logic
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationFinal Exam Spring 2017
COE 3 / ICS 233 Computer Organization Final Exam Spring 27 Friday, May 9, 27 7:3 AM Computer Engineering Department College of Computer Sciences & Engineering King Fahd University of Petroleum & Minerals
More informationCS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST
CS 110 Computer Architecture Pipelining Guest Lecture: Shu Yin http://shtech.org/courses/ca/ School of Information Science and Technology SIST ShanghaiTech University Slides based on UC Berkley's CS61C
More informationPipeline design. Mehran Rezaei
Pipeline design Mehran Rezaei How Can We Improve the Performance? Exec Time = IC * CPI * CCT Optimization IC CPI CCT Source Level * Compiler * * ISA * * Organization * * Technology * With Pipelining We
More informationEnhanced Performance with Pipelining
Chapter 6 Enhanced Performance with Pipelining Note: The slides being presented represent a mi. Some are created by ark Franklin, Washington University in St. Lois, Dept. of CSE. any are taken from the
More informationInstruction Level Parallelism. Appendix C and Chapter 3, HP5e
Instruction Level Parallelism Appendix C and Chapter 3, HP5e Outline Pipelining, Hazards Branch prediction Static and Dynamic Scheduling Speculation Compiler techniques, VLIW Limits of ILP. Implementation
More informationCO Computer Architecture and Programming Languages CAPL. Lecture 18 & 19
CO2-3224 Computer Architecture and Programming Languages CAPL Lecture 8 & 9 Dr. Kinga Lipskoch Fall 27 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be
More informationCaches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University
s See P&.,. (except writes) akim Weatherspoon CS, Spring Computer Science Cornell University What will you do over Spring Break? A) Relax B) ead home C) ead to a warm destination D) Stay in (frigid) Ithaca
More informationLecture 8: Data Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 8: Data Hazard and Resolution James C. Hoe Department of ECE Carnegie ellon University 18 447 S18 L08 S1, James C. Hoe, CU/ECE/CALC, 2018 Your goal today Housekeeping detect and resolve
More informationInstruction word R0 R1 R2 R3 R4 R5 R6 R8 R12 R31
4.16 Exercises 419 Exercise 4.11 In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor
More information14:332:331 Pipelined Datapath
14:332:331 Pipelined Datapath I n s t r. O r d e r Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate
More informationThe Processor: Datapath & Control
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Processor: Datapath & Control Processor Design Step 3 Assemble Datapath Meeting Requirements Build the
More informationChapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)
Chapter Seven emories: Review SRA: value is stored on a pair of inverting gates very fast but takes up more space than DRA (4 to transistors) DRA: value is stored as a charge on capacitor (must be refreshed)
More informationCaches. See P&H 5.1, 5.2 (except writes) Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University
s See P&.,. (except writes) akim Weatherspoon CS, Spring Computer Science Cornell University What will you do over Spring Break? A) Relax B) ead home C) ead to a warm destination D) Stay in (frigid) Ithaca
More informationMidnight Laundry. IC220 Set #19: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life. Return to Chapter 4
IC220 Set #9: Laundry, Co-dependency, and other Hazards of Modern (Architecture) Life Return to Chapter 4 Midnight Laundry Task order A B C D 6 PM 7 8 9 0 2 2 AM 2 Smarty Laundry Task order A B C D 6 PM
More informationPipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010
Pipelining, Instruction Level Parallelism and Memory in Processors Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010 NOTE: The material for this lecture was taken from several
More informationLecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"
Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3
More informationPipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!
Advanced Topics on Heterogeneous System Architectures Pipelining! Politecnico di Milano! Seminar Room @ DEIB! 30 November, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Outline!
More information