Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

Similar documents
Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

Suggested Readings! Recap: Pipelining improves throughput! Processor comparison! Lecture 17" Short Pipelining Review! ! Readings!

Modern Computer Architecture

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

Question?! Processor comparison!

Suggested Readings! What makes a memory system coherent?! Lecture 27" Cache Coherency! ! Readings! ! Program order!! Sequential writes!! Causality!

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

Caching Basics. Memory Hierarchies

Memory hierarchy review. ECE 154B Dmitri Strukov

CPU Pipelining Issues

ECE331: Hardware Organization and Design

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

What is Pipelining? RISC remainder (our assumptions)

Memory Hierarchies 2009 DAT105

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Page 1. Multilevel Memories (Improving performance using a little cash )

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

ECE468 Computer Organization and Architecture. Virtual Memory

ECE4680 Computer Organization and Architecture. Virtual Memory

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

COMPUTER ORGANIZATION AND DESIGN

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

LECTURE 11. Memory Hierarchy

Lec 11 How to improve cache performance

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Chapter 4. The Processor

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Pipelined processors and Hazards

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

ECE 2300 Digital Logic & Computer Organization. Caches

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

COSC 6385 Computer Architecture - Memory Hierarchies (I)

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

Chapter 4. The Processor

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

CHAPTER 4 MEMORY HIERARCHIES TYPICAL MEMORY HIERARCHY TYPICAL MEMORY HIERARCHY: THE PYRAMID CACHE PERFORMANCE MEMORY HIERARCHIES CACHE DESIGN

Advanced Computer Architecture

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Computer Architecture Computer Science & Engineering. Chapter 5. Memory Hierachy BK TP.HCM

CS 61C: Great Ideas in Computer Architecture. Virtual Memory

Adapted from David Patterson s slides on graduate computer architecture

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

UNIT I (Two Marks Questions & Answers)

CS161 Design and Architecture of Computer Systems. Cache $$$$$

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Computer Architecture Review. Jo, Heeseung

ECEC 355: Pipelining

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Instruction Pipelining Review

Handout 4 Memory Hierarchy

Computer Architecture Spring 2016

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

EN1640: Design of Computing Systems Topic 06: Memory System

Introduction to cache memories

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

CPE 631 Lecture 04: CPU Caches

EN1640: Design of Computing Systems Topic 06: Memory System

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

COMPUTER ORGANIZATION AND DESIGN

V. Primary & Secondary Memory!

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Computer Architecture CS372 Exam 3

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

EC 513 Computer Architecture

Improve performance by increasing instruction throughput

CPU issues address (and data for write) Memory returns data (or acknowledgment for write)

Computer Architecture Spring 2016

14:332:331. Week 13 Basics of Cache

Page 1. Review: Address Segmentation " Review: Address Segmentation " Review: Address Segmentation "

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Memory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

EECS151/251A Spring 2018 Digital Design and Integrated Circuits. Instructors: John Wawrzynek and Nick Weaver. Lecture 19: Caches EE141

CS61C : Machine Structures

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

CS61C : Machine Structures

Transcription:

Be sure you understand CC, clock period Lecture 29 Review Suggested reading: Everything Q1: D[8] = D[8] + RF[1] + RF[4] I[15]: Add R2, R1, R4 RF[1] = 4 I[16]: MOV R3, 8 RF[4] = 5 I[17]: Add R2, R2, R3 D[8] = 7 CLK (n+1) Fetch PC=15 IR=xxxx (n+2) (n+3) Decode Execute PC=16 PC=16 IR=2214h IR=2214h RF[2]= xxxxh (n+4) Fetch PC=16 (n+5) Decode PC=17 (n+6) IR=2214h IR=0308h RF[2]= 0009h Execute PC=17 IR=0308h RF[3]= xxxxh (n+7) Fetch PC=17 IR=0308h RF[3]= 0007h 1 2 Common (and good) performance metrics latency: response time, execution time good metric for fixed amount of work (minimize time) throughput: work per unit time = (1 / latency) when there is NO OVERLAP > (1 / latency) when there is overlap in real processors there is always overlap good metric for fixed amount of time (maximize work) comparing performance A is N times faster than B if and only if: time(b)/time(a) = N A is X% faster than B if and only if: time(b)/time(a) = 1 + X/100 10 time units Finish each time unit 3 CPU time: the best metric We can see CPU performance dependent on: Clock rate, CPI, and instruction count CPU time is directly proportional to all 3: Therefore an x % improvement in any one variable leads to an x % improvement in CPU performance But, everything usually affects everything: Hardware Technology Clock Cycle Time Organization CPI ISAs Instruction Count Compiler Technology 4

Pipelining improves throughput Clock Number Inst. # 1 2 3 4 5 6 7 8 Inst. i IF ID EX MEM WB Inst. i+1 IF ID EX MEM WB Inst. i+2 IF ID EX MEM WB PIPELINES Inst. i+3 IF ID EX MEM WB Program execution order (in instructions) ALU ALU Use all available HW at all times IM Reg DM Reg IM Reg DM Reg ALU IM Reg DM Reg ALU Time IM Reg DM Reg 5 6 Time for N instructions in a pipeline: If times for all S stages are equal to T: Time for one initiation to complete still ST Time between 2 initiates = T not ST Initiations per second = 1/T Key to improving performance: parallel execution Time for N initiations: NT + (S-1)T Throughput: Time per initiation = T + (S-1)T/N! T! (assumes no stalls) Pipelining: Overlap multiple executions of same sequence Improves THROUGHPUT, not the time to perform a single operation Stalls and performance Stalls impede progress of a pipeline and result in deviation from 1 instruction executing/clock cycle CPI pipelined = Ideal CPI + Pipeline stall cycles per instruction 1 + Pipeline stall cycles per instruction Ignoring overhead and assuming stages are balanced: Ideally, speedup equal to # of pipeline stages Stalls occur because of hazards! 7 8

Pipelining hazards Pipeline hazards prevent next instruction from executing during designated clock cycle There are 3 classes of hazards: Structural Hazards: Arise from resource conflicts HW cannot support all possible combinations of instructions Data Hazards: Occur when given instruction depends on data from an instruction ahead of it in pipeline Control Hazards: Result from branch, other instructions that change flow of program (i.e. change PC) Structural hazards (why) 1 way to avoid structural hazards is to duplicate resources i.e.: An ALU to perform an arithmetic operation and an adder to increment PC If not all possible combinations of instructions can be executed, structural hazards occur Most common instances of structural hazards: When some resource not duplicated enough 9 10 Data hazards (why) These exist because of pipelining Why do they exist??? Pipelining changes order or read/write accesses to operands Order differs from order seen by sequentially executing instructions on un-pipelined machine Consider this example: ADD R1, R2, R3 SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11 All instructions after ADD use result of ADD ADD writes the register in WB but SUB needs it in ID. This is a data hazard Data hazards (avoiding with forwarding) Problem illustrated on previous slide can actually be solved relatively easily with forwarding In this example, result of the ADD instruction not really needed until after ADD actually produces it Can we move the result from EX/MEM register to the beginning of ALU (where SUB needs it)? Yes! Hence this slide! Generally speaking: Forwarding occurs when a result is passed directly to functional unit that requires it. Result goes from output of one unit to input of another 11 12

Data hazards (avoiding with forwarding) Data hazards (forwarding sometimes fails) Idea: send result just calculated back around to ALU inputs LW R1, 0(R2) SUB R4, R1, R5 AND R6, R1, R7 OR R8, R1, R9 ALU IM Reg DM Reg IM Reg IM ALU Reg IM DM ALU Reg Load has a latency that forwarding can t solve. Pipeline must stall until hazard cleared (starting with instruction that wants to use data until source produces it). Time Can t get data to subtract b/c result needed at beginning of CC #4, but not produced until end of CC #4. Send output of memory back to ALU too 13 14 Hazards vs. dependencies dependence: fixed property of instruction stream (i.e., program) hazard: property of program and processor organization implies potential for executing things in wrong order potential only exists if instructions can be simultaneously in-flight property of dynamic distance between instructions vs. pipeline depth For example, can have RAW dependence with or without hazard depends on pipeline Branch / Control Hazards So far, we ve limited discussion of hazards to: Arithmetic/logic operations Data transfers Also need to consider hazards involving branches: Example: 40: beq $1, $3, $28 # ($28 gives address 72) 44: and $12, $2, $5 48: or $13, $6, $2 52: add $14, $2, $2 72: lw $4, 50($7) How long will it take before branch decision? What happens in the meantime? 15 16

A Branch Predictor Branch prediction critical path (BTB) Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Note: must check for branch match now, since can t use wrong branch address Example: BTB combined with BHT Branch PC Predicted PC PC of instruction FETCH 17 =? No: branch not predicted, proceed normally (Next PC = PC+4) Yes: instruction is branch and use predicted PC as next PC Extra prediction state bits 18 Is there a problem with DRAM? MEMORY HIERARCHIES Performance Processor-DRAM Memory Gap (latency) 1000 100 10 1 Why is this a problem? 1980 1981 1982 1983 1984 1985 1986 1987 1988 Moore s Law 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time µproc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10yrs) 19 20 CPU Processor-Memory Performance Gap: grows 50% / year DRAM

A common memory hierarchy Average Memory Access Time CPU Registers 100s Bytes <10s ns Cache K Bytes 10-100 ns 1-0.1 cents/bit Main Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit Registers Cache Memory Upper Level faster AMAT = (Hit Time) + (1 - h) x (Miss Penalty) Hit time: basic time of every access. Hit rate (h): fraction of access that hit Miss penalty: Disk G Bytes, 10 ms (10,000,000 ns) 10-5 - 10-6 cents/bit Disk extra time to fetch a block from lower level, including time to replace in CPU Tape infinite sec-min 10-8 Tape Larger Lower Level Introduces caches to improve hit time. 21 22 Where can a block be placed in a cache? Where can a block be placed in a cache? 3 schemes for block placement in a cache: Direct mapped cache: Block can go to only 1 place in cache Usually: (Block address) MOD (# of blocks in the cache) Cache: Fully Associative Direct Mapped Set Associative 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Set 0 Set 1 Set 2 Set 3 Fully associative cache: Block can be placed anywhere in cache Set associative cache: A set is a group of blocks in the cache Block mapped onto a set & then block can be placed anywhere within that set Usually: (Block address) MOD (# of sets in the cache) If n blocks, we call it n-way set associative Block 12 can go anywhere 0 1 2 3 4 5 6 7 8... Block 12 can go only into Block 4 (12 mod 8) Memory: 12 Block 12 can go anywhere in set 0 (12 mod 4) 23 24

How is a block found in the cache? Cache s have address tag on each block frame that provides block address Tag of every cache block that might have entry is examined against CPU address (in parallel! why?) Each entry usually has a valid bit Tells us if cache data is useful/not garbage If bit is not set, there can t be a match How is a block found in the cache? Tag Block Address Index Block offset field selects data from block (i.e. address of desired data within block) Index field selects a specific set Tag field is compared against it for a hit Block Offset If block data is updated, set a dirty bit Block data must be written back to higher levels of memory hierarchy upon replacement 25 26 Reducing cache misses Want data accesses to result in cache hits, not misses for best performance Start by looking at ways to increase % of hits. but first look at 3 kinds of misses Compulsory misses: 1st access to cache block will not be a hit data not there yet! Capacity misses: Cache is only so big. Can t store every block accessed in a program must swap out! Conflict misses: Result from set-associative or direct mapped caches Blocks discarded / retrieved if too many map to a location Impact of larger cache blocks: Can reduce miss by increasing cache block size This will help eliminate what kind of misses? Helps improve miss rate b/c of principle of locality: Temporal locality says that if something is accessed once, it ll probably be accessed again soon Spatial locality says that if something is accessed, something nearby it will probably be accessed Larger block sizes help with spatial locality Be careful though! Larger block sizes can increase miss penalty! Generally, larger blocks reduce # of total blocks in cache 27 28

Impact of larger cache blocks: Second-level caches More compulsory misses Miss Rate 25 20 15 10 5 Miss rate vs. block size More conflict misses 1K 4K 16K 64K 256K Introduces new definition of AMAT: Hit time L1 + Miss Rate L1 * Miss Penalty L1 Where, Miss Penalty L1 = Hit Time L2 + Miss Rate L2 * Miss Penalty L2 So 2nd level miss rate measure from 1st level cache misses 0 16 32 64 128 256 Block Size Total $ capacity (Assuming total cache size stays constant for each curve) 29 30 Virtual Memory Computers run lots of processes simultaneously No full address space of memory for each process Physical memory expensive and not dense - thus, too small Share smaller amounts of physical memory among many processes VIRTUAL MEMORY Virtual memory is the answer Divide physical memory into blocks, assign them to different processes Compiler assigns data to a virtual address. VA translated to a real/physical somewhere in memory Allows program to run anywhere; where is determined by a particular machine, OS + Business: common SW on wide product line» (w/o VM, sensitive to actual physical memory size) 31 32

Translating VA to PA sort of like finding right cache entry with division of PA 33 34 Review: Address Translation Page table lookups = memory references Virtual Address Page # Offset Frame # Offset Operating System Physical Memory Register Page Table Ptr Disk + P# Page Table Offset Page Frame CPU 42 356 356 page table Frame # Cache! i Program Paging Main Memory Special-purpose cache for translations Historically called the TLB: Translation Lookaside Buffer 35 36

TLBs speed up VA translations A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB! Virtual Page # Physical Frame # Dirty Ref Valid Access Critical path of an address translation (1) Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped TLBs are usually small, typically not more than 128-256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations. tag Really just a cache (a special-purpose cache) on the page table mappings TLB access time comparable to cache access time (much less than main memory access time) Translation with a TLB CPU hit VA PA miss TLB Cache Lookup miss Translation hit Main Memory data 37 38 Critical path of an address translation (2) Virtual Address TLB access No TLB Hit? Yes Page fault? Try to read from page table No Try to read from cache Write? Yes Set in TLB Yes Replace page from disk No TLB miss stall No Cache hit? Yes Cache/buffer memory write Cache miss stall Deliver data to CPU 39 40

The many ways of parallel computing Processors can be any of these configurations Uni: P Pipelined Superscalar M M Centralized Shared Memory P M * P Distributed Shared Memory PARALLEL PROCESSING SISD M P P M M N E T This brings us to MIMD configurations MIMD 41 42 MIMD Multiprocessors MIMD Multiprocessors Centralized Shared Memory Distributed Memory Actually, quite representative of quad core chip, with main memory being a shared, L2 cache. Note: just 1 memory Multiple, distributed memories here. 43 44

Speedup from parallel processing metric for performance on latency-sensitive applications Time(1) / Time(P) for P processors note: must use the best sequential algorithm for Time(1); the parallel algorithm may be different. speedup 1 2 4 8 16 32 64 1 2 4 8 16 32 64 # processors linear speedup (ideal) typical: rolls off w/some # of processors occasionally see superlinear... why? Impediments to parallel performance Latency is already a major source of performance degradation Architecture charged with hiding local latency (that s why we talked about registers & caches) Hiding global latency is also task of programmer (I.e. manual resource allocation) All ed items also affect speedup that could be obtained, add overheads 1 Speedup = [ 1- Fraction parallelizable ] + Fraction parallelizable N 45 Impediments to parallel performance Reliability: Want to achieve high up time especially in non-cmps Contention for access to shared resources i.e. multiple accesses to limited # of memory banks may dominate system scalability Programming languages, environments, & methods: Need simple semantics that can expose computational properties to be exploited by large-scale architectures Algorithms Not all problems are parallelizable Speedup = Cache coherency P1 writes, P2 can read Protocols can enable $ coherency but add overhead 1 [ 1- Fraction parallelizable ] + Fraction parallelizable N Overhead where no actual processing is done. Maintaining Cache Coherence Hardware schemes Shared Caches Trivially enforces coherence Not scalable (L1 cache quickly becomes a bottleneck) Snooping What if you write good code for 4- core chip and then get an 8-core chip? Needs a broadcast network (like a bus) to enforce coherence Each cache that has a block tracks its sharing state on its own Directory Can enforce coherence even with a point-to-point network A block has just one place where its full sharing state is kept 46 Overhead where no actual processing is done. 47 48

How Snooping Works What happens on write conflicts? (invalidate protocol) CPU State Tag Data Bus Often 2 sets of tags why? CPU references check cache tags (as usual) Cache misses filled from memory (as usual) + Other read/write on bus must check tags, too, and possibly invalidate Processor Activity Bus Activity Contents of CPU A s cache Contents of CPU B s cache Contents of memory location X CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes a 1 to X Invalidation for X 1 0 CPU B reads X Cache miss for X 1 1 1 Assumes neither cache had value/location X in it 1 st When 2 nd miss by B occurs, CPU A responds with value canceling response from memory. Update B s cache & memory contents of X updated Typical and simple 0 49 50 M(E)SI Snoopy Protocols for $ coherency State of block B in cache C can be Invalid: B is not cached in C To read or write, must make a request on the bus Modified: B is dirty in C C has the block, no other cache has the block, and C must update memory when it displaces B Can read or write B without going to the bus Shared: B is clean in C C has the block, other caches have the block, and C need not update memory when it displaces B Can read B without going to bus To write, must send an upgrade request to the bus Exclusive: B is exclusive to cache C Can help to eliminate bus traffic E state not absolutely necessary Cache to Cache transfers Problem P1 has block B in M state P2 wants to read B, puts a read request on bus If P1 does nothing, memory will supply the data to P2 What does P1 do? Solution 1: abort/retry P1 cancels P2 s request, issues a write back P2 later retries RdReq and gets data from memory Too slow (two memory latencies to move data from P1 to P2) Solution 2: intervention P1 indicates it will supply the data ( intervention bus signal) Memory sees that, does not supply the data, and waits for P1 s data P1 starts sending the data on the bus, memory is updated P2 snoops the transfer during the write-back and gets the block 51 52

Latency NW topologies facilitate communication among different processing nodes (and this is just the physics no router overhead, etc. included) 53 From Balfour, Dally, Supercomputing 54 but add additional overhead Time to message = # of hops (routers) x time per router + # of links x time per link + serialization latency (ceiling(message size / link BW)) Load balancing Load balancing: keep all cores busy at all times (i.e. minimize idle time) Example: If all tasks subject to barrier synchronization, slowest task determines overall performance: 55 56