Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Size: px

Start display at page:

Download "Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model."

Tiffany Richardson
5 years ago
Views:

1 Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer Performance metrics Use (weighted) arithmetic means for execution times Use (weighted) harmonic means for rates CPU exec. time = Instruction count*σ CPI i *f i *clock cycle time We talk about contributions to the CPI from, e.g., Hazards in the pipeline Cache misses Branch mispredictions etc. CSE 586 Spring 00 1 CSE 586 Spring 00 2 ISA s (RISC, CISC, EPIC) Basic Pipeline Model In RISC, R stands for: Restricted (relatively small number of opcodes) Regular (all instructions have same length ) And also, few instruction formats and addressing modes RISC and load-store architectures are synonymous CISC Fewer instructions executed but CPI/instruction is larger More complex to design VLIW-EPIC Compiler-based exploitation of ILP Basic pipelining (e.g, DLX in the book; MIPS 3000) 5 stages: IF, ID, EX, MEM, WB Pipeline registers between stages to keep data/control info needed in subsequent stages Hazards Structural (won t happen in basic pipeline but will in multiple pipeline machines) Data dependencies Most can be removed via forwarding Otherwise stall (insert bubbles) Control CSE 586 Spring 00 3 CSE 586 Spring 00 4 IF ID/RR EXE Mem WB IF/ID ID/EX EX/MEM MEM/WB Branch Prediction PC 4 Inst. mem. data (PC) Regs. s e 2 (Rd) zero Forwarding unit Data mem. Modern processors use dynamic branch prediction Becomes increasingly important because of deep pipes and multiple issue of instructions BPT (Branch prediction table) Prediction occurs during ID cycle BPT either indexed by some bits on the PC or organized cache-like BPT: either separate table or part of the metadada of the I-cache Use of 2-bit saturating counters for the prediction per se control Control unit Stall unit CSE 586 Spring 00 5 CSE 586 Spring 00 6

2 2-bit Saturating Counter Scheme Property: takes two wrong predictions before it changes T to NT (and vice-versa) More Elaborate Branch Prediction ^ predict predict not not not predict predict not not ^ not Generally, this is the initial state BTB (Branch Target Buffer) = BPT + Target address Prediction and target address computation occur during IF cycle Possibility of decoupling a (large) BPT and a (smaller) BTB Correlated -- or 2-level branch prediction Relies on history of outcome of previous branches to predict current branch Many variations on the number of (shift) registers recording branch history and the number of Pattern History Tables (PHT) storing the 2-bit saturating counters CSE 586 Spring 00 7 CSE 586 Spring 00 8 Tag BPT Note: the BPT does not require a tag, so could be much larger Hist Decoupled BTB (2) If predict T then access BTB Tag BTB Next address (3) if match then have target address PC Extensions to Single Pipe Model Basic pipelining How to handle precise exceptions Single issue processor with multiple pipes How to handle sharing the WB stage How to avoid WAW hazards (1) access BPT CSE 586 Spring 00 9 CSE 586 Spring EX (e.g., integer; latency 0) Exploiting Instruction Level Parallelism IF ID M1 F-p mul (latency 7) A1 A4 F-p add (latency 3) M7 Me WB both ILP: where can the compiler optimize Loop unrolling and software pipelining Speculative execution ILP: Dynamic scheduling in a single issue machine Scoreboard -- Centralized control unit Tomasulo s algorithm -- Decentralized control Div (e.g., not pipelined, Latency 25) Needed at beg of cycle & ready at end of cycle 5/31/00 CSE 586 Spring CSE 586 Spring 00 12

3 Registers Scoreboard -- The example machine Scoreboard Data buses Functional units (pipes) The scoreboard keeps a record of all data dependencies The scoreboard keeps a record of all functional unit occupancies The scoreboard decides if an instruction can be issued The scoreboard decides if an instruction can store its result Implementation-wise, scoreboard keeps track of which registers are used as sources and destinations and which functional units use them scoreboard Control lines /status CSE 586 Spring CSE 586 Spring Example Machine using Tomasulo Algorithm Load buffers Reservation stations From memory From I-unit Fp registers To memory Store buffers Common data bus F-p units CSE 586 Spring Tomasulo s algorithm Decentralized control Use of reservation stations to buffer and/or rename registers (hence gets rid of WAW and WAR hazards) Results and their names are broadcast to reservations stations and register file Instructions are issued in order but can be dispatched, executed and completed out-of-order Issue, Execute, Write stages CSE 586 Spring Register Renaming Goal: avoid WAW and WAR hazards Is performed at decode time to rename the result register Two basic implementation schemes Have a separate physical register file Use of reorder buffer (to preserve in-order completion) and reservation stations Often a mix of the two Example Machine (Tomasulo-like) Revisited Reservation stations From I-unit Reorder buffer Fp registers From memory & CDB To memory CSE 586 Spring F-p units To CDB 5/31/00 CSE 586 Spring 00 18

4 The Commit Step (in-order completion) Multiple Issue Implications A fourth stage: Commit Need of a mechanism (reorder buffer) to: Complete instructions in order. This commits the instruction. Since multiple issue machine, should be able to commit (retire) several instructions per cycle Know when an instruction has completed non-speculatively (head of the buffer) Know whether the result of an instruction is correct, i.e., flush reorder buffer when there are incorrectly predicted branches and exceptions Will increase throughput The Instruction Fetch step requires buffering and can become a critical point in the design The Commit stage must be able to retire multiple instructions in a given cycle Decoding, issuing, dispatching can encounter more structural hazards CSE 586 Spring CSE 586 Spring VLIW-EPIC Predication Compiler plays a major role in scheduling operations Merced/Itanium implementation Bundles of predicated instructions Large register files with rotating registers to facilitate loop unrolling, software pipelining, and call/return paradigms Predication and sophisticated branch prediction Powerful floating-point units SIMD instructions for 3D Graphics and Multimedia Partial predication (Conditional Moves) Full predication (predicate definitions); Unconditional and OR predicates Used extensively in Merced/Itanium CSE 586 Spring CSE 586 Spring Memory Hierarchy Caches Memory hierarchies work because of the principle of locality Temporal and spatial locality Two main interfaces in the memory hierarchy Caches Main memory Main memory disk (secondary memory) Same questions arise at both interfaces: Size, placement, retrieval, replacement, and timing of the information being transferred Cache organizations Direct-mapped, fully-associative, set-associative Decomposition of the address for hit/miss detection Write-through vs. write-back; write-around and write-allocate The 3 C s Cache performance Metrics: CPI c, Average memory access time Examples of naïve analysis CSE 586 Spring CSE 586 Spring 00 24

5 Cache Performance Main Memory Improving performance by giving more associativity Victim caches; column-associative caches; skewed ass. caches Reducing conflict misses Interaction with the O.S.: page coloring Interaction with the compiler: code placement Improving performance by tolerating memory latency Prefetching Write buffers Critical word first Sector caches Lock-up free caches DRAM basics Interleaving Low order bits for reading consecutive words in parallel Middle bits for banks of banks allowing concurrent access by several devices Page-mode and SDRAMs Processor In Memory paradigm (IRAM, Active Pages) Rambus CSE 586 Spring CSE 586 Spring Virtual memory From Virtual Address to Memory Location (highly abstracted) Paging and segmentation Page tables TLB s Address translation Virtual address TLB miss hit Physical address hit cache miss Main memory CSE 586 Spring CSE 586 Spring Hardware-software interactions for paging systems I/O TLB s Miises handled either in hardware or software Page fault: detection and termination Context-switch (exception) I/O interrupt Choice of a (or several) page size(s) Virtually addressed caches - Synonyms Protection I/O and caches (software and hardware solutions) Cache coherence I/O architecture (CPU-memory and I/O buses) Disks (access time components) Buses (arbitration, transactions, split-transactions) I/O hardware-software interface DMA Disk arrays (RAID) CSE 586 Spring CSE 586 Spring 00 30

6 Parallel Processing Shared-bus Systems Flynn s taxonomy {Single Instr., Multiple Instr.} X {Single Data, Multiple Data} MIMD machines --Shared-memory multiprocessors UMA NUMA-cc DSM MIMD machines Message passing systems Multicomputers Synchronous vs. asynchronous message passing SMP s Cache coherence using snoopy protocols Write-update protocols (Dragon) Write-invalidate protocols (Illinois) Cache coherence misses Impact of capacity and block sizes Multilevel inclusion property CSE 586 Spring CSE 586 Spring NUMA Machines Directory-based Cache Coherence Interconnection networks for tightly-coupled systems Centralized vs. decentralized switches Centralized switches Crossbar Perfect shuffle Omega and Butterfly networks Decentralized switches Meshes and tori Performance metrics Bandwidth; Bisection bandwidth; latency Routing and flow control Full directory Partial directory 2-bit Coarse directories Basic protocols SCI Directory in the caches COMA architecture CSE 586 Spring CSE 586 Spring Synchronization Models of Memory Consistency Locking and barriers Primitives for implementation of locking Test-and-Set Fetch-and-F Full/empty bits Load locked and Store conditional Spin locks Test and Test-and-Set Queuing locks Sequential consistency Relaxed models Weak Ordering Release consistency CSE 586 Spring CSE 586 Spring 00 36

Keywords and Review Questions

Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain