Wrap-Up. CS:APP Chapter 4 Computer Architecture. Overview. Performance Metrics. CPI for PIPE. Randal E. Bryant. Carnegie Mellon University

Size: px

Start display at page:

Download "Wrap-Up. CS:APP Chapter 4 Computer Architecture. Overview. Performance Metrics. CPI for PIPE. Randal E. Bryant. Carnegie Mellon University"

Neal Fields
5 years ago
Views:

1 CS:APP Chapter 4 Computer Architecture rap-up Overview rap-up of PIP esign Performance analysis etch stage design xceptional conditions odern High-Performance Processors Out-of-order execution Randal. Bryant Carnegie ellon University CS:APP 2 CS:APP Performance etrics Clock rate easured in egahertz or Gigahertz unction of stage partitioning and circuit design Keep amount of work per stage small Rate at which instructions executed CPI: cycles per instruction On average, how many clock cycles does each instruction require? unction of pipeline design and benchmark programs.g., how frequently are branches mispredicted? CPI for PIP CPI 1.0 etch instruction each clock cycle ffectively process new instruction almost every cycle Although each individual instruction has latency of 5 cycles CPI > 1.0 Sometimes must stall or cancel branches Computing CPI C clock cycles I instructions executed to completion B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = B/I actor B/I represents average penalty due to bubbles 3 CS:APP 4 CS:APP

2 CPI for PIP (Cont.) B/I = LP + P + RP Typical Values LP: Penalty due to load/use hazard stalling raction of instructions that are loads 0.25 raction of load instructions requiring stall 0.20 Number of bubbles injected each time 1 LP = 0.25 * 0.20 * 1 = 0.05 P: Penalty due to mispredicted branches raction of instructions that are cond. jumps 0.20 raction of cond. jumps mispredicted 0.40 Number of bubbles injected each time 2 P = 0.20 * 0.40 * 2 = 0.16 RP: Penalty due to ret instructions raction of instructions that are returns 0.02 Number of bubbles injected each time 3 RP = 0.02 * 3 = 0.06 Net effect of penalties = 0.27 CPI = 1.27 (Not bad!) 5 CS:APP etch Logic Revisited uring etch Cycle Timing Select PC Read bytes from instruction memory xamine icode to determine instruction length Increment PC Steps 2 & 4 require significant amount of time Instr valid icode ifun ra Split 6 CS:APP Byte 0 rb memory Select PC predpc Align valc Bytes 1-5 Need valc Need regids valp PC increment Predict PC _icode _Bch _vala _icode _val Standard etch Timing Select PC em. Read need_regids, need_valc Increment A ast PC Increment Circuit High-order 29 bits UX 0 1 incrpc carry Low-order 3 bits 1 clock cycle ust Perform verything in Sequence Can t compute incremented PC until know how much to increment it by Slow High-order 29 bits 29-bit incrementer 3-bit adder ast need_regids 0 need_valc Low-order 3 bits PC 7 CS:APP 8 CS:APP

3 odified etch Timing Select PC em. Read Incrementer 29-Bit Incrementer need_regids, need_valc 3-bit add 1 clock cycle Acts as soon as PC selected Output not needed until final UX UX orks in parallel with memory read Standard cycle 9 CS:APP ore Realistic etch Logic etch Box Other PC Controls etch Control Instr. Length Integrated into instruction cache etches entire cache block (16 or 32 bytes) Selects current instruction from current block orks ahead to fetch next block As reaches end of current block At branch target Byte 0 Current Bytes 1-5 Current Block Next Block 10 CS:APP xceptions Conditions under which pipeline cannot continue normal operation Causes Halt instruction Bad address for instruction or data Invalid instruction Pipeline control error esired Action (Current) (Previous) (Previous) (Previous) Complete some instructions ither current or previous (depends on exception type) iscard others Call exception handler Like an unexpected procedure call 11 CS:APP xception xamples etect in etch Stage jmp $-1.byte 0x halt etect in emory Stage # Invalid jump target # Invalid instruction code # Halt instruction irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address 12 CS:APP

4 xceptions in Pipeline Processor #1 # demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop.byte 0x # Invalid instruction code esired Behavior x000: irmovl $100,%eax 0x006: rmmovl %eax,0x1000(%eax) 0x00c: nop 0x00d:.byte 0x xception detected rmmovl should cause exception 5 xception detected xceptions in Pipeline Processor #2 # demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t:.byte 0x # Target 0x000: xorl %eax,%eax 0x002: jne t 0x014: t:.byte 0x 0x???: (IÕm lost!) 0x007: irmovl $1,%eax esired Behavior No exception should occur xception detected CS:APP 14 CS:APP aintaining xception Ordering exc icode val val dst dst exc icode Bch val vala dst dst exc icode ifun valc vala valb dst dst srca srcb Side ffects in Pipeline Processor # demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes exc icode ifun ra rb valc valp predpc Add exception status field to pipeline registers etch stage sets to either AOK, AR (when bad fetch address), or INS (illegal instruction) ecode & execute pass values through emory either passes through or sets to AR xception triggered only when instruction hits write back 15 CS:APP esired Behavior x000: irmovl $100,%eax 0x006: rmmovl %eax,0x1000(%eax) 0x00c: addl %eax,%eax rmmovl should cause exception No following instruction should have any effect 16 CS:APP 5 Condition code set xception detected

5 Avoiding Side ffects Presence of xception Should isable State Update hen detect exception in memory stage isable condition code setting in execute ust happen in same clock cycle hen exception passes to write-back stage isable memory write in memory stage isable condition code setting in execute stage Implementation Hardwired into the design of the PIP simulator You have no control over this Rest of xception Handling Calling xception Handler Push PC onto stack ither PC of faulting instruction or of next instruction Usually pass through pipeline along with exception status Jump to handler address Usually fixed address efined as part of ISA Implementation Haven t tried it yet! 17 CS:APP 18 CS:APP odern CPU esign Register Updates Prediction OK? Integer/ Branch Retirement Unit Register ile General Integer Control Address P Add Operation Results etch Control ecode P ult/iv xecution s Operations unctional Units 19 CS:APP Addr. Load Addr. Store Control Retirement Unit Register ile Control Address Grabs Bytes rom emory Based on Current PC + Predicted Targets for Predicted Branches Hardware dynamically guesses whether branches taken/not taken and (possibly) branch target Translates s Into Operations etch Control s ecode Operations Primitive steps required to perform instruction Typical instruction requires 1 3 operations Converts Register References Into Tags Abstract identifier linking destination of one operation with sources of later operations 20 CS:APP

xecution Unit ultiple functional units ach can operate in independently Operations performed as soon as operands available Not necessarily in program order ithin limits of functional units Control

6 xecution Unit ultiple functional units ach can operate in independently Operations performed as soon as operands available Not necessarily in program order ithin limits of functional units Control logic Register Updates Prediction OK? Integer/ Branch General Integer xecution xecution unctional Units nsures behavior equivalent to sequential program execution 21 CS:APP P Add Operation Results Operations P ult/iv Addr. Load Addr. Store CPU Capabilities of Pentium III ultiple s Can xecute in Parallel 1 load 1 store 2 integer (one may be branch) 1 P Addition 1 P ultiplication or ivision Some s Take > 1 Cycle, but Can be Pipelined Latency Cycles/Issue Load / Store 3 1 Integer ultiply 4 1 Integer ivide ouble/single P ultiply 5 2 ouble/single P Add 3 1 ouble/single P ivide CS:APP PentiumPro Block iagram P6 icroarchitecture PentiumPro Pentium II Pentium III PentiumPro Operation Translates instructions dynamically into Uops Uops 118 bits wide Holds operation, two sources, and destination xecutes Uops with Out of Order engine Uop executed when Operands available unctional unit available xecution controlled by Reservation Stations Keeps track of data dependencies between uops Allocates resources icroprocessor Report 2/16/95 24 CS:APP

PentiumPro Branch Prediction Critical to Performance 11 15 cycle penalty for misprediction Branch Targe

7 PentiumPro Branch Prediction Critical to Performance cycle penalty for misprediction Branch Target Buffer 512 entries 4 bits of history Adaptive algorithm Can recognize repeated patterns, e.g., alternating taken not taken Handling BTB misses etect in cycle 6 Predict taken for negative offset, not taken for positive Loops vs. conditionals xample Branch Prediction Branch History ncode information about prior history of branch instructions Predict whether or not branch will be taken T State achine NT NT Yes! Yes? No? No! T T T ach time branch taken, transition to right hen not taken, transition to left NT Predict branch taken when in state Yes! or Yes? NT 25 CS:APP 26 CS:APP Pentium 4 Block iagram Pentium 4 eatures Trace L2 IA32 Instrs. Instruct. ecoder uops Trace Next generation microarchitecture Intel Tech. Journal Q1, CS:APP Replaces traditional instruction cache s instructions in decoded form Reduces required rate for instruction decoder ouble-pumped ALUs Simple instructions (add) run at 2X clock rate Very eep Pipeline 20+ cycle branch penalty nables very high clock rates Slower than Pentium III for a given clock rate Operations 28 CS:APP

8 Processor Summary esign Technique Create uniform framework for all instructions ant to share hardware among instructions Connect standard logic blocks with bits of control logic Operation State held in memories and clocked registers Computation done by combinational logic Clocking of registers/memories sufficient to control overall behavior nhancing Performance Pipelining increases throughput and improves resource utilization ust make sure maintains ISA behavior 29 CS:APP

Wrap-Up. Lecture 10 Computer Architecture V. Performance Metrics. Overview. CPI for PIPE (Cont.) CPI for PIPE. Clock rate

Wrap-Up. Lecture 10 Computer Architecture V. Performance Metrics. Overview. CPI for PIPE (Cont.) CPI for PIPE. Clock rate Lecture 10 Computer Architecture V Overview rap-up of PIP esign Performance analysis etch stage design xceptional conditions rap-up Performance etrics Clock rate easured in egahertz or Gigahertz unction