Wrap-Up. Lecture 10 Computer Architecture V. Performance Metrics. Overview. CPI for PIPE (Cont.) CPI for PIPE. Clock rate

Size: px

Start display at page:

Download "Wrap-Up. Lecture 10 Computer Architecture V. Performance Metrics. Overview. CPI for PIPE (Cont.) CPI for PIPE. Clock rate"

Martha Taylor
5 years ago
Views:

1 Lecture 10 Computer Architecture V Overview rap-up of PIP esign Performance analysis etch stage design xceptional conditions odern High-Performance Processors Out-of-order execution rap-up 10 1 atorarkitektur 2007 Performance etrics Clock rate easured in egahertz or Gigahertz unction of stage partitioning and circuit design Keep amount of work per stage small Rate at which instructions executed CPI: cycles per instruction On average, how many clock cycles does each instruction require? unction of pipeline design and benchmark programs.g., how frequently are branches mispredicted? 10 2 atorarkitektur 2007 CPI for PIP CPI 1.0 etch instruction each clock cycle ffectively process new instruction almost every cycle CPI > 1.0 Although each individual instruction has latency of 5 cycles Sometimes must stall or cancel branches Computing CPI C clock cycles I instructions executed to completion B bubbles injected (C = I + B) CPI = C/I = (I+B)/I = B/I actor B/I represents average penalty due to bubbles 10 3 atorarkitektur 2007 CPI for PIP (Cont.) B/I = LP + P + RP LP: Penalty due to load/use hazard stalling Typical Values raction of instructions that are loads 0.25 raction of load instructions requiring stall 0.20 Number of bubbles injected each time 1 LP = 0.25 * 0.20 * 1 = 0.05 P: Penalty due to mispredicted branches raction of instructions that are cond. jumps 0.20 raction of cond. jumps mispredicted 0.40 Number of bubbles injected each time 2 P = 0.20 * 0.40 * 2 = 0.16 RP: Penalty due to ret instructions raction of instructions that are returns 0.02 Number of bubbles injected each time 3 RP = 0.02 * 3 = 0.06 Net effect of penalties = 0.27 CPI = 1.27 (Not bad!) 10 4 atorarkitektur 2007

2 etch Logic Revisited uring etch Cycle 1. Select 2. Read bytes from instruction memory 3. xamine icode to determine instruction length 4. Increment Timing Steps 2 & 4 require significant amount of time Instr valid icode ifun ra Split Byte atorarkitektur 2007 rb memory Select pred Align valc Bytes 1-5 Need valc Need regids valp increment Predict _icode _Bch _vala _icode _val Standard etch Timing need_regids, need_valc Select em. Read Increment 1 clock cycle ust Perform verything in Sequence Can t compute incremented until know how much to increment it by 10 6 atorarkitektur 2007 A ast Increment Circuit odified etch Timing incr Select need_regids, need_valc 3-bit add High-order 29 bits Slow UX bit incrementer carry Low-order 3 bits 3-bit adder ast em. Read Incrementer 1 clock cycle UX Standard cycle High-order 29 bits need_regids 0 need_valc Low-order 3 bits 29-Bit Incrementer Acts as soon as selected Output not needed until final UX orks in parallel with memory read 10 7 atorarkitektur atorarkitektur 2007

3 ore Realistic etch Logic Other Controls xceptions Conditions under which pipeline cannot continue normal operation etch Box etch Control Instr. Length Integrated into instruction cache etches entire cache block (16 or 32 bytes) Selects current instruction from current block orks ahead to fetch next block As reaches end of current block Bytes 1-5 At branch target 10 9 atorarkitektur 2007 Byte 0 Current Current Block Next Block Causes Halt instruction (Current) Bad address for instruction or data (Previous) Invalid instruction (Previous) Pipeline control error (Previous) esired Action Complete some instructions ither current or previous (depends on exception type) iscard others Call exception handler Like an unexpected procedure call atorarkitektur 2007 xception xamples etect in etch Stage jmp $-1 # Invalid jump target xceptions in Pipeline Processor #1 # demo-exc1.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # Invalid address nop.byte 0x # Invalid instruction code.byte 0x # Invalid instruction code halt # Halt instruction etect in emory Stage irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address 0x000: irmovl $100,%eax 0x006: rmmovl %eax,0x1000(%eax) 0x00c: nop 0x00d:.byte 0x esired Behavior xception detected xception detected rmmovl should cause exception atorarkitektur atorarkitektur 2007

4 xceptions in Pipeline Processor #2 # demo-exc2.ys 0x000: xorl %eax,%eax # Set condition codes 0x002: jne t # Not taken 0x007: irmovl $1,%eax 0x00d: irmovl $2,%edx 0x013: halt 0x014: t:.byte 0x # Target 0x000: xorl %eax,%eax 0x002: jne t 0x014: t:.byte 0x 0x???: (I m lost!) 0x007: irmovl $1,%eax esired Behavior No exception should occur atorarkitektur 2007 xception detected aintaining xception Ordering exc icode val val dst dst exc icode Bch val vala dst dst exc icode ifun valc vala valb dst dst srca srcb exc icode ifun ra rb valc valp pred Add exception status field to pipeline registers etch stage sets to either AOK, AR (when bad fetch address), or INS (illegal instruction) ecode & execute pass values through emory either passes through or sets to AR xception triggered only when instruction hits write back atorarkitektur 2007 Side ffects in Pipeline Processor # demo-exc3.ys irmovl $100,%eax rmmovl %eax,0x10000(%eax) # invalid address addl %eax,%eax # Sets condition codes x000: irmovl $100,%eax 0x006: rmmovl %eax,0x1000(%eax) 0x00c: addl %eax,%eax 5 xception detected Avoiding Side ffects Presence of xception Should isable State Update hen detect exception in memory stage isable condition code setting in execute ust happen in same clock cycle hen exception passes to write-back stage isable memory write in memory stage isable condition code setting in execute stage esired Behavior rmmovl should cause exception Condition code set Implementation Hardwired into the design of the PIP simulator You have no control over this No following instruction should have any effect atorarkitektur atorarkitektur 2007

5 Rest of xception Handling odern CPU esign Calling xception Handler Push onto stack ither of faulting instruction or of next instruction Usually pass through pipeline along with exception status Jump to handler address Updates Retirement Unit ile Prediction OK? Control Address etch Control ecode s Usually fixed address efined as part of ISA Integer/ Branch General Integer Operation Results P Add P ult /iv Load Addr. Addr. ata Store ata unctional Units xecution ata atorarkitektur atorarkitektur 2007 Control Retirement Unit ile Control Address etch Control s ecode xecution Unit Updates Prediction OK? Integer/ Branch General Integer P Add Operation Results P ult /iv Addr. Load Addr. ata Store ata unctional Units Grabs Bytes rom emory Based on Current + Predicted Targets for Predicted Branches Hardware dynamically guesses whether branches taken/not taken and (possibly) branch target Translates s Into Primitive steps required to perform instruction Typical instruction requires 1 3 operations Converts References Into Tags Abstract identifier linking destination of one operation with sources of later operations atorarkitektur 2007 ultiple functional units ach can operate in independently performed as soon as operands available Not necessarily in program order ithin limits of functional units Control logic xecution xecution ata nsures behavior equivalent to sequential program execution atorarkitektur 2007

CPU Capabilities of Pentium III ultiple s Can xecute in Parallel 1 load 1 store 2 integer (one may be branch) 1 P Addition 1 P ultiplication or ivision PentiumPro Block iagram P6 icroarchitecture

6 CPU Capabilities of Pentium III ultiple s Can xecute in Parallel 1 load 1 store 2 integer (one may be branch) 1 P Addition 1 P ultiplication or ivision PentiumPro Block iagram P6 icroarchitecture PentiumPro Pentium II Pentium III Some s Take > 1 Cycle, but Can be Pipelined Latency Cycles/Issue Load / Store 3 1 Integer ultiply 4 1 Integer ivide ouble/single P ultiply 5 2 ouble/single P Add 3 1 ouble/single P ivide icroprocessor Report 2/16/ atorarkitektur 2007 PentiumPro Operation Translates instructions dynamically into Uops 118 bits wide Holds operation, two sources, and destination xecutes Uops with Out of Order engine Uop executed when Operands available unctional unit available xecution controlled by Reservation Stations Keeps track of data dependencies between uops Allocates resources atorarkitektur 2007 PentiumPro Branch Prediction Critical to Performance cycle penalty for misprediction Branch Target Buffer 512 entries 4 bits of history Adaptive algorithm Can recognize repeated patterns, e.g., alternating taken not taken Handling BTB misses etect in cycle 6 Predict taken for negative offset, not taken for positive Loops vs. conditionals atorarkitektur 2007

xample Branch Prediction Pentium 4 Block iagram Branch History ncode information about prior history of branch instructions Predict whether or not branch will be taken Intel Tech.

7 xample Branch Prediction Pentium 4 Block iagram Branch History ncode information about prior history of branch instructions Predict whether or not branch will be taken Intel Tech. Journal Q1, 2001 T Yes! Yes? No? No! T T T State achine ach time branch taken, transition to right Next generation microarchitecture hen not taken, transition to left Predict branch taken when in state Yes! or Yes? atorarkitektur atorarkitektur 2007 Pentium 4 eatures Trace Replaces traditional instruction cache s instructions in decoded form Reduces required rate for instruction decoder ouble-pumped ALUs Simple instructions (add) run at 2X clock rate Very eep Pipeline L2 20+ cycle branch penalty IA32 Instrs. nables very high clock rates Instruct. ecoder uops Slower than Pentium III for a given clock rate Trace Processor Summary esign Technique Create uniform framework for all instructions ant to share hardware among instructions Connect standard logic blocks with bits of control logic Operation State held in memories and clocked registers Computation done by combinational logic Clocking of registers/memories sufficient to control overall behavior nhancing Performance Pipelining increases throughput and improves resource utilization ust make sure maintains ISA behavior atorarkitektur atorarkitektur 2007

Wrap-Up. Lecture 10 Computer Architecture V. Performance Metrics. Overview. CPI for PIPE (Cont.) CPI for PIPE. Clock rate

Wrap-Up. Lecture 10 Computer Architecture V. Performance Metrics. Overview. CPI for PIPE (Cont.) CPI for PIPE. Clock rate Lecture 10 Computer Architecture V Overview rap-up of PIP esign Performance analysis etch stage design xceptional conditions rap-up Performance etrics Clock rate easured in egahertz or Gigahertz unction