CS/ECE 552: Introduction To Computer Architecture 1

Size: px

Start display at page:

Download "CS/ECE 552: Introduction To Computer Architecture 1"

Nancy Farmer
6 years ago
Views:

1 ECE/CS 552: idterm Review Instructor: ikko H Lipasti Fall 2 University i of Wisconsin-adison i Lecture notes based on notes by ark Hill and John P Shen Updated by ikko Lipasti Computer Architecture Exercise in engineering tradeoff analysis Find the fastest/cheapest/power-efficient/etc solution Optimization problem with s of variables All the variables are changing At non-uniform rates With inflection points Only one guarantee: Today s right answer will be wrong tomorrow Two high-level effects: Technology push Application Pull Abstraction Difference between interface and implementation Interface: WHAT something does Implementation: HOW it does so What s the Big Deal? Tower of abstraction Complex interfaces implemented by layers below Abstraction hides detail Hundreds of engineers build one product Complexity unmanageable otherwise Application Program CS32 Operating System Compiler CS537 CS536 achine Language (ISA) CS354 Digital Logic ECE352 Electronic circuits ECE34 Semiconductor devices ECE335 Performance vs Design Time Time to market is critically important Eg, a new design may take 3 years It will be 3 times faster But if technology improves 5%/year In 3 years 5 3 = 338 So the new design is worse! (unless it also employs new technology) Bottom Line Designers must know BOTH software and hardware Both contribute to layers of abstraction IC costs and performance Compilers and Operating Systems Architecture

2 Performance Time and performance: achine A n times faster than achine B Iff Time(B)/Time(A) = n Iron Law: Performance = Time/program = Instructions Cycles = Time X X Program Instruction Cycle (code size) (CPI) (cycle time) Performance cont d Other etrics: IPS and FLOPS Beware of peak and omitted details Benchmarks: SPEC2 (95 in text) Summarize performance: A for time H for rate G for ratio Speedup Amdahl s Law: f f s Ch 2 Summary Basics Registers and ALU ops emory and load/store Branches and jumps Addressing odes Summary: Instruction Formats R: opcode rs rt rd shamt function I: opcode rs rt address/immediate J: opcode addr 6 26 Instruction decode: instruction bits Activate control signals Conclusions Simple and regular Constant length instructions, fields in same place Small and fast Small number of operands in registers Compromises inevitable Pipelining should not be hindered ake common case fast! Backwards compatibility! Basic Arithmetic and the ALU Number representations: 2 s complement, unsigned Addition/Subtraction Add/Sub ALU Full adder, ripple carry, subtraction Carry-lookahead addition Logical operations and, or, xor, nor, shifts Overflow Architecture 2

3 Unsigned Integers f(b3b) = b3 x b x 2 + b x 2 Treat as normal binary number Eg = x x x x x x 2 + x 2 = = 23 ax f( ) = 2 32 = 4,294,967,295 in f( ) = Range [,2 32 -] => # values (2 32 -) + = 2 32 Signed Integers 2 s complement f(b3 b b) = -b3 x b x 2 + b x 2 ax f( ) = 2 3 = in f( ) = -2 3 = (asymmetric) Range[-2 3,2 3 -] => # values( _ ) = 2 32 Eg 6 => + => Full Adder Full adder (a,b,c in ) => (c out, s) c out = two or more of (a, b, c in ) s = exactly one or three of (a,b,c in ) a b c in c out s Combined Ripple-carry Adder/Subtractor Control = => subtract XOR B with control and set c in to control Full Full Full Full Add Add Add Add er er er er a b a b a 2 b 2 C out a 3 b 3 operation c 4 4-bit Carry Lookahead Adder Carry Lookahead Block c Hierarchical Carry Lookahead for 6 bits c c 5 Carry Lookahead Block g 3 p 3 a 3 b 3 g 2 p 2 a 2 b 2 g p a b g p a b P G a,b 2-5 P G a,b 8- P G a 4-7 b 4-7 G P a -3 b -3 c 3 c 2 c c c 2 c 8 c 4 c s 3 s 2 s s s 2-5 s 8- s 4-7 s -3 Architecture 3

4 CLA: Compute G s and P s G 2,5 G 8, G 4,7 G,3 P 2,5 P 8, P 4,7 P,3 G 8,5 P 8,5 G,5 P,5 G,7 P,7 CLA: Compute Carries g 2 -g 5 g 8 -g g 4 -g 7 g -g 3 p 2 -p 5 p 8 -p p 4 -p 7 p -p 3 c4 c c 2 c 8 G 8 8, P 8, c 8 c G,7 P,7 c G 3,3 P,3 All Together Addition Overflow a invert carry in operation ux result = 5 > 4: + = =? 3 < X is f(2) : + = > Y is ~f(2) Overflow = f(2) * ~(a2)*~(b2) + ~f(2) * a(2) * b(2) b ux Add Subtraction Overflow No overflow on a-b if signs are the same Neg pos => neg ;; overflow otherwise Pos neg => pos ;; overflow otherwise Overflow = f(2) * ~(a2)*(b2) + ~f(2) * a(2) * ~b(2) What to do on Overflow? Ignore! (C language semantics) What about Java? (try/catch?) Flag condition code Sticky flag eg for floating point Otherwise gets in the way of fast hardware Trap possibly maskable IPS has eg add that traps, addu that does not Architecture 4

5 Ch 3 Summary Binary representations, signed/unsigned Arithmetic Full adder, ripple-carry, carry lookahead Carry-select, Carry-save Overflow, negative ore (multiply/divide/fp) later Logical Shift, and, or Ch 4 Processor Implementation Heart of 552 key to project Sequential logic design review (brief) Clock methodology (FSD) Datapath CPI Single instruction, 2 s complement, unsigned Control ultiple cycle implementation (information only) icroprogramming (information only) Exceptions Clocking ethology otivation Design data and control without considering clock Use Fully Synchronous Design (FSD) Just a convention to simplify design process Restricts design freedom Eliminates complexity, can guarantee timing correctness Not really feasible in real designs Even in 554 you will violate FSD Our ethodology Only flip-flops All on the same edge (eg falling) All with same clock No need to draw clock signals All logic finishes in one cycle Logic FFs Logic FFs Our ethodology, cont d No clock gating! Book has bad examples Correct design: new new state write AND clock write state current current Delayed Clocks (Gating) Clock Gated clock X Y X Clock D Delay Delay Problem: Some flip-flops receive gated clock late Data signal may violate setup & hold req t Y D Architecture 5

6 FSD Clocking Rules Clock Y T clock = cycle time T setup = FF setup time requirement T hold = FF hold time requirement T FF = FF combinational delay T comb = Combinational delay FSD Rules: T clock > T FF + T comb + T setup T FF + T comb > T hold Clock D Delay Y D All Together Register File? DFF Bit Slice CE D DFF Control Signals w/jumps Data_C(i) C_Adx 4 C Adx Decoder A_Adx 4 A Adx Decoder 5 i DFF DFF DFF 5 B_Adx 4 B Adx Decoder C = Write Port A,B = Ports Data_A(i) Data_B(i) ulti-cycle Implementation ulti-cycle Ctrl Signals Clock cycle = max(i-mem,reg-read+reg-write, ALU, d-mem) Reuse combination logic on different cycles One memory One ALU without other adders But Control is more complex (later) Need new registers to save values (eg IR) Used again on later cycles Logic that computes signals is reused Architecture 6

7 ulti-cycle Steps Step Description Sample Actions IF Fetch IR=E[PC] PC=PC+4 ID Decode A=RF(IR[25:2]) B=RF(IR[2:6]) Target=PC+SE(IR[5:] << 2) EX Execute ALUout = A + SE(IR[5:]) # lw/sw ALUout = A op B # rrr if (A==B) PC = target # beq em emory E[ALUout] = B # sw DR = E[ALUout] #lw RF(IR[5:]) = ALUout # rrr WB Writeback Reg(IR[2:6]) = DR # lw ulti-cycle Example (lw) EX E ALUSrcA = ALUSrcB = ALUOp = em IorD = WB LW Start LW SW SW RegDst = RegWrite emtoreg = IF em ALUSrcA= IorD = IRWrite ALUSrcB = ALUOp = PCWrite PCSrc = ALUSrcA = ALUSrcB = ALUOp = emwrite IorD = RRR BEQ ALUSrcA = ALUSrcB = ALUOp = PCWriteCond PCSource = WB RegDst = RegWrite emtoreg = ID ALUSrcA = ALUSrcB = ALUOp = J PCWrite PCSource = ulti-cycle Example (lw) icroprogramming Alternative way of specifying control FS State bubble Control signals in bubble Next state given by signals on arc Not a great language for specifying complex events Instead, treat as a programming problem icroprogramming Datapath remains the same Control is specified differently but does the same Each cycle a microprogram field specifies required control signals Label Alu Src Src2 Reg emory Pcwrite Next? Fetch Add Pc 4 pc Alu Alu + Add Pc Extshft Dispatch em Add A Extend Dispatch 2 Lw2 alu + Write mdr fetch Exceptions: Big Picture Two types: Interrupt (asynchronous) or Trap p( (synchronous) Hardware handles initial reaction Then invokes a software exception handler By convention, at eg xc O/S kernel provides code at the handler address Architecture 7

8 Exceptions: Hardware Sets state that identifies cause of exception IPS: in exception_code field of Cause register Changes to kernel mode for dangerous work ahead Disables interrupts IPS: recorded in status register Saves current PC (IPS: exception PC) Jumps to specific address (IPS: x88) Like a surprise JAL so can t clobber $3 Exceptions: Software Exception handler: IPS: ktext at x88 Set flag to detect incorrect entry Nested exception while in handler Save some registers Find exception type Eg I/O interrupt or syscall Jump to specific exception handler Exceptions: Software, cont d Handle specific exception Jump to clean-up to resume user program Restore registers Reset flag that detects incorrect entry Atomically Restore previous mode Enable interrupts Jump back to program (using EPC) Implementing Exceptions We worry only about hardware, not s/w IntCause undefined instruction arithmetic overflow Changes to the datapath New states in control FS FS With Exceptions Review Type Control Datapath Time (CPI, cycle time) Singlecycle ulticycle We want? Comb + end update No reuse cycle, (imem + reg + ALU + dmem) Comb + FS Reuse [3,5] cycles, update ax(imem, reg, ALU, dmem)?? ~ cycle, ax(imem, reg, ALU, dmem We will use pipelining to achieve last row Architecture 8

9 Pipelining (45-49) 49) Summary Big Picture Datapath Control Data Hazards Stalls Forwarding Control Hazards Exceptions Ideal Pipelining L L L n -- Gate 2 Delay n -- Gate 3 Delay L Comb Logic n Gate Delay L n -- Gate 3 Delay BW = ~(/n) n -- Gate 2 Delay BW = ~(2/n) n -- Gate 3 Delay BW = ~(3/n) Bandwidth increases linearly with pipeline depth Latency increases by latch delays L Ideal Pipelining Cycle: Instr: i F D X W i+ F D X W i+2 F D X W i+3 F D X W i+4 F D X W 2 3 Pipelining Idealisms Uniform subcomputations Can pipeline into stages with equal delay Identical computations Can fill pipeline with identical work Independent computations No relationships between work units Are these practical? No, but can get close enough to get significant speedup Complications Datapath Five (or more) instructions in flight Control ust correspond to multiple instructions Instructions may have data and control flow dependences Ie units of work are not independent One may have to stall and wait for another Program Data Dependences True dependence (RAW) j cannot execute until i produces its result Anti-dependence d (WAR) j cannot write its result until i has read its sources Output dependence (WAW) j cannot write its result until i has written its result D( i) R( j) R ( i ) D ( j ) D( i) D( j) Architecture 9

10 Control Dependences Resolution of Pipeline Hazards Conditional branches Branch must execute to determine which instruction to fetch next Instructions following a conditional branch are control dependent on the branch instruction Pipeline hazards Potential violations of program dependences ust ensure program dependences are not violated Hazard resolution Static: compiler/programmer guarantees correctness Dynamic: hardware performs checks at runtime Pipeline interlock Hardware mechanism for dynamic hazard resolution ust detect and enforce dependences at runtime Pipeline Hazards Pipelined Datapath Necessary conditions: WAR: write stage earlier than read stage Is this possible in IF-RD-EX-E-WB? WAW: write stage earlier than write stage Is this possible in IF-RD-EX-E-WB? RAW: read stage earlier than write stage Is this possible in IF-RD-EX-E-WB? If conditions not met, no need to resolve Check for both register and memory PC u x Add IF/ID ID/EX EX/E E/WB 4 Add Address Instruction memory Instruction register data register 2 Registers Write data 2 register Write data 6 32 Sign extend Shift left 2 u x Add result Zero ALU ALU result Address Write data Data memory data u x Pipelined Control PCSrc Pipelined Control PC u x Add IF/ID Add 4 Add result Shift left 2 ALUSrc Address Instruction memory Instruction register register 2 RegWr rite Registers Write register Write data Control data data 2 ID/EX WB EX u x Zero ALU ALU result EX/E WB E/WB WB Branch emwrite emtoreg Address data Data memory u x Write data Controlled by different instructions Decode instructions and pass the signals down the pipe pp Control sequencing is embedded in the pipeline Instruction 6 32 [5 ] Sign extend Instruction [2 6] Instruction [5 ] 6 ALU control ALUOp u x RegDst em Architecture

11 Data Hazards ust first detect hazards ID/EXWriteRegister = IF/IDRegister ID/EXWriteRegister = IF/IDRegister2 EX/EWriteRegister = IF/IDRegister EX/EWriteRegister = IF/IDRegister2 E/WBWriteRegister = IF/IDRegister E/WBWriteRegister = IF/IDRegister2 Forwarding Paths (ALU instructions) c b a ALU FORWARDING PATHS IF ID RD i+: R i+2: R i+3: R ALU E WB i: R (i i+) Forwarding via Path a i+: i: R R (i i+2) Forwarding via Path b i+2: i+: i: R R (i i+3) i writes R before i+3 reads R Implementation of ALU Forwarding Comp Comp Comp Comp Register File ALU Control Flow Hazards What to do? Always stall Easy to implement Performs poorly /6 th instructions are branches, each branch takes 3 cycles CPI = + 3 x /6 = 5 (lower bound) Control Flow Hazards Predict branch not taken Send sequential instructions down pipeline Kill instructions later if incorrect ust stop memory accesses and RF writes Including loads (why?) Late flush of instructions on misprediction Complex Global signal (wire delay) Exceptions Even worse: in one cycle I/O interrupt User trap to OS (EX) Illegal instruction (ID) Arithmetic overflow Hardware error Etc Interrupt priorities must be supported Architecture

12 Review Big Picture Datapath Control Data hazards Stalls Forwarding or bypassing Control flow hazards Branch prediction Exceptions IB RISC Experience [Agerwala and Cocke 987] Internal IB study: Limits of a scalar pipeline? emory Bandwidth Fetch instr/cycle from I-cache 4% of instructions are load/store (D-cache) Code characteristics (dynamic) Loads 25% Stores 5% ALU/RR 4% Branches 2% /3 unconditional (always taken /3 conditional taken, /3 conditional not taken Simplify Branches Assume 9% can be PC-relative 5% Overhead No register indirect, no register access from program Separate adder (like IPS R3) dependences Branch penalty reduced Total CPI: = 5 CPI = 87 IPC PC-relative Schedulable Penalty Yes (9%) Yes (5%) cycle Yes (9%) No (5%) cycle No (%) Yes (5%) cycle No (%) No (5%) 2 cycles Processor Performance Time Processor Performance = Program Instructions Cycles = Time X X Program Instruction Cycle (code size) (CPI) (cycle time) In the 98 s (decade of pipelining): CPI: 5 => 5 In the 99 s (decade of superscalar): CPI: 5 => 5 (best case) Revisit Amdahl s Law Sequential bottleneck lim v f f Even if v is infinite v Performance limited by nonvectorizable portion (-f) N No of Processors h - h - f f Time f Pipelined Performance odel N Pipeline Depth -g g g = fraction of time pipeline is filled -g = fraction of time pipeline is not filled (stalled) Architecture 2

13 Pipelined Performance odel Pipelined Performance odel N Pipeline Depth -g g g = fraction of time pipeline is filled -g = fraction of time pipeline is not filled (stalled) N Pipeline Depth -g g Tyranny of Amdahl s Law [Bob Colwell] When g is even slightly below %, a big performance hit will result Stalled cycles are the key adversary and must be minimized as much as possible otivation for Superscalar [Agerwala and Cocke] p Speedup p Speedup jumps from 3 to 43 for N=6, f=8, but s =2 instead of s= (scalar) n=6,s=2 n= n=2 n=6 n=4 Typical Range Vectorizability f Superscalar Proposal oderate tyranny of Amdahl s Law Ease sequential bottleneck ore generally applicable Robust (less sensitive to f) Revised Amdahl s Law: Speedup f s f v Limits on Instruction Level Parallelism (ILP) Weiss and Smith [984] 58 Sohi and Vajapeyam [987] 8 Tjaden and Flynn [97] 86 (Flynn s bottleneck) Tjaden and Flynn [973] 96 Uht [986] 2 Smith et al [989] 2 Jouppi and Wall [988] 24 Johnson [99] 25 Acosta et al [986] 279 Wedig [982] 3 Butler et al [99] 58 elvin and Patt [99] 6 Wall [99] 7 (Jouppi disagreed) Kuck et al [972] 8 Riseman and Foster [972] 5 (no control dependences) Nicolau and Fisher [984] 9 (Fisher s optimism) Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP) Architecture 3

14 Classifying ILP achines [Jouppi, DECWRL 99] Scalar pipelined Superpipelined Superscalar VLIW Superpipelined superscalar Review Summary Ch : Intro & performance Ch 2: Instruction Sets Ch 3: Arithmetic I Ch 4: Data path, control,,pipeliningp Details Fri /29 2:25-3:3 ( hour) in EH237 Closed books/notes/homeworks One page handwritten cheatsheet for quick reference A mix of short answer, design, analysis problems Architecture 4

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim Smith Pipelining to Superscalar Forecast Real