CSE 141 Computer Architecture Summer Session I, Lecture 3 Performance and Single Cycle CPU Part 1. Pramod V. Argade

CSE 141 Computer Architecture Summer Session I, 2005 Lecture 3 Performance and Single Cycle CPU Part 1 Pramod V. Argade

CSE141: Introduction to Computer Architecture Instructor: Pramod V. Argade (p2argade@cs.ucsd.edu) Office Hour: Tue. 7:30-9:00 (Center 105) Wed. 5:00-6:00 PM (HSS 1330) By Appointment TAs: Chengmo Yang: c5yang@cs.ucsd.edu Wenjing Rao: wrao@cs.ucsd.edu Lecture: Mon/Wed. 6:00-8:50 PM, HSS 1330 Textbook: Web-page: Computer Organization & Design The Hardware Software Interface, 3 nd Edition. Authors: Patterson and Hennessy http://www.cse.ucsd.edu/classes/su05/cse141 2

Summer Session I, 2005 CSE141 Course Schedule Lecture # Date Time Room Topic Quiz topic 1 Mon. 6/27 6-8:50 PM HSS 1330 Introduction, Ch. 1 ISA, Ch. 2 2 Wed. 6/29 6-8:50 PM HSS 1330 Arithmetic, Ch. 3 Homework Due - - ISA Ch. 2 - Mon. 7/4 No Class July 4th Holiday - - 3 Wed. 7/6 6-8:50 PM HSS 1330 Performance, Ch. 4 Arithmetic Single-cycle CPU Ch. 5 Ch. 3 #2 4 Mon. 7/11 6-8:50 PM HSS 1330 Single-cycle CPU Ch. 5 Cont. Performance Multi-cycle CPU Ch. 5 Ch. 4 #3 5 Tue. 7/12 7:30-8:50 PM HSS 1330 Multi-cycle CPU Ch. 5 Cont. (July 4th make up class) - - 6 Wed. 7/13 6-8:50 PM HSS 1330 Single and Multicycle CPU Examples and Single-cycle CPU Review for Midterm Ch. 5-7 Mon. 7/18 6-8:50 PM HSS 1330 Mid-term Exam Exceptions - #4 8 Tue. 7/19 7:30-8:50 PM HSS 1330 Pipelining Ch. 6 (July 4th make up class) - - 9 Wed. 7/20 6-8:50 PM HSS 1330 Hazards, Ch. 6 - - 10 Mon. 7/25 6-8:50 PM HSS 1330 Memory Hierarchy & Caches Ch. 7 11 Wed. 7/27 6-8:50 PM HSS 1330 Virtual Memory, Ch. 7 Course Review Hazards Ch. 6 Cache Ch. 7 12 Sat. 7/30 TBD TBD Final Exam - - #1 #5 #6 3

Discussions Sections for 141: Thursday, 7:30-8:30 WLH-2205 Office Hour: Tue. 7:30-9:00 (Center 105) Wed. 5:00-6:00 PM (HSS 1330) By Appointment Announcements Reading Assignment Chapter 4. Assessing and Understanding Performance Sections 4.1-4.6 Homework 3: Due Mon., July 11th in class 4.1, 4.2, 4.6, 4.7, 4.8, 4.9, 4.12, 4.19, 4.22, 4.45 Quiz When: Mon., July 11th, First 10 minutes of the class Topic: Performance, Chapter 4 Need: Paper, pen 4

Performance Assessment Airplane Passenger Capacity Cruising Range (miles) Cruising Speed (m.p.h.) Passenger Throughput (passenger x m.p.h.) Boeing 777 375 4630 610 228,750 Boeing 747 470 4150 610 286,700 Douglas DC-8-50 146 8720 544 79,424 Airbus A380 555 8000 645 357,975 Example definitions of performance Passengers * MPH Cruising speed Passenger capacity (Passenger * MPH)/$ need more data! Clearly formulate your definition of performance 5

Why worry about performance? Learn to measure, report, and summarize Make intelligent choices See through the marketing hype As a designer or purchaser, assess: which system has best performance? which system has lowest cost? which system has highest performance/cost? Formulate metrics for performance measurement How to report relative performance? Understand impact of architectural choices on performance 6

Computer Performance: TIME, TIME, TIME Response Time (latency) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput How many jobs can the machine run at once? What is the average execution rate? How much work is getting done? If we upgrade a machine with a new processor what do we decrease? If we add a new machine to the lab what do we increase? 7

% time program... program results... 90.7u 12.9s 2:39 65% % Execution Time Elapsed Time (2:39) Counts everything (disk and memory accesses, I/O, etc.) A useful number, but often not good for comparison purposes Total CPU time (103.6 s) Doesn't count I/O or time spent running other programs Comprised of system time (12.9 s), and user time (90.7 s) Our focus: user CPU time Time spent executing the lines of code that are "in" our program 8

A Definition of Performance For some program running on machine X, Performance X = 1 / Execution time X Machine X is n times faster than Y n = Performance X / Performance Y = (Execution time Y ) / (Execution time X ) Problem: Execution time: Machine A: 12 seconds, B: 20 seconds A/B =.6, so A is 40% faster, or 1.4X faster, or B is 40% slower B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower Need a precise definition X is n times faster than Y, n > 1 A is 1.67 times faster than B 9

Clock Cycles Operation of conventional computer is controlled by Clock ticks Clock rate (frequency) = cycles per second (Hertz) MHz = Million cycles per second GHz = Giga (Billion) cycles per second Cycle time = time between ticks = seconds per cycle A 2 GHz clock => Cycle time = 1/(2x10 9 ) s = 0.5 nanoseconds Instead of reporting execution time in seconds, we often use clock cycles seconds program = cycles program seconds cycle time 10

How many cycles are required for a program? Assumption: # of cycles = # of instructions 1st instruction 2nd instruction 3rd instruction 4th 5th 6th... time Assumption is incorrect Typically, different instructions take different number of clock cycles Integer Multiply instructions are multi-cycle Floating point instructions are multi-cycle 1st 2nd 3rd 4th 5th time 11

Now that we understand cycles A given program will require some number of instructions (machine instructions) some number of cycles some number of seconds We have a vocabulary that relates these quantities: cycle time (seconds per cycle) clock rate (cycles per second) CPI (cycles per instruction) 12

CPI Example 2.1 Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? 14

Compiler Optimization: Example 2.2 A compiler designer is trying to decide between two code sequences for a particular machine with 3 instruction classes. CPI: Class A = 1, Class B = 2, and Class C = 3 Code sequence X has 5 instructions: 2 of A, 1 of B, and 2 of C Code sequence Y has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? 15

Execution time: Contributing Factors CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle Instruction Count CPI Program X Compiler X (X) ISA X X Clock Rate Organization X X Technology X Improve performance => reduce execution time Reduce instruction count (Programmer, Compiler) Reduce cycles per instruction (ISA, Machine designer) Reduce clock cycle time (Hardware designer, Process engineer) 16

Performance Performance is determined by program execution time Do any of the other variables equal performance? # of cycles to execute program? # of instructions in program? # of cycles per second? average # of cycles per instruction? average # of instructions per second? Common pitfall: thinking one of the variables is indicative of performance when it really isn t. 17

Performance Variation CPU Execution Time Instruction CPI = Count X X Clock Cycle Time Number of instructions CPI Clock Cycle Time Same machine different programs same programs, different machines, same ISA different similar same same different different Same programs, different machines somewhat different different different 18

Amdahl s Law The impact of a performance improvement is limited by the percent of execution time affected by the improvement Execution time after improvement = Execution Time Affected Amount of Improvement + Execution Time Unaffected Make the common case fast! Amdahl s law sets limit on how much improvement can be made 19

Amdahl s Law: Example 2.3 Example: A program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? Is it possible to get the program to run 5 times faster? 20

Metrics of Performance Application Programming Language Answers per month Useful Operations per second Compiler (millions) of Instructions per second MIPS ISA (millions) of (F.P.) operations per second MFLOP/s Datapath Control Megabytes per second Function Units TransistorsWiresPins Cycles per second (clock rate) Each metric has a place and a purpose, and each can be misused 21

Benchmarks Performance best determined by running a real application Use programs typical of expected workload Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks nice for architects and designers easy to standardize can be abused SPEC (System Performance Evaluation Cooperative) companies have agreed on a set of real program and inputs can still be abused (Intel s other bug) valuable indicator of performance (and compiler technology) 22

SPEC 89 Compiler enhancements and performance 800 700 600 SPEC performance ratio 500 400 300 200 100 0 gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Benchmark Compiler Enhanced compiler 23

SPEC 95* Benchmark go m88ksim gcc compress li ijpeg perl vortex tomcatv swim su2cor hydro2d mgrid applu trub3d apsi fpppp wave5 Description Artificial intelligence; plays the game of Go Motorola 88k chip simulator; runs test program The Gnu C compiler generating SPARC code Compresses and decompresses file in memory Lisp interpreter Graphic compression and decompression Manipulates strings and prime numbers in the special-purpose programming language Perl A database program A mesh generation program Shallow water model with 513 x 513 grid quantum physics; Monte Carlo simulation Astrophysics; Hydrodynamic Naiver Stokes equations Multigrid solver in 3-D potential field Parabolic/elliptic partial differential equations Simulates isotropic, homogeneous turbulence in a cube Solves problems regarding temperature, wind velocity, and distribution of pollutant Quantum chemistry Plasma physics; electromagnetic particle simulation SPEC CPU2000 consists of 12 integer and 14 floating point benchmarks 24

SPEC 95 Does doubling the clock rate double the performance? Can a machine with a slower clock rate have better performance? 10 10 9 9 8 8 7 7 SPECint 6 5 4 SPECfp 6 5 4 3 3 2 2 1 1 0 0 50 100 150 Clock rate (MHz) 200 250 Pentium 50 100 Clock rate (MHz) 150 200 250 Pentium Pentium Pro Pentium Pro 25

MIPS, MFLOPS etc. MIPS - million instructions per second number of instructions executed in program Clock rate = = execution time in seconds * 10 6 CPI * 10 6 MFLOPS - million floating point operations per second = number of floating point operations executed in program execution time in seconds * 10 6 Hard to relate MIPS/MFLOPS with execution time CPU Execution Time Instruction CPI = Count X X Clock Cycle Time Highly program-dependent metrics Deceptive (See pages 78-79 in the text) MIPS would be higher for a program using simple instructions (e.g. a loop of NOPs!) 26

RISC Processor: Example 2.4 Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time ALU 50% 1 Load 20% 5 Store 10% 3 Branch 20% 2 What is average CPI? What percentage of time is spent in each instruction class? How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? 27

Performance assessment is tricky Performance is specific to particular program(s) Total execution time is a consistent summary of performance For a given architecture performance increases come from: increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count Pitfall: expecting improvement in one aspect of a machine s performance to affect the total performance You should not always believe everything you read! Read carefully! 28

Datapath and Control Design The Five Classic Components of a Computer Processor Control Memory Input Datapath Output 30

Two Types of Logic Components Combinational Elements that operate on data values Produces same output if given same inputs State Elements Contains internal storage May produce different output if given same inputs Clock is used to determine when a state element(s) should be written A B State Element clk A B Combinational Logic C = f(a,b,state) C = f(a,b) 31

Clock Clock is a free running signal Fixed cycle time (period) Frequency = 1/(cycle time) Duty Cycle: (% high)/(%low), e.g. 50/50 Duty Cycle below Jitter: Uncertainty/wobble in rising or falling edge Rising Edge Falling Edge Clock Cycle (Period) 32

Storage Elements D Latch Two inputs: the data value to be stored (D) the clock signal (C) indicating when to read & store D Two outputs: the value of the internal state (Q) and it's complement C Q D D _ Q C Q Falling edge triggered D flip-flop Output changes only on the clock edge D D D Q latch C D Q D latch _ C Q Q _ Q D C C Q 33

Edge-triggered Clocking Values stored in the machine are updated on a clock edge The clock edge can be either rising or falling State element 1 Combinational logic State element 2 State element 1 Combinational logic State element 1 Clock cycle Clock cycle By default a state element is written every clock edge An explicit write control signal is required otherwise. Edge triggered methodology allows, in the same clock cycle to: read the contents of a register send the value through some combinational logic, and write the contents of the same or another register Possible to have the same state element as input and output 34

CPU: Clocking Clk Setup Hold Don t Care Setup Hold............ CLK CLK All storage elements are clocked by the same clock edge 35

Register: A Storage Element Similar to the D Flip Flop except N-bit input and output Write Enable input Write Enable: 0: Data Out will not change 1: Data Out will become Data In (on the clock edge) Write Enable Data In N Data Out N Clk 36

Register File Register File consists of (32) registers: Two 32-bit output busses: busa and busb One 32-bit input bus: busw Register is selected by: RA selects the register to put on busa RB selects the register to put on busb RW selects the register to be written via busw when Write Enable is 1 Clock input (CLK) Write Enable busw 32 Clk RW RA RB 5 5 5 32 32-bit Registers busa 32 busb 32 37

Memory Write Enable Address Memory One input bus: Data In One output bus: Data Out Memory word is selected by: Address selects the word to put on Data Out Write Enable = 1: address selects the memory word to be written via the Data In bus Clock input (CLK) Data In DataOut 32 32 Clk The CLK input is a factor ONLY during write operation During read operation, behaves as a combinational logic block: Address valid => Data Out valid after access time. 38

Basic 4 x 2 Static RAM Din[1] Din[0] Write enable D D C latch Enable Q D D C latch Enable Q 0 2-to-4 decoder D C D latch Q D C D latch Q Enable Enable 1 Address D C D latch Q D C D latch Q Enable Enable 2 D C D latch Q D C D latch Q Enable Enable 3 Dout[1] Dout[0] 39

Register Transfer Language (RTL) Mechanism for describing the movement and manipulation of of data between storage elements. Example: R[rd] R[rs] + R[rt] PC PC + 4 R[rt] Mem[R[rs] + immediate] 40

A Simple Implementation of MIPS CPU Simplified to contain only: Memory-reference instructions: lw, sw Arithmetic-logical instructions: add, sub, and, or, slt Control flow instructions: beq, j Execution Time = Instructions * CPI * Cycle Time Processor design (datapath and control) will determine: Clock cycle time Clock cycles per instruction We will design a single cycle processor: Advantage: One clock cycle per instruction Disadvantage: long cycle time 41

Arithmetic Instructions (R-Type) ADD, SUB, AND, OR, SLT Example add rd, rs, rt 31 26 s0 21 s5 16 t3 11 6 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 0 e.g. add $t3, $s0, $s5 REG[$t3] = REG[$s0] + REG[$s5] 42

Load/Store Instructions (I-Type) LW, SW Examples lw rt, rs, imm16 sw rt, rs, imm16 31 26 s2 21 s3 16-4 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 0 e.g. lw $s3, -4($s2) REG[$s3] = D-MEM[ REG[$s2] - 4 ] 43

Branch (I-Type) Beq Example beq rs, rt, imm16 31 26 s1 21 t3 16-3 op rs rt displacement 6 bits 5 bits 5 bits 16 bits e.g. 0x4c beq $s1, $t3, -12 if( REG[$s1] == REG[$t3] ) { new_pc = old_pc + 4-12 # new_pc = 0x44 } else { new_pc = old_pc + 4 # new_pc = 0x50 } 0 44

Jump (J-Type) J Example J Label 31 26 0x111 1111 op target address 6 bits 26 bits 0 e.g. 0x8000 0000 j Label Label: 0x8444 4444 ADD $s3, $s5, $t1 new_pc = 0x8444 4444 45

Single Cycle Implementation Datapath and Control Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Clock Cycle I. Fetch Decode Op. Fetch Execute Store Next PC Complete Execution of a Single Instruction Next Instruction 46

Components Required to implement the ISA Next PC generation Add 4 or extended 16-bit immediate to PC Memory Instruction read Data read/write Registers (32 x 32-bit) Read register rs Read register rt Write register rt or rd Sign extend immediate operand ALU to operate on the operands 47

Datapath Abstract / Simplified View: Data Register # PC Address Instruction Instruction Registers Register # ALU memory Register # Address Data memory Data 48

CPU: Instruction Fetch RTL version of the instruction fetch step: Fetch the Instruction: mem[pc] Update the program counter: Sequential Code: PC PC + 4 Branch and Jump: PC something else Clk PC 4 Address Instruction Memory Adder 32 Instruction Word 49

CPU: Register-Register Operations (Add, Subtract etc.) R[rd] R[rs] op R[rt] Example: addu rd, rs, rt Ra, Rb, and Rw come from instruction s rs, rt, and rd fields ALUctr and RegWr: control logic after decoding the instruction 31 26 21 16 11 6 op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 0 RegWr busw 32 Clk Rd Rs 5 5 5 Rt Rw Ra Rb 32 32-bit Registers busa 32 busb 32 ALUctr ALU 32 Result 50

CPU: Load Operations R[rt] Mem[R[rs] + SignExt[imm16]] Example: lw rt, rs, imm16 31 26 21 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 0 Rd Rt RegDst Mux Rs Rt RegWr 5 5 5 busw 32 Clk imm16 Rw Ra Rb 32 32-bit Registers 16 busb 32 Extender busa 32 32 Mux ALUSrc ALUctr ALU Data In 32 Clk 32 MemWr WrEn Adr Data Memory 32 W_Src Mux ExtOp 51

CPU: Store Operations Mem[ R[rs] + SignExt[imm16] R[rt] ] Example: sw rt, rs, imm16 31 26 21 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits 0 RegDst busw 32 Clk Rd Mux Rt RegWr 5 5 imm16 Rs 5 Rw Ra Rb 32 32-bit Registers 16 Rt busb 32 Extender busa 32 32 Mux ALUctr ALU Data In 32 Clk 32 MemWr WrEn Adr Data Memory 32 W_Src Mux ExtOp ALUSrc 52

CPU: Datapath for Branching beq rs, rt, imm16 31 26 21 Datapath generates condition (equal) 16 op rs rt immediate 6 bits 5 bits 5 bits 16 bits Instruction Address 0 Cond 4 imm16 PC Ext npc_sel Mux Adder Adder PC 00 Clk 32 busw Clk Rs RegWr 5 5 5 Rw Ra Rb 32 32-bit Registers Rt busa 32 busb 32 Equal? Sign extend to 32 bits and left shift by 2 53

CPU: Binary arithmetic for PC In theory, the PC is a 32-bit byte address into the instruction memory: Sequential operation: PC<31:0> = PC<31:0> + 4 Branch operation: PC<31:0> = PC<31:0> + 4 + SignExt[Imm16] * 4 The magic number 4 always comes up because: The 32-bit PC is a byte address And all our instructions are 4 bytes (32 bits) long In other words: The 2 LSBs of the 32-bit PC are always zeros There is no reason to have hardware to keep the 2 LSBs In practice, we can simplify the hardware by using a 30-bit PC<31:2>: Sequential operation: PC<31:2> = PC<31:2> + 1 Branch operation: PC<31:2> = PC<31:2> + 1 + SignExt[Imm16] In either case: Instruction Memory Address = PC<31:2> concat 00 54

Single Cycle Implementation Putting it all together Supports Arithmetic, Load/Store and Branch instructions PCSrc 4 Add RegWrite Shift left 2 Add ALU result 1 M u x 0 PC Read address Instruction [31 0] Instruction memory Instruction [25 21] Instruction [20 16] 1 M u Instruction [15 11] x 0 RegDst Instruction [15 0] Read register 1 Read register 2 Write register Write data Read data 1 Read data 2 Registers 16 Sign 32 extend ALUSrc 1 M u x 0 ALU control ALU Zero ALU result MemWrite Address Write data Data memory MemRead Read data MemtoReg 1 M u x 0 Instruction [5 0] ALUOp 55