Lecture 3 MIPS ISA and Performance

Lecture 3 MIPS ISA and Performance Peng Liu liupeng@zju.edu.cn 1

MIPS Instruction Encoding MIPS instructions are encoded in different forms, depending upon the arguments R-format, I-format, J-format MIPS architecture has three instruction formats, all 32 bits in length Regularity is simpler and improves performance A 6-bit opcode appears at the beginning of each instruction Control logic based on decode instruction type 2

Instruction Formats I-format: used for instructions with immediates, lw and sw (since the offset counts as an immediate), and the branches (beq and bne), (but not the shift instructions; later) J-format: used for j and jal R-format: used for all other instructions It will soon become clear why the instructions have been partitioned in this way. 3

I-Format 4

I-Format Example 5

I-Format Example: Load/Store 6

J-Format (1/2) Define fields of the following number of bits each: 6 bits 26 bits As usual, each field has a name: opcode target address Key Concepts Keep opcode field identical to R-format and I-format for consistency. Combine all other fields to make room for large target address. 7

J-Format (2/2) Summary: New PC = { PC[31..28], target address, 00 } Understand where each part came from! Note: In Verilog, {,, } means concatenation { 4 bits, 26 bits, 2 bits } = 32-bit address { 1010, 11111111111111111111111111, 00 } = 10101111111111111111111111111100 We use Verilog in this class 8

R-Format (1/2) Define fields of the following number of bits each: 6 + 5 + 5 + 5 + 5 + 6 = 32 6 5 5 5 5 6 For simplicity, each field has a name: opcode rs rt rd shamt funct 9

R-Format (2/2) More fields: rs (Source Register): generally used to specify register containing first operand rt (Target Register): generally used to specify register containing second operand (note that name is misleading) rd (Destination Register): generally used to specify register which will receive result of computation 10

R-Format Example MIPS Instruction: add $8,$9,$10 Decimal number per field representation: 0 9 10 8 0 32 Binary number per field representation: 000000 01001 01010 01000 00000 100000 hex representation: decimal representation: 012A 4020 hex 19,546,144 ten hex 11

MIPS-32 Operation Overview Arithmetic Logical: Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV Memory Access: LB, LBU, LH, LHU, LW, LWL, LWR SB, SH, SW, SWL, SWR 12

MIPS logical instructions Instruction Example Meaning Comment and and $1,$2,$3 $1 = $2 & $3 3 reg. operands; Logical AND or or $1,$2,$3 $1 = $2 $3 3 reg. operands; Logical OR xor xor $1,$2,$3 $1 = $2 ^ $3 3 reg. operands; Logical XOR nor nor $1,$2,$3 $1 = ~($2 $3) 3 reg. operands; Logical NOR and immediate andi $1,$2,10 $1 = $2 & 10 Logical AND reg, constant or immediate ori $1,$2,10 $1 = $2 10 Logical OR reg, constant xor immediate xori $1, $2,10 $1 = ~$2 &~10 Logical XOR reg, constant shift left logical sll $1,$2,10 $1 = $2 << 10 Shift left by constant shift right logical srl $1,$2,10 $1 = $2 >> 10 Shift right by constant shift right arithm. sra $1,$2,10 $1 = $2 >> 10 Shift right (sign extend) shift left logical sllv $1,$2,$3 $1 = $2 << $3 Shift left by variable shift right logical srlv $1,$2, $3 $1 = $2 >> $3 Shift right by variable shift right arithm. srav $1,$2, $3 $1 = $2 >> $3 Shift right arith. by variable Q: Can some multiply by 2 i? Divide by 2 i? Invert? 13

MIPS Reference Data: CORE INSTRUCTION SET (1) NAME MNE- MON-IC FOR- MAT OPERATION (in Verilog) OPCODE/FU NCT (hex) Add add R R[rd] = R[rs] + R[rt] (1) 0 / 20 hex Add Immediate addi I R[rt] = R[rs] + SignExtImm (1)(2) 8 hex Branch On Equal beq I if(r[rs]==r[rt]) PC=PC+4+ BranchAddr (4) 4 hex (1) May cause overflow exception (2) SignExtImm = { 16{immediate[15]}, immediate } (3) ZeroExtImm = { 16{1b 0}, immediate } (4) BranchAddr = { 14{immediate[15]}, immediate, 2 b0} 14

MIPS Data Transfer Instructions Instruction Comment sw 500($4), $3 Store word sh 502($2), $3 Store half word sb 41($3), $2 Store byte lw $1, 30($2) lh $1, 40($3) lhu $1, 40($3) lb $1, 40($3) lbu $1, 40($3) Load word Load halfword Load halfword unsigned Load byte Load byte unsigned lui $1, 40 Load Upper Immediate (16 bits shifted left by 16) Q: Why need lui? LUI R5 R5 0000 0000 15

Multiply / Divide Start multiply, divide MULT rs, rt MULTU rs, rt DIV rs, rt DIVU rs, rt Move result from multiply, divide MFHI rd MFLO rd Move to HI or LO MTHI rd MTLO rd Registers HI LO 16

MIPS Arithmetic Instructions Instruction Example Meaning Comments add add $1,$2,$3 $1 = $2 + $3 3 operands; exception possible subtract sub $1,$2,$3 $1 = $2 $3 3 operands; exception possible add immediate addi $1,$2,100 $1 = $2 + 100 + constant; exception possible add unsigned addu $1,$2,$3 $1 = $2 + $3 3 operands; no exceptions subtract unsigned subu $1,$2,$3 $1 = $2 $3 3 operands; no exceptions add imm. unsign. addiu $1,$2,100 $1 = $2 + 100 + constant; no exceptions multiply mult $2,$3 Hi, Lo = $2 x $3 64-bit signed product multiply unsigned multu$2,$3 Hi, Lo = $2 x $3 64-bit unsigned product divide div $2,$3 Lo = $2 $3, Lo = quotient, Hi = remainder Hi = $2 mod $3 divide unsigned divu $2,$3 Lo = $2 $3, Unsigned quotient & remainder Hi = $2 mod $3 Move from Hi mfhi $1 $1 = Hi Used to get copy of Hi Move from Lo mflo $1 $1 = Lo Used to get copy of Lo Q: Which add for address arithmetic? Which add for integers? 17

Green Card: ARITHMETIC CORE INSTRUCTION SET (2) NAME MNE- MON-IC FOR- MAT OPERATION (in Verilog) OPCODE/FMT / FT/ FUNCT (hex) Branch On FP True bc1t FI if (FPcond) PC=PC + 4 + BranchAddr (4) 11/8/1/-- Load FP Single lwc1 I F[rt] = M[R[rs] + SignExtImm] (2) 11/8/1/-- Divide div R Lo=R[rs]/R[rt]; Hi=R[rs]%R[rt] 31/--/--/-- 18

When does MIPS Sign Extend? When value is sign extended, copy upper bit to full value: Examples of sign extending 8 bits to 16 bits: 00001010 00000000 00001010 10001100 11111111 10001100 When is an immediate operand sign extended? Arithmetic instructions (add, sub, etc.) always sign extend immediates even for the unsigned versions of the instructions! Logical instructions do not sign extend immediates (They are zero extended) Load/Store address computations always sign extend immediates Multiply/Divide have no immediate operands however: unsigned treat operands as unsigned The data loaded by the instructions lb and lh are extended as follows ( unsigned don t extend): lbu, lhu are zero extended lb, lh are sign extended Q: Then what is does add unsigned (addu) mean since not immediate? 19

MIPS Compare and Branch Compare and Branch BEQ rs, rt, offset if R[rs] == R[rt] then PC-relative branch BNE rs, rt, offset <> Compare to zero and Branch BLEZ rs, offset if R[rs] <= 0 then PC-relative branch BGTZ rs, offset > BLT < BGEZ >= BLTZAL rs, offset if R[rs] < 0 then branch and link (into R 31) BGEZAL >=! Remaining set of compare and branch ops take two instructions Almost all comparisons are against zero! 20

MIPS jump, branch, compare Instructions Instruction Example Meaning branch on equal beq $1,$2,100 if ($1 == $2) go to PC+4+100 Equal test; PC relative branch branch on not eq. bne $1,$2,100 if ($1!= $2) go to PC+4+100 Not equal test; PC relative set on less than slt $1,$2,$3 if ($2 < $3) $1=1; else $1=0 Compare less than; 2 s comp. set less than imm. slti $1,$2,100 if ($2 < 100) $1=1; else $1=0 Compare < constant; 2 s comp. set less than uns. sltu $1,$2,$3 if ($2 < $3) $1=1; else $1=0 Compare less than; natural numbers set l. t. imm. uns. sltiu $1,$2,100 if ($2 < 100) $1=1; else $1=0 Compare < constant; natural numbers jump j 10000 go to 10000 Jump to target address jump register jr $31 go to $31 For switch, procedure return jump and link jal 10000 $31 = PC + 4; go to 10000 For procedure call 21

MIPS Assembler Register Convention Name Number Usage Preserved across a call? $zero 0 the value 0 n/a $v0-$v1 2-3 return values no $a0-$a3 4-7 arguments no $t0-$t7 8-15 temporaries no $s0-$s7 16-23 saved yes $t18-$t19 24-25 temporaries no $sp 29 stack pointer yes $ra 31 return address yes caller saved callee saved 22

Green Card: OPCODES, BASE CONVERSION, ASCII (3) MIPS opcode (31:26) (1) MIPS funct (5:0) (2) MIPS funct (5:0) Binary Decimal Hexadeci-mal ASCII (1) sll add.f 00 0000 0 0 NUL j srl mul.f 00 0010 2 2 STX lui sync floor.w.f 00 1111 15 f SI lbu and cvt.w.f 10 0100 36 24 $ (1) opcode(31:26) == 0 (2) opcode(31:26) == 17 ten (11 hex ); if fmt(25:21)==16 ten (10 hex ) f = s (single); if fmt(25:21)==17 ten (11 hex ) f = d (double) Note: 3-in-1 - Opcodes, base conversion, ASCII! 23

Branch & Pipelines Time li $3, #7 sub $4, $4, 1 execute ifetch execute bz $4, LL ifetch execute Branch addi $5, $3, 1 ifetch execute Delay Slot LL: slt $1, $3, $5 Branch Target ifetch execute By the end of Branch instruction, the CPU knows whether or not the branch will take place. However, it will have fetched the next instruction by then, regardless of whether or not a branch will be taken. Why not execute it? 24

Delayed Branches li $3, #7 sub $4, $4, 1 bz $4, LL addi $5, $3, 1 subi $6, $6, 2 LL: slt $1, $3, $5 In the Raw MIPS, the instruction after the branch is executed even when the branch is taken This is hidden by the assembler for the MIPS virtual machine allows the compiler to better utilize the instruction pipeline (???) Jump and link (jal inst): Put the return addr. Into link register ($31): PC+4 (logical architecture) PC+8 physical ( Raw ) architecture delay slot executed Then jump to destination address Delay Slot Instruction 25

Filling Delayed Branches Branch: Inst Fetch Dcd & Op Fetch Execute execute successor even if branch taken! Then branch target or continue Inst Fetch Dcd & Op Fetch Inst Fetch Execute Single delay slot impacts the critical path Compiler can fill a single delay slot with a useful instruction 50% of the time. try to move down from above jump move up from target, if safe add $3, $1, $2 sub $4, $4, 1 bz $4, LL NOP... LL: add rd,... Is this violating the ISA abstraction? 26

Summary: Salient features of MIPS I 32-bit fixed format inst (3 formats) 32 32-bit GPR (R0 contains zero) and 32 FP registers (and HI LO) partitioned by software convention 3-address, reg-reg arithmetic instr. Single address mode for load/store: base+displacement no indirection, scaled 16-bit immediate plus LUI Simple branch conditions compare against zero or two registers for =, no integer condition codes Delayed branch execute instruction after a branch (or jump) even if the branch is taken (Compiler can fill a delayed branch with useful work about 50% of the time) 27

And in Conclusion... Continued rapid improvement in Computing 2X every 1.5 years in processor speed; every 2.0 years in memory size; every 1.0 year in disk capacity; Moore s Law enables processor, memory (2X transistors/chip/ ~1.5 ro 2.0 yrs) 5 classic components of all computers Control Datapath Memory Input Output Processor 28

MIPS Machine Instruction Review: Instruction Format Summary 29

Addressing Modes Summary Register addressing Operand is a register (e.g. ALU) Base/displacement addressing (ex. load/store) Operand is at the memory location that is the sum of a base register + a constant Immediate addressing (e.g. constants) Operand is a constant within the instruction itself PC-relative addressing (e.g. branch) Address is the sum of PC and constant in instruction (e.g. branch) Pseudo-direct addressing (e.g. jump) Target address is concatenation of field in instruction and the PC 30

Addressing Modes Summary 31

Computer Performance Metrics Response Time (latency) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput How many jobs can the machine run at once? What is the average execution rate? How many queries per minute? If we upgrade a machine with a new processor what to we increase? If we add a new machine to the lab what do we increase? 32

Performance = Execution Time Elapsed Time Counts everything (disk and memory accesses, I/O, etc.) A useful number, but often not good for comparison purposes E.g., OS & multiprogramming time make it difficult to compare CPUs CPU time (CPU = Central Processing Unit = processor) Doesn t count I/O or time spent running other programs Can be broken up into system time, and user time Our focus: user CPU time Time spent executing the lines of code that are in our program Includes arithmetic, memory, and control instructions, 33

Clock Cycles Instead of reporting execution time in seconds, we often use cycles Clock ticks indicate when to start activities Cycle time = time between ticks = seconds per cycle Clock rate (frequency) = cycles per second (1Hz = 1 cycle/sec) A 2 GHz clock has a 500 picoseconds (ps) cycle time. 34

Performance and Execution Time The program should be something real people care about Desktop: MS office, edit, compile Server: web, e-commerce, database Scientific: physics, weather forecasting 35

Measuring Clock Cycles Clock cycles/program is not an intuitive or easily determined value, so Clock Cycles = Instructions x Clock Cycles Per Instruction Cycles Per Instruction (CPI) used often CPI is an average since the number of cycles per instruction varies from instruction to instruction Average depends on instruction mix, latency of each inst. Type etc. CPIs can be used to compare two implementations of the same ISA, but is not useful alone for comparing different ISAs An X86 add is different from a MIPS add 36

Using CPI Drawing on the previous equation: ExecutionTime Instructions CPI ClockCycleTime ExecutionTime Instructions CPI ClockRate To improve performance (i.e. reduce execution time) Increase clock rate (decrease clock cycle time) OR Decrease CPI OR Reduce the number of instructions Designers balance cycle time against the number of cycles required Improving one factor may make the other one worse 37

Clock Rate Performance Mobile Intel Pentium 4 Vs Intel Pentium M 2.4 GHz 1.6 GHz P4 is 50% faster? Performance on Mobilemark with same memory and disk Word, excel, photoshop, powerpoint, etc. Mobile Pentium 4 is only 15% faster What is the relative CPI? ExecTime = ICxCPI/ClockRate ICxCPI M /1.6 = 1.15xICxCPI 4 /2.4 CPI 4 /CPI M = 2.4/(1.15x1.6) = 1.3 38

CPI Calculation Different instruction types require different numbers of cycles CPI is often reported for types of instructions ClockCylces ( CPI IC ) n i 1 i i where CPI i is the CPI for the type of instructions and IC i is the count of that type of instruction 39

CPI Calculation To compute the overall average CPI use ClockCylces n i 1 CPI i InstructionCount InstructionCount i 40

Computing CPI Example Given this machine, the CPI is the sum of CPI x Frequency Average CPI is 0.5 + 0.4 + 0.4 + 0.2 = 1.5 What fraction of the time for data transfer? 41

What is the Impact of Displacement Based Memory Addressing Mode? Assume 50% of MIPS loads and stores have a zero displacement. 42

Speedup Speedup allows us to compare different CPUs or optimizations Speedup CPUtimeOld CPUtimeNew Example Original CPU takes 2sec to run a program New CPU takes 1.5sec to run a program Speedup = 1.333 or speedup or 33% 43

Amdahl s Law If an optimization improves a fraction f of execution time by a factor of a Speedup Told 1 [(1 f ) f / a]* Told (1 f ) f / a This formula is known as Amdahl s Law Lessons from If f->100%, then speedup = a If a->, the speedup = 1/(1-f) Summary Make the common case fast Watch out for the non-optimized component 44

Evaluating Performance Performance best determined by running a real application Use programs typical of expected workload e.g. compiler/editors, scientific applications, graphics, etc. Microbenchmarks Small programs, synthetic or kernels from larger applications Nice for architects and designers Can be misleading Benchmarks Collection of real programs that companies have agreed on Components: programs, inputs & outputs, measurements rules, metrics Can still be abused 45

The SPEC CPU Benchmark Suite (System Performance Evaluation Cooperative) 46

Benchmarks System Performance Evaluation Cooperative (SPEC) Scientific computing: Linpack, SpecOMP, SpecHPC, Embedded benchmarks: EEMBC, Dhrystone, Enterprise computing TPC-C, TPC-W, TPC-H SpecJbb, SpecSFS, SpecMail, Streams, Multiprocessor: PARSEC, SPLASH-2, EEMBC (multicore) Other 3Dmark, ScienceMark, Winstone, ibench, AquaMark, Watch out: your results will be as good as your benchmarks Make sure you know what the benchmark is designed to measure Performance is not the only metric for computing systems Cost, power consumption, reliability, real-time performance, 48

Summarizing Performance Combining results from multiple programs into 1 benchmark score Sometimes misleading, always controversial, and inevitable We all like quoting a single number 1 n i i AM Weight Time n i i 3 types of means Arithmetic: for times Harmonic: for rates Geometric: for rations HM GM n i 1 n i 1 n Weight Rate Ratio i i i 1 n 49

Arithmetic mean: Using the Means When you have individual performance scores in latency Harmonic mean: When you have individual performance scores in throughput Geometric mean: Nice property: GM(X/Y)= GM(X)/GM(Y) But difficult to related back to execution times Note Always look at the results for individual programs 50

Performance Summary Performance is specific to a particular programs Total execution time is a consistent summary of performance For a given architecture performance increases come from: Increase in clock rate (without adverse CPI effects) Improvements in processor organization that lower CPI Compiler enhancements that lower CPI and/or instruction count Algorithm/Language choices that affect instruction count Pitfall: expecting improvement in one aspect of a machine s performance to affect the total performance 51

Translation Hierarchy High-level->Assembly->Machine 52

Compiler Converts High-level Language to Machine Code 53

Optimizing Compilers Provide efficient mapping of program to machine Code selection and ordering Eliminating minor inefficiencies Register allocation Don t (usually) improve asymptotic efficiency Up to programmer to select best overall algorithm Big-O savings are (often) more important than constant factors But constant factors also matter Optimization types Local: inside basic blocks Global: across basic blocks e.g. loops 55

Limitations of Optimizing Compilers Operate Under Fundamental Constraints Improved code has same output Improved code has same side effects Improved code is as correct as original code Most analysis is performed only within procedures Whole-program analysis is too expensive in most cases Most analysis is based only on static information Compiler has difficulty anticipating run-time inputs Profile driven dynamic compilation becoming more prevalent Dynamic compilation (e.g. Java JIT) When in doubt, the compiler must be conservative 56

Preview: Code Optimization Goal: Improve performance by: Removing redundant work Unreachable code Common-subexpression elimination Induction variable elimination Using simpler and faster operations Dealing with constants in the compiler Strength reduction Managing register well ExecutionTime = Instructions x CPI x ClockCycleTime 57

Constant Folding Detect & combine values that will be constant. Respect laws of arithmetic on target machine. a = 10 * 5 + 6 b; 58

Register Allocation & Assignment Register provide fastest data access in the processor. Two terms: What variables belong in register? (allocation) What register will hold this value? (assignment) Once a value is in a register, we d like not to spill & restore it to use same value again. 66

Assembly Code Generation 67

Assembler Expands macros and pseudo instructions as well as converts constants. Primary purpose is to produce an object file Machine language instructions Application data Information for memory organization 69

Object File 70

Linker Linker combines multiple object modules Identify where code/data will be placed in memory Resolve code/data cross references Produces executable if all references found Steps Place code and data modules in memory Determine the address of data and instruction labels Patch both the internal and external references Separation between compiler and linker makes standard libraries and efficient solution to maintaining modular code. 71

Loader used at run-time Loader Reads executable file header for size of text/data segments Create address space sufficiently large Copy program from executable on disk into memory Copy arguments to main program s stack Initialize machine registers and set stack pointer Jump to start-up routine Terminate program when execution completes 72

SPIM System Calls SPIM provides a small number of OS system calls Service Code Argument Result print_int 1 $a0 = integer print_float 2 $f12 = float print_double 3 $f12 = double print_string 4 $a0 = string read_int 5 Integer in $v0 read_float 6 Float if $f0 read_double 7 Double in $f0 read_string 8 $a0 = buffer, $a1 = length Data in buffer Sbrk 9 $a0 = amount Address in $v0 Exit 10 Set system call code in $v0 and pass arguments as needed li $v0, 1 # set system call to print_int li $a0, 12 # load constant 12 systcall See web site for details and examples. 73

MIPS Assembler Directives SPIM supports a subset of the MIPS assembler directives Some of the directives include.asciiz Store a null-terminated string in memory.data Start of data segment.global Identify an exported symbol.text Start of text segment.word Store words in memory 74

HW1 Homework http://mypage.zju.edu.cn/liupeng/ 教学工作 Check the following web http://10.13.71.82/ 75

Acknowledgements These slides contain material from courses: UCB CS152 Stanford EE108B 76