Introduction To Computer Architecture

Introduction To Computer Architecture. Virendra Singh Computer Design and Test Lab. Supercomputer Education and Research Centre Indian Institute of Science Bangalore http://www.serc.iisc.ernet.in/~viren

Computer Architecture Instruction Set Architecture (IBM 360) the attributes of a [computing] system as seen by the programmer. I.e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls, the logic design, and the physical implementation. -- Amdahl, Blaauw, & Brooks, 1964 Machine Organization (microarchitecture) ALUS, Buses, Caches, Memories, etc. Machine Implementation (realization) Gates, cells, transistors, wires

Computer Architecture Exercise in engineering tradeoff analysis Find the fastest/cheapest/power-efficient/etc. solution Optimization problem with 100s of variables All the variables are changing At non-uniform rates With inflection points Only one guarantee: Today s right answer will be wrong tomorrow Two high-level effects: Technology push Application Pull

Technology Push What do these two intervals have in common? 1776-1999 (224 years) 2000-2001 (2 years) Answer: Equal progress in processor speed! The power of exponential growth! Driven by Moore s Law Device per chips doubles every 18-24 months Computer architects work to turn the additional resources into speed/power savings/functionality!

Some History Date Event Comments 1939 First digital computer John Atanasoff (UW PhD 30) 1947 1 st transistor Bell Labs 1958 1 st IC Jack Kilby (MSEE 50) @TI Winner of 2000 Nobel prize 1971 1 st microprocessor Intel 1974 Intel 4004 2300 transistors 1978 Intel 8086 29K transistors 1989 Intel 80486 1.M transistors, pipelined 1995 Intel Pentium Pro 5.5M transistors 2005 Intel Montecito 1B transistors

Technology Push Technology advances at varying rates E.g. DRAM capacity increases at 60%/year But DRAM speed only improves 10%/year Creates gap with processor frequency! Inflection points Crossover causes rapid change E.g. enough devices for multicore processor (2001) Current issues causing an inflection point Power consumption Reliability Variability

Application Pull Corollary to Moore s Law: Cost halves every two years In a decade you can buy a computer for less than its sales tax today. Jim Gray Computers cost-effective for National security weapons design Enterprise computing banking Departmental computing computer-aided design Personal computer spreadsheets, email, web Pervasive computing prescription drug labels

Application Pull What about the future? Must dream up applications that are not costeffective today Virtual reality Telepresence Mobile applications Sensing, analyzing, actuating in real-world environments This is your job!

Building Computer Chips Complex multi-step process Slice silicon ingots into wafers Process wafers into patterned wafers Dice patterned wafers into dies Test dies, select good dies Bond to package Test parts Ship to customers and make money

Building Computer Chips

Performance vs. Design Time Time to market is critically important E.g., a new design may take 3 years It will be 3 times faster But if technology improves 50%/year In 3 years 1.5 3 = 3.38 So the new design is worse! (unless it also employs new technology)

Bottom Line Designers must know BOTH software and hardware Both contribute to layers of abstraction IC costs and performance Compilers and Operating Systems

Performance and Cost Which of the following airplanes has the best performance? Airplane Passengers Range (mi) Speed (mph) Boeing 737-100 101 630 598 Boeing 747 470 4150 610 BAC/Sud Concorde 132 4000 1350 Douglas DC-8-50 146 8720 544 How much faster is the Concorde vs. the 747 How much bigger is the 747 vs. DC-8?

Performance and Cost Which computer is fastest? Not so simple Scientific simulation FP performance Program development Integer performance Database workload Memory, I/O

Performance of Computers Want to buy the fastest computer for what you want to do? Workload is all-important Correct measurement and analysis Want to design the fastest computer for what the customer wants to pay? Cost is an important criterion

Forecast Time and performance Iron Law MIPS and MFLOPS Which programs and how to average Amdahl s law

Defining Performance What is important to whom? Computer system user Minimize elapsed time for program = time_end time_start Called response time Computer center manager Maximize completion rate = #jobs/second Called throughput

Response Time vs. Throughput Is throughput = 1/av. response time? Only if NO overlap Otherwise, throughput > 1/av. response time E.g. a lunch buffet assume 5 entrees Each person takes 2 minutes/entrée Throughput is 1 person every 2 minutes BUT time to fill up tray is 10 minutes Why and what would the throughput be otherwise? 5 people simultaneously filling tray (overlap) Without overlap, throughput = 1/10

What is Performance for us? For computer architects CPU time = time spent running a program Intuitively, bigger should be faster, so: Performance = 1/X time, where X is response, CPU execution, etc. Elapsed time = CPU time + I/O wait We will concentrate on CPU time

Improve Performance Improve (a) response time or (b) throughput? Faster CPU Helps both (a) and (b) Add more CPUs Helps (b) and perhaps (a) due to less queueing

Performance Comparison Machine A is n times faster than machine B iff perf(a)/perf(b) = time(b)/time(a) = n Machine A is x% faster than machine B iff perf(a)/perf(b) = time(b)/time(a) = 1 + x/100 E.g. time(a) = 10s, time(b) = 15s 15/10 = 1.5 => A is 1.5 times faster than B 15/10 = 1.5 => A is 50% faster than B

Breaking Down Performance A program is broken into instructions H/W is aware of instructions, not programs At lower level, H/W breaks instructions into cycles Lower level state machines change state every cycle For example: 1GHz Snapdragon runs 1000M cycles/sec, 1 cycle = 1ns 2.5GHz Core i7 runs 2.5G cycles/sec, 1 cycle = 0.25ns

Iron Law Time Processor Performance = --------------- Program Instructions Cycles = X X Program Instruction (code size) (CPI) Time Cycle (cycle time) Architecture --> Implementation --> Realization Compiler Designer Processor Designer Chip Designer

Iron Law Instructions/Program Instructions executed, not static code size Determined by algorithm, compiler, ISA Cycles/Instruction Determined by ISA and CPU organization Overlap among instructions reduces this term Time/cycle Determined by technology, organization, clever circuit design

Our Goal Minimize time which is the product, NOT isolated terms Common error to miss terms while devising optimizations E.g. ISA change to decrease instruction count BUT leads to CPU organization which makes clock slower Bottom line: terms are inter-related

Other Metrics MIPS and MFLOPS MIPS = instruction count/(execution time x 10 6 ) = clock rate/(cpi x 10 6 ) But MIPS has serious shortcomings

Problems with MIPS For instance, without FP hardware, an FP op may take 50 single-cycle instructions With FP hardware, only one 2-cycle instruction Thus, adding FP hardware: CPI increases (why?) Instructions/program decreases (why?) Total execution time decreases BUT, MIPS gets worse! 50/50 => 2/1 50 => 1 50 => 2 50 MIPS => 2 MIPS

Problems with MIPS Ignores program Usually used to quote peak performance Ideal conditions => guaranteed not to exceed! When is MIPS ok? Same compiler, same ISA E.g. same binary running on AMD Phenom, Intel Core i7 Why? Instr/program is constant and can be ignored

Other Metrics MFLOPS = FP ops in program/(execution time x 10 6 ) Assuming FP ops independent of compiler and ISA Often safe for numeric codes: matrix size determines # of FP ops/program However, not always safe: Missing instructions (e.g. FP divide) Optimizing compilers Relative MIPS and normalized MFLOPS Adds to confusion

Rules Use ONLY Time Beware when reading, especially if details are omitted Beware of Peak Guaranteed not to exceed

Iron Law Example Machine A: clock 1ns, CPI 2.0, for program x Machine B: clock 2ns, CPI 1.2, for program x Which is faster and how much? Time/Program = instr/program x cycles/instr x sec/cycle Time(A) = N x 2.0 x 1 = 2N Time(B) = N x 1.2 x 2 = 2.4N Compare: Time(B)/Time(A) = 2.4N/2N = 1.2 So, Machine A is 20% faster than Machine B for this program

Iron Law Example Keep clock(a) @ 1ns and clock(b) @2ns For equal performance, if CPI(B)=1.2, what is CPI(A)? Time(B)/Time(A) = 1 = (Nx2x1.2)/(Nx1xCPI(A)) CPI(A) = 2.4

Iron Law Example Keep CPI(A)=2.0 and CPI(B)=1.2 For equal performance, if clock(b)=2ns, what is clock(a)? Time(B)/Time(A) = 1 = (N x 2.0 x clock(a))/(n x 1.2 x 2) clock(a) = 1.2ns

Which Programs Execution time of what program? Best case your always run the same set of programs Port them and time the whole workload In reality, use benchmarks Programs chosen to measure performance Predict performance of actual workload Saves effort and money Representative? Honest? Benchmarketing

How to Average Machine A Machine B Program 1 1 10 Program 2 1000 100 Total 1001 110 One answer: for total execution time, how much faster is B? 9.1x

How to Average Another: arithmetic mean (same result) Arithmetic mean of times: n 1 time( i) AM(A) = 1001/2 = 500.5 i 1 n AM(B) = 110/2 = 55 500.5/55 = 9.1x Valid only if programs run equally often, so use weighted arithmetic mean: n 1 weight( i) time( i) n i 1

Other Averages E.g., 30 mph for first 10 miles, then 90 mph for next 10 miles, what is average speed? Average speed = (30+90)/2 WRONG Average speed = total distance / total time = (20 / (10/30 + 10/90)) = 45 mph

Harmonic Mean Harmonic mean of rates = i n n 1 1 rate( n) Use HM if forced to start and end with rates (e.g. reporting MIPS or MFLOPS) Why? Rate has time in denominator Mean should be proportional to inverse of sums of time (not sum of inverses) See: J.E. Smith, Characterizing computer performance with a single number, CACM Volume 31, Issue 10 (October 1988), pp. 1202-1206.

Dealing with Ratios Machine A Machine B Program 1 1 10 Program 2 1000 100 Total 1001 110 If we take ratios with respect to machine A Machine A Machine B Program 1 1 10 Program 2 1 0.1

Dealing with Ratios Average for machine A is 1, average for machine B is 5.05 If we take ratios with respect to machine B Machine A Machine B Program 1 0.1 1 Program 2 10 1 Average 5.05 1 Can t both be true!!! Don t use arithmetic mean on ratios!

Geometric Mean Use geometric mean for ratios Geometric mean of ratios = n n ratio( i) i 1 Independent of reference machine In the example, GM for machine a is 1, for machine B is also 1 Normalized with respect to either machine

But GM of ratios is not proportional to total time AM in example says machine B is 9.1 times faster GM says they are equal If we took total execution time, A and B are equal only if Program 1 is run 100 times more often than program 2 Generally, GM will mispredict for three or more machines

Summary Use AM for times Use HM if forced to use rates Use GM if forced to use ratios Best of all, use unnormalized numbers to compute time

Benchmarks: SPEC2000 System Performance Evaluation Cooperative Formed in 80s to combat benchmarketing SPEC89, SPEC92, SPEC95, SPEC2000 12 integer and 14 floating-point programs Sun Ultra-5 300MHz reference machine has score of 100 Report GM of ratios to reference machine

Benchmarks: SPEC CINT2000 Benchmark Description 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf Compression FPGA place and route C compiler Combinatorial optimization Chess Word processing, grammatical analysis Visualization (ray tracing) PERL script execution Group theory interpreter Object-oriented database Compression Place and route simulator

Benchmarks: SPEC CFP2000 Benchmark Description 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi Physics/Quantum Chromodynamics Shallow water modeling Multi-grid solver: 3D potential field Parabolic/elliptic PDE 3-D graphics library Computational Fluid Dynamics Image Recognition/Neural Networks Seismic Wave Propagation Simulation Image processing: face recognition Computational chemistry Number theory/primality testing Finite-element Crash Simulation High energy nuclear physics accelerator design Meteorology: Pollutant distribution

Benchmark Pitfalls Benchmark not representative Your workload is I/O bound, SPEC is useless Benchmark is too old Benchmarks age poorly; benchmarketing pressure causes vendors to optimize compiler/hardware/software to benchmarks Need to be periodically refreshed

Amdahl s Law Motivation for optimizing common case Speedup = old time / new time = new rate / old rate Let an optimization speed fraction f of time by a factor of s Speedup 1 1 f f 1 1 f s f f oldtime oldtime f oldtime s

Amdahl s Law Example Your boss asks you to improve performance by: Improve the ALU used 95% of time by 10% Improve memory pipeline used 5% of time by 10x Let f=fraction sped up and s = speedup on that fraction New_time = (1-f) x old_time + (f/s) x old_time Speedup = old_time / new_time Speedup = old_time / ((1-f) x old_time + (f/s) x old_time) Amdahl s Law: Speedup 1 1 f f s

Amdahl s Law Example, cont d f s Speedup 95% 1.10 1.094 5% 10 1.047 5% 1.052

Speedup Amdahl s Law: Limit Make common case fast: 10 9 8 7 6 5 4 3 2 1 0 lim 0 0.2 0.4 0.6 0.8 1 f f f s s 1 1 1 1 f

Amdahl s Law: Limit Consider uncommon case! If (1-f) is nontrivial Speedup is limited! lim Particularly true for exploiting parallelism in the large, where large s is not cheap GPU with e.g. 1024 processors (shader cores) Parallel portion speeds up by s (1024x) Serial portion of code (1-f) limits speedup E.g. 10% serial limits to 10x speedup! f f s s 1 1 1 1 f

Instructions Instructions are the words of a computer Instruction set architecture (ISA) is its vocabulary This defines most of the interface to the processor (not quite everything) Implementations can and do vary Intel 486->Pentium->P6->Core Duo->Core i7

Instructions cont d MIPS ISA: Simple, sensible, regular, widely used Most common: x86 (IA-32) Intel Pentium/Core i7, AMD Athlon, etc. Others: PowerPC (Mac, IBM servers) SPARC (Sun) ARM (cell phones, embedded systems)

Forecast Basics Registers and ALU ops Memory and load/store Branches and jumps Etc.

Basics C statement f = (g + h) (i + j) MIPS instructions add t0, g, h add t1, i, j sub f, t0, t1 Opcode/mnemonic, operands, source/destination

Basics Opcode: specifies the kind of operation (mnemonic) Operands: input and output data (source/destination) Operands t0 & t1 are temporaries One operation, two inputs, one output Multiple instructions for one C statement

Why not bigger instructions? Why not f = (g + h) (i + j) as one instruction? Church s thesis: A very primitive computer can compute anything that a fancy computer can compute you need only logical functions, read and write memory, and data-dependent decisions Therefore, ISA selected for practical reasons: Performance and cost, not computability Regularity tends to improve both E.g. H/W to handle arbitrary number of operands is complex and slow and UNNECESSARY

Registers and ALU ops Operands must be registers, not variables add $8, $17, $18 add $9, $19, $20 sub $16, $8, $9 MIPS has 32 registers $0-$31 $8 and $9 are temps, $16 is f, $17 is g, $18 is h, $19 is i and $20 is j MIPS also allows one constant called immediate Later we will see immediate is restricted to 16 bits

Registers Registers and ALU Processor $0 $31 ALU

ALU ops Some ALU ops: add, addi, addu, addiu (immediate, unsigned) sub mul, div wider result 32b x 32b = 64b product 32b / 32b = 32b quotient and 32b remainder and, andi or, ori sll, srl Why registers? Short name fits in instruction word: log 2 (32) = 5 bits But are registers enough?

Memory and Load/Store Need more than 32 words of storage An array of locations M[j] indexed by j Data movement (on words or integers) Load word for register <= memory lw $17, 1002 # get input g Store word for register => memory sw $16, 1001 # save output f

Registers Memory and load/store Processor $0 0 1 2 3 Memory $31 ALU 1001 1002 f g maxmem

Memory and load/store Important for arrays A[i] = A[i] + h # $8 is temp, $18 is h, $21 is (i x 4) # Astart is &A[0] is 0x8000 lw $8, Astart($21) # or 8000($21) add $8, $18, $8 sw $8, Astart($21) MIPS has other load/store for bytes and halfwords

Registers Memory and load/store Processor $0 0 Memory $31 ALU 4004 4008 f g 8000 A[0] 8004 A[1] 8008 A[2] maxmem

Aside on Endian Big endian: MSB at address xxxxxx00 E.g. IBM, SPARC Little endian: MSB at address xxxxxx11 E.g. Intel x86 Mode selectable E.g. PowerPC, MIPS

Branches and Jumps While ( i!= j) { j= j + i; i= i + 1; } # $8 is i, $9 is j # $10 is k Loop: beq $8, $9, Exit add $9, $9, $8 addi $8, $8, 1 j Loop Exit:

Branches and Jumps # better: beq $8, $9, Exit # not!= Loop: add $9, $9, $8 addi $8, $8, 1 bne $8, $9, Loop Exit: Best to let compilers worry about such optimizations

Branches and Jumps What does bne do really? read $, read $9, compare Set PC = PC + 4 or PC = Target To do compares other than = or!= E.g. blt $8, $9, Target # pseudoinstruction Expands to: slt $1, $8, $9 # $1==($8<$9)==($8-$9)<0 bne $1, $0, Target # $0 is always 0

Branches and Jumps Other MIPS branches/jumps beq $8, $9, imm # if ($8==$9) PC = PC + imm<< 2 else PC += 4; bne slt, sle sgt, sge With immediate, unsigned j addr # PC = addr jr $12 # PC = $12 jal addr # $31 = PC + 4; PC = addr; used for???

MIPS Machine Language All instructions are 32 bits wide Assembly: add $1, $2, $3 Machine language: 33222222222211111111110000000000 10987654321098765432109876543210 00000000010000110000100000010000 000000 00010 00011 00001 00000 010000 alu-rr 2 3 1 zero add/signed

Instruction Format R-format Opc rs rt rd shamt function 65 5 5 5 6 Digression: How do you store the number 4,392,976? Same as add $1, $2, $3 Stored program: instructions are represented as numbers Programs can be read/written in memory like numbers

Instruction Format Other R-format: addu, sub, subi, etc. Assembly: lw $1, 100($2) Machine: 100011 00010 00001 0000000001100100 I-format lw 2 1 100 (in binary) Opc rs rt address/immediate 6 5 5 16

Instruction Format I-format also used for ALU ops with immediates addi $1, $2, 100 001000 00010 00001 0000000001100100 What about constants larger than 16 bits Outside range: [-32768, 32767]? 1100 0000 0000 0000 1111? lui $4, 12 # $4 == 0000 0000 1100 0000 0000 0000 0000 0000 ori $4, $4, 15 # $4 == 0000 0000 1100 0000 0000 0000 1111 All loads and stores use I-format

Instruction Format beq $1, $2, 7 000100 00001 00010 0000 0000 0000 0111 PC = PC + (0000 0111 << 2) # word offset Finally, J-format J address Opcode addr 6 26 Addr is weird in MIPS: addr = 4 MSB of PC // addr // 00

Summary: Instruction Formats R: opcode rs rt rd shamt function 6 5 5 5 5 6 I: opcode rs rt address/immediate 6 5 5 16 J: opcode addr 6 26 Instruction decode: Read instruction bits Activate control signals

Procedure Calls Caller Save registers Set up parameters Call procedure Get results Restore registers Callee Save more registers Do some work, set up result Restore registers Return Jal is special, otherwise just software convention

Procedure Calls Stack is all-important Stack grows from larger to smaller addresses (arbitrary) $29 is stack pointer; points just beyond valid data Push $2: addi $29, $29, -4 sw $2, 4($29) Pop $2: lw $2, 4($29) addi $29, $29, 4 Cannot change order. Why? Interrupts.

Procedure Example Swap(int v[], int k) { int temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } # $4 is v[] & $5 is k -- 1st & 2nd incoming argument # $8, $9 & $10 are temporaries that callee can use w/o saving swap: add $9,$5,$5 # $9 = k+k add $9,$9,$9 # $9 = k*4 add $9,$4,$9 # $9 = v + k*4 = &(v[k]) lw $8,0($9) # $8 = temp = v[k] lw $10,4($9) # $10 = v[k+1] sw $10,0($9) # v[k] = v[k+1] sw $8,4($9) # v[k+1] = temp jr $31 # return

Addressing Modes There are many ways of accessing operands Register addressing: add $1, $2, $3 op rs rt rd... funct register

Addressing Modes Base addressing (aka displacement) lw $1, 100($2) # $2 == 400, M[500] == 42 op rs rt Offset/displacement register 100 400 Memory Effective address 42

Addressing Modes Immediate addressing addi $1, $2, 100 op rs rt immediate

Addressing Modes PC relative addressing beq $1, $2, 100 # if ($1==$2) PC = PC + 100 op rs rt address PC Memory Effective address

Addressing Modes Not found in MIPS: Indexed: add two registers base + index Indirect: M[M[addr]] two memory references Autoincrement/decrement: add operand size Autoupdate found in PowerPC, PA-RISC Like displacement, but update base register

Addressing Modes Autoupdate lwupdate $1,24($2) # $1 = M[$2+24]; $2 = $2 + 24 op rs rt address register Delay Effective address Memory

Addressing Modes for(i=0; i < N, i += 1) sum += A[i]; # $7 is sum, $8 is &a[i], $9 is N,$2 is tmp, $3 is i*4 Inner loop: Or: lw $2, 0($8) lwupdate $2, 4($8) addi $8, $8, 4 add $7, $7, $2 add $7, $7, $2 Where s the bug? Before loop: sub $8, $8, 4

How to Choose ISA Minimize what? Instrs/prog x cycles/instr x sec/cycle!!! In 1985-1995 technology, simple modes like MIPS were great As technology changes, computer design options change If memory is limited, dense instructions are important For high speed, pipelining and ease of pipelining is important

Some Intel x86 (IA-32) History Year CPU Comment 1978 8086 16-bit with 8-bit bus from 8080; selected for IBM PC 1980 8087 Floating Point Unit 1982 80286 24-bit addresses, memory-map, protection 1985 80386 32-bit registers, flat memory addressing, paging 1989 80486 Pipelining 1992 Pentium Superscalar 1995 Pentium Out-of-order execution, 1997 MMX Pro 1999 P-III SSE streaming SIMD

Intel 386 Registers & Memory Registers 8 32b registers (but backward 16b & 8b: EAX, AX, AH, AL) 4 special registers: stack (ESP) & frame (EBP) Condition codes: overflow, sign, zero, parity, carry Floating point uses 8-element stack Memory Flat 32b or segmented (rarely used) Effective address = (base_reg + (index_reg x scaling_factor) + displacement)

Intel 386 ISA Two register instructions: src1/dst, src2 reg/reg, reg/immed, reg/mem, mem/reg, mem/imm Examples mov EAX, 23 # 32b 2 s C imm 23 in EAX neg [EAX+4] # M[EAX+4] = -M[EAX+4] faddp ST(7), ST # ST = ST + ST(7) jle label # PC = label if sign or zero flag set

Intel 386 ISA cont d Decoding nightmare Instructions 1 to 17 bytes Optional prefixes, postfixes alter semantics AMD64 64-bit extension: 64b prefix byte Crazy formats E.g. register specifiers move around But key 32b 386 instructions not terrible Yet entire ISA has to correctly implemented

Current Approach Current technique used by Intel and AMD Decode logic translates to RISC uops Execution units run RISC uops Backward compatible Very complex decoder Execution unit has simpler (manageable) control logic, data paths We use MIPS/DLX to keep it simple and clean

Complex Instructions More powerful instructions not faster E.g. string copy Option 1: move with repeat prefix for memory-to-memory move Special-purpose Option 2: use loads/stores to/from registers Generic instructions Option 2 faster on same machine! (but which code is denser?)

Conclusions Simple and regular Constant length instructions, fields in same place Small and fast Small number of operands in registers Compromises inevitable Pipelining should not be hindered Make common case fast! Backwards compatibility!