Computer Science 246. Computer Architecture

Computer Architecture Spring 2010 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu

Lecture Outline Performance Metrics Averaging Amdahl s Law Benchmarks The CPU Performance Equation Optimal Pipelining Case Study Modern Processor Analysis Begin ISAs

Performance Metrics Better Execution Time is often what we target Throughput (tasks/sec) vs. latency (sec/task) How do we decide the tasks? Benchmarks What the customer cares about, real applications Representative programs (SPEC, SYSMARK, etc) Kernels: Code fragments from real programs (Linpack) Toy Programs: Quicksort, Sieve of Eratosthenes Synthetic Programs: Just a representative instruction mix (Whetsone, Dhrystone)

Measuring Performance Total Execution Time: This is arithmetic mean 1 n n i=1 Time i This should be used when measuring performance in execution times (CPI)

Measuring Performance Weighted Execution Time: n i=1 Weight i Time i What if P1 and P2 are not run equally?

Measuring Performance Normalized Execution Time Normalize to reference machine Can only use geometric mean (arithmetic mean can vary depending on the reference machine) n n ExecutionTimeRatioi i=1 Problem: Ratio not Execution Time is the result

Harmonic Mean: Motivation 30 mph for the first 10 miles 90 mph for the next 10 miles Average speed? (30+90)/2 = 60mph WRONG! Average speed = total distance / total time 20/(10/30+10/90) = 45mph

Harmonic Mean Each program has O operations n programs executed no operations in Σ T i Execution rate is then no/ Σ T i = n/ Σ (T i /O)=n/ Σ 1/P i where 1/P i is the rate of execution of program i Harmonic mean should be used when measuring performance in execution rates (IPC)

Amdahl s Law (Law of Diminishing Returns) Very Intuitive Make the Common case fast Speedup = Execution Time for task without enhancement Execution Time for task using enhancement Execution time new = Execution time old Overall Speedup Fraction enhanced ) ((1 - Fraction enhanced ) + Speedup enhanced

Amdahl s Law Corollary Speedup Overall = 1 Fraction enhanced ) ((1 - Fraction enhanced ) + Speedup enhanced As Speedup Enhanced >> 0, Speedup Overall = 1 (1 - Fraction enhanced )

Amdahl s Law Example Fraction enhanced =95%, Speedup Enhanced =1.1x Speedup Overall =1/((1-.95)+(.95/1.1))=1.094 Fraction enhanced =5%, Speedup Enhanced =10x Make the common case fast! Speedup Overall =1/((1-.05)+(.05/10))=1.047 Fraction enhanced =5%, Speedup Enhanced =Infinity Speedup Overall =1/(1-.05)=1.052

MIPS MIPS = instruction count/(execution time x 10 6 ) = clock rate/(cpi x 10 6 ) Problems ISAs are not equivalent, e.g. RISC vs. CISC 1 CISC instruction may equal many RISC! Programs use different instruction mixes May be ok when comparing same benchmarks, same ISA, same compiler, same OS

MFLOPS Same as MIPS, just FP ops Not useful either FP-intensive apps needed Traditionally, FP ops were slow, INT can be ignored BUT, now memory ops can be the slowest! Peak MFLOPS is a common marketing fallacy Basically, it just says #FP-pipes X Clock Rate

GHz Is this a metric? Maybe as good as the others One number, no benchmarks, what can be better? Many designs are frequency driven Processor Clock Rate SPEC FP2000 IBM POWER3 450 MHz 434 Intel PIII 1.4 GHz 456 Intel Pentium 4 2.4 GHz 833 Itanium-2 1.0 GHz 1356

Benchmark Suites SPEC CPU2000 (int and float) (Desktop, Server) EEMBC ( embassy ), SPECjvm (Embedded) TPC-C, TPC-H, SPECjbb, ECperf (Server)

SPEC CPU2000: Integer Benchmarks 164.gzip C Compression 175.vpr C FPGA Circuit Placement and Routing 176.gcc C C Programming Language Compiler 181.mcf C Combinatorial Optimization 186.crafty C Game Playing: Chess 197.parser C Word Processing 252.eon C++ Computer Visualization 253.perlbmk C PERL Programming Language 254.gap C Group Theory, Interpreter 255.vortex C Object-oriented Database 256.bzip2 C Compression 300.twolf C Place and Route Simulator

SPEC CPU2000: Floating Point Benchmarks 168.wupwise Fortran 77 Physics / Quantum Chromodynamics 171.swim Fortran 77 Shallow Water Modeling 172.mgrid Fortran 77 Multi-grid Solver: 3D Potential Field 173.applu Fortran 77 Parabolic / Elliptic Partial Differential Equations 177.mesa C 3-D Graphics Library 178.galgel Fortran 90 Computational Fluid Dynamics 179.art C Image Recognition / Neural Networks 183.equake C Seismic Wave Propagation Simulation 187.facerec Fortran 90 Image Processing: Face Recognition 188.ammp C Computational Chemistry 189.lucas Fortran 90 Number Theory / Primality Testing 191.fma3d Fortran 90 Finite-element Crash Simulation 200.sixtrack Fortran 77 High Energy Nuclear Physics Accelerator Design 301.apsi Fortran 77 Meteorology: Pollutant Distribution

Server Benchmarks TPC-C (Online-Transaction Processing, OLTP) Models a simple order-entry application Thousands of concurrent database accesses System # / Processor tpm $/tpm Fujitsu PrimePower 128, 563MHz SPARC64 455K $28.58 HP SuperDome 64, 875MHz PA8700 423K $15.64 IBM p690 32, 1300MHz POWER4 403K $17.80 TPC-H (Ad-hoc, decision support) Data warehouse, backend analysis tools for data

CPU Performance Equation Execution Time = seconds/program instructions cycles seconds program instruction cycle Program Compiler (Scheduling) Technology Architecture (ISA) Organization (uarch) Physical Design Compiler Microarchitects Circuit Designers

Common Architecture Trick Instructions/Program (Path-length) is constant Same benchmark, same compiler Ok usually, but for some ideas compiler may change Seconds/Cycle (Cycle-time) is constant My tweak won t impact cycle-time Often a bad assumption Current designs are ~12-20FO4 Inverter Delays per cycle Just focus on Cycles/Instruction (CPI or IPC) Many architecture studies do just this!

Instruction Set Architecture Instruction Set Architecture is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine. IBM, Introducing the IBM 360 (1964) The ISA defines: Operations that the processor can execute Data Transfer mechanisms + how to access data Control Mechanisms (branch, jump, etc) Contract between programmer/compiler + HW

Classifying ISAs

Stack Architectures with implicit stack Acts as source(s) and/or destination, TOS is implicit Push and Pop operations have 1 explicit operand Example: C = A + B Push A // S[++TOS] = Mem[A] Push B // S[++TOS] = Mem[B] Add // Tem1 = S[TOS--], Tem2 = S[TOS--], S[++TOS] = Tem1 + Tem2 Pop C // Mem[C] = S[TOS--] x86 FP uses stack (complicates pipelining)

Accumulator Architectures with one implicit register Acts as source and/or destination One other source explicit Example: C = A + B Load A // (Acc)umulator <= A Add B // Acc <= Acc + B Store C // C <= Acc Accumulator implicit, bottleneck? x86 uses accumulator concepts for integer

Register Most common approach Fast, temporary storage (small) Explicit operands (register IDs) Example: C = A + B Register-memory load/store Load R1, A Load R1, A Add R3, R1, B Load R2, B Store R3, C Add R3, R1, R2 Store R3, C All RISC ISAs are load/store IBM 360, Intel x86, Moto 68K are register-memory

Common Addressing Modes Base/DisplacementLoad R4, 100(R1) Register Indirect Load R4, (R1) Indexed Load R4, (R1+R2) Direct Load R4, (1001) Memory Indirect Load R4, @(R3) Autoincrement Load R4, (R2)+ Scaled Load R4, 100(R2)[R3]

What leads to a good/bad ISA? Ease of Implementation (Job of Architect/Designer) Does the ISA lead itself to efficient implementations? Ease of Programming (Job of Programmer/Compiler) Can the compiler use the ISA effectively? Future Compatibility ISAs may last 30+yrs Special Features, Address range, etc. need to be thought out

Implementation Concerns Simple Decoding (fixed length) Compactness (variable length) Simple Instructions (no load/update) Things that get microcoded these days Deterministic Latencies are key! Instructions with multiple exceptions are difficult More/Less registers? Slower register files, decoding, better compilers Condition codes/flags (scheduling!)

Programmability 1960s, early 70s Code was mostly hand-coded Late 70s, Early 80s Most code was compiled, but hand-coded was better Mid-80s to Present Most code is compiled and almost as good as assembly Why?

Programmability: 70s, Early 80s Closing the Semantic Gap High-level languages match assembly languages Efforts for computers to execute HLL directly e.g. LISP Machine Hardware Type Checking. Special type bits let the type be checked efficiently at run-time Hardware Garbage Collection Fast Function Calls Efficient Representation of Lists Never worked out Semantic Clash Too many HLLs? C was more popular? Is this coming back with Java? (Sun s picojava)

Programmability: 1980s 2000s In the Compiler We Trust Wulf: Primitives not Solutions Compilers cannot effectively use complex instructions Synthesize programs from primitives Regularity: same behavior in all contexts No odd cases things should be intuitive Orthogonality: Data type independent of addressing mode Addressing mode independent of operation performed

ISA Compatibility In Computer Architecture, no good idea ever goes unpunished. Marty Hopkins, IBM Fellow Never abandon existing code base Extremely difficult to introduce a new ISA Alpha failed, IA64 is struggling, best solution may not win x86 most popular, is the least liked! Hard to think ahead, but ISA tweak may buy 5-10% today 10 years later it may buy nothing, but must be implemented Register windows, delay branches

CISC vs. RISC Debate raged from early 80s through 90s Now it is fairly irrelevant Despite this Intel (x86 => Itanium) and DEC/Compaq (VAX => Alpha) have tried to switch Research in the late 70s/early 80s led to RISC IBM 801 -- John Cocke mid 70s Berkeley RISC-1 (Patterson) Stanford MIPS (Hennessy)

VAX 32-bit ISA, instructions could be huge (up to 321 bytes), 16 GPRs Operated on data types from 8 to 128-bits, decimals, strings Orthogonal, memory-to-memory, all operand modes supported Hundreds of special instructions Simple compiler, hand-coding was common CPI was over 10!

x86 Variable length ISA (1-16 bytes) FP Operand Stack 2 operand instructions (extended accumulator) Register-register and register-memory support Scaled addressing modes Has been extended many times (as AMD did with x86-64) Intel, initially went to IA64, now backtracked (EM64T)

RISC vs. CISC Arguments RISC Simple Implementation Load/store, fixed-format 32-bit instructions, efficient pipelines Lower CPI Compilers do a lot of the hard work CISC MIPS = Microprocessor without Interlocked Pipelined Stages Simple Compilers (assists hand-coding, many addressing modes, many instructions) Code Density

MIPS/VAX Comparison Computer Science 146

After the dust settled Turns out it doesn t matter much Can decode CISC instructions into internal micro- ISA This takes a couple of extra cycles (PLA implementation) and a few hundred thousand transistors In 20 stage pipelines, 55M tx processors this is minimal Pentium 4 caches these micro-ops Actually may have some advantages External ISA for compatibility, internal ISA can be tweaked each generation (Transmeta)