Instructor Information

Size: px

Start display at page:

Download "Instructor Information"

Christopher Holland
5 years ago
Views:

1 Instructor Information 55:32/22C:60 High Performance Computer Architecture Spring 2008 Instructor: Jon Kuhl (That s me) Office: 406A SC Office Hours: 0:30-:30 a.m. MWF ( Other times by appointment) kuhl@engineering.uiowa.edu Phone: (39) TA: Prasidha Mohandas Office: 33 SC Office hours: t.b.d. Class Info. Website: Texts: Required: Shen and Lipasti, Modern Processor Design- -Fundamentals of Superscalar Processors, McGraw Hill, Supplemental: Thomas and Moorby, The Verilog Hardware Description Language,Third Edition, Kluwer Academic Publishers, 996. Additional Reference: Hennessy and Patterson, Computer Architecture A Quantitative Approach, Morgan Kaufmann, Fourth Edition, 2007 Course Objectives Understand quantitative measures for assessing and comparing processor performance Understand modern processor design techniques, including: pipelining instruction-level parallelism multi-threading high performance memory architecture Master the use of modern design tools (HDLs) to design and analyze processors Do case studies of contemporary processors Discuss future trends in processor design

2 Expected Background A previous course in computer architecture/organization covering: Instruction set architecture (ISA) Addressing modes Assembly language Basic computer organization Memory system organization Cache virtual Etc. 22c:060 or 55:035 or equivalent Course Organization Homework assignments--several Two projects (design/analysis exercises using the Verilog HDL and ModelSim simulation environment) Two exams: Midterm Wed. March 2, in class Final Tues. May 3, 2:5-4:5 p.m. Course Organization--continued Grading: Exams: Better of midterm/final exam score: 35% Poorer of midterm/final exam scores: 25% Homework: 0% Projects 30% Historical Perspectives The Decade of the 970 s: Birth of Microprocessors Programmable Controller Single-Chip Microprocessors Personal Computers (PC) The Decade of the 980 s: Quantitative Architecture Instruction Pipelining Fast Cache Memories Compiler Considerations Workstations The Decade of the 990 s: Instruction-Level Parallelism Superscalar,Speculative Microarchitectures Aggressive Compiler Optimizations Low-Cost Desktop Supercomputing 2

Moore s Law Moore s Law (965) The number of devices that can be integrated on a single piece of silicon will double roughly every 8-24 months Moore s law has held true for 40 years and will continue

3 Moore s Law Moore s Law (965) The number of devices that can be integrated on a single piece of silicon will double roughly every 8-24 months Moore s law has held true for 40 years and will continue to hold for at least another decade. Intel Microprocessors Transistor Count Processor Performance DEC Alpha 2264/ Performance DEC Alpha 5/ DEC Alpha 5/ DEC Alpha 4/266 SUN-4/MIPS MIPS IBM 00 IBM POWER M/20 M2000 RS6000 DEC AXP/500 HP 9000/ Year 3

4 Evolution of Single-Chip Micros Transistor Count 0K- 00K 970 s 980 s 990 s K-M M-00M B Clock Frequency 0.2-2MHz 2-20MHz 20M- GHz Instruction/ Cycle 0GHz < (?) MIPS/MFLOPS < ,000 00,000 Performance Growth in Perspective Doubling every 24 months ( ): total of 260,000X Cars travel at 25 million MPH; get 5 million miles/gal. Air travel: L.A. to N.Y. in 0. seconds Corn yield: 50 million bushels per acre A Quote from Robert Cringely If the automobile had followed the same development as the computer, a Rolls- Royce would today cost $00, get a million miles per gallon, and explode once a year killing everyone inside. Convergence of Key Enabling Technologies: VLSI: Submicron CMOS feature sizes: Intel is shipping 45nm chips and has demonstrated 32nm (2x increase in density every 2 years) Metal layers: 3 -> 4 -> 5 -> 6 -> 9 (copper) Power supply voltage: 5v -> 3.3v -> 2.4v ->.8v ->0.8 v CAD Tools: Interconnect simulation and critical path analysis Clock signal propagation analysis Process simulation and yield analysis/learning Microarchitecture: Superpipelined and superscalar machines Speculative and dynamic microarchitectures Simulation tools and emulation systems Compilers: Extraction of instruction-level parallelism Aggressive and speculative code scheduling Object code translation and optimization 4

5 Instruction Set Processing ARCHITECTURE (ISA) programmer/compiler view Functional appearance (interface) to user/system programmer Opcodes, addressing modes, architected registers, IEEE floating point Serves as specification for processor design IMPLEMENTATION (µarchitecture) processor designer view Logical structure or organization that performs the architecture Pipelining, functional units, caches, physical registers REALIZATION (Chip) chip/system designer view Physical structure that embodies the implementation Gates, cells, transistors, wires Iron Law Time Processor Performance = Program Instructions Cycles = X X Program Instruction (code size) (CPI) Time Cycle (cycle time) Architecture --> Implementation --> Realization Compiler Designer Processor Designer Chip Designer Iron Law Instructions/Program Instructions executed, not static code size Determined by algorithm, compiler, ISA Cycles/Instruction Determined by ISA and CPU organization Overlap among instructions reduces this term Time/cycle Determined by technology, organization, clever circuit design Overall Goal Minimize time, which is the product, NOT isolated terms Common error to miss terms while devising optimizations E.g. ISA change to decrease instruction count BUT leads to CPU organization which makes clock slower Bottom line: terms are inter-related This is the crux of the RISC vs. CISC argument 5

6 Instruction Set Architecture ISA, the boundary between software and hardware Specifies the logical machine that is visible to the programmer Also, a functional spec for the processor designers What needs to be specified by an ISA Operations what to perform and what to perform next Temporary Operand Storage in the CPU accumulator, stacks, registers Number of operands per instruction Operand location where and how to specify the operands Type and size of operands Instruction-to-Binary Encoding Operand Storage Registers (in processor vs. memory)? faster access shorter address Accumulator less hardware high memory traffic likely bottleneck Operand Storage Stack - LIFO (60 s - 70 s) simple addressing (top of stack implicit) bottleneck while pipelining (why?) note: JAVA VM stack-based Registers - 8 to 256 words flexible: temporaries and variables registers must be named code density and second name space Registers Caches vs. Registers faster (no addressing modes, no tags) deterministic (no misses) can replicate for more ports short identifier must save/restore on procedure calls can t take address of a register (distinct from memory) fixed size (FP, strings, structures) compilers must manage (an advantage?) 6

7 Registers vs. Caches How many registers? more => hold operands longer (reducing memory traffic + run time) longer register specifiers (except with register windows) slower registers more state slows context switches Operands for ALU Instructions ALU instructions require operands Number of explicit operands two - r i := r i op r j three - r i := r j op r k operands in registers or memory any combo - VAX - variable length instrs at least one register - IBM 360/370 all registers - Cray, RISCs - separate loads/store instructions VAX Addressing Modes register: R i displacement M[R i + #n] immediate: #n register indirect M[R i ] indexed: M[R i + R j ] memory indirect: M[M[R i ]] auto-decrement: M[R i ]; R i -= d scaled: M[R i + #n + R j * d] update: M[R i = R i + #n] absolute: M[#n] auto-increment: M[R i ]; R i += d Modes -4 account for 93% of all VAX operands [Clark and Emer] Operations arithmetic and logical - and, add data transfer control system floating point decimal string - move, load, store - branch, jump, call - system call, traps - add, mul, div, sqrt - addd, convert - move, compare multimedia? 2D, 3D? e.g., Intel MMX/SSE and Sun VIS 7

8 Control Instructions (Branches). Types of Branches A. Conditional or Unconditional B. Save PC? C. How is target computed? Single target (immediate, PC+immediate) Multiple targets (register) 2. Branch Architectures A. Condition code or condition registers B. Register Save or Restore State What state? function calls: registers (CISC) system calls: registers, flags, PC, PSW, etc Hardware need not save registers caller can save registers in use callee save registers it will use Hardware register save IBM STM, VAX CALLS faster? Most recent architectures do no register saving Or do implicit register saving with register windows (SPARC) VAX DEC 977 VAX-/780 upward compatible from PDP- 32-bit words and addresses virtual memory 6 GPRs (r5 PC r4 SP), CCs extremely orthogonal and memory-memory decode as byte stream - variable in length opcode: operation, #operands, operand types Data types 8, 6, 32, 64, 28 char string - 8 bits/char decimal - 4 bits/digit numeric string - 8 bits/digit Addressing modes VAX literal 6 bits 8, 6, 32 bit immediates register, register deferred 8, 6, 32 bit displacements 8, 6, 32 bit displacements deferred indexed (scaled) autoincrement, autodecrement autoincrement deferred 8

9 operations VAX data transfer including string move arithmetic and logical (2 and 3 operands) control (branch, jump, etc) AOBLEQ function calls save state bit manipulation floating point - add, sub, mul, div, polyf system - exception, VM other - crc (cyclic redundancy check), insque (insert in Q) VAX addl3 R, 737$R2$, #456 byte : addl3 byte 2: mode, R byte 3: mode, R2 byte 4,5: 737 byte 6: mode byte 7-0: 456 VAX has too many modes and formats Big deal with RISC is not fewer instructions few modes/formats => fast decoding to facilitate pipelining VAX /780 First implementation of VAX ISA 84% of instructions simple, 9% branches loop branches 9% taken, other branches 4% taken Operands: register mode 4%, complex addressing 6% Implementation ns => 0.5 MIPS 50% of time decoding, simple instructions only 0% of time memory stalls 2. CPI (<< 0.6) Anatomy of a Modern ISA Operations simple ALU op s, data movement, control transfer Temporary Operand Storage in the CPU Large General Purpose Register (GPR) File Number of operands per instruction triadic A B op C Operand location load-store architecture with register indirect addressing Type and size of operands 32/64-bit integers, IEEE floats Instruction-to-Binary Encoding Fixed width, regular fields Exceptions: Intel x86, IBM 390 (aka z900) 9

10 Dynamic-Static Interface Program (Software) Architecture Machine (Hardware) Compiler complexity Hardware complexity Exposed to software Hidden in hardware Semantic gap between s/w and h/w Static (DSI) Dynamic Placement of DSI determines how gap is bridged Dynamic-Static Interface DEL ~CISC ~VLIW ~RISC HLL Program DSI- DSI-2 DSI-3 Hardware Low-level DSI exposes more knowledge of hardware through the ISA Places greater burden on compiler/programmer Optimized code becomes specific to implementation In fact: happens for higher-level DSI also The Role of the Compiler Phases to manage complexity Parsing --> intermediate representation Procedure inlining Loop Optimizations Common Sub-Expression Jump Optimization Constant Propagation Register Allocation Strength Reduction Pipeline Scheduling Code Generation --> assembly code Performance and Cost Which computer is fastest? Not so simple Scientific simulation FP performance Program development Integer performance Commercial workload Memory, I/O 0

11 Performance of Computers Want to buy the fastest computer for what you want to do? Workload is all-important Correct measurement and analysis Want to design the fastest computer for what the customer wants to pay? Cost is always an important criterion Speed is not always the only performance criteria: Power Area Defining Performance What is important to whom? Computer system user Minimize elapsed time for program = time_end time_start Called response time Computer center manager Maximize completion rate = #jobs/second Called throughput Improve Performance Improve (a) response time or (b) throughput? Faster CPU Helps both (a) and (b) Add more CPUs Helps (b) and perhaps (a) due to less queuing Performance Comparison Machine A is n times faster than machine B iff perf(a)/perf(b) = time(b)/time(a) = n Machine A is x% faster than machine B iff perf(a)/perf(b) = time(b)/time(a) = + x/00 E.g. time(a) = 0s, time(b) = 5s 5/0 =.5 => A is.5 times faster than B 5/0 =.5 => A is 50% faster than B

12 Other Metrics MIPS and MFLOPS MIPS = instruction count/(execution time x 0 6 ) = clock rate/(cpi x 0 6 ) But MIPS has serious shortcomings Problems with MIPS E.g. without FP hardware, an FP op may take 50 single-cycle instructions With FP hardware, only one 2-cycle instruction Thus, adding FP hardware: CPI increases (why?) Instructions/program decreases (why?) Total execution time decreases BUT, MIPS gets worse! 50/50 => 2/ 50 => 50 => 2 50 MIPS => 2 MIPS Problems with MIPS Ignore program Usually used to quote peak performance Ideal conditions => guarantee not to exceed! When is MIPS ok? Same compiler, same ISA E.g. same binary running on Pentium-III, IV Why? Instr/program is constant and can be ignored Other Metrics MFLOPS = FP ops in program/(execution time x 0 6 ) Assuming FP ops independent of compiler and ISA Often safe for numeric codes: matrix size determines # of FP ops/program However, not always safe: Missing instructions (e.g. FP divide, sqrt/sin/cos) Optimizing compilers Relative MIPS and normalized MFLOPS Normalized to some common baseline machine E.g. VAX MIPS in the 980s 2

13 Iron Law Example Machine A: clock ns, CPI 2.0, for program x Machine B: clock 2ns, CPI.2, for program x Which is faster and how much? Time/Program = instr/program x cycles/instr x sec/cycle Time(A) = N x 2.0 x = 2N Time(B) = N x.2 x 2 = 2.4N Compare: Time(B)/Time(A) = 2.4N/2N =.2 So, Machine A is 20% faster than Machine B for this program Iron Law Example Keep ns and For equal performance, if CPI(B)=.2, what is CPI(A)? Time(B)/Time(A) = = (Nx2x.2)/(NxxCPI(A)) CPI(A) = 2.4 Iron Law Example Keep CPI(A)=2.0 and CPI(B)=.2 For equal performance, if clock(b)=2ns, what is clock(a)? Time(B)/Time(A) = = (N x 2.0 x clock(a))/(n x.2 x 2) clock(a) =.2ns Another Example OP Freq Cycles ALU 43% Load 2% Store 2% 2 Branch 24% 2 Assume stores can execute in cycle by slowing clock 5% Should this be implemented? 3

14 Example-- Let s do the math: OP Freq Cycles ALU 43% Load 2% Store 2% 2 Branch 24% 2 Old CPI = x x 2 =.36 New CPI = x 2 =.24 Speedup = old time/new time = {P x old CPI x T}/{P x new CPI x.5 T} = (.36)/(.24 x.5) = 0.95 Answer: Don t make the change Which Programs Execution time of what program? Best case you always run the same set of programs Port them and time the whole workload In reality, use benchmarks Programs chosen to measure performance Predict performance of actual workload Saves effort and money Representative? Honest? Benchmarketing Types of Benchmarks Real programs representative of real workload only accurate way to characterize performance requires considerable work Kernels or microbenchmarks representative program fragments good for focusing on individual features not big picture Instruction mixes instruction frequency of occurrence; calculate CPI Benchmarks: SPEC2000 System Performance Evaluation Cooperative Formed in 80s to combat benchmarketing SPEC89, SPEC92, SPEC95, now SPEC integer and 4 floating-point programs Sun Ultra-5 300MHz reference machine has score of 00 Report geometric mean of ratios to reference machine 4

15 Benchmarks: SPEC CINT2000 Benchmarks: SPEC CFP2000 Benchmark Description Benchmark 64.gzip 75.vpr 76.gcc 8.mcf 86.crafty 97.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf Description Compression FPGA place and route C compiler Combinatorial optimization Chess Word processing, grammatical analysis Visualization (ray tracing) PERL script execution Group theory interpreter Object-oriented database Compression Place and route simulator 68.wupwise 7.swim 72.mgrid 73.applu 77.mesa 78.galgel 79.art 83.equake 87.facerec 88.ammp 89.lucas 9.fma3d 200.sixtrack 30.apsi Physics/Quantum Chromodynamics Shallow water modeling Multi-grid solver: 3D potential field Parabolic/elliptic PDE 3-D graphics library Computational Fluid Dynamics Image Recognition/Neural Networks Seismic Wave Propagation Simulation Image processing: face recognition Computational chemistry Number theory/primality testing Finite-element Crash Simulation High energy nuclear physics accelerator design Meteorology: Pollutant distribution Benchmark Pitfalls Benchmark not representative Your workload is I/O bound, SPECint is useless Benchmark is too old Benchmarks age poorly; benchmarketing pressure causes vendors to optimize compiler/hardware/software to benchmarks Need to be periodically refreshed Benchmark Pitfalls Choosing benchmark from the wrong application space e.g., in a realtime environment, choosing gcc Choosing benchmarks from no application space e.g., synthetic workloads, esp. unvalidated ones Using toy benchmarks (dhrystone, whetstone) e.g., used to prove the value of RISC in early 80 s Mismatch of benchmark properties with scale of features studied e.g., using SPECINT for large cache studies 5

16 Benchmark Pitfalls Carelessly scaling benchmarks Truncating benchmarks Using only first few million instructions Reducing program data size Too many easy cases May not show value of a feature Too few easy cases May exaggerate importance of a feature Scalar to Superscalar Scalar processor Fetches and issues at most one instruction per machine cycle Superscalar processor-- Fetches and issues multiple instructions per machine cycle Can also define superscalar in terms of how many instructions can complete execution in a given machine cycle. Note that only a superscalar architecture can achieve a CPI of less than Processor Performance Time Processor Performance = Program Instructions Cycles = X X Program Instruction (code size) (CPI) In the 980 s (decade of pipelining): CPI: 5.0 =>.5 In the 990 s (decade of superscalar): CPI:.5 => 0.5 (best case) Time Cycle (cycle time) No. of Processors Amdahl s Law (Originally formulated for vector processing) N -f f Time f = fraction of program that is vectorizable (-f) = fraction that is serial N = speedup for vectorizable portion Overall speedup: Speedup = ( f ) + f N 6

17 Amdahl s Law--Continued Sequential bottleneck Even if N is infinite Performance limited by nonvectorizable portion (-f) lim = N f f + N f Ramifications of Amdahl s Law Consider: f = 0.9, (-f) = 0. For N, Speedup 0 Consider: f = 0.5, (-f) = 0.5 For N infinity, Speedup 2 Consider: f 0., (-f) = 0.9 For N infinity, Speedup. Maximum Achievable Speedup Pipelining time T 2 Inputs I, I 2, I 3, Unpipelined operation Outputs O, 0 2,.. 0 Time required to process K inputs = KT Speedup 8 6 Speedup Perfect Pipeline (N stages): T/N Stage T/N Stage 2 T/N Stage 3 T/N Stage N 4 I I I parallelizable fraction f I 3 I 2 I I N I N- I N-2 I O Time required to process K inputs = (K + N-)(T/N) Note For K >>N, the processing time approaches KT/N 7

18 Pipelined Performance Model Amdahl s Law Applied to Pipelining N Pipeline Depth -g g g = fraction of time pipeline is filled -g = fraction of time pipeline is not filled (stalled) No. of stages N -g g g = fraction of time the pipeline is full (-g) = fraction that it is not full N = pipeline depth Overall speedup: Speedup = ( g ) + g N Time Pipelined Performance Model Pipelined Performance Model N Pipeline Depth -g g g = fraction of time pipeline is filled -g = fraction of time pipeline is not filled (stalled) N Pipeline Depth -g g Tyranny of Amdahl s Law [Bob Colwell] When g is even slightly below 00%, a big performance hit will result Stalled cycles are the key adversary and must be minimized as much as possible 8

19 Superscalar Proposal Moderate tyranny of Amdahl s Law Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahl s Law: s = amount of parallelism for nonvectorizable instructions Speedup = ( f ) f + s N Motivation for Superscalar [Agerwala and Cocke] Speedup p Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s= (scalar) n=6,s=2 Typical Range n=00 n=2 n=6 n= Vectorizability f Limits on Instruction Level Parallelism (ILP) Weiss and Smith [984].58 Sohi and Vajapeyam [987].8 Tjaden and Flynn [970] Tjaden and Flynn [973].96 Uht [986] 2.00 Smith et al. [989] 2.00 Jouppi and Wall [988] 2.40 Johnson [99] 2.50 Acosta et al. [986] 2.79 Wedig [982] 3.00 Butler et al. [99] 5.8 Melvin and Patt [99] 6 Wall [99] Kuck et al. [972] 8 Riseman and Foster [972] Nicolau and Fisher [984].86 (Flynn s bottleneck) 7 (Jouppi disagreed) 5 (no control dependences) 90 (Fisher s optimism) Superscalar Proposal Go beyond single instruction pipeline, achieve IPC > Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP) 9

20 Classifying ILP Machines [Jouppi, DECWRL 99] Baseline scalar RISC Issue parallelism = IP = Operation latency = OP = Peak IPC = Classifying ILP Machines [Jouppi, DECWRL 99] Superpipelined: cycle time = /m of baseline Issue parallelism = IP = inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = m instr / major cycle (m x speedup?) SUCCESSIVE INSTRUCTIONS 0 IF DE EX WB TIME IN CYCLES (OF BASELINE MACHINE) IF DE EX WB Classifying ILP Machines [Jouppi, DECWRL 99] Superscalar: Issue parallelism = IP = n inst / cycle Operation latency = OP = cycle Peak IPC = n instr / cycle (n x speedup?) IF DE EX WB Classifying ILP Machines [Jouppi, DECWRL 99] VLIW: Very Long Instruction Word Issue parallelism = IP = n inst / cycle Operation latency = OP = cycle Peak IPC = n instr / cycle = VLIW / cycle IF DE EX WB 20

21 Classifying ILP Machines [Jouppi, DECWRL 99] Superpipelined-Superscalar Issue parallelism = IP = n inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = n x m instr / major cycle IF DE EX WB Superscalar vs. Superpipelined Roughly equivalent performance If n = m then both have about the same IPC Parallelism exposed in space vs. time SUPERSCALAR SUPERPIPELINED Time in Cycles (of Base Machine) Key: IFetch Dcode Execute Writeback

Computer System. Performance

Computer System. Performance Computer System Performance Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/