Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Performance, Cost and Amdahl s s Law Arquitectura de Computadoras Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Arquitectura de Computadoras Performance- 1

Performance Purchasing perspective given a collection of machines, which has the» best performance?» least cost?» best performance / cost? Design perspective faced with design options, which has the» best performance improvement?» least cost?» best performance / cost? Both require basis for comparison metric for evaluation Our goal is to understand cost & performance implications of architectural choices Arquitectura de Computadoras Performance- 2

Two notions of performance Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 Concorde 3 hours 1350 mph 132 178,200 Which has higher performance? Time to do the task (Execution Time) execution time, response time, latency Tasks per day, hour, week, sec, ns... (Performance) throughput, bandwidth Response time and throughput often are in opposition Arquitectura de Computadoras Performance- 3

What is Performance? KEY: A measure of Speed (Rate) Car: miles driven per hour Car wash: cars washed per day Auto plant: cars built per year Two metrics: Latency (response or execution time)» time to start to finish of a task Throughput (bandwidth)» rate of task completion = rate of task initiation = 1 / (time between task completions) Deterministic vs. average Arquitectura de Computadoras Performance- 4

Definitions Performance is in units of things-per-second bigger is better If we are primarily concerned with response time performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ---------------------- Performance(Y) Arquitectura de Computadoras Performance- 5

Example Time of Concorde vs. Boeing 747? Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours Throughput of Concorde vs. Boeing 747? Concord is 178,200 pmph / 286,700 pmph Boeing is 286,700 pmph / 178,200 pmph = 0.62 times faster = 1.60 times faster Boeing is 1.6 times ( 60% ) faster in terms of throughput Concord is 2.2 times ( 120% ) faster in terms of flying time We will focus primarily on execution time for a single job Lots of instructions in a program => Instruction throughput important! Arquitectura de Computadoras Performance- 6

Relative Performance Definition: X is n % faster than Y if execution rate execution rate Example: X = 1 minute, Y = 2 minutes 2 minute 1 minute = 1+ 100 100 Thus, X is 100 % faster than Y Example: Car wash that starts one car per minute and holds four cars. Latency = four minutes per car Throughput = one car per minute Throughput > 1/Latency due to overlap Key idea: pipelining X Y execution time Y n = = 1+ execution time 100 Arquitectura de Computadoras Performance- 7 X

Basis of Evaluation Pros representative portable widely used improvements useful in reality Actual Target Workload Full Application Benchmarks Cons very specific non-portable difficult to run, or measure hard to identify cause less representative easy to run, early in design cycle Small Kernel Benchmarks easy to fool identify peak capability and potential bottlenecks Microbenchmarks peak may be a long way from application performance Arquitectura de Computadoras Performance- 8

Metrics of performance Application Answers per month Useful Operations per second Programming Language Compiler ISA (millions) of Instructions per second MIPS (millions) of (F.P.) operations per second MFLOP/s Datapath Control Function Units Transistors Wires Pins Megabytes per second Cycles per second (clock rate) Each metric has a place and a purpose, and each can be misused Arquitectura de Computadoras Performance- 9

Aspects of CPU Performance CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle Program instr count CPI clock rate Compiler Instr. Set Organization Technology Arquitectura de Computadoras Performance- 10

Aspects of CPU Performance CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle Program instr count CPI clock rate X Compiler X X Instr. Set X X X Organization X X Technology X Arquitectura de Computadoras Performance- 11

CPI: Average cycles per per instruction CPI = Instruction Count / (CPU Time * Clock Rate) = Instruction Count / Cycles CPU Time = Cycle Time * n CPI i= 1 i * I i CPU Time = n CPI i= 1 i * F i where F i Ii = Instruction Count Invest resources where time is spent! Arquitectura de Computadoras Performance- 12

Controversial Example CPU time Instruction Cycles = Program Instruction Seconds Cycle Some have argued: CISC CPU Time = P x 8 x T = 8PT RISC CPU Time = 2P x 2 x T = 4PT RISC CPU Time = (CISC CPU Time)/2 DISCLAIMER: The truth is much, much more complex Arquitectura de Computadoras Performance- 13

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = 1 (1-F) + F/S Arquitectura de Computadoras Performance- 14

Amdahl s Law Let Speedup new rate = = old rate old latency new latency Consider an enhancement x that speedups fraction f x of a task by S x Speedup overall = old latency new latency = [( 1- fx ) + fx ] old latency ( 1 f ) old latency + (f / S ) old latency x x x Amdahl s Law gives: Speedup overall = 1 ( 1 f ) + (f / S ) x x x Arquitectura de Computadoras Performance- 15

Amdahl s Law, cont. Example: f x = 95 % and S x = 1.10 Speedup overall = 1 ( 1 095. ) + ( 095. / 110. ) = 1094. Example: f x = 5% and S x = 10 Speedup overall = 1 ( 1 005.. ) + ( 005. / 10) = 1047. Example: f x = 5% and S x Speedup overall = 1 1 005 + ( 005. = 1052. (. ) / ) Arquitectura de Computadoras Performance- 16

Amdahl s Law Corollary Since S x implies Speedup overall For real speedups: 1 Speedup overall < ( 1 ) 1 ( 1 ) + (f / ) f x f x x Example: f x 1 ( 1 f x ) 1 % 1.01 2 % 1.02 5 % 1.05 10 % 1.11 20 % 1.25 50 % 2.00 Arquitectura de Computadoras Performance- 17

Standard Example: : Load/Store Machine Operation Frequency Cycle Count ALU Ops 43 % 1 Loads 21 % 1 Stores 12 % 2 Branches 24 % 2 Suppose we could make stores execute in 1 cycle, by slowing down the cycle time by 15 % Should we make this optimization? Old CPI = 0.43 + 0.21 + (0.12 + 0.24)x2 = 1.36 New CPI = 0.43 + 0.21 + 0.12 + 0.24x2 = 1.24 New CPU time Old CPU time Conclusion: Don t make the change = P New CPI 115. T P Old CPI T = 105. Arquitectura de Computadoras Performance- 18

Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time ALU 50% 1.5 23% Load 20% 5 1.0 45% Store 10% 3.3 14% Branch 20% 2.4 18% Typical Mix 2.2 How much faster would the machine be if a better data cache reduce the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? Arquitectura de Computadoras Performance- 19

Evaluating Instruction Sets? Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? CPI How many clocks are required per instruction? How "lean" a clock is practical? Best Metric: Time to execute the program! Inst. Count Cycle Time NOTE: this depends on instructions set, processor organization, and compilation techniques. Arquitectura de Computadoras Performance- 20

Corollary: Make The Common Case Fast All instructions require an instruction fetch, only a fraction require a data fetch/store. Optimize instructions access over data access Programs exhibit locality spatial locality temporal locality Arquitectura de Computadoras Performance- 21

Corollary: Make The Common Case Fast Access to small memories is faster provide a storage hierarchy such that the most frequent accesses are the smallest (closest) memories Regs. Cache Memory Disk/Tape Arquitectura de Computadoras Performance- 22

Marketing Metrics Clock Frequency 3 Ghz better than 2 Ghz? Only relevant for comparing processors from the same family The same architecture The same ISA Machine with different instruction sets? Intel Pentium vs PowerPC Program with different instruction mixes? Dynamic frequency of instructions Uncorrelated to performance Arquitectura de Computadoras Performance- 23

Marketing Metrics MIPS= instruction Count /Time * 10 6 = Clock Rate / CPI * 10 6 machine with different instruction sets? program with different instruction mixes? dynamic frequency of instruction uncorrelated to performance MFLOPS = FP Operations / Time * 10 6 machine dependent often not where time is spent Normalized: add, sub, compare, mult 1 divide, sqrt 4 exp, sin,... 8 Arquitectura de Computadoras Performance- 24

Normalized MFLOPS Not all machines implement the same FP operations Cray-1 does not implement Divide Motorola 68882 does SQRT, SIN, and COS Not all FP operations are the same ADD is much faster than Divide Normalized MFLOPS Assign a canonical number of FP operations to a program Normalized MFLOPS = Canonical FP operations time 10 6 Arquitectura de Computadoras Performance- 25

Benchmarks Real Programs Representative of real workload The only accurate way to characterize performance e.g., gcc, spice,... Kernels Representative program fragments Time critical excerpts of real programs. e.g., Livermore loops Toy Benchmarks 10-100 lines e. g. Sieve, Puzzle, Towers Synthetic Benchmarks attempt to match average frequencies of real workloads e.g. Whetstone, dhrystone Arquitectura de Computadoras Performance- 27

Benchmarking Reproducible results must control outside factors Important factors Program input Version of program Version of compiler Optimization level Version of operating system Amount of memory Number and type of disks Version of CPU Cache configuration Arquitectura de Computadoras Performance- 28

Benchmarking: SPEC Limitations of de facto Benchmarks Dhrystone Synthetic integer benchmark Heavy string emphasis Optimization compilers cause MAJOR problems Whetsone Synthetic floating-point benchmark Designed to thwart optimization Linpack Floating-point kernel DAXPY() = A(I) = B(I) + C * D(I) Arquitectura de Computadoras Performance- 29

SPEC95 Standard Performance Evaluation Corporation Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer go, m88ksim, gcc, compress, li, ijpeg, perl, vortex Ten floating-point intensive tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 Must run with standard compiler flags eliminate special undocumented incantations that may not even generate working code for real programs Arquitectura de Computadoras Performance- 30

Benchmarking: SPEC200 Integer Floating Point Gzip Vpr Gcc Mcf Crafty Parser Eon Perlbmk Gap Vortex Bzip2 Compression FPGA circuit placement and routing C programming language compiler Combinatorial optimization Game playing: chess Word processing Computer visualization Perl programming language Group theory Object oriented database Compression Wupwise Swim Mgrid Applu Mesa Galgel Art Equaqke Facerec Ammp Lucas Fma3d Sixtrack Apsi Physics: quantum chromadinamics Shallow water modelling Multigrid solver: 3D potential field Partial differential equations 3D Graphics library Computational fluid dynamics Image recognition neural networks Seismic wave propagation simulation Image processing: face recognition Computational chemistry Number theory/primality testing Finite-element crash simulation Nuclear physics accelerator design Meteorology: pollutant distribution Arquitectura de Computadoras Performance- 31

Summarizing Results: : A Counter- Example A car goes 30 MPH for the first then miles and 90 MPH for the second ten miles. What the car s average speed over the twenty miles? Wrong answer: Avg Speed = 30 MPH + 90 MPH 2 = 60 MPH Correct answer: Avg Speed = = total distance total time 10 miles + 10 miles ( 10 miles / 30 MPH) + ( 10 miles / 90 MPH) = 20 miles ( 1/ 3) hour + ( 1/ 9) hour = 45 MPH Arquitectura de Computadoras Performance- 32

Summarizing Results: Averages Use the ARITHMETIC mean for times (cycles per instruction): Use the HARMONIC mean for rates (MIPS, MFLOPS): 1 n time i i= 1 Use the GEOMETRIC mean for ratios (normalized numbers): n 1 n 1 n i= 1rate i 1 n 1 n i= 1rate i 1 1/ n Better yet: don t average normalized numbers Arquitectura de Computadoras Performance- 33

Summarizing Results: A Measure of Time Property 1: A single-number performance measure for a set of benchmarks expressed in units of time should be directly proportional to the total (weighted) time consumed by the benchmarks. Property 2: A single-number performance measure for a set of benchmarks expressed as a rate should be indirectly proportional to the total (weighted) time consumed by the benchmarks. Arquitectura de Computadoras Performance- 34

Summarizing Results: Which Which Mean? T i = Execution time for Benchmark i F i = FP Operations for Benchmark i R i = F i / T i = Rate of Benchmark i Average Time: Average Rate: 1 n A mean = n i= 1 n A mean = n i= T i 1 R i 1 Violates Property 2: Not proportional to inverse of time. Use Harmonic mean: n 1 n 1 Fi H mean = 1 1 n i R = 1 n = 1 i= 1T i i Arquitectura de Computadoras Performance- 35

Homework 2 Choose a program to evaluate performance of a PC It can be for Linux or Windows Choose performance metrics for: Speed of CPU Speed of Main memory Speed of graphics applications Speed of hard disk Run performance program in two different computers Your assigned PC at the lab Your home computer Compare results for two computers and stand if one is faster than the other according each metrics Three pages long report Describe the performance program (one page) Describe performance tests and metrics (one page) Describe characteristics of both computers, compare results and make conclusions (one page) Arquitectura de Computadoras Performance- 36

Homework 2 Computer A Computer B Comparison CPU speed Memory speed Graphics speed HD speed Due date: September 19th, 2008. Arquitectura de Computadoras Performance- 37