Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture The Role of Performance
What is performance? A set of metrics that allow us to compare two different hardware platforms Facts about hardware Measured, reported, summarized 2 2
Why should we care? Besides the obvious what to buy The Informed Consumer Concerned from an architectural standpoint Again the why of the machine Why is one instruction faster? Why does some hardware feature affect the speed? But also the how How can we make the computer be faster? And finally, what are the trade-offs? 3 3
Performance Definition What does this mean? Computer A has better performance than B Well, it depends Response time (latency) Amount of time to finish your job Called Execution Time Finishing more jobs, faster (bandwidth) Amount of time to finish many jobs Throughput 4 4
Keep it simple... For now, we focus on execution time Thus we say: Performance = or, by substitution: 1 Execution time Performance X Performance Y = Execution time Y Execution time X When we say X is n times faster than Y Performance X Performance Y = n 5 5
Relative Performance Two machines X & Y Machine X runs a program in 10 seconds Machine Y runs the program in 15 seconds Which has better performance? Relative performance Performance X Performance Y = Execution time Y 15 = Execution time X 10 = 1.5 X is 1.5 times faster than Y Or is Y 1.5 times slower than X 6 6
Actually measure performance Everything based on time Time to finish a job Time to finish all of the jobs Easiest is total time (elapsed time) But this is not totally accurate CPU Time I/O Time 7 7
CPU Time Two categories User time Time spent on MY job System time Time spent on operating system overhead On Unix time command On Windows Can't do it easily 8 8
System Time Includes all operating system overhead Switching tasks Managing interrupts Should this be included in performance? Yes? Different OS, different overhead No? OS is part of the execution time 9 9
Other Metrics Performance is not just time Think about measures based on computer work All computers have a clock Based on an oscillator that vibrates at a constant rate Provides distinct time intervals Clock cycle Ticks Clock periods 10 10
Clocks Length of the clock period Called Clock cycle (e.g. 2 ns) Measured in seconds/cycle Or Clock rate (500 MHZ) Inverse of the clock cycle Measured in cylces/second See 500 MHZ = 500,000,000 cycles/second 2 ns 2/1,000,000,000 second 500 x 10 6 cycles second * 2 1,000 x 10 6 = 1 11 11
Ponder Consider 1 ns clock cycle means 1 GHZ (1,000,000,000) Nanosecond = 1 x 10-9 Light travels 1 foot 3 GHZ is 3,000,000,000 cycles in a second 3 x 10 9 cycles/second Cycle time of 333 picoseconds or.333 ns Amount of time it takes light to travel 4 inches 12 12
Relating Metrics Only concerned with CPU time Users are concerned with time Designers with clock cycles Relate the two CPU Execution Time We could also say CPU Execution Time = # clock cycles for a program = * Clock cycle time # clock cycles for a program Clock rate 13 13
Look at an example A Program runs in 10 seconds Want to improve the clock rate to take 6 seconds Currently running at 400 MHZ 10 seconds = # clock cycles for a program 400 x 10 6 cycles second # clock cycles for a program = 10 seconds x cycles 400 x 10 6 second = 4000 x 10 6 cycles 14 14
The Solution Changing the clock rate would... So Make the processor use 1.2 times as many clock cycles CPU Time = 1.2 x Old cyle count New clock rate 6 seconds = 1.2 x 4000 x 106 cycles New clock rate New clock rate = 1.2 x 4000 x 106 cycles 6 seconds = 800 x 106 cycles second = 800 MHZ 15 15
Program Clock Cycles Where do they come from? Every instruction takes time to execute Time taken is a clock cycle Refine our equations CPU clock cycles = # Instructions executed * Average clock cycle per instruction 16 16
CPI CPI is Clock cycles Per Instruction Average time each instruction takes to execute Add takes less time than multiply or load Averaged over an entire program Can compare two different implementation of the same architecture 17 17
More CPI CPI depends on different factors Memory system Processor structure ISA implementation Example: A program that performs many load/store operations will have a high CPI Takes a certain number of cycles to access memory 18 18
MIPS (no, not the one you think) Millions of Instructions Per Second Inverse of CPI Not constant Faster machines have higher MIPS rate Also depends on instruction mix MIPS = Instruction count Execution time x 10 6 19 19
FLOPS Floating-point operations per seconds Rate depends on machine implementation Options: No FPU Microprogrammed FPU Hardwired FPU Supports more functions (sine, cosine,...) MFLOPS (Mega-FLOPS) FLOPS x 10 6 Operations: Add/ Subtract/ Multiply/ Divide/... Precision Single / Double 20 20
Finally, the equations This all works out to the following metrics CPU Time = Instruction count * CPI * Clock cycle time Or CPU Time = Instruction count * CPI Clock rate 21 21
That is great Can play with distinct parts to find balance But how do we get the numbers? Look at the parts CPU time: Run the program Clock cycle time: Provided by the manufacturer Instruction Count:???? CPI:???? 22 22
Instruction count Couple of ways Use a simulator Such as SPIM A Software Profiler Counts instructions as they execute Hardware counters Included on newer processors Instruction count is implementation agnostic 23 23
But CPI... CPI however, depends on a lot more Like the memory system Can be done with Detailed simulation (more detailed than SPIM) Calculating the number of instructions and their individual cycle counts CPU clock cycles = S (CPI i x C i ) CPI is implementation specific 24 24
Instruction Frequencies Rank 1 2 3 4 5 6 7 8 9 10 instruction load conditional branch compare store add and sub move register -register call return Total Average time executed 22% 20% 16% 12% 8% 6% 5% 4% 1% 1% 96% 25 25
Lets be compiler writers Given The following facts, supplied by HW designer Instruction class CPI for this instruction class A 1 B 2 C 3 2 code sequences for a particular machine Code sequence 1 2 Instruction counts A B C 2 1 2 4 1 1 26 26
Finding the solution Required: Which code sequence executes the most instructions? Solution: Sequence 1 Executes (2+1+2 = 5) instruction CPU clock cycles for 1: 10 cycles (2x1)+(1x2)+(2x3) = 10 cycles Sequence 2 Executes (4+ 1+ 1 =6) instructions CPU clock cycles for 2: 9 cycles (4x1)+(1x2)+(1x3) = 9 cycles 27 27
The Example Completed CPI = CPU Clock cycles / Instruction count CPI for Sequence 1: 10/5 = 2 CPI CPI for 2: 9/6 = 1.5 CPI Code sequence 2 is faster Has more instructions Requires fewer cycles 28 28
Software for Comparison How to select software for comparison? Needs to accurately emulate daily workload Needs to be run on different machines We want real world applications Actual real-world target workload Pros: Represents real-world problems Cons: Difficult to run and measure Very specific Not portable to other situations Difficult to speed up execution 29 29
Which Benchmarks? We use Benchmarks Programs selected to measure performance More specific and Portable Should be class of apps users use most Engineers run math-intensive applications Developers- compilers and document processors Large suites Prevents trivial optimizations that negate the benchmark More likely to represent real usage 30 30
Types of benchmarks Full application Like compilers, games, streaming media Pros Portable, Widely used Cons Less representative 31 31
Types of Benchmarks Kernel Short loops with specific instructions Pros Easy to run Good in design phase Cons Tailored to specific task Can't compare across machines 32 32
Types of Benchmarks Micro Benchmarks Pros Good for beginning design Can compile and simulate easily Cons Can lead to misleading results Might not give actual performance 33 33
Types of Applications Suites of Applications Programs with specific input Pros Good indicator of compiler tech and performance Cons Needs updating to match current SW applications 34 34
Problems with Benchmarks Compiler optimizations Vendors optimize specifically for benchmarks Compilers optimized for benchmark only Special switches Little Code, Lots of Execution Modern compilers can optimize loops to almost nothing Further, small amount of code will reside in cache 35 35
What IBM Did PowerPC 550 in 1991 8 0 0 7 0 0 S P E C p e rf o r m a n c e r a t io 6 0 0 5 0 0 4 0 0 3 0 0 2 0 0 1 0 0 0 g c c e s p re s s o s p ic e d o d u c n a s a 7 li e q n to tt m a trix 3 0 0 fp p p p to m c a tv B e n c h m a rk C o m p ile r E n h a n c e d c o m p ile r 36 36
Select a suite Industry has settled on SPEC System Performance Evaluation Cooperative Created by a group of companies in 1989 SPEC95 The latest release of SPEC benchmark Eighteen application benchmarks (with inputs) reflecting a technical computing workload Eight integer Ten floating-point intensive Must run with standard compiler flags 37 37
Comparing results SPEC Ratio Normalized results Divide execution time on a SPARC station by time on the measured machine 10 9 S P E C i n t 8 7 6 5 4 3 2 1 0 50 100 Clock Rate (MHZ) 150 200 250 Pentium Pentium Pro 38 38
Now, to comparing performance Not quite We have benchmarks We decide on response/throughput How do we summarize performance? Perfomance = Execution Run the suite a number of times Take the arithmetic mean of all of the runs The ratio of the two is the relative performance 39 39
Increasing Performance How do we increase performance? Make improvements to Implementation Architecture Start from Scratch 40 40
Increasing Performance Implementation Improvements Faster clock with unchanged architecture Advantage: Old programs can run on the new machine => A major selling point Architectural Improvements Add new instructions & new registers Advantage: Old programs should continue to run Disadvantage: Software must be recompiled to take advantage of the new features 41 41
Starting over RISC architecture (1980 s) IA-64 (Now) Really just RISC Advantage: Freedom of change and design current needs Disadvantage: Everything must be done from scratch Old programs can t be used 42 42
Increasing Performance- Pitfalls Amdahl's Law Improving one aspect by a percentage does not increase the entire machine by that percentage MIPS (not that one) can be used Instructions implemented differently Can vary inversely with perfomance Arithmetic mean predicts performance Normalization causes skewing of information 43 43
Final thoughts Execution time is only valid metric Performance measurements should reflect execution time Can't design hardware for performance without considering cost Exception: High performance computers for scientific computing Crays or the Virginia Tech Computer Cluser 44 44
Virginia Technology G5 45 45
Final thoughts Other extreme is low-cost design Cost takes precedence over performance IBM PC, embedded computers Cost / performance design, in which the Designer balance cost against performance Cost determined by Components Labor Research & development Sales & marketing Profit margin 46 46
Final thoughts Performance increases come from: Increases in clock rate Without adverse CPI affects Improvements in processor organization that lower CPI Compiler enhancements that lower CPI and/or instruction count 47 47