Designing for Performance. Patrick Happ Raul Feitosa

Designing for Performance Patrick Happ Raul Feitosa

Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance 2

Which one would you choose? Name INTEL CORE I7 4770K Number of cores 4 Number of threads 8 Frequency 3.5 GHz Turbo Frequency 3.9 GHz Data width 64-bit TDP 84 W Release June, 2013 Name AMD FX 9590 Number of cores 8 Number of threads 8 Frequency 4.7 GHz Turbo Frequency 5 GHz Data width 64-bit TDP 220 W Release July, 2013 Designing for Performance 3

Outline Performance Assessment Amdahl s Law Designing for Performance 4

Designing new systems Cost Size Reliability Security Power Consumption Performance Designing for Performance 5

CPU operations Seek and decode instructions Load and Store data Logic and Arithmetic Operations Clock pulse Designing for Performance 6

Performance factors Clock speed or clock rate ( f ) Expressed in multiples of Hz. Clock cycle or clock tick one increment, or pulse, of the clock. Clock time ( τ ) time between consecutive pulses. 1 f Designing for Performance 7

Performance factors Clock speed Usually multiple clock cycles are required per instruction. The amount of work implied by one instruction varies considerably. Pipelining gives simultaneous execution of instructions. So, clock speed is not the whole story! Designing for Performance 8

Performance factors CPI - average number of cycles per instructions I i - number of machine instructions of type i executed by a program. CPI i - number of cycles per instruction of type i. I c - number of machine instructions executed by a program n I c I i i1 CPI n i 1 CPI I c i I i Designing for Performance 9

Performance factors T processor time needed to execute a program. T I c CPI a refinement yields T where I p ( mk) c p is the number of processor cycles to decode + execute the instruction m is the number of memory references needed k is the ratio between memory cycle time and processor cycle time. Designing for Performance 10

Performance factors Instruction Execution Rate Expressed in Millions of instructions (MIPS) or floating point operations (MFLOPS) per second. Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy. Ic MIPS T 6. 10 CPI f.10 6 Designing for Performance 11

Performance factors System attributes affecting the performance factors Instruction set architecture I c p m k τ Compiler technology VLSI technology Processor implementation Cache and memory hierarchy Designing for Performance 12

Performance factors System attributes affecting the performance factors I c p m k τ Instruction set architecture! Compiler technology VLSI technology Processor implementation Cache and memory hierarchy! Designing for Performance 13

Exercise 1 A program involves the execution of 2 million instructions on a 400 MHz processor. CPI and proportion of four instruction types are given below. Compute the average CPI: instruction type CPI instruction mix Arithmetic and logic 1 60% Load/store with cache hit 2 18% Branch 4 12% Load/store with cache miss 8 10% average CPI is CPI = 0.6+ (2 0.18) + (4 0.12) + (8 0.1) = 2.24 Designing for Performance 14

Exercise 2 Consider two hardware implementations M 1 and M 2 of the same instruction set. There are three instruction classes: F, I and N. The M 1 clock rate is 600 Mhz. The clock cycle of M 2 is 2 ns. The average CPI for these three instruction classes are Class CPI of M 1 CPI of M 2 Comments F 5.0 4.0 floating-point I 2.0 3.8 integer N 2.4 2.0 non-arithmetic a) Compute the peak performance for M 1 and M 2 in MIPS. b) If 50% of the instruction executed in a given program belong to class N and the other are equally distributed between F and I, which is the fastest machine and by which factor? Designing for Performance 15

Exercise 2 c) A designer of M 1 plan to change the project to improve performance. Assuming the information in (b). Which of the options below should be more beneficial? 1. Use a FPU twice as fast (CPI=2,5 for class F). 2. Add a second ALU to reduce the CPI for integer operations to 1.20 3. Use a faster logic that allows a clock rate of 750 MHz keeping the same CPI values? d) The CPI given above include a cache miss that occurs 5 times per 100 executed instructions. Each cache miss imply in a 10 cycles penalty. The forth redesign option consists of using a larger instruction cache so as to reduce the miss ratio from 5% to 3%. Compare this alternative with the options before. e) Characterize application programs that can be executed faster in M 1 than in M 2, i. e., discuss the instruction composition of such applications. Hint: Let x, y and 1-x-y the fraction of instructions belonging to classes F, I and N respectively. Designing for Performance 16

Exercise 3 Consider two codes produced by two compiler for the same source program. The instructions of the machine that will execute these codes can be divided in class A (CPI=1) and B (CPI=2). The number of executed instruction of each class is given below Class compiler 1 compiler 2 comments A 600M 400M CPI=1 B 400M 400M CPI=2 a) Compute the execution time for both codes assuming a clock rate = 1 GHz. b) Which compiler produce the most efficient code and by which factor? c) Which code execute at the highest MIPS? Designing for Performance 17

Benchmarks: motivation A high level language statement A=B+C /* assume all quantities in main memory */ Compiled code on CISC Compiled code on RISC add mem(b),mem(c),mem(a) load mem(b),reg(1); load mem(c),reg(2); add reg(1),reg(2),reg(3); store reg(3),mem(a); Designing for Performance 18

Benchmarks: definition Programs designed to test performance Written in high level language portable Represents style of task (systems, numerical, commercial) Easily measured and widely distributed E.g. System Performance Evaluation Corporation (SPEC) CPU2006 for computation bound 17 floating point programs in C, C++, Fortran 12 integer programs in C, C++ 3 million lines of code Graphics, High Performance, Web, Servers, Designing for Performance 19

Averaging Results By running m different benchmark one obtains a reliable comparison. The overall instruction execution rate may be expressed by the m m 1 R H RA R m i 1 m i1 R arithmetic or harmonic mean, where R i is the instruction execution rate of the i-th benchmark i1 i Designing for Performance 20

SPEC speed metric Spec benchmarks do not concern with instruction execution rates Base runtime defined for each benchmark using reference machine Speed metric is ratio of reference time to system run time Tref i execution time for benchmark i on reference machine Tsut i execution time of benchmark i on test system Designing for Performance 21

Averaging SPEC metrics Overall performance calculated by averaging ratios for all 12 integer benchmarks Use geometric mean Appropriate for normalized numbers such as ratios Designing for Performance 22

SPEC Rate Metric Measures throughput or rate of a machine carrying out a number of tasks Multiple copies of benchmarks run simultaneously Typically, same as number of processors Ratio is calculated as follows: Tref i reference execution time for benchmark i N number of copies running simultaneously Tsut i elapsed time from start of execution of all N programs until completion of all copies of program Again, a geometric mean is calculated Designing for Performance 23

Exercise 4 The table below shows the execution times, in seconds, for 3 different processors. benchmark processor X Y Z a) Compute the arithmetic mean value for each system using X as the reference machine and then using Y as the reference machine. b) Compute the geometric mean value for each system using X as the reference machine and then using Y as the reference machine. Which is the most realistic result? 1 20 10 40 2 40 80 20 Designing for Performance 24

Ref: CPUBoss Link Designing for Performance 26

Outline Performance Assessment Amdahl s Law Designing for Performance 27

Amdahl s Law Estimate the potential speed up of program using multiple processors Fraction p of code parallelizable with no scheduling overhead Fraction (1 - p) of code inherently serial T is total execution time for program on single processor N is number of processors that fully exploit parallel portions of code Gene Amdahl Speedup time to execute program on a single processor time to execute program on N parallel processors T(1 p) Tp Tp T(1 p) N (1 1 p) p N Designing for Performance 28

Amdahl s Law Conclusions Code needs to be parallelizable/parallelized! p small, parallel processors has little effect. N, speedup bound by 1/(1 p). Speedup is bound, giving diminishing returns for more processors. Speedup time to execute program on a single processor time to execute program on N parallel processors T(1 p) Tp Tp T(1 p) N (1 1 p) p N Designing for Performance 29

Amdahl s Law Exercise 5 A program spends 60% of its execution time with floating point operations. 90% of them are executed in parallelizable loops. When the code is parallelized coordination and synchronization between parts make the part not involving floating-point operations 10% longer. a) Find the improvement in terms of execution time achieved by doubling the speed of the floating-point unit. b) Find the improvement in terms of execution time achieved by using two processors having the same speed and structure as the original one c) What would be the improvement if both changes are implemented. Designing for Performance 30

Amdahl s Law Generalization for any design improvement Speedup Execution time before enhancemen t Execution time after enhancemen t Suppose that the enhancement affects the execution p of the total runtime before enhancement, and that the speed up brought by this enhancement is SU p. Thus Speedup 1 p 1 f SU p Designing for Performance 31.

Amdahl s Law Generalized Amdahl s Law example Suppose that a task consumes 40% of the time with floating-point operations. A new FPU has speedup K. Then the overall speedup is Speedup So, the maximum speedup is 1.67. 1 1 0.4 0.4 K Designing for Performance 32

Homeworks Exercise 6 A processor is used for an application where 30 %, 25% and 10% of the processing time is spent with floating-point addition, multiplication and division, respectively. For a new processor version, 3 alternatives are being considered, all of them involving nearly the same design and implementation cost. Which one should be selected? a) Redesign the adder making it twice as fast as the older one. b) Redesign the multiplier making it three times as fast as the older one c) Redesign the divider making it ten times as fast as the older one. Designing for Performance 33

Homeworks Exercise 7: T is the average processing time of a computer operating at frequency f. Instructions are grouped in 3 types, as shown below. Instruction type CPI Floating point arithmetic 10 Integer arithmetic 5 Non- arithmetic 2 Typically a program executes the same proportion of instructions from all three groups/types. Compute the MIPS and the new execution time, if the FPU becomes twice as fast. Designing for Performance 34

Homeworks Exercise 8: Let f 1 and f 2 be the operation frequency of processors P 1 and P 2 respectively. Assume that two compilers generate different executable codes for the same source program which may be executed byp 1 as well as byp 2. The codes have the characteristics given below: Instruction type CPI Proportion compiler 1 Proportion compiler 2 Floating point arithmetic 10 20 % 30 % Integer arithmetic 5 30 % 10 % Non- arithmetic 2 50 % 60 % Compute the ratio f 1 /f 2 for which the processing time in P 1 executing code 1 equals the processing time of P 2 executing code 2. Designing for Performance 35

Homeworks Exercise 9: The code of an application can be separated in a sequential part (S) and in a parallelizable part (P). The number of executed instructions of type P is twice as many as of type S, when the application runs in a single processor. When the application runs in multiple processors the number of instructions of type S increases in 10%. Consider the following two configurations: A) Single processor machine operating with frequency 2f. B) Four processors machine operating with frequency f. a) Determine the limit ratio r between the CPI of instructions of type P and type S (r=cpi P /CPI S ), for which the configuration A) is faster than configuration B). b) Compute the upper limit for the speed up that can be achieved using multiple processors without changing the operation frequency. Designing for Performance 36

Homeworks Exercise 10: The following table shows the execution times, in seconds, for five different benchmark programs on three machines. Benchmark Processor R M Z E 417 244 134 F 83 70 70 H 66 153 135 I 39449 35527 66000 K 772 368 369 a) Compute the speed metric for each processor for each benchmark, normalized to machine R using equation given in slide 21. Then compute the arithmetic mean value b) Repeat a) using M as reference machine. Which machine is the slowest based on each of the preceding two calculations? c) Repeat the calculations of parts(a) and (b) using the geometric mean, defined in slide 22. Which machine is the slowest based on the two calculations?. Designing for Performance 37

Text Book References The topics are covered in Stallings - sections 2.2, 2.3 and 2.5 Tanenbaum - section 8.4 Parhami - chapter 4 Designing for Performance 38

Designing for Performance END 15-17, 24,28,31-25 Designing for Performance 39