Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence

Performance = 1 Executon tme Speedup = Performance (B) Performance (A) = Tme (A) Tme (B) CPU tme = Instructons Program Cycles Instructon Seconds Cycle CPU clock cycles = n =1 CPI Instructons

The performance enhancement possble wth a gven mprovement s lmted by the amount that the mproved feature s used Executon tme after mprovement = Executon tme affected by the mprovement Amount of mprovement + Executon tme unaffected A common theme n Hardware desgn s to make the common case fast Increasng the clock rate would not affect memory access tme Usng a floatng pont processng unt does not speed nteger ALU operatons Example: Floatng pont nstructons mproved to run 2X; but only 10% of actual nstructons are floatng pont Exec-Tme new = Exec-Tme old x (0.9 +.1/2) = 0.95 x Exec-Tme old Speedup overall = Exec-Tme new / Exec-Tme old = 1/0.95 = 1.053

Tme old = Tme old * ( Fracton unchanged + Fracton enhanced) Tme new = Tme old * Fracton unchanged + Fracton enhanced Speedup enhanced Speedup overall = Tme old Tme new = = Speedup overall = Tme old Tme old * Fracton unchanged + Fracton enhanced 1 Fracton unchanged + Fracton enhanced Speedup enhanced 1 1 Fracton enhanced ( )+ Fracton enhanced Speedup enhanced Speedup enhanced

Tme KDF9 B5500 Instructons executed Code sze n nstructons Code sze n bts 12 11 10 9 8 7 6 5 ICL 1907 1.1 μs ATLAS 4 3 2 CDC 6600 NU 1108 1 The Burroughs B5500 machne s desgned specfcally for Algol 60 programs Although CDC 6600 s programs are over 3 tmes as bg as those of B5500, yet the CDC machne runs them almost 6 tmes faster Code sze cannot be used as an ndcaton for performance

Computer A Computer B Program 1 (seconds) 1 10 Program 2 (seconds) 1000 100 Total tme (seconds) 1001 110 Wrong summary can present a confusng pcture A s 10 tmes faster than B for program 1 B s 10 tmes faster than A for program 2 Total executon tme s a consstent summary measure Relatve executon tmes for the same workload Assumng that programs 1 and 2 are executng for the same number of tmes on computers A and B CPU Performance (B) CPU Performance (A) = Total executon tme (A) Total executon tme (B) = 1001 110 = 9.1 Executon tme s the only vald and unmpeachable measure of performance

Arthmetc Mean (AM) = 1 n Executon_ Tme n 1 = Weghted Arthmetc Mean (WAM) = n = 1 w Executon_ Tme Where: n s the number of programs executed w s a weghtng factor that ndcates the frequency of executng program n w = wth and = 1 1 0 w 1 Weghted arthmetc means summarze performance whle trackng exec. tme Never use AM for normalzng tme relatve to a reference machne Tme on A Tme on B Norm. to A Norm. to B A B A B Program 1 1 10 1 10 0.1 1 Program 2 1000 100 1 0.1 10 1 AM of normalzed tme 1 5.05 5.05 1 AM of tme 500.5 55 1 0.11 9.1 1

Geometrc Mean (GM) = n n = 1 Executon_Tme_rato Where: n s the number of programs executed Wth Geometrc Mean ( X ) Geometrc Mean ( Y ) = Geometrc Mean X Y Geometrc mean s sutable for reportng average normalzed executon tme Tme on A Tme on B Norm. to A Norm. to B A B A B Program 1 1 10 1 10 0.1 1 Program 2 1000 100 1 0.1 10 1 GM of tme or normalzed tme 31.62 31.62 1 1 1 1

Many wdely-used benchmarks are small programs that have sgnfcant localty of nstructon and data reference Unversal benchmarks can be msleadng snce hardware and compler vendors do optmze ther desgn for these programs The best types of benchmarks are real applcatons snce they reflect the end-user nterest Archtectures mght perform well for some applcatons and poorly for others Complaton can boost performance by takng advantage of archtecture-specfc features Applcaton-specfc compler optmzaton are becomng more popular

800 700 600 500 400 300 200 100 0 gcc espresso spce doduc nasa7 l eqntott matrx300 fpppp tomcatv Benchmark Compler Enhanced compler App. and arch. specfc optmzaton can dramatcally mpact performance

SPEC stands for System Performance Evaluaton Cooperatve sute of benchmarks Created by a set of companes to mprove the measurement and reportng of CPU performance SPEC2000 s the latest sute that conssts of 12 nteger (wrtten n C) and 14 floatng-pont (n Fortran 77) programs Customzed SPEC sutes have been recently ntroduced to assess performance of graphcs and transacton systems. Snce SPEC requres runnng applcatons on real hardware, the memory system has a sgnfcant effect on performance

Hardware Model number Powerstaton 550 CPU 41.67-MHz POWER 4164 FPU (floatng pont) Integrated Number of CPU 1 Cache sze per CPU 64K data/8k nstructon Memory 64 MB Dsk subsystem Network nterface N/A Software 2 400-MB SCSI OS type and revson AIX Ver. 3.1.5 Compler revson AIX XL C/6000 Ver. 1.1.5 AIX XL Fortran Ver. 2.2 Other software Fle system type Frmware level Tunng parameters Background load System state None AIX N/A System None None Mult-user (sngle-user logn) Gudng prncple s reproducblty (report envronment & experments setup)

SPEC rato = Executon tme on SUN SPARCstaton10/40 Executon tme on the measure machne Bgger numerc values of the SPEC rato ndcate faster machne

10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 50 100 Clock rate (MHz) 150 200 250 Pentum 0 50 100 150 Clock rate (MHz) 200 250 Pentum Pentum Pro Pentum Pro The performance measured may be dfferent on other Pentum-based hardware wth dfferent memory system and usng dfferent complers At the same clock rate, the SPECnt95 measure shows that Pentum Pro s 1.4-1.5 tmes faster whle the SPECfp95 shows that t s 1.7-1.8 tmes faster When the clock rate s ncreased by a certan factor, the processor performance ncreases by a lower factor

SPECbase CINT2000 Prces reflects those of July 2001 SPEC CINT2000 per $1000 n prce Dfferent results are obtaned for other benchmarks, e.g. SPEC CFP2000 Wth the excepton of the Sunblade prce-performance metrcs were consstent wth performance

In early computers most nstructons of a machne took the same executon tme The measure of performance for old machnes was the tme requred performng an ndvdual operaton (e.g. addton) New computers have dverse set of nstructons wth dfferent executon tmes The relatve frequency of nstructons across many programs was calculated The average nstructon executon tme was measured by multplyng the tme of each nstructon by ts frequency The average nstructon executon tme was a small step to MIPS that grew n popularty

MIPS = Mllon of Instructons Per Second one of the smplest metrcs vald only n a lmted context Instructon count MIPS (natve MIPS) = 6 Executon tme 10 There are three problems wth MIPS: MIPS specfes the nstructon executon rate but not the capabltes of the nstructons MIPS vares between programs on the same computer MIPS can vary nversely wth performance (see next example) The use of MIPS s smple and ntutve, faster machnes have bgger MIPS

Consder the machne wth the followng three nstructon classes and CPI: Now suppose we measure the code for the same program from two dfferent complers and obtan the followng data: Assume that the machne s clock rate s 500 MHz. Whch code sequence wll execute faster accordng to MIPS? Accordng to executon tme? Answer: Usng the formula: Instructon class CPI for ths nstructon class A 1 B 2 C 3 Instructon count n (bllons) for each Code from nstructon class A B C Compler 1 5 1 1 Compler 2 10 1 1 CPU clock cycles = CPI C Sequence 1: CPU clock cycles = (5 1 + 1 2 + 1 3) 10 9 = 1010 9 cycles Sequence 2: CPU clock cycles = (10 1 + 1 2 + 1 3) 10 9 = 1510 9 cycles n =1

Usng the formula: Execton tme = CPU clock cycles Clock rate Sequence 1: Executon tme = (1010 9 )/(50010 6 ) = 20 seconds Sequence 2: Executon tme = (1510 9 )/(50010 6 ) = 30 seconds Therefore compler 1 generates a faster program Usng the formula: MIPS = Instructon count Executon tme 10 6 (5 + 1+ 1) 10 Sequence 1: MIPS = = 350 6 20 10 (10 + 1+ 1) 10 Sequence 2: MIPS = 6 = 400 30 10 Although compler 2 has a hgher MIPS ratng, the code from generated by compler 1 runs faster 9 9