Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/ http://www.ntomoble.com/2010/11/02/opera-celand-clean/ 4.3 4.4 Performance Depends on Vew Pont?! What's faster: A 747 Jumbo Arlner An F-22 fghter jet If you are an ndvdual nterested n gettng from pont A to pont B, then the F-22 Ths s known as Tme from the of an operaton untl t If you are tryng to number of people, the Ths s known as Throughput vs. Latency If Latency s the Tmet takes for a Jobto complete & Throughput Jobs/ Tme Is Throughput 1 / Latency? Latency s from the perspectve of a Throughput s from the perspectve of s the great frend of throughput! We wll see many tmes n ths course (ppelnng, memory org., etc.) that there s often not much we can do about but there are lots of ways to mprove Hopefully wthout latency too much, f at all

4.5 4.6 What are the metrcs? Metrcs Executon Tme Key Pont: When comparng dfferent systems, s the ultmate crteron (metrc) Usng a as a metrc can often be msleadng metrcs Often not comparng apples to apples Often not normalzed 4.7 4.8 What's Wrong wth Rates Wall Clock Tme vs. CPU Tme Two trans take two dfferent routes from Cty A to Cty B and leave at the same tme. Tran 1 travels at 60 MPH, whle tran 2 travels at 75 MPH. Whch one arrves frst? 1 (MIPS): You may hear that Computer 1 executes 500 MIPS whle Computer 2 executes 750 MIPS. Whch one executes a gven program faster? Tran speed MIPS & Routes Program (how many nstructons) MIPS s only useful for the same 2 (Clock Rate): You may hear that CPU1 runs at 2 GHz and CPU2 runs at 3 GHz, whch one executes a program faster (assume same nstructon set) CPU1 may have whle CPU2 has CPU1 Tme < CPU2 Tme Even executon tme can be hard to measure accurately because the OS may allocate a percentage of compute cycles to other programs (also, part of a programs executon s spent n OS calls or watng for I/O, etc.) Wall Clock Tme: Real tme t took from when the user submtted the job untl t was completed CPU Tme (User Tme + System Tme): Actual tme the program used the CPU ether n the applcaton code (User Tme) or n the OS (System Tme) Doesn't nclude I/O tme Lnux/Unx: % real 0m16.019s user 0m12.840s sys 0m0.180s

4.9 4.10 Performance Performance Equaton Performance s defned as the nverse of executon tme Performanc e 1 Executon Tme Executon tme can be modeled usng three components IC Dynamc Instructon Count not Statc Instructon Count : Average number of clock cycles to execute each nstructon Often want to compare relatve performance or speedup (how many tmes faster s a new system than an old one) Performance Performance New Speedup Old Executon Executon Old New Exec.Tme 4.11 4.12 Dynamc vs. Statc Instructon Count What Affects Performance Statc nstructon count s the number of wrtten nstructons Dynamc nstructon count (or trace count) s how many nstructon were executed at run tme Would you prefer ether: SmallStatc IC & LargeDynamc IC or LargeStatc IC & Small Dynamc IC Statc IC LP: BNE LP THN: ELS: Dynamc IC Component SW/HW Affects Descrpton Algorthm SW Instruc.Count & Programmng Language SW Instruc.Count & Compler SW Instruc.Count & Instructon Set HW Instruc.Count,, Clock Cycle Determnes how many nstructons & whch knd are executed Determnes constructs that need to be translated and the knd of nstructons Effcency of translatonaffects how many and whch nstructons are used Determnes what nstructons are avalable and whatwork each nstructon performs Mcroarchtecture HW, Clock Cycle Determnes how each nstructon s executed (, clock perod) Source: H&P, Computer Organzaton & Desgn, 3 rd Ed.

4.13 4.14 & Mcroarchtecture Sngle Bus R0 R1 Rn Y Reg. ALU Two-Bus R0 R1 Rn Y Reg. ALU Three Bus R0 R1 Rn Y Reg. ALU Processor A runs at 200 MHz and executes a 40 mllon nstructon program at a sustaned 50 MIPS Processor B runs at 400 MHz and executes the same program (w/ a dfferent compler) whch yelds a count of 60 mllon nstructons and a of 6 What s the of the program on Proc. A? Whch processor executes the program faster and by what factor? What s the MIPS rate of Proc. B? Z Reg. Z Reg. Z Reg. Clock 1: Y Rsrc1 Clock 2: Z Rsrc2 + Y Clock 3: Rdst Z Clock 1: Z Rsrc1 + Rsrc2 Clock 2: Rdst Z Clock 1: Rdst Rsrc1 + Rsrc2 General Implcatons: Less Resources > More Clock Cycles (Tme) 4.15 4.16 Calculatng can be found by takng the expected value (weghted average) of each nstructon type s [.e. for each type * frequency (probablty) of that type of nstructon] * P( InstructonType ) Type _ In practce, s often hard too fnd analytcally because n modern processors nstructon executon s dependent on earler nstructons Instead we run benchmark applcatons on smulators to measure average. If CLK1 MHz what s PEAK Inst./Sec. Average Average P1 A 1 B 2 C 3 P1 A 1 B 2 C 3

4.17 4.18 Calculate of ths snppet of code usng the followng s for each nstructon type add $s0,$zero,$zero add $t1,$zero,4 loop: lw $t2,0($t0) add $t2,$t2,$t1 add $t0,$t0,4 add $t1,$t1,-1 bne $t1,$zero,loop sw $t2,0($t2) Dynamc Instructon Count * P( InstructonType ) Type _ add lw/ sw bne add / add 1 lw/ sw 4 bne 2 Dynamc Count Other Performance Measures OPS/FLOPS (Floatng-Pont) Operatons/Sec. Maxmum number of arthmetc operatons per second the processor can acheve : 4 FP ALU s on a processor runnng @ 2 GHz > 8 GFLOPS Memory Bandwdth (Bytes/Sec.) Maxmum bytes of memory per second that can be read/wrtten Programs are ether memory bound or computatonally bound 4.19 4.20 Energy Proportonal Computng Desred Power vs. Utlzaton Relatonshp What should I optmze? AMDAHL'S LAW The Case for Energy-Proportonal Computng, Luz André Barroso, Urs Hölzle, IEEE Computer, vol. 40 (2007).

Amdahl s Law 4.21 Amdahl s Law 4.22 Where should we put our effort when tryng to enhance performance of a program Amdahl s Law How much performance gan do we get by mprovng only a part of the whole ExecTmeNe w ExecTmeUnaffected + Speedup ExecTmeOld ExecTmeNew ExecTmeAffected ImprovementFactor Holds for both HW and SW HW: Whch nstructons should we make fast? The most used (executed) ones SW: Whch portons of our program should we work to optmze Holds for parallelzaton of algorthms (convertng code to run multple processors) Orgnal Sequental Program Parallelzed Program Parallelzaton 4.23 4.24 A programmer parallelzes a functon n hs program to be run on 8 cores. The functon accounted for 40% of the runtme of the overall program. What s the speedup of the enhancement? What f we mprove only class B nstrucs. P1 Freq. A 1 10% B 2 > 1 40% C 3 50% Speedup Speedup Percent Unaffected + 1 Percent Affected ImprovementFactor 1 1 Speedup 6 / 5 1.2??? 2 / 3+ (1/ 3/ 2) 5/ 6

Proflng 4.25 gprof Output 4.26 How do you know where tme s beng spent? From a software (programmng for performance) perspectve, proflers are handy tools Instrument your code to take statstcs as t runs and then can show you what percentage of tme each functon or even lne of code was responsble for Common proflers 'gprof' (usually standard wth Unx / Lnux nstalls) and 'g++' Intel VTune MS Vsual Studo Proflng Tools From a hardware perspectve, smulators can help SmpleScalar Smcs Your own smulaton model developed n Verlog/SystemC/etc. % cumulatve self self total tme seconds seconds calls s/call s/call name 42.96 4.48 4.48 56091649 0.00 0.00 Board::operator<(Board const&) const 6.43 5.15 0.67 2209524 0.00 0.00 std::_rb_tree<...>::_m_lower_bound(...) 5.08 5.68 0.53 108211500 0.00 0.00 gnu_cxx:: normal_terator<...>::operator+(...) 4.51 6.15 0.47 4419052 0.00 0.00 Board::Board(Board const&) 4.32 6.60 0.45 1500793 0.00 0.00 vod std:: adjust_heap<...>(...) 3.84 7.00 0.40 28553646 0.00 0.00 PuzzleMove::operator>(PuzzleMove const&) const Credts 4.27 These sldes were derved from Gandh Puvvada s EE 457 Class Notes