Uderstadig Performace Lecture : Fudametal Cocepts ad Performace Aalysis CENG 332 Algorithm Determies umber of operatios executed Programmig laguage, compiler, architecture Determie umber of machie istructios executed per operatio Processor ad memory system Determie how fast istructios are executed I/O system (icludig OS) Determies how fast I/O operatios are executed [Lecture slides are adapted from the referece book: Computer Orgaizatio ad Desig, Patterso & Heessy, 20, MKP] 2 Lecture - Fudametal Cocepts Below Your Program Applicatio software Writte i high-level laguage System software Compiler: traslates HLL code to machie code Operatig System: service code Hadlig iput/output Maagig memory ad storage Schedulig tasks & sharig resources Hardware Processor, memory, I/O cotrollers Levels of Program Code High-level laguage Level of abstractio closer to problem domai Provides for productivity ad portability Assembly laguage Textual represetatio of istructios Hardware represetatio Biary digits (bits) Ecoded istructios ad data Lecture - Fudametal Cocepts 4 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts
Compoets of a Computer The BIG Picture Same compoets for all kids of computer Desktop, server, embedded Iput/output icludes User-iterface devices Display, keyboard, mouse Storage devices Hard disk, CD/DVD, flash Network adapters For commuicatig with other computers Iside the Processor (CPU) Datapath: performs operatios o data Cotrol: sequeces datapath, memory,... Cache memory Small fast SRAM memory for immediate access to data Lecture - Fudametal Cocepts Lecture - Fudametal Cocepts Iside the Processor AMD Barceloa: 4 processor cores Relative Performace Performace Executio Time If X is times faster tha Y? Performace Performace X Y Executio Time Executio Time Y X Example: A program takes 0s o machie A ad 5s o machieb ET B / ET A = 5s / 0s =.5 So, A is.5 times faster tha B 7 Lecture - Fudametal Cocepts 8 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 2
Measurig Time Elapsed time Total respose time, icludig all aspects Processig, I/O, OS overhead, idle time Determies system performace Time spet processig a give job Discouts I/O time, other jobs shares Differet programs are affected differetly by CPU ad system performace Throughput Total work doe per uit time e.g., istructios executed per secod, tasks per hour, etc. 9 Lecture - Fudametal Cocepts Number of Clock Cycles Clock Cycle Time # of CCs Istructio Cout Cycles per Istructio CC Time Clock Rate Istructio Cout CPI Clock Cycle Time Istructio Cout CPI Clock Rate 0 Lecture - Fudametal Cocepts Example A program rus o two differet computers: A ad B. If both computers have the same ISA, which oe is faster ad by how much? Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI =.2 A B B A ICCPI CCT A A I 2.0 250ps I500ps ICCPI CCT B B I.2500ps I 00ps I 00ps.2 I500ps Lecture - Fudametal Cocepts A is faster by this much Istructio Cout ad CPI Performace improved by Reducig umber of clock cycles Icreasig clock rate Istructio Cout for a program Determied by program, ISA ad compiler Average Cycles Per Istructio (CPI) Determied by CPU hardware If differet istructios have differet CPI Average CPI affected by istructio mix 2 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 3
CPI i More Detail If differet istructio classes take differet umbers of cycles Number of Clock Cycles Weighted average CPI i (CPIi ICi) CPI Example A high-level program is compiled by two differet compilers. Each code sequece has istructios from three istructio classes: A, B, C. Give CPI values for each istructio class, fid average CPI? A B C CPI for class 2 3 IC i sequece 2 2 IC i sequece 2 4 # of CCs CPI IC i ICi CPIi IC Relative frequecy Sequece : IC = 5 Clock Cycles = 2 + 2 + 2 3 = 0 Avg. CPI = 0/5 = 2.0 Sequece 2: IC = Clock Cycles = 4 + 2 + 3 = 9 Avg. CPI = 9/ =.5 3 Lecture - Fudametal Cocepts 4 Lecture - Fudametal Cocepts MIPS MIPS: Millio Istructios Per Secod Does t accout for differeces i ISAs betwee computers ad differeces i complexity betwee istructios So, it is ot a good performace metric. Uiprocessor Performace Istructio cout MIPS Executio time0 Istructio cout Istructio cout CPI 0 Clock rate Clock rate CPI0 CPI varies betwee programs o a give CPU Costraied by power, istructio-level parallelism, memory latecy 5 Lecture - Fudametal Cocepts Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 4
Multiprocessors Multicore microprocessors More tha oe processor per chip Requires explicitly parallel programmig Compare with istructio level parallelism Hardware executes multiple istructios at oce Hidde from the programmer Hard to do Programmig for performace Load balacig Optimizig commuicatio ad sychroizatio 7 Lecture - Fudametal Cocepts SPEC CPU Bechmark Programs used to measure performace Supposedly typical of actual workload Stadard Performace Evaluatio Corp (SPEC) Develops bechmarks for CPU, I/O, Web, SPEC CPU200 Elapsed time to execute a selectio of programs Negligible I/O, so focuses o CPU performace Normalize relative to referece machie Summarize as geometric mea of performace ratios CINT200 (iteger) ad CFP200 (floatig-poit) Executio time ratio i i 8 Lecture - Fudametal Cocepts CINT200 for Optero X4 235 Name Descriptio IC 0 9 CPI Tc (s) Exec time Ref time SPECratio perl Iterpreted strig processig 2,8 0.75 0.40 37 9,777 5.3 bzip2 Block-sortig compressio 2,389 0.85 0.40 87 9,50.8 gcc GNU C Compiler,050.72 0.47 24 8,050. mcf Combiatorial optimizatio 33 0.00 0.40,345 9,20.8 go Go game (AI),58.09 0.40 72 0,490 4. hmmer Search gee sequece 2,783 0.80 0.40 890 9,330 0.5 sjeg Chess game (AI) 2,7 0.9 0.48 37 2,00 4.5 libquatum Quatum computer simulatio,23. 0.40,047 20,720 9.8 h24avc Video compressio 3,02 0.80 0.40 993 22,30 22.3 ometpp Discrete evet simulatio 587 2.94 0.40 90,250 9. astar Games/path fidig,082.79 0.40 773 7,020 9. xalacbmk XML parsig,058 2.70 0.40,43,900.0 Geometric mea.7 Fudametal Cocepts Takig advatage of parallelism The priciple of locality Focus o the commo case Amdahl s Law Processor performace Equatio High cache miss rates 9 Lecture - Fudametal Cocepts 20 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 5
) Takig Advatage of Parallelism Icreasig throughput of a server computer via multiple processors or multiple disks Detailed HW desig Carry look ahead adders uses parallelism to speed up Multiple memory baks searched i parallel Pipeliig: overlap istructio executio to reduce the total time to complete a istructio sequece. Classic 5-stage pipelie: ) Istructio Fetch (Ifetch), 2) ister Read (), 3) Execute (ALU), 4) Data Memory Access (Dmem), 5) ister Write () 2 Lecture - Fudametal Cocepts Pipelied Istructio Executio I s t r. O r d e r Ifetch Ifetch Time (clock cycles) Cycle Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle Cycle 7 ALU Ifetch DMem ALU Ifetch DMem 22 Lecture - Fudametal Cocepts ALU DMem ALU DMem Limits to pipeliig Hazards prevet ext istructio from executig durig its desigated clock cycle Structural hazards: attempt to use the same hardware to do two differet thigs at oce Data hazards: Istructio depeds o result of prior istructio still i the pipelie Cotrol hazards: Caused by delay betwee the fetchig of istructios ad decisios about chages i cotrol flow (braches ad jumps). 2) The Priciple of Locality The Priciple of Locality: Program access a relatively small portio of the address space at ay istat of time. Two Differet Types of Locality: Temporal Locality (Locality i Time): If a item is refereced, it will ted to be refereced agai soo (e.g., loops, reuse) Spatial Locality (Locality i Space): If a item is refereced, items whose addresses are close by ted to be refereced soo (e.g., straight-lie code, array access) Last 30 years, HW relied o locality for memory performace. P $ MEM 23 Lecture - Fudametal Cocepts 24 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts
Capacity Access Time Cost Tape ifiite sec-mi ~$ / GByte 25 Levels of the Memory Hierarchy CPU isters 00s Bytes 300 500 ps (0.3-0.5 s) L ad L2 Cache 0s-00s K Bytes ~ s - ~0 s $000s/ GByte Mai Memory G Bytes 80s- 200s ~ $00/ GByte Disk 0s T Bytes, 0 ms (0,000,000 s) ~ $ / GByte isters L Cache L2 Cache Memory Disk Tape Istr. Operads Blocks Blocks Pages Files Lecture - Fudametal Cocepts Stagig Xfer Uit prog./compiler -8 bytes cache ctl 32-4 bytes cache ctl 4-28 bytes OS 4K-8K bytes user/operator Mbytes Upper Level faster Larger Lower Level 3) Focus o the Commo Case I makig a desig trade-off, favor the frequet case over the ifrequet case E.g., Istructio fetch ad decode uit used more frequetly tha multiplier, so optimize it first. Frequet case is ofte simpler ad ca be doe faster tha the ifrequet case E.g., overflow is rare whe addig 2 umbers, so improve performace by optimizig commo case of o overflow What is frequet case ad how much performace improved by makig it faster => Amdahl s Law 2 Lecture - Fudametal Cocepts 4) Amdahl s Law gaied from some faster mode of executio is limited by the fractio of the time durig which faster mode is used. ETold overall ETew ETaffected ETew ETuaffected improvemet factor ExTimeew ExTimeold overall ExTime ExTime old ew Fractio Fractio Fractio Fractio ET affected Theoretical Maximum: maximum - Fractio 27 ET old ET old Lecture - Fudametal Cocepts 28 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 7
Amdahl s Law example Example: A ew 0X faster CPU is placed i a computig system where 40% of time is for CPU ad 0% of time is for I/O. What is? 5) Processor performace equatio ist cout Cycle time CPU time = Secods = Istructios x Cycles x Secods Program Program Istructio Cycle CPI overall Fractio 0.4 0.4 0 Fractio 0.4.5 Ist. Cout CPI Clock Rate Program X Compiler X (X) Ist. Set. X X Orgaizatio X X 0X vs. just.x faster? Techology X Lecture - Fudametal Cocepts 30 Lecture - Fudametal Cocepts Pitfall: Amdahl s Law Improvig a aspect of a computer ad expectig a proportioal improvemet i overall performace T improved Taffected T improvemet factor uaffected Example: multiply accouts for 80s/00s How much improvemet i multiply performace to get 5 overall? 80 20 20 Ca t be doe! Fallacy: Low Power at Idle Look back at X4 power bechmark At 00% load: 295W At 50% load: 24W (83%) At 0% load: 80W (%) Google data ceter Mostly operates at 0% 50% load At 00% load less tha % of the time Cosider desigig processors to make power proportioal to load Corollary: make the commo case fast 3 Lecture - Fudametal Cocepts 32 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 8
Pitfall: MIPS as a Performace Metric MIPS: Millios of Istructios Per Secod Does t accout for Differeces i ISAs betwee computers Differeces i complexity betwee istructios Istructio cout MIPS Executio time 0 Istructio cout Istructio cout CPI 0 Clock rate Clock rate CPI0 CPI varies betwee programs o a give CPU Cocludig Remarks Cost/performace is improvig Due to uderlyig techology developmet Hierarchical layers of abstractio I both hardware ad software Istructio set architecture The hardware/software iterface Executio time: the best performace measure Power is a limitig factor Use parallelism to improve performace 33 Lecture - Fudametal Cocepts 34 Lecture - Fudametal Cocepts Lecture : Itroductio ad Fudametal Cocepts 9