Performance Evaluaton [Ch. ] What s performance? of a car? of a car wash? of a TV? How should we measure the performance of a computer? The response tme (or wall-clock tme) t takes to complete a task? Why sn t ths a good measure? The CPU tme t takes to complete a task? H&P use Amdahl s Law The user CPU tme t takes? How s ths dfferent from CPU tme? system performance to refer to elapsed tme on an unloaded system, and CPU performance to refer to user CPU tme on an unloaded system. The mprovement n performance ( speedup ) s lmted by the part you cannot mprove. In the last class, we looked at mprovng performance by ppelnng operatons. What was the part we could not mprove? 2002 Edward F. Gehrnger ECE 463/52 Lecture otes, Fall 2002 Based on notes from Drs. Tom Conte & Erc Rotenberg of CSU Portons adapted from notes by Drs. Mark Hll, Davd Wood, Gur Soh, and Jm Smth of U. of Wsconsn
Speedup or Performance of task wth gzmo Performance of task wthout gzmo Speedup Executon tme of task wthout gzmo Executon tme of task wth gzmo However, usually we can t speed up (or enhance ) the whole task. So we have to speed up only a part. Speedup enhanced Best-case speedup from gzmo alone Fracton enhanced Fracton of task that gzmo can enhance Speedup overall Executon tme of task wthout gzmo Executon tme of task wth gzmo Executon tme old Executon tme old x ( Fracton enhanced ) + Executon tme old x Fracton enhanced Speedup enhanced ( Fracton enhanced ) + Fracton enhanced Speedup enhanced ( f ) + f s Amdahl s Law example You do smulaton of jet plane wngs. One run takes one week on your fastest computer You get ths ad n your malbox: The Acme Hyperbole s the largest supercomputer ever bult, t has 00,000 processors (great!) Lecture 2 Advanced Mcroprocessor Desgn 2
It costs $ bazllon (not so great) ow, week s 600,000 sec, so You could run a smulaton n 6 seconds, rght? Well, not all of a program can be done at the same tme Data dependences: x ( ), followed by ( ) x * y Control dependences: f xxx then yyy else zzz Say 80% of your program s parallelzable (pretty good). How fast would your program fnsh? Speedup enhanced Fracton enhanced Speedup overall ( Fracton enhanced ) + Fracton enhanced Speedup enhanced 0.8 ( 0.8) + 00000 0.2 5. So the program runs approxmately 5 tmes faster, fnshng n not qute as great as one would hope. Worth a bazllon dollars? Let s take another look at Amdahl s law, from the perspectve that not all work s parallelzable. Recall Speedup s lmted by the part you cannot mprove. The common case matters most. 2002 Edward F. Gehrnger ECE 463/52 Lecture otes, Fall 2002 3 Based on notes from Drs. Tom Conte & Erc Rotenberg of CSU Portons adapted from notes by Drs. Mark Hll, Davd Wood, Gur Soh, and Jm Smth of U. of Wsconsn
Case : Suppose f 0.95 and s.. What knd of speedup can we get? s overall ( 0.95) + 0.95.0 Case 2: Suppose f 0.05 and s 0. s overall ( 0.05) + 0.05 0 Case 3: Suppose f 0.05, s. Workload selecton s overall ( 0.05) + ε What workloads do we use to evaluate performance? Observatons Solutons A database search does dfferent thngs from an FFT Hardware good for searchng databases sn t good for an FFT. Ask the users Guess Standards: Benchmark sutes SPEC89, SPEC92, SPEC95, SPEC2K (for workstatons) Includes code and nputs Transacton Process Councl TPC-A, B, C, D (web/database servers) o code, just a specfcaton Lecture 2 Advanced Mcroprocessor Desgn 4
Perfect Club (for supercomputers) Code, but you can rewrte t bas long as results are same Graveyard of faled metrcs MIPS (mllon nstructons per second) MIPS nstructons program program tme 0 6 Instructon sets are not the same across dfferent vendors machnes MIPS can be nversely proportonal to performance! (consder FP hw vs. software emulaton) MFLOPS (Mllon floatng-pont operatons per second) The set of FP nstructons s not consstent across machnes (A Pentum has a dvde, Cray C90 supercomputer does not) Integer-only code (e.g., a compler) has a zero MFLOPS ratng Peak performance (maxmum performance for a synthetc strng of nstructons) Example: The DEC Alpha mcroprocessor has a peak performance of.2 BIPS When compared usng benchmarks, the actual rate s closer to 360 to 750 (DEC Alpha) MIPS So, what metrc should we use? Run tme Run tme the only unmpeachable measure of performance for processors. But, there are many ways to nterpret run tme. 2002 Edward F. Gehrnger ECE 463/52 Lecture otes, Fall 2002 5 Based on notes from Drs. Tom Conte & Erc Rotenberg of CSU Portons adapted from notes by Drs. Mark Hll, Davd Wood, Gur Soh, and Jm Smth of U. of Wsconsn
Wall clock tme user sees System Program A (benchmark) Program B (somethng else) I/O Compute I/O Compute I/O Compute What we care about: CPU benchmarkng cares about these two t Sngle program compute tme Fracton of system tme due to sngle program Compute tme CPU tme CPU tme clock cycle count cycle tme clock cycle count clock rate (MHz) Cycles per nstructon CPI So, clock cycle count And, CPU tme clock cycle count nstructon count We can mprove CPU tme n three ways. Decrease nstructon count (IC) Good compler Better software algorthms Decrease CPI (ncrease IPC, a.k.a. nst. level parallelsm, or ILP) Fancy hardware (e.g., caches, branch predcton, ppelnng, superscalar) Lecture 2 Advanced Mcroprocessor Desgn 6
Good compler Decrease CT Deeper ppelnng & really good crcut desgn Technology scalng Smple ISA; Less aggressve ILP (smple mcroarchtecture) The real story on RISC vs. CISC RISC: Smple nstructons Takes a lot of them to do anythng: Increases Easer to buld hardware: Easer to parallelze: et effect: eed a lot of memory to hold program, but Runs faster f Inc(IC) < Dec(CT) and Dec(CPI). IC CISC ADD ( C ), ( A ), ( B ) RISC CT CPI CISC: Bg honkng complex nstructons RISC Takes very few to do anythng: Decreases IC Easer to program by hand Harder to buld fast hardware: Incr. CT Harder to parallelze: Increases CPI et effect: Retrospectve LD r, ( A ) LD r2, ( B ) ADD r3, r, r2 ST r3, ( C ) Memory effcent Runs faster f Dec(IC) > Inc(CT) and Inc(CPI) 2002 Edward F. Gehrnger ECE 463/52 Lecture otes, Fall 2002 7 Based on notes from Drs. Tom Conte & Erc Rotenberg of CSU Portons adapted from notes by Drs. Mark Hll, Davd Wood, Gur Soh, and Jm Smth of U. of Wsconsn
If memory s expensve, people hand code machnes, and complers are terrble Use If memory s nexpensve, no one hand codes, and complers are terrfc Use Whch computer s faster? Computer A Computer B Computer C Program P (sec) 0 20 Program P2 (sec) 000 00 20 Total tme (sec) 00 0 40 A s 0x faster than B for P B s 0x faster than A for P2 A s 20x faster than C for P C s 50x faster than A for P2 etc. Total executon tme gves the clearest pcture: B s 00/0 9.x faster than A for both programs C s 25x faster than A for both programs C s 2.75x faster than B for both programs Whch would you buy? (Answer: C s fastest, overall) The arthmetc mean of tmes s a good measure too Example of means A B Prog 4 2 Prog 2 4 7 Harmonc mean 4 3. Arthmetc mean 4 4.5 (Rates gven n nstructons per second) Whch s faster, A or B? Lecture 2 Advanced Mcroprocessor Desgn 8
Consder runnng an average nstructon from Prog. followed by one from Prog. 2: for A: (/4 + /4) /2 for B: (/2 + /7) 9/4 A runs the two nstructons faster (/2 < 9/4), thus A s better. ow look at the harmonc mean (Hmean) vs. the arthmetc mean (Amean). Hmean says A has a hgher rate than B (4 vs. 3.) so Amean says B has a hgher rate than A (4.5 vs. 4) so B s better, but that s wrong! If you used the wrong method to combne the numbers, you would buy the slower machne! ote also that the defnton of harmonc mean s just the average of the rates converted to tmes, then converted back to rates. Rules Use arthmetc mean to combne run tmes. x weght tme tme Use harmonc mean to combne rates (e.g., IPC), because t actually combnes them as tmes then converts back to a rate H weght / rate rate 2002 Edward F. Gehrnger ECE 463/52 Lecture otes, Fall 2002 9 Based on notes from Drs. Tom Conte & Erc Rotenberg of CSU Portons adapted from notes by Drs. Mark Hll, Davd Wood, Gur Soh, and Jm Smth of U. of Wsconsn