Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Optmzng for Speed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se What s the potental gan? Latency dfference L$ and mem: ~5x Bandwdth dfference L$ and mem: ~x Repeated TLB msses adds a factor ~-3x Execute from L$ nstead from mem ==> 5-5x mprovement At least a factor -x s wthn reach OPT Optmzng for cache performance What can go Wrong? A Smple Example Perform a dagonal copy tmes Keep the actve footprnt small Use the entre cache lne once t has been brought nto the cache Fetch a cache lne pror to ts usage Let the CPU that already has the data n ts cache do the ob... N N OPT 3 OPT

Example: Loop order Performance Dfference: Loop order //Optmzed Example A //Unoptmzed Example A for (=; <N; ++) { for (=; <N; ++) { A[][]= A[-][-]; for (=; <N; ++) { for (=; <N; ++) { A[][] = A[-][-];? Speedup vs UnOpt 8 6 8 6 Athlon6 x Pentum D Core Duo 6 3 6 8 56 5 8 96 Array sde OPT 5 OPT 6 Example: Sparse data //Optmzed Example A for (=; <N; ++) { for (=; <N; ++) { A_data[][]= A_data[-][-]; //Unoptmzed Example A for (=; <N; ++) { for (=; <N; ++) { A[][].data = A[-][-].data; dddd d d d d Performance Dfference: Sparse Data Speedup vs UnOPT 6 8 6 6 3 6 Athlon6 x Pentum D Core Duo Array sde 8 56 5 8 96 OPT 7 OPT 8

Loop Mergng Paddng of data structures /* Unoptmzed */ for ( = ; < N; = + ) for ( = ; < N; = + ) a[][] = * b[][]; for ( = ; < N; = + ) for ( = ; < N; = + ) c[][] = K * b[][] + d[][]/ Cachelne:? A lsb A+56*8 A+56**8 ndex 56 = (3) = (3) /* Optmzed */ for ( = ; < N; = + ) for ( = ; < N; = + ) a[][] = * b[][]; c[][] = K * b[][] + d[][]/; 56 & logc Ht? & () Select Multp (: m (3) Data OPT 9 OPT Paddng of data structures Cachelne:? A lsb A+56*8+paddng (7) A+56**8+*paddng ndex 56 (3) = (3) = Blocng /* Unoptmzed ARRAY: x = y * z */ for ( = ; < N; = + ) for ( = ; < N; = + ) {r = ; for ( = ; < N; = + ) r = r + y[][] * z[][]; x[][] = r; ; 56+paddng & & X: Y: Z: allocate more memory than needed logc Ht? S OPT OPT

Blocng /* Optmzed ARRAY: X = Y * Z */ for ( = ; < N; = + B) for ( = ; < N; = + B) for ( = ; < N; = + ) for ( = ; < mn(+b,n); = + ) {r = ; for ( = ; < mn(+b,n); = + ) r = r + y[][] * z[][]; x[][] += r; ; X: Partal soluton Y: OPT 3 Z: Frst bloc Second bloc Blocng: the Move! Partal soluton /* Optmzed ARRAY: X = Y * Z */ for ( = ; < N; = + B) /* Loop 5 */ for ( = ; < N; = + B) /* Loop */ for ( = ; < N; = + ) /* Loop 3 */ for ( = ; < mn(+b,n); = + ) /* Loop */ {r = ; for ( = ; < mn(+b,n); = + ) /* Loop */ r = r + y[][] * z[][]; X: x[][] += r; ; +B 5 Y: +B 3 3 OPT Z: +B Second bloc Frst bloc 5 +B Prefetchng Cache Affnty /* Unoptmzed */ for ( = ; < N; ++) for ( = ; < N; ++) x[][] = * x[][]; Schedule the process on the processor t last ran /* Optmzed */ for ( = ; < N; ++) for ( = ; < N; ++) PREFETCH x[+][] x[][] = * x[][]; Allocate and free data buffers n a LIFO order (Typcally, the HW prefetcher wll successfully prefetch sequental streams) OPT 5 OPT 6

Optmze for other caches TLB... Avod random accesses to huge data structs (Ex. Huge hashng table) Avod few access per page (very sparse data) Commercal Brea: Acumem s Multcore Tools Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se OPT 7 Acumem SlowSpotter Source: C, C++, Fortran, OpenMP /* Unoptmzed Array Multplcaton: x = y * z N = */ for ( = ; < N; = + ) for ( = ; < N; = + ) {r = ; for ( = ; < N; = + ) r = r + y[][] * z[][]; x[][] = r; /* Unoptmzed Array Multplcaton: x = y * z N = */ for ( = ; < N; = + ) for ( = ; < N; = + ) {r = ; for ( = ; < N; = + ) r = r + y[][] * z[][]; x[][] = r; Any Compler Msson: Fnd the SlowSpots Asses ther mportance Enable for non-experts to fx them Improve the productvty of performance experts Acumem SlowSpotter Source: C, C++, Fortran... /* Unoptmzed Array Multplcaton: x = y * z N = */ for ( = ; < N; = + ) for ( = ; < N; = + ) {r = ; for ( = ; < N; = + ) r = r + y[][] * z[][]; x[][] = r; /* Unoptmzed Array Multplcaton: x = y * z N = */ for ( = ; < N; = + ) for ( = ; < N; = + ) {r = ; for ( = ; < N; = + ) r = r + y[][] * z[][]; x[][] = r; Any Compler What? How? Help! Msson: Fnd the Where? SlowSpots Asses ther mportance Enable for non-experts to fx them Improve the productvty of performance experts Sampler n Fnger Prnt (~MB) Sampler n Fnger Prnt (~MB) Analyss n Advce n Bnary Host System OPT 9 Bnary Host System OPT Target System Parameters

A One-Clc Report Generaton Fll n the followng felds: Applcaton to run Input arguments Worng dr (where to run the app) (Lmt, f you le, data gathered here, e.g., start gatherng after after sec. and stop after sec.) Mss rate Fetch rate Cache utlzaton Fracton of cache data utlzed Predcted fetch rate (f utlzaton %) Cache sze Clc ths button to create a report Cache sze of the target system for optmzaton (e.g., L or L sze) OPT OPT Loop Focus Tab Spottng the crme Lst of bad loops Cache sze to optmze for Explanng what to do OPT 3 OPT

Bandwdth Focus Tab Resource Sharng Example Spottng the crme Lbquantum A quantum computer smulaton Wdely used n research (download from: http://www.lbquantum.de/ ) + lnes of C, farly complex code. Runs an experment n ~3 mn Throughput mprovement: Lst of Bandwdth SlowSpots,5 Explanng what to do Relatve Throughput,5 3 Number of Cores Used OPT 5 OPT 6 6 Utlzaton Analyss Lbquantum Utlzaton Analyss Lbquantum Fetch rate Predcted fetch rate f utlzaton = % Cache utlzaton Fracton of cache data utlzed Orgnal Code.3% Cache sze data status data status data status data 3 status 3 record Only accessng status data n man loop Need 3 MB per thread! Fetch rate Predcted fetch rate f utlzaton = % Orgnal Code Cache utlzaton Fracton of cache data utlzed Cache sze Utlzaton Optmzaton for (=; ++; <MAX) {... = huge_data[].status +... for (=; ++; <MAX) {... = huge_data_status[] +... SlowSpotter s Frst Advce: Improve Utlzaton Change one data structure Involves ~ lnes of code Taes a non-expert 3 mn SlowSpotter s Frst Advce: Improve Utlzaton Change one data structure Involves ~ lnes of code Taes a non-expert 3 mn OPT 7 OPT 8

After Utlzaton Optmzaton Lbquantum Utlzaton Optmzaton Old fetch rate Orgnal Code Cache Utlzaton 95% Utlzaton Optmzaton Old fetch rate Orgnal Code Cache Utlzaton 95% Utlzaton Optmzaton Cache sze Predcted fetch rate New fetch rate Cache sze Predcted fetch rate New fetch rate Two postve effects from better utlzaton. Each fetch brngs n more useful data lower fetch rate. The same amount of useful data can ft n a smaller cache shft left OPT 9 OPT 3 Reuse Analyss Lbquantum Effect: Reuse Optmzaton SPEC CPU6-6.lbquantum Fetch rate Utlzaton Optmzaton Utlzaton + Fuson Optmzaton... toffol(huge_data,...) cnot(huge_data,......... fused_toffol_cnot(huge_data,...)... Old fetch rate Utlzaton Optmzaton New fetch rate Utlzaton + Fuson Optmzaton Second-Ffth SlowSpotter Advce: Improve reuse of data Fuse functons traversng the same data Here: four fused functons created Taes a non-expert < h The mss n the second loop goes away Stll need the same amount of cache to ft all data OPT 3 OPT 3

Utlzaton + Reuse Optmzaton Lbquantum Summary Lbquantum Old fetch rate Utlzaton Optmzaton New fetch rate Utlzaton + Fuson Optmzaton 5 Orgnal Utlzaton Optmzaton Utlzaton + Fuson.7x Throughput 3 Fetch rate down to.3% for MB Same as a 3 MB cache orgnally 3 # Cores Used OPT 33 OPT 3 3 Demo Orgnal Cgar Throughput Demo Tme! 3 Throughput Lbquantum: Org code Spatal opt Spat + Loop fuson Performance Edt-comple-analyss cycle mn OPT 35 Throughput scalablty s a dfferent way to loo at the performance of an applcaton. Here, several sngle-threaded nstances of the applcaton s run at the same tme. Even though the dfferent nstances do not explctly depend on each other, they wll nevertheless fght over the shared resources, e.g., runnng four threads on four cores mples that each thread wll get one quarter of the shared cache. A system usng four cores to run four nstances of Cgar wll actually result n a lower throughput than f only three cores were used. 3 # Cores OPT 36

Throughput Performance Intel Core (Intel Xeon E535) Throughput Performance (AMD s Istanbul) 33x 7x The optmzaton puts a much lower pressure on the shared cache resultng n a 33x better throughput for four cores. AMDs new sx-core Istanbul processor can enoy a 7x better throughput due to the optmzaton on sx cores OPT 37 OPT 38 Throughput Performance (Intel 7) 5,5 Normalzed Throughput 3x Cache sharng ssues 7,5 5,5 7,5 5,5 Orgnal Optmzed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se 3 5 6 7 8 # Threads Intel s new four-core 7 (Nehalem) processor enoy a 3x better throughput due to the Optmzaton on four cores. Note that each core can run up to two threads. OPT 39 OPT

Fghtng for shared resources Example: Hnts to avod cache polluton (non-temporal prefetches) Bnary Core Bnary cache msses The larger cache, the better $ wasted Mem st Order MC Performance Problems Addtonal multcore ssues: Even less cache resources per applcaton Sharng of cache resources Wasted cache usage x mssrate mssrate 3 One Instance actual/ Four Instances Hnt: Don t allocate! actual cache sze Throughput % faster Org Orgnal Lm=.7MB Hnt: lm= actual/ OPT OPT Example: Hnts for mxed worloads (non-temporal prefetches) Some performance tools Mss rate,5,,5,,5 streamng bgger s better tny 8 6 3 6 8 56 5 M M M 8M 6M áctual 3M 6M Lbquantum LBM bzp Cache sze Free lcenses Oprofle GNU: gprof AMD: code analyst Google performance tools Vrtual Inst: Hgh Productvty Supercomputng (http://www.v-hps.org/tools/) Sun Studo Performance,,8,6,, Indvdually In mx In mx, patched bzp Lbquantum LBM Geom mean 5% Not free Intel: Vtune and many more Alnea, TotalVew, (for MPI ) Acumem (of course ) HP: Multcore toolt (some free, some not) AMD Opteron OPT 3 OPT