Optimization and Parallelization of Sequential Programs

DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton Optmzaton and Parallelzaton of Sequental Programs Lecture 7 Chrstoph Kessler IDA / PELAB Lnköpng Unversty Sweden Outlne Towards (sem-)automatc parallelzaton of sequental programs Data dependence analyss for loops Some loop transformatons Loop nvarant code hostng, loop unrollng, loop fuson, loop nterchange, loop blockng and tlng Statc loop parallelzaton Run-tme loop parallelzaton Doacross parallelzaton, Inspector-executor method Speculatve parallelzaton (as tme permts) Auto-tunng (later, f tme) Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Foundatons: Control and Data Dependence Foundatons: Control and Data Dependence Consder statements S, T n a sequental program (S=T possble) Scope of analyss s typcally a functon,.e. ntra-procedural analyss Assume that a control flow path S T s possble Can be done at arbtrary granularty (nstructons, operatons, statements, compound statements, program regons) Relevant are only the read and wrte effects on memory (.e. on program varables) by each operaton, and the effect on control flow Control dependence S T, f the fact whether T s executed may depend on S (e.g. condton) Imples that relatve executon order S T must be preserved when restructurng the program S: f () { T: Mostly obvous from nestng structure n well-structured programs, but more trcky n arbtrary branchng code (e.g. assembler code) Data dependence S T, f statement S may execute (dynamcally) before T and both may access the same memory locaton and at least one of these accesses s a wrte Means that executon order S before T must be preserved when restructurng the program In general, only a conservatve over-estmaton can be determned statcally flow dependence: (RAW, read-after-wrte) S may wrte a locaton z that T may read ant dependence: (WAR, wrte-after-read) S may read a locaton x that T may overwrtes output dependence: (WAW, wrte-after-wrte) both S and T may wrte the same locaton S: z = ; T: =..z.. ; (flow dependence) 3 4 Dependence Graph (Data, Control, Program) Dependence Graph: Drected graph, consstng of all statements as vertces and all (data, control, any) dependences as edges. Why Loop Optmzaton and Parallelzaton? Loops are a promsng obect for program optmzatons, ncludng automatc parallelzaton: Hgh executon frequency Most computaton done n (nner) loops Even small optmzatons can have large mpact (cf. Amdahl s Law) Regular, repettve behavor compact descrpton relatvely smple to analyze statcally Well researched 5 6

Loop Optmzatons General Issues Move loop nvarant computatons out of loops Modfy the order of teratons or parts thereof Goals: Improve data access localty Faster executon Reduce loop control overhead Enhance possbltes for loop parallelzaton or vectorzaton DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton Data Dependence Analyss for Loops A more formal ntroducton Only transformatons that preserve the program semantcs (ts nput/output behavor) are admssble Conservatve (statc) crterum: preserve data dependences Need data dependence analyss for loops 7 Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Data Dependence Analyss Overvew Precedence relaton between statements Important for loop optmzatons, vectorzaton and parallelzaton, nstructon schedulng, data cache optmzatons Conservatve approxmatons to dsontness of pars of memory accesses weaker than data-flow analyss but generalzes ncely to the level of ndvdual array element Loops, loop nests Iteraton space Array subscrpts n loops Index space Dependence testng methods Data dependence graph Data + control dependence graph Program dependence graph 9 Data Dependence Graph Loop Iteraton Space Data dependence graph for straght-lne code ( basc block, no branchng) s always acyclc, because relatve executon order of statements s forward only. Data dependence graph for a loop: Dependence edge S T f a dependence may exst for some par of nstances (teratons) of S, T Cycles possble Loop-ndependent versus loop-carred dependences (assumng we know statcally that arrays a and b do not ntersect)

Example Loop Normalzaton (assumng that we statcally know that arrays A, X, Y, Z do not ntersect, otherwse there mght be further dependences) (Iteratons unrolled) Data dependence graph: S S 3 Dependence Dstance and Drecton 5 Lnear Dophantne Equatons 7 4 Dependence Equaton System 6 Dependence Testng, : GCD-Test 8 3

For multdmensonal arrays? 9 Survey of Dependence Tests DF Advanced Compler Constructon Loop Optmzatons General Issues TDDC86 Compler optmzatons and code generaton Move loop nvarant computatons out of loops Modfy the order of teratons or parts thereof Loop Transformatons and Parallelzaton Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Some mportant loop transformatons Loop normalzaton Goals: Improve data access localty Faster executon Reduce loop control overhead Enhance possbltes for loop parallelzaton or vectorzaton Only transformatons that preserve the program semantcs (ts nput/output behavor) are admssble Conservatve (statc) crterum: preserve data dependences Need data dependence analyss for loops Loop Invarant Code Hostng Move loop nvarant code out of the loop Loop parallelzaton Loop nvarant code hostng Loop nterchange Loop fuson vs. Loop dstrbuton / fsson Strp-mnng / loop tlng / blockng vs. Loop lnearzaton Loop unrollng, unroll-and-am Complers can do ths automatcally f they can statcally fnd out what code s loop nvarant for (=; <; ++) tmp = c / d; a[] = b[] + c / d; for (=; <; ++) a[] = b[] + tmp; Loop peelng Index set splttng, Loop unswtchng Scalar replacement, Scalar expanson Later: Software ppelnng More: Cycle shrnkng, Loop skewng,... 3 4 4

Loop Unrollng Loop Interchange () Loop unrollng For properly nested loops Can be enforced wth compler optons e.g. funroll= (statements n nnermost loop body only) for (=; <5; ++) { a[] = b[]; Example : for (=; <M; ++) for ( =; <5; +=) { Unroll by : a[][] Reduces loop overhead (total # comparsons, branches, ncrements) a[][] row-wse storage of D-arrays n C, Java a[][m-] new teraton order...... Longer loop body may enable further local optmzatons (e.g. common subexpresson elmnaton, regster allocaton, nstructon schedulng, usng SIMD nstructons) old teraton order a[n-][] a[n-][] longer code Exercse: Formulate the unrollng rule for lmt C. Kessler, IDA, Lnköpngs unverstet. 5 statcally unknown TDDD56 upper Multcoreloop and GPU Programmng Foundatons: Loop-Carred Data Dependences Can mprove data access localty n memory herarchy (fewer cache msses / page faults) 6 Loop Interchange () Recall: Data dependence S T, S: z = ; f operaton S may execute (dynamcally) before operaton T and both may access the same memory locaton T: =..z.. ; and at least one of these accesses s a wrte In general, only a conservatve over-estmaton can be determned statcally. Be careful wth loop carred data dependences! = =3 T T T3 S S S3 for (=; <N; ++) for (=; <N; ++) for (=; <M; ++) a[][] =a[+][-]...; Iteraton space: =N- a[][] =a[+][-]; f the data dependence S T may exst for nstances of S and T n dfferent teratons of L. Iteraton space: = Example : for (=; <M; ++) Data dependence S T s called loop carred by a loop L TN- SN- Iteraton (,) reads locaton a[+][-] that was wrtten n an earler teraton, (-,+) Iteraton (,) reads locaton a[+][-], that wll be overwrtten n a later teraton (+,-) new teraton order old teraton order partal order between the operaton nstances resp. teratons 7 a[ ][ ] =. ; for (=; <M; ++) a[ ][ ] =. ; a[+] = b[+]; L: for (=; <N; ++) { T: = x[ - ]; S: x[ ] = ; for (=; <N; ++) for (=; <N; ++) a[] = b[]; Interchangng the loop headers would volate the partal teraton order gven by the data dependences 8 Loop Interchange (3) Loop Fuson Be careful wth loop-carred data dependences! Merge subsequent loops wth same header Example 3: for (=; <M; ++) for (=; <N; ++) for (=; <N; ++) OK for (=; <M; ++) a[][] =a[-][-]...; a[][] =a[-][-]; Iteraton space: new teraton order old teraton order Iteraton (,) reads locaton a[-][-] that was wrtten n earler teraton (-,-) Generally: Interchangng loop headers s only admssble f loop-carred dependences have the same drecton for all loops n the loop nest (all drected along or all aganst the teraton order) 9 Safe f nether loop carres a (backward) dependence for (=; <N; ++) for (= ; <N; ++) { a[ ] = ; a[ ] = ; for (=; <N; ++) Iteraton (,) reads locaton a[-][-] that was wrtten n earler teraton (-,-) = a[ ] ; = a[ ] ; OK Read of a[] stll after wrte of a[], for all For N suffcently large, a[] wll no longer be n the cache at ths tme Can mprove data access localty and reduces number of branches 3 5

Loop Iteraton Reorderng Loop Parallelzaton -loop carres a dependence, ts teraton order must be preserved -loop carres a dependence, ts teraton order must be preserved Loop parallelzaton 3 Remark on Loop Parallelzaton 3 Strp Mnng / Loop Blockng / -Tlng Introducng temporary copes of arrays can remove some antdependences to enable automatc loop parallelzaton for (=; <n; ++) a[] = a[] + a[+]; The loop-carred dependence can be elmnated: for (=; <n; ++) aold[+] = a[+]; for (=; <n; ++) a[] = a[] + aold[+]; Parallelzable loop Parallelzable loop 33 34 Tled Matrx-Matrx Multplcaton () Tled Matrx-Matrx Multplcaton () Matrx-Matrx multplcaton C = A x B Block each loop by block sze S here for square (n x n) matrces C, A, B, wth n large (~3): C = S k=..n A k B k for all, =...n A for (=; <n; +=S) for (kk=; kk<n; kk+=s) k for (=; <n; ++) k kk kk for (=; <n; +=S) (here wthout the ntalzaton of C-entres to ): k B Good spatal localty for A, B and C for (=; < +S; ++) for (=; < +S; ++) for (k=; k<n; k++) C[][] += A[][k] * B[k][]; Code after tlng: Standard algorthm for Matrx-Matrx multplcaton for (=; <n; ++) (choose S so that a block of A, B, C ft n cache together), k then nterchange loops k 35 Good spatal localty on A, C for (k=kk; k < kk+s; k++) Bad spatal localty on B (many capacty msses) C[][] += A[][k] * B[k][]; 36 6

Remark on Localty Transformatons Loop Dstrbuton (a.k.a. Loop Fsson) An alternatve can be to change the data layout rather than the control structure of the program Store matrx B n transposed form, or, f necessary, consder transposng t, whch may pay off over several subsequent computatons Fndng the best layout for all multdmensonal arrays s a NP-complete optmzaton problem [Mace, 988] Recursve array layouts that preserve localty Morton-order Herarchcally layout tled arrays In the best case, can make computatons cache-oblvous Performance largely ndependent of TDDD56 cache sze 37 Multcore and GPU Programmng Loop Fuson 38 Loop Nest Flattenng / Lnearzaton 39 Loop Unrollng 4 Loop Unrollng wth Unknown Upper Bound 4 4 7

Loop Unroll-And-Jam 43 Loop Peelng Index Set Splttng 44 46 Loop Unswtchng 45 Scalar Replacement Scalar Expanson / Array Prvatzaton 47 48 8

Idom recognton and algorthm replacement DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton C. Kessler: Pattern-drven automatc parallelzaton. Scentfc Programmng, 996. Concludng Remarks A. Shafee-Sarvestan, E. Hansson, C. Kessler: Extensble recognton of algorthmc patterns n DSP programs for automatc parallelzaton. Int. J. on Parallel Programmng, 3. 49 Lmts of Statc Analyzablty Outlook: Runtme Analyss and Parallelzaton Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Remark on statc analyzablty () Remark on statc analyzablty () Statc dependence nformaton s always a (safe) Statc dependence nformaton s always a (safe) overapproxmaton of the real (run-tme) dependences overapproxmaton of the real (run-tme) dependences Fndng out the real ones exactly s statcally undecdable! Fndng out the latter exactly s statcally undecdable! If n doubt, a dependence must be assumed may prevent some optmzatons or parallelzaton If n doubt, a dependence must be assumed may prevent some optmzatons or parallelzaton One man reason for mprecson s alasng,.e. the program may have several ways to refer to the same memory locaton Ponter alasng vod mergesort ( nt* a, nt n ) { mergesort ( a, n/ ); mergesort ( a + n/, n-n/ ); 5 Another reason for mprecson are statcally unknown values that mply whether a dependence exsts or not Unknown dependence dstance // value of K statcally unknown Loop-carred dependence for ( =; <N; ++ ) f K < N. Otherwse, the loop s { parallelzable. S: a[] = a[] + a[k]; 5 How could a statc analyss tool (e.g., compler) know that the two recursve calls read and wrte dsont subarrays of a? Outlook: Runtme Parallelzaton DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton Run-Tme Parallelzaton 53 Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. 9

Goal of run-tme parallelzaton Overvew Typcal target: rregular loops Run-tme parallelzaton of rregular loops for ( =; <n; ++) a[] = f ( a[ g() ], a[ h() ],... ); DOACROSS parallelzaton Inspector-Executor Technque (shared memory) Array ndex expressons g, h... depend on run-tme data Inspector-Executor Technque (message passng) * Prvatzng DOALL Test * Iteratons cannot be statcally proved ndependent (and not ether dependent wth dstance +) Speculatve run-tme parallelzaton of rregular loops * Prncple: At runtme, nspect g, h... to fnd out the real dependences and compute a schedule for partally parallel executon Can also be combned wth speculatve parallelzaton LRPD Test * General Thread-Level Speculaton 55 Hardware support * * = not covered n ths course. See the references. 56 DOACROSS Parallelzaton Inspector-Executor Technque () Useful f loop-carred dependence dstances are unknown, but often > Compler generates peces of customzed code for such loops: Allow ndependent subsequent loop teratons to overlap Blateral synchronzaton between really-dependent teratons Inspector for ( =; <n; ++) a[] = f ( a[ g() ],... ); calculates values of ndex expresson by smulatng whole loop executon typcally, based on sequental verson of the source loop (some computatons could be left out) sh float aold[n]; sh flag done[n]; // flag (semaphore) array forall n..n- { // spawn n threads, one per teraton done[n] = ; aold[] = a[]; // create a copy forall n..n- { // spawn n threads, one per teraton f (g() < ) wat untl done[ g() ] ); a[] = f ( a[ g() ],... ); set( done[] ); else a[] = f ( aold[ g() ],... ); set done[]; 57 Inspector-Executor Technque () computes mplctly the real teraton dependence graph computes a parallel schedule as (greedy) wavefront traversal of the teraton dependence graph n topologcal order all teratons n same wavefront are ndependent schedule depth = #wavefronts = crtcal path length Executor follows ths schedule to execute the loop for ( =; <n; ++) a[] = f ( a[ g() ], a[ h() ],... ); for (=; <n; ++) a[] =... a[ g() ]...; Inspector: nt wf[n]; // wavefront ndces nt depth = ; for (=; <n; ++) wf[] = ; // nt. for (=; <n; ++) { wf[] = max ( wf[ g() ], wf[ h() ],... ) + ; depth = max ( depth, wf[] ); Inspector consders only flow dependences (RAW), ant- and output dependences to be preserved by executor 59 Inspector-Executor Technque (3) Source loop: 58 Executor: 3 4 5 g() wf[] g()<? no yes no yes yes yes float aold[n]; // buffer array aold[:n] = a[:n]; 3 4 for (w=; w<depth; w++) forall (,, n, #) f (wf[] == w) { 5 a = (g() < )? a[g()] : aold[g()];... // smlarly, a for h etc. a[] = f ( a, a,... ); teraton (flow) dependence graph 6

Inspector-Executor Technque (4) DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton Problem: Inspector remans sequental no speedup Soluton approaches: Re-use schedule over subsequent teratons of an outer loop f access pattern does not change Thread-Level Speculaton amortzes nspector overhead across repeated executons Parallelze the nspector usng doacross parallelzaton [Saltz,Mrchandaney 9] Parallelze the nspector usng sectonng [Leung/Zahoran 9] compute processor-local wavefronts n parallel, concatenate trade-off schedule qualty (depth) vs. nspector speed Parallelze the nspector usng bootstrappng [Leung/Z. 9] Start wth suboptmal schedule by sectonng, use ths to execute the nspector refned schedule 6 Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Speculatvely parallel executon TLS Example For automatc parallelzaton of sequental code where dependences are hard to analyze statcally Works on a task graph constructed mplctly and dynamcally Speculate on: control flow, data ndependence, synchronzaton, values We focus on thread-level speculaton (TLS) for CMP/MT processors. Speculatve nstructon-level parallelsm s not consdered here. Task: statcally: Connected, sngle-entry subgraph of the controlflow graph Basc blocks, loop bodes, loops, or entre functons dynamcally: Contguous fragment of dynamc nstructon stream wthn statc task regon, entered at statc task entry 63 Data dependence problem n TLS Explotng module-level speculatve parallelsm (across functon calls) Source: F. Warg: Technques for Reducng Thread-Level Speculaton Overhead n Chp Multprocessors. PhD thess, Chalmers TH, Gothenburg, June 6. 64 Speculatvely parallel executon of tasks Speculaton on nter-task control flow After havng assgned a task, predct ts successor task and start t speculatvely Speculaton on data ndependence For nter-task memory data (flow) dependences conservatvely: speculatvely: awat wrte (memory synchronzaton, message) hope for ndependence and contnue (execute the load) Roll-back of speculatve results on ms-speculaton (expensve) When startng speculaton, state must be buffered Squash an offendng task and all ts successors, restart Commt speculatve results when speculaton resolved to correct Source: F. Warg: Technques for Reducng Thread-Level Speculaton Overhead n Chp Multprocessors. PhD thess, Chalmers TH, Gothenburg, June 6. 65 Task s retred 66

Selectng Tasks for Speculaton TLS Implementatons Small tasks: Software-only speculaton too much overhead (task startup, task retrement) low parallelsm degree Large tasks: hgher msspeculaton probablty hgher rollback cost many speculatons ongong n parallel may saturate the resources Load balancng ssues avod large varaton n task szes Traversal of the program s control flow graph (CFG) for loops [Rauchwerger, Padua 94, 95]... Hardware-based speculaton Typcally, ntegrated n cache coherence protocols Used wth multthreaded processors / chp multprocessors for automatc parallelzaton of sequental legacy code If source code avalable, compler may help e.g. wth dentfyng sutable threads Heurstcs for task sze, control and data dep. speculaton 67 68 Some references on Dependence Analyss, Loop optmzatons and Transformatons DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton H. Zma, B. Chapman: Supercomplers for Parallel and Vector Computers. Addson-Wesley / ACM press, 99. M. Wolfe: Hgh-Performance Complers for Parallel Computng. Addson-Wesley, 996. Questons? R. Allen, K. Kennedy: Optmzng Complers for Modern Archtectures. Morgan Kaufmann,. Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Some references on run-tme parallelzaton Idom recognton and algorthm replacement: C. Kessler: Pattern-drven automatc parallelzaton. Scentfc Programmng 5:5-74, 996. A. Shafee-Sarvestan, E. Hansson, C. Kessler: Extensble recognton of algorthmc patterns n DSP programs for automatc paral-lelzaton. Int. J. on Parallel Programmng, 3. C. Kessler, IDA, Lnköpngs unverstet. 7 Some references on speculatve executon / parallelzaton R. Cytron: Doacross: Beyond vectorzaton for multprocessors. Proc. ICPP-986 D. Chen, J. Torrellas, P. Yew: An Effcent Algorthm for the Run-tme Parallelzaton of DOACROSS Loops, Proc. IEEE Supercomputng Conf., Nov. 4, IEEE CS Press, pp. 58-57 J. Martnez, J. Torrellas: Speculatve Locks for Concurrent Executon of Crtcal R. Mrchandaney, J. Saltz, R. M. Smth, D. M. Ncol, K. Crowley: Prncples of run-tme support for parallel processors, Proc. ACM Int. Conf. on Supercomputng, July 988, pp. 4-5. F. Warg and P. Stenström: Lmts on speculatve module-level parallelsm n J. Saltz and K. Crowley and R. Mrchandaney and H. Berryman: Runtme Schedulng and Executon of Loops on Message Passng Machnes, Journal on Parallel and Dstr. Computng 8 (99): 33-3. J. Saltz, R. Mrchandaney: The preprocessed doacross loop. Proc. ICPP-99 Int. Conf. on Parallel Processng. S. Leung, J. Zahoran: Improvng the performance of run-tme parallelzaton. Proc. ACM PPoPP-993, pp. 83-9. Lawrence Rauchwerger, Davd Padua: The Prvatzng DOALL Test: A Run-Tme Technque for DOALL Loop Identfcaton and Array Prvatzaton. Proc. ACM Int. Conf. on Supercomputng, July 994, pp. 33-45. Lawrence Rauchwerger, Davd Padua: The LRPD Test: Speculatve Run-Tme Parallelzaton of Loops wth Prvatzaton and Reducton Parallelzaton. Proc. ACM SIGPLAN PLDI-95, 995, pp. 8-3. 7 T. Vaykumar, G. Soh: Task Selecton for a Multscalar Processor. Proc. MICRO-3, Dec. 998. Sectons n Shared-Memory Multprocessors. Proc. WMPI at ISCA,. mperatve and obect-orented programs on CMP platforms. Pr. IEEE PACT. P. Marcuello and A. Gonzalez: Thread-spawnng schemes for speculatve multthreadng. Proc. HPCA-8,. J. Steffan et al.: Improvng value communcaton for thread-level speculaton. HPCA-8,. M. Cntra, J. Torrellas: Elmnatng squashes through learnng cross-thread volatons n speculatve parallelzaton for multprocessors. HPCA-8,. Fredrk Warg and Per Stenström: Improvng speculatve thread-level parallelsm through module run-length predcton. Proc. IPDPS 3. F. Warg: Technques for Reducng Thread-Level Speculaton Overhead n Chp Multprocessors. PhD thess, Chalmers TH, Gothenburg, June 6. T. Ohsawa et al.: Pnot: Speculatve mult-threadng processor archtecture explotng parallelsm over a wde range of granulartes. Proc. MICRO-38, 5. 7