Optimization and Parallelization of Sequential Programs

Similar documents
Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Vectorization in the Polyhedral Model

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Polyhedral Compilation Foundations

Loop Transformations, Dependences, and Parallelization

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Today Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints

CMPS 10 Introduction to Computer Science Lecture Notes

Uncorrected Proof. Thread-Level Speculation

Assembler. Building a Modern Computer From First Principles.

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Parallel matrix-vector multiplication

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Cache Memories. Lecture 14 Cache Memories. Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

2.1. The Program Model

LLVM passes and Intro to Loop Transformation Frameworks

ELEC 377 Operating Systems. Week 6 Class 3

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Programming in Fortran 90 : 2017/2018

Module Management Tool in Software Development Organizations

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Parallel and Distributed Association Rule Mining - Dr. Giuseppe Di Fatta. San Vigilio,

Concurrent Apriori Data Mining Algorithms

[KV99] M. Kaul and R, Vemuri. Integrated Block-Processing and Design-Space Exploration in Temporal Partitioning for RTR Architectures.

Wavefront Reconstructor

An Entropy-Based Approach to Integrated Information Needs Assessment

Agenda & Reading. Simple If. Decision-Making Statements. COMPSCI 280 S1C Applications Programming. Programming Fundamentals

The Codesign Challenge

Giving credit where credit is due

CS221: Algorithms and Data Structures. Priority Queues and Heaps. Alan J. Hu (Borrowing slides from Steve Wolfman)

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

AADL : about scheduling analysis

Dijkstra s Single Source Algorithm. All-Pairs Shortest Paths. Dynamic Programming Solution. Performance. Decision Sequence.

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Estimation of Parallel Complexity with Rewriting Techniques

Support Vector Machines

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Verification by testing

Midterms Save the Dates!

Conditional Speculative Decimal Addition*

Problem Set 3 Solutions

Introduction to Programming. Lecture 13: Container data structures. Container data structures. Topics for this lecture. A basic issue with containers

Petri Net Based Software Dependability Engineering

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

ADJUSTING A PROGRAM TRANSFORMATION FOR LEGALITY

Parallel Solutions of Indexed Recurrence Equations

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Dijkstra s Single Source Algorithm. All-Pairs Shortest Paths. Dynamic Programming Solution. Performance

Dynamic Programming. Example - multi-stage graph. sink. source. Data Structures &Algorithms II

Efficient Distributed File System (EDFS)

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Chapter 1. Introduction

CS 534: Computer Vision Model Fitting

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

Machine Learning. Topic 6: Clustering

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

Support Vector Machines

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Load Balancing for Hex-Cell Interconnection Network

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Real-time Scheduling

CE 221 Data Structures and Algorithms

Smoothing Spline ANOVA for variable screening

System-on-Chip Design Analysis of Control Data Flow. Hao Zheng Comp Sci & Eng U of South Florida

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Classifier Selection Based on Data Complexity Measures *

Machine Learning: Algorithms and Applications

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Hierarchical clustering for gene expression data analysis

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Brave New World Pseudocode Reference

Programming FPGAs in C/C++ with High Level Synthesis PACAP - HLS 1

K-means and Hierarchical Clustering

Estimating Costs of Path Expression Evaluation in Distributed Object Databases

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

11. APPROXIMATION ALGORITHMS

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

7/12/2016. GROUP ANALYSIS Martin M. Monti UCLA Psychology AGGREGATING MULTIPLE SUBJECTS VARIANCE AT THE GROUP LEVEL

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Meta-heuristics for Multidimensional Knapsack Problems

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Cluster Analysis of Electrical Behavior

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Performance Study of Parallel Programming on Cloud Computing Environments Using MapReduce

International Conference on Parallel Processing, St. Charles, IL, August COMMUNICATION OPTIMIZATIONS USED IN THE PARADIGM

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Lecture 5: Multilayer Perceptrons

Concurrent models of computation for embedded software

Transcription:

DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton Optmzaton and Parallelzaton of Sequental Programs Lecture 7 Chrstoph Kessler IDA / PELAB Lnköpng Unversty Sweden Outlne Towards (sem-)automatc parallelzaton of sequental programs Data dependence analyss for loops Some loop transformatons Loop nvarant code hostng, loop unrollng, loop fuson, loop nterchange, loop blockng and tlng Statc loop parallelzaton Run-tme loop parallelzaton Doacross parallelzaton, Inspector-executor method Speculatve parallelzaton (as tme permts) Auto-tunng (later, f tme) Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Foundatons: Control and Data Dependence Foundatons: Control and Data Dependence Consder statements S, T n a sequental program (S=T possble) Scope of analyss s typcally a functon,.e. ntra-procedural analyss Assume that a control flow path S T s possble Can be done at arbtrary granularty (nstructons, operatons, statements, compound statements, program regons) Relevant are only the read and wrte effects on memory (.e. on program varables) by each operaton, and the effect on control flow Control dependence S T, f the fact whether T s executed may depend on S (e.g. condton) Imples that relatve executon order S T must be preserved when restructurng the program S: f () { T: Mostly obvous from nestng structure n well-structured programs, but more trcky n arbtrary branchng code (e.g. assembler code) Data dependence S T, f statement S may execute (dynamcally) before T and both may access the same memory locaton and at least one of these accesses s a wrte Means that executon order S before T must be preserved when restructurng the program In general, only a conservatve over-estmaton can be determned statcally flow dependence: (RAW, read-after-wrte) S may wrte a locaton z that T may read ant dependence: (WAR, wrte-after-read) S may read a locaton x that T may overwrtes output dependence: (WAW, wrte-after-wrte) both S and T may wrte the same locaton S: z = ; T: =..z.. ; (flow dependence) 3 4 Dependence Graph (Data, Control, Program) Dependence Graph: Drected graph, consstng of all statements as vertces and all (data, control, any) dependences as edges. Why Loop Optmzaton and Parallelzaton? Loops are a promsng obect for program optmzatons, ncludng automatc parallelzaton: Hgh executon frequency Most computaton done n (nner) loops Even small optmzatons can have large mpact (cf. Amdahl s Law) Regular, repettve behavor compact descrpton relatvely smple to analyze statcally Well researched 5 6

Loop Optmzatons General Issues Move loop nvarant computatons out of loops Modfy the order of teratons or parts thereof Goals: Improve data access localty Faster executon Reduce loop control overhead Enhance possbltes for loop parallelzaton or vectorzaton DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton Data Dependence Analyss for Loops A more formal ntroducton Only transformatons that preserve the program semantcs (ts nput/output behavor) are admssble Conservatve (statc) crterum: preserve data dependences Need data dependence analyss for loops 7 Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Data Dependence Analyss Overvew Precedence relaton between statements Important for loop optmzatons, vectorzaton and parallelzaton, nstructon schedulng, data cache optmzatons Conservatve approxmatons to dsontness of pars of memory accesses weaker than data-flow analyss but generalzes ncely to the level of ndvdual array element Loops, loop nests Iteraton space Array subscrpts n loops Index space Dependence testng methods Data dependence graph Data + control dependence graph Program dependence graph 9 Data Dependence Graph Loop Iteraton Space Data dependence graph for straght-lne code ( basc block, no branchng) s always acyclc, because relatve executon order of statements s forward only. Data dependence graph for a loop: Dependence edge S T f a dependence may exst for some par of nstances (teratons) of S, T Cycles possble Loop-ndependent versus loop-carred dependences (assumng we know statcally that arrays a and b do not ntersect)

Example Loop Normalzaton (assumng that we statcally know that arrays A, X, Y, Z do not ntersect, otherwse there mght be further dependences) (Iteratons unrolled) Data dependence graph: S S 3 Dependence Dstance and Drecton 5 Lnear Dophantne Equatons 7 4 Dependence Equaton System 6 Dependence Testng, : GCD-Test 8 3

For multdmensonal arrays? 9 Survey of Dependence Tests DF Advanced Compler Constructon Loop Optmzatons General Issues TDDC86 Compler optmzatons and code generaton Move loop nvarant computatons out of loops Modfy the order of teratons or parts thereof Loop Transformatons and Parallelzaton Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Some mportant loop transformatons Loop normalzaton Goals: Improve data access localty Faster executon Reduce loop control overhead Enhance possbltes for loop parallelzaton or vectorzaton Only transformatons that preserve the program semantcs (ts nput/output behavor) are admssble Conservatve (statc) crterum: preserve data dependences Need data dependence analyss for loops Loop Invarant Code Hostng Move loop nvarant code out of the loop Loop parallelzaton Loop nvarant code hostng Loop nterchange Loop fuson vs. Loop dstrbuton / fsson Strp-mnng / loop tlng / blockng vs. Loop lnearzaton Loop unrollng, unroll-and-am Complers can do ths automatcally f they can statcally fnd out what code s loop nvarant for (=; <; ++) tmp = c / d; a[] = b[] + c / d; for (=; <; ++) a[] = b[] + tmp; Loop peelng Index set splttng, Loop unswtchng Scalar replacement, Scalar expanson Later: Software ppelnng More: Cycle shrnkng, Loop skewng,... 3 4 4

Loop Unrollng Loop Interchange () Loop unrollng For properly nested loops Can be enforced wth compler optons e.g. funroll= (statements n nnermost loop body only) for (=; <5; ++) { a[] = b[]; Example : for (=; <M; ++) for ( =; <5; +=) { Unroll by : a[][] Reduces loop overhead (total # comparsons, branches, ncrements) a[][] row-wse storage of D-arrays n C, Java a[][m-] new teraton order...... Longer loop body may enable further local optmzatons (e.g. common subexpresson elmnaton, regster allocaton, nstructon schedulng, usng SIMD nstructons) old teraton order a[n-][] a[n-][] longer code Exercse: Formulate the unrollng rule for lmt C. Kessler, IDA, Lnköpngs unverstet. 5 statcally unknown TDDD56 upper Multcoreloop and GPU Programmng Foundatons: Loop-Carred Data Dependences Can mprove data access localty n memory herarchy (fewer cache msses / page faults) 6 Loop Interchange () Recall: Data dependence S T, S: z = ; f operaton S may execute (dynamcally) before operaton T and both may access the same memory locaton T: =..z.. ; and at least one of these accesses s a wrte In general, only a conservatve over-estmaton can be determned statcally. Be careful wth loop carred data dependences! = =3 T T T3 S S S3 for (=; <N; ++) for (=; <N; ++) for (=; <M; ++) a[][] =a[+][-]...; Iteraton space: =N- a[][] =a[+][-]; f the data dependence S T may exst for nstances of S and T n dfferent teratons of L. Iteraton space: = Example : for (=; <M; ++) Data dependence S T s called loop carred by a loop L TN- SN- Iteraton (,) reads locaton a[+][-] that was wrtten n an earler teraton, (-,+) Iteraton (,) reads locaton a[+][-], that wll be overwrtten n a later teraton (+,-) new teraton order old teraton order partal order between the operaton nstances resp. teratons 7 a[ ][ ] =. ; for (=; <M; ++) a[ ][ ] =. ; a[+] = b[+]; L: for (=; <N; ++) { T: = x[ - ]; S: x[ ] = ; for (=; <N; ++) for (=; <N; ++) a[] = b[]; Interchangng the loop headers would volate the partal teraton order gven by the data dependences 8 Loop Interchange (3) Loop Fuson Be careful wth loop-carred data dependences! Merge subsequent loops wth same header Example 3: for (=; <M; ++) for (=; <N; ++) for (=; <N; ++) OK for (=; <M; ++) a[][] =a[-][-]...; a[][] =a[-][-]; Iteraton space: new teraton order old teraton order Iteraton (,) reads locaton a[-][-] that was wrtten n earler teraton (-,-) Generally: Interchangng loop headers s only admssble f loop-carred dependences have the same drecton for all loops n the loop nest (all drected along or all aganst the teraton order) 9 Safe f nether loop carres a (backward) dependence for (=; <N; ++) for (= ; <N; ++) { a[ ] = ; a[ ] = ; for (=; <N; ++) Iteraton (,) reads locaton a[-][-] that was wrtten n earler teraton (-,-) = a[ ] ; = a[ ] ; OK Read of a[] stll after wrte of a[], for all For N suffcently large, a[] wll no longer be n the cache at ths tme Can mprove data access localty and reduces number of branches 3 5

Loop Iteraton Reorderng Loop Parallelzaton -loop carres a dependence, ts teraton order must be preserved -loop carres a dependence, ts teraton order must be preserved Loop parallelzaton 3 Remark on Loop Parallelzaton 3 Strp Mnng / Loop Blockng / -Tlng Introducng temporary copes of arrays can remove some antdependences to enable automatc loop parallelzaton for (=; <n; ++) a[] = a[] + a[+]; The loop-carred dependence can be elmnated: for (=; <n; ++) aold[+] = a[+]; for (=; <n; ++) a[] = a[] + aold[+]; Parallelzable loop Parallelzable loop 33 34 Tled Matrx-Matrx Multplcaton () Tled Matrx-Matrx Multplcaton () Matrx-Matrx multplcaton C = A x B Block each loop by block sze S here for square (n x n) matrces C, A, B, wth n large (~3): C = S k=..n A k B k for all, =...n A for (=; <n; +=S) for (kk=; kk<n; kk+=s) k for (=; <n; ++) k kk kk for (=; <n; +=S) (here wthout the ntalzaton of C-entres to ): k B Good spatal localty for A, B and C for (=; < +S; ++) for (=; < +S; ++) for (k=; k<n; k++) C[][] += A[][k] * B[k][]; Code after tlng: Standard algorthm for Matrx-Matrx multplcaton for (=; <n; ++) (choose S so that a block of A, B, C ft n cache together), k then nterchange loops k 35 Good spatal localty on A, C for (k=kk; k < kk+s; k++) Bad spatal localty on B (many capacty msses) C[][] += A[][k] * B[k][]; 36 6

Remark on Localty Transformatons Loop Dstrbuton (a.k.a. Loop Fsson) An alternatve can be to change the data layout rather than the control structure of the program Store matrx B n transposed form, or, f necessary, consder transposng t, whch may pay off over several subsequent computatons Fndng the best layout for all multdmensonal arrays s a NP-complete optmzaton problem [Mace, 988] Recursve array layouts that preserve localty Morton-order Herarchcally layout tled arrays In the best case, can make computatons cache-oblvous Performance largely ndependent of TDDD56 cache sze 37 Multcore and GPU Programmng Loop Fuson 38 Loop Nest Flattenng / Lnearzaton 39 Loop Unrollng 4 Loop Unrollng wth Unknown Upper Bound 4 4 7

Loop Unroll-And-Jam 43 Loop Peelng Index Set Splttng 44 46 Loop Unswtchng 45 Scalar Replacement Scalar Expanson / Array Prvatzaton 47 48 8

Idom recognton and algorthm replacement DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton C. Kessler: Pattern-drven automatc parallelzaton. Scentfc Programmng, 996. Concludng Remarks A. Shafee-Sarvestan, E. Hansson, C. Kessler: Extensble recognton of algorthmc patterns n DSP programs for automatc parallelzaton. Int. J. on Parallel Programmng, 3. 49 Lmts of Statc Analyzablty Outlook: Runtme Analyss and Parallelzaton Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Remark on statc analyzablty () Remark on statc analyzablty () Statc dependence nformaton s always a (safe) Statc dependence nformaton s always a (safe) overapproxmaton of the real (run-tme) dependences overapproxmaton of the real (run-tme) dependences Fndng out the real ones exactly s statcally undecdable! Fndng out the latter exactly s statcally undecdable! If n doubt, a dependence must be assumed may prevent some optmzatons or parallelzaton If n doubt, a dependence must be assumed may prevent some optmzatons or parallelzaton One man reason for mprecson s alasng,.e. the program may have several ways to refer to the same memory locaton Ponter alasng vod mergesort ( nt* a, nt n ) { mergesort ( a, n/ ); mergesort ( a + n/, n-n/ ); 5 Another reason for mprecson are statcally unknown values that mply whether a dependence exsts or not Unknown dependence dstance // value of K statcally unknown Loop-carred dependence for ( =; <N; ++ ) f K < N. Otherwse, the loop s { parallelzable. S: a[] = a[] + a[k]; 5 How could a statc analyss tool (e.g., compler) know that the two recursve calls read and wrte dsont subarrays of a? Outlook: Runtme Parallelzaton DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton Run-Tme Parallelzaton 53 Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. 9

Goal of run-tme parallelzaton Overvew Typcal target: rregular loops Run-tme parallelzaton of rregular loops for ( =; <n; ++) a[] = f ( a[ g() ], a[ h() ],... ); DOACROSS parallelzaton Inspector-Executor Technque (shared memory) Array ndex expressons g, h... depend on run-tme data Inspector-Executor Technque (message passng) * Prvatzng DOALL Test * Iteratons cannot be statcally proved ndependent (and not ether dependent wth dstance +) Speculatve run-tme parallelzaton of rregular loops * Prncple: At runtme, nspect g, h... to fnd out the real dependences and compute a schedule for partally parallel executon Can also be combned wth speculatve parallelzaton LRPD Test * General Thread-Level Speculaton 55 Hardware support * * = not covered n ths course. See the references. 56 DOACROSS Parallelzaton Inspector-Executor Technque () Useful f loop-carred dependence dstances are unknown, but often > Compler generates peces of customzed code for such loops: Allow ndependent subsequent loop teratons to overlap Blateral synchronzaton between really-dependent teratons Inspector for ( =; <n; ++) a[] = f ( a[ g() ],... ); calculates values of ndex expresson by smulatng whole loop executon typcally, based on sequental verson of the source loop (some computatons could be left out) sh float aold[n]; sh flag done[n]; // flag (semaphore) array forall n..n- { // spawn n threads, one per teraton done[n] = ; aold[] = a[]; // create a copy forall n..n- { // spawn n threads, one per teraton f (g() < ) wat untl done[ g() ] ); a[] = f ( a[ g() ],... ); set( done[] ); else a[] = f ( aold[ g() ],... ); set done[]; 57 Inspector-Executor Technque () computes mplctly the real teraton dependence graph computes a parallel schedule as (greedy) wavefront traversal of the teraton dependence graph n topologcal order all teratons n same wavefront are ndependent schedule depth = #wavefronts = crtcal path length Executor follows ths schedule to execute the loop for ( =; <n; ++) a[] = f ( a[ g() ], a[ h() ],... ); for (=; <n; ++) a[] =... a[ g() ]...; Inspector: nt wf[n]; // wavefront ndces nt depth = ; for (=; <n; ++) wf[] = ; // nt. for (=; <n; ++) { wf[] = max ( wf[ g() ], wf[ h() ],... ) + ; depth = max ( depth, wf[] ); Inspector consders only flow dependences (RAW), ant- and output dependences to be preserved by executor 59 Inspector-Executor Technque (3) Source loop: 58 Executor: 3 4 5 g() wf[] g()<? no yes no yes yes yes float aold[n]; // buffer array aold[:n] = a[:n]; 3 4 for (w=; w<depth; w++) forall (,, n, #) f (wf[] == w) { 5 a = (g() < )? a[g()] : aold[g()];... // smlarly, a for h etc. a[] = f ( a, a,... ); teraton (flow) dependence graph 6

Inspector-Executor Technque (4) DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton Problem: Inspector remans sequental no speedup Soluton approaches: Re-use schedule over subsequent teratons of an outer loop f access pattern does not change Thread-Level Speculaton amortzes nspector overhead across repeated executons Parallelze the nspector usng doacross parallelzaton [Saltz,Mrchandaney 9] Parallelze the nspector usng sectonng [Leung/Zahoran 9] compute processor-local wavefronts n parallel, concatenate trade-off schedule qualty (depth) vs. nspector speed Parallelze the nspector usng bootstrappng [Leung/Z. 9] Start wth suboptmal schedule by sectonng, use ths to execute the nspector refned schedule 6 Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Speculatvely parallel executon TLS Example For automatc parallelzaton of sequental code where dependences are hard to analyze statcally Works on a task graph constructed mplctly and dynamcally Speculate on: control flow, data ndependence, synchronzaton, values We focus on thread-level speculaton (TLS) for CMP/MT processors. Speculatve nstructon-level parallelsm s not consdered here. Task: statcally: Connected, sngle-entry subgraph of the controlflow graph Basc blocks, loop bodes, loops, or entre functons dynamcally: Contguous fragment of dynamc nstructon stream wthn statc task regon, entered at statc task entry 63 Data dependence problem n TLS Explotng module-level speculatve parallelsm (across functon calls) Source: F. Warg: Technques for Reducng Thread-Level Speculaton Overhead n Chp Multprocessors. PhD thess, Chalmers TH, Gothenburg, June 6. 64 Speculatvely parallel executon of tasks Speculaton on nter-task control flow After havng assgned a task, predct ts successor task and start t speculatvely Speculaton on data ndependence For nter-task memory data (flow) dependences conservatvely: speculatvely: awat wrte (memory synchronzaton, message) hope for ndependence and contnue (execute the load) Roll-back of speculatve results on ms-speculaton (expensve) When startng speculaton, state must be buffered Squash an offendng task and all ts successors, restart Commt speculatve results when speculaton resolved to correct Source: F. Warg: Technques for Reducng Thread-Level Speculaton Overhead n Chp Multprocessors. PhD thess, Chalmers TH, Gothenburg, June 6. 65 Task s retred 66

Selectng Tasks for Speculaton TLS Implementatons Small tasks: Software-only speculaton too much overhead (task startup, task retrement) low parallelsm degree Large tasks: hgher msspeculaton probablty hgher rollback cost many speculatons ongong n parallel may saturate the resources Load balancng ssues avod large varaton n task szes Traversal of the program s control flow graph (CFG) for loops [Rauchwerger, Padua 94, 95]... Hardware-based speculaton Typcally, ntegrated n cache coherence protocols Used wth multthreaded processors / chp multprocessors for automatc parallelzaton of sequental legacy code If source code avalable, compler may help e.g. wth dentfyng sutable threads Heurstcs for task sze, control and data dep. speculaton 67 68 Some references on Dependence Analyss, Loop optmzatons and Transformatons DF Advanced Compler Constructon TDDC86 Compler optmzatons and code generaton H. Zma, B. Chapman: Supercomplers for Parallel and Vector Computers. Addson-Wesley / ACM press, 99. M. Wolfe: Hgh-Performance Complers for Parallel Computng. Addson-Wesley, 996. Questons? R. Allen, K. Kennedy: Optmzng Complers for Modern Archtectures. Morgan Kaufmann,. Chrstoph Kessler, IDA, Lnköpngs unverstet, 4. Some references on run-tme parallelzaton Idom recognton and algorthm replacement: C. Kessler: Pattern-drven automatc parallelzaton. Scentfc Programmng 5:5-74, 996. A. Shafee-Sarvestan, E. Hansson, C. Kessler: Extensble recognton of algorthmc patterns n DSP programs for automatc paral-lelzaton. Int. J. on Parallel Programmng, 3. C. Kessler, IDA, Lnköpngs unverstet. 7 Some references on speculatve executon / parallelzaton R. Cytron: Doacross: Beyond vectorzaton for multprocessors. Proc. ICPP-986 D. Chen, J. Torrellas, P. Yew: An Effcent Algorthm for the Run-tme Parallelzaton of DOACROSS Loops, Proc. IEEE Supercomputng Conf., Nov. 4, IEEE CS Press, pp. 58-57 J. Martnez, J. Torrellas: Speculatve Locks for Concurrent Executon of Crtcal R. Mrchandaney, J. Saltz, R. M. Smth, D. M. Ncol, K. Crowley: Prncples of run-tme support for parallel processors, Proc. ACM Int. Conf. on Supercomputng, July 988, pp. 4-5. F. Warg and P. Stenström: Lmts on speculatve module-level parallelsm n J. Saltz and K. Crowley and R. Mrchandaney and H. Berryman: Runtme Schedulng and Executon of Loops on Message Passng Machnes, Journal on Parallel and Dstr. Computng 8 (99): 33-3. J. Saltz, R. Mrchandaney: The preprocessed doacross loop. Proc. ICPP-99 Int. Conf. on Parallel Processng. S. Leung, J. Zahoran: Improvng the performance of run-tme parallelzaton. Proc. ACM PPoPP-993, pp. 83-9. Lawrence Rauchwerger, Davd Padua: The Prvatzng DOALL Test: A Run-Tme Technque for DOALL Loop Identfcaton and Array Prvatzaton. Proc. ACM Int. Conf. on Supercomputng, July 994, pp. 33-45. Lawrence Rauchwerger, Davd Padua: The LRPD Test: Speculatve Run-Tme Parallelzaton of Loops wth Prvatzaton and Reducton Parallelzaton. Proc. ACM SIGPLAN PLDI-95, 995, pp. 8-3. 7 T. Vaykumar, G. Soh: Task Selecton for a Multscalar Processor. Proc. MICRO-3, Dec. 998. Sectons n Shared-Memory Multprocessors. Proc. WMPI at ISCA,. mperatve and obect-orented programs on CMP platforms. Pr. IEEE PACT. P. Marcuello and A. Gonzalez: Thread-spawnng schemes for speculatve multthreadng. Proc. HPCA-8,. J. Steffan et al.: Improvng value communcaton for thread-level speculaton. HPCA-8,. M. Cntra, J. Torrellas: Elmnatng squashes through learnng cross-thread volatons n speculatve parallelzaton for multprocessors. HPCA-8,. Fredrk Warg and Per Stenström: Improvng speculatve thread-level parallelsm through module run-length predcton. Proc. IPDPS 3. F. Warg: Technques for Reducng Thread-Level Speculaton Overhead n Chp Multprocessors. PhD thess, Chalmers TH, Gothenburg, June 6. T. Ohsawa et al.: Pnot: Speculatve mult-threadng processor archtecture explotng parallelsm over a wde range of granulartes. Proc. MICRO-38, 5. 7