Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger

Size: px

Start display at page:

Download "Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger"

Stella Little
5 years ago
Views:

1 Hybrid Analysis and its Application to Thread Level Parallelization Lawrence Rauchwerger

2 Thread (Loop) Level Parallelization Thread level Parallelization Extracting parallel threads from a sequential program Manual Very expensive for sequential legacy code: MPI - OpenMP Automatic So far not very successful due to: weak static analysis Decisions depend on dynamic values (input dependent) 2

3 What is a parallel loop? Parallelizable loop DO DO i = 1, 1, a[i] = a[i] + b[i] ENDDO Sequential loop DO DO i = 1, 1, a[i+1] = a[i] + b[i] ENDDO i=1 RW i=1 R W i=2 RW i=2 R W i=3 RW i=3 R W i=4 RW i=4 R W flow, i i+1 3

4 Static Data Dependence Analysis: Linear Reference Patterns No cross iteration dependences DO DO j = 1, 1, a(j) a(j) = a(j+40) ENDDO ENDDO No integer solutions: 1 j w 10 1 j r 10 j w j r j w = j r +40 Geometric view: Polytope model Some convex body contains no integral points Simplified solutions: GCD Test, Banerjee Test etc Potentially overly conservative General solution: Presburger formula decidability Omega Test: Precise, potentially slow Restricted to linear addressing and control: mostly small kernels 4

5 Static Data Dependence Analysis: Nonlinear Reference Patterns DO DO j = 1, 1, IF IF (x(j)>0) THEN THEN a(f(j)) = ENDIF ENDIF ENDDO ENDDO No known solution in general Common cases: Indirect access (subscripted subscripts) Control dependence on data values Recurrence without closed form Linear Approximation Maslov 94; Creusillet, Irigoin 96 Range Test (Blume, Eigenmann 94) Monotonicity (Wu, Cohen, Padua 01) Symbolic Unknowns Uninterpreted Function Symbols (Pugh, Wonnacott 95) Fuzzy Array Dataflow (Barthou, Collard, Feautrier 95) User assistance Uninterpreted Function Symbols (Pugh, Wonnacott 95) SUIF Explorer (Liao 00) Most nonlinear cases not solvable statically: need run time analysis 5

6 Alternative: Run-time Analysis READ *, N DO j=1,n a(j)=a(j+40) ENDDO Linear, very simple, but not decidable statically! Solution: LRPD Run-time Test (Rauchwerger and Padua 95) Instrument all relevant memory references N = 5 WRITE READ Analyze the resulting trace at run time Independent N = 45 WRITE READ Dependent Accurate, but overhead proportional to the dynamic memory reference count Minimum necessary information for parallelization: N 40 6

7 Compile-time vs. Run-time Compile Time Run-time, reference by reference PROs No run-time overhead PROs Always finds answers CONs: too conservative when Input/computed values Indirection Control Weak symbolic analysis Complex recurrences Impractical symbolic analysis Combinatorial explosion CONs Run-time overhead proportional to dynamic memory reference count Unnecessary work: Ignores partial compile-time results 7

8 Why Did We Fail? Static Analysis cannot be sufficient (weak and/or input sensitive) Dynamic Analysis (misunderstood as only speculation) is not a substitute for Static Analysis (too costly) No program level representation suitable for representing statically unanalyzable code 8

9 Hybrid Analysis of memory Reference Patterns (Rus ICS 02, 07) Compile-time Analysis Hybrid Analysis Run-time Analysis DYNAMIC STATIC Symbolic analysis Symbolic analysis Extract Extract conditions Evaluate conditions Full Full reference-byreference analysis Framework: Hybrid memory reference analysis Application: Automatic parallelization 9

10 Hybrid Analysis DO DO j=1,n j=1,n a(j)=a(j+40) ENDDO ENDDO. 4.a) If we can prove 10 N 30, generate parallel.loop 4.b) If N is unknown, Extract run-time test. N 40 Parallel Loop DO PARALLEL j=1,n a(j)=a(j+40) ENDDO No run-time tests performed if not necessary! Compile Time Run Time Parallel Loop Sequential Loop Run-time Test IF (N 40) THEN DO PARALLEL j=1,n a(j)=a(j+40) ENDDO ELSE DO j=1,n a(j)=a(j+40) ENDDO ENDIF 10

11 Hybrid Analysis Compile-time Phase DO DO j=1,n j=1,n a(j)=a(j+40) ENDDO ENDDO Under what conditions can the loop be executed in parallel? x j+40 j=1,n j j=1,n READ x WRITE 1. Collect and classify memory references. 41:40+N READ 1:N WRITE 2. Aggregate them symbolically 41:40+N READ Empty? 1:N WRITE 3. Formulate independence test. 4.a) If we can prove 10 N 30, Declare loop parallel. 4.b) If N is unknown, Extract run-time test. N 40 11

12 Aggregation of Linear References Across an Iteration Space WF pattern for A: 1:100 DO DO j=1,100 A(j) A(j) = ENDDO ENDDO LMAD = Linear Memory Access Descriptor (Hoeflinger 98) Multidimensional, strided intervals 12

13 Gate Operator Postpone Analysis Failure The truth value of (x=0) is not known # IF IF (x=0) (x=0) THEN THEN DO DO j=1,100 A(j) A(j) = ENDDO ENDDO ENDIF ENDIF 1:100 x=0 13

14 Recurrence Operator Postpone Static Analysis Failure due to Nonlinear Reference Pattern Recurrence with no closed form on x # x j=1,3 DO DO j=1,3 j=1,3 x = f(x) f(x) IF IF (x=0) (x=0) THEN THEN DO DO j=1,100 A(j) A(j) = ENDDO ENDDO ENDIF ENDIF ENDDO ENDDO 1:100 x=0 14

15 Uniform Set of References (USR) Closed under composition at program level T= { LMAD,,,, (, ), #,, Θ, Gate, Recurrence, Call Site} N = { USR } { USR LMAD (USR ) S = USR USR USR X P = USR USR USR USR USR USR USR Gate # USR USR Recurrence USR USR Call Site Θ USR } # 1:100 x=0 j=1,3 LMAD = Start + [Stride 1 :Span 1, Stride 2 :Span 2,...] 15

16 READ READ *, *, N DO DO j=1,n j=1,n a(j)=a(j+40) ENDDO ENDDO Are there any cross-iteration dependences? x x j+40 j=1,n j j=1,n READ WRITE 1. Collect references. 41:40+N READ Empty? 41:40+N READ N 40 1:N WRITE 1:N WRITE 2. Aggregate them symbolically & Classify Reference Type 3. Formulate independence test Extract lowest-cost runtime test.

17 Data Dependences Given: Loop expression: j = 1,N Per-iteration aggregated descriptors RO j, WF j, RW j Solve equation RO WF = Empty? x x RO j At compile-time: j=1,n WF j RO WF evaluates to independent RO WF evaluates to a set that is not empty dependent All other cases: run-time dependence test j=1,n Similar equations for privatization and reduction recognition 17

18 READ READ *, *, N DO DO j=1,n j=1,n a(j)=a(j+40) ENDDO ENDDO Are there any cross-iteration dependences? x x j+40 j=1,n j j=1,n READ WRITE 1. Collect references. 41:40+N READ Empty? 41:40+N READ N 40 1:N WRITE 1:N WRITE 2. Aggregate them symbolically & Classify Reference Type 3. Formulate independence test Extract lowest-cost runtime test.

19 Algorithm: Recursive Descent # Empty? 41:40+n 1:n 1. Distribute Intersection Empty? # 1:n DO j = 1, n A(j) = A(j+40) IF (x>0) THEN A(j) = A(j) + A(j+20) ENDIF ENDDO 41:40+n Empty? 1:n 21:20+n x>0 21:20+n x>0 2 (n 20 or x 0) and n 40 4 n 20 x 0 n 40 3 Empty? x 0 n 40 21:20+n 1:n 19

20 Novel Static/Dynamic Interface: Predicate DAG Represents any dynamic condition for loop parallelization T= { Logical Expression,,,,, Θ, Recurrence, Call Site, Library Routine} N = { PDAG } S = PDAG P = { PDAG Logical Expression PDAG PDAG PDAG PDAG PDAG PDAG PDAG PDAG Recurrence PDAG PDAG Recurrence PDAG PDAG Θ Call Site PDAG Library Routine } n 100 x 0 n 200 Expressive: arbitrary dependence questions Inexpensive: evaluates quickly at run-time 20

21 PDAG Extraction Input: USR equation D = Output: PDAG P Such that: P D = Optimistic: Sufficient predicate: Pessimistic: Necessary predicate: P D = D = P, or P D 21

22 Fallback DO j=1,100 A(ind 1 (j)) = A(ind 2 (j)) ENDDO Not all equations can be reversed Empty? x x ind 1 (j) j=1,100 ind 2 (j) j=1,100 Fallback Solutions Trace based. Speculation Possibly Aggregated 22

23 Hybrid Analysis: Run Time O(1) Scalar Comparison Fail O(n/k) Comparisons Fail Reference Based Pass Pass Pass Fail Independent Dependent 23

24 Evaluation Methodology Automatic parallelization using Hybrid Analysis in Polaris Analyses/Transformations: Dependence analysis, privatization, reduction, pushback Candidate loop selection: Profiling No scheduling or memory locality optimization Simple dynamic mechanism to rule out very small loops Polaris Fortran 77 Fortran 77 + HA OpenMP Intel ifort version 9.0 Executable Code Experiment Setup SPEC , PERFECT, Other SPEC -Dual Core CoreDuo Intel, Lenovo X60s Dual Core AMD, quad socket Sun 24

25 Polaris-HA vs. Ifort Coverage: PERFECT/SPEC89/92 25

26 Polaris-HA vs. Ifort Coverage: PERFECT Benchmarks 26

27 Polaris-HA vs. Ifort Coverage: SPEC 89/92 27

28 Polaris-HA vs. Ifort Coverage: SPEC2000/06 28

29 Speedups SPEC2000/06: AMD dual core/quad socket 29

30 Speedups Perfect/Spec89/92: Intel Core Duo 30

31 Speedups PERFECT: Intel CoreDuo 31

32 Speedups SPEC 89/92: Intel Core Duo 32

33 Conclusions: HA + Autopar Hybrid memory reference analysis is a general framework for optimization Representation is crucial -- USR Closed-form representation that tolerates analysis failure PDAG: Input sensitivity of optimization decisions Continuum of compile-time to run-time solutions Efficient automatic parallelization Good Speedups on FP benchmark applications 33

Sensitivity Analysis for Automatic Parallelization on Multi-Cores

Sensitivity Analysis for Automatic Parallelization on Multi-Cores Silvius Rus Texas A&M University Dept. of Computer Science Parasol Laboratory 3112 TAMUS College Station, TX 77843 silvius.rus@gmail.com