Run-time Reordering. Transformation 2. Important Irregular Science & Engineering Applications. Lackluster Performance in Irregular Applications

Size: px

Start display at page:

Download "Run-time Reordering. Transformation 2. Important Irregular Science & Engineering Applications. Lackluster Performance in Irregular Applications"

Prudence Potter
5 years ago
Views:

Reordering s Important Irregular Science & Engineering Applications Molecular Dynamics Finite Element Analysis Michelle Mills Strout December 5, 2005 Sparse Matrix Computations 2 Lacluster

0 Mflops True Mflops Effective Mflops Mflops 4% 70.0 2% 60.0 0% 50.

1 Reordering s Important Irregular Science & Engineering Applications Molecular Dynamics Finite Element Analysis Michelle Mills Strout December 5, 2005 Sparse Matrix Computations 2 Lacluster Performance in Irregular Applications Ideal Compiler/ System for Irregular Applications % of Pea Mflops Mflops 28% % % % % % % 50.0 Mflops True Mflops Effective Mflops Mflops 4% % % Problem Size Size [Mahinthaumar & Saied 99] 0% of pea typical for irregular apps BiCGSTAB solver on SGI Origin 2000 Performance degrades upon falling out of L2 cache Processor-memory performance gap increasing 3 Challenge: unable to effectively reorder data and computation at compile-time in irregular applications Approach: run-time reordering transformations Goal: Framewor 2... N Compiler Inspector/Executor Code Decision Code 4

2 Contributions Tal Outline Full Sparse Tiling (FST) - Data locality in Gauss-Seidel (ICCS0)(JHPCA03) - Parallelism for Gauss-Seidel (LCPC02) Framewor for run-time reordering transformations and their compositions (PLDI03) Framewor 2 FST... N Compiler Inspector/Executor Code Decision Code 5 I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 6 Reordering s Example Loop with an Irregular Memory Reference Goal: Improve data locality and exploit parallelism How? Reordering data at runtime Reordering iterations at runtime Result: Performance improvements for irregular applications Cache: M i 0 x Z for i=0,7 Y[i] = Z[x[i]] M M M M M M M a b c d e f g h Z is the data array x is the index array Simple cache model - 2 elements of Z in each cache line - 2 cache lines of Z in cache at any one time 7 II. Introduction 8

3 Data Locality Data Reordering Spatial locality occurs when memory locations mapped to the same cache-line are used before the cache line is evicted Temporal locality occurs when the same memory location is reused before its cache line is evicted for i=0,7 Y[i] = Z[x[i]] Z [x [i]] Cache: M S M S M M M M i x x reorder items in data array Z at run-time update index array x with new locations increases spatial locality Z da b h c a d f e c bf g e h g II. Introduction 9 0 Iteration Reordering Inspector/Executor [Saltz et al. 994] for i =0,7 Y [i ] = Z[x [i ]] M T M T S M M T i x Cache: a b c d e f g h reorder iterations of the i loop remap index array x according to the new i loop ordering increases temporal and spatial locality Inspector - Traverses through index arrays - Creates data and/or iteration reordering functions - Reorders data and updates any affected index arrays by applying reordering functions Executor A transformed version of the original code which uses reordered data arrays and/or executes new iteration order II. Introduction 2

4 Traverses index array Generates data reordering function " Reorder data and updates index array Original Code for i=0,7 Y[i] = Z[x[i]] Inspector for i=0,7... x[i]... for =0,7 sigma[] =... for =0,7 Z [sigma[]]=z[] x []=sigma[x[]] Executor for i=0,7 Y[i] = Z [x [i]] Tal Outline I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 3 4 Effect of Composing Reordering s Sample Irregular Benchmars [Han & Tseng 2000] Hand-generated inspector and executor code for three benchmars Used existing data and iteration reordering heuristics within one loop Composed these with full sparse tiling to reorder iterations between loops Measured performance improvements and overhead 5 for s=,t for i=,n... =...Z[i] for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[] IRREG, partial differential equation solver Non-bonded force calculations in molecular dynamics - MOLDYN abstracted from CHARMM - NBF abstracted from GROMOS, different data structure 6

5 Data Mapping for Loop Data Reordering: CPACK [Ding & Kennedy 99] for s=,t for i=,n... =...Z[i] for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[] Z Z Z2 Z3 Z4 Z5 Z6 Z Z Z2 Z3 Z4 Z5 Z6 Z' Z4 Z5 Z2 Z3 Z6 Z 7 8 Iteration Reordering: lexgroup [Ding & Kennedy 99] Dependences Between Loops 7 8 for s=,t for i=,n... =...Z[i] Z' Z4 Z5 Z2 Z3 Z6 Z Z' Z4 Z5 Z2 Z3 Z6 Z for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[]

6 Iteration Reordering: Full Sparse Tiling (FST) Iteration & Data Reordering: Tile Pacing (tilepac) i Executor for Composed for s=,t for i=,n... =...Z[i] for =,m Z[l[]] =... Z[r[]] =... for =,n... += Z[] for s=,t for t=,nt for i in sched(t,i)... =...Z [i ] for in sched(t,) Z [l [ ]] =... for in sched(t,)... += Z [ ] 23 Normalized Execution Time for Executor Effect of FST in Composition IBM Power 3, 375MHz, 64 KB L cache CPACK, lexgroup CPACK, lexgroup, FST Gpart, lexgroup Gpart, lexgroup, FST 0MB 3MB MB 37MB 8MB 6MB IRREG NBF MOLDYN 24

7 Effect of FST in Composition Intel Pentium 4,.7GHz, 8KB L cache Inspector for FST Compositions IBM Power 3 and Intel Pentium 4 Normalized Execution Time for Executor CPACK, lexgroup CPACK, lexgroup, FST Gpart, lexgroup Gpart, lexgroup, FST 0MB 3MB MB 37MB 8MB 6MB Brea Even Min (timesteps) Brea Even Max (timesteps) IRREG 6 NBF 5 8 MOLDYN IRREG NBF MOLDYN 26 Parallelism with Full Sparse Tiling Experimental Results IBM Power 3, 375MHz, 8MB L2 cache c=2 c c= Inspector determines dependences between tiles Full Sparse Tiled Iteration Space P P2 Time Tile Dependence Graph 27 Speedup w/out Overhead 0 owner-computes, p=2 owner-computes, p=4 9 owner-computes, p= Sphere N=54,938 NZ=,508,390 Pipe N=38,20 NZ=5,300,288 full sparse tiling, p=2 full sparse tiling, p=4 full sparse tiling, p=8 Wing N=924,672 NZ=38,360,266 28

8 Experimental Results SUN HPC0000, 36 UltraSPARC II, 400MHz, 4MB L2 Tal Outline Speedup w/out Overhead owner-computes, p=2 owner-computes, p=4 owner-computes, p=8 owner-computes, p=6 owner-computes, p=32 Sphere N=54,938 NZ=,508,390 Pipe N=38,20 NZ=5,300,288 full sparse tiling, p=2 full sparse tiling, p=4 full sparse tiling, p=8 full sparse tiling, p=6 full sparse tiling, p=32 Wing N=924,672 NZ=38,360, I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 30 Framewor Needed for Compositions Key Insights for Composing Reordering s Current options - Analyze and generate inspectors one at a time - Prepare every interesting composition by hand Framewor should provide a uniform representation to manipulate descriptions of any run-time reordering transformation at compile-time Framewor 2... N Compiler Inspector/Executor Code Decision Code 3 The inspectors traverse the data mappings and/or the data dependences We can express how the data mappings and data dependences will change Subsequent inspectors traverse the new data mappings and data dependences 32

9 Framewor Elements [Kelly & Pugh 95] [Pugh & Wonnacott 95] : Data Mapping I 0 7 Reordering s (PLDI 03) Data Reordering : Z a b c d e f g h Compile-time: : Z M I "Z = {[i] " [x(i)] 0 # i # 7} I a b c d e f g h Data Dependences 0 7 Compile-time: D I "I = {[i] " [ ] (0 # i < # 7) $(i = b( ) % = b(i))} 33 Compile-time: : Compile-time: Z d h a f c b e g R Z "Z ' = {[i] " [#(i)]} Iteration Reordering I I T I "I ' = {[i] " [#(i)]} 34 Data Mappings and Data Dependences for MOLDYN Data Reordering: CPACK for ts=,t for i=,n Z[i]=... for =,M... = Z[l[]]... = Z[r[]] M J 0 "Z 0 D I 0 "J 0 Data Mapping for i loop = {[i] " [i]} M I 0 "Z 0 Data Mapping for loop = {[ ] " [i] (i = l( ))#(i = r( ))} Data Dependences between i and loop = {[i] " [ ] (i = l( ))#(i = r( ))} 35 M J 0 "Z 0 = {[ ] " [i] (i = l( ))#(i = r( ))} Z Z Z2 Z3 Z4 Z5 Z6 Z' Z4 Z5 Z2 Z3 Z6 Z M J 0 "Z R Z 0 "Z = T I 0 "I = {[i] " [#(i)]} = {[ ] " [#(i)] (i = l( )) $(i = r( ))} 36

10 Iteration Reordering: lexgroup [Ding & Kennedy 99] Iteration Reordering: Full Sparse Tiling (FST) M J 0 "Z = {[ ] " [#(i)] (i = l( )) $(i = r( ))} D I "J = {[#(i)] " [$( )] (i = l( )) %(i = r( ))} 7 8 Z' Z4 Z5 Z2 Z3 Z6 Z Z' Z4 Z5 Z2 Z3 Z6 Z T J 0 "J = {[ ] " [#( )]} T I "I 2 = {[i ] " [# (,i ),,i ]} M J "Z = {[#( )] " [$(i)] (i = l( )) %(i = r( ))} T J "J 2 = {[ ] " [# (2, ),2, ]} Compile-time Composition of Data and Iteration Reorderings (PLDI 03) Tal Outline Uniform representation using uninterpreted functions to represent run-time info Describe how to compose run-time reordering transformations Determining legality constraints for run-time reordering functions helps reduce overhead Possible optimization is to remap data only once Describe space of possible transformations I. Overview II. Introduction to run-time reordering transformations III. Motivation for composing such transformations IV. Framewor for creating compositions V. Conclusions and Future Wor 39 40

11 Conclusions Future Wor Framewor gives the compiler the ability to manipulate run-time reordering transformations Framewor 2 FST... N Compiler Inspector/Executor Code Decision Code FST composed with other run-time reordering transformations can improve serial performance 4 Automatic generation of optimized inspectors guidance - Compile-time - Incorporating domain-specific nowledge into transformation guidance Separation of algorithmic and performance interfaces through generative techniques 42

Fourier-Motzkin and Farkas Questions (HW10)

Fourier-Motzkin and Farkas Questions (HW10) Automating Scheduling Logistics Final report for project due this Friday, 5/4/12 Quiz 4 due this Monday, 5/7/12 Poster session Thursday May 10 from 2-4pm Distance students need to contact me to set up