An Auto-Tuning Framework for Parallel Multicore Stencil Computations

Size: px

Start display at page:

Download "An Auto-Tuning Framework for Parallel Multicore Stencil Computations"

Norman Powell
5 years ago
Views:

1 Software Engineering Seminar Sebastian Hafen An Auto-Tuning Framework for Parallel Multicore Stencil Computations Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, Samuel Williams 1

2 Stencils

structured Grid Like a n-dimensional Array Iterative: i i+1

3 What is a Stencil Computation? Nearest Neighbor Computations E.g. finite difference between data points Sweeps over a structured Grid Like a n-dimensional Array Iterative: i i+1 i+ Left Two: Middle: Right:

4 Example: D 5-Points-Stencil //Stencil loop do k=, xlength 1, 1 do i=, ylength 1, 1 writearray[k][i] = usestencil(k,i) enddo enddo //Stencil function function usestencil(k,i) int result = readarray[k][i] + readarray[k+1][i] + readarray[k 1][i] + readarray[k][i+1] + readarray[k][i 1] result = result/5 return result endfunction (k-1,i) (k,i-1) (k,i) (k,i+1) (k+1,i)

5 Example readarray writearray (+1+++8)/5 = 5

6 Example readarray writearray (+++7+7)/5 = 6

7 Example readarray writearray ( )/5 = 7

8 8 Picture from Paper Example from the paper: Gradient

Why? Solving Partial Differential Equations Used by many branches of Science Heat Equations Wave Equations Automatic beam path analysis of laser wakefield particle

9 Why? Solving Partial Differential Equations Used by many branches of Science Heat Equations Wave Equations Automatic beam path analysis of laser wakefield particle acceleration data... Quote: Papername of Images: 9

10 Characteristics of stencil computations High memory traffic Low arithmetic intensity CPUs can handle it //Stencil function function usestencil(k,i) int result = readarray[k][i] + readarray[k+1][i] + readarray[k 1][i] + readarray[k][i+1] + readarray[k][i 1] result = result/5 return result endfunction Computations are memory bound Auto-tuning for better memory access management 10

11 The Framework 11

12 Overview Not the first auto-tuning framework for stencils But other work about static/single kernel instantiations Proof-of-Concept Supports broad range of stencil kernels Fully generalized framework Auto-parallelisation Multiple back-end architectures Even a GPU 1

13 Reference Implementation Parse as AST Framework flow Inspired by a picture of the paper Myriad of equivalent, optimized implementations Best performing implemntation and configuration parameters 1

14 Strategy Engine Parameter Space is massive Decides on a appropriate subset of parameter combinations (strategies) Combined serial and parallel optimizations Based on the underlying architecture Knows about correlation of different optimizations Chooses only legal combinations 1

15 Transformation Engine Transforms the AST First applies auto-parallelization Then uses auto-tuning Has domain knowledge Can do transformations a compiler can not 15

Auto-parallelization Basically dividing the problem space into blocks Core blocks, thread blocks and register blocks Creates new loops for every block

16 Auto-parallelization Basically dividing the problem space into blocks Core blocks, thread blocks and register blocks Creates new loops for every block Non-Uniform Memory Access (NUMA)-Aware Separate stencil for the border cases Image: 16

17 17 Picture from Paper Auto-parallelization

18 Auto-tuning Loop unrolling and register blocking Improves innermost loop efficiency Cache blocking Exposes temporal locality and and increases cache reuse Arithmetic simplifications Many more possible It is a prove-of-concept Example for cache blocking: 18

19 Search Engine Runs all the different tuned versions of the stencil kernel 56 grids (16'777'16 Elements) initialized with random values User can replace the original kernel with the fastest one 19

20 Limitations Only D or D Only Arrays No sophisticated Data structures Only arithmetic stencils They want to change that in future work 0

21 Code Generator Creates code from the modified ASTs For the CPUs: pthreads For the GPU: CUDA thread blocks Serial fortran and c code also possible 1

22 Tested Stencils and Architectures

23 Used Stencils Divergence Stencil Gradient Stencil Picture from Paper Laplacian Stencil

24 Picture from Paper Used Architectures

25 Results 5

26 Laplacian 6 Pictures from Paper One Result

27 7 Pictures from Paper Results

28 Conclusion Pro It does work. Concept is proven Performance comparable to hand-optimized code Programmer Production Benefits Fully general Few minutes to annotate code Contra OpenMP works good, too New architecture means new coding Peak not yet reached Quote from Paper 8

29 End of Presentation 9

Code Generators for Stencil Auto-tuning

Code Generators for Stencil Auto-tuning Shoaib Kamil with Cy Chan, Sam Williams, Kaushik Datta, John Shalf, Katherine Yelick, Jim Demmel, Leonid Oliker Diagnosing Power/Performance Correctness Where this