Petalisp. A Common Lisp Library for Data Parallel Programming. Marco Heisig Chair for System Simulation FAU Erlangen-Nürnberg

Size: px

Start display at page:

Download "Petalisp. A Common Lisp Library for Data Parallel Programming. Marco Heisig Chair for System Simulation FAU Erlangen-Nürnberg"

Patrick Flynn
5 years ago
Views:

1 Petalisp A Common Lisp Library for Data Parallel Programming Marco Heisig Chair for System Simulation FAU Erlangen-Nürnberg

2 Petalisp The library Petalisp 1 is a new approach to data parallel computing. The Goal: Elegant High Performance Computing Programs that are beautiful and fast A programming model that is safe and productive Drawbacks: Limited to operations on structured data Significant run-time overhead 1 1

3 Table of contents 1. Using Petalisp 2. Implementation 3. Performance 4. Conclusions 2

4 Using Petalisp

5 Strided Arrays A strided array in n dimensions is a function from elements of the cartesian product of n ranges to a set of Common Lisp objects. A range with the lower bound x L, the step size s and the upper bound x U, with x L, s, x U Z, is the set of integers { x Z x L x x U ( k Z) [x = x L + ks] }. stride 1 stride 3 stride 2 Objects of type cl:simple-array are a special case of strided arrays. 3

6 API 1/5 First Class Index Spaces To introduce parallelism, Petalisp always operates on index spaces, not on individual array elements. Notation: (σ (START [STEP] END)...) (σ (0 5) (0 3)) (σ (1 2 3) (1 2 3) (1 2 3)) Implementation detail: Petalisp can compute the union, difference and intersection of arbitrary index spaces. 4

7 API 2/5 Transformations A transformation is an affine-linear mapping from indices to indices. Notation: (τ (INDEX...) (EXPRESSION...)) (τ (i j) (0 j (- i 4))) (τ (0 j k) ((+ k 4) j)) Implementation detail: Petalisp can compute the inverse and composition of arbitrary transformations. 5

8 API 3/5 Moving Data The -> operator allows to select, transform or broadcast data. (-> 0 (σ (0 9) (0 9))) ; a array of zeros (-> #(2 3) (σ (0 0))) ; the first element only (-> A (τ (i j) (j i))) ; transposing A Admittedly, -> is a mediocre function name. Better suggestions are most welcome! 6

9 API 4/5 Combining Data The fuse and fuse* operator combine multiple arrays into one. For fuse, the arguments must be non-overlapping. For fuse*, the value of the rightmost array takes precedence on overlap. (defvar B (-> #(2) (τ (i) ((1+ i))))) (fuse #(1) B) ; equivalent to (-> #(1 2)) (fuse #(1 3) B) ; an error! (fuse* #(1 3) B) ; equivalent to (-> #(1 2)) 7

10 API 5/5 Parallel Operations The final piece: Application of Common Lisp functions to strided arrays. The function α is basically cl:map for strided arrays. The function β is basically cl:reduce applied to the last dimension of a strided array. (α # + 2 3) ; adding two numbers (α # + A B C) ; adding three arrays element-wise (β # + #(2 3)) ; adding two numbers (defvar B #2A((1 2 3) (4 5 6))) (β # + B) ; summing the rows of B Remark: No guarantees are made about when and how often the functions passed to α and β are invoked. 8

11 API Summary All core functions at a glance: Index spaces, e.g. (σ (a b)) Transformations, e.g. (τ (x y) (y (- x))) Data motion, e.g. (-> A (σ (2 5))) Data combination, e.g. (fuse* A B C) Parallel map, e.g. (α # * A B) Parallel reduce, e.g. (β # + A) This API is purely functional and declarative. But how do we obtain values? 9

12 API 6/5 Triggering Evaluation Petalisp provides two functions to trigger evaluation. The compute function converts strided arrays into regular Common Lisp arrays. (compute (-> 0.0 (σ (0 1)))) => #( ) (defvar A #(1 2 3)) (compute (β # + A)) => 6 (compute (-> A (τ (i) ((- i))))) => #(3 2 1) Remark: There is also a schedule function for asynchronous evaluation. 10

13 Example: Matrix Multiplication The mathematical definition C ij = n A ip B pj p=1 The corresponding Petalisp code (β # + (α # * (-> A (τ (m n) (m 1 n))) (-> B (τ (n k) (1 k n))))) n k C A B m R E D U C T I O N 11

14 Implementation

15 Lazy Arrays are Data Flow Graphs (jacobi u 1) => #<strided-array-fusion t (σ (0 9) (0 9))> Internal representation: (σ (0 9) (0 9)) (σ) #0A0.25 (τ (i j) ((- i 1) j)) (σ (1 8) (1 8)) (τ (i j) ((1+ i) j)) (σ (1 8) (1 8)) (τ (i j) (i (- j 1))) (σ (1 8) (1 8)) (τ (i j) (i (1+ j))) (σ (1 8) (1 8)) (σ (0 9) (0 9)) (τ (i j) ()) (σ (1 8) (1 8)) + (σ (1 8) (1 8)) (τ (i j) (i j)) (σ (0 9 9) (0 9)) (τ (i j) (i j)) (σ (1 8) (0 9 9)) * (σ (1 8) (1 8)) fusion (σ (0 1 9) (0 1 9)) 12

16 From Arrays to Executable Code How do we execute a graph? fusion compute 13

17 From Arrays to Executable Code How do we execute a graph? 1. Determine critical nodes fusion compute 13

18 From Arrays to Executable Code How do we execute a graph? 1. Determine critical nodes fusion compute 13

19 From Arrays to Executable Code How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees fusion compute 13

20 From Arrays to Executable Code How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees fusion compute 13

21 From Arrays to Executable Code How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees 3. Lift s fusion compute 13

22 From Arrays to Executable Code How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees 3. Lift s fusion compute 13

23 From Arrays to Executable Code How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees 3. Lift s 4. Eliminate fusions fusion compute 13

24 From Arrays to Executable Code How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees 3. Lift s 4. Eliminate fusions compute 13

25 From Arrays to Executable Code How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees 3. Lift s 4. Eliminate fusions 5. Construct kernels compute 13

26 From Arrays to Executable Code a0 How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees 3. Lift s 4. Eliminate fusions 5. Construct kernels [f0 [f1 [a0]] [f2 [a0]] a2 a1 [f0 [a2] [a2]] [f0 [a2] [a1]] a3 13

27 From Arrays to Executable Code a0 How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees 3. Lift s 4. Eliminate fusions 5. Construct kernels 6. Schedule & allocate [f0 [a2] [a2]] [f0 [f1 [a0]] [f2 [a0]] a2 a1 [f0 [a2] [a1]] a3 13

28 From Arrays to Executable Code a0 a1 How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees [f0 [f1 [a0]] [f2 [a0]] a2 3. Lift s 4. Eliminate fusions time [f0 [a2] [a1]] a3 5. Construct kernels 6. Schedule & allocate [f0 [a2] [a2]] 13

29 From Arrays to Executable Code a0 a1 How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees [f0 [f1 [a0]] [f2 [a0]] a2 3. Lift s 4. Eliminate fusions time [f0 [a2] [a1]] a3 5. Construct kernels 6. Schedule & allocate [f0 [a2] [a2]] 7. Compile & execute 13

30 From Arrays to Executable Code How do we execute a graph? 1. Determine critical nodes 2. Determine subtrees 3. Lift s 4. Eliminate fusions 5. Construct kernels 6. Schedule & allocate 7. Compile & execute 8. Done! fusion compute 13

31 Performance

32 On Performance The long-term goal of Petalisp is to provide a programming model for Petascale (10 15 operations per second) systems. The constant, high overhead of analysis and JIT-compilation seems to be at odds with this goal. However: Do not underestimate the power of memoization, hash-consing and CLOS wizardry. Scheduling can often be done asynchronously. Petalisp s analysis is independent of the problem size. 14

33 Jacobi s method: Python vs. C++ vs. Petalisp Hardware: Intel Xeon E CPU 3.6GHz 15

34 Conclusions

35 Conclusions Main Result: Our compilation strategy is feasible, with just about microseconds overhead when calling compute. Benefits: Clean separation between notation and execution. Unprecedented potential for optimization. Already faster than NumPy.... all in just about 5000 lines of maintainable code. 16

36 Future Work My preliminary roadmap for the next years: More s (simulations, image processing, machine learning) API finalization Sophisticated Scheduling Better Shared-Memory Parallelization Auto-Vectorization Distributed Parallelization... Make this a PhD thesis 17

37 Thank you! Questions or remarks? 18

NumbaPro CUDA Python. Square matrix multiplication

NumbaPro CUDA Python. Square matrix multiplication NumbaPro Enables parallel programming in Python Support various entry points: Low-level (CUDA-C like) programming language High-level array oriented interface CUDA library bindings Also support multicore