Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Size: px

Start display at page:

Download "Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules"

Bridget Morgan
5 years ago
Views:

1 Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Toomas Remmelg Michel Steuwer Christophe Dubach The 4th South of England Regional Programming Language Seminar 27th September 2016

2 Heterogeneity Everywhere GPU Mobile devices Game consoles FPGA PCs CPU/GPU Data centres Supercomputers

3 Top 500 with parallel accelerators

4 Top 500 with parallel accelerators increasing

5 Top 500 with parallel accelerators new ones appearing regularly

6 Top 500 with parallel accelerators Difficult to program Difficult to achieve high performance Moving target

7 But also domain-specific accelerators E.g. Google neural network accelerators TPU (Tensor Processor Units) E.g. Movidius vision accelerators VPU (Vision Processing Units)

8 Current situation Applications Device-specific interface Parallel Hardware OpenCL GPUs OpenMP Multicore CPUs VHDL C library... FPGAs NN Accelerator... Programmers have to be expert in high performance computing!

9 + Performance is not portable! hand-written implementations for each device parametric auto-tuner

10 + Performance is not portable even on same device class Matrix-matrix multiplication auto-tuner

11 + Performance is not portable even on same device class Matrix-matrix multiplication auto-tuner best implementation

12 + Performance is not portable even on same device class Matrix-matrix multiplication auto-tuner best implementation

13 + Performance is not portable even on same device class Matrix-matrix multiplication auto-tuner best implementation Auto-tuning alone fails to achieve portable performance

14 Need for high-level programming + Code generation

15 High-Level programming approaches for heterogeneous devices StreamIt (MIT) Dataflow programming, architecture Independent Multiple backend (tile architecture, C, FPGA, GPU) LiquidMetal (IBM) Java + dataflow programming + (map / reduce) C, OpenCL and FPGA backend Green-Marl (Stanford / Oracle) DSL for graph analysis operations described on graph's node or edgesl Halide (MIT) DSL for imagine processing, functional in nature GPU code generator TensorFlow (Google) DSL library for AI applications data flow model

16 Data flow & functional approaches: Hardware complexity is hidden away code portability Empowers the compiler no need for complicated analysis explicit data movement, no global state can be mapped on various hardware scheduling is easier

17 General purpose framework for data parallelism Delite (Stanford + EPFL) framework for developing high-performance DSLs provide built-in parallel primitives Multicore + GPU code generation Lift (Edinburgh) intermediate data-parallel language compiler optimisations expressed as rewrite rules Multicore + GPU code generation + more to come (MPI, OpenMP, VHDL)

18 What we have Applications Device-specific interface Parallel Hardware OpenCL GPUs OpenMP Multicore CPUs VHDL C library... FPGAs NN Accelerator...

19 What we need Applications OpenCL GPUs OpenMP Multicore CPUs VHDL C library... FPGAs NN Accelerator...

20 What we need Applications Domain Specific Languages (DSLs) Image Processing Neural Networks Graph Analytics Language for Parallelism Data-Parallel Intermediate Language Compiler Technology Performance Portable Code Generator OpenCL GPUs OpenMP Multicore CPUs... VHDL C library... FPGAs NN Accelerator...

21 Different starting point Applications Delite (top-down) Image Processing Neural Networks Graph Analytics... Data-Parallel Intermediate Language Performance Portable Code Generator Lift (bottom-up) OpenCL GPUs OpenMP Multicore CPUs VHDL C library... FPGAs NN Accelerator...

22 What this talk is about Applications Image Processing Neural Networks Graph Analytics... Lift Data-Parallel Intermediate Language Lift rewrite-bases system + hardware specific language OpenCL GPUs OpenMP Multicore CPUs VHDL C library... FPGAs NN Accelerator...

23 Rest of the talk Part I: Lift data parallel language Part II: Optimisations as rewrite rules Part III: Code generation details

24 Part I Lift: a Functional Data Parallel Language

25 Lift Intermediate Language map(f) : zip: reduce(0,+): split(n): join: iterate(f, n): reorder(σ):

26 Matrix Multiplication in Lift

27 Lift high-level matrix multiplication Naive OpenCL version

28 Lift high-level matrix multiplication Naive OpenCL version? How to achieve high performance?

29 ARM Mali optimised version Lift high-level matrix multiplication Naive OpenCL version? How to achieve high performance?

30 Part II Encoding optimisations choices as Rewrite Rules

31 Algorithmic Rewrite Rules Provably correct rewrite rules Express algorithmic and optimisations choices Split-join rule: Map fusion rule: Reduce rules:...

34 map

35 map split (m) map join map

36 map split (m) map join map /m for for (int l = 0 ; l < m ; l++ ) {

37 Map interchanged

38 Split-join rule

39 Map interchanged

40 A few rewrite steps later...

41 Tiled version

42 Low-Level OpenCL-specific Extensions

43 Lift OpenCL-Specific Primitives map-global(f) map-workgroup(f) OpenCL thread hierarchy map-local(f) map-sequential(f) Sequential implementations reduce-sequential(z,f) tolocal(f) Memory address spaces toglobal(f) vect-n(f) asscalar asvector-n Vectorisation

44 OpenCL-specific rewrites (examples) OpenCL thread hierarchy: OpenCL memory hierarchy: OpenCL vector types and operations:

45 Let's start applying OpenCL-specific rules

46 Vectorisation

47 Mapping parallelism to global threads

48 Accumulating in private memory

49 Result: high performance

50 Now just need to search the space ,000 runs later...

51 Performance Portability Achieved Compiler input: Search result:

52 Desktop GPUs

53 Easily Extensible New rules can be added E.g. dot built-in reduceseq(z, add4) o mapseq(mult4) o zip(x,y) => dot(x, y) Mali GPU

54 Easily Extensible New rules can be added E.g. dot built-in reduceseq(z, add4) o mapseq(mult4) o zip(x,y) => dot(x, y) Mali GPU

55 Part III Efficient Code Generation

56 Dot-product example

57 Dot-product example after rewrite:

58 Dot-product example after rewrite: copy to local memory

59 Dot-product example after rewrite: iterative reduction in local memory

60 Dot-product example after rewrite: write back to global mem.

61 How to generate efficient OpenCL code?

62 Pattern based code generator map-global (f,input) for (uint i=get_global_id; i<n; i+= get_global_size) { output[i] = f(input[i]); parallel sum } reduce-sequential (f,z,input) T acc = z; for (uint i=0; i<n; i++) { acc = f(acc, input[i]); }... map-sequential (f,input) for (uint i=0; i<n; i++) { output[i] = f(input[i]); } parallel sum

63 What about split, zip, join,? Need to avoid temporary results split, zip, merely just change data layout lazy evaluation

64 dot-product (a tiny bit of it): desired OpenCL output: int int... acc for } wg_id = get_group_id(0) ; l_id = get_local_id(0) ; = 0.0f; (int i = 0 ; i < 2 ; I += 1) { acc = acc + x [ 2 * l_id * wg_id + I ] * y [ 2 * l_id * wg_id + I ] ;

65 Construct a representation of the effects of data layout functions: Views Consumes the views to generate correct array indices

66 View Construction

67 View Construction

68 View Construction

69 View Construction

70 View Construction

71 View Construction

72 View Construction

73 View Construction

74 View Consumption

75 View Consumption

76 View Consumption

77 View Consumption

78 View Consumption

79 View Consumption

80 View Consumption

81 View Consumption

82 View Consumption

83 dot-product (a tiny bit of it): produced OpenCL output: int int... acc for } wg_id = get_group_id(0) ; l_id = get_local_id(0) ; = 0.0f; (int i = 0 ; i < 2 ; I += 1) { acc = acc + x [ 2 * l_id * wg_id + I ] * y [ 2 * l_id * wg_id + I ] ;

84 Array indices not so simple

85 Array indices not so simple Can use well-known algebraic rules

86 Array indices not so simple Can use well-known algebraic rules and simplify array index expression

87 Putting it all together rewriting code generation

88 Code Generator vs Human no simplification arithmetic simplification AMD Nvidia

The Lift Approach: Summary Hardware-agnostic data-parallel functional intermediate language Low-level functional language for each hardware type hides hardware complexity, can be targeted by DSLs

89 The Lift Approach: Summary Hardware-agnostic data-parallel functional intermediate language Low-level functional language for each hardware type hides hardware complexity, can be targeted by DSLs OpenCL language (this talk), in the future: MPI, OpenMP Rewrite rules based optimisation maps the hardware-agnostic language to the hardware-specific language extensible Possible to achieve high performance through exploration opportunity for smarter technique (e.g. predictive modelling) if you want to know more: supported by:

Generating Performance Portable Code using Rewrite Rules

Generating Performance Portable Code using Rewrite Rules From High-Level Functional Expressions to High-Performance OpenCL Code Michel Steuwer Christian Fensch Sam Lindley Christophe Dubach The Problem(s)