Scalarization of Index Vectors in Compiled APL

Size: px

Start display at page:

Download "Scalarization of Index Vectors in Compiled APL"

Georgia Anthony
5 years ago
Views:

1 Snake Island Research Inc 18 Fifth Street, Ward s Island Toronto, Canada tel: bernecky@snakeisland.com September 30, 2011

2 Abstract High-performance for array languages offers several unique challenges to the compiler writer, including fusion of loops over large arrays, detection and elimination of scalars as arbitrary arrays, and eliminating or minimizing the run-time creation of index vectors. We introduce one of those challenges in the context of SAC, a functional array languge, and give preliminary results on the performance of a compiler that eliminates index vectors by scalarizing them within the optimization cycle.

3 The Question How much faster is compiled APL than interpreted APL?

4 The Question How much faster is compiled APL than interpreted APL? The answer is NOT a scalar.

5 Environment Dyalog APL 13.0 vs. APEX/SAC

6 Environment Dyalog APL 13.0 vs. APEX/SAC The current SAC compiler a functional array language data-parallel nested loops: With-Loop array-based optimizations functional loops and conditionals as functions

7 Environment Dyalog APL 13.0 vs. APEX/SAC The current SAC compiler a functional array language data-parallel nested loops: With-Loop array-based optimizations functional loops and conditionals as functions Goal: Compiled APL performance competitive with hand-coded C

8 Some Reasons Why APL is Slow Fixed per-primitive overheads: Syntax analysis, conf checks, fn dispatch, mem mgmt Variable per-primitive overheads Index vector materialization APL vs. APEX CPU Time Performance (2, ) Speedup (APL/APEX w/awlf) 1, Higher is better for APEX SAC: 17,654:MODIFIED APL: Dyalog APL buildvaks buildvfaks buildv2aks compiotaaks compiotadaks csbenchaks downgradepvaks fdaks gewlfaks histgradeaks histlpaks histopaks histopfaks iotanaks ipapeaks ipbbaks ipbdaks ipddaks ipopneaks ipplusandaks lltopaks logd3aks logd4aks loopfsaks loopfvaks loopisaks mconvaks mconvoutaks nsvaks nthoneaks schedraks scsaks testforaks testindxaks testlcvaks unirandaks unirand3aks upgradeboolaks upgradecharaks upgradepvaks upgraderpvaks

9 Why APL is Slow: Fixed Per-Primitive Overheads APL Primitive Overhead time/element for (Intvec+intvec) microseconds/element # elements in array Who suffers? Apps dominated by operations on scalars: CRC, loopy histograms, dynamic programming, RNG

10 Why APL is Slow: Fixed Per-Primitive Overheads Speedup (APL/APEX w/awlf) 1, Higher is better for APEX APL: Dyalog APL 13.0 SAC: 17,654:MODIFIED APL vs. APEX CPU Time Performance (2, ) 10 crcaks histlpaks lltopaks loopisaks scsaks testforaks Benchmark name

11 Why APL is Slow: Fixed Per-Primitive Overheads Scalar-dominated apps have good serial speedup... but poor parallel speedup Execution time (sec) APEX/SAC Parallel Performance SAC (17654:MODIFIED) real time 2, x mt1 mt2 0.4x mt3 6 core AMD Phenom II X6 1,075T mt4 mt5 mt6 0.3x 0.2x 0.1x 0x crc histlp lltop loopis scs testfor # threads

12 Why is APL Slow? Variable Per-Primitive Overheads Naive execution: Limited fn composition, e.g. sum( iota( N))

13 Why is APL Slow? Variable Per-Primitive Overheads Naive execution: Limited fn composition, e.g. sum( iota( N)) Array-valued intermediate results: memory madness

14 Why is APL Slow? Variable Per-Primitive Overheads Naive execution: Limited fn composition, e.g. sum( iota( N)) Array-valued intermediate results: memory madness Who suffers? Apps dominated by operations on large arrays: Signal processing, convolution, normal move-out APL vs. APEX CPU Time Performance (2, ) Speedup (APL/APEX w/awlf) 1, Higher is better for APEX SAC: 17,654:MODIFIED APL: Dyalog APL iotanaks ipapeaks ipbdaks ipopneaks logd3aks logd4aks upgradecharaks

15 Why is APL Slow? Variable Per-Primitive Overheads Who suffers? Apps dominated by operations on large arrays: Signal processing, convolution, normal move-out APEX/SAC Parallel Performance SAC (17654:MODIFIED) real time 2, Execution time (sec) 1.2x 1x 0.8x 0.6x 0.4x 6 core AMD Phenom II X6 1,075T mt1 mt2 mt3 mt4 mt5 mt6 0.2x 0x iotan ipape ipbbakd ipbd ipopneakd # threads ipopne logd3 logd4 upgradechar

16 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i]

17 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i] [i,pi] and [pi,i] are materialized index vectors

18 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i] [i,pi] and [pi,i] are materialized index vectors A few simple changes to scalarize index vectors: tmp = a[;i] a[;i] = a[;pi] a[;pi] = tmp

19 Why is APL Slow? Materialized Index Vectors Mike Jenkins matrix divide model a[;i,pi] = a[;pi,i] [i,pi] and [pi,i] are materialized index vectors A few simple changes to scalarize index vectors: tmp = a[;i] a[;i] = a[;pi] a[;pi] = tmp Matrix divide model now runs twice as fast!

20 Why Materialized Index Vectors are Expensive Materialization of [i,pi] and [pi,i]:

21 Why Materialized Index Vectors are Expensive Materialization of [i,pi] and [pi,i]: (* for indexing part) *Increment reference counts on i and pi Allocate 2-element temp vector Initialize temp vector descriptor Initialize temp vector elements *Perform indexing Deallocate 2-element temp vector *Decrement reference counts on i and pi

22 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing

23 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign

24 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign Who suffers? Matrix divide, compress, deal, dynamic programming

25 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign Who suffers? Matrix divide, compress, deal, dynamic programming Who suffers? Inner products that use the CDC STAR-100 algorithm

26 Why Materialized Index Vectors are Expensive Who suffers? Apps using explicit array indexing e.g., many apps dominated by indexed assign Who suffers? Matrix divide, compress, deal, dynamic programming Who suffers? Inner products that use the CDC STAR-100 algorithm 800x800 ipplusandakd CPU time: 45 minutes!

27 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko

28 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation

29 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M)

30 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M) IVE: Replace: z = M[ IV] by: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M)

31 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M) IVE: Replace: z = M[ IV] by: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) Implication: IV is a materialized index vector!

32 Eliminating Materialized Index Vectors With IVE IFL2006 paper: Index Vector Elimination (IVE) Bernecky, Grelck, Herhut, Scholz, Trojahner, and Schafarenko Post-optimization transformation Start with: IV = [ i, j, k] z = sel( IV, M) IVE: Replace: z = M[ IV] by: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) Implication: IV is a materialized index vector! Implication: offset calculation may be liftable (LIR)

33 Eliminating Materialized Index Vectors With IVE IVE: If IV can be represented as scalars, eliminate vect2offset. Before: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M)

34 Eliminating Materialized Index Vectors With IVE IVE: If IV can be represented as scalars, eliminate vect2offset. Before: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: IV = [ i, j, k] offset = idxs2offset( shape(m), i, j, k) z = idxsel( offset, M)

35 Eliminating Materialized Index Vectors With IVE IVE: If IV can be represented as scalars, eliminate vect2offset. Before: IV = [ i, j, k] offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: IV = [ i, j, k] offset = idxs2offset( shape(m), i, j, k) z = idxsel( offset, M) IV is now dead code, and can be eliminated!

36 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M)

37 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV)

38 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV) IV is now dead code, and can be eliminated!

39 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV) IV is now dead code, and can be eliminated! Earlier substitution by idxs2offset now feasible.

40 Eliminating Materialized Index Vectors With IVE IVE: If IV is NOT scalars, scalarize it. Before: offset = vect2offset( shape(m), IV) z = idxsel( offset, M) After: I = IV[0] J = IV[1] K = IV[2] JV = [ I, J, K] offset = vect2offset( shape(m), JV) IV is now dead code, and can be eliminated! Earlier substitution by idxs2offset now feasible. Unfortunately...

41 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K]

42 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function

43 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV

44 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV

45 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV This is where we came in!

46 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV This is where we came in! So, what can we do?

47 Dueling Optimizations: IVE vs. LIR IVE introduces JV = [ I, J, K] Loop-Invariant Removal (LIR) lifts I, J, K, JV out of the function Constant Folding (CF) replaces JV = [ I, J, K] by JV = IV Common-subexpression elimination (CSE) replaces JV by IV This is where we came in! So, what can we do? A kludged LIR to deal with this was deemed tasteless

48 Biting the Bullet Another approach: move IVE into the optimization cycle

49 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations!

50 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations

51 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops!

52 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops! e.g., unroll IV+1, guard and extrema functions..

53 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops! e.g., unroll IV+1, guard and extrema functions.. e.g., extend existing optimizations, such as CF

54 Biting the Bullet Another approach: move IVE into the optimization cycle Having IVE in opt cycle enables other optimizations! But, it also breaks many optimizations To prevent dueling, we must scalarize all index vector ops! e.g., unroll IV+1, guard and extrema functions.. e.g., extend existing optimizations, such as CF The Good News: I had already scalarized many index vectors for Algebraic With-Loop Folding!

55 Current Status Moving IVE into optimization cycle worked, sort of:

56 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes

57 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations!

58 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them!

59 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them! The Good News: Performance is improving daily

60 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them! The Good News: Performance is improving daily More Good News: Many opportunities exist for further optimization

61 Current Status Moving IVE into optimization cycle worked, sort of: The Good News: ipplusandakd CPU time: 8 seconds, instead of 45 minutes The Bad News: New primitives in AST broke some optimizations! More Bad News: I am still fixing them! The Good News: Performance is improving daily More Good News: Many opportunities exist for further optimization Even More Good News: More parallelism is coming

62 How Fast Is Compiled APL? Array sizes affect performance

63 How Fast Is Compiled APL? Array sizes affect performance Iteration counts affect performance

64 How Fast Is Compiled APL? Array sizes affect performance Iteration counts affect performance Indexed assigns affect performance

65 How Fast Is Compiled APL? Array sizes affect performance Iteration counts affect performance Indexed assigns affect performance Characterize your application, then we can provide an answer.

APL on GPUs A Progress Report with a Touch of Machine Learning

APL on GPUs A Progress Report with a Touch of Machine Learning Martin Elsman, DIKU, University of Copenhagen Joined work with Troels Henriksen and Cosmin Oancea @ Dyalog 17, Elsinore 1 Motivation Goal: