AUTOMATIC VECTORIZATION OF TREE TRAVERSALS

Size: px

Start display at page:

Download "AUTOMATIC VECTORIZATION OF TREE TRAVERSALS"

Christine Pierce
6 years ago
Views:

1 AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11 th, 2013

2 Youngjoon Jo 2 Commodity processors and SIMD Commodity processors support SIMD (Single Instruction Multiple Data) instructions MMX (1996), SSE (1999), SSE2 (2001), SSE3 (2004), SSE4 (2006), AVX (2011), AVX2 (2013) SIMD width getting wider AVX is 256bit Upcoming AVX-512 to be 512bit (2015) Using SIMD is an excellent way to improve performance

3 Youngjoon Jo 3 SIMD works great for regular loops for (int i = 0; i < 4; i++) { c[i] = a[i] + b[i]; }

4 Youngjoon Jo 4 SIMD works great for regular loops for (int i = 0; i < 4; i++) { c[i] = a[i] + b[i]; } m128 vec_a = _mm_load_ps(a); m128 vec_b = _mm_load_ps(b); m128 vec_c = _mm_add_ps(vec_a, vec_b); _mm_store_ps(c, vec_c);

5 Youngjoon Jo 5 But not so well on irregular codes void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

6 Youngjoon Jo 6 But not so well on irregular codes void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

7 Youngjoon Jo 7 Automatic vectorization desired void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } } Automatic vectorization techniques for irregular codes highly desired

Youngjoon Jo 8 Automatic vectorization desired void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray

8 Youngjoon Jo 8 Automatic vectorization desired void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } } Automatic vectorization techniques for tree traversals highly desired

9 Tree codes are important Youngjoon Jo 9

10 Tree codes are important Youngjoon Jo 10

11 Tree codes are important Youngjoon Jo 11

12 Youngjoon Jo 12 Tree vectorization challenges Non trivial to find vectorizable computation Difficult to keep vectorizable computation together

13 Youngjoon Jo 13 Previous tree vectorization work Non trivial to find vectorizable computation Manually transform code to packetize traversals Process multiple traversals in packet simultaneously Wald et. al. [Computer Graphics Forum 2001] Difficult to keep vectorizable computation together

Youngjoon Jo 14 Previous tree vectorization work Non trivial to find vectorizable computation In situations like physical simulation, collision detection or raytracing in scenes, where rays bounce

14 Youngjoon Jo 14 Previous tree vectorization work Non trivial to find vectorizable computation In situations like physical simulation, collision detection or raytracing in scenes, where rays bounce into multiple directions (spherical or bumpmapped surfaces), coherent ray packets break down very quickly to single rays or do not exist at all. In the above mentioned tasks, packet oriented SIMD computations is much less useful. Manually transform code to packetize traversals Process multiple points in packet simultaneously Wald et. al. [Computer Graphics Forum 2001] Kim et. al. [SIGMOD 2010] Havel and Herout [IEEE Transactions on Visualization and Computer Graphics 2010] Difficult to keep vectorizable computation together

15 Youngjoon Jo 15 Previous tree vectorization work Non trivial to find vectorizable computation Manually transform code to packetize traversals Process multiple points in packet simultaneously Wald et. al. [Computer Graphics Forum 2001] Difficult to keep vectorizable computation together Look to alternative sources of vectorization Pixar [RT 2006] Dammertz et. al. [EGSR 2008] Kim et. al. [SIGMOD 2010] Chhugani et. al. [SC 2012]

16 Youngjoon Jo 16 Our previous tree locality work Point blocking Jo and Kulkarni [OOPSLA 2011] Traversal splicing Jo and Kulkarni [OOPSLA 2012]

17 Youngjoon Jo 17 Our solution Non trivial to find vectorizable computation Manually transform code to packetize traversals Automatically packetize traversals with point blocking and a novel layout transformation Difficult to keep vectorizable computation together Look to alternative sources of vectorization Exploit dynamic sorting of traversal splicing to dramatically enhance utilization

18 Youngjoon Jo 18 Contributions Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Propose a novel layout transformation for efficient vectorization of tree codes Present a framework for automatically restructuring traversals and data layouts to enable vectorization

Youngjoon Jo 19 Contributions Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Spoiler alert Propose a novel layout transformation for

19 Youngjoon Jo 19 Contributions Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Spoiler alert Propose a novel layout transformation for efficient vectorization of tree codes SIMTree can deliver speedups of Present a framework for automatically restructuring traversals and data layouts to enable vectorization up to 6.59, and 2.78 on average

20 Youngjoon Jo 20 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion

21 Youngjoon Jo 21 Tree traversals void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

22 Youngjoon Jo 22 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

23 Youngjoon Jo 23 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

24 Youngjoon Jo 24 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

25 Youngjoon Jo 25 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

26 Youngjoon Jo 26 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

27 Youngjoon Jo 27 An abstract model void main() { foreach(point p : points) { foreach(node n : p.oraclenodes()) { update(p, n); } } }

28 Youngjoon Jo 28 Iteration space of traversal void main() { foreach(point p : points) { foreach(node n : p.oraclenodes()) { update(p, n); } } Nodes } Points

29 Youngjoon Jo 29 Iteration space of traversal Points A B C D E F G H Nodes

30 Youngjoon Jo 30 Iteration space of traversal Points A B C D E F G H Nodes

31 Youngjoon Jo 31 How to vectorize? Points A B C D E F G H Nodes

32 Youngjoon Jo 32 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion

33 Youngjoon Jo 33 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes

34 Youngjoon Jo 34 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes

35 Youngjoon Jo 35 Point blocked code void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

36 Youngjoon Jo 36 Point blocked code void recurse(block *block, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

37 Youngjoon Jo 37 Point blocked code void recurse(block *block, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { Function update(p, n); body } else { recurse(p, n->left); recurse(p, n->right); } }

38 Youngjoon Jo 38 Point blocked code Loop over points in block void recurse(block *block, Node *n) { for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } } Function body

39 Youngjoon Jo 39 Point blocked code Loop over points in block void recurse(block *block, Node *n) { for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } } Function body

40 Youngjoon Jo 40 Point blocked code void recurse(block *block, Node *n) { Block *nextblock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { nextblock->add(p); } } } Loop over points in block Function body

Youngjoon Jo 41 Point blocked code void recurse(block *block, Node *n) { Block *nextblock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n))

41 Youngjoon Jo 41 Point blocked code void recurse(block *block, Node *n) { Block *nextblock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { nextblock->add(p); } } if (nextblock->size > 0) { recurse(nextblock, n->left); recurse(nextblock, n->right); } } Loop over points in block Function body Next block recurses children

42 Youngjoon Jo 42 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes

43 Youngjoon Jo 43 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes

44 Youngjoon Jo 44 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes

45 Youngjoon Jo 45 Analogous to packet SIMD Points A B C D E F G H Nodes

46 Youngjoon Jo 46 Analogous to packet SIMD Points A B C D E F G H Nodes Breaks down when points diverge

Youngjoon Jo 47 Packet SIMD has poor utilization Points A B C D E F G H Nodes 1 2 4 8

47 Youngjoon Jo 47 Packet SIMD has poor utilization Points A B C D E F G H Nodes Full SIMD Partial SIMD

48 Youngjoon Jo 48 Packet SIMD has poor utilization Points A B C D E F G H Nodes Full SIMD Partial SIMD

Youngjoon Jo 49 SIMD utilization Points A B C D E F G H Nodes 1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Full

49 Youngjoon Jo 49 SIMD utilization Points A B C D E F G H Nodes Full SIMD SIMD utilization Partial SIMD = Work in full SIMD / Total work

Youngjoon Jo 50 SIMD utilization Points A B C D E F G H Nodes 1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Work in full SIMD /

50 Youngjoon Jo 50 SIMD utilization Points A B C D E F G H Nodes Work in full SIMD / Total work Full SIMD Partial SIMD = Circles in blue / Total circles = 24 / 74 = 0.32

Youngjoon Jo 51 Use larger block size Points A B C D E F G H Nodes 1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Full

51 Youngjoon Jo 51 Use larger block size Points A B C D E F G H Nodes Full SIMD Partial SIMD Use block size larger than SIMD width and compact points

52 Youngjoon Jo 52 Use larger block size Points A B C D E F G H Nodes

Youngjoon Jo 53 Better utilization with larger block size Points A B C D E F G H Nodes 1 2

53 Youngjoon Jo 53 Better utilization with larger block size Points A B C D E F G H Nodes Full SIMD Partial SIMD

Youngjoon Jo 54 Better utilization with larger block size Points A B C D E F G H Nodes 1 2 4 8 9 5 10 11 3 6 12 13

54 Youngjoon Jo 54 Better utilization with larger block size Points A B C D E F G H Nodes Full SIMD Partial SIMD Circles in blue / Total circles = 64 / 74 = 0.86

55 Youngjoon Jo 55 SIMD utilization Block size 1 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

56 Youngjoon Jo 56 Ideal utilization Ideal Utilization 1 SIMD Utilization 0.8 Barnes-Hut 0.6 Point Correlation 0.4 Nearest Neighbor Vantage Point Photon Mapping Block Size Block size equal to total points yields ideal SIMD utilization

57 Youngjoon Jo 57 Use max block Problem solved? 1 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

58 Youngjoon Jo 58 Large block has poor locality 1 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

Youngjoon Jo 59 Large block has poor locality SIMD Utilization 1 0.8 Barnes-Hut 0.6 Point Correlation 0.4 Nearest Neighbor 0.

59 Youngjoon Jo 59 Large block has poor locality SIMD Utilization Barnes-Hut 0.6 Point Correlation 0.4 Nearest Neighbor Vantage Point Photon Mapping Block Size Need schedule with good utilization and good locality

60 Youngjoon Jo 60 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion

61 Youngjoon Jo 61 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes

62 Youngjoon Jo 62 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes

63 Youngjoon Jo 63 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node

64 Youngjoon Jo 64 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node

65 Youngjoon Jo 65 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node

66 Youngjoon Jo 66 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node

67 Youngjoon Jo 67 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished

68 Youngjoon Jo 68 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished

69 Youngjoon Jo 69 Can change order of points Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished We can change the order of paused points, but how?

Youngjoon Jo 70 Dynamic sorting Points 4 A B C D E F G H 8 9 Nodes 1 2 4 8 9 5 10 11 3 6 12 13 7 14 15 Insight: points which reach same nodes are likely to have similar 2 5

70 Youngjoon Jo 70 Dynamic sorting Points 4 A B C D E F G H 8 9 Nodes Insight: points which reach same nodes are likely to have similar 2 5 traversals in future Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished Dynamic sorting on traversal history

71 Youngjoon Jo 71 Dynamic sorting Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node

72 Youngjoon Jo 72 Dynamic sorting Points A B C D E F G H Nodes A B C D E F G H Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node

73 Youngjoon Jo 73 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node

74 Youngjoon Jo 74 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G A C E F H B D G Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node 4. Resume at next node

75 Youngjoon Jo 75 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node 4. Resume at next node 5. Repeat 2-4 until finished

76 Youngjoon Jo 76 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node 4. Resume at next node 5. Repeat 2-4 until finished

77 Youngjoon Jo 77 Dynamic sorting enhances utilization Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Partial SIMD Full SIMD

78 Youngjoon Jo 78 Dynamic sorting enhances utilization Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Partial SIMD Full SIMD Circles in blue / Total circles = 48 / 74 =

79 Youngjoon Jo 79 SIMD utilization splice depth SIMD Utilization Block Size N/A Nearest Neighbor

80 Youngjoon Jo 80 SIMD utilization splice depth Block size: 512 Splice depth: 10 SIMD Utilization Block Size N/A 2 4 Block 6 size: Nearest Neighbor

81 Youngjoon Jo 81 SIMD utilization Utilization Baseline A Priori Sort Dynamic Sort Ideal

Youngjoon Jo 82 SIMD utilization Utilization 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.

82 Youngjoon Jo 82 SIMD utilization Utilization Baseline A Priori Sort Dynamic Sort Ideal Dynamic sorting can automatically extract almost the maximum amount of SIMD utilization

83 Youngjoon Jo 83 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion

84 Youngjoon Jo 84 Automatic transformation Point blocking Jo and Kulkarni [OOPSLA 2011] Traversal splicing Jo and Kulkarni [OOPSLA 2012]

85 Youngjoon Jo 85 Automatic transformation Our key addition for SIMD: Layout transformation from AoS (array of structures) to SoA (structure of arrays) + Allows vector load/stores + Packed data has better spatial locality - More overhead in moving data AoS (array of structures) x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 SoA (structure of arrays) x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4

86 Youngjoon Jo 86 AoS to SoA layout Whole program AoS to SoA layout transformation difficult to automate with aliasing Limit scope to traversal code only Copy in to SoA before traversal Copy out to AoS after traversal Inter-procedural, flow-insensitive analysis Determine which point fields should be SoA Conservatively ensure correctness

87 Youngjoon Jo 87 AoS to SoA layout void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

88 Youngjoon Jo 88 AoS to SoA layout struct Point { float f1, f2, f3; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

89 Youngjoon Jo 89 AoS to SoA layout struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

90 Youngjoon Jo 90 AoS to SoA layout struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

91 Youngjoon Jo 91 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

92 Youngjoon Jo 92 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } f1 f2 f3 Point-access Non-point-access Read Write Read Write bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

93 Youngjoon Jo 93 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } f1 f2 f3 Point-access Non-point-access Read Write Read Write bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

94 Youngjoon Jo 94 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

95 Youngjoon Jo 95 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

96 Youngjoon Jo 96 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

97 Youngjoon Jo 97 Transforming SoA fields struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

98 Youngjoon Jo 98 Transforming SoA fields struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(block *block, int bi, Node *n) { return block->f1[bi] == n->point->f1; } void update(block *block, int bi, Node *n) { block->f2[bi] += n->point->f3; }

99 Youngjoon Jo 99 Correctness violation example struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(block *block, int bi, Node *n) { return block->f1[bi] == n->point->f1; } void update(block *block, int bi, Node *n) { block->f2[bi] += n->point->f2; }

Youngjoon Jo 100 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool

100 Youngjoon Jo 100 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; Sound analysis conservatively proves SoA transformation correct. Suffices to transform all of our } p->f3 = 1; benchmarks.

101 Youngjoon Jo 101 SIMTree Implementation of analysis and transformation in a source to source C++ compiler Based on ROSE compiler infrastructure Transforms code to apply point blocking, traversal splicing, and SoA layout Does not perform the vectorization itself

102 Youngjoon Jo 102 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion

103 Youngjoon Jo 103 Evaluation Five benchmarks Barnes-Hut, Point Correlation, Nearest Neighbor, Vantage Point, Photon Mapping Real and random inputs form 17 benchmark/inputs Two machines Intel Xeon E AMD Opteron 6282 Automatic transformation with SIMTree Manual vectorization of transformed code with 4-way SIMD intrinsics for best performance Auto vectorization of transformed code with icc gets 84% of best performance

104 Youngjoon Jo 104 Speedup on Xeon 7 6 Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice

105 Youngjoon Jo Speedup on Xeon Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] PacketSIMD Block Block+SIMD+Splice Block +SIMD Block +Splice Xeon Block+SIMD +Splice

106 Youngjoon Jo Dynamic sorting makes SIMD profitable Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] PacketSIMD Block Block+SIMD+Splice Block +SIMD Block +Splice Xeon Opteron Block+SIMD +Splice

Youngjoon Jo 107 Dynamic sorting makes SIMD profitable 7 6 Speedup 5 4 3

Block+Splice [OOPSLA 2012] PacketSIMD Block Block+SIMD+Splice Block +SIMD Block

107 Youngjoon Jo 107 Dynamic sorting makes SIMD profitable 7 6 Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] PacketSIMD Block Block+SIMD+Splice Block +SIMD Block +Splice Xeon Opteron Block+SIMD +Splice

108 Youngjoon Jo 108 Instruction counts: Opteron 2.5 Insturction Ratio Block BlockSoA BlockSoA+SIMD Block+Splice BlockSoA+Splice BlockSoA+SIMD+Splice Block BlockSoA BlockSoA +SIMD Block +Splice BlockSoA +Splice BlockSoA +SIMD+Splice

109 Youngjoon Jo 109 Cycles per instruction: Opteron Cycles Per Instruction Base Block BlockSoA Block+Splice BlockSoA+Splice Base Block BlockSoA Block+Splice BlockSoA+Splice

110 Youngjoon Jo 110 Conclusion Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Propose a novel layout transformation for efficient vectorization of tree codes Present a framework for automatically restructuring traversals and data layouts to enable vectorization

Youngjoon Jo 111 Conclusion Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Propose a novel layout transformation for efficient

111 Youngjoon Jo 111 Conclusion Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Propose a novel layout transformation for efficient vectorization of tree codes Present a framework for automatically restructuring traversals and data layouts to enable vectorization SIMTree is open source simtree/

112 AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11 th, 2013

General Transformations for GPU Execution of Tree Traversals

General Transformations for GPU Execution of Tree Traversals eneral Transformations for PU Execution of Tree Traversals Michael oldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at oogle PU execution