AUTOMATIC VECTORIZATION OF TREE TRAVERSALS

Size: px
Start display at page:

Download "AUTOMATIC VECTORIZATION OF TREE TRAVERSALS"

Transcription

1 AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11 th, 2013

2 Youngjoon Jo 2 Commodity processors and SIMD Commodity processors support SIMD (Single Instruction Multiple Data) instructions MMX (1996), SSE (1999), SSE2 (2001), SSE3 (2004), SSE4 (2006), AVX (2011), AVX2 (2013) SIMD width getting wider AVX is 256bit Upcoming AVX-512 to be 512bit (2015) Using SIMD is an excellent way to improve performance

3 Youngjoon Jo 3 SIMD works great for regular loops for (int i = 0; i < 4; i++) { c[i] = a[i] + b[i]; }

4 Youngjoon Jo 4 SIMD works great for regular loops for (int i = 0; i < 4; i++) { c[i] = a[i] + b[i]; } m128 vec_a = _mm_load_ps(a); m128 vec_b = _mm_load_ps(b); m128 vec_c = _mm_add_ps(vec_a, vec_b); _mm_store_ps(c, vec_c);

5 Youngjoon Jo 5 But not so well on irregular codes void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

6 Youngjoon Jo 6 But not so well on irregular codes void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

7 Youngjoon Jo 7 Automatic vectorization desired void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } } Automatic vectorization techniques for irregular codes highly desired

8 Youngjoon Jo 8 Automatic vectorization desired void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } } Automatic vectorization techniques for tree traversals highly desired

9 Tree codes are important Youngjoon Jo 9

10 Tree codes are important Youngjoon Jo 10

11 Tree codes are important Youngjoon Jo 11

12 Youngjoon Jo 12 Tree vectorization challenges Non trivial to find vectorizable computation Difficult to keep vectorizable computation together

13 Youngjoon Jo 13 Previous tree vectorization work Non trivial to find vectorizable computation Manually transform code to packetize traversals Process multiple traversals in packet simultaneously Wald et. al. [Computer Graphics Forum 2001] Difficult to keep vectorizable computation together

14 Youngjoon Jo 14 Previous tree vectorization work Non trivial to find vectorizable computation In situations like physical simulation, collision detection or raytracing in scenes, where rays bounce into multiple directions (spherical or bumpmapped surfaces), coherent ray packets break down very quickly to single rays or do not exist at all. In the above mentioned tasks, packet oriented SIMD computations is much less useful. Manually transform code to packetize traversals Process multiple points in packet simultaneously Wald et. al. [Computer Graphics Forum 2001] Kim et. al. [SIGMOD 2010] Havel and Herout [IEEE Transactions on Visualization and Computer Graphics 2010] Difficult to keep vectorizable computation together

15 Youngjoon Jo 15 Previous tree vectorization work Non trivial to find vectorizable computation Manually transform code to packetize traversals Process multiple points in packet simultaneously Wald et. al. [Computer Graphics Forum 2001] Difficult to keep vectorizable computation together Look to alternative sources of vectorization Pixar [RT 2006] Dammertz et. al. [EGSR 2008] Kim et. al. [SIGMOD 2010] Chhugani et. al. [SC 2012]

16 Youngjoon Jo 16 Our previous tree locality work Point blocking Jo and Kulkarni [OOPSLA 2011] Traversal splicing Jo and Kulkarni [OOPSLA 2012]

17 Youngjoon Jo 17 Our solution Non trivial to find vectorizable computation Manually transform code to packetize traversals Automatically packetize traversals with point blocking and a novel layout transformation Difficult to keep vectorizable computation together Look to alternative sources of vectorization Exploit dynamic sorting of traversal splicing to dramatically enhance utilization

18 Youngjoon Jo 18 Contributions Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Propose a novel layout transformation for efficient vectorization of tree codes Present a framework for automatically restructuring traversals and data layouts to enable vectorization

19 Youngjoon Jo 19 Contributions Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Spoiler alert Propose a novel layout transformation for efficient vectorization of tree codes SIMTree can deliver speedups of Present a framework for automatically restructuring traversals and data layouts to enable vectorization up to 6.59, and 2.78 on average

20 Youngjoon Jo 20 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion

21 Youngjoon Jo 21 Tree traversals void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }

22 Youngjoon Jo 22 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

23 Youngjoon Jo 23 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

24 Youngjoon Jo 24 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

25 Youngjoon Jo 25 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

26 Youngjoon Jo 26 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

27 Youngjoon Jo 27 An abstract model void main() { foreach(point p : points) { foreach(node n : p.oraclenodes()) { update(p, n); } } }

28 Youngjoon Jo 28 Iteration space of traversal void main() { foreach(point p : points) { foreach(node n : p.oraclenodes()) { update(p, n); } } Nodes } Points

29 Youngjoon Jo 29 Iteration space of traversal Points A B C D E F G H Nodes

30 Youngjoon Jo 30 Iteration space of traversal Points A B C D E F G H Nodes

31 Youngjoon Jo 31 How to vectorize? Points A B C D E F G H Nodes

32 Youngjoon Jo 32 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion

33 Youngjoon Jo 33 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes

34 Youngjoon Jo 34 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes

35 Youngjoon Jo 35 Point blocked code void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

36 Youngjoon Jo 36 Point blocked code void recurse(block *block, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

37 Youngjoon Jo 37 Point blocked code void recurse(block *block, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { Function update(p, n); body } else { recurse(p, n->left); recurse(p, n->right); } }

38 Youngjoon Jo 38 Point blocked code Loop over points in block void recurse(block *block, Node *n) { for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } } Function body

39 Youngjoon Jo 39 Point blocked code Loop over points in block void recurse(block *block, Node *n) { for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } } Function body

40 Youngjoon Jo 40 Point blocked code void recurse(block *block, Node *n) { Block *nextblock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { nextblock->add(p); } } } Loop over points in block Function body

41 Youngjoon Jo 41 Point blocked code void recurse(block *block, Node *n) { Block *nextblock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { nextblock->add(p); } } if (nextblock->size > 0) { recurse(nextblock, n->left); recurse(nextblock, n->right); } } Loop over points in block Function body Next block recurses children

42 Youngjoon Jo 42 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes

43 Youngjoon Jo 43 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes

44 Youngjoon Jo 44 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes

45 Youngjoon Jo 45 Analogous to packet SIMD Points A B C D E F G H Nodes

46 Youngjoon Jo 46 Analogous to packet SIMD Points A B C D E F G H Nodes Breaks down when points diverge

47 Youngjoon Jo 47 Packet SIMD has poor utilization Points A B C D E F G H Nodes Full SIMD Partial SIMD

48 Youngjoon Jo 48 Packet SIMD has poor utilization Points A B C D E F G H Nodes Full SIMD Partial SIMD

49 Youngjoon Jo 49 SIMD utilization Points A B C D E F G H Nodes Full SIMD SIMD utilization Partial SIMD = Work in full SIMD / Total work

50 Youngjoon Jo 50 SIMD utilization Points A B C D E F G H Nodes Work in full SIMD / Total work Full SIMD Partial SIMD = Circles in blue / Total circles = 24 / 74 = 0.32

51 Youngjoon Jo 51 Use larger block size Points A B C D E F G H Nodes Full SIMD Partial SIMD Use block size larger than SIMD width and compact points

52 Youngjoon Jo 52 Use larger block size Points A B C D E F G H Nodes

53 Youngjoon Jo 53 Better utilization with larger block size Points A B C D E F G H Nodes Full SIMD Partial SIMD

54 Youngjoon Jo 54 Better utilization with larger block size Points A B C D E F G H Nodes Full SIMD Partial SIMD Circles in blue / Total circles = 64 / 74 = 0.86

55 Youngjoon Jo 55 SIMD utilization Block size 1 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

56 Youngjoon Jo 56 Ideal utilization Ideal Utilization 1 SIMD Utilization 0.8 Barnes-Hut 0.6 Point Correlation 0.4 Nearest Neighbor Vantage Point Photon Mapping Block Size Block size equal to total points yields ideal SIMD utilization

57 Youngjoon Jo 57 Use max block Problem solved? 1 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

58 Youngjoon Jo 58 Large block has poor locality 1 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping

59 Youngjoon Jo 59 Large block has poor locality SIMD Utilization Barnes-Hut 0.6 Point Correlation 0.4 Nearest Neighbor Vantage Point Photon Mapping Block Size Need schedule with good utilization and good locality

60 Youngjoon Jo 60 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion

61 Youngjoon Jo 61 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes

62 Youngjoon Jo 62 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes

63 Youngjoon Jo 63 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node

64 Youngjoon Jo 64 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node

65 Youngjoon Jo 65 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node

66 Youngjoon Jo 66 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node

67 Youngjoon Jo 67 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished

68 Youngjoon Jo 68 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished

69 Youngjoon Jo 69 Can change order of points Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished We can change the order of paused points, but how?

70 Youngjoon Jo 70 Dynamic sorting Points 4 A B C D E F G H 8 9 Nodes Insight: points which reach same nodes are likely to have similar 2 5 traversals in future Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished Dynamic sorting on traversal history

71 Youngjoon Jo 71 Dynamic sorting Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node

72 Youngjoon Jo 72 Dynamic sorting Points A B C D E F G H Nodes A B C D E F G H Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node

73 Youngjoon Jo 73 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node

74 Youngjoon Jo 74 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G A C E F H B D G Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node 4. Resume at next node

75 Youngjoon Jo 75 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node 4. Resume at next node 5. Repeat 2-4 until finished

76 Youngjoon Jo 76 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node 4. Resume at next node 5. Repeat 2-4 until finished

77 Youngjoon Jo 77 Dynamic sorting enhances utilization Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Partial SIMD Full SIMD

78 Youngjoon Jo 78 Dynamic sorting enhances utilization Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Partial SIMD Full SIMD Circles in blue / Total circles = 48 / 74 =

79 Youngjoon Jo 79 SIMD utilization splice depth SIMD Utilization Block Size N/A Nearest Neighbor

80 Youngjoon Jo 80 SIMD utilization splice depth Block size: 512 Splice depth: 10 SIMD Utilization Block Size N/A 2 4 Block 6 size: Nearest Neighbor

81 Youngjoon Jo 81 SIMD utilization Utilization Baseline A Priori Sort Dynamic Sort Ideal

82 Youngjoon Jo 82 SIMD utilization Utilization Baseline A Priori Sort Dynamic Sort Ideal Dynamic sorting can automatically extract almost the maximum amount of SIMD utilization

83 Youngjoon Jo 83 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion

84 Youngjoon Jo 84 Automatic transformation Point blocking Jo and Kulkarni [OOPSLA 2011] Traversal splicing Jo and Kulkarni [OOPSLA 2012]

85 Youngjoon Jo 85 Automatic transformation Our key addition for SIMD: Layout transformation from AoS (array of structures) to SoA (structure of arrays) + Allows vector load/stores + Packed data has better spatial locality - More overhead in moving data AoS (array of structures) x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 SoA (structure of arrays) x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4

86 Youngjoon Jo 86 AoS to SoA layout Whole program AoS to SoA layout transformation difficult to automate with aliasing Limit scope to traversal code only Copy in to SoA before traversal Copy out to AoS after traversal Inter-procedural, flow-insensitive analysis Determine which point fields should be SoA Conservatively ensure correctness

87 Youngjoon Jo 87 AoS to SoA layout void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

88 Youngjoon Jo 88 AoS to SoA layout struct Point { float f1, f2, f3; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

89 Youngjoon Jo 89 AoS to SoA layout struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }

90 Youngjoon Jo 90 AoS to SoA layout struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

91 Youngjoon Jo 91 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

92 Youngjoon Jo 92 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } f1 f2 f3 Point-access Non-point-access Read Write Read Write bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

93 Youngjoon Jo 93 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } f1 f2 f3 Point-access Non-point-access Read Write Read Write bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

94 Youngjoon Jo 94 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

95 Youngjoon Jo 95 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

96 Youngjoon Jo 96 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

97 Youngjoon Jo 97 Transforming SoA fields struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }

98 Youngjoon Jo 98 Transforming SoA fields struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(block *block, int bi, Node *n) { return block->f1[bi] == n->point->f1; } void update(block *block, int bi, Node *n) { block->f2[bi] += n->point->f3; }

99 Youngjoon Jo 99 Correctness violation example struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(block *block, int bi, Node *n) { return block->f1[bi] == n->point->f1; } void update(block *block, int bi, Node *n) { block->f2[bi] += n->point->f2; }

100 Youngjoon Jo 100 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; Sound analysis conservatively proves SoA transformation correct. Suffices to transform all of our } p->f3 = 1; benchmarks.

101 Youngjoon Jo 101 SIMTree Implementation of analysis and transformation in a source to source C++ compiler Based on ROSE compiler infrastructure Transforms code to apply point blocking, traversal splicing, and SoA layout Does not perform the vectorization itself

102 Youngjoon Jo 102 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion

103 Youngjoon Jo 103 Evaluation Five benchmarks Barnes-Hut, Point Correlation, Nearest Neighbor, Vantage Point, Photon Mapping Real and random inputs form 17 benchmark/inputs Two machines Intel Xeon E AMD Opteron 6282 Automatic transformation with SIMTree Manual vectorization of transformed code with 4-way SIMD intrinsics for best performance Auto vectorization of transformed code with icc gets 84% of best performance

104 Youngjoon Jo 104 Speedup on Xeon 7 6 Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice

105 Youngjoon Jo Speedup on Xeon Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] PacketSIMD Block Block+SIMD+Splice Block +SIMD Block +Splice Xeon Block+SIMD +Splice

106 Youngjoon Jo Dynamic sorting makes SIMD profitable Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] PacketSIMD Block Block+SIMD+Splice Block +SIMD Block +Splice Xeon Opteron Block+SIMD +Splice

107 Youngjoon Jo 107 Dynamic sorting makes SIMD profitable 7 6 Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] PacketSIMD Block Block+SIMD+Splice Block +SIMD Block +Splice Xeon Opteron Block+SIMD +Splice

108 Youngjoon Jo 108 Instruction counts: Opteron 2.5 Insturction Ratio Block BlockSoA BlockSoA+SIMD Block+Splice BlockSoA+Splice BlockSoA+SIMD+Splice Block BlockSoA BlockSoA +SIMD Block +Splice BlockSoA +Splice BlockSoA +SIMD+Splice

109 Youngjoon Jo 109 Cycles per instruction: Opteron Cycles Per Instruction Base Block BlockSoA Block+Splice BlockSoA+Splice Base Block BlockSoA Block+Splice BlockSoA+Splice

110 Youngjoon Jo 110 Conclusion Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Propose a novel layout transformation for efficient vectorization of tree codes Present a framework for automatically restructuring traversals and data layouts to enable vectorization

111 Youngjoon Jo 111 Conclusion Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Propose a novel layout transformation for efficient vectorization of tree codes Present a framework for automatically restructuring traversals and data layouts to enable vectorization SIMTree is open source simtree/

112 AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11 th, 2013

General Transformations for GPU Execution of Tree Traversals

General Transformations for GPU Execution of Tree Traversals eneral Transformations for PU Execution of Tree Traversals Michael oldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at oogle PU execution

More information

Automatically Enhancing Locality for Tree Traversals with Traversal Splicing

Automatically Enhancing Locality for Tree Traversals with Traversal Splicing Purdue University Purdue e-pubs ECE Technical Reports Electrical and Computer Engineering 2-15-2012 Automatically Enhancing Locality for Tree Traversals with Traversal Splicing Youngjoon Jo Electrical

More information

Automatically Optimizing Tree Traversal Algorithms

Automatically Optimizing Tree Traversal Algorithms Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations Fall 2013 Automatically Optimizing Tree Traversal Algorithms Youngjoon Jo Purdue University Follow this and additional

More information

Enhancing Locality for Recursive Traversals of Recursive Structures

Enhancing Locality for Recursive Traversals of Recursive Structures Enhancing Locality for Recursive Traversals of Recursive Structures Youngjoon Jo and Milind Kulkarni School of Electrical and Computer Engineering Purdue University {yjo,milind}@purdue.edu Abstract While

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016 Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2016 Primitive-partitioning vs. space-partitioning acceleration structures Primitive partitioning (bounding volume hierarchy): partitions node s primitives

More information

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 5: SIMD (1) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 5: SIMD (1) Welcome! /INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 5: SIMD (1) Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV Lecture 5 SIMD (1) 3 Meanwhile, the job market

More information

General Transformations for GPU Execution of Tree Traversals

General Transformations for GPU Execution of Tree Traversals General Transformations for GPU Execution of Tree Traversals Michael Goldfarb, Youngjoon Jo and Milind Kulkarni School of Electrical and Computer Engineering Purdue University West Lafayette, IN {mgoldfar,

More information

Treelogy: A Benchmark Suite for Tree Traversals

Treelogy: A Benchmark Suite for Tree Traversals Purdue University Programming Languages Group Treelogy: A Benchmark Suite for Tree Traversals Nikhil Hegde, Jianqiao Liu, Kirshanthan Sundararajah, and Milind Kulkarni School of Electrical and Computer

More information

Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization

Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Yulei Sui, Xiaokang Fan, Hao Zhou and Jingling Xue School of Computer Science and Engineering The University of

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Milind Kulkarni Research Statement

Milind Kulkarni Research Statement Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers

More information

Bandwidth Avoiding Stencil Computations

Bandwidth Avoiding Stencil Computations Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu

More information

INFOMAGR Advanced Graphics. Jacco Bikker - February April Welcome!

INFOMAGR Advanced Graphics. Jacco Bikker - February April Welcome! INFOMAGR Advanced Graphics Jacco Bikker - February April 2016 Welcome! I x, x = g(x, x ) ε x, x + S ρ x, x, x I x, x dx Today s Agenda: Introduction Ray Distributions The Top-level BVH Real-time Ray Tracing

More information

Lecture 4 - Real-time Ray Tracing

Lecture 4 - Real-time Ray Tracing INFOMAGR Advanced Graphics Jacco Bikker - November 2017 - February 2018 Lecture 4 - Real-time Ray Tracing Welcome! I x, x = g(x, x ) ε x, x + න S ρ x, x, x I x, x dx Today s Agenda: Introduction Ray Distributions

More information

INFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Acceleration. Welcome!

INFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Acceleration. Welcome! INFOGR Computer Graphics J. Bikker - April-July 2015 - Lecture 11: Acceleration Welcome! Today s Agenda: High-speed Ray Tracing Acceleration Structures The Bounding Volume Hierarchy BVH Construction BVH

More information

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals

SSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions

More information

Object Support in an Array-based GPGPU Extension for Ruby

Object Support in an Array-based GPGPU Extension for Ruby Object Support in an Array-based GPGPU Extension for Ruby ARRAY 16 Matthias Springer, Hidehiko Masuhara Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology June 14, 2016 Object

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

SIMD Exploitation in (JIT) Compilers

SIMD Exploitation in (JIT) Compilers SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input

More information

Exploiting Local Orientation Similarity for Efficient Ray Traversal of Hair and Fur

Exploiting Local Orientation Similarity for Efficient Ray Traversal of Hair and Fur 1 Exploiting Local Orientation Similarity for Efficient Ray Traversal of Hair and Fur Sven Woop, Carsten Benthin, Ingo Wald, Gregory S. Johnson Intel Corporation Eric Tabellion DreamWorks Animation 2 Legal

More information

A Static Cut-off for Task Parallel Programs

A Static Cut-off for Task Parallel Programs A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We

More information

An Evaluation of Multi-Hit Ray Traversal in a BVH Using Existing First-Hit/Any-Hit Kernels: Algorithm Listings and Performance Visualizations

An Evaluation of Multi-Hit Ray Traversal in a BVH Using Existing First-Hit/Any-Hit Kernels: Algorithm Listings and Performance Visualizations Journal of Computer Graphics Techniques in a BVH Using Existing First-Hit/Any-Hit Kernels: Algorithm Listings and Performance Visualizations This document provides algorithm listings for multi-hit ray

More information

Vectorization on KNL

Vectorization on KNL Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017

More information

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Motivation C AVX2 AVX512 New instructions utilized! Scalar performance

More information

Kismet: Parallel Speedup Estimates for Serial Programs

Kismet: Parallel Speedup Estimates for Serial Programs Kismet: Parallel Speedup Estimates for Serial Programs Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor Computer Science and Engineering University of California, San Diego 1 Questions

More information

EDAN30 Photorealistic Computer Graphics. Seminar 2, Bounding Volume Hierarchy. Magnus Andersson, PhD student

EDAN30 Photorealistic Computer Graphics. Seminar 2, Bounding Volume Hierarchy. Magnus Andersson, PhD student EDAN30 Photorealistic Computer Graphics Seminar 2, 2012 Bounding Volume Hierarchy Magnus Andersson, PhD student (magnusa@cs.lth.se) This seminar We want to go from hundreds of triangles to thousands (or

More information

Warped parallel nearest neighbor searches using kd-trees

Warped parallel nearest neighbor searches using kd-trees Warped parallel nearest neighbor searches using kd-trees Roman Sokolov, Andrei Tchouprakov D4D Technologies Kd-trees Binary space partitioning tree Used for nearest-neighbor search, range search Application:

More information

Lectures Parallelism

Lectures Parallelism Lectures 24-25 Parallelism 1 Pipelining vs. Parallel processing In both cases, multiple things processed by multiple functional units Pipelining: each thing is broken into a sequence of pieces, where each

More information

Image Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions

Image Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions Image Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions September 4, 2009 Authors: Petter Larsson & Eric Palmer INFORMATION IN THIS DOCUMENT

More information

Locality-enhancing loop transformations for parallel tree traversal algorithms

Locality-enhancing loop transformations for parallel tree traversal algorithms Purdue University Purdue e-pubs ECE Technical Reports Electrical and Computer Engineering -15-11 Locality-enhancing loop transformations for parallel tree traversal algorithms Youngjoon Jo Electrical and

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

Programmazione Avanzata

Programmazione Avanzata Programmazione Avanzata Vittorio Ruggiero (v.ruggiero@cineca.it) Roma, Marzo 2017 Pipeline Outline CPU: internal parallelism? CPU are entirely parallel pipelining superscalar execution units SIMD MMX,

More information

CSC766: Project - Fall 2010 Vivek Deshpande and Kishor Kharbas Auto-vectorizating compiler for pseudo language

CSC766: Project - Fall 2010 Vivek Deshpande and Kishor Kharbas Auto-vectorizating compiler for pseudo language CSC766: Project - Fall 2010 Vivek Deshpande and Kishor Kharbas Auto-vectorizating compiler for pseudo language Table of contents 1. Introduction 2. Implementation details 2.1. High level design 2.2. Data

More information

Adaptive Assignment for Real-Time Raytracing

Adaptive Assignment for Real-Time Raytracing Adaptive Assignment for Real-Time Raytracing Paul Aluri [paluri] and Jacob Slone [jslone] Carnegie Mellon University 15-418/618 Spring 2015 Summary We implemented a CUDA raytracer accelerated by a non-recursive

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel

More information

Embree Ray Tracing Kernels for the Intel Xeon and Intel Xeon Phi Architectures. Sven Woop, Carsten Benthin, Ingo Wald Intel

Embree Ray Tracing Kernels for the Intel Xeon and Intel Xeon Phi Architectures. Sven Woop, Carsten Benthin, Ingo Wald Intel Embree Ray Tracing Kernels for the Intel Xeon and Intel Xeon Phi Architectures Sven Woop, Carsten Benthin, Ingo Wald Intel 1 2 Legal Disclaimer and Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

ARE WE OPTIMIZING HARDWARE FOR

ARE WE OPTIMIZING HARDWARE FOR ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science

More information

Ray Tracing Performance

Ray Tracing Performance Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel Ray Tracing Performance Zero to Millions in 45 Minutes?! Gordon Stoll, Intel Goals for this talk Goals point you toward the current

More information

Multi Agent Navigation on GPU. Avi Bleiweiss

Multi Agent Navigation on GPU. Avi Bleiweiss Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance

More information

Intersection Acceleration

Intersection Acceleration Advanced Computer Graphics Intersection Acceleration Matthias Teschner Computer Science Department University of Freiburg Outline introduction bounding volume hierarchies uniform grids kd-trees octrees

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

Efficient Image-Based Methods for Rendering Soft Shadows. Hard vs. Soft Shadows. IBR good for soft shadows. Shadow maps

Efficient Image-Based Methods for Rendering Soft Shadows. Hard vs. Soft Shadows. IBR good for soft shadows. Shadow maps Efficient Image-Based Methods for Rendering Soft Shadows Hard vs. Soft Shadows Maneesh Agrawala Ravi Ramamoorthi Alan Heirich Laurent Moll Pixar Animation Studios Stanford University Compaq Computer Corporation

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Geometric Structures 2. Quadtree, k-d stromy

Geometric Structures 2. Quadtree, k-d stromy Geometric Structures 2. Quadtree, k-d stromy Martin Samuelčík samuelcik@sccg.sk, www.sccg.sk/~samuelcik, I4 Window and Point query From given set of points, find all that are inside of given d-dimensional

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Parallel Programming Case Studies

Parallel Programming Case Studies Lecture 8: Parallel Programming Case Studies Parallel Computer Architecture and Programming CMU 5-48/5-68, Spring 207 Tunes Beyoncé (featuring Jack White) Don t Hurt Yourself (Lemonade) I suggest 48 students

More information

ARMv8-A Scalable Vector Extension for Post-K. Copyright 2016 FUJITSU LIMITED

ARMv8-A Scalable Vector Extension for Post-K. Copyright 2016 FUJITSU LIMITED ARMv8-A Scalable Vector Extension for Post-K 0 Post-K Supports New SIMD Extension The SIMD extension is a 512-bit wide implementation of SVE SVE is an HPC-focused SIMD instruction extension in AArch64

More information

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

RTfact: Concepts for Generic and High Performance Ray Tracing

RTfact: Concepts for Generic and High Performance Ray Tracing RTfact: Concepts for Generic and High Performance Ray Tracing Ray tracers are used in RT08 papers Change packet size? Change data structures? No common software base No tools for writing composable software

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 Case Study: Matrix Multiplication 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 1 4k-by-4k Matrix Multiplication Version Implementation Running time (s) GFLOPS Absolute

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Dynamic SIMD Scheduling

Dynamic SIMD Scheduling Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies

More information

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 Problem GPU: wide SIMD lanes 16 lanes per warp in this work SIMD

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Introduction to OpenMP

Introduction to OpenMP 1 / 7 Introduction to OpenMP: Exercises and Handout Introduction to OpenMP Christian Terboven Center for Computing and Communication, RWTH Aachen University Seffenter Weg 23, 52074 Aachen, Germany Abstract

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Implicit and Explicit Optimizations for Stencil Computations

Implicit and Explicit Optimizations for Stencil Computations Implicit and Explicit Optimizations for Stencil Computations By Shoaib Kamil 1,2, Kaushik Datta 1, Samuel Williams 1,2, Leonid Oliker 2, John Shalf 2 and Katherine A. Yelick 1,2 1 BeBOP Project, U.C. Berkeley

More information

MAQAO Hands-on exercises LRZ Cluster

MAQAO Hands-on exercises LRZ Cluster MAQAO Hands-on exercises LRZ Cluster LProf: lightweight generic profiler LProf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Copy handson material > cp /home/hpc/a2c06/lu23bud/lrz-vihpstw21/tools/maqao/maqao_handson_lrz.tar.xz

More information

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture

More information

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation

More information

OpenCL Vectorising Features. Andreas Beckmann

OpenCL Vectorising Features. Andreas Beckmann Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels

More information

Improving graphics processing performance using Intel Cilk Plus

Improving graphics processing performance using Intel Cilk Plus Improving graphics processing performance using Intel Cilk Plus Introduction Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism. It provides three new keywords

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

The Polyhedral Model Is More Widely Applicable Than You Think

The Polyhedral Model Is More Widely Applicable Than You Think The Polyhedral Model Is More Widely Applicable Than You Think Mohamed-Walid Benabderrahmane 1 Louis-Noël Pouchet 1,2 Albert Cohen 1 Cédric Bastoul 1 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

A Framework for Safe Automatic Data Reorganization

A Framework for Safe Automatic Data Reorganization Compiler Technology A Framework for Safe Automatic Data Reorganization Shimin Cui (Speaker), Yaoqing Gao, Roch Archambault, Raul Silvera IBM Toronto Software Lab Peng Peers Zhao, Jose Nelson Amaral University

More information

Hidden Surface Elimination: BSP trees

Hidden Surface Elimination: BSP trees Hidden Surface Elimination: BSP trees Outline Binary space partition (BSP) trees Polygon-aligned 1 BSP Trees Basic idea: Preprocess geometric primitives in scene to build a spatial data structure such

More information

Advanced Parallel Programming II

Advanced Parallel Programming II Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Introduction to Vectorization RISC Software GmbH Johannes Kepler

More information

Optimizations of BLIS Library for AMD ZEN Core

Optimizations of BLIS Library for AMD ZEN Core Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was

More information

A distributed rendering architecture for ray tracing large scenes on commodity hardware. FlexRender. Bob Somers Zoe J.

A distributed rendering architecture for ray tracing large scenes on commodity hardware. FlexRender. Bob Somers Zoe J. FlexRender A distributed rendering architecture for ray tracing large scenes on commodity hardware. GRAPP 2013 Bob Somers Zoe J. Wood Increasing Geometric Complexity Normal Maps artifacts on silhouette

More information

High-Performance Ray Tracing

High-Performance Ray Tracing Lecture 16: High-Performance Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2015 Ray tracing is a mechanism for answering visibility queries v(x 1,x 2 ) = 1 if x 1 is visible from x 2, 0 otherwise

More information

Embree Ray Tracing Kernels: Overview and New Features

Embree Ray Tracing Kernels: Overview and New Features Embree Ray Tracing Kernels: Overview and New Features Attila Áfra, Ingo Wald, Carsten Benthin, Sven Woop Intel Corporation Intel, the Intel logo, Intel Xeon Phi, Intel Xeon Processor are trademarks of

More information

Accelerating Geometric Queries. Computer Graphics CMU /15-662, Fall 2016

Accelerating Geometric Queries. Computer Graphics CMU /15-662, Fall 2016 Accelerating Geometric Queries Computer Graphics CMU 15-462/15-662, Fall 2016 Geometric modeling and geometric queries p What point on the mesh is closest to p? What point on the mesh is closest to p?

More information

Towards a Holistic Approach to Auto-Parallelization

Towards a Holistic Approach to Auto-Parallelization Towards a Holistic Approach to Auto-Parallelization Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping Georgios Tournavitis, Zheng Wang, Björn Franke and Michael F.P. O

More information

Program Optimization Through Loop Vectorization

Program Optimization Through Loop Vectorization Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Simple Example Loop

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

MAQAO hands-on exercises

MAQAO hands-on exercises MAQAO hands-on exercises Perf: generic profiler Perf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Recompile NPB-MZ with dynamic if using cray compiler #---------------------------------------------------------------------------

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

Efficient Clustered BVH Update Algorithm for Highly-Dynamic Models. Kirill Garanzha

Efficient Clustered BVH Update Algorithm for Highly-Dynamic Models. Kirill Garanzha Symposium on Interactive Ray Tracing 2008 Los Angeles, California Efficient Clustered BVH Update Algorithm for Highly-Dynamic Models Kirill Garanzha Department of Software for Computers Bauman Moscow State

More information

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Sarah Knepper Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Outline Motivation Problem statement and solutions Simple example Performance comparison 2 Motivation Partial differential equations

More information

Parallel Programming Case Studies

Parallel Programming Case Studies Lecture 8: Parallel Programming Case Studies Parallel Computer Architecture and Programming CMU 5-48/5-68, Spring 205 Tunes The Lumineers Ho Hey (The Lumineers) Parallelism always seems to dominate the

More information

Memory access patterns. 5KK73 Cedric Nugteren

Memory access patterns. 5KK73 Cedric Nugteren Memory access patterns 5KK73 Cedric Nugteren Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA

More information

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization

High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86 and getting compiler

More information

Line Segment Intersection Dmitriy V'jukov

Line Segment Intersection Dmitriy V'jukov Line Segment Intersection Dmitriy V'jukov 1. Problem Statement Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers

More information

COMP371 COMPUTER GRAPHICS

COMP371 COMPUTER GRAPHICS COMP371 COMPUTER GRAPHICS LECTURE 14 RASTERIZATION 1 Lecture Overview Review of last class Line Scan conversion Polygon Scan conversion Antialiasing 2 Rasterization The raster display is a matrix of picture

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome! /INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU

More information

Visible Surface Detection Methods

Visible Surface Detection Methods Visible urface Detection Methods Visible-urface Detection identifying visible parts of a scene (also hidden- elimination) type of algorithm depends on: complexity of scene type of objects available equipment

More information

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010 Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010 Homework 2, Due Friday, Sept. 10, 11:59 PM To submit your homework: - Submit a PDF file - Use the handin program on the

More information

Lightcuts. Jeff Hui. Advanced Computer Graphics Rensselaer Polytechnic Institute

Lightcuts. Jeff Hui. Advanced Computer Graphics Rensselaer Polytechnic Institute Lightcuts Jeff Hui Advanced Computer Graphics 2010 Rensselaer Polytechnic Institute Fig 1. Lightcuts version on the left and naïve ray tracer on the right. The lightcuts took 433,580,000 clock ticks and

More information

Cost Modelling for Vectorization on ARM

Cost Modelling for Vectorization on ARM Cost Modelling for Vectorization on ARM Angela Pohl, Biagio Cosenza and Ben Juurlink ARM Research Summit 2018 Challenges of Auto-Vectorization in Compilers 1. Is it possible to vectorize the code? Passes:

More information

Computer Graphics. - Rasterization - Philipp Slusallek

Computer Graphics. - Rasterization - Philipp Slusallek Computer Graphics - Rasterization - Philipp Slusallek Rasterization Definition Given some geometry (point, 2D line, circle, triangle, polygon, ), specify which pixels of a raster display each primitive

More information

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy

Lecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Lecture 4 Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Partners? Announcements Scott B. Baden / CSE 160 / Winter 2011 2 Today s lecture Why multicore? Instruction

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information