AUTOMATIC VECTORIZATION OF TREE TRAVERSALS
|
|
- Christine Pierce
- 6 years ago
- Views:
Transcription
1 AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11 th, 2013
2 Youngjoon Jo 2 Commodity processors and SIMD Commodity processors support SIMD (Single Instruction Multiple Data) instructions MMX (1996), SSE (1999), SSE2 (2001), SSE3 (2004), SSE4 (2006), AVX (2011), AVX2 (2013) SIMD width getting wider AVX is 256bit Upcoming AVX-512 to be 512bit (2015) Using SIMD is an excellent way to improve performance
3 Youngjoon Jo 3 SIMD works great for regular loops for (int i = 0; i < 4; i++) { c[i] = a[i] + b[i]; }
4 Youngjoon Jo 4 SIMD works great for regular loops for (int i = 0; i < 4; i++) { c[i] = a[i] + b[i]; } m128 vec_a = _mm_load_ps(a); m128 vec_b = _mm_load_ps(b); m128 vec_c = _mm_add_ps(vec_a, vec_b); _mm_store_ps(c, vec_c);
5 Youngjoon Jo 5 But not so well on irregular codes void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }
6 Youngjoon Jo 6 But not so well on irregular codes void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }
7 Youngjoon Jo 7 Automatic vectorization desired void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } } Automatic vectorization techniques for irregular codes highly desired
8 Youngjoon Jo 8 Automatic vectorization desired void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } } Automatic vectorization techniques for tree traversals highly desired
9 Tree codes are important Youngjoon Jo 9
10 Tree codes are important Youngjoon Jo 10
11 Tree codes are important Youngjoon Jo 11
12 Youngjoon Jo 12 Tree vectorization challenges Non trivial to find vectorizable computation Difficult to keep vectorizable computation together
13 Youngjoon Jo 13 Previous tree vectorization work Non trivial to find vectorizable computation Manually transform code to packetize traversals Process multiple traversals in packet simultaneously Wald et. al. [Computer Graphics Forum 2001] Difficult to keep vectorizable computation together
14 Youngjoon Jo 14 Previous tree vectorization work Non trivial to find vectorizable computation In situations like physical simulation, collision detection or raytracing in scenes, where rays bounce into multiple directions (spherical or bumpmapped surfaces), coherent ray packets break down very quickly to single rays or do not exist at all. In the above mentioned tasks, packet oriented SIMD computations is much less useful. Manually transform code to packetize traversals Process multiple points in packet simultaneously Wald et. al. [Computer Graphics Forum 2001] Kim et. al. [SIGMOD 2010] Havel and Herout [IEEE Transactions on Visualization and Computer Graphics 2010] Difficult to keep vectorizable computation together
15 Youngjoon Jo 15 Previous tree vectorization work Non trivial to find vectorizable computation Manually transform code to packetize traversals Process multiple points in packet simultaneously Wald et. al. [Computer Graphics Forum 2001] Difficult to keep vectorizable computation together Look to alternative sources of vectorization Pixar [RT 2006] Dammertz et. al. [EGSR 2008] Kim et. al. [SIGMOD 2010] Chhugani et. al. [SC 2012]
16 Youngjoon Jo 16 Our previous tree locality work Point blocking Jo and Kulkarni [OOPSLA 2011] Traversal splicing Jo and Kulkarni [OOPSLA 2012]
17 Youngjoon Jo 17 Our solution Non trivial to find vectorizable computation Manually transform code to packetize traversals Automatically packetize traversals with point blocking and a novel layout transformation Difficult to keep vectorizable computation together Look to alternative sources of vectorization Exploit dynamic sorting of traversal splicing to dramatically enhance utilization
18 Youngjoon Jo 18 Contributions Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Propose a novel layout transformation for efficient vectorization of tree codes Present a framework for automatically restructuring traversals and data layouts to enable vectorization
19 Youngjoon Jo 19 Contributions Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Spoiler alert Propose a novel layout transformation for efficient vectorization of tree codes SIMTree can deliver speedups of Present a framework for automatically restructuring traversals and data layouts to enable vectorization up to 6.59, and 2.78 on average
20 Youngjoon Jo 20 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion
21 Youngjoon Jo 21 Tree traversals void main() { Ray *rays[n] = // rays to trace Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(rays[i], root); } } void recurse(ray *r, Node *n) { if (truncate(r, n)) return; if (n->isleaf()) { update(r, n); } else { recurse(r, n->left); recurse(r, n->right); } }
22 Youngjoon Jo 22 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
23 Youngjoon Jo 23 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
24 Youngjoon Jo 24 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
25 Youngjoon Jo 25 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
26 Youngjoon Jo 26 Tree traversals void main() { Point *points[n] = // entities to traverse tree Node *root = // root of tree for (int i = 0; i < N; i++) { recurse(points[i], root); } } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
27 Youngjoon Jo 27 An abstract model void main() { foreach(point p : points) { foreach(node n : p.oraclenodes()) { update(p, n); } } }
28 Youngjoon Jo 28 Iteration space of traversal void main() { foreach(point p : points) { foreach(node n : p.oraclenodes()) { update(p, n); } } Nodes } Points
29 Youngjoon Jo 29 Iteration space of traversal Points A B C D E F G H Nodes
30 Youngjoon Jo 30 Iteration space of traversal Points A B C D E F G H Nodes
31 Youngjoon Jo 31 How to vectorize? Points A B C D E F G H Nodes
32 Youngjoon Jo 32 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion
33 Youngjoon Jo 33 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes
34 Youngjoon Jo 34 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes
35 Youngjoon Jo 35 Point blocked code void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
36 Youngjoon Jo 36 Point blocked code void recurse(block *block, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
37 Youngjoon Jo 37 Point blocked code void recurse(block *block, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { Function update(p, n); body } else { recurse(p, n->left); recurse(p, n->right); } }
38 Youngjoon Jo 38 Point blocked code Loop over points in block void recurse(block *block, Node *n) { for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } } Function body
39 Youngjoon Jo 39 Point blocked code Loop over points in block void recurse(block *block, Node *n) { for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } } Function body
40 Youngjoon Jo 40 Point blocked code void recurse(block *block, Node *n) { Block *nextblock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { nextblock->add(p); } } } Loop over points in block Function body
41 Youngjoon Jo 41 Point blocked code void recurse(block *block, Node *n) { Block *nextblock = // next level block for (int i = 0; i = block->size; i++) { Point *p = block->p[i]; if (truncate(p, n)) continue; if (n->isleaf()) { update(p, n); } else { nextblock->add(p); } } if (nextblock->size > 0) { recurse(nextblock, n->left); recurse(nextblock, n->right); } } Loop over points in block Function body Next block recurses children
42 Youngjoon Jo 42 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes
43 Youngjoon Jo 43 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes
44 Youngjoon Jo 44 Point blocking [OOPSLA 2011] Points A B C D E F G H Nodes
45 Youngjoon Jo 45 Analogous to packet SIMD Points A B C D E F G H Nodes
46 Youngjoon Jo 46 Analogous to packet SIMD Points A B C D E F G H Nodes Breaks down when points diverge
47 Youngjoon Jo 47 Packet SIMD has poor utilization Points A B C D E F G H Nodes Full SIMD Partial SIMD
48 Youngjoon Jo 48 Packet SIMD has poor utilization Points A B C D E F G H Nodes Full SIMD Partial SIMD
49 Youngjoon Jo 49 SIMD utilization Points A B C D E F G H Nodes Full SIMD SIMD utilization Partial SIMD = Work in full SIMD / Total work
50 Youngjoon Jo 50 SIMD utilization Points A B C D E F G H Nodes Work in full SIMD / Total work Full SIMD Partial SIMD = Circles in blue / Total circles = 24 / 74 = 0.32
51 Youngjoon Jo 51 Use larger block size Points A B C D E F G H Nodes Full SIMD Partial SIMD Use block size larger than SIMD width and compact points
52 Youngjoon Jo 52 Use larger block size Points A B C D E F G H Nodes
53 Youngjoon Jo 53 Better utilization with larger block size Points A B C D E F G H Nodes Full SIMD Partial SIMD
54 Youngjoon Jo 54 Better utilization with larger block size Points A B C D E F G H Nodes Full SIMD Partial SIMD Circles in blue / Total circles = 64 / 74 = 0.86
55 Youngjoon Jo 55 SIMD utilization Block size 1 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping
56 Youngjoon Jo 56 Ideal utilization Ideal Utilization 1 SIMD Utilization 0.8 Barnes-Hut 0.6 Point Correlation 0.4 Nearest Neighbor Vantage Point Photon Mapping Block Size Block size equal to total points yields ideal SIMD utilization
57 Youngjoon Jo 57 Use max block Problem solved? 1 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping
58 Youngjoon Jo 58 Large block has poor locality 1 SIMD Utilization Block Size Barnes-Hut Point Correlation Nearest Neighbor Vantage Point Photon Mapping
59 Youngjoon Jo 59 Large block has poor locality SIMD Utilization Barnes-Hut 0.6 Point Correlation 0.4 Nearest Neighbor Vantage Point Photon Mapping Block Size Need schedule with good utilization and good locality
60 Youngjoon Jo 60 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion
61 Youngjoon Jo 61 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes
62 Youngjoon Jo 62 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes
63 Youngjoon Jo 63 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node
64 Youngjoon Jo 64 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node
65 Youngjoon Jo 65 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node
66 Youngjoon Jo 66 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node
67 Youngjoon Jo 67 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished
68 Youngjoon Jo 68 Traversal splicing [OOPSLA 2012] Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished
69 Youngjoon Jo 69 Can change order of points Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished We can change the order of paused points, but how?
70 Youngjoon Jo 70 Dynamic sorting Points 4 A B C D E F G H 8 9 Nodes Insight: points which reach same nodes are likely to have similar 2 5 traversals in future Designate splice nodes 2. Traverse up to splice node 3. Resume at next node 4. Repeat 2-3 until finished Dynamic sorting on traversal history
71 Youngjoon Jo 71 Dynamic sorting Points A B C D E F G H Nodes Designate splice nodes 2. Traverse up to splice node
72 Youngjoon Jo 72 Dynamic sorting Points A B C D E F G H Nodes A B C D E F G H Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node
73 Youngjoon Jo 73 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node
74 Youngjoon Jo 74 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G A C E F H B D G Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node 4. Resume at next node
75 Youngjoon Jo 75 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node 4. Resume at next node 5. Repeat 2-4 until finished
76 Youngjoon Jo 76 Dynamic sorting Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Designate splice nodes 2. Traverse up to splice node 3. Reorder points at splice node 4. Resume at next node 5. Repeat 2-4 until finished
77 Youngjoon Jo 77 Dynamic sorting enhances utilization Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Partial SIMD Full SIMD
78 Youngjoon Jo 78 Dynamic sorting enhances utilization Points A B C D E F G H Nodes A C E F H B D G E B D G A C F H Partial SIMD Full SIMD Circles in blue / Total circles = 48 / 74 =
79 Youngjoon Jo 79 SIMD utilization splice depth SIMD Utilization Block Size N/A Nearest Neighbor
80 Youngjoon Jo 80 SIMD utilization splice depth Block size: 512 Splice depth: 10 SIMD Utilization Block Size N/A 2 4 Block 6 size: Nearest Neighbor
81 Youngjoon Jo 81 SIMD utilization Utilization Baseline A Priori Sort Dynamic Sort Ideal
82 Youngjoon Jo 82 SIMD utilization Utilization Baseline A Priori Sort Dynamic Sort Ideal Dynamic sorting can automatically extract almost the maximum amount of SIMD utilization
83 Youngjoon Jo 83 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion
84 Youngjoon Jo 84 Automatic transformation Point blocking Jo and Kulkarni [OOPSLA 2011] Traversal splicing Jo and Kulkarni [OOPSLA 2012]
85 Youngjoon Jo 85 Automatic transformation Our key addition for SIMD: Layout transformation from AoS (array of structures) to SoA (structure of arrays) + Allows vector load/stores + Packed data has better spatial locality - More overhead in moving data AoS (array of structures) x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 SoA (structure of arrays) x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4
86 Youngjoon Jo 86 AoS to SoA layout Whole program AoS to SoA layout transformation difficult to automate with aliasing Limit scope to traversal code only Copy in to SoA before traversal Copy out to AoS after traversal Inter-procedural, flow-insensitive analysis Determine which point fields should be SoA Conservatively ensure correctness
87 Youngjoon Jo 87 AoS to SoA layout void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
88 Youngjoon Jo 88 AoS to SoA layout struct Point { float f1, f2, f3; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
89 Youngjoon Jo 89 AoS to SoA layout struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } }
90 Youngjoon Jo 90 AoS to SoA layout struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }
91 Youngjoon Jo 91 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } void recurse(point *p, Node *n) { if (truncate(p, n)) return; if (n->isleaf()) { update(p, n); } else { recurse(p, n->left); recurse(p, n->right); } } bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }
92 Youngjoon Jo 92 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } f1 f2 f3 Point-access Non-point-access Read Write Read Write bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }
93 Youngjoon Jo 93 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } f1 f2 f3 Point-access Non-point-access Read Write Read Write bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }
94 Youngjoon Jo 94 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }
95 Youngjoon Jo 95 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }
96 Youngjoon Jo 96 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }
97 Youngjoon Jo 97 Transforming SoA fields struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; }
98 Youngjoon Jo 98 Transforming SoA fields struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(block *block, int bi, Node *n) { return block->f1[bi] == n->point->f1; } void update(block *block, int bi, Node *n) { block->f2[bi] += n->point->f3; }
99 Youngjoon Jo 99 Correctness violation example struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(block *block, int bi, Node *n) { return block->f1[bi] == n->point->f1; } void update(block *block, int bi, Node *n) { block->f2[bi] += n->point->f2; }
100 Youngjoon Jo 100 Ensuring correctness struct Point { float f1, f2, f3; } struct Node { Node *left, *right; Point *point; } Point-access Non-point-access Read Write Read Write f1 f2 f3 bool truncate(point *p, Node *n) { return p->f1 == n->point->f1; } void update(point *p, Node *n) { p->f2 += n->point->f3; Sound analysis conservatively proves SoA transformation correct. Suffices to transform all of our } p->f3 = 1; benchmarks.
101 Youngjoon Jo 101 SIMTree Implementation of analysis and transformation in a source to source C++ compiler Based on ROSE compiler infrastructure Transforms code to apply point blocking, traversal splicing, and SoA layout Does not perform the vectorization itself
102 Youngjoon Jo 102 Outline Example & Abstract Model Point Blocking to Enable SIMD Traversal Splicing to Enhance Utilization Automatic Transformation Evaluation and Conclusion
103 Youngjoon Jo 103 Evaluation Five benchmarks Barnes-Hut, Point Correlation, Nearest Neighbor, Vantage Point, Photon Mapping Real and random inputs form 17 benchmark/inputs Two machines Intel Xeon E AMD Opteron 6282 Automatic transformation with SIMTree Manual vectorization of transformed code with 4-way SIMD intrinsics for best performance Auto vectorization of transformed code with icc gets 84% of best performance
104 Youngjoon Jo 104 Speedup on Xeon 7 6 Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] Block+SIMD+Splice
105 Youngjoon Jo Speedup on Xeon Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] PacketSIMD Block Block+SIMD+Splice Block +SIMD Block +Splice Xeon Block+SIMD +Splice
106 Youngjoon Jo Dynamic sorting makes SIMD profitable Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] PacketSIMD Block Block+SIMD+Splice Block +SIMD Block +Splice Xeon Opteron Block+SIMD +Splice
107 Youngjoon Jo 107 Dynamic sorting makes SIMD profitable 7 6 Speedup Geometric means Packet SIMD [CGF 2001] Block [OOPSLA 2011] Block+SIMD Block+Splice [OOPSLA 2012] PacketSIMD Block Block+SIMD+Splice Block +SIMD Block +Splice Xeon Opteron Block+SIMD +Splice
108 Youngjoon Jo 108 Instruction counts: Opteron 2.5 Insturction Ratio Block BlockSoA BlockSoA+SIMD Block+Splice BlockSoA+Splice BlockSoA+SIMD+Splice Block BlockSoA BlockSoA +SIMD Block +Splice BlockSoA +Splice BlockSoA +SIMD+Splice
109 Youngjoon Jo 109 Cycles per instruction: Opteron Cycles Per Instruction Base Block BlockSoA Block+Splice BlockSoA+Splice Base Block BlockSoA Block+Splice BlockSoA+Splice
110 Youngjoon Jo 110 Conclusion Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Propose a novel layout transformation for efficient vectorization of tree codes Present a framework for automatically restructuring traversals and data layouts to enable vectorization
111 Youngjoon Jo 111 Conclusion Show how tree traversal codes can be systematically transformed to Expose SIMD opportunities Enhance utilization Propose a novel layout transformation for efficient vectorization of tree codes Present a framework for automatically restructuring traversals and data layouts to enable vectorization SIMTree is open source simtree/
112 AUTOMATIC VECTORIZATION OF TREE TRAVERSALS Youngjoon Jo, Michael Goldfarb and Milind Kulkarni PACT, Edinburgh, U.K. September 11 th, 2013
General Transformations for GPU Execution of Tree Traversals
eneral Transformations for PU Execution of Tree Traversals Michael oldfarb*, Youngjoon Jo**, Milind Kulkarni School of Electrical and Computer Engineering * Now at Qualcomm; ** Now at oogle PU execution
More informationAutomatically Enhancing Locality for Tree Traversals with Traversal Splicing
Purdue University Purdue e-pubs ECE Technical Reports Electrical and Computer Engineering 2-15-2012 Automatically Enhancing Locality for Tree Traversals with Traversal Splicing Youngjoon Jo Electrical
More informationAutomatically Optimizing Tree Traversal Algorithms
Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations Fall 2013 Automatically Optimizing Tree Traversal Algorithms Youngjoon Jo Purdue University Follow this and additional
More informationEnhancing Locality for Recursive Traversals of Recursive Structures
Enhancing Locality for Recursive Traversals of Recursive Structures Youngjoon Jo and Milind Kulkarni School of Electrical and Computer Engineering Purdue University {yjo,milind}@purdue.edu Abstract While
More informationHigh Performance Computing: Tools and Applications
High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform
More informationRay Tracing. Computer Graphics CMU /15-662, Fall 2016
Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2016 Primitive-partitioning vs. space-partitioning acceleration structures Primitive partitioning (bounding volume hierarchy): partitions node s primitives
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 5: SIMD (1) Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 5: SIMD (1) Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV Lecture 5 SIMD (1) 3 Meanwhile, the job market
More informationGeneral Transformations for GPU Execution of Tree Traversals
General Transformations for GPU Execution of Tree Traversals Michael Goldfarb, Youngjoon Jo and Milind Kulkarni School of Electrical and Computer Engineering Purdue University West Lafayette, IN {mgoldfar,
More informationTreelogy: A Benchmark Suite for Tree Traversals
Purdue University Programming Languages Group Treelogy: A Benchmark Suite for Tree Traversals Nikhil Hegde, Jianqiao Liu, Kirshanthan Sundararajah, and Milind Kulkarni School of Electrical and Computer
More informationLoop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization
Loop-Oriented Array- and Field-Sensitive Pointer Analysis for Automatic SIMD Vectorization Yulei Sui, Xiaokang Fan, Hao Zhou and Jingling Xue School of Computer Science and Engineering The University of
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationMilind Kulkarni Research Statement
Milind Kulkarni Research Statement With the increasing ubiquity of multicore processors, interest in parallel programming is again on the upswing. Over the past three decades, languages and compilers researchers
More informationBandwidth Avoiding Stencil Computations
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization Group UC Berkeley March 13, 2008 http://bebop.cs.berkeley.edu
More informationINFOMAGR Advanced Graphics. Jacco Bikker - February April Welcome!
INFOMAGR Advanced Graphics Jacco Bikker - February April 2016 Welcome! I x, x = g(x, x ) ε x, x + S ρ x, x, x I x, x dx Today s Agenda: Introduction Ray Distributions The Top-level BVH Real-time Ray Tracing
More informationLecture 4 - Real-time Ray Tracing
INFOMAGR Advanced Graphics Jacco Bikker - November 2017 - February 2018 Lecture 4 - Real-time Ray Tracing Welcome! I x, x = g(x, x ) ε x, x + න S ρ x, x, x I x, x dx Today s Agenda: Introduction Ray Distributions
More informationINFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Acceleration. Welcome!
INFOGR Computer Graphics J. Bikker - April-July 2015 - Lecture 11: Acceleration Welcome! Today s Agenda: High-speed Ray Tracing Acceleration Structures The Bounding Volume Hierarchy BVH Construction BVH
More informationSSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals
SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions
More informationObject Support in an Array-based GPGPU Extension for Ruby
Object Support in an Array-based GPGPU Extension for Ruby ARRAY 16 Matthias Springer, Hidehiko Masuhara Dept. of Mathematical and Computing Sciences, Tokyo Institute of Technology June 14, 2016 Object
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationSIMD Exploitation in (JIT) Compilers
SIMD Exploitation in (JIT) Compilers Hiroshi Inoue, IBM Research - Tokyo 1 What s SIMD? Single Instruction Multiple Data Same operations applied for multiple elements in a vector register input 1 A0 input
More informationExploiting Local Orientation Similarity for Efficient Ray Traversal of Hair and Fur
1 Exploiting Local Orientation Similarity for Efficient Ray Traversal of Hair and Fur Sven Woop, Carsten Benthin, Ingo Wald, Gregory S. Johnson Intel Corporation Eric Tabellion DreamWorks Animation 2 Legal
More informationA Static Cut-off for Task Parallel Programs
A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We
More informationAn Evaluation of Multi-Hit Ray Traversal in a BVH Using Existing First-Hit/Any-Hit Kernels: Algorithm Listings and Performance Visualizations
Journal of Computer Graphics Techniques in a BVH Using Existing First-Hit/Any-Hit Kernels: Algorithm Listings and Performance Visualizations This document provides algorithm listings for multi-hit ray
More informationVectorization on KNL
Vectorization on KNL Steve Lantz Senior Research Associate Cornell University Center for Advanced Computing (CAC) steve.lantz@cornell.edu High Performance Computing on Stampede 2, with KNL, Jan. 23, 2017
More informationGuy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany
Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Motivation C AVX2 AVX512 New instructions utilized! Scalar performance
More informationKismet: Parallel Speedup Estimates for Serial Programs
Kismet: Parallel Speedup Estimates for Serial Programs Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor Computer Science and Engineering University of California, San Diego 1 Questions
More informationEDAN30 Photorealistic Computer Graphics. Seminar 2, Bounding Volume Hierarchy. Magnus Andersson, PhD student
EDAN30 Photorealistic Computer Graphics Seminar 2, 2012 Bounding Volume Hierarchy Magnus Andersson, PhD student (magnusa@cs.lth.se) This seminar We want to go from hundreds of triangles to thousands (or
More informationWarped parallel nearest neighbor searches using kd-trees
Warped parallel nearest neighbor searches using kd-trees Roman Sokolov, Andrei Tchouprakov D4D Technologies Kd-trees Binary space partitioning tree Used for nearest-neighbor search, range search Application:
More informationLectures Parallelism
Lectures 24-25 Parallelism 1 Pipelining vs. Parallel processing In both cases, multiple things processed by multiple functional units Pipelining: each thing is broken into a sequence of pieces, where each
More informationImage Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions
Image Processing Acceleration Techniques using Intel Streaming SIMD Extensions and Intel Advanced Vector Extensions September 4, 2009 Authors: Petter Larsson & Eric Palmer INFORMATION IN THIS DOCUMENT
More informationLocality-enhancing loop transformations for parallel tree traversal algorithms
Purdue University Purdue e-pubs ECE Technical Reports Electrical and Computer Engineering -15-11 Locality-enhancing loop transformations for parallel tree traversal algorithms Youngjoon Jo Electrical and
More informationIntroduction to tuning on many core platforms. Gilles Gouaillardet RIST
Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need
More informationProgrammazione Avanzata
Programmazione Avanzata Vittorio Ruggiero (v.ruggiero@cineca.it) Roma, Marzo 2017 Pipeline Outline CPU: internal parallelism? CPU are entirely parallel pipelining superscalar execution units SIMD MMX,
More informationCSC766: Project - Fall 2010 Vivek Deshpande and Kishor Kharbas Auto-vectorizating compiler for pseudo language
CSC766: Project - Fall 2010 Vivek Deshpande and Kishor Kharbas Auto-vectorizating compiler for pseudo language Table of contents 1. Introduction 2. Implementation details 2.1. High level design 2.2. Data
More informationAdaptive Assignment for Real-Time Raytracing
Adaptive Assignment for Real-Time Raytracing Paul Aluri [paluri] and Jacob Slone [jslone] Carnegie Mellon University 15-418/618 Spring 2015 Summary We implemented a CUDA raytracer accelerated by a non-recursive
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationGetting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions
Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions Introduction SIMD Vectorization and SIMD-enabled Functions are a part of Intel Cilk Plus feature supported by the Intel
More informationEmbree Ray Tracing Kernels for the Intel Xeon and Intel Xeon Phi Architectures. Sven Woop, Carsten Benthin, Ingo Wald Intel
Embree Ray Tracing Kernels for the Intel Xeon and Intel Xeon Phi Architectures Sven Woop, Carsten Benthin, Ingo Wald Intel 1 2 Legal Disclaimer and Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED
More informationARE WE OPTIMIZING HARDWARE FOR
ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science
More informationRay Tracing Performance
Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel Ray Tracing Performance Zero to Millions in 45 Minutes?! Gordon Stoll, Intel Goals for this talk Goals point you toward the current
More informationMulti Agent Navigation on GPU. Avi Bleiweiss
Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance
More informationIntersection Acceleration
Advanced Computer Graphics Intersection Acceleration Matthias Teschner Computer Science Department University of Freiburg Outline introduction bounding volume hierarchies uniform grids kd-trees octrees
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly
More informationEfficient Image-Based Methods for Rendering Soft Shadows. Hard vs. Soft Shadows. IBR good for soft shadows. Shadow maps
Efficient Image-Based Methods for Rendering Soft Shadows Hard vs. Soft Shadows Maneesh Agrawala Ravi Ramamoorthi Alan Heirich Laurent Moll Pixar Animation Studios Stanford University Compaq Computer Corporation
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationGeometric Structures 2. Quadtree, k-d stromy
Geometric Structures 2. Quadtree, k-d stromy Martin Samuelčík samuelcik@sccg.sk, www.sccg.sk/~samuelcik, I4 Window and Point query From given set of points, find all that are inside of given d-dimensional
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationParallel Programming Case Studies
Lecture 8: Parallel Programming Case Studies Parallel Computer Architecture and Programming CMU 5-48/5-68, Spring 207 Tunes Beyoncé (featuring Jack White) Don t Hurt Yourself (Lemonade) I suggest 48 students
More informationARMv8-A Scalable Vector Extension for Post-K. Copyright 2016 FUJITSU LIMITED
ARMv8-A Scalable Vector Extension for Post-K 0 Post-K Supports New SIMD Extension The SIMD extension is a 512-bit wide implementation of SVE SVE is an HPC-focused SIMD instruction extension in AArch64
More informationImproving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm
Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationRTfact: Concepts for Generic and High Performance Ray Tracing
RTfact: Concepts for Generic and High Performance Ray Tracing Ray tracers are used in RT08 papers Change packet size? Change data structures? No common software base No tools for writing composable software
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationGPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC
GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT
More informationCase Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017
Case Study: Matrix Multiplication 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017 1 4k-by-4k Matrix Multiplication Version Implementation Running time (s) GFLOPS Absolute
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationDynamic SIMD Scheduling
Dynamic SIMD Scheduling Florian Wende SC15 MIC Tuning BoF November 18 th, 2015 Zuse Institute Berlin time Dynamic Work Assignment: The Idea Irregular SIMD execution Caused by branching: control flow varies
More informationSIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13
SIMD Divergence Optimization through Intra-Warp Compaction Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13 Problem GPU: wide SIMD lanes 16 lanes per warp in this work SIMD
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationIntroduction to OpenMP
1 / 7 Introduction to OpenMP: Exercises and Handout Introduction to OpenMP Christian Terboven Center for Computing and Communication, RWTH Aachen University Seffenter Weg 23, 52074 Aachen, Germany Abstract
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationImplicit and Explicit Optimizations for Stencil Computations
Implicit and Explicit Optimizations for Stencil Computations By Shoaib Kamil 1,2, Kaushik Datta 1, Samuel Williams 1,2, Leonid Oliker 2, John Shalf 2 and Katherine A. Yelick 1,2 1 BeBOP Project, U.C. Berkeley
More informationMAQAO Hands-on exercises LRZ Cluster
MAQAO Hands-on exercises LRZ Cluster LProf: lightweight generic profiler LProf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Copy handson material > cp /home/hpc/a2c06/lu23bud/lrz-vihpstw21/tools/maqao/maqao_handson_lrz.tar.xz
More informationExecution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures
Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture
More informationCSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization
CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation
More informationOpenCL Vectorising Features. Andreas Beckmann
Mitglied der Helmholtz-Gemeinschaft OpenCL Vectorising Features Andreas Beckmann Levels of Vectorisation vector units, SIMD devices width, instructions SMX, SP cores Cus, PEs vector operations within kernels
More informationImproving graphics processing performance using Intel Cilk Plus
Improving graphics processing performance using Intel Cilk Plus Introduction Intel Cilk Plus is an extension to the C and C++ languages to support data and task parallelism. It provides three new keywords
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationThe Polyhedral Model Is More Widely Applicable Than You Think
The Polyhedral Model Is More Widely Applicable Than You Think Mohamed-Walid Benabderrahmane 1 Louis-Noël Pouchet 1,2 Albert Cohen 1 Cédric Bastoul 1 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationA Framework for Safe Automatic Data Reorganization
Compiler Technology A Framework for Safe Automatic Data Reorganization Shimin Cui (Speaker), Yaoqing Gao, Roch Archambault, Raul Silvera IBM Toronto Software Lab Peng Peers Zhao, Jose Nelson Amaral University
More informationHidden Surface Elimination: BSP trees
Hidden Surface Elimination: BSP trees Outline Binary space partition (BSP) trees Polygon-aligned 1 BSP Trees Basic idea: Preprocess geometric primitives in scene to build a spatial data structure such
More informationAdvanced Parallel Programming II
Advanced Parallel Programming II Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Introduction to Vectorization RISC Software GmbH Johannes Kepler
More informationOptimizations of BLIS Library for AMD ZEN Core
Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was
More informationA distributed rendering architecture for ray tracing large scenes on commodity hardware. FlexRender. Bob Somers Zoe J.
FlexRender A distributed rendering architecture for ray tracing large scenes on commodity hardware. GRAPP 2013 Bob Somers Zoe J. Wood Increasing Geometric Complexity Normal Maps artifacts on silhouette
More informationHigh-Performance Ray Tracing
Lecture 16: High-Performance Ray Tracing Computer Graphics CMU 15-462/15-662, Fall 2015 Ray tracing is a mechanism for answering visibility queries v(x 1,x 2 ) = 1 if x 1 is visible from x 2, 0 otherwise
More informationEmbree Ray Tracing Kernels: Overview and New Features
Embree Ray Tracing Kernels: Overview and New Features Attila Áfra, Ingo Wald, Carsten Benthin, Sven Woop Intel Corporation Intel, the Intel logo, Intel Xeon Phi, Intel Xeon Processor are trademarks of
More informationAccelerating Geometric Queries. Computer Graphics CMU /15-662, Fall 2016
Accelerating Geometric Queries Computer Graphics CMU 15-462/15-662, Fall 2016 Geometric modeling and geometric queries p What point on the mesh is closest to p? What point on the mesh is closest to p?
More informationTowards a Holistic Approach to Auto-Parallelization
Towards a Holistic Approach to Auto-Parallelization Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping Georgios Tournavitis, Zheng Wang, Björn Franke and Michael F.P. O
More informationProgram Optimization Through Loop Vectorization
Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Simple Example Loop
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationMAQAO hands-on exercises
MAQAO hands-on exercises Perf: generic profiler Perf/MPI: Lightweight MPI oriented profiler CQA: code quality analyzer Setup Recompile NPB-MZ with dynamic if using cray compiler #---------------------------------------------------------------------------
More informationKevin O Leary, Intel Technical Consulting Engineer
Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."
More informationEfficient Clustered BVH Update Algorithm for Highly-Dynamic Models. Kirill Garanzha
Symposium on Interactive Ray Tracing 2008 Los Angeles, California Efficient Clustered BVH Update Algorithm for Highly-Dynamic Models Kirill Garanzha Department of Software for Computers Bauman Moscow State
More informationSarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018
Sarah Knepper Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Outline Motivation Problem statement and solutions Simple example Performance comparison 2 Motivation Partial differential equations
More informationParallel Programming Case Studies
Lecture 8: Parallel Programming Case Studies Parallel Computer Architecture and Programming CMU 5-48/5-68, Spring 205 Tunes The Lumineers Ho Hey (The Lumineers) Parallelism always seems to dominate the
More informationMemory access patterns. 5KK73 Cedric Nugteren
Memory access patterns 5KK73 Cedric Nugteren Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA
More informationHigh Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization
High Performance Computing and Programming 2015 Lab 6 SIMD and Vectorization 1 Introduction The purpose of this lab assignment is to give some experience in using SIMD instructions on x86 and getting compiler
More informationLine Segment Intersection Dmitriy V'jukov
Line Segment Intersection Dmitriy V'jukov 1. Problem Statement Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers
More informationCOMP371 COMPUTER GRAPHICS
COMP371 COMPUTER GRAPHICS LECTURE 14 RASTERIZATION 1 Lecture Overview Review of last class Line Scan conversion Polygon Scan conversion Antialiasing 2 Rasterization The raster display is a matrix of picture
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More information/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 10: GPGPU (3) Welcome!
/INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2018 - Lecture 10: GPGPU (3) Welcome! Today s Agenda: Don t Trust the Template The Prefix Sum Parallel Sorting Stream Filtering Optimizing GPU
More informationVisible Surface Detection Methods
Visible urface Detection Methods Visible-urface Detection identifying visible parts of a scene (also hidden- elimination) type of algorithm depends on: complexity of scene type of objects available equipment
More informationCS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010
Parallel Programming Lecture 7: Introduction to SIMD Mary Hall September 14, 2010 Homework 2, Due Friday, Sept. 10, 11:59 PM To submit your homework: - Submit a PDF file - Use the handin program on the
More informationLightcuts. Jeff Hui. Advanced Computer Graphics Rensselaer Polytechnic Institute
Lightcuts Jeff Hui Advanced Computer Graphics 2010 Rensselaer Polytechnic Institute Fig 1. Lightcuts version on the left and naïve ray tracer on the right. The lightcuts took 433,580,000 clock ticks and
More informationCost Modelling for Vectorization on ARM
Cost Modelling for Vectorization on ARM Angela Pohl, Biagio Cosenza and Ben Juurlink ARM Research Summit 2018 Challenges of Auto-Vectorization in Compilers 1. Is it possible to vectorize the code? Passes:
More informationComputer Graphics. - Rasterization - Philipp Slusallek
Computer Graphics - Rasterization - Philipp Slusallek Rasterization Definition Given some geometry (point, 2D line, circle, triangle, polygon, ), specify which pixels of a raster display each primitive
More informationLecture 4. Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy
Lecture 4 Instruction Level Parallelism Vectorization, SSE Optimizing for the memory hierarchy Partners? Announcements Scott B. Baden / CSE 160 / Winter 2011 2 Today s lecture Why multicore? Instruction
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More information