Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal

Motivation - Architecture Challenges on GPU architecture - Parallel strategies for different computation patterns - Utilize fast and small shared memory - Efficient and deadlock-free locking support - Exploit SIMT execution manor Heterogeneous CPU+GPU architecture - Number of CPU+GPU systems increases by fivefold from 2010 to 2012 in top 500 list - Emergence of integrated CPU-GPU architecture Fusion APU and Intel Sandy Bridge - Task scheduling considering computation patterns, transmission overhead command launching overhead, synchronization overhead, load imbalance

Motivation - Application Irregular / Unstructured Reduction - A dwarf in Berkeley view on parallel computing (Molecular Dynamics and Euler) - Challenges in Parallelism Heavy data dependencies - Challenges in Memory Performance Indirect memory accessing results in poor data locality Recursive Control Flow - Conflicts between SIMD architecture and control dependencies - Recursion Support: SSE (No), OpenCL (No), CUDA(Yes) Thesis Work - New software and hardware scheduling frameworks can help map irregular and recursive applications to new architectures.

Thesis Work Different strategies for generalized reductions on GPU - Approaches for Parallelizing Reductions on GPUs (HiPC 2010) Strategy and runtime support for irregular reductions - An Execution Strategies and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs (ICS 2011) Task scheduling frameworks for heterogeneous architectures - Decoupled GPU + CPU Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations (HiPC 2011) - Coupled GPU + CPU Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture (SC 2012) Improve parallelism of SIMD for recursion on GPUs - Efficient scheduling of recursive control flow on GPUs (ICS 2013) Further extend recursion support - Recursion support for vectorization - Task scheduling of recursive applications on GPUs

Outline Current Work - Strategy and runtime support for irregular reductions - Improve parallelism of SIMD for recursion on GPUs - Different strategies for generalized reductions on GPU - Task scheduling frameworks for heterogeneous architectures Decoupled GPU + CPU Coupled GPU + CPU Proposed Work - Further extend recursion support Recursion support for vectorization Task scheduling of recursive applications on GPUs Conclusion

Irregular Reduction A dwarf in Berkeley view on parallel computing IA - Unstructured Grid pattern - More random and irregular accesses - Indirect memory references - Indirection Array - Iterates over e (Computation Space) RObj {* Outer Sequence Loop *} while( ) { {* Reduction Loop *} Foreach(element e) { (IA(e,0),val1) = Process(IA(e,0)); (IA(e,1),val2) = Process(IA(e,1)); RObj = Reduce(RObj(IA(e,0)),val1); RObj = Reduce(RObj(IA(e,1)),val2); } Global Reduction to Combine RObj } - Accessed by Indirection Array - Reduction Space

Application Context Molecular Dynamics - Indirection Array -> Edges (Interactions) - Reduction Objects -> Molecules (Attributes) - Computation Space -> Interactions b/w molecules - Reduction Space -> Attributes of Molecules

Main Issues Traditional Strategies are not effective - Full Replication (Private copy per thread) Large memory overhead Both intra-block and inter-block combination Shared memory usage unlikely - Locking Scheme (Private copy per block) Heavy conflicts within a block Avoid intra-block combination, but not inter-block combination Shared memory is only available for small data sets Need to choose Partitioning Strategy - Make sure data can be put into shared memory - Choice of partitioning space (Computation VS. Reduction) - Tradeoffs: Partitioning overhead & Execution efficiency

Contributions A Novel Partitioning-based Locking Strategy - Efficient shared memory utilization - Eliminate both intra and inter-block combination Optimized Runtime Support - Multi-Dimensional Partitioning Method - Reordering & Updating components for correctness and memory performance Significant Performance Improvements - Exhaustive evaluation - Up to 3.3x improvement over traditional strategies

Data structures & Access Pattern {* Outer Sequence Loop *} while( ) { {* Reduction Loop *} Foreach(element e) { (IA(e,0),val1) = Process( IA(e,0) ); (IA(e,1),val2) = Process( IA(e,1) ); RObj = Reduce( RObj(IA(e,0)),val1); RObj = Reduce( RObj(IA(e,1)),val2); } Global Reduction to Combine RObj } Goal: Utilize shared memory IA: No reuse, no benefit from shared memory RObj: Reuse is possible, more benefits from shared memory

Choice of Partitioning Space Two partitioning choices: - Computation Space Partition on edges - Reduction Space Partition on nodes

Computation Space Partitioning Partitioning on the iterations of computation loop Partition 1 5 2 1 4 3 5 Pros: - Load Balance on Computation Cons: 6 Partition 2 8 6 12 16 7 9 11 13 4 5 10 16 Partition 3 Partition 4 14 - Unequal reduction size in each partition - Replicated reduction elements (4 out of 16 nodes are replicated) - Combination cost Shared memory is infeasible

Reduction Space Partitioning Partitioning on the Reduction Elements Partition 1 5 7 Partition 2 Partition 3 11 8 2 1 13 6 4 7 10 3 12 16 5 9 16 Pros: - Balanced reduction space - Independent between each two partitions - Avoid combination cost - Shared memory is feasible Cons: - Imbalance on computation space - Replicated work caused by the crossing edges Partition 4 14

Reduction Space Partitioning - Challenges Unbalanced & Replicated Computation - Partitioning method can achieve balance between Cost and Efficiency Cost: Execution time of partitioning method Efficiency: Reduce number of crossing edges (Replicated work) Maintain correctness on GPU - Reorder reduction space - Update/Reorder computation space

Runtime Partitioning Approaches Metis Partitioning (Multi-level k-way Partitioning) - Execute sequentially on CPU - Minimizes crossing edges - Cons: Large overhead for data initialization High Cost GPU-based (Trivial) Partitioning - Parallel execution on GPU - Minimize execution time - Cons: Large number of crossing edges among partitions Multi-dimensional Partitioning (Coordinate Information) - Execute sequentially on CPU - Balance between cost and efficiency Low Efficiency

Experiment Evaluation Platform - NVIDIA Tesla C2050 Fermi (14x32=448 cores) - 2.86 GB device memory - 64 KB configurable shared memory 48 KB shared memory and 16 KB L1 cache 16 KB shared memory and 48 KB L1 cache - Intel 2.27 GHz Quad core Xeon E5520 with 48GB memory Applications - Euler (Computational Fluid Dynamics) 20K nodes, 120K edges, and 12K faces - MD (Molecular Dynamics) 37K molecules, 4.6 Million interactions

Euler - Performance Gains Euler: Comparison between Partitioning-based Locking (PBL), Locking, Full Replication, and Sequential CPU time 100 1 80 Execution time (sec) 60 40 20 32.2 9.7 7.5 0 PBL Locking Full Replication CPU

Molecular Dynamics - Performance Gains Molecular Dynamics: Comparison between Partitioning-based Locking (PBL), Locking, Full Replication, and Sequential CPU time 1400 1 1200 Execution time (sec) 1000 800 600 400 200 17.6 5.7 2.1 0 PBL Locking Full Replication CPU

Comparison of Different Partitioning Schemes - Cost Euler: Compare Metis Partitioner (MP), GPU Partitioner (GP), and Multidimensional Partitioner (MD) on 14, 28 and 42 partitions Shows only Partitioning Time - (Init Time + Running Time + Reordering Time) 20 15 Init Time Running Time Reordering Time Init Time - MP: largest - MD: no initialization log(time (us)) 10 5 Running Time - GP: shortest - MD: similar to MP Reordering Time 0 MP GP MD MP GP MD MP GP MD 14 28 42 Number of partitions in Euler - Similar on three strategies

Comparison of Different Partitioning Schemes - Efficiency Euler: Compare Metis Partitioner (MP), GPU Partitioner (GP), and Multidimensional Partitioner (MD) on 14, 28 and 42 partitions Shows workload - (include redundant workload caused by crossing edges) Workload (Iterations) 8000 7000 6000 5000 4000 3000 GP MD MP GP: involve the most replicated workload and load imbalance MD is very close to MP MP-14 GP-14 MD-14 MP-28 GP-28 MD-28 MP-42 GP-42 MD-42 2000 1000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Partition number

End-to-End Execution Time with Different Partitioners Euler: End-to-End execution time for Multi-dimensional Partitioner (MD), GPU Partitioner (GP), and Metis Partitioner (MP) on 28 partitions Time (sec) 25 20 15 10 5 0 Reordering Partitioning Copy Computation 64 128 256 512 64 128 256 512 64 128 256 512 MD GP MP The PBL scheme with different partitioner on 28 partitions MP - Partitioning time is even larger than computation time GP - Too much redundant work slow down the execution

Summary Systematic study to parallelize irregular reductions on modern GPUs A Partitioning-based Locking Scheme - on reduction space Optimized runtime support - Three partitioning schemes - Reordering and updating components Multi-Dimensional Partitioning can balance cost and efficiency Achieve significant performance improvement over traditional methods

Limited Recursion Support on Modern GPUs No Support on OpenCL and AMD GPUs NVIDIA support recursion from computing capability 2.0 and SDK 3.1 How is performance? Focus on intra-warp thread scheduling, and each thread in warp executes a recursive computation

A recursion example on NVIDIA GPU Serial Matched Unmatched Thread 1 Thread 1 Thread 2 Thread 1 Thread 2 20 Fib(23) Fib(23) Fib(23) Fib(23) Fib(23).. Fib(24) Fib(24) Fib(23) Fib(24) Fib(23) Fib(24) Fib(23).. Fib(23) Fib(24) Fib(23) Fib(24) Fib(23).. Fib(23) Fib(24) Fib(23) Fib(24) Fib(23).. Fib(24) Fib(23) Fib(24) Fib(23) Fib(24).. 0.8 Fib(24) Fib(24) Fib(24) Fib(24) Fib(24) 20 Fib(24) Fib(24) Fib(24).. Execution time is Execution time (s) 0.6 0.4 0.2 Fib(23) Fib(24) 2x 1.43x bounded by largest task 0 Serial Matched Unmatched

Threads Re-convergence in General Branch /* General Branch */ Fib(n) { if((a B) && C) {... } else {... } } Path1 Path2 Path3 Path4 Path5 Entry BB1 bra cond1 A BB3 bra cond3 C BB2 bra cond2 B Post-dominator is the node through which all paths (across different branches) must pass through. Immediate post-dominator is the postdominator which is not dominated by any other post-dominators. Exit BB4 else branch Immediate post-dominator

Immediate Post-dominator Re-convergence in Recursion /* General Branch */ Fib(n) { if(n < 2) { return 1; } else { x = Fib(n-1); y = Fib(n-2); return x+y; } } T0 T1 Entry BB1 bra if(n < 2) Exit BB2 bra else Entry BB1 bra if(n < 2) Exit Entry BB2 bra else Next level CFG Re-convergence can only happen on the same recursion level Threads with short branch cannot return until the threads with long branch coming back to the reconvergence point BB1 bra if(n < 2) Exit BB2 bra else Next level CFG

Re-convergence Methods Immediate Post-dominator Re-convergence - Execution time is bounded by the longest branch For N threads, each thread executes a recursion task with M branches T ti = Execution time of tth thread on ith branch Total time = M max N t (T ti ) i Dynamic Re-convergence - Re-convergence can happen before or after immediate post-dominator

Dynamic Re-convergence Mechanisms Remove the static re-convergence on immediate post-dominator Dynamic Re-convergence Implementations - Frontier-based Re-convergence Frontier: The group of threads that have the same PC (Program Counter) address with the current active threads Re-convergence is on the same frontier Re-convergence can cross different recursion levels - Majority-based Re-convergence Majority group of the threads with the same PC Tend to maximize IPC (Instructions Per Cycle) The threads in minority have chance to be in majority in future - Other Dynamic Re-convergence mechanisms Frontier-ret: Schedule return instruction first Majority-threshold: Prevent starvation of threads in minority group

Dynamic Re-convergence Mechanisms T0 T1 Entry Divergence Entry BB1 bra if(n < 2) BB1 bra if(n < 2) Entry BB2 bra else BB2 bra else BB1 bra if(n < 2) Exit Exit BB2 bra else Entry Exit BB1 bra if(n < 2) BB2 bra else Exit

Implementation of Dynamic Re-convergence GPGPU-sim simulator - A cycle-level GPU performance simulator for general purpose computation on GPUs - High simulation accuracy (98.3% for GT200, 97.3% for Fermi) - Model Fermi micro-architecture Stack-based Re-convergence mechanism - Stack structure PC: Address of the future scheduled instruction Active Mask: Represent which threads are active for the corresponding PC (Bitset: 1 represent active) - Stack Updating Function Update PC and Active Mask for different implementations

Frontier-based Dynamic Re-convergence T0 T1 T2 Entry2 PC0 Entry3 PC0 Step1 Step2 Block Entry1 Block Next PC Active Mask PC0 111 Next PC Active Mask BB2 PC2 111 Entry1 PC0 BB3(PC1) bra if(n < 2) BB4(PC2) bra else BB5(PC1) bra if(n < 2) BB6(PC2) bra else Step3 Step4 Block Entry2 Block Next PC Active Mask PC0 111 Next PC Active Mask BB1(PC1) bra if(n < 2) Exit2 PC4 Exit3 PC4 Divergence BB3 BB4 PC1 100 PC2 011 Exit1 PC4 BB2(PC2) bra else Entry4 PC0 Step5 Step6 Block BB3 Entry3 Block Next PC Active Mask PC1 100 PC0 011 Next PC Active Mask BB3 PC1 100 BB7(PC1) bra if(n < 2) BB5 PC1 011 BB8(PC2) bra else Step6 Re-convergent on the same PC Block BB3/5 Next PC Active Mask PC1 111 Exit4 PC4 Step7 Block Next PC Active Mask Exit2/3 PC4 111

Majority-based Dynamic Re-convergence T0 T1 T2 Entry2 PC0 Entry3 PC0 Step1 Step2 Block Entry1 Block Next PC Active Mask PC0 111 Next PC Active Mask BB2 PC2 111 Entry1 PC0 BB1(PC1) bra if(n < 2) BB3(PC1) bra if(n < 2) Exit2 PC4 BB4(PC2) bra else BB5(PC1) bra if(n < 2) Exit3 PC4 BB6(PC2) bra else Step3 Block Next PC Active Mask Entry2 PC0 111 Divergence Majority group(pc2:011) Exit1 PC4 BB2(PC2) bra else Entry4 PC0 Step4 Block BB4 Next PC Active Mask PC2 011 BB7(PC1) bra if(n < 2) BB8(PC2) bra else Exit4 PC4

Experiment Evaluation GPGPU-sim Simulator - Model Fermi architecture Recursive Benchmarks - Small number of recursive branches Fibonacci Binomial coefficients - Large number of recursive branches Graph coloring NQueens - Dependency between branches Tak Function - Only one branch Mandelbrot Fractals

Performance with Increasing Divergence 0-Offset Rotation 1-Offset Rotation Threads Tasks Threads Tasks Thread-0 10 8 6 9 7 5 2 4 Thread-0 10 8 6 9 7 5 2 4 Thread-1 10 8 6 9 7 5 2 4 Thread-1 8 6 9 7 5 2 4 10 Thread-2 10 8 6 9 7 5 2 4 Thread-3 10 8 6 9 7 5 2 4 Thread-4 10 8 6 9 7 5 2 4 Thread-5 10 8 6 9 7 5 2 4 Thread-6 10 8 6 9 7 5 2 4 Thread-2 6 9 7 5 2 4 10 8 Thread-3 9 7 5 2 4 10 8 6 Thread-4 7 5 2 4 10 8 6 9 Thread-5 5 2 4 10 8 6 9 7 Thread-6 2 4 10 8 6 9 7 5 Thread-7 10 8 6 9 7 5 2 4 Thread-7 4 10 8 6 9 7 5 2 2-Offset Rotation 4-Offset Rotation Threads Tasks Threads Tasks Thread-0 10 8 6 9 7 5 2 4 Thread-1 6 9 7 5 2 4 10 8 Thread-2 7 5 2 4 10 8 6 9 Thread-3 2 4 10 8 6 9 7 5 Thread-4 10 8 6 9 7 5 2 4 Thread-5 6 9 7 5 2 4 10 8 Thread-6 7 5 2 4 10 8 6 9 Thread-0 10 8 6 9 7 5 2 4 Thread-1 7 5 2 4 10 8 6 9 Thread-2 10 8 6 9 7 5 2 4 Thread-3 7 5 2 4 10 8 6 9 Thread-4 10 8 6 9 7 5 2 4 Thread-5 7 5 2 4 10 8 6 9 Thread-6 10 8 6 9 7 5 2 4 Thread-7 2 4 10 8 6 9 7 5 Thread-7 7 5 2 4 10 8 6 9

Performance with Increasing Divergence Fibonacci IPC 6 5 4 3 2 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Fibonacci and Binomial Coefficients - Perfect speedup on 0-Offset 1 0 Serial 0 Offset 4 Offset 2 Offset 1 Offset Task Rotation with different offsets under increasing of divergence With increasing divergence - Post-dom decreases by 2.5, 4, and 5 times - Majority only decreases by 1.8, 2.1, and 2.2 times Binomial Coefficient IPC 7 6 5 4 3 2 1 0 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Serial 0 Offset 4 Offset 2 Offset 1 Offset Task Rotation with different offsets under increasing of divergence

Performance with Increasing Divergence NQueens IPC 4.00 3.00 2.00 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Graph Coloring IPC 3 2.5 2 1.5 1 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold 1.00 0.5 0.00 Serial 0 Offset 4 Offset 2 Offset 1 Offset 0 Serial 0 Offset 4 Offset 2 Offset 1 Offset Task Rotation with different offsets under increasing of divergence Task Rotation with different offsets under increasing of divergence Tak Function IPC 8 6 4 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Mandelbrot IPC 2 1.5 1 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold 2 0.5 0 Serial 0 Offset 4 Offset 2 Offset 1 Offset Task Rotation with different offsets under increasing of divergence 0 Serial 0 Offset 4 Offset 2 Offset 1 Offset Task Rotation with different offsets under increasing of divergence

Scalability with Warp Width Fibonacci IPC 2.5 2 1.5 1 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Fibonacci and Binomial Coefficients - When warp width is 4, all versions have similar IPC 0.5 0 4 8 16 32 Warp Width With increasing warp width - Majority has the better scalability than both frontier and post-dom Binomial Coefficients IPC 2.5 2 1.5 1 0.5 Serial Post-dom Frontier Frontier-ret Majority Majority_threshold 0 4 8 16 32 Warp Width

Scalability with Warp Width NQueens IPC 1.6 1.4 1.2 1 0.8 0.6 0.4 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold 1.5 Graph Coloring IPC 0.8 0.6 0.4 0.2 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold 1.2 0.2 0 4 8 16 32 Warp Width 0 4 8 16 32 Warp Width 2.5 2 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold 0.7 0.6 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Tak Function 1.5 1 Mandelbrot IPC 0.5 0.4 0.3 0.2 0.5 0.1 0 4 8 16 32 Warp Width 0 4 8 16 32 Warp Width

Summary Current recursion support is limited by static reconvergence method Dynamic re-convergence mechanisms - Re-convergence before or after immediate post-dominator - Two implementations Frontier-based method Majority-based method Kepler GPU - Dynamic Parallelism blocks the executing kernel when calling a new kernel. - It is no related to intra-warp scheduling.

Different Strategies for Generalized Reductions on GPUs Tradeoffs between Full Replication and Locking Scheme Hybrid Scheme - Balance between Full Replication and Locking Scheme - Introduce a intermediate scheduling layer group under thread block Intra-group: Locking Scheme Inter-group: Full Replications - Benefits of Hybrid Scheme varies with group size Reduced memory overhead Better use of shared memory Reduce combination cost & conflicts Extensive evaluations on different applications with different parameters

Porting Irregular Reductions on Heterogeneous CPU-GPU Architecture A Multi-level Partitioning Framework - Parallelize irregular reduction on heterogeneous architecture Coarse-grained Level: Tasks between CPU and GPU Fine-grained Level: Task between thread blocks or threads Both use reduction space partitioning - Eliminate device memory limitation on GPU Runtime Support Scheme - Pipeline Scheme overlap partitioning and computation on GPU - Work stealing based scheduling strategy provide load balance and increase the pipelining length Significant Performance Improvements - Achieve 11% and 22% improvement for Euler and Molecular Dynamics

Accelerating Applications on Integrated CPU-GPU Architectures Thread Block Level Scheduling Framework on Fusion APU - Work Scheduling target -> thread blocks (Not devices) - Only launch one kernel in the beginning -> small command launching overhead - No synchronization between devices or thread blocks - Inter and Intra-device load balance is achieved by fine-grained and factoring scheduling policies Locking-free Implementations - Master-worker Scheduling - Token Scheduling Applications with different communication patters - Stencil Computing (Jacobi): 1.6x - Generalized Reduction (K-means): 1.92x - Irregular Reduction (Molecular Dynamics): 1.15x

SIMD Extensions Supported in popular processors - SIMD lane width is increased from 128 bit (SSE) and 256 bit(avx) to 512 bit (MIC) Exhaustively studied in Compiler and Runtime Support - Memory alignment - Irregular accessing - Control flow No recursion support - Dynamic function calls - Extensive divergence

Overall idea Stack-based recursion support for SIMD extensions - Software function calling stack for each SIMD lane SIMD operations Contiguous Memory Accessing Divergence - Re-convergence method for SIMD lanes Same strategies on GPU cannot be ported to SIMD extensions directly

Structure of Stack Frame Fibonacci(n) { } if(n <= 1) return 1; int x = Fibonacci(n-1); int y = Fibonacci(n-2); return x+y; Check End Case Branch Case Return Case Stack Frame - PC: case number of recursion Check End Case Branch Case Return Case - n[]: input values - ret[]: return values stack frame PC N RET

Stack Driver Algorithm 5: stack driver(stack t stack) while Not stack.finish() do Stack Frame &ff = stack.gettop(); pc = ff.pc; sf.pc ++; if pc =0then if ff.n = end value then end Read current PC, and update it for next case stack.sf[stack.top-2].ret += final value; stack.pop(); end else if pc <= branch num then stack.push(input(ff.n, branch)); else stack.sf[stack.top-2].ret += ff.ret; stack.pop(); end Check End Case Branch Case Return Case

Memory Support stack frame PC N RET Continues Address........................ PC0 PC1 PC2 PC3 N0 N1 N2 N3 RET0 RET1 RET2 RET3 PC0 PC1 PC2 PC3 N0 N1 N2 N3 RET0 RET1 RET2 RET3 PC0 PC1 PC2 PC3 N0 N1 N2 N3 RET0 RET1 RET2 RET3 Contiguous Memory Accessing - Stacks in Column Major - Structure of Array PC0 PC1 PC2 PC3 N0 N1 N2 N3 RET0 RET1 RET2 RET3 SIMD lane 1 SIMD lane 2 SIMD lane 3 SIMD lane 4

Other Supports Divergence Support - Let all lanes W/R the same stack level (Inserting ghost frame for the lanes with short branches) - Mask operations (Support in MIC) Re-convergence for SIMD lanes - Immediate post-dominator re-convergence - How to implement efficient dynamic reconvergence? Updating stack in different levels Incontiguous Accessing & Potential Parallelsim

Tree-based recursive scheduling Framework Challenges of scheduling recursion - Dynamic task creation - Load imbalance Tree-based scheduling framework - Hierarchy structure Block Tree, Warp Tree, Thread Tree, Thread Stack - Task stealing in the same level - Nodes are divided into public and private Only public nodes are available for stealing (reduce locking overhead) - Task Stealing in different levels Locking-based stealing for warps and blocks Task redistribution for threads in the same warp when beyond a threshold Thread1 10 9 8 8 7 7 6 8 7 7 6 6 5 5 4 6 5 4 4 3 3 2 4 3 Thread Stack Warp1 Thread Stack Block1 5 Thread2 Warp2 Block2 Thread Tree (Shared MEM) Warp Tree (Shared MEM) Block Tree (Global MEM)

Conclusions Different strategies for generalized reductions on GPU Strategy and runtime support for irregular reductions Task scheduling frameworks for heterogeneous architectures - Decoupled GPU + CPU - Coupled GPU + CPU Improve parallelism of SIMD for recursion on GPUs Further extend recursion support - Recursion support for vectorization - Task scheduling of recursive applications on GPUs

Thanks for your attention! Q & A