Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Similar documents
An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Handout 3. HSAIL and A SIMT GPU Simulator

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Accelerating MapReduce on a Coupled CPU-GPU Architecture

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Hybrid Implementation of 3D Kirchhoff Migration

High Performance Computing on GPUs using NVIDIA CUDA

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Portland State University ECE 588/688. Graphics Processors

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Automatic Intra-Application Load Balancing for Heterogeneous Systems

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Master Informatics Eng.

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

How to Optimize Geometric Multigrid Methods on GPUs

CPU-GPU Heterogeneous Computing

Modern Processor Architectures. L25: Modern Compiler Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

SIMD Divergence Optimization through Intra-Warp Compaction. Aniruddha Vaidya Anahita Shayesteh Dong Hyuk Woo Roy Saharoy Mani Azimi ISCA 13

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

GPU Fundamentals Jeff Larkin November 14, 2016

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

OpenMP for next generation heterogeneous clusters

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Warps and Reduction Algorithms

Optimization solutions for the segmented sum algorithmic function

Spring Prof. Hyesoon Kim

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Fast Tridiagonal Solvers on GPU

CUDA Architecture & Programming Model

Runtime Support for Scalable Task-parallel Programs

Introduction to GPGPU and GPU-architectures

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to CUDA Programming

GPUfs: Integrating a file system with GPUs

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CUB. collective software primitives. Duane Merrill. NVIDIA Research

A Hardware-Software Integrated Solution for Improved Single-Instruction Multi-Thread Processor Efficiency

GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

Parallelising Pipelined Wavefront Computations on the GPU

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CS377P Programming for Performance GPU Programming - II

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

ECE 8823: GPU Architectures. Objectives

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Systems I The GPU architecture. Jan Lemeire

LDetector: A low overhead data race detector for GPU programs

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

GPU Sparse Graph Traversal

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Recent Advances in Heterogeneous Computing using Charm++

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

CUDA GPGPU Workshop 2012

Antonio R. Miele Marco D. Santambrogio

GPU Implementation of a Multiobjective Search Algorithm

Programmer's View of Execution Teminology Summary

Computer Architecture

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Maximizing Face Detection Performance

Simty: Generalized SIMT execution on RISC-V

CUDA. Matthew Joyner, Jeremy Williams

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Computing. November 20, W.Homberg

Review: Creating a Parallel Program. Programming for Performance

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

General Purpose GPU Computing in Partial Wave Analysis

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Scalable GPU Graph Traversal!

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen

A Framework for Modeling GPUs Power Consumption

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

CS CS9535: An Overview of Parallel Computing

Transcription:

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal

Motivation - Architecture Challenges on GPU architecture - Parallel strategies for different computation patterns - Utilize fast and small shared memory - Efficient and deadlock-free locking support - Exploit SIMT execution manor Heterogeneous CPU+GPU architecture - Number of CPU+GPU systems increases by fivefold from 2010 to 2012 in top 500 list - Emergence of integrated CPU-GPU architecture Fusion APU and Intel Sandy Bridge - Task scheduling considering computation patterns, transmission overhead command launching overhead, synchronization overhead, load imbalance

Motivation - Application Irregular / Unstructured Reduction - A dwarf in Berkeley view on parallel computing (Molecular Dynamics and Euler) - Challenges in Parallelism Heavy data dependencies - Challenges in Memory Performance Indirect memory accessing results in poor data locality Recursive Control Flow - Conflicts between SIMD architecture and control dependencies - Recursion Support: SSE (No), OpenCL (No), CUDA(Yes) Thesis Work - New software and hardware scheduling frameworks can help map irregular and recursive applications to new architectures.

Thesis Work Different strategies for generalized reductions on GPU - Approaches for Parallelizing Reductions on GPUs (HiPC 2010) Strategy and runtime support for irregular reductions - An Execution Strategies and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs (ICS 2011) Task scheduling frameworks for heterogeneous architectures - Decoupled GPU + CPU Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations (HiPC 2011) - Coupled GPU + CPU Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture (SC 2012) Improve parallelism of SIMD for recursion on GPUs - Efficient scheduling of recursive control flow on GPUs (ICS 2013) Further extend recursion support - Recursion support for vectorization - Task scheduling of recursive applications on GPUs

Outline Current Work - Strategy and runtime support for irregular reductions - Improve parallelism of SIMD for recursion on GPUs - Different strategies for generalized reductions on GPU - Task scheduling frameworks for heterogeneous architectures Decoupled GPU + CPU Coupled GPU + CPU Proposed Work - Further extend recursion support Recursion support for vectorization Task scheduling of recursive applications on GPUs Conclusion

Irregular Reduction A dwarf in Berkeley view on parallel computing IA - Unstructured Grid pattern - More random and irregular accesses - Indirect memory references - Indirection Array - Iterates over e (Computation Space) RObj {* Outer Sequence Loop *} while( ) { {* Reduction Loop *} Foreach(element e) { (IA(e,0),val1) = Process(IA(e,0)); (IA(e,1),val2) = Process(IA(e,1)); RObj = Reduce(RObj(IA(e,0)),val1); RObj = Reduce(RObj(IA(e,1)),val2); } Global Reduction to Combine RObj } - Accessed by Indirection Array - Reduction Space

Application Context Molecular Dynamics - Indirection Array -> Edges (Interactions) - Reduction Objects -> Molecules (Attributes) - Computation Space -> Interactions b/w molecules - Reduction Space -> Attributes of Molecules

Main Issues Traditional Strategies are not effective - Full Replication (Private copy per thread) Large memory overhead Both intra-block and inter-block combination Shared memory usage unlikely - Locking Scheme (Private copy per block) Heavy conflicts within a block Avoid intra-block combination, but not inter-block combination Shared memory is only available for small data sets Need to choose Partitioning Strategy - Make sure data can be put into shared memory - Choice of partitioning space (Computation VS. Reduction) - Tradeoffs: Partitioning overhead & Execution efficiency

Contributions A Novel Partitioning-based Locking Strategy - Efficient shared memory utilization - Eliminate both intra and inter-block combination Optimized Runtime Support - Multi-Dimensional Partitioning Method - Reordering & Updating components for correctness and memory performance Significant Performance Improvements - Exhaustive evaluation - Up to 3.3x improvement over traditional strategies

Data structures & Access Pattern {* Outer Sequence Loop *} while( ) { {* Reduction Loop *} Foreach(element e) { (IA(e,0),val1) = Process( IA(e,0) ); (IA(e,1),val2) = Process( IA(e,1) ); RObj = Reduce( RObj(IA(e,0)),val1); RObj = Reduce( RObj(IA(e,1)),val2); } Global Reduction to Combine RObj } Goal: Utilize shared memory IA: No reuse, no benefit from shared memory RObj: Reuse is possible, more benefits from shared memory

Choice of Partitioning Space Two partitioning choices: - Computation Space Partition on edges - Reduction Space Partition on nodes

Computation Space Partitioning Partitioning on the iterations of computation loop Partition 1 5 2 1 4 3 5 Pros: - Load Balance on Computation Cons: 6 Partition 2 8 6 12 16 7 9 11 13 4 5 10 16 Partition 3 Partition 4 14 - Unequal reduction size in each partition - Replicated reduction elements (4 out of 16 nodes are replicated) - Combination cost Shared memory is infeasible

Reduction Space Partitioning Partitioning on the Reduction Elements Partition 1 5 7 Partition 2 Partition 3 11 8 2 1 13 6 4 7 10 3 12 16 5 9 16 Pros: - Balanced reduction space - Independent between each two partitions - Avoid combination cost - Shared memory is feasible Cons: - Imbalance on computation space - Replicated work caused by the crossing edges Partition 4 14

Reduction Space Partitioning - Challenges Unbalanced & Replicated Computation - Partitioning method can achieve balance between Cost and Efficiency Cost: Execution time of partitioning method Efficiency: Reduce number of crossing edges (Replicated work) Maintain correctness on GPU - Reorder reduction space - Update/Reorder computation space

Runtime Partitioning Approaches Metis Partitioning (Multi-level k-way Partitioning) - Execute sequentially on CPU - Minimizes crossing edges - Cons: Large overhead for data initialization High Cost GPU-based (Trivial) Partitioning - Parallel execution on GPU - Minimize execution time - Cons: Large number of crossing edges among partitions Multi-dimensional Partitioning (Coordinate Information) - Execute sequentially on CPU - Balance between cost and efficiency Low Efficiency

Experiment Evaluation Platform - NVIDIA Tesla C2050 Fermi (14x32=448 cores) - 2.86 GB device memory - 64 KB configurable shared memory 48 KB shared memory and 16 KB L1 cache 16 KB shared memory and 48 KB L1 cache - Intel 2.27 GHz Quad core Xeon E5520 with 48GB memory Applications - Euler (Computational Fluid Dynamics) 20K nodes, 120K edges, and 12K faces - MD (Molecular Dynamics) 37K molecules, 4.6 Million interactions

Euler - Performance Gains Euler: Comparison between Partitioning-based Locking (PBL), Locking, Full Replication, and Sequential CPU time 100 1 80 Execution time (sec) 60 40 20 32.2 9.7 7.5 0 PBL Locking Full Replication CPU

Molecular Dynamics - Performance Gains Molecular Dynamics: Comparison between Partitioning-based Locking (PBL), Locking, Full Replication, and Sequential CPU time 1400 1 1200 Execution time (sec) 1000 800 600 400 200 17.6 5.7 2.1 0 PBL Locking Full Replication CPU

Comparison of Different Partitioning Schemes - Cost Euler: Compare Metis Partitioner (MP), GPU Partitioner (GP), and Multidimensional Partitioner (MD) on 14, 28 and 42 partitions Shows only Partitioning Time - (Init Time + Running Time + Reordering Time) 20 15 Init Time Running Time Reordering Time Init Time - MP: largest - MD: no initialization log(time (us)) 10 5 Running Time - GP: shortest - MD: similar to MP Reordering Time 0 MP GP MD MP GP MD MP GP MD 14 28 42 Number of partitions in Euler - Similar on three strategies

Comparison of Different Partitioning Schemes - Efficiency Euler: Compare Metis Partitioner (MP), GPU Partitioner (GP), and Multidimensional Partitioner (MD) on 14, 28 and 42 partitions Shows workload - (include redundant workload caused by crossing edges) Workload (Iterations) 8000 7000 6000 5000 4000 3000 GP MD MP GP: involve the most replicated workload and load imbalance MD is very close to MP MP-14 GP-14 MD-14 MP-28 GP-28 MD-28 MP-42 GP-42 MD-42 2000 1000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Partition number

End-to-End Execution Time with Different Partitioners Euler: End-to-End execution time for Multi-dimensional Partitioner (MD), GPU Partitioner (GP), and Metis Partitioner (MP) on 28 partitions Time (sec) 25 20 15 10 5 0 Reordering Partitioning Copy Computation 64 128 256 512 64 128 256 512 64 128 256 512 MD GP MP The PBL scheme with different partitioner on 28 partitions MP - Partitioning time is even larger than computation time GP - Too much redundant work slow down the execution

Summary Systematic study to parallelize irregular reductions on modern GPUs A Partitioning-based Locking Scheme - on reduction space Optimized runtime support - Three partitioning schemes - Reordering and updating components Multi-Dimensional Partitioning can balance cost and efficiency Achieve significant performance improvement over traditional methods

Outline Current Work - Strategy and runtime support for irregular reductions - Improve parallelism of SIMD for recursion on GPUs - Different strategies for generalized reductions on GPU - Task scheduling frameworks for heterogeneous architectures Decoupled GPU + CPU Coupled GPU + CPU Proposed Work - Further extend recursion support Recursion support for vectorization Task scheduling of recursive applications on GPUs Conclusion

Limited Recursion Support on Modern GPUs No Support on OpenCL and AMD GPUs NVIDIA support recursion from computing capability 2.0 and SDK 3.1 How is performance? Focus on intra-warp thread scheduling, and each thread in warp executes a recursive computation

A recursion example on NVIDIA GPU Serial Matched Unmatched Thread 1 Thread 1 Thread 2 Thread 1 Thread 2 20 Fib(23) Fib(23) Fib(23) Fib(23) Fib(23).. Fib(24) Fib(24) Fib(23) Fib(24) Fib(23) Fib(24) Fib(23).. Fib(23) Fib(24) Fib(23) Fib(24) Fib(23).. Fib(23) Fib(24) Fib(23) Fib(24) Fib(23).. Fib(24) Fib(23) Fib(24) Fib(23) Fib(24).. 0.8 Fib(24) Fib(24) Fib(24) Fib(24) Fib(24) 20 Fib(24) Fib(24) Fib(24).. Execution time is Execution time (s) 0.6 0.4 0.2 Fib(23) Fib(24) 2x 1.43x bounded by largest task 0 Serial Matched Unmatched

Threads Re-convergence in General Branch /* General Branch */ Fib(n) { if((a B) && C) {... } else {... } } Path1 Path2 Path3 Path4 Path5 Entry BB1 bra cond1 A BB3 bra cond3 C BB2 bra cond2 B Post-dominator is the node through which all paths (across different branches) must pass through. Immediate post-dominator is the postdominator which is not dominated by any other post-dominators. Exit BB4 else branch Immediate post-dominator

Immediate Post-dominator Re-convergence in Recursion /* General Branch */ Fib(n) { if(n < 2) { return 1; } else { x = Fib(n-1); y = Fib(n-2); return x+y; } } T0 T1 Entry BB1 bra if(n < 2) Exit BB2 bra else Entry BB1 bra if(n < 2) Exit Entry BB2 bra else Next level CFG Re-convergence can only happen on the same recursion level Threads with short branch cannot return until the threads with long branch coming back to the reconvergence point BB1 bra if(n < 2) Exit BB2 bra else Next level CFG

Re-convergence Methods Immediate Post-dominator Re-convergence - Execution time is bounded by the longest branch For N threads, each thread executes a recursion task with M branches T ti = Execution time of tth thread on ith branch Total time = M max N t (T ti ) i Dynamic Re-convergence - Re-convergence can happen before or after immediate post-dominator

Dynamic Re-convergence Mechanisms Remove the static re-convergence on immediate post-dominator Dynamic Re-convergence Implementations - Frontier-based Re-convergence Frontier: The group of threads that have the same PC (Program Counter) address with the current active threads Re-convergence is on the same frontier Re-convergence can cross different recursion levels - Majority-based Re-convergence Majority group of the threads with the same PC Tend to maximize IPC (Instructions Per Cycle) The threads in minority have chance to be in majority in future - Other Dynamic Re-convergence mechanisms Frontier-ret: Schedule return instruction first Majority-threshold: Prevent starvation of threads in minority group

Dynamic Re-convergence Mechanisms T0 T1 Entry Divergence Entry BB1 bra if(n < 2) BB1 bra if(n < 2) Entry BB2 bra else BB2 bra else BB1 bra if(n < 2) Exit Exit BB2 bra else Entry Exit BB1 bra if(n < 2) BB2 bra else Exit

Implementation of Dynamic Re-convergence GPGPU-sim simulator - A cycle-level GPU performance simulator for general purpose computation on GPUs - High simulation accuracy (98.3% for GT200, 97.3% for Fermi) - Model Fermi micro-architecture Stack-based Re-convergence mechanism - Stack structure PC: Address of the future scheduled instruction Active Mask: Represent which threads are active for the corresponding PC (Bitset: 1 represent active) - Stack Updating Function Update PC and Active Mask for different implementations

Frontier-based Dynamic Re-convergence T0 T1 T2 Entry2 PC0 Entry3 PC0 Step1 Step2 Block Entry1 Block Next PC Active Mask PC0 111 Next PC Active Mask BB2 PC2 111 Entry1 PC0 BB3(PC1) bra if(n < 2) BB4(PC2) bra else BB5(PC1) bra if(n < 2) BB6(PC2) bra else Step3 Step4 Block Entry2 Block Next PC Active Mask PC0 111 Next PC Active Mask BB1(PC1) bra if(n < 2) Exit2 PC4 Exit3 PC4 Divergence BB3 BB4 PC1 100 PC2 011 Exit1 PC4 BB2(PC2) bra else Entry4 PC0 Step5 Step6 Block BB3 Entry3 Block Next PC Active Mask PC1 100 PC0 011 Next PC Active Mask BB3 PC1 100 BB7(PC1) bra if(n < 2) BB5 PC1 011 BB8(PC2) bra else Step6 Re-convergent on the same PC Block BB3/5 Next PC Active Mask PC1 111 Exit4 PC4 Step7 Block Next PC Active Mask Exit2/3 PC4 111

Majority-based Dynamic Re-convergence T0 T1 T2 Entry2 PC0 Entry3 PC0 Step1 Step2 Block Entry1 Block Next PC Active Mask PC0 111 Next PC Active Mask BB2 PC2 111 Entry1 PC0 BB1(PC1) bra if(n < 2) BB3(PC1) bra if(n < 2) Exit2 PC4 BB4(PC2) bra else BB5(PC1) bra if(n < 2) Exit3 PC4 BB6(PC2) bra else Step3 Block Next PC Active Mask Entry2 PC0 111 Divergence Majority group(pc2:011) Exit1 PC4 BB2(PC2) bra else Entry4 PC0 Step4 Block BB4 Next PC Active Mask PC2 011 BB7(PC1) bra if(n < 2) BB8(PC2) bra else Exit4 PC4

Experiment Evaluation GPGPU-sim Simulator - Model Fermi architecture Recursive Benchmarks - Small number of recursive branches Fibonacci Binomial coefficients - Large number of recursive branches Graph coloring NQueens - Dependency between branches Tak Function - Only one branch Mandelbrot Fractals

Performance with Increasing Divergence 0-Offset Rotation 1-Offset Rotation Threads Tasks Threads Tasks Thread-0 10 8 6 9 7 5 2 4 Thread-0 10 8 6 9 7 5 2 4 Thread-1 10 8 6 9 7 5 2 4 Thread-1 8 6 9 7 5 2 4 10 Thread-2 10 8 6 9 7 5 2 4 Thread-3 10 8 6 9 7 5 2 4 Thread-4 10 8 6 9 7 5 2 4 Thread-5 10 8 6 9 7 5 2 4 Thread-6 10 8 6 9 7 5 2 4 Thread-2 6 9 7 5 2 4 10 8 Thread-3 9 7 5 2 4 10 8 6 Thread-4 7 5 2 4 10 8 6 9 Thread-5 5 2 4 10 8 6 9 7 Thread-6 2 4 10 8 6 9 7 5 Thread-7 10 8 6 9 7 5 2 4 Thread-7 4 10 8 6 9 7 5 2 2-Offset Rotation 4-Offset Rotation Threads Tasks Threads Tasks Thread-0 10 8 6 9 7 5 2 4 Thread-1 6 9 7 5 2 4 10 8 Thread-2 7 5 2 4 10 8 6 9 Thread-3 2 4 10 8 6 9 7 5 Thread-4 10 8 6 9 7 5 2 4 Thread-5 6 9 7 5 2 4 10 8 Thread-6 7 5 2 4 10 8 6 9 Thread-0 10 8 6 9 7 5 2 4 Thread-1 7 5 2 4 10 8 6 9 Thread-2 10 8 6 9 7 5 2 4 Thread-3 7 5 2 4 10 8 6 9 Thread-4 10 8 6 9 7 5 2 4 Thread-5 7 5 2 4 10 8 6 9 Thread-6 10 8 6 9 7 5 2 4 Thread-7 2 4 10 8 6 9 7 5 Thread-7 7 5 2 4 10 8 6 9

Performance with Increasing Divergence Fibonacci IPC 6 5 4 3 2 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Fibonacci and Binomial Coefficients - Perfect speedup on 0-Offset 1 0 Serial 0 Offset 4 Offset 2 Offset 1 Offset Task Rotation with different offsets under increasing of divergence With increasing divergence - Post-dom decreases by 2.5, 4, and 5 times - Majority only decreases by 1.8, 2.1, and 2.2 times Binomial Coefficient IPC 7 6 5 4 3 2 1 0 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Serial 0 Offset 4 Offset 2 Offset 1 Offset Task Rotation with different offsets under increasing of divergence

Performance with Increasing Divergence NQueens IPC 4.00 3.00 2.00 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Graph Coloring IPC 3 2.5 2 1.5 1 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold 1.00 0.5 0.00 Serial 0 Offset 4 Offset 2 Offset 1 Offset 0 Serial 0 Offset 4 Offset 2 Offset 1 Offset Task Rotation with different offsets under increasing of divergence Task Rotation with different offsets under increasing of divergence Tak Function IPC 8 6 4 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Mandelbrot IPC 2 1.5 1 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold 2 0.5 0 Serial 0 Offset 4 Offset 2 Offset 1 Offset Task Rotation with different offsets under increasing of divergence 0 Serial 0 Offset 4 Offset 2 Offset 1 Offset Task Rotation with different offsets under increasing of divergence

Scalability with Warp Width Fibonacci IPC 2.5 2 1.5 1 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Fibonacci and Binomial Coefficients - When warp width is 4, all versions have similar IPC 0.5 0 4 8 16 32 Warp Width With increasing warp width - Majority has the better scalability than both frontier and post-dom Binomial Coefficients IPC 2.5 2 1.5 1 0.5 Serial Post-dom Frontier Frontier-ret Majority Majority_threshold 0 4 8 16 32 Warp Width

Scalability with Warp Width NQueens IPC 1.6 1.4 1.2 1 0.8 0.6 0.4 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold 1.5 Graph Coloring IPC 0.8 0.6 0.4 0.2 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold 1.2 0.2 0 4 8 16 32 Warp Width 0 4 8 16 32 Warp Width 2.5 2 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold 0.7 0.6 Serial Post-dom Frontier Frontier-ret Majority Majority-threshold Tak Function 1.5 1 Mandelbrot IPC 0.5 0.4 0.3 0.2 0.5 0.1 0 4 8 16 32 Warp Width 0 4 8 16 32 Warp Width

Summary Current recursion support is limited by static reconvergence method Dynamic re-convergence mechanisms - Re-convergence before or after immediate post-dominator - Two implementations Frontier-based method Majority-based method Kepler GPU - Dynamic Parallelism blocks the executing kernel when calling a new kernel. - It is no related to intra-warp scheduling.

Different Strategies for Generalized Reductions on GPUs Tradeoffs between Full Replication and Locking Scheme Hybrid Scheme - Balance between Full Replication and Locking Scheme - Introduce a intermediate scheduling layer group under thread block Intra-group: Locking Scheme Inter-group: Full Replications - Benefits of Hybrid Scheme varies with group size Reduced memory overhead Better use of shared memory Reduce combination cost & conflicts Extensive evaluations on different applications with different parameters

Porting Irregular Reductions on Heterogeneous CPU-GPU Architecture A Multi-level Partitioning Framework - Parallelize irregular reduction on heterogeneous architecture Coarse-grained Level: Tasks between CPU and GPU Fine-grained Level: Task between thread blocks or threads Both use reduction space partitioning - Eliminate device memory limitation on GPU Runtime Support Scheme - Pipeline Scheme overlap partitioning and computation on GPU - Work stealing based scheduling strategy provide load balance and increase the pipelining length Significant Performance Improvements - Achieve 11% and 22% improvement for Euler and Molecular Dynamics

Accelerating Applications on Integrated CPU-GPU Architectures Thread Block Level Scheduling Framework on Fusion APU - Work Scheduling target -> thread blocks (Not devices) - Only launch one kernel in the beginning -> small command launching overhead - No synchronization between devices or thread blocks - Inter and Intra-device load balance is achieved by fine-grained and factoring scheduling policies Locking-free Implementations - Master-worker Scheduling - Token Scheduling Applications with different communication patters - Stencil Computing (Jacobi): 1.6x - Generalized Reduction (K-means): 1.92x - Irregular Reduction (Molecular Dynamics): 1.15x

Outline Current Work - Strategy and runtime support for irregular reductions - Improve parallelism of SIMD for recursion on GPUs - Different strategies for generalized reductions on GPU - Task scheduling frameworks for heterogeneous architectures Decoupled GPU + CPU Coupled GPU + CPU Proposed Work - Further extend recursion support Recursion support for vectorization Task scheduling of recursive applications on GPUs Conclusion

SIMD Extensions Supported in popular processors - SIMD lane width is increased from 128 bit (SSE) and 256 bit(avx) to 512 bit (MIC) Exhaustively studied in Compiler and Runtime Support - Memory alignment - Irregular accessing - Control flow No recursion support - Dynamic function calls - Extensive divergence

Overall idea Stack-based recursion support for SIMD extensions - Software function calling stack for each SIMD lane SIMD operations Contiguous Memory Accessing Divergence - Re-convergence method for SIMD lanes Same strategies on GPU cannot be ported to SIMD extensions directly

Structure of Stack Frame Fibonacci(n) { } if(n <= 1) return 1; int x = Fibonacci(n-1); int y = Fibonacci(n-2); return x+y; Check End Case Branch Case Return Case Stack Frame - PC: case number of recursion Check End Case Branch Case Return Case - n[]: input values - ret[]: return values stack frame PC N RET

Stack Driver Algorithm 5: stack driver(stack t stack) while Not stack.finish() do Stack Frame &ff = stack.gettop(); pc = ff.pc; sf.pc ++; if pc =0then if ff.n = end value then end Read current PC, and update it for next case stack.sf[stack.top-2].ret += final value; stack.pop(); end else if pc <= branch num then stack.push(input(ff.n, branch)); else stack.sf[stack.top-2].ret += ff.ret; stack.pop(); end Check End Case Branch Case Return Case

Memory Support stack frame PC N RET Continues Address........................ PC0 PC1 PC2 PC3 N0 N1 N2 N3 RET0 RET1 RET2 RET3 PC0 PC1 PC2 PC3 N0 N1 N2 N3 RET0 RET1 RET2 RET3 PC0 PC1 PC2 PC3 N0 N1 N2 N3 RET0 RET1 RET2 RET3 Contiguous Memory Accessing - Stacks in Column Major - Structure of Array PC0 PC1 PC2 PC3 N0 N1 N2 N3 RET0 RET1 RET2 RET3 SIMD lane 1 SIMD lane 2 SIMD lane 3 SIMD lane 4

Other Supports Divergence Support - Let all lanes W/R the same stack level (Inserting ghost frame for the lanes with short branches) - Mask operations (Support in MIC) Re-convergence for SIMD lanes - Immediate post-dominator re-convergence - How to implement efficient dynamic reconvergence? Updating stack in different levels Incontiguous Accessing & Potential Parallelsim

Tree-based recursive scheduling Framework Challenges of scheduling recursion - Dynamic task creation - Load imbalance Tree-based scheduling framework - Hierarchy structure Block Tree, Warp Tree, Thread Tree, Thread Stack - Task stealing in the same level - Nodes are divided into public and private Only public nodes are available for stealing (reduce locking overhead) - Task Stealing in different levels Locking-based stealing for warps and blocks Task redistribution for threads in the same warp when beyond a threshold Thread1 10 9 8 8 7 7 6 8 7 7 6 6 5 5 4 6 5 4 4 3 3 2 4 3 Thread Stack Warp1 Thread Stack Block1 5 Thread2 Warp2 Block2 Thread Tree (Shared MEM) Warp Tree (Shared MEM) Block Tree (Global MEM)

Conclusions Different strategies for generalized reductions on GPU Strategy and runtime support for irregular reductions Task scheduling frameworks for heterogeneous architectures - Decoupled GPU + CPU - Coupled GPU + CPU Improve parallelism of SIMD for recursion on GPUs Further extend recursion support - Recursion support for vectorization - Task scheduling of recursive applications on GPUs

Thanks for your attention! Q & A