An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

Size: px

Start display at page:

Download "An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs"

Shona Blake
5 years ago
Views:

1 An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

2 Irregular Reduction - Context A dwarf in Berkeley view on parallel computing - Unstructured Grid pattern - More random and irregular accesses - Indirect memory references Previous efforts in porting to different architectures - Distributed memory machines - Distributed shared memory machines - Shared memory machines - Cache performance improvement on uniprocessors - No systematic study on modern GPUs 2

3 WHY GPU? - A Glimpse Favorable price/performance, performance/watt ratio GPUs are pervasive - Mobiles, Notebooks, Desktops, SuperComputers, Clouds... - NVIDIA s Tegra brings GPUs to mobiles - Fusion proposed by AMD, Sandy Bridge provided by Intel - NVIDIA s Project Denver combines ARM and GPU - 3 out 4 fastest supercomputers are based on GPUs - Amazon GPU instances Latest NVIDIA GPU - Fermi or T20 series - Larger shared memory (64 KB compared to 16 KB) - Configurable shared memory 16 KB shared memory + 48 KB L1 cache (L1 Cache Preferred) 48 KB shared memory + 16 KB L1 cache (Shared Memory Preferred) 3

4 Outline Background - Irregular Reduction Structure - Main Issues Contributions Partitioning-Based Locking Runtime Support Experiments Conclusions 4

5 Background

6 Irregular Reduction Structure Regular Reduction {* Outer Sequence Loop *} while( ) { {* Reduction Loop *} Foreach(element e) { (i,val) = Process(e); RObj = Reduce(RObj( i ),val) } Global Reduction to Combine RObj } RObj: Reduction object e: Iterations of computation loop e and RObj: Direct Accesses 6

7 Irregular Reduction Structure Regular Reduction Irregular Reduction {* Outer Sequence Loop *} while( ) { {* Reduction Loop *} Foreach(element e) { (i,val) = Process(e); RObj = Reduce(RObj( i ),val) } Global Reduction to Combine RObj } {* Outer Sequence Loop *} while( ) { {* Reduction Loop *} Foreach(element e) { (IA(e,0),val1) = Process(IA(e,0)); (IA(e,1),val2) = Process(IA(e,1)); RObj = Reduce(RObj(IA(e,0)),val1); RObj = Reduce(RObj(IA(e,1)),val2); } Global Reduction to Combine RObj } IA(e,x): Indirection Array IA: Iterators over e (Computation Space). RObj: Accessed by Indirection Array (Reduction Space). 7

8 Application Context Molecular Dynamics Indirection Array -> Edges (Interactions) Reduction Objects -> Molecules (Attributes) Computation Space -> Interactions b/w molecules Reduction Space -> Attributes of Molecules 8

9 Main Issues Traditional reduction strategies are not effective - Full Replication (Private copy per thread) Large memory overhead Both intra-block and inter-block combination Shared memory usage unlikely - Locking Scheme (Private copy per block) Heavy conflicts within a block Avoid intra-block combination, but not inter-block combination Shared memory is only available for small data sets Need to choose Partitioning Strategy - Choice of partitioning space (Computation VS. Reduction) - Tradeoffs: Partitioning overhead & Execution efficiency 9

10 Contributions A Novel Partitioning-based Locking Strategy - Efficient shared memory utilization - Eliminate both intra and inter-block combination Optimized Runtime Support - Multi-Dimensional Partitioning Scheme - Reordering & Updating components (Maintain Correctness) Significant Performance Improvements - Exhaustive evaluation - Up to 3.3x improvement over traditional strategies 10

11 Partitioning-based Locking Strategy

12 Partitioning-based Locking Strategy Host GPU SM1 Host Partition 1 Partition Shared Memory Partition 2 SM2 Reduction Object Partition 3 Partition Shared Memory Reduction Object Partition N Partition SM N Shared Memory 12

Data Structures & Access Pattern Irregular Reduction {* Outer Sequence Loop *} while( ) { {* Reduction Loop *} Foreach(element e) { (IA(e,0),val1) = Process( IA(e,0) ); (IA(e,1),val2) = Process(

13 Data Structures & Access Pattern Irregular Reduction {* Outer Sequence Loop *} while( ) { {* Reduction Loop *} Foreach(element e) { (IA(e,0),val1) = Process( IA(e,0) ); (IA(e,1),val2) = Process( IA(e,1) ); RObj = Reduce( RObj(IA(e,0)),val1); RObj = Reduce( RObj(IA(e,1)),val2); } Global Reduction to Combine RObj } Goal: Utilize shared memory IA: No reuse, no benefit from shared memory Robj: Reuse is possible, more benefits from shared memory 13

14 Choice of Partitioning Space Two partitioning choices: - Computation Space Partition on edges - Reduction Space Partition on nodes 14

15 Computation Space Partitioning Partitioning on the iterations of computation loop Partition Pros: - Load Balance on Computation Cons: 6 Partition Partition 3 Partition Unequal reduction size in each partition - Replicated reduction elements (4 out of 16 nodes are replicated) - Combination cost Shared memory cannot be used on GPU 15

16 Reduction Space Partitioning Partitioning on the Reduction Elements Partition Pros: - Balanced reduction space - Independent between each partition 7 Partition 2 Partition Avoid combination cost Cons: - Imbalance on computation space - Replicated work caused by the crossing edges Partition

17 Reduction Space Partitioning - Challenges Trade-offs between Cost and Efficiency - The cost of partitioning: execution time - The efficiency of partitioning: crossing edges Maintain Correctness on GPU - Reorder reduction space - Update/Reorder computation space 17

18 Runtime Support

19 Runtime Partitioning Approaches Metis Partitioning (Multi-level k-way Partitioning) - Execute sequentially on CPU - Minimizes crossing edges - Cons: Large overhead for data initialization GPU-based (Trivial) Partitioning - Parallel execution on GPU - Minimize execution time - Cons: Large number of crossing edges among partitions Multi-dimensional Partitioning - Execute sequentially on CPU - Balance both crossing edges and execution time 19

20 Multi-dimensional Partitioning Partitioning based on co-ordinate information (x, y, z dimensions) - Use Find-Kth-smallest-number Reduction Objects 0 4 8? Dimension Partitioning on X dimension Partitioning on Y dimension Partition Number Partitioning on Z dimension 2 20

21 Runtime Reordering & Updating Reduction Space Partitioning Module Runtime Reordering Module Reordering Component Computation Space node node node... edge edge edge... node node node... part part part... Updating Component part1 part2 part3... edge edge edge... Reordered Reduction & Computation Space 21 Two Components - Reordering Component - Updating Component Nodes: - Reordering: help transferring b/w shared and host memory Edges - Updating: maintain correctness - Reordering: coalesce access

22 Experiments

23 Experiment Setup Platform - NVIDIA Tesla C2050 Fermi (14x32=448 cores) GB device memory - 64 KB configurable shared memory 48 KB shared memory and 16 KB L1cache 16 KB shared memory and 48 KB L1 cache - Intel 2.27 GHz Quad core Xeon E5520 with 48GB memory Applications - Euler (Computational Fluid Dynamics) 20K nodes, 120K edges, and 12K faces - MD (Molecular Dynamics) 37K molecules, 4.6 Million interactions 23

24 Euler - Performance Gains Euler: Comparison between Partitioning-based Locking (PBL), Locking, Full Replication, and Sequential CPU time Execution time (sec) PBL Locking Full Replication CPU 24

25 Molecular Dynamics - Performance Gains Molecular Dynamics: Comparison between Partitioning-based Locking (PBL), Locking, Full Replication, and Sequential CPU time Execution time (sec) PBL Locking Full Replication CPU 25

Comparison of Different Partitioning Schemes log(time (us)) 20 15 10 5 Euler: Compare Metis Partitioner (MP), GPU Partitioner (GP), and Multidimensional Partitioner (MD) on 14, 28 and 42 partitions

26 Comparison of Different Partitioning Schemes log(time (us)) Euler: Compare Metis Partitioner (MP), GPU Partitioner (GP), and Multidimensional Partitioner (MD) on 14, 28 and 42 partitions Shows only Partitioning Time - (Init Time + Running Time + Reordering Time) Init Time Running Time Reordering Time Initialization Time - MP: largest - MD: no initialization Running Time - GP: shortest - MD: similar to MP 0 MP GP MD MP GP MD MP GP MD Number of partitions in Euler 26 Reordering Time - Similar on three strategies

End-to-End Execution Time with Different

27 End-to-End Execution Time with Different Partitioners Euler: End-to-End execution time for Multi-dimensional Partitioner (MD), GPU Partitioner (GP), and Metis Partitioner (MP) on 28 partitions Time (sec) Computation Copy Partitioning Reordering MD GP MP The PBL scheme with different partitioner on 28 partitions 27 MP - Partitioning time is even larger than computation time GP - Much more redundant work slow down the execution

28 Conclusions Systematic study to parallelize irregular reductions on modern GPUs A novel Partitioning-based Locking Strategy Optimized Runtime support - Three Partitioning Schemes - Reordering and updating components Multi-Dimensional Partitioning can balance cost and efficiency Achieve significant performance improvement over traditional methods 28

29 Thank you Questions? Contacts: Xin Huo Vignesh T. Ravi Wenjing Ma Gagan Agrawal

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures Xin Huo Advisor: Gagan Agrawal Motivation - Architecture Challenges on GPU architecture