LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

Size: px

Start display at page:

Download "LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs"

Erika Ferguson
5 years ago
Views:

1 LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang*, Norm Rubin, Albert Sidelnik, Sudhakar Yalamanchili* *Computer Architecture and System Lab, Georgia Institute of Technology NVIDIA Research Sponsors: 1

Executive Summary Motivation: New memory reference locality relationship in dynamic parallelism Problem: State-of-the-art GPU schedulers are unaware of this new relationship

2 Executive Summary Motivation: New memory reference locality relationship in dynamic parallelism Problem: State-of-the-art GPU schedulers are unaware of this new relationship Designed for non-dynamic parallelism settings Do not exploit parent-child locality in and L2 cache Proposed: LaPerm Locality-aware thread block scheduler Three scheduling decisions 2

3 Dynamic Parallelism on GPU Launch workload on demand from GPU CUDA Dynamic Parallelism (CDP) OpenCL device-side enqueue Dynamic Thread Block Launch [1] (DTBL) Benefits 0 Apply to fine-grained parallelism in irregular applications Increase execution efficiency and productivity Graph Traversal Launch new kernels Launch new thread blocks Dynamically Launched Child Kernels/TBs Launch when sufficient parallelism (e.g. 4 threads) Reduced Workload Imbalance Parent Kernel Workload Imbalance Launched by t1 Launched by t2 Launched by t3 Uniform control flow and more memory coalescing Launched by t4 [1] Wang, Rubin, Sidelnik and Yalamanchili, Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on gpus, ISCA

4 Memory Locality in Dynamic Parallelism New data reference locality relationship in parent-child launching Potential Locality: Parent-child and childsibling data sharing /L2 locality Read/Write Graph Data Parent Kernel Launches Cache Cache Cache L2 Cache GPU Global Memory Reuse Graph Data Cache Child Kernels/TBs Average shared footprint ratio for 8 benchmarks: Parent-child: 38.4% Child-sibling: 30.5% 4

5 Executive Summary Motivation: New memory reference locality relationship in dynamic parallelism Problem: State-of-the-art GPU schedulers are unaware of this new relationship Designed for non-dynamic parallelism settings Do not exploit parent-child locality in and L2 cache Proposed: LaPerm Locality-aware thread block scheduler Three scheduling decisions 5

6 Current GPU Scheduler First-Come-First-Serve kernel scheduler Round-Robin TB Scheduler Kernel 1 TB0 TB1 First Come First Serve Kernel Distributor TB2 TB3 Round Robin Scheduler Kernel 2 TB4 TB5 TB0 TB1 TB2 TB3 TB4 TB5 6

7 Current GPU Scheduler (continue) For non-dynamic Parallelism scenario Fairness and efficiency For Dynamic Parallelism scenario Child TBs are scheduled after parent TBs Parent TBs First Come First Serve Kernel Distributor P0 P1 P2 P3 P4 P5 P6 P7 Round Robin Scheduler C2 C4 C3 C0 C5 Child TBs C1 P0 P1 P2 P3 P4 P5 P6 P7 C0 C1 C2 C3 C4 C5 7

8 Current GPU Scheduler: Issues Issue 1: Child TBs are executed far later than the parent TBs, decreasing L2 locality First Come First Serve Kernel Distributor GPU Global Memory L2 Cache Round Robin Scheduler P0 P1 P2 P3 P4 P5 P6 P7 Potential L2 pollution C0 C1 C2Child TBs C3are far away from parent TBs C4 C5 8

9 Current GPU Scheduler: Issues (continue) Issue 1: Child TBs are executed far later than the parent TBs, decreasing L2 locality Issue 2: Child TBs are executed on a different than its parent, decreasing locality Fails to exploit parent-child or child-sibling locality Exacerbated in real applications GPU Global Memory which generally have many TBs. L2 Cache P0 P1 P2 P3 P4 P5 P6 P7 C0 Child TBs are not able C1 C2 C3 to utilize on the C4 parent s C5 9

10 Executive Summary Motivation: New memory reference locality relationship in dynamic parallelism Problem: State-of-the-art GPU schedulers are unaware of this new relationship Designed for non-dynamic parallelism settings Do not exploit parent-child locality in and L2 cache Proposed: LaPerm Locality-aware thread block scheduler Three scheduling decisions 10

11 LaPerm: Locality-Aware Scheduler for Dynamic Parallelism Leverage locality between parent and child TBs Three scheduling decisions Accommodate different forms of locality Goal: improve memory efficiency and overall performance 11

12 Scheduling Decision 1: TB Prioritizing Prioritize child TBs to be executed immediately after direct parent TBs Reuse parent data and avoid L2 cache pollution GPU Global Memory L2 Cache Parent TBs P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 C2 C3 C0 C1 C0 C1 P4 P5 C4 C5 C2 C3 C4 C5 Child TBs P6 P7 Still no locality! 12

13 Scheduling Decision 2: Prioritized Binding Bind the prioritized child TBs on the same as the parent TBs Utilize cache for data reuse GPU Global Memory L2 Cache Parent TBs P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 C0 P6 C2 C3 C0 C1 C2 P7 C1 C4 C5 C3 Child TBs C4 load balancing issue! C5 13

14 Scheduling Decision 3: Adaptive Prioritized Binding Adaptively bind child TBs on available s Avoid load balancing caused by binding GPU Global Memory L2 Cache Parent TBs P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 C0 P6 C2 C3 C0 C1 C2 P7 C1 C3 C4 C5 C4 C5 Child TBs 14

15 Architecture Support Multi-level priority queues Used to store new TB/Kernel information Ordered using priority value Divided among multiple s Off-chip overflow priority queues Kernel Management Unit First Come First Serve Kernel Distributor DRAM Scheduler New Scheduler Logic On-chip priority queues 15

16 Benchmark and Experimental Environment Benchmark implemented with dynamic parallelism Simulated on GPGPU-Sim Benchmark Input Notation Adaptive Mesh Refinement Energy data set amr Barnes Hut Tree Random data set bht Breadth-First Search Citation network bfs_citation USA road network Cage15 sparse matrix bfs_usa_road bfs_cage15 Graph Coloring Citation network clr_citation Graph 500 Cage15 sparse matrix clr_graph500 clr_cage15 Regular Expression Match Darpa network regx_darpa Random string collection Product Recommendation MovieLens pre regx_string Relational Join Uniform distribution synthetic join_uniform Gaussian distribution synthetic join_gaussian Single-Source Shortest Path Citation network sssp_citation Fight network Physics Simulation Cage15 sparse matrix Tree and Graph Applications Machine Learning Relational Database sssp_flight sssp_cage15 16

17 Performance of LaPerm: L2 Cache L2 Cache Hit Rate RR TB-Pri -Bind Adaptive-Bind 17

18 Performance of LaPerm: L2 Cache L2 Cache Hit Rate RR TB-Pri -Bind Adaptive-Bind L2 Cache hit rate benefits mainly from prioritizing child TBs 18

19 Performance of LaPerm: L2 Cache L2 Cache Hit Rate RR TB-Pri -Bind Adaptive-Bind L2 Cache hit rate benefits mainly from prioritizing child TBs Graph applications with cage15 input have more parent-child data reuse 19

20 Performance of LaPerm: Cache 100 RR TB-Pri -Bind Adaptive-Bind Cache Hit Rate

21 Performance of LaPerm: Cache 100 RR TB-Pri -Bind Adaptive-Bind Cache Hit Rate Cache hit rate benefits mainly from binding child TBs to parents s 21

22 Performance of LaPerm: Cache 100 RR TB-Pri -Bind Adaptive-Bind Cache Hit Rate Cache hit rate benefits mainly from binding child TBs to parents s Product recommendation pre has good child-sibling data locality 22

23 Performance of LaPerm: IPC 1.7 TB-Pri -Bind Adaptive-Bind Normalized IPC Normalized to RR 23

24 Performance of LaPerm: IPC 1.7 TB-Pri -Bind Adaptive-Bind Normalized IPC Normalized to RR TB-Pri: IPC (1.13x) increases due to higher L2 cache hit rate -Bind: IPC decreases (1.08x) from TB-Pri due to higher cache hit rate but load balancing Adaptive-Bind: Overall IPC (1.27x) increases because of both memory efficiency and load balancing 24

25 Conclusion LaPerm: Locality-aware thread block scheduler for dynamic parallelism Exploit new memory reference locality in the parent-child launching Three scheduling decisions Increase /L2 cache locality while maintaining load balance Achieve overall memory system efficiency and performance improvement 25

26 THANK YOU! QUESTIONS? 26

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA Research Email: {jin.wang,sudha}@gatech.edu,