Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Size: px

Start display at page:

Download "Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das"

Eleanor Ray
6 years ago
Views:

1 Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

2 Parallelize your code! Launch more threads! Multi- threading Caching Main Memory Prefetching Improve Replacement Policies Is the arp Scheduler aware of these techniques? Improve Memory Scheduling Policies Improve Prefetcher (look deep in the future, if you can!)

3 Two-level Scheduling MICRO Multi- threading Caching Main Memory Aware arp Scheduler Cache-Conscious Scheduling, MICRO Prefetching Thread-Block-Aware Scheduling (OL) ASPLOS?

4 Our Proposal n Prefetch Aware arp Scheduler n Goals: q Make a Simple prefetcher more Capable q Improve system performance by orchestrating scheduling and prefetching mechanisms n % average IPC improvement over q Prefetching + Conventional arp Scheduling Policy n % average IPC improvement over q Prefetching + Best Previous arp Scheduling Policy

5 Outline n Proposal n Background and Motivation n Prefetch-aware Scheduling n Evaluation n Conclusions

6 High- Level View of a GPU Streaming Multiprocessors (SMs) Threads CTA CTA CTA CTA arps Scheduler L Caches Prefetcher ALUs Interconnect Cooperative Thread Arrays (CTAs) Or Thread Blocks L cache DRAM

7 arp Scheduling Policy n Equal scheduling priority q Round-Robin (RR) execution n Problem: arps stall roughly at the same time Phase () SIMT Core Stalls Phase () DRAM Requests D D D D D D D D Time

8 Phase () DRAM Requests D D D D D D D D SIMT Core Stalls Phase () Time Group Phase () DRAM Requests D D D D TO LEVEL (TL) SCHEDULING Group Phase () D D D D Group Comp. Phase () Group Comp. Phase () Saved Cycles

9 Accessing DRAM High Bank- Level Parallelism High Row Buffer Locality X X + X + X + Y Y + Bank Bank Y + Y + Memory Addresses Legend Group Group Idle for a period Bank Bank Low Bank-Level Parallelism High Row Buffer Locality

10 arp Scheduler Perspec?ve (Summary) arp Scheduler Round- Robin (RR) Forms Multiple arp Groups? DRAM Bandwidth Utilization Bank Level Parallelism Row Buffer Locality Two-Level (TL) 0

11 Evalua?ng RR and TL schedulers 0 SSC Round-robin (RR) PVC KMN SPMV BFSR FFT SCP Two-level (TL) IPC Can Improvement we further factor with Perfect L Cache reduce this gap? Via Prefetching? BLK FT.0X.X JPEG GMEA N

12 () Prefetching: Saves more cycles Phase () DRAM Requests D D D D Phase () D D D D (A) Comp. Phase () (B) Comp. Phase () TL Saved Cycles Time RR Phase- (Group-) Can Start Phase () DRAM Requests D D D Phase () D P P P P Comp. Phase () Prefetch Requests Comp. Phase () Saved Cycles

13 () Prefetching: Improve DRAM Bandwidth U?liza?on X X + X + X + Y Y + Y + Y + Memory Addresses High Bank- Level Parallelism Idle for a period Bank Bank No Idle period! High Row Buffer Locality Bank Bank Prefetch Requests

14 Challenge: Designing a Prefetcher X X + X + X + Y Y + Y + Y + Memory Addresses X Y Prefetch Requests Bank Bank X Sophisticated Prefetcher Y

15 Our Goal n Keep the prefetcher simple, yet get the performance benefits of a sophisticated prefetcher. To this end, we will design a prefetch-aware warp scheduling policy hy? A simple prefetching does not improve performance with existing scheduling policies.

16 Simple Prefetching + RR scheduling RR Phase () DRAM Requests D D D D D D D D Phase () No Saved Cycles Time Phase () DRAM Requests D P D P D Overlap with D (Late Prefetch) P D P Overlap with D (Late Prefetch) Phase () Time

17 Simple Prefetching + TL scheduling Phase () DRAM Requests Group Phase () Overlap with D (Late Group Group Group Group Comp. Comp. Phase Phase Phase () () () D D D D D P D P Group Phase () Prefetch) Overlap with D (Late Prefetch) D D D D D P D P No Saved Cycles (over TL) Group Comp. Phase () Group Comp. Phase () TL Saved Cycles Time RR

18 Let s Try X Simple Prefetcher X +

19 Simple Prefetching with TL scheduling X X + X + X + X + May not be equal to Y Y + Y Idle for a period Bank Bank Y + Y + Memory Addresses Useless Prefetch (X + ) Bank Bank UP UP UP UP Useless Prefetches

20 Simple Prefetching with TL scheduling Phase () DRAM Requests Phase () DRAM Requests D D D D D D D Phase () Phase () D U U U U D D D D D D D D Comp. Phase () Comp. Phase () Comp. Phase () No Saved Cycles Comp. Phase () (over TL) TL Useless Prefetches Saved Cycles Time RR

21 arp Scheduler Perspec?ve (Summary) arp Scheduler Forms Multiple arp Groups? Simple Prefetcher Friendly? DRAM Bandwidth Utilization Bank Level Parallelism Row Buffer Locality Round- Robin (RR) Two-Level (TL)

22 Our Goal n Keep the prefetcher simple, yet get the performance benefits of a sophisticated prefetcher. To this end, we will design a prefetch-aware warp scheduling policy A simple prefetching does not improve performance with existing scheduling policies.

23 Sophisticated Prefetcher Prefetch Aware (PA) arp Scheduler Simple Prefetcher

24 Prefetch- aware (PA) warp scheduling X X + X + X + Y Y + Y + Y + Round Robin Scheduling Group Group X X + X X + X + X + X + X + Y Y Y + Y + Y + Y + Prefetch-aware Scheduling Non-consecutive warps are associated with one group Y + Y + Two-level Scheduling See paper for generalized algorithm of PA scheduler

25 Simple Prefetching with PA scheduling X X + X + X + Y Y + Y + Y + Bank Bank Reasoning of non-consecutive warp grouping is that groups can (simple) prefetch for each other (green warps can prefetch for red warps using simple prefetcher) X Simple Prefetcher X +

26 Simple Prefetching with PA scheduling X X + X + X + Y Y + Y + Y + Cache Hits! Bank Bank X Simple Prefetcher X +

27 Simple Prefetching with PA scheduling (A) (B) Phase () DRAM Requests D D D D Phase () D D D D Comp. Phase () Comp. Phase () TL Saved Cycles Time RR Phase- (Group-) Can Start Phase () DRAM Requests D D D Phase () D P P P P Comp. Phase () Prefetch Requests Comp. Phase () Saved Saved Cycles!!! (over TL) Cycles

28 DRAM Bandwidth U:liza:on X X + X + X + Y + Y + Y + % increase in bank-level parallelism Y % decrease in row buffer locality Bank Bank High Bank-Level Parallelism High Row Buffer Locality X Simple Prefetcher X +

29 arp Scheduler Perspec?ve (Summary) arp Scheduler Round- Robin (RR) Forms Multiple arp Groups? Simple Prefetcher Friendly? DRAM Bandwidth Utilization Bank Level Parallelism Row Buffer Locality Two-Level (TL) Prefetch- Aware (PA) (with prefetching) 9

30 Outline n Proposal n Background and Motivation n Prefetch-aware Scheduling n Evaluation n Conclusions 0

31 Evalua?on Methodology n Evaluated on GPGPU-Sim, a cycle accurate GPU simulator n Baseline Architecture q 0 SMs, memory controllers, crossbar connected q q q 00MHz, SIMT idth =, Max. 0 threads/core KB L data cache, KB Texture and Constant Caches L Data Cache Prefetcher, GDDR@00MHz n Applications Chosen from: q Mapreduce Applications q q q Rodinia Heterogeneous Applications Parboil Throughput Computing Focused Applications NVIDIA CUDA SDK GPGPU Applications

32 Spa?al Locality Detector based Prefetching X MACRO BLOCK D Prefetch:- Not accessed (demanded) Cache Lines X + X + X + P D Prefetch-aware Scheduler Improves effectiveness of this simple prefetcher See paper for more details P D = Demand, P = Prefetch

33 Improving Prefetching Effec?veness RR+Prefetching TL+Prefetching PA+Prefetching 00% 0% 0% 0% 0% Fraction of Late Prefetches 9% % 9% 00% 0% 0% 0% 0% Prefetch Accuracy % 9% 90% 0% 0% % 0% Reduction in LD Miss Rates % 0% % 0% % %

scheduling RR+Prefetching TL TL+Prefetching

.9.0. SPMV BFSR FFT % IPC improvement over

34 Performance Evalua?on.. 0. SSC PVC KMN Results are Normalized to RR scheduling RR+Prefetching TL TL+Prefetching Prefetch-aware (PA) PA+Prefetching SPMV BFSR FFT % IPC improvement over Prefetching + RR arp Results Scheduling Policy (Commonly Used) % IPC improvement over Prefetching + TL arp Scheduling Policy (Best Previous) SCP BLK FT JPEG See paper for Additional GMEAN

35 Conclusions n n n Existing warp schedulers in GPGPUs cannot take advantage of simple prefetchers q Consecutive warps have good spatial locality, and can prefetch well for each other q But, existing schedulers schedule consecutive warps closeby in time à prefetches are too late e proposed prefetch-aware (PA) warp scheduling q Key idea: group consecutive warps into different groups q Enables a simple prefetcher to be timely since warps in different groups are scheduled at separate times Evaluations show that PA warp scheduling improves performance over combinations of conventional (RR) and the best previous (TL) warp scheduling and prefetching policies q Better orchestrates warp scheduling and prefetching decisions

36 THANKS! QUESTIONS?

37 BACKUP

38 Effect of Prefetch- aware Scheduling Percentage of DRAM requests (averaged over group) with: miss misses - misses to a macro-block 0% 0% High Spatial Locality Requests Recovered by Prefetching 0% High Spatial Locality Requests 0% Two-level Prefetch-aware

39 orking (ith Two- Level Scheduling) MACRO BLOCK MACRO BLOCK X D Y D X + X + X + D D D High Spatial Locality Requests Y + Y + Y + D D D 9

40 orking (ith Prefetch- Aware Scheduling) MACRO BLOCK MACRO BLOCK X D Y D X + X + X + P D P High Spatial Locality Requests Y + Y + Y + P D P

41 orking (ith Prefetch- Aware Scheduling) MACRO BLOCK MACRO BLOCK X Y X + X + D Cache Hits Y + Y + D X + D Y + D

42 Effect on Row Buffer locality Row Buffer Locality 0 0 TL TL+Prefetching PA PA+Prefetching SSC PVC KMN SPMV BFSR FFT SCP BLK FT JPEG AVG % decrease in row buffer locality over TL

43 Effect on Bank- Level Parallelism Bank Level Parallelism SSC PVC RR TL PA KMN SPMV BFSR FFT SCP BLK FT JPEG AVG % increase in bank-level parallelism over TL

44 Simple Prefetching + RR scheduling X X + X + X + Y Y + Y + Y + Memory Addresses Bank Bank Bank Bank

45 Simple Prefetching with TL scheduling X X + X + X + Y Y + Y + Y + Memory Addresses Idle for a period Bank Bank Legend Group Group Idle for a period Bank Bank

46 CTA-Assignment Policy (Example) Multi-threaded CUDA Kernel CTA- CTA- CTA- CTA- SIMT Core- SIMT Core- CTA- CTA- CTA- CTA- arp Scheduler arp Scheduler L Caches ALUs L Caches ALUs

Orchestrated Scheduling and Prefetching for GPGPUs

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog Onur Kayiran Asit K. Mishra Mahmut T. Kandemir Onur Mutlu Ravishankar Iyer Chita R. Das The Pennsylvania State University Carnegie Mellon University