PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

Size: px

Start display at page:

Download "PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS"

Claud Hicks
6 years ago
Views:

1 PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015

2 EVOLVING GPU MEMORY SYSTEM PCI-E GDDR5 GPU CPU 15.8 GB/s 200 GB/s 80 GB/s DDR4 Roadmap Current: CUDA 1-7 cudamemcpy/unified Virtual Memory low BW interconnect 2

3 EVOLVING GPU MEMORY SYSTEM NVLink (Cache Coherent) GDDR5 GPU CPU 80 GB/s 200 GB/s 80 GB/s DDR4 Roadmap Current: CUDA 1-7 cudamemcpy/unified Virtual Memory low BW interconnect Future CPU-GPU cache-coherent high BW interconnect How to best exploit BW from both GPU memory & cache-coherent link? 3

4 GPU WORKLOAD CHARACTERISTICS Highly Sensitive to Memory Bandwidth Rela%ve Throughput Rodinia, Parboil and DoE mini apps 0.125x 0.25x 0.5x 1x 2x 3x Aggregate DRAM Bandwidth, 1x=200GB/sec Applications perform up to 2.8x better with 3x more BW 4

5 GPU WORKLOAD CHARACTERISTICS Memory Latency Tolerant Rela%ve Throughput Rodinia, Parboil and DoE mini apps DRAM Latency (cycles), Baseline 100 cycles Applications are more sensitive to memory bandwidth than latency 5

6 CPU-GPU MEMORY ORGANIZATION Variation in Bandwidth and Capacity! BW-optimized memory è GPU! Higher BW, limited capacity! HBM, GDDR5, WIO2! Capacity-optimized memory è CPU! Higher capacity, lower BW! DDR4, LPDDR4 HPC System GPU$+$HBM$ CPU$ DDR4$ 1000 GB/sec, 16 GB 120 GB/sec, 256 GB Desktop System GDDR5$ GPU$ CPU$ DDR4$ 200 GB/sec, 4 GB 80GB/sec, 32 GB WIO2$ Mobile System GPU$ CPU$ LPDDR4$ 51 GB/sec, 1 GB 24 GB/sec, 4 GB BW Ratio 8.3x 2.5x 2.1x How to place data to exploit BW from all levels of the memory? 6

7 CONTRIBUTIONS BW Maximizing Page Placement Policies! BW-AWARE policy places pages in the ratio of memory BW! Application-Aware policy selectively places hot pages in GPU memory! Data structure based annotation can identify hot pages BW-AWARE page placement is 35% better than Linux INTERLEAVE & 18% than Linux LOCAL Application-Aware policy achieves 90% of the static oracle performance 7

8 OUTLINE! Page Placement Policies! Linux NUMA page placement! BW-AWARE placement! Application memory access patterns! Application-Aware placement! Results & Conclusions 8

9 NUMA PAGE PLACEMENT Linux s Local Placement for CPU NVLink GPU 80 GB/s CPU 200 GB/s 80 GB/s GDDR5 DDR4 Time(s) to serve 100GB GPU (200GB/s) Lower (ê ) is better Min Time to Serve 100GB CPU (80GB/s)! All data in capacity-optimized CPU memory! Wastes high-bw GPU memory 9

10 NUMA PAGE PLACEMENT Linux s Local Placement for GPU NVLink GPU 80 GB/s CPU 200 GB/s 80 GB/s GDDR5 DDR4 Time(s) to serve 100GB GPU (200GB/s) Min Time to Serve 100GB Lower (ê ) is better CPU (80GB/s)! All data in bandwidth-optimized GDDR memory! Wastes cache-coherent BW 10

11 NUMA PAGE PLACEMENT Linux s Interleave Placement NVLink GPU 80 GB/s CPU 200 GB/s 80 GB/s GDDR5 DDR4 Time(s) to serve 100GB GPU (200GB/s) Min Time to Serve 100GB Lower (ê ) is better CPU (80GB/s)! 50% data in GDDR memory, 50% in DDR memory! Exploit BW of both GDDR & DDR memories 11

12 NUMA PAGE PLACEMENT Bandwidth Optimal Placement? NVLink GPU 80 GB/s CPU 200 GB/s 80 GB/s GDDR5 DDR4 Time(s) to serve 100GB GPU (200GB/s) Min Time to Serve 100GB Lower (ê ) is better CPU (80GB/s)! Goal: Minimize time to access the data! What is the optimal data placement ratio? 12

13 GPU GDDR5 200 GB/s NUMA PAGE PLACEMENT Bandwidth Optimal -> BW-AWARE Placement NVLink 80 GB/s 80 GB/s CPU DDR4 70% accesses 30% accesses! Place data in the ratio of memory bandwidths! Achieves the goal to minimize time to access the data Time(s) to serve 100GB GPU (200GB/s) Min Time to Serve 100GB Lower (ê ) is better CPU (80GB/s) 13

BW-AWARE IMPLEMENTATION Physical page allocation 1 2 3 Generate a random number N ε [0,99] If N 30: page è GDDR

14 BW-AWARE IMPLEMENTATION Physical page allocation Generate a random number N ε [0,99] If N 30: page è GDDR Else: page è DDR DDR4 GDDR5 Random placement distributes data accesses in the optimal ratio 14

15 BW-AWARE PERFORMANCE! BW-AWARE performs! 35% better than INTERLEAVE! 18% better than LOCAL! BW-AWARE policy places! 70% accesses è GDDR (70G)! 30% accesses è DDR (30D) Throughput Rela%ve to Linux Interleave backprop bfs cfd gaussian kmeans mummergpu needle nn pathfinder srad_v1 cns comd minife xsbench histo lbm sad sgemm stencil LOCAL BW- AWARE INTERLEAVE Ra%o of Pages Placed in GDDR:DDR Memories What if GDDR memory doesn t have enough capacity? 15

16 LIMITED MEMORY CAPACITY! Performance degrades at low GDDR memory capacity! At low GDDR capacities (10%)! Up to 4x slowdown backprop bfs cfd gaussian kmeans mummergpu needle nn pathfinder srad_v1 cns comd minife xsbench histo lbm sad sgemm stencil Throughput Rela%ve to Op%mal Available GDDR Memory Capacity (%) Data access ratio!= optimal 16

17 APPLICATION CHARACTERISTICS! Bandwidth CDF! At OS page granularity (4KB)! Hotness: number of accesses to a page served from DRAM! Applications pages accessed non-uniformly Frac%on(of(Memory(Bandwidth((%)( backprop" bfs" cfd" gaussian" kmeans" mummergpu" needle" nn" pathfinder" srad_v1" cns" comd" minife" xsbench" histo" lbm" sad" sgemm" stencil" 100" 90" 80" 70" 60" 50" 40" 30" 20" 10" 0" Hot Pages Cold Pages 0" 10" 20" 30" 40" 50" 60" 70" 80" 90" 100" Frac%on(of(Applica%on(Memory(Footprint((%)( 17

18 VISUALIZING PAGE ACCESS PATTERN! Application CDF of data footprint vs. virtual address layout! Intuition: Hot pages are clustered together! Place hot data structures in limited capacity GDDR memory Applica1on(( Data(Structures( Virtual(Page(Address( 20000" 18000" 16000" 14000" 12000" 10000" 8000" 6000" 4000" 2000" 0" Hot Pages d_graph_nodes" d_graph_mask" d_graph_visited" d_over" Applica1on(Pages(( BFS d_graph_edges" d_upda6ng_graph_mask" d_cost" Cold Pages 18

19 APPLICATION-AWARE PAGE PLACEMENT Selectively place hot pages in GDDR memory! Goal: Achieve the data access ratio close to the optimal! Compiler-based profiling to identify hot data structures [Stephenson ISCA2015]! Augmented nvcc and ptxas to support data structure access profiling! Compiler inserts memory instrumentation code for all loads & stores! Program annotation to hint hot data structures for placement! Runtime uses these hints to place hot pages in GDDR memory 19

20 ! Annotate the cudamalloc with memory placement hints! CUDA runtime uses these hints to make data placement decisions cudamalloc(devptr0, size[0]); cudamalloc(devptr1, size[1]); Original Code PROGRAM ANNOTATION Virtual Address Space hotness[0] = 2; hotness[1] = 1; hint[] = GetAllocation(size[], hotness[]); cudamalloc(devptr0, size[0], hint[0]); cudamalloc(devptr1, size[1], hint[1]); Annotated Code Virtual Address Space devptr0 devptr1 Profiling or Expert Programmer devptr0 devptr1 Hotness 20

21 SIMULATION ENVIRONMENT! Simulator: GPGPU-Sim 3.x! Heterogeneous 2-level memory! GDDR5 (200GB/s, 8-channels)! DDR4 (80GB/s, 4-channels) GPU 100 clock additional latency 80 GB/s CPU! GPU-CPU interconnect! Latency: 100 GPU core cycles! Workloads:! Rodinia [Che IISWC2009], Parboil [Stratton TR2012]! DoE mini apps [Villa SC2014] GDDR5 200 GB/s 80 GB/s DDR4 21

22 PROGRAM ANNOTATION PERFORMANCE Throughput Rela%ve to Linux Interleave BW- AWARE Program AnnotaRon Oracle! At 10% footprint in GDDR! 19% better than INTERLEAVE, 90% of oracle performance Program annotation correctly identifies hot data structures 22

23 SUMMARY! BW-AWARE page placement is application-agnostic policy! Directly implementable as a default OS policy! Near-optimal performance when applications fit in GDDR memory! Application-aware page placement! Governed by page-access frequency! Program annotation achieves 90% of Oracle Performance! Dynamic migration may be required in application with phases These 2 policies effectively exploit full system BW (code available at 23

24 THANK YOU 24

BW-AWARE ADAPTABILITY TO BW-RATIOS! Bandwidth ratios vary! BW-AWARE adapts to! bandwidth variation Throughput Rela%ve to 200GB/sec 1.6 1.

25 BW-AWARE ADAPTABILITY TO BW-RATIOS! Bandwidth ratios vary! BW-AWARE adapts to! bandwidth variation Throughput Rela%ve to 200GB/sec INTERLEAVE BW- AWARE LOCAL policy BO + CO Memory Bandwidth (GB/sec) 25

SENSITIVITY TO DATA SETS Throughput Rela%ve to Linux Interleave 2.5 2 1.5 1 0.

26 SENSITIVITY TO DATA SETS Throughput Rela%ve to Linux Interleave training- set BW- AWARE Program AnnotaRon Oracle data- set- 1 data- set- 2 data- set- 3 training- set data- set- 1 data- set- 2 data- set- 3 training- set data- set- 1 data- set- 2 data- set- 3 training- set data- set- 1 data- set- 2 data- set- 3 bfs minife mummergpu xsbench Feedback-driven optimization not sensitive to datasets/parameter 26

27 ORACLE: CONSTRAINED CAPACITY Throuhgput Rela%ve to Unconstarined Interleave Oracle- unconstrained capacity Oracle- 10% GDDR memory capacity BW- AWARE- unconstrained capacity BW- AWARE- 10% GDDR memory capacity At 10% GDDR capacity, Oracle performance drops than 100% capacity 27

28 BW-AWARE PAGE PLACEMENT Place pages in the BW ratio of the NUMA zones! BO & CO memory BW: b B and b C! N uniformly accessed pages! Page fraction in BO memory: f B! T = max (N*f B /b B, N(1-f B )/b C )! f Bopt = b B /(b B +b C ) N/b c Time (f bopt, T min )! Application agnostic policy 0 Pages in BO Memory (f b ) 1 28

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha