UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

Size: px

Start display at page:

Download "UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS"

Jessie Nicholson
6 years ago
Views:

1 UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha Agarwal was an intern at NVIDIA)

EVOLVING GPU MEMORY SYSTEM NVLink (Cache PCI-E

8 8 GB/s 8 GB/s DDR4 Roadmap CUDA 1.-5.x CUDA 6.

GPU memory Run-time controlled copying à Better

2 EVOLVING GPU MEMORY SYSTEM NVLink (Cache PCI-E Coherent) GDDR5 GPU CPU 2 GB/s GB/s 8 GB/s DDR4 Roadmap CUDA 1.-5.x CUDA 6.+: Current Future cudamemcpy Unified virtual memory CPU-GPU cache-coherent high BW interconnect Programmer controlled copying to GPU memory Run-time controlled copying à Better productivity How to best exploit full BW while maintaining programmability? 2

3 DESIGN GOALS & OPPORTUNITIES! Simple programming model:! No need for explicit data copying! Exploit full DDR + GDDR BW! Additional 3% BW via NVLink! Crucial to BW sensitive GPU apps Total Bandwidth (GB/s) Legacy CUDA UVM PCI-E NVLink CC-NUMA BW Target Design intelligent dynamic page migration policies to achieve both these goals 3

4 BANDWIDTH UTILIZATION NVLink GPU 8 GB/s CPU 2 GB/s 8 GB/s GDDR5 DDR4 Total Bandwidth (GB/s) Total System Memory BW GPU Memory BW BW from CC % of Migrated Pages to GPU Memory 1! Coherence-based accesses, no page migration! Wastes GPU memory BW 4

5 BANDWIDTH UTILIZATION NVLink GPU 8 GB/s CPU 2 GB/s 8 GB/s GDDR5 DDR4 Total Bandwidth (GB/s) Total System Memory BW GPU Memory BW Additional BW from CC % of Migrated Pages to GPU Memory 1! Static Oracle: Place data in the ratio of memory bandwidths [ASPLOS 15] Dynamic migration can exploit the full system memory BW 5

BANDWIDTH UTILIZATION NVLink GPU 8 GB/s CPU 2 GB/s 8 GB/s GDDR5 DDR4 Total Bandwidth (GB/s) 28 2 8 Total System Memory BW GPU Memory BW Additional

6 BANDWIDTH UTILIZATION NVLink GPU 8 GB/s CPU 2 GB/s 8 GB/s GDDR5 DDR4 Total Bandwidth (GB/s) Total System Memory BW GPU Memory BW Additional BW from CC % of Migrated Pages to GPU Memory 1! Excessive migration leads to under-utilization of DDR BW Migrate pages for optimal BW utilization 6

7 CONTRIBUTIONS Intelligent Dynamic Page Migration! Aggressively migrate pages upon First-Touch to GDDR memory! Pre-fetch neighbors of touched pages to reduce TLB shootdowns! Throttle page migrations when nearing peak BW Dynamic page migration performs 1.95x better than no migration Comes within 28% of the static oracle performance 6% better than Legacy CUDA 7

8 OUTLINE! Page Migration Techniques! First-Touch page migration! Range-Expansion to save TLB shootdowns! BW balancing to stop excessive migrations! Results & Conclusions 8

9 FIRST-TOUCH PAGE MIGRATION! Naive: Migrate pages that are touched Throughput Rela.ve to No Migra.on CPU memory GPU memory! First-Touch migration approaches Legacy CUDA First-Touch migration is cheap, no hardware counters required 9

10 PROBLEMS WITH FIRST-TOUCH MIGRATION VA PA Send Ack VA PA V1 P2 P1 Invalidate V1 P2 P1 CPU TLB GPU TLB Migrate V1 CPU GPU Throughput Rela.ve To No Migra.on No overhead TLB shootdown backprop needle pathfinder! TLB shootdowns may negate benefits of page migration How to migrate pages without incurring shootdown cost? 1

11 PRE-FETCH TO AVOID TLB SHOOTDOWNS Hot pages, consume 8% BW! Intuition: Hot virtual addresses are clustered! Pre-fetch pages before access by the GPU Max Min Min No shootdown cost for pre-fetched pages 11

12 MIGRATION USING RANGE-EXPANSION CPU memory GPU memory X X X X X Accessed, shootdown required Pre-fetched, shootdown not required Throughput Relative to No Migration No overhead No range expansion TLB shootdown No overhead TLB shootdown No overhead backprop needle pathfinder TLB shootdown! Pre-fetch pages in spatially contiguous range 12

MIGRATION USING RANGE-EXPANSION CPU memory GPU memory X X X Accessed, shootdown required Pre-fetched, shootdown not required Throughput Relative to No Migration No range expansion Range:16 Range:64

13 MIGRATION USING RANGE-EXPANSION CPU memory GPU memory X X X Accessed, shootdown required Pre-fetched, shootdown not required Throughput Relative to No Migration No range expansion Range:16 Range:64 Range: No overhead TLB shootdown No overhead TLB shootdown No overhead backprop needle pathfinder TLB shootdown! Pre-fetch pages in spatially contiguous range Range-Expansion hides TLB shootdown overhead 13

14 REVISITING BANDWIDTH UTILIZATION Total Bandwidth (GB/s) Total System Memory BW GPU Memory BW First-Touch migration BW from CC Excessive migration % of Migrated Pages to GPU Memory 1! First-Touch + Range-Expansion aggressively unlocks GDDR BW How to avoid excessive page migrations? 14

15 BANDWIDTH BALANCING First- Touch + Range Exp % of Migrated Pages to GPU Memory 1 7 Excessive migrations First-Touch migrations Total Bandwidth (GB/s) Target Start DDR under- subscribed GDDR under- subscribed Time End % of Accesses from GPU Memory Throttle migrations when nearing peak BW 15

16 BANDWIDTH BALANCING First- Touch + Range Exp First- Touch + Range Exp + BW Balancing % of Migrated Pages to GPU Memory 1 7 Excessive migrations First-Touch migrations Total Bandwidth (GB/s) Target Start DDR under- subscribed GDDR under- subscribed Time End % of Accesses from GPU Memory Throttle migrations when nearing peak BW 16

17 SIMULATION ENVIRONMENT! Simulator: GPGPU-Sim 3.x! Heterogeneous 2-level memory! GDDR5 (2GB/s, 8-channels)! DDR4 (8GB/s, 4-channels) GPU 1 clks additional latency 8 GB/s CPU! GPU-CPU interconnect! Latency: 1 GPU core cycles! Workloads:! Rodinia applications [Che IISWC29]! DoE mini apps [Villa SC214] GDDR5 2 GB/s 8 GB/s DDR4 17

18 Throughput Rela.ve to No Migra.on RESULTS Legacy CUDA First- Touch + Range Exp + BW Balancing ORACLE Static Oracle 18

19 Throughput Rela.ve to No Migra.on RESULTS Legacy CUDA First- Touch + Range Exp + BW Balancing ORACLE Static Oracle First-Touch + Range-Expansion + BW Balancing outperforms Legacy CUDA 19

20 Throughput Rela.ve to No Migra.on RESULTS Legacy CUDA First- Touch + Range Exp + BW Balancing ORACLE Static Oracle Streaming accesses, no-reuse after migration 2

21 Throughput Rela.ve to No Migra.on RESULTS Legacy CUDA First- Touch + Range Exp + BW Balancing ORACLE Static Oracle Dynamic page migration performs 1.95x better than no migration Comes within 28% of the static oracle performance 6% better than Legacy CUDA 21

22 CONCLUSIONS! Developed migration policies without any programmer involvement! First-Touch migration is cheap but has high TLB shootdowns! First-Touch + Range-Expansion technique unlocks GDDR memory BW! BW balancing maximizes BW utilization, throttles excessive migrations These 3 complementary techniques effectively unlock full system BW 22

23 THANK YOU 23

24 EFFECTIVENESS OF RANGE-EXPANSION TLB Shootdowns Execution % Migrations Shootdown Execution Saved Benchmark Execution % Migrations Exececution backpropbenchmark 29.1% Overhead 26% 7.6% TLB Shootdowns of Shootdown Without Saved Runtime backprop 29.1% 26% 7.6% bfs TLB 6.7% Shootdown Shootdown 12% Saved.8% cns 2.4% 2%.5% comd 2.2% 89% 1.8% cns Backprop 2.4% 29.1% 26% 2% 7.6%.5% minife 3.6% 36% 1.3% mummer 21.15% 13% 2.75% comd Pathfinder 2.2% 25.9% 1% 89% 2.6% 1.8% pathfinder 25.9% 1% 2.6% srad v1.5% 27%.14% kmeans Needle 4.1% 24.9% 55% 79% 2.75% 3.17% Average 11.13% 33.5% 3.72% minife Mummer TABLE 3.6% 21.15% II: Effectiveness of range prefetching 13% at36% avoiding 2.75% 1.3% mummer Bfs 21.15% 6.7% 12% 13%.8% 2.75% needle 24.9% 55% 13.7% pathfinder Range-Expansion 25.9% can save up to 45% 1% TLB shootdowns 2.6% srad v1.5% 27%.14% 24

25 DATA TRANSFER RATIO Performance is low when GDDR/DDR ratio is away from optimal 25

26 GPU WORKLOADS: BW SENSITIVITY backprop" bfs" btree" gaussian" heartwell" hotspot" kmeans" lava_md" lud_cuda" mummergpu" needle" pathfinder" sc_gpu" srad_v1" cns" comd" minife" xsbench" 3" Rela%ve'Throughput'' 2.5" 2" 1.5" 1".5" Rela%ve' Throughput' " 1.2$.8$.12x".14x".17x".2x".25x".33x".5x" 1x" 2x" 3x" 1$ Aggregate'DRAM'Bandwidth,'1x=2GB/sec' '99$ '8$ '6$ '4$ '2$ $ 2$ 4$ 6$ 8$ 1$ Additonal'DRAM'delay'in'GPU'core'cycles' GPU workloads are highly BW sensitive 26

27 Throughput Rela.ve to No Migra.on RESULTS Legacy CUDA First- Touch First- Touch + Range Exp First- Touch + Range Exp + BW Balancing ORACLE Oracle 6 Dynamic page migration performs 1.95x better than no migration Comes within 28% of the static oracle performance 6% better than Legacy CUDA 27

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015 EVOLVING GPU