Managing GPU Concurrency in Heterogeneous Architectures

Size: px

Start display at page:

Download "Managing GPU Concurrency in Heterogeneous Architectures"

Maximilian Ray
6 years ago
Views:

1 Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das

2 Era of Heterogeneous Architectures Intel Haswell AMD Fusion NVIDIA Denver NVIDIA Echelon

3 Execu9ve Summary When sharing the memory hierarchy, CPU and applications interfere with each other q applications signijicantly affect CPU applications due to multi- threading Existing Thread- level Parallelism (TLP) management techniques (MICRO12, PACT13) q q Unaware of CPUs Not effective in heterogeneous systems Our Proposal: Warp scheduling strategies to Adjust TLP to improve CPU and/or performance

4 Executive Summary CPU-centric Strategy Memory Congestion CPU Performance

5 Executive Summary CPU-centric Strategy Memory Congestion CPU Performance IF Memory Congestion TLP

6 Executive Summary CPU-centric Strategy Memory Congestion CPU Performance IF Memory Congestion TLP Results Summary: +24% CPU & -11%

7 Executive Summary CPU-centric Strategy Memory Congestion CPU Performance CPU- Balanced Strategy TLP Latency Tolerance IF Memory Congestion TLP Results Summary: +24% CPU & -11%

8 Executive Summary CPU-centric Strategy Memory Congestion CPU Performance CPU- Balanced Strategy TLP Latency Tolerance IF Memory Congestion TLP IF Latency Tolerance TLP Results Summary: +24% CPU & -11%

9 Executive Summary CPU-centric Strategy Memory Congestion CPU Performance CPU- Balanced Strategy TLP Latency Tolerance IF Memory Congestion TLP IF Latency Tolerance TLP Results Summary: +24% CPU & -11% Results Summary: +7% both CPU &

10 Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

11 Many- core Architecture CPU Cores L1 Caches L2 Caches ROB CTA Scheduler ALUs SIMT Cores Warp Scheduler L1 Caches ALUs Throughput optimized cores Interconnect LLC cache DRAM Latency optimized cores

12 Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

13 Applica9on Interference Up to 20% Up to 85% Normalized IPC nocpu mcf omnetpp perlbench KM MM PVR Normalized CPU IPC no KM MM PVR mcf omnetpp perlbench applications are affected moderately due to CPU interference CPU applications are affected signijicantly due to interference

14 Latency Tolerance in CPUs vs. s Normalized IPC warp IPC 6 warps CPU IPC 16 warps High TLP - > memory system congestion High TLP - > low CPU performance Problem: TLP management strategies for s are not aware of the latency tolerance disparity between CPU and applications cores can tolerate latencies due to multi- threading Higher performance potential at low TLP DYNCTA (PACT 2013)

15 Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

16 Effect of Concurrency on Performance Reduction in TLP performance

17 Effect of Concurrency on CPU Performance Reduction in TLP CPU performance

18 Effect of Concurrency on CPU Performance? Change in CPU performance 2 metrics: - Memory congestion - Network congestion congestion : CPU performance

19 Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

20 Our Approach Improved performance Improved CPU performance Existing works CPU- centric Strategy CPU- Balanced Strategy þ þ þ þ + control the trade- off

21 CM- CPU: CPU- centric Strategy Categorize congestion: low, medium, or high memory network L M H - unaware L TLP management: InsufJicient latency tolerance M Increase # of warps No change in # of warps Decrease # of warps H

22 CM- BAL: CPU- Balanced Strategy Latency tolerance of cores: stall : scheduler cores stall Overrides CM- CPU can only increase TLP Low latency tolerance same strategy as CM- CPU High memory congestion TLP

23 CM- BAL: CPU- Balanced Strategy Latency tolerance of cores: stall : scheduler cores stall Overrides CM- CPU can only increase TLP Low latency tolerance same strategy as CM- CPU High memory congestion TLP

24 CM- BAL: CPU- Balanced Strategy Latency tolerance of cores: stall : scheduler cores stall Overrides CM- CPU can only increase TLP Low latency tolerance same strategy as CM- CPU High memory congestion þ TLP

25 CM- BAL: CPU- Balanced Strategy Latency tolerance of cores: stall : scheduler cores stall Overrides CM- CPU can only increase TLP Low latency tolerance same strategy as CM- CPU High memory congestion Control the triggering of the condition = TLP Control the trade- off between CPU or beneqits

26 Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

27 Evaluated Architecture LLC/ MC LLC/ MC LLC/ MC LLC/ MC LLC/ MC LLC/ MC LLC/ MC CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU LLC/ MC Tile- based design

28 Evalua9on Methodology Evaluated on an integrated platform with an in-house x86 CPU simulator and GP-Sim Baseline Architecture q q q q q 28 cores, 14 CPU cores, 8 memory controllers, 2D mesh : 1400MHz, SIMT Width = 16*2, Max threads/core, GTO Sch. CPU: 2000 MHz, OoO, 128-entry instr. win., max. 3 inst./cycle 8MB, 128B Line, 16-way, 700MHz GDDR5 800MHz Workloads: q q q 13 applications 34 CPU applications, 6 CPU application mixes 36 diverse workloads 1 application + 1 CPU mix

29 Performance Results Normalized IPC 1.2 2% - 11% 7% - 11% 0.8 All 36 workloads

30 CPU Performance Results Normalized CPU WS % 24% 7% 19% All 36 workloads

31 System Performance OSS = (1 α) WS CPU + α SU (ISCA 2012) α is between 0 and 1 Higher α - > higher importance warps DYNCTA CM- CPU Obj. 1 CM- BAL Obj. 2 (Balanced) Normalized OSS alpha alpha (0-1)

32 More in the Paper Motivation q Analysis of the metrics used by our algorithm Scheme q Detailed hardware walkthrough of our scheme Results q Analysis over time q q q q Change in TLP Change in the metrics used by our algorithm Comparison against static approaches Lower number of LLC accesses

33 Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

34 Conclusions Sharing the memory hierarchy leads to CPU and applications to interfere with each other Existing TLP management techniques are not well- suited for heterogeneous architectures We propose two TLP management techniques for heterogeneous architectures q q q CM- CPU reduces TLP to improve CPU performance CM- BAL is similar to CM- CPU, but increases TLP when it detects low latency tolerance in cores TLP can be tuned based on user s preference for higher CPU or performance

35 THANKS!

36 Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading