Leveraging Mobile GPUs for Flexible High-speed Wireless Communication

Size: px

Start display at page:

Download "Leveraging Mobile GPUs for Flexible High-speed Wireless Communication"

Richard Robinson
6 years ago
Views:

1 0 Leveraging Mobile GPUs for Flexible High-speed Wireless Communication Qi Zheng, Cao Gao, Trevor Mudge, Ronald Dreslinski *, Ann Arbor The 3 rd International Workshop on Parallelism in Mobile Platforms (PRISM-3) Jun 14,

2 Popular Mobile Device & Applications 1 1 1

3 Fast Technology Evolution 2 1.0E E E E E

4 Support of Multiple Standards 3 3 3

5 Wireless Processor in Mobile Devices 4! Hardware accelerator based processors! ASIC! DSP with many ASICs! FPGA customized as ASICs " High performance " High energy efficiency Long time to market Hard and costly to maintain compatibility Software-defined Radio! 4 4

6 Software-defined Radio (SDR) 5! Wireless system is implemented entirely in software! General-purpose programming language " Very short time to market Only implement new software for new technologies " Easy to maintain compatibility Different programs for different standards? Performance? Energy efficiency 5 5

7 Processor for SDR GPU! 6! High computing throughput! High energy efficiency! Explore DLP and TLP! Good programmability 6 6

8 Graphics Processing Unit (GPU)! High throughput:! GOPS-level peak throughput! SIMT execution model! Make use of abundant DLP and TLP! Good programming support! CUDA/OpenCL 7 Desktop GPU Power 100W Mobile GPU Power ~ 1W 7 7

9 Our Contribution 8! Parallel implementations of LTE baseband kernels on a mobile GPGPU! Performance of running LTE baseband! Meet the specification peak data rate for physical layer kernels! Get close to 10Mpbs typical data rate with for the Turbo decoder! Meet 20ms latency budget! Power and energy efficiency of a mobile GPGPU! High power compared with application-specific processors! Still in an acceptable range! Implications of GPU architecture to further reduce the power 8 8

10 Outline 9! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 9 9

Radio PHY Layer kernel Antenna MAC Layer kernel RF# OFDM# Demodula.

11 LTE Baseband Downlink Receiver 10! Downlink Receiver has the most computations in a mobile device Radio PHY Layer kernel Antenna MAC Layer kernel RF# OFDM# Demodula.on# Channel# Es.mate# MIMO## Detect# Turbo# Decoder# Rate# Match# Descrambling# Modula.on# Demapper# 10 10

12 Parallelism in Each Kernel 11! The PHY layer kernels! Antenna-level! Symbol-level! Subcarrier-level! Algorithm-level! The Turbo Decoder! Trellis-level! Subblock-level! For all kernels! Batch multiple packets 11 11

13 Outline 12! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 12 12

192 CUDA cores! 852MHz! 256 KB RF / 64 KB L1 memory /!

14 Experimental Setup! 13 Jetson TK1 Board! NVIDIA 4-Plus-1 quad-core ARM Cortex-A15 CPU! NVIDIA Kepler GK20a GPU! 192 CUDA cores! 852MHz! 256 KB RF / 64 KB L1 memory /!! 128 KB L2 cache 2G DDR3L RAM (shared by CPU and GPU) Wattsup.net Power Meter! 33% board power consumed by SoC 13 13

15 Outline 14! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 14 14

16 PHY Layer Throughput 15 Achieved Throughputs (Mbps) The number of packets 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Packet batching increases throughput Saturate at

17 Compare with Specification Peak 16 Throughput#Normalized#to# Specifica>on#Peak# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# 0.0# 0# 1# 2# 3# 4# 5# 6# 7# The#number#of#packets# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Fulfill Overall, but 2 packets one specification -- a good peak batching data size rates 16 16

18 Compare with 10Mbps Typical Rate 17 Throughput#Normalized#to# Average/Typical#User#Rate# 18.0# 16.0# 14.0# 12.0# 10.0# 8.0# 6.0# 4.0# 2.0# 0.0# 0# 1# 2# 3# 4# 5# 6# 7# 8# 9# The#number#of#packets# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Fulfill all typical user data rates 17 17

19 PHY Layer Latency 18 Worst Case Latency (ms) The number of packets 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Packet batching also increases latency Packet batching also increases latency, can batch 8 packets given 20ms budgets 18 18

20 Outline 19! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 19 19

21 Turbo Throughput 20 Turbo"Decoding" Throughput"(Mbps)" 10" 8" 6" 4" 2" 0" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" The"number"of"packets" #Subblock=16" #Subblock=32" #Subblock=64" Throughput increases with #subblocks an batching sizes 32 subblocks + 2 packets: 8.4 Mpbs throughput, 4.8 ms latency <<peak rate, but ~10Mbp typical rate *Qi Zheng, et al, Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors, ISCAS 2013, Beijing 20 20

22 Outline 21! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 21 21

Energy Efficiency 22 Energy Efficiency (Mb/J) 30.0 25.0 20.0 15.0 10.0 5.0 0.

23 Energy Efficiency 22 Energy Efficiency (Mb/J) GPU Frequency (MHz) Higher frequency achieves better energy efficiency Run to finish 22 22

24 Power Consumption 23 Processor Standards Details ASIC (LeoCore) FPGA (MAGALI) DSP (Tomahawk) CPU (SoftLTE) WiMAX LTE LTE, WiMAX LTE 70MHz Mobile GPU LTE 23 23

25 Outline 24! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 24 24

26 Reduce Energy from Unused Register! Register file is underutilized running PHY layers! Due to max_thread/sm and max_thread_block/sm! Power gate unused registers Normalized#RF#U:liza:on# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# 1# 2# 4# 8# 16# The#number#of#packets# 25 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# 25 25

27 GPU Support for Trellis Algorithm 26! Trellis Algorithm! Turbo algorithm! Viterbi algorithm! Baum-Welch algorithm! Coding theory! Speech recognition! Data compression! Architecture and ISA support! Similar to AES instruction set in Intel CPU 26 26

28 Outline 27! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 27 27

29 Conclusion 28! Mobile GPUs can be used for wireless communication in a mobile device! Meet specification peaks for 3 out of 4 PHY configurations! Meet the typical rate for all PHY configurations! Turbo is close to the typical data rate! Need special supports for the Turbo decoder in a mobile GPU! Energy is high compared with application-specific solutions! But in the accepting range! Need further optimizations 28 28

30 29 Thanks! Any questions? 29 29

31 30 Backup Slides 30 30

Quick Subscription Growth 31 7000 World mobile-cellular subscriptions (Millions)* 6000 5000 4000 3000 2000 1000 0 1998 1999

32 Quick Subscription Growth World mobile-cellular subscriptions (Millions)* Year *ITU report -- The World in 2011: ICT Facts and Figures 31 31

33 Our Contribution 32! Parallel implementations of LTE baseband kernels on a mobile GPGPU! Throughputs and latency of different configurations! Energy efficiency of a mobile GPGPU 32 32

34 Parallelism in PHY Layer Kernels 33 Kernel Parallelism Number of threads OFDM Modulation (FFT) Channel Estimation MIMO detector Antenna-level Symbol-level Algorithm-level Antenna-level Subcarrier-level Symbol-level Subcarrier-level N ant N sym N FFT N ant N sub N sym N sub Modulation demapper Antenna-level Symbol-level Subcarrier-level Algorithm-level N ant N sym N sub N Mod N ant N sym N sub log 2 (N Mod ) Descrambling Antenna-level Symbol-level Subcarrier-level N ant N sym N sub 33 33

35 Parallelism in Turbo Decoder* 34! Total number of threads N thread = N packet N subblock Thread trellis! Implementation performance tradeoff Parallelism Scheme Throughput Latency Bit Error Rate Packet-level Better Worse No Change Subblock-level Better No Change Worse Trellis-level Better No Change No Change Subblock+NII Worse No Change Better Subblock+TS Worse No Change Better *Qi Zheng, et al, Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors, ISCAS 2013, Beijing 34 34

36 Mobile GPU Architecture 35! GPU on NVIDIA Jetson Tegra K1 board! Kepler architecture 192 CUDA MHz! 64KB L1 memory / 128KB L2 cache / Share 2GB DDR3L with CPU 35 35

37 Performance Metrics 36! Collected stats for a variety of important events! CUDA Event API! Throughput! Latency! NVIDIA Profiler! Achieved occupancy! Memory utilization! Dynamic instruction count 36 36

38 GPU Occupancy 37 Achieved#Occupancy# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# Achieved"Throughputs"(Mbps)" 200" 180" 160" 140" 120" 100" 80" 60" 40" 20" 0" 0# 2# 4# 6# 8# 10# 12# 14# 16# 18# The#number#of#packets# 0" 5" 10" 15" 20" 25" 30" 35" The"number"of"packets" 1x1+QPSK" 2x2+QPSK" 1x1+16QAM" 2x2+16QAM" Packet batching increases throughput which saturates at 8 packets due to full GPU utilization 37 37

2x2+16QAM# Rate#Matching# Descrambling# Demodula9on# MIMO#Detec9on#

39 Hotspot Analysis 38 Execu9on#Time#Breakdown# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Rate#Matching# Descrambling# Demodula9on# MIMO#Detec9on# Channel#Es9ma9on# FFT# Demodulation dominates due to its heavy computation 38 38

40 Hotspot Analysis -- #instructions %# Executed(Instruc7ons( 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# Rate(Matching( Descrambling( Demodula7on( MIMO(Detec7on( Channel(Es7ma7on( FFT( 1x1+QPSK( 2x2+QPSK( 1x1+16QAM( 2x2+16QAM( Demodulation dominates due to its heavy computation 39 39

41 Turbo Throughput 40 Turbo"Decoding" Throughput"(Mbps)" 10" 8" 6" 4" 2" 0" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" The"number"of"packets" #Subblock=16" #Subblock=32" #Subblock=64" Throughput increases with the number of subblocks and packets, and starts saturating at 64 subblocks 40 40

42 Frequency Scaling -- PHY 41 Achieved Throughput (Mbps) GPU Frequency (MHz) 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Throughputs have almost linear scaling with the frequency 41 41

43 Frequency Scaling -- Turbo 42 Achieved#Throughput#(Mbps)# 9.0# 8.0# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# 0.0# 0# 100# 200# 300# 400# 500# 600# 700# 800# 900# GPU#Frequency#(MHz)# Throughputs have almost linear scaling with the frequency 42 42

Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

1 Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti +, Achilleas Anastasopoulos*, Scott Mahlke*, Trevor