Leveraging Mobile GPUs for Flexible High-speed Wireless Communication

0 Leveraging Mobile GPUs for Flexible High-speed Wireless Communication Qi Zheng, Cao Gao, Trevor Mudge, Ronald Dreslinski *, Ann Arbor The 3 rd International Workshop on Parallelism in Mobile Platforms (PRISM-3) Jun 14, 2015 0 0

Popular Mobile Device & Applications 1 1 1

Fast Technology Evolution 2 1.0E+03 1.0E+02 1.0E+01 1.0E+00 1.0E-01 2000 2004 2008 2010 2 2

Support of Multiple Standards 3 3 3

Wireless Processor in Mobile Devices 4! Hardware accelerator based processors! ASIC! DSP with many ASICs! FPGA customized as ASICs " High performance " High energy efficiency Long time to market Hard and costly to maintain compatibility Software-defined Radio! 4 4

Software-defined Radio (SDR) 5! Wireless system is implemented entirely in software! General-purpose programming language " Very short time to market Only implement new software for new technologies " Easy to maintain compatibility Different programs for different standards? Performance? Energy efficiency 5 5

Processor for SDR GPU! 6! High computing throughput! High energy efficiency! Explore DLP and TLP! Good programmability 6 6

Graphics Processing Unit (GPU)! High throughput:! GOPS-level peak throughput! SIMT execution model! Make use of abundant DLP and TLP! Good programming support! CUDA/OpenCL 7 Desktop GPU Power 100W Mobile GPU Power ~ 1W 7 7

Our Contribution 8! Parallel implementations of LTE baseband kernels on a mobile GPGPU! Performance of running LTE baseband! Meet the specification peak data rate for physical layer kernels! Get close to 10Mpbs typical data rate with for the Turbo decoder! Meet 20ms latency budget! Power and energy efficiency of a mobile GPGPU! High power compared with application-specific processors! Still in an acceptable range! Implications of GPU architecture to further reduce the power 8 8

Outline 9! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 9 9

LTE Baseband Downlink Receiver 10! Downlink Receiver has the most computations in a mobile device Radio PHY Layer kernel Antenna MAC Layer kernel RF# OFDM# Demodula.on# Channel# Es.mate# MIMO## Detect# Turbo# Decoder# Rate# Match# Descrambling# Modula.on# Demapper# 10 10

Parallelism in Each Kernel 11! The PHY layer kernels! Antenna-level! Symbol-level! Subcarrier-level! Algorithm-level! The Turbo Decoder! Trellis-level! Subblock-level! For all kernels! Batch multiple packets 11 11

Outline 12! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 12 12

Experimental Setup! 13 Jetson TK1 Board! NVIDIA 4-Plus-1 quad-core ARM Cortex-A15 CPU! NVIDIA Kepler GK20a GPU! 192 CUDA cores! 852MHz! 256 KB RF / 64 KB L1 memory /!! 128 KB L2 cache 2G DDR3L RAM (shared by CPU and GPU) Wattsup.net Power Meter! 33% board power consumed by SoC 13 13

Outline 14! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 14 14

PHY Layer Throughput 15 Achieved Throughputs (Mbps) 200 180 160 140 120 100 80 60 40 20 0 0 5 10 15 20 25 30 35 The number of packets 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Packet batching increases throughput Saturate at 8 15 15

Compare with Specification Peak 16 Throughput#Normalized#to# Specifica>on#Peak# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# 0.0# 0# 1# 2# 3# 4# 5# 6# 7# The#number#of#packets# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Fulfill Overall, but 2 packets one specification -- a good peak batching data size rates 16 16

Compare with 10Mbps Typical Rate 17 Throughput#Normalized#to# Average/Typical#User#Rate# 18.0# 16.0# 14.0# 12.0# 10.0# 8.0# 6.0# 4.0# 2.0# 0.0# 0# 1# 2# 3# 4# 5# 6# 7# 8# 9# The#number#of#packets# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Fulfill all typical user data rates 17 17

PHY Layer Latency 18 Worst Case Latency (ms) 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20 25 30 35 The number of packets 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Packet batching also increases latency Packet batching also increases latency, can batch 8 packets given 20ms budgets 18 18

Outline 19! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 19 19

Turbo Throughput 20 Turbo"Decoding" Throughput"(Mbps)" 10" 8" 6" 4" 2" 0" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" The"number"of"packets" #Subblock=16" #Subblock=32" #Subblock=64" Throughput increases with #subblocks an batching sizes 32 subblocks + 2 packets: 8.4 Mpbs throughput, 4.8 ms latency <<peak rate, but ~10Mbp typical rate *Qi Zheng, et al, Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors, ISCAS 2013, Beijing 20 20

Outline 21! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 21 21

Energy Efficiency 22 Energy Efficiency (Mb/J) 30.0 25.0 20.0 15.0 10.0 5.0 0.0 72 252 468 684 852 GPU Frequency (MHz) Higher frequency achieves better energy efficiency Run to finish 22 22

Power Consumption 23 Processor Standards Details ASIC (LeoCore) FPGA (MAGALI) DSP (Tomahawk) CPU (SoftLTE) WiMAX LTE LTE, WiMAX LTE 70mW@ 70MHz 234mW@400MHz 904mW@170MHz 130W@2.66GHz Mobile GPU LTE 1390mW@852MHz 23 23

Outline 24! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 24 24

Reduce Energy from Unused Register! Register file is underutilized running PHY layers! Due to max_thread/sm and max_thread_block/sm! Power gate unused registers Normalized#RF#U:liza:on# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# 1# 2# 4# 8# 16# The#number#of#packets# 25 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# 25 25

GPU Support for Trellis Algorithm 26! Trellis Algorithm! Turbo algorithm! Viterbi algorithm! Baum-Welch algorithm! Coding theory! Speech recognition! Data compression! Architecture and ISA support! Similar to AES instruction set in Intel CPU 26 26

Outline 27! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 27 27

Conclusion 28! Mobile GPUs can be used for wireless communication in a mobile device! Meet specification peaks for 3 out of 4 PHY configurations! Meet the typical rate for all PHY configurations! Turbo is close to the typical data rate! Need special supports for the Turbo decoder in a mobile GPU! Energy is high compared with application-specific solutions! But in the accepting range! Need further optimizations 28 28

29 Thanks! Any questions? 29 29

30 Backup Slides 30 30

Quick Subscription Growth 31 7000 World mobile-cellular subscriptions (Millions)* 6000 5000 4000 3000 2000 1000 0 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Year *ITU report -- The World in 2011: ICT Facts and Figures 31 31

Our Contribution 32! Parallel implementations of LTE baseband kernels on a mobile GPGPU! Throughputs and latency of different configurations! Energy efficiency of a mobile GPGPU 32 32

Parallelism in PHY Layer Kernels 33 Kernel Parallelism Number of threads OFDM Modulation (FFT) Channel Estimation MIMO detector Antenna-level Symbol-level Algorithm-level Antenna-level Subcarrier-level Symbol-level Subcarrier-level N ant N sym N FFT N ant N sub N sym N sub Modulation demapper Antenna-level Symbol-level Subcarrier-level Algorithm-level N ant N sym N sub N Mod N ant N sym N sub log 2 (N Mod ) Descrambling Antenna-level Symbol-level Subcarrier-level N ant N sym N sub 33 33

Parallelism in Turbo Decoder* 34! Total number of threads N thread = N packet N subblock Thread trellis! Implementation performance tradeoff Parallelism Scheme Throughput Latency Bit Error Rate Packet-level Better Worse No Change Subblock-level Better No Change Worse Trellis-level Better No Change No Change Subblock+NII Worse No Change Better Subblock+TS Worse No Change Better *Qi Zheng, et al, Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors, ISCAS 2013, Beijing 34 34

Mobile GPU Architecture 35! GPU on NVIDIA Jetson Tegra K1 board! Kepler architecture 192 CUDA cores@852 MHz! 64KB L1 memory / 128KB L2 cache / Share 2GB DDR3L with CPU 35 35

Performance Metrics 36! Collected stats for a variety of important events! CUDA Event API! Throughput! Latency! NVIDIA Profiler! Achieved occupancy! Memory utilization! Dynamic instruction count 36 36

GPU Occupancy 37 Achieved#Occupancy# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# Achieved"Throughputs"(Mbps)" 200" 180" 160" 140" 120" 100" 80" 60" 40" 20" 0" 0# 2# 4# 6# 8# 10# 12# 14# 16# 18# The#number#of#packets# 0" 5" 10" 15" 20" 25" 30" 35" The"number"of"packets" 1x1+QPSK" 2x2+QPSK" 1x1+16QAM" 2x2+16QAM" Packet batching increases throughput which saturates at 8 packets due to full GPU utilization 37 37

Hotspot Analysis 38 Execu9on#Time#Breakdown# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Rate#Matching# Descrambling# Demodula9on# MIMO#Detec9on# Channel#Es9ma9on# FFT# Demodulation dominates due to its heavy computation 38 38

Hotspot Analysis -- #instructions 39 100%# Executed(Instruc7ons( 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# Rate(Matching( Descrambling( Demodula7on( MIMO(Detec7on( Channel(Es7ma7on( FFT( 1x1+QPSK( 2x2+QPSK( 1x1+16QAM( 2x2+16QAM( Demodulation dominates due to its heavy computation 39 39

Turbo Throughput 40 Turbo"Decoding" Throughput"(Mbps)" 10" 8" 6" 4" 2" 0" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" The"number"of"packets" #Subblock=16" #Subblock=32" #Subblock=64" Throughput increases with the number of subblocks and packets, and starts saturating at 64 subblocks 40 40

Frequency Scaling -- PHY 41 Achieved Throughput (Mbps) 45 40 35 30 25 20 15 10 5 0 0 100 200 300 400 500 600 700 800 900 GPU Frequency (MHz) 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Throughputs have almost linear scaling with the frequency 41 41

Frequency Scaling -- Turbo 42 Achieved#Throughput#(Mbps)# 9.0# 8.0# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# 0.0# 0# 100# 200# 300# 400# 500# 600# 700# 800# 900# GPU#Frequency#(MHz)# Throughputs have almost linear scaling with the frequency 42 42