Leveraging Mobile GPUs for Flexible High-speed Wireless Communication

Similar documents
Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

OpenRadio. A programmable wireless dataplane. Manu Bansal Stanford University. Joint work with Jeff Mehlman, Sachin Katti, Phil Levis

Flexible wireless communication architectures

Software Defined Modem A commercial platform for wireless handsets

White Paper Using Cyclone III FPGAs for Emerging Wireless Applications

Benchmarking Multithreaded, Multicore and Reconfigurable Processors

Compilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs DREAM seminar

Benchmarking Processors for DSP Applications

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Simplifying FPGA Design for SDR with a Network on Chip Architecture

Parallel HMMs. Parallel Implementation of Hidden Markov Models for Wireless Applications

RFNoC : RF Network on Chip Martin Braun, Jonathon Pendlum GNU Radio Conference 2015

THE LEADER IN VISUAL COMPUTING

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

Choosing a Processor: Benchmarks and Beyond (S043)

Simulation, prototyping and verification of standards-based wireless communications

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

The WINLAB Cognitive Radio Platform

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

CEVA-X1 Lightweight Multi-Purpose Processor for IoT

Graph-based Framework for Flexible Baseband Function Splitting and Placement in C-RAN

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

Reconfigurable VLSI Communication Processor Architectures

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Versal: AI Engine & Programming Environment

A Multi-Tiered Optimization Framework for Heterogeneous Computing

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI

A System Solution for High-Performance, Low Power SDR

Are Polar Codes Practical?

CUDA. Matthew Joyner, Jeremy Williams

A Software Development and Validation Framework for SDR Platforms

Algorithm-Architecture Co- Design for Efficient SDR Signal Processing

PicoSDR goes GNU Radio. Tristan Martin Jan 2013

Multimedia in Mobile Phones. Architectures and Trends Lund

A GPU Implementation of a Real-Time MIMO Detector

OCP Engineering Workshop - Telco

IMPLEMENTATION OF MPI-BASED WIMAX BASE STATION SYSTEM FOR SDR

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

High performance, power-efficient DSPs based on the TI C64x

Summary of WP5 Integration and Validation Second Year. FP7 ICT Objective 1.1 The Network of the Future

Improved Convolutional Coding and Decoding of IEEE802.11n Based on General Purpose Processors

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

The rcuda middleware and applications

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

Real Time MIMO and Cooperative MIMO Systems Testbed for latest Wireless Broadband Systems. B. Project Summary:

Elaborazione dati real-time su architetture embedded many-core e FPGA

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Deep Learning Accelerators

Reconfigurable Multicore Server Processors for Low Power Operation

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Near-Threshold Computing: Reclaiming Moore s Law

Seamless Dynamic Runtime Reconfiguration in a Software-Defined Radio

Practical MU-MIMO User Selection on ac Commodity Networks

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Session: Configurable Systems. Tailored SoC building using reconfigurable IP blocks

Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing

CHALLENGES TO LTE PROGRESS. The Evolution of Mobile Broadband and Regulatory Policy

Nutaq. PicoSDR FPGA-based, MIMO-Enabled, tunable RF SDR solutions PRODUCT SHEET I MONTREAL I NEW YORK I. nutaq. .com QUEBEC

Hardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms

PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor

Abstract of the Book

Specializing Hardware for Image Processing

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Connect your IoT device: Bluetooth 5, , NB-IoT

Nutaq. PicoSDR FPGA-based, MIMO-Enabled, tunable RF SDR solutions PRODUCT SHEET I MONTREAL I NEW YORK I. nutaq. .com QUEBEC

LTE for Mobile Consumer Devices -ETSI M2M Workshop Oct. 2011, Sophia Antipolis, France

ASTRI/CTA data analysis on parallel and low-power platforms

Connect Your IoT Device: Bluetooth 5, , NB-IoT

A Glimpse at the Wireless Data Communications Standards. Fanny Mlinarsky 8 August 2007

ECE 747 Digital Signal Processing Architecture. DSP Implementation Architectures

Simplifying FPGA Design with A Novel Network-on-Chip Architecture

Embedded Linux Conference San Diego 2016

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Implementation of a Dual-Mode SDR Smart Antenna Base Station Supporting WiBro and TDD HSDPA

Transcede. Jim Johnston CTO. Solving 4G Challenges for Pico, Micro and Macrocell Platforms

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Design Choices for FPGA-based SoCs When Adding a SATA Storage }

MPSOC 2011 BEAUNE, FRANCE

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

August, 2010 Enabling Software Defined Radio with the Modem Vector Signal Processor ENT-F0766

Compilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs

Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures

Third Genera+on USRP Devices and the RF Network- On- Chip. Leif Johansson Market Development RF, Comm and SDR

Building heterogeneous networks of green base stations on TI s KeyStone II architecture

Implementation of a Fully-Parallel Turbo Decoder on a General-Purpose Graphics Processing Unit

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

Benchmarking Real-World In-Vehicle Applications

UNIVERSITAT POLITÈCNICA DE CATALUNYA

Reminder. Course project team forming deadline. Course project ideas. Friday 9/8 11:59pm You will be randomly assigned to a team after the deadline

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

5G Technology update. Dr. David Hammarwall Head of Product Line 5G, Ericsson

CSC573: TSHA Introduction to Accelerators

HPC with Multicore and GPUs

A PROGRAMMABLE BASEBAND PLATFORM FOR SOFTWARE-DEFINED RADIO

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Transcription:

0 Leveraging Mobile GPUs for Flexible High-speed Wireless Communication Qi Zheng, Cao Gao, Trevor Mudge, Ronald Dreslinski *, Ann Arbor The 3 rd International Workshop on Parallelism in Mobile Platforms (PRISM-3) Jun 14, 2015 0 0

Popular Mobile Device & Applications 1 1 1

Fast Technology Evolution 2 1.0E+03 1.0E+02 1.0E+01 1.0E+00 1.0E-01 2000 2004 2008 2010 2 2

Support of Multiple Standards 3 3 3

Wireless Processor in Mobile Devices 4! Hardware accelerator based processors! ASIC! DSP with many ASICs! FPGA customized as ASICs " High performance " High energy efficiency Long time to market Hard and costly to maintain compatibility Software-defined Radio! 4 4

Software-defined Radio (SDR) 5! Wireless system is implemented entirely in software! General-purpose programming language " Very short time to market Only implement new software for new technologies " Easy to maintain compatibility Different programs for different standards? Performance? Energy efficiency 5 5

Processor for SDR GPU! 6! High computing throughput! High energy efficiency! Explore DLP and TLP! Good programmability 6 6

Graphics Processing Unit (GPU)! High throughput:! GOPS-level peak throughput! SIMT execution model! Make use of abundant DLP and TLP! Good programming support! CUDA/OpenCL 7 Desktop GPU Power 100W Mobile GPU Power ~ 1W 7 7

Our Contribution 8! Parallel implementations of LTE baseband kernels on a mobile GPGPU! Performance of running LTE baseband! Meet the specification peak data rate for physical layer kernels! Get close to 10Mpbs typical data rate with for the Turbo decoder! Meet 20ms latency budget! Power and energy efficiency of a mobile GPGPU! High power compared with application-specific processors! Still in an acceptable range! Implications of GPU architecture to further reduce the power 8 8

Outline 9! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 9 9

LTE Baseband Downlink Receiver 10! Downlink Receiver has the most computations in a mobile device Radio PHY Layer kernel Antenna MAC Layer kernel RF# OFDM# Demodula.on# Channel# Es.mate# MIMO## Detect# Turbo# Decoder# Rate# Match# Descrambling# Modula.on# Demapper# 10 10

Parallelism in Each Kernel 11! The PHY layer kernels! Antenna-level! Symbol-level! Subcarrier-level! Algorithm-level! The Turbo Decoder! Trellis-level! Subblock-level! For all kernels! Batch multiple packets 11 11

Outline 12! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 12 12

Experimental Setup! 13 Jetson TK1 Board! NVIDIA 4-Plus-1 quad-core ARM Cortex-A15 CPU! NVIDIA Kepler GK20a GPU! 192 CUDA cores! 852MHz! 256 KB RF / 64 KB L1 memory /!! 128 KB L2 cache 2G DDR3L RAM (shared by CPU and GPU) Wattsup.net Power Meter! 33% board power consumed by SoC 13 13

Outline 14! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 14 14

PHY Layer Throughput 15 Achieved Throughputs (Mbps) 200 180 160 140 120 100 80 60 40 20 0 0 5 10 15 20 25 30 35 The number of packets 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Packet batching increases throughput Saturate at 8 15 15

Compare with Specification Peak 16 Throughput#Normalized#to# Specifica>on#Peak# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# 0.0# 0# 1# 2# 3# 4# 5# 6# 7# The#number#of#packets# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Fulfill Overall, but 2 packets one specification -- a good peak batching data size rates 16 16

Compare with 10Mbps Typical Rate 17 Throughput#Normalized#to# Average/Typical#User#Rate# 18.0# 16.0# 14.0# 12.0# 10.0# 8.0# 6.0# 4.0# 2.0# 0.0# 0# 1# 2# 3# 4# 5# 6# 7# 8# 9# The#number#of#packets# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Fulfill all typical user data rates 17 17

PHY Layer Latency 18 Worst Case Latency (ms) 90 80 70 60 50 40 30 20 10 0 0 5 10 15 20 25 30 35 The number of packets 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Packet batching also increases latency Packet batching also increases latency, can batch 8 packets given 20ms budgets 18 18

Outline 19! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 19 19

Turbo Throughput 20 Turbo"Decoding" Throughput"(Mbps)" 10" 8" 6" 4" 2" 0" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" The"number"of"packets" #Subblock=16" #Subblock=32" #Subblock=64" Throughput increases with #subblocks an batching sizes 32 subblocks + 2 packets: 8.4 Mpbs throughput, 4.8 ms latency <<peak rate, but ~10Mbp typical rate *Qi Zheng, et al, Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors, ISCAS 2013, Beijing 20 20

Outline 21! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 21 21

Energy Efficiency 22 Energy Efficiency (Mb/J) 30.0 25.0 20.0 15.0 10.0 5.0 0.0 72 252 468 684 852 GPU Frequency (MHz) Higher frequency achieves better energy efficiency Run to finish 22 22

Power Consumption 23 Processor Standards Details ASIC (LeoCore) FPGA (MAGALI) DSP (Tomahawk) CPU (SoftLTE) WiMAX LTE LTE, WiMAX LTE 70mW@ 70MHz 234mW@400MHz 904mW@170MHz 130W@2.66GHz Mobile GPU LTE 1390mW@852MHz 23 23

Outline 24! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 24 24

Reduce Energy from Unused Register! Register file is underutilized running PHY layers! Due to max_thread/sm and max_thread_block/sm! Power gate unused registers Normalized#RF#U:liza:on# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# 1# 2# 4# 8# 16# The#number#of#packets# 25 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# 25 25

GPU Support for Trellis Algorithm 26! Trellis Algorithm! Turbo algorithm! Viterbi algorithm! Baum-Welch algorithm! Coding theory! Speech recognition! Data compression! Architecture and ISA support! Similar to AES instruction set in Intel CPU 26 26

Outline 27! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 27 27

Conclusion 28! Mobile GPUs can be used for wireless communication in a mobile device! Meet specification peaks for 3 out of 4 PHY configurations! Meet the typical rate for all PHY configurations! Turbo is close to the typical data rate! Need special supports for the Turbo decoder in a mobile GPU! Energy is high compared with application-specific solutions! But in the accepting range! Need further optimizations 28 28

29 Thanks! Any questions? 29 29

30 Backup Slides 30 30

Quick Subscription Growth 31 7000 World mobile-cellular subscriptions (Millions)* 6000 5000 4000 3000 2000 1000 0 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Year *ITU report -- The World in 2011: ICT Facts and Figures 31 31

Our Contribution 32! Parallel implementations of LTE baseband kernels on a mobile GPGPU! Throughputs and latency of different configurations! Energy efficiency of a mobile GPGPU 32 32

Parallelism in PHY Layer Kernels 33 Kernel Parallelism Number of threads OFDM Modulation (FFT) Channel Estimation MIMO detector Antenna-level Symbol-level Algorithm-level Antenna-level Subcarrier-level Symbol-level Subcarrier-level N ant N sym N FFT N ant N sub N sym N sub Modulation demapper Antenna-level Symbol-level Subcarrier-level Algorithm-level N ant N sym N sub N Mod N ant N sym N sub log 2 (N Mod ) Descrambling Antenna-level Symbol-level Subcarrier-level N ant N sym N sub 33 33

Parallelism in Turbo Decoder* 34! Total number of threads N thread = N packet N subblock Thread trellis! Implementation performance tradeoff Parallelism Scheme Throughput Latency Bit Error Rate Packet-level Better Worse No Change Subblock-level Better No Change Worse Trellis-level Better No Change No Change Subblock+NII Worse No Change Better Subblock+TS Worse No Change Better *Qi Zheng, et al, Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors, ISCAS 2013, Beijing 34 34

Mobile GPU Architecture 35! GPU on NVIDIA Jetson Tegra K1 board! Kepler architecture 192 CUDA cores@852 MHz! 64KB L1 memory / 128KB L2 cache / Share 2GB DDR3L with CPU 35 35

Performance Metrics 36! Collected stats for a variety of important events! CUDA Event API! Throughput! Latency! NVIDIA Profiler! Achieved occupancy! Memory utilization! Dynamic instruction count 36 36

GPU Occupancy 37 Achieved#Occupancy# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# Achieved"Throughputs"(Mbps)" 200" 180" 160" 140" 120" 100" 80" 60" 40" 20" 0" 0# 2# 4# 6# 8# 10# 12# 14# 16# 18# The#number#of#packets# 0" 5" 10" 15" 20" 25" 30" 35" The"number"of"packets" 1x1+QPSK" 2x2+QPSK" 1x1+16QAM" 2x2+16QAM" Packet batching increases throughput which saturates at 8 packets due to full GPU utilization 37 37

Hotspot Analysis 38 Execu9on#Time#Breakdown# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Rate#Matching# Descrambling# Demodula9on# MIMO#Detec9on# Channel#Es9ma9on# FFT# Demodulation dominates due to its heavy computation 38 38

Hotspot Analysis -- #instructions 39 100%# Executed(Instruc7ons( 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# Rate(Matching( Descrambling( Demodula7on( MIMO(Detec7on( Channel(Es7ma7on( FFT( 1x1+QPSK( 2x2+QPSK( 1x1+16QAM( 2x2+16QAM( Demodulation dominates due to its heavy computation 39 39

Turbo Throughput 40 Turbo"Decoding" Throughput"(Mbps)" 10" 8" 6" 4" 2" 0" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" The"number"of"packets" #Subblock=16" #Subblock=32" #Subblock=64" Throughput increases with the number of subblocks and packets, and starts saturating at 64 subblocks 40 40

Frequency Scaling -- PHY 41 Achieved Throughput (Mbps) 45 40 35 30 25 20 15 10 5 0 0 100 200 300 400 500 600 700 800 900 GPU Frequency (MHz) 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Throughputs have almost linear scaling with the frequency 41 41

Frequency Scaling -- Turbo 42 Achieved#Throughput#(Mbps)# 9.0# 8.0# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# 0.0# 0# 100# 200# 300# 400# 500# 600# 700# 800# 900# GPU#Frequency#(MHz)# Throughputs have almost linear scaling with the frequency 42 42