Leveraging Mobile GPUs for Flexible High-speed Wireless Communication
|
|
- Richard Robinson
- 6 years ago
- Views:
Transcription
1 0 Leveraging Mobile GPUs for Flexible High-speed Wireless Communication Qi Zheng, Cao Gao, Trevor Mudge, Ronald Dreslinski *, Ann Arbor The 3 rd International Workshop on Parallelism in Mobile Platforms (PRISM-3) Jun 14,
2 Popular Mobile Device & Applications 1 1 1
3 Fast Technology Evolution 2 1.0E E E E E
4 Support of Multiple Standards 3 3 3
5 Wireless Processor in Mobile Devices 4! Hardware accelerator based processors! ASIC! DSP with many ASICs! FPGA customized as ASICs " High performance " High energy efficiency Long time to market Hard and costly to maintain compatibility Software-defined Radio! 4 4
6 Software-defined Radio (SDR) 5! Wireless system is implemented entirely in software! General-purpose programming language " Very short time to market Only implement new software for new technologies " Easy to maintain compatibility Different programs for different standards? Performance? Energy efficiency 5 5
7 Processor for SDR GPU! 6! High computing throughput! High energy efficiency! Explore DLP and TLP! Good programmability 6 6
8 Graphics Processing Unit (GPU)! High throughput:! GOPS-level peak throughput! SIMT execution model! Make use of abundant DLP and TLP! Good programming support! CUDA/OpenCL 7 Desktop GPU Power 100W Mobile GPU Power ~ 1W 7 7
9 Our Contribution 8! Parallel implementations of LTE baseband kernels on a mobile GPGPU! Performance of running LTE baseband! Meet the specification peak data rate for physical layer kernels! Get close to 10Mpbs typical data rate with for the Turbo decoder! Meet 20ms latency budget! Power and energy efficiency of a mobile GPGPU! High power compared with application-specific processors! Still in an acceptable range! Implications of GPU architecture to further reduce the power 8 8
10 Outline 9! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 9 9
11 LTE Baseband Downlink Receiver 10! Downlink Receiver has the most computations in a mobile device Radio PHY Layer kernel Antenna MAC Layer kernel RF# OFDM# Demodula.on# Channel# Es.mate# MIMO## Detect# Turbo# Decoder# Rate# Match# Descrambling# Modula.on# Demapper# 10 10
12 Parallelism in Each Kernel 11! The PHY layer kernels! Antenna-level! Symbol-level! Subcarrier-level! Algorithm-level! The Turbo Decoder! Trellis-level! Subblock-level! For all kernels! Batch multiple packets 11 11
13 Outline 12! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 12 12
14 Experimental Setup! 13 Jetson TK1 Board! NVIDIA 4-Plus-1 quad-core ARM Cortex-A15 CPU! NVIDIA Kepler GK20a GPU! 192 CUDA cores! 852MHz! 256 KB RF / 64 KB L1 memory /!! 128 KB L2 cache 2G DDR3L RAM (shared by CPU and GPU) Wattsup.net Power Meter! 33% board power consumed by SoC 13 13
15 Outline 14! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 14 14
16 PHY Layer Throughput 15 Achieved Throughputs (Mbps) The number of packets 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Packet batching increases throughput Saturate at
17 Compare with Specification Peak 16 Throughput#Normalized#to# Specifica>on#Peak# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# 0.0# 0# 1# 2# 3# 4# 5# 6# 7# The#number#of#packets# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Fulfill Overall, but 2 packets one specification -- a good peak batching data size rates 16 16
18 Compare with 10Mbps Typical Rate 17 Throughput#Normalized#to# Average/Typical#User#Rate# 18.0# 16.0# 14.0# 12.0# 10.0# 8.0# 6.0# 4.0# 2.0# 0.0# 0# 1# 2# 3# 4# 5# 6# 7# 8# 9# The#number#of#packets# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Fulfill all typical user data rates 17 17
19 PHY Layer Latency 18 Worst Case Latency (ms) The number of packets 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Packet batching also increases latency Packet batching also increases latency, can batch 8 packets given 20ms budgets 18 18
20 Outline 19! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 19 19
21 Turbo Throughput 20 Turbo"Decoding" Throughput"(Mbps)" 10" 8" 6" 4" 2" 0" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" The"number"of"packets" #Subblock=16" #Subblock=32" #Subblock=64" Throughput increases with #subblocks an batching sizes 32 subblocks + 2 packets: 8.4 Mpbs throughput, 4.8 ms latency <<peak rate, but ~10Mbp typical rate *Qi Zheng, et al, Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors, ISCAS 2013, Beijing 20 20
22 Outline 21! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Methodology! PHY layer! Turbo decoder! Power and energy efficiency! Implications for future mobile GPGPUs! Conclusion 21 21
23 Energy Efficiency 22 Energy Efficiency (Mb/J) GPU Frequency (MHz) Higher frequency achieves better energy efficiency Run to finish 22 22
24 Power Consumption 23 Processor Standards Details ASIC (LeoCore) FPGA (MAGALI) DSP (Tomahawk) CPU (SoftLTE) WiMAX LTE LTE, WiMAX LTE 70MHz Mobile GPU LTE 23 23
25 Outline 24! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 24 24
26 Reduce Energy from Unused Register! Register file is underutilized running PHY layers! Due to max_thread/sm and max_thread_block/sm! Power gate unused registers Normalized#RF#U:liza:on# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# 1# 2# 4# 8# 16# The#number#of#packets# 25 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# 25 25
27 GPU Support for Trellis Algorithm 26! Trellis Algorithm! Turbo algorithm! Viterbi algorithm! Baum-Welch algorithm! Coding theory! Speech recognition! Data compression! Architecture and ISA support! Similar to AES instruction set in Intel CPU 26 26
28 Outline 27! Motivation! LTE baseband kernels! Running LTE baseband on a mobile GPGPU! Implications for future mobile GPGPUs! Conclusion 27 27
29 Conclusion 28! Mobile GPUs can be used for wireless communication in a mobile device! Meet specification peaks for 3 out of 4 PHY configurations! Meet the typical rate for all PHY configurations! Turbo is close to the typical data rate! Need special supports for the Turbo decoder in a mobile GPU! Energy is high compared with application-specific solutions! But in the accepting range! Need further optimizations 28 28
30 29 Thanks! Any questions? 29 29
31 30 Backup Slides 30 30
32 Quick Subscription Growth World mobile-cellular subscriptions (Millions)* Year *ITU report -- The World in 2011: ICT Facts and Figures 31 31
33 Our Contribution 32! Parallel implementations of LTE baseband kernels on a mobile GPGPU! Throughputs and latency of different configurations! Energy efficiency of a mobile GPGPU 32 32
34 Parallelism in PHY Layer Kernels 33 Kernel Parallelism Number of threads OFDM Modulation (FFT) Channel Estimation MIMO detector Antenna-level Symbol-level Algorithm-level Antenna-level Subcarrier-level Symbol-level Subcarrier-level N ant N sym N FFT N ant N sub N sym N sub Modulation demapper Antenna-level Symbol-level Subcarrier-level Algorithm-level N ant N sym N sub N Mod N ant N sym N sub log 2 (N Mod ) Descrambling Antenna-level Symbol-level Subcarrier-level N ant N sym N sub 33 33
35 Parallelism in Turbo Decoder* 34! Total number of threads N thread = N packet N subblock Thread trellis! Implementation performance tradeoff Parallelism Scheme Throughput Latency Bit Error Rate Packet-level Better Worse No Change Subblock-level Better No Change Worse Trellis-level Better No Change No Change Subblock+NII Worse No Change Better Subblock+TS Worse No Change Better *Qi Zheng, et al, Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors, ISCAS 2013, Beijing 34 34
36 Mobile GPU Architecture 35! GPU on NVIDIA Jetson Tegra K1 board! Kepler architecture 192 CUDA MHz! 64KB L1 memory / 128KB L2 cache / Share 2GB DDR3L with CPU 35 35
37 Performance Metrics 36! Collected stats for a variety of important events! CUDA Event API! Throughput! Latency! NVIDIA Profiler! Achieved occupancy! Memory utilization! Dynamic instruction count 36 36
38 GPU Occupancy 37 Achieved#Occupancy# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# Achieved"Throughputs"(Mbps)" 200" 180" 160" 140" 120" 100" 80" 60" 40" 20" 0" 0# 2# 4# 6# 8# 10# 12# 14# 16# 18# The#number#of#packets# 0" 5" 10" 15" 20" 25" 30" 35" The"number"of"packets" 1x1+QPSK" 2x2+QPSK" 1x1+16QAM" 2x2+16QAM" Packet batching increases throughput which saturates at 8 packets due to full GPU utilization 37 37
39 Hotspot Analysis 38 Execu9on#Time#Breakdown# 100%# 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# 1x1+QPSK# 2x2+QPSK# 1x1+16QAM# 2x2+16QAM# Rate#Matching# Descrambling# Demodula9on# MIMO#Detec9on# Channel#Es9ma9on# FFT# Demodulation dominates due to its heavy computation 38 38
40 Hotspot Analysis -- #instructions %# Executed(Instruc7ons( 90%# 80%# 70%# 60%# 50%# 40%# 30%# 20%# 10%# 0%# Rate(Matching( Descrambling( Demodula7on( MIMO(Detec7on( Channel(Es7ma7on( FFT( 1x1+QPSK( 2x2+QPSK( 1x1+16QAM( 2x2+16QAM( Demodulation dominates due to its heavy computation 39 39
41 Turbo Throughput 40 Turbo"Decoding" Throughput"(Mbps)" 10" 8" 6" 4" 2" 0" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" The"number"of"packets" #Subblock=16" #Subblock=32" #Subblock=64" Throughput increases with the number of subblocks and packets, and starts saturating at 64 subblocks 40 40
42 Frequency Scaling -- PHY 41 Achieved Throughput (Mbps) GPU Frequency (MHz) 1x1+QPSK 2x2+QPSK 1x1+16QAM 2x2+16QAM Throughputs have almost linear scaling with the frequency 41 41
43 Frequency Scaling -- Turbo 42 Achieved#Throughput#(Mbps)# 9.0# 8.0# 7.0# 6.0# 5.0# 4.0# 3.0# 2.0# 1.0# 0.0# 0# 100# 200# 300# 400# 500# 600# 700# 800# 900# GPU#Frequency#(MHz)# Throughputs have almost linear scaling with the frequency 42 42
Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors
1 Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti +, Achilleas Anastasopoulos*, Scott Mahlke*, Trevor
More informationOpenRadio. A programmable wireless dataplane. Manu Bansal Stanford University. Joint work with Jeff Mehlman, Sachin Katti, Phil Levis
OpenRadio A programmable wireless dataplane Manu Bansal Stanford University Joint work with Jeff Mehlman, Sachin Katti, Phil Levis HotSDN 12, August 13, 2012, Helsinki, Finland 2 Opening up the radio Why?
More informationFlexible wireless communication architectures
Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April
More informationSoftware Defined Modem A commercial platform for wireless handsets
Software Defined Modem A commercial platform for wireless handsets Charles F Sturman VP Marketing June 22 nd ~ 24 th Brussels charles.stuman@cognovo.com www.cognovo.com Agenda SDM Separating hardware from
More informationWhite Paper Using Cyclone III FPGAs for Emerging Wireless Applications
White Paper Introduction Emerging wireless applications such as remote radio heads, pico/femto base stations, WiMAX customer premises equipment (CPE), and software defined radio (SDR) have stringent power
More informationBenchmarking Multithreaded, Multicore and Reconfigurable Processors
Insight, Analysis, and Advice on Signal Processing Technology Benchmarking Multithreaded, Multicore and Reconfigurable Processors Berkeley Design Technology, Inc. 2107 Dwight Way, Second Floor Berkeley,
More informationCompilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs DREAM seminar
Compilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs DREAM seminar Mickaël Dardaillon Research Intern with NOKIA Technologies January 27th, 2015 2 / 33 What we know
More informationBenchmarking Processors for DSP Applications
Insight, Analysis, and Advice on Signal Processing Technology Benchmarking Processors for DSP Applications Berkeley Design Technology, Inc. 2107 Dwight Way, Second Floor Berkeley, California 94704 USA
More informationManycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT
Manycore and GPU Channelisers Seth Hall High Performance Computing Lab, AUT GPU Accelerated Computing GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate
More informationExperiences Using Tegra K1 and X1 for Highly Energy Efficient Computing
Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian
More informationSimplifying FPGA Design for SDR with a Network on Chip Architecture
Simplifying FPGA Design for SDR with a Network on Chip Architecture Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 RF NoC 3 Status and Conclusions USRP FPGA Capability Gen
More informationParallel HMMs. Parallel Implementation of Hidden Markov Models for Wireless Applications
Parallel HMMs Parallel Implementation of Hidden Markov Models for Wireless Applications Authors Shawn Hymel (Wireless@VT, Virginia Tech) Ihsan Akbar (Harris Corporation) Jeffrey Reed (Wireless@VT, Virginia
More informationRFNoC : RF Network on Chip Martin Braun, Jonathon Pendlum GNU Radio Conference 2015
RFNoC : RF Network on Chip Martin Braun, Jonathon Pendlum GNU Radio Conference 2015 Outline Motivation Current situation Goal RFNoC Basic concepts Architecture overview Summary No Demo! See our booth,
More informationTHE LEADER IN VISUAL COMPUTING
MOBILE EMBEDDED THE LEADER IN VISUAL COMPUTING 2 TAKING OUR VISION TO REALITY HPC DESIGN and VISUALIZATION AUTO GAMING 3 BEST DEVELOPER EXPERIENCE Tools for Fast Development Debug and Performance Tuning
More informationBER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU
2013 8th International Conference on Communications and Networking in China (CHINACOM) BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU Xiang Chen 1,2, Ji Zhu, Ziyu Wen,
More informationChoosing a Processor: Benchmarks and Beyond (S043)
Insight, Analysis, and Advice on Signal Processing Technology Choosing a Processor: Benchmarks and Beyond (S043) Jeff Bier Berkeley Design Technology, Inc. Berkeley, California USA +1 (510) 665-1600 info@bdti.com
More informationSimulation, prototyping and verification of standards-based wireless communications
Simulation, prototyping and verification of standards-based wireless communications Colin McGuire, Neil MacEwen 2015 The MathWorks, Inc. 1 Real Time LTE Cell Scanner with MATLAB and Simulink 2 Real time
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationThe WINLAB Cognitive Radio Platform
The WINLAB Cognitive Radio Platform IAB Meeting, Fall 2007 Rutgers, The State University of New Jersey Ivan Seskar Software Defined Radio/ Cognitive Radio Terminology Software Defined Radio (SDR) is any
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationCEVA-X1 Lightweight Multi-Purpose Processor for IoT
CEVA-X1 Lightweight Multi-Purpose Processor for IoT 1 Cellular IoT for The Massive Internet of Things Narrowband LTE Technologies Days Battery Life Years LTE-Advanced LTE Cat-1 Cat-M1 Cat-NB1 >10Mbps Up
More informationGraph-based Framework for Flexible Baseband Function Splitting and Placement in C-RAN
Graph-based Framework for Flexible Baseband Function Splitting and Placement in C-RAN Group Meeting Presentation (Paper Review) J. Liu, Graph-based framework for flexible baseband function splitting and
More informationSirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers
Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers Johann Hauswald, Michael A. Laurenzano, Yunqi Zhang, Cheng Li, Austin Rovinski,
More informationReconfigurable VLSI Communication Processor Architectures
Reconfigurable VLSI Communication Processor Architectures Joseph R. Cavallaro Center for Multimedia Communication www.cmc.rice.edu Department of Electrical and Computer Engineering Rice University, Houston
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationA Multi-Tiered Optimization Framework for Heterogeneous Computing
A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationHigh-performance and Low-power Consumption Vector Processor for LTE Baseband LSI
High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI Yi Ge Mitsuru Tomono Makiko Ito Yoshio Hirose Recently, the transmission rate for handheld devices has been increasing by
More informationA System Solution for High-Performance, Low Power SDR
A System Solution for High-Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztián Flautner 2 1 Advanced Computer Architecture Laboratory
More informationAre Polar Codes Practical?
Are Polar Codes Practical? Prof. Warren J. Gross McGill University, Montreal, QC. Coding: from Practice to Theory Berkeley, CA. Feb. 10 th, 2015. Achieving the Channel Capacity Successive cancellation
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationA Software Development and Validation Framework for SDR Platforms
A Software Development and Validation Framework for SDR Platforms Jeroen.Declerck@imec.be Outline IMEC SDR Platform Problem Statement Framework (XMSF) Implementation XMSS server Graphical logger IMEC SDR
More informationAlgorithm-Architecture Co- Design for Efficient SDR Signal Processing
Algorithm-Architecture Co- Design for Efficient SDR Signal Processing Min Li, limin@imec.be Wireless Research, IMEC Introduction SDR Baseband Platforms Today are Usually Based on ILP + DLP + MP Massive
More informationPicoSDR goes GNU Radio. Tristan Martin Jan 2013
Tristan Martin Jan 2013 Table of content Model Based Design tool for FPGA Development (MBDK) Model Based Design tool for host development (GNU Radio) PicoSDR : High End MIMO RF frontend for GNU Radio Radio420M
More informationMultimedia in Mobile Phones. Architectures and Trends Lund
Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson
More informationA GPU Implementation of a Real-Time MIMO Detector
A GPU Implementation of a Real-Time MIMO Detector Michael Wu, Siddharth Gupta, Yang Sun, and Joseph R. Cavallaro Electrical and Computer Engineering Rice University Houston, Texas 77005 {mbw2, sgupta,
More informationOCP Engineering Workshop - Telco
OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,
More informationIMPLEMENTATION OF MPI-BASED WIMAX BASE STATION SYSTEM FOR SDR
IMPLEMENTATION OF MPI-BASED WIMAX BASE STATION SYSTEM FOR SDR Hyohan Kim (HY-SDR Research Center, Hanyang Univ., Seoul, South Korea; hhkim@dsplab.hanyang.ac.kr); Chiyoung Ahn(HY-SDR Research center, Hanyang
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationHigh performance, power-efficient DSPs based on the TI C64x
High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research
More informationSummary of WP5 Integration and Validation Second Year. FP7 ICT Objective 1.1 The Network of the Future
Summary of WP5 Integration and Validation Second Year FP7 ICT Objective 1.1 The Network of the Future 1 Outline WP5 Outlook Testbed 1 Testbed 2 Testbed 3 Road map 2 WP5 Outlook Year 1 Year 2 Year 3 Testbeds
More informationImproved Convolutional Coding and Decoding of IEEE802.11n Based on General Purpose Processors
2013 th International Conference on Communications and Networking in China (CHINACOM) Improved Convolutional Coding and Decoding of IEEE02.11n Based on General Purpose Processors Yanuo Xu, Kai Niu, Zhiqiang
More informationRhythm: Harnessing Data Parallel Hardware for Server Workloads
Rhythm: Harnessing Data Parallel Hardware for Server Workloads Sandeep R. Agrawal $ Valentin Pistol $ Jun Pang $ John Tran # David Tarjan # Alvin R. Lebeck $ $ Duke CS # NVIDIA Explosive Internet Growth
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More information8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2
CSE 820 Graduate Computer Architecture Richard Enbody Dr. Enbody 1 st Day 2 1 Why Computer Architecture? Improve coding. Knowledge to make architectural choices. Ability to understand articles about architecture.
More informationReal Time MIMO and Cooperative MIMO Systems Testbed for latest Wireless Broadband Systems. B. Project Summary:
Real Time MIMO and Cooperative MIMO Systems Testbed for latest Wireless Broadband Systems B. Project Summary: 1. Project Objectives: Research on the MIMO detection algorithms and Cooperative-MIMO (C- MIMO)
More informationElaborazione dati real-time su architetture embedded many-core e FPGA
Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T
More informationExploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API
EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationReconfigurable Multicore Server Processors for Low Power Operation
Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationExploiting CUDA Dynamic Parallelism for low power ARM based prototypes
www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es BSC/UPC CUDA Centre of Excellence (CCOE) Training
More informationNear-Threshold Computing: Reclaiming Moore s Law
1 Near-Threshold Computing: Reclaiming Moore s Law Dr. Ronald G. Dreslinski Research Fellow Ann Arbor 1 1 Motivation 1000000 Transistors (100,000's) 100000 10000 Power (W) Performance (GOPS) Efficiency (GOPS/W)
More informationSeamless Dynamic Runtime Reconfiguration in a Software-Defined Radio
Seamless Dynamic Runtime Reconfiguration in a Software-Defined Radio Michael L Dickens, J Nicholas Laneman, and Brian P Dunn WINNF 11 Europe 2011-Jun-22/24 Overview! Background on relevant SDR! Problem
More informationPractical MU-MIMO User Selection on ac Commodity Networks
Practical MU-MIMO User Selection on 802.11ac Commodity Networks Sanjib Sur Ioannis Pefkianakis, Xinyu Zhang and Kyu-Han Kim From Legacy to Gbps Wi-Fi 1999-2003 2009 What is new in 802.11ac? 2013 Legacy
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationSession: Configurable Systems. Tailored SoC building using reconfigurable IP blocks
IP 08 Session: Configurable Systems Tailored SoC building using reconfigurable IP blocks Lodewijk T. Smit, Gerard K. Rauwerda, Jochem H. Rutgers, Maciej Portalski and Reinier Kuipers Recore Systems www.recoresystems.com
More informationHermes: An Integrated CPU/GPU Microarchitecture for IP Routing
Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu * Yangdong Deng Yubei Chen * Electrical and Computer Engineering University of Texas at Austin Institute of Microelectronics Tsinghua
More informationCHALLENGES TO LTE PROGRESS. The Evolution of Mobile Broadband and Regulatory Policy
CHALLENGES TO LTE PROGRESS The Evolution of Mobile Broadband and Regulatory Policy LTE North America, Dallas, Texas, November 21-22, 2013 4G Americas Board of Governors Exabytes per Month Traffic Growth
More informationNutaq. PicoSDR FPGA-based, MIMO-Enabled, tunable RF SDR solutions PRODUCT SHEET I MONTREAL I NEW YORK I. nutaq. .com QUEBEC
Nutaq PicoSDR FPGA-based, MIMO-Enabled, tunable RF SDR solutions PRODUCT SHEET QUEBEC I MONTREAL I NEW YORK I nutaq.com Nutaq PicoSDR Includes Nutaq OFDM Reference Design Up to 4 independent TRX, synchronized
More informationHardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms
Hardware Acceleration of Feature Detection and Description Algorithms on LowPower Embedded Platforms Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, School of Engineering,
More informationPicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor
PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor Taeho Kgil, Shaun D Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Steve Reinhardt, Krisztian Flautner,
More informationAbstract of the Book
Book Keywords IEEE 802.16, IEEE 802.16m, mobile WiMAX, 4G, IMT-Advanced, 3GPP LTE, 3GPP LTE-Advanced, Broadband Wireless, Wireless Communications, Cellular Systems, Network Architecture Abstract of the
More informationSpecializing Hardware for Image Processing
Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.
More informationAn Evaluation of Unified Memory Technology on NVIDIA GPUs
An Evaluation of Unified Memory Technology on NVIDIA GPUs Wenqiang Li 1, Guanghao Jin 2, Xuewen Cui 1, Simon See 1,3 Center for High Performance Computing, Shanghai Jiao Tong University, China 1 Tokyo
More informationConnect your IoT device: Bluetooth 5, , NB-IoT
Connect your IoT device: Bluetooth 5, 802.15.4, NB-IoT Prithi Ramakrishnan Arm TechTalk 2017 IoT connectivity technologies Multiple standards, different applications Throughput Unlicensed >100Mbps Wi-Fi
More informationNutaq. PicoSDR FPGA-based, MIMO-Enabled, tunable RF SDR solutions PRODUCT SHEET I MONTREAL I NEW YORK I. nutaq. .com QUEBEC
Nutaq PicoSDR FPGA-based, MIMO-Enabled, tunable RF SDR solutions PRODUCT SHEET QUEBEC I MONTREAL I NEW YORK I nutaq.com Nutaq PicoSDR Includes Nutaq OFDM Reference Design Up to 4 independent TRX, synchronized
More informationLTE for Mobile Consumer Devices -ETSI M2M Workshop Oct. 2011, Sophia Antipolis, France
LTE for Mobile Consumer Devices -ETSI M2M Workshop 2011-26 -27 Oct. 2011, Sophia Antipolis, France Sony Europe Limited Yuichi Morioka Yuichi.Morioka@eu.sony.com Overview Recent Trends of the Mobile Market
More informationASTRI/CTA data analysis on parallel and low-power platforms
ICT Workshop INAF, Cefalù 2015 Universidade de São Paulo Instituto de Astronomia, Geofisica e Ciencias Atmosferica ASTRI/CTA data analysis on parallel and low-power platforms Alberto Madonna, Michele Mastropietro
More informationConnect Your IoT Device: Bluetooth 5, , NB-IoT
Connect Your IoT Device: Bluetooth 5, 802.15.4, NB-IoT Craig Tou Business Development Manager, Arm Arm Tech Symposia 2017, Taipei IoT Devices - Everything Connects New classes of connectivity for a new
More informationA Glimpse at the Wireless Data Communications Standards. Fanny Mlinarsky 8 August 2007
A Glimpse at the Wireless Data Communications Standards Fanny Mlinarsky 8 August 2007 IMS Infrastructure for FMC IP Multimedia Subsystem (IMS) IP Network Fixed Mobile Convergence Mobile Fixed www.octoscope.com
More informationECE 747 Digital Signal Processing Architecture. DSP Implementation Architectures
ECE 747 Digital Signal Processing Architecture DSP Implementation Architectures Spring 2006 W. Rhett Davis NC State University W. Rhett Davis NC State University ECE 406 Spring 2006 Slide 1 My Goal Challenge
More informationSimplifying FPGA Design with A Novel Network-on-Chip Architecture
Simplifying FPGA Design with A Novel Network-on-Chip Architecture ABSTRACT John Malsbury Ettus Research 1043 N Shoreline Blvd Suite 100 +1 (650) 967-2870 john.malsbury@ettus.com As wireless communications
More informationEmbedded Linux Conference San Diego 2016
Embedded Linux Conference San Diego 2016 Linux Power Management Optimization on the Nvidia Jetson Platform Merlin Friesen merlin@gg-research.com About You Target Audience - The presentation is introductory
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationImplementation of a Dual-Mode SDR Smart Antenna Base Station Supporting WiBro and TDD HSDPA
Implementation of a Dual-Mode SDR Smart Antenna Base Station Supporting WiBro and TDD HSDPA Jongeun Kim, Sukhwan Mun, Taeyeol Oh,Yusuk Yun, Seungwon Choi 1 HY-SDR Research Center, Hanyang University, Seoul,
More informationTranscede. Jim Johnston CTO. Solving 4G Challenges for Pico, Micro and Macrocell Platforms
Transcede Solving 4G Challenges for Pico, Micro and Macrocell Platforms Jim Johnston CTO Communications Convergence Processing Mindspeed Technologies, Inc. August 24, 2010 The Next Internet Wave Mobility
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationDesign Choices for FPGA-based SoCs When Adding a SATA Storage }
U4 U7 U7 Q D U5 Q D Design Choices for FPGA-based SoCs When Adding a SATA Storage } Lorenz Kolb & Endric Schubert, Missing Link Electronics Rudolf Usselmann, ASICS World Services Motivation for SATA Storage
More informationMPSOC 2011 BEAUNE, FRANCE
MPSOC 2011 BEAUNE, FRANCE BOADRES: A SCALABLE BASEBAND PROCESSOR TEMPLATE FOR Gbps RADIOS VICE PRESIDENT, CHAIRMAN OF THE TECHNOLOGY OFFICE PROFESSOR AT THE KATHOLIEKE UNIVERSITEIT LEUVEN STATUS SDR BASEBAND
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationAugust, 2010 Enabling Software Defined Radio with the Modem Vector Signal Processor ENT-F0766
August, 2010 Enabling Software Defined Radio with the Modem Vector Signal Processor ENT-F0766 Kevin Traylor Freescale Fellow Reg. U.S. Pat. & Tm. Off. BeeKit, BeeStack, CoreNet, the Energy Efficient Solutions
More informationCompilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs
Compilation of Parametric Dataflow Applications for Software-Defined-Radio-Dedicated MPSoCs PhD work of Mickael Dardaillon Mickaël Dardaillon, Kevin Marquet (Citi), Tanguy Risset (Citi), Jérôme Martin
More informationPowering Real-time Radio Astronomy Signal Processing with latest GPU architectures
Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures Harshavardhan Reddy Suda NCRA, India Vinay Deshpande NVIDIA, India Bharat Kumar NVIDIA, India What signals we are processing?
More informationThird Genera+on USRP Devices and the RF Network- On- Chip. Leif Johansson Market Development RF, Comm and SDR
Third Genera+on USRP Devices and the RF Network- On- Chip Leif Johansson Market Development RF, Comm and SDR About Ettus Research Leader in soeware defined radio and signals intelligence Maker of USRP
More informationBuilding heterogeneous networks of green base stations on TI s KeyStone II architecture
WHITE PAPER Introduction Zhihong Lin, Strategic Marketing Manager, Wireless Base Station Infrastructure Texas Instruments Fueled by the ability to connect to any device anytime and anywhere, both interpersonal
More informationImplementation of a Fully-Parallel Turbo Decoder on a General-Purpose Graphics Processing Unit
Received June 7, 2016, accepted June 28, 2016, date of publication June 29, 2016, date of current version October 6, 2016. Digital Object Identifier 10.1109/ACCESS.2016.2586309 Implementation of a Fully-Parallel
More informationA Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors
A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,
More informationBenchmarking Real-World In-Vehicle Applications
Benchmarking Real-World In-Vehicle Applications NVIDIA GTC 2015-03-18 m y c a b l e GmbH Michael Carstens-Behrens Gartenstraße 10 24534 Neumuenster, Germany +49 4321 559 56-55 +49 4321 559 56-10 mcb@mycable.de
More informationUNIVERSITAT POLITÈCNICA DE CATALUNYA
Department of Signal Theory and Communications UNIVERSITAT POLITÈCNICA DE CATALUNYA RESOURCE MANAGEMENT STRATEGIES FOR SDR CLOUDS Vuk Marojevic Ismael Gomez Gabriel Montoro Antoni Gelonch Pere Gilabert
More informationReminder. Course project team forming deadline. Course project ideas. Friday 9/8 11:59pm You will be randomly assigned to a team after the deadline
Reminder Course project team forming deadline Friday 9/8 11:59pm You will be randomly assigned to a team after the deadline Course project ideas If you have difficulty in finding team mates, send your
More informationFast packet processing in the cloud. Dániel Géhberger Ericsson Research
Fast packet processing in the cloud Dániel Géhberger Ericsson Research Outline Motivation Service chains Hardware related topics, acceleration Virtualization basics Software performance and acceleration
More information5G Technology update. Dr. David Hammarwall Head of Product Line 5G, Ericsson
5G Technology update Dr. David Hammarwall Head of Product Line 5G, Ericsson 5g Vision One network supporting multiple use cases and industries Massive MTC Critical MTC Enhanced MBB Fixed Wireless Access
More informationCSC573: TSHA Introduction to Accelerators
CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationA PROGRAMMABLE BASEBAND PLATFORM FOR SOFTWARE-DEFINED RADIO
A PROGRAMMABLE BASEBAND PLATFORM FOR SOFTWARE-DEFINED RADIO Hans-Martin Bluethgen, Cyprian Grassmann, Wolfgang Raab, Ulrich Ramacher, Josef Hausner, Infineon Technologies AG, 81609 Munich, Germany, Hans-Martin.Bluethgen@infineon.com
More informationCarlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)
Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB
More information