Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

Similar documents
Leveraging Mobile GPUs for Flexible High-speed Wireless Communication

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Portland State University ECE 588/688. Graphics Processors

Hardware/Software Co-Design

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Numerical Simulation on the GPU

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Parallel Processing SIMD, Vector and GPU s cont.

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

A Cross-Input Adaptive Framework for GPU Program Optimizations

Programmable Graphics Hardware (GPU) A Primer

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to GPU programming with CUDA

Chapter 17 - Parallel Processing

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Mattan Erez. The University of Texas at Austin

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Threading Hardware in G80

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

ECE 8823: GPU Architectures. Objectives

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Kaisen Lin and Michael Conley

CME 213 S PRING Eric Darve

Dense matching GPU implementation

RECAP. B649 Parallel Architectures and Programming

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Programming GPUs for database applications - outsourcing index search operations

A HYBRID APPROACH TO OFFLOADING MOBILE IMAGE CLASSIFICATION. J. Hauswald, T. Manville, Q. Zheng, R. Dreslinski, C. Chakrabarti and T.

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Exploring GPU Architecture for N2P Image Processing Algorithms


Performance Evaluation of Transcoding and FEC Schemes for 100 Gb/s Backplane and Copper Cable

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Preparing seismic codes for GPUs and other

GPU-accelerated Verification of the Collatz Conjecture

ME964 High Performance Computing for Engineering Applications

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

GPU Fundamentals Jeff Larkin November 14, 2016

Scientific Computing on GPUs: GPU Architecture Overview

Lecture 1: Gentle Introduction to GPUs

Real-Time Rendering Architectures

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

Parallel Computer Architecture - Basics -

Parallelising Pipelined Wavefront Computations on the GPU

A Framework for Modeling GPUs Power Consumption

GPU for HPC. October 2010

ECE 574 Cluster Computing Lecture 15

Cartoon parallel architectures; CPUs and GPUs

High Performance Computing on GPUs using NVIDIA CUDA

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

The NVIDIA GeForce 8800 GPU

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Applications of Berkeley s Dwarfs on Nvidia GPUs

ME964 High Performance Computing for Engineering Applications

High Performance Computing and GPU Programming

Caches. Hiding Memory Access Times

Tesla Architecture, CUDA and Optimization Strategies

Comparison of High-Speed Ray Casting on GPU

Multithreaded Processors. Department of Electrical Engineering Stanford University

GPGPU introduction and network applications. PacketShaders, SSLShader

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 1: Introduction and Basics

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

Introduction to CUDA Programming

High Performance Computing

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

TUNING CUDA APPLICATIONS FOR MAXWELL

Parallel programming: Introduction to GPU architecture

CUDA GPGPU Workshop 2012

Transcription:

1 Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti +, Achilleas Anastasopoulos*, Scott Mahlke*, Trevor Mudge* *, Ann Arbor + Arizona State University, Tempe ISCAS 13 May 21, 2013 1 1

Trellis algorithm! Trellis is widely used in coding theory! Progression of symbols within a code! Representation of the state transitions of a finite state machine 2 2 2

Trellis algorithm 3! Trellis is widely used in coding theory! Progression of symbols within a code! Representation of the state transitions of a finite state machine! Trellis algorithm! The processing described by the value propagation in a trellis input=0 input=1 state 0 state 1 state 2 state 3 state 4 state 5 state 6 state 7 stage k-1 stage k stage k+1 3 3

Trellis algorithm! Broad scope of uses! Viterbi/BCJR/Baum-Welch! Communication system/data compression/speech recognition! Play important roles 4 -./(#0& 12,&!"#$%&'()%'(#& *+,& 3"456(&$#(78'%94&%:&./(&$70($74'&;4&<!=&">?;48& 4 4

GPU Graphics processing unit! High Throughput:! GFLOPS/TFLOPS-level peak throughput! High Efficiency 5 Processor GFLOP/dollar GFLOP/watt Nvidia GeForce GTX680 6.192 15.848 Intel Xeon E7-8837 0.037 0.656 Intel Itanium 8350 0.007 0.150! Programming Support! OpenCL! CUDA 5 5

GPU Performance Challenge! Sources of GPU underutilization! Thread inadequacy! Pipeline stall 6! Thread inadequacy! 1000 cores on a commercial GPU! Pipeline stall! Long memory access latency! L2 cache/external memory! Using multithreading to hide pipeline stall 6 6

Contribution of the paper! Previous work! Mapped the Turbo decoder on a GPU! Study the throughput and BER of the implementation! Our work! Generalize the parallelization schemes to the implementation of trellis algorithms on a GPU! Explore additional schemes not in previous works: forward-backward and branch-metric parallelism! Study the implementation tradeoffs between throughput, processing latency and BER! Show different combinations of parallelization schemes for different system requirements 7 7 7

Outline! Motivation and background! Trellis algorithm! GPU! Parallelization schemes! Packet-level! Subblock-level! Trellis-level! Implementation tradeoffs! Conclusion 8 8 8

Packet-level Parallelism! Process multiple packets! #Threads = #packets! Long processing latency! Especially for the 1 st packet 9 Buffer Packets Trellis Algorithm 9 9

Subblock-level Parallelism 10 subblock subblock input=0 input=1 state 0 state 1 state 2 state 3 state 4 state 5 state 6 state 7! #Threads = #subblocks! Increases the output error rate! Recovery scheme to fix performance loss! Training sequence (TS)! Next iteration initialization (NII)! Need additional computations 10 10

Trellis-level Parallelism! State-level parallelism! #Threads = #states of a stage 11 11 11

Trellis-level Parallelism! State-level parallelism! Branch-metric parallelization! #Threads = #branches thread 0 12 thread 2 thread 1 thread 3 thread 14 thread 15 stage k stage k+1 12 12

Trellis-level Parallelism! State-level parallelism (SL)! Branch-metric parallelization (BM)! Forward-backward parallelization (FB)! #Threads = 2 forward recursion 13 input=0 input=1 state 0 state 1 state 2 state 3 state 4 state 5 state 6 state 7 stage k-1 stage k backward recursion stage k+1 13 13

Summary! Total number of threads 14 N thread = N packet N subblock Thread trellis Scheme Throughput Latency Bit Error Rate Packet-level Better Worse No Change Subblock-level Better No Change Worse Trellis-level Better No Change No Change Subblock+NII Worse No Change Better Subblock+TS Worse No Change Better 14 14

Outline! Motivation and background! Trellis algorithm! GPU! Parallelization schemes! Packet-level! Subblock-level! Trellis-level! Implementation tradeoffs! Conclusion 15 15 15

Experiment setup! Nvidia GeForce GTX470! 14 Streaming Multiprocessors (SM)! 448 Streaming Processors (SP)! 64KB L1 cache + shared memory per SM! 768KB L2 cache per GPU! 2GB DRAM 16! LTE Turbo decoder! Codeword size: 6144! Code rate: 1/3! Iteration num: 5 16 16

Throughput vs. Latency 17 Higher is better #subblock = 1 BER 10-5 with SNR=1dB " More packets # higher throughput, and longer latency. " Trellis-level parallelism improves throughput without affecting the latency. 17 17

Throughput vs. BER 18 Higher is better Lower is better SNR requirement presented are the lowest values to achieve 10-5 BER. One packet " More subblocks # higher throughput, but higher SNR requirement. " Longer TS # lower SNR requirement, but lower throughput. " NII+TS-4 achieves the best tradeoff. 18 18

Implementation tradeoff 19 Trellis-level Parallelization Schemes Subblock Num Packet Num TH* (Mbps) WPL* (ms) SNR* (db) BER* - 512 1 4.26 1.44 1.7 1.6x10-3 SL + 512 1 20.49 0.55 1.7 1.6x10-3 SL 256 2 21.09 1.07 1.3 4.1x10-4 SL,FB + 256 1 19.65 0.56 1.3 4.1x10-4 SL,FB 128 10 29.00 4.58 1.1 2.0x10-4 + SL = State-level parallelism, FB = Forward-backward parallelism *TH = Throughput, WPL = Worst-case Packet Latency *SNR requirement presented are the lowest values to achieve 10-5 BER. *BER is the bit error rate when SNR = 1 db. " Different combinations of parallelization schemes can satisfy different system requirements 19 19

Implementation tradeoff 20 Trellis-level Parallelization Schemes Subblock Num Packet Num TH* (Mbps) WPL* (ms) SNR* (db) BER* - 512 1 4.26 1.44 1.7 1.6x10-3 SL + 512 1 20.49 0.55 1.7 1.6x10-3 SL 256 2 21.09 1.07 1.3 4.1x10-4 SL,FB + 256 1 19.65 0.56 1.3 4.1x10-4 SL,FB 128 10 29.00 4.58 1.1 2.0x10-4 + SL = State-level parallelism, FB = Forward-backward parallelism *TH = Throughput, WPL = Worst-case Packet Latency *SNR requirement presented are the lowest values to achieve 10-5 BER. *BER is the bit error rate when SNR = 1 db. " Different combinations of parallelization schemes can satisfy different system requirements 20 20

Implementation tradeoff 21 Trellis-level Parallelization Schemes Subblock Num Packet Num TH* (Mbps) WPL* (ms) SNR* (db) BER* - 512 1 4.26 1.44 1.7 1.6x10-3 SL + 512 1 20.49 0.55 1.7 1.6x10-3 SL 256 2 21.09 1.07 1.3 4.1x10-4 SL,FB + 256 1 19.65 0.56 1.3 4.1x10-4 SL,FB 128 10 29.00 4.58 1.1 2.0x10-4 + SL = State-level parallelism, FB = Forward-backward parallelism *TH = Throughput, WPL = Worst-case Packet Latency *SNR requirement presented are the lowest values to achieve 10-5 BER. *BER is the bit error rate when SNR = 1 db. " Different combinations of parallelization schemes can satisfy different system requirements 21 21

Outline! Motivation and background! Trellis algorithm! GPU! Parallelization schemes! Packet-level! Subblock-level! Trellis-level! Implementation tradeoffs! Conclusion 22 22 22

Conclusion! Implement different parallelization schemes of trellis algorithms on a GPU 23! Discuss the implementation tradeoffs between throughput, processing latency and BER! Different combinations of parallelization schemes can satisfy different system requirements 23 23

24 Thanks! Any questions? 24 24

25 Backup 25 25

Next iteration initiation 26 26 26

Turbo decoder performance on GPGPU 27! GPGPU Utilization N thread = N codeword N sub block Thread sub block! Throughput! Decoding latency THR Dec = N codeword T decoding t = t buf + t decode = N codeword R K THR phy + K T decoding 27 27

Trellis algorithm 28! Example Turbo codes in LTE input=0 input=1 state 0 state 1 state 2 state 3 state 4 state 5 state 6 state 7 stage k-1 stage k stage k+1 g. 1. Trellis structure of Turbo codes used in LTE 28 28

Subblock-level Parallelism 29 a 0 a 1 a i-1 a i a i+1 a 2i-1 a (k-2)i a (k-1)i-1 a (k-1)i a ki-1 a 0 a 1 a i-1 a i a i+1 a 2i-1 a (k-2)i a (k-1)i-1 a (k-1)i a ki-1! Increases the output error rate! Recovery scheme to fix performance loss! Training sequence (TS)! Next iteration initialization (NII)! Additional computation 29 29