Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

Size: px

Start display at page:

Download "Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors"

Rudolph Walton
5 years ago
Views:

1 1 Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors Qi Zheng*, Yajing Chen*, Ronald Dreslinski*, Chaitali Chakrabarti +, Achilleas Anastasopoulos*, Scott Mahlke*, Trevor Mudge* *, Ann Arbor + Arizona State University, Tempe ISCAS 13 May 21,

2 Trellis algorithm! Trellis is widely used in coding theory! Progression of symbols within a code! Representation of the state transitions of a finite state machine 2 2 2

3 Trellis algorithm 3! Trellis is widely used in coding theory! Progression of symbols within a code! Representation of the state transitions of a finite state machine! Trellis algorithm! The processing described by the value propagation in a trellis input=0 input=1 state 0 state 1 state 2 state 3 state 4 state 5 state 6 state 7 stage k-1 stage k stage k+1 3 3

4 Trellis algorithm! Broad scope of uses! Viterbi/BCJR/Baum-Welch! Communication system/data compression/speech recognition! Play important roles 4 -./(#0& 12,&!"#$%&'()%'(#& *+,& 3"456(&$#(78'%94&%:&./(&$70($74'&;4&<!=&">?;48& 4 4

5 GPU Graphics processing unit! High Throughput:! GFLOPS/TFLOPS-level peak throughput! High Efficiency 5 Processor GFLOP/dollar GFLOP/watt Nvidia GeForce GTX Intel Xeon E Intel Itanium ! Programming Support! OpenCL! CUDA 5 5

6 GPU Performance Challenge! Sources of GPU underutilization! Thread inadequacy! Pipeline stall 6! Thread inadequacy! 1000 cores on a commercial GPU! Pipeline stall! Long memory access latency! L2 cache/external memory! Using multithreading to hide pipeline stall 6 6

7 Contribution of the paper! Previous work! Mapped the Turbo decoder on a GPU! Study the throughput and BER of the implementation! Our work! Generalize the parallelization schemes to the implementation of trellis algorithms on a GPU! Explore additional schemes not in previous works: forward-backward and branch-metric parallelism! Study the implementation tradeoffs between throughput, processing latency and BER! Show different combinations of parallelization schemes for different system requirements 7 7 7

8 Outline! Motivation and background! Trellis algorithm! GPU! Parallelization schemes! Packet-level! Subblock-level! Trellis-level! Implementation tradeoffs! Conclusion 8 8 8

9 Packet-level Parallelism! Process multiple packets! #Threads = #packets! Long processing latency! Especially for the 1 st packet 9 Buffer Packets Trellis Algorithm 9 9

10 Subblock-level Parallelism 10 subblock subblock input=0 input=1 state 0 state 1 state 2 state 3 state 4 state 5 state 6 state 7! #Threads = #subblocks! Increases the output error rate! Recovery scheme to fix performance loss! Training sequence (TS)! Next iteration initialization (NII)! Need additional computations 10 10

11 Trellis-level Parallelism! State-level parallelism! #Threads = #states of a stage

12 Trellis-level Parallelism! State-level parallelism! Branch-metric parallelization! #Threads = #branches thread 0 12 thread 2 thread 1 thread 3 thread 14 thread 15 stage k stage k

13 Trellis-level Parallelism! State-level parallelism (SL)! Branch-metric parallelization (BM)! Forward-backward parallelization (FB)! #Threads = 2 forward recursion 13 input=0 input=1 state 0 state 1 state 2 state 3 state 4 state 5 state 6 state 7 stage k-1 stage k backward recursion stage k

14 Summary! Total number of threads 14 N thread = N packet N subblock Thread trellis Scheme Throughput Latency Bit Error Rate Packet-level Better Worse No Change Subblock-level Better No Change Worse Trellis-level Better No Change No Change Subblock+NII Worse No Change Better Subblock+TS Worse No Change Better 14 14

15 Outline! Motivation and background! Trellis algorithm! GPU! Parallelization schemes! Packet-level! Subblock-level! Trellis-level! Implementation tradeoffs! Conclusion

16 Experiment setup! Nvidia GeForce GTX470! 14 Streaming Multiprocessors (SM)! 448 Streaming Processors (SP)! 64KB L1 cache + shared memory per SM! 768KB L2 cache per GPU! 2GB DRAM 16! LTE Turbo decoder! Codeword size: 6144! Code rate: 1/3! Iteration num:

17 Throughput vs. Latency 17 Higher is better #subblock = 1 BER 10-5 with SNR=1dB " More packets # higher throughput, and longer latency. " Trellis-level parallelism improves throughput without affecting the latency

18 Throughput vs. BER 18 Higher is better Lower is better SNR requirement presented are the lowest values to achieve 10-5 BER. One packet " More subblocks # higher throughput, but higher SNR requirement. " Longer TS # lower SNR requirement, but lower throughput. " NII+TS-4 achieves the best tradeoff

19 Implementation tradeoff 19 Trellis-level Parallelization Schemes Subblock Num Packet Num TH* (Mbps) WPL* (ms) SNR* (db) BER* x10-3 SL x10-3 SL x10-4 SL,FB x10-4 SL,FB x SL = State-level parallelism, FB = Forward-backward parallelism *TH = Throughput, WPL = Worst-case Packet Latency *SNR requirement presented are the lowest values to achieve 10-5 BER. *BER is the bit error rate when SNR = 1 db. " Different combinations of parallelization schemes can satisfy different system requirements 19 19

20 Implementation tradeoff 20 Trellis-level Parallelization Schemes Subblock Num Packet Num TH* (Mbps) WPL* (ms) SNR* (db) BER* x10-3 SL x10-3 SL x10-4 SL,FB x10-4 SL,FB x SL = State-level parallelism, FB = Forward-backward parallelism *TH = Throughput, WPL = Worst-case Packet Latency *SNR requirement presented are the lowest values to achieve 10-5 BER. *BER is the bit error rate when SNR = 1 db. " Different combinations of parallelization schemes can satisfy different system requirements 20 20

21 Implementation tradeoff 21 Trellis-level Parallelization Schemes Subblock Num Packet Num TH* (Mbps) WPL* (ms) SNR* (db) BER* x10-3 SL x10-3 SL x10-4 SL,FB x10-4 SL,FB x SL = State-level parallelism, FB = Forward-backward parallelism *TH = Throughput, WPL = Worst-case Packet Latency *SNR requirement presented are the lowest values to achieve 10-5 BER. *BER is the bit error rate when SNR = 1 db. " Different combinations of parallelization schemes can satisfy different system requirements 21 21

22 Outline! Motivation and background! Trellis algorithm! GPU! Parallelization schemes! Packet-level! Subblock-level! Trellis-level! Implementation tradeoffs! Conclusion

23 Conclusion! Implement different parallelization schemes of trellis algorithms on a GPU 23! Discuss the implementation tradeoffs between throughput, processing latency and BER! Different combinations of parallelization schemes can satisfy different system requirements 23 23

24 24 Thanks! Any questions? 24 24

25 25 Backup 25 25

26 Next iteration initiation

27 Turbo decoder performance on GPGPU 27! GPGPU Utilization N thread = N codeword N sub block Thread sub block! Throughput! Decoding latency THR Dec = N codeword T decoding t = t buf + t decode = N codeword R K THR phy + K T decoding 27 27

28 Trellis algorithm 28! Example Turbo codes in LTE input=0 input=1 state 0 state 1 state 2 state 3 state 4 state 5 state 6 state 7 stage k-1 stage k stage k+1 g. 1. Trellis structure of Turbo codes used in LTE 28 28

29 Subblock-level Parallelism 29 a 0 a 1 a i-1 a i a i+1 a 2i-1 a (k-2)i a (k-1)i-1 a (k-1)i a ki-1 a 0 a 1 a i-1 a i a i+1 a 2i-1 a (k-2)i a (k-1)i-1 a (k-1)i a ki-1! Increases the output error rate! Recovery scheme to fix performance loss! Training sequence (TS)! Next iteration initialization (NII)! Additional computation 29 29

Leveraging Mobile GPUs for Flexible High-speed Wireless Communication

0 Leveraging Mobile GPUs for Flexible High-speed Wireless Communication Qi Zheng, Cao Gao, Trevor Mudge, Ronald Dreslinski *, Ann Arbor The 3 rd International Workshop on Parallelism in Mobile Platforms