AnySP: Anytime Anywhere Anyway Signal Processing

Size: px

Start display at page:

Download "AnySP: Anytime Anywhere Anyway Signal Processing"

Leona Patterson
6 years ago
Views:

1 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner 3 University of Michigan ACAL 1 Arizona State University 2 ARM, Ltd. 3 1

2 The Old Modern Mobile Mobile Phone Phone 2 Video Recording Future phones are becoming more complex Richer applications require much more requirements Video Editing How do phones handle this now? Higher Data Rates 3D Rendering Advanced Image Processing Photos From

3 Inside Today s Smart Phones 3 Inside the OMAP330 Application Processor Inside the X-Gold 608 (Representation of QCOM) 3 3

4 Power/Performance Requirements for Multiple Systems P e r f o r m a n c e ( G o p s ) Different applications have different power/performance characteristics! SODA ( 65 nm ) Mobile HD Video 3 G Wireless G Wireless SODA ( 90 nm ) We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) TI C 6 X Imagine VIRAM Pentium M IBM Cell Power ( Watts )

5 5 The Applications Is there anything we can learn from the applications themselves? 5 5

H.26 Basics 6 Standard Basis Patterns 5 3 2 Basis Coefficients Reconstructed Macroblock Modules Entropy

Frames Algorithms Context-adaptive VLD x integer kernel Half: 6-tap filter Quarter: 6-tap / bilinear

Frame Power Profiling * Motion Compensation Future Frames Form Prediction T.-A. Liu, T.-M. Lin, S. -Z.

A low-power dual-mode video decoder for mobile applications, IEEE Communications Magazine, volume, issue

6 H.26 Basics 6 Standard Basis Patterns Basis Coefficients Reconstructed Macroblock Modules Entropy decoder Inverse Transform Motion Compensation Intra Prediction Deblocking Filter Inverse Transform Past Frames Algorithms Context-adaptive VLD x integer kernel Half: 6-tap filter Quarter: 6-tap / bilinear filter Eighth: -tap filter Directional spatial prediction In-loop adaptive filter Residual Data Current Frame Power Profiling * Motion Compensation Future Frames Form Prediction T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. A low-power dual-mode video decoder for mobile applications, IEEE Communications Magazine, volume, issue 8, pp , Aug % 29% 10% 3% Misc Kernels 23% Decoded Block Deblocking Filter Reconstructed Macroblock Decoded Block Current Uncoded Block Intra-Prediction Prediction Based on Neighbors 6 6

7 G Wireless Basics 7 Ant D/A Guard Insertion IFFT Channel Encoder (LDPC) S/P Transmitted Data Modulator Demodulator Ant A/D Symbol Timing / Guard Detect FFT Channel Estimator MIMO Decoder (STBC) Channel Decoder (LDPC) Recovered Data Three kernels make up the majority of the work FFT Extract Data from Signals STBC Combine Data into More Reliable Stream LDPC Error Correction on Data Stream 7 7

8 8 Mobile Signal Processing Algorithm Characteristics G H.26 Algorithm SIMD Scalar Overhead SIMD Width Amount Workload (%) Workload (%) Workload (%) (Elements) of TLP FFT Low STBC High LDPC Low Deblocking Filter Medium Intra- PredicMon SIMD 85 comes 5 at a cost! Medium Inverse Transform Register 80 File 5 Power 15 8 High MoMon CompensaMon High Data Movement/Alignment Cost SIMD architectures have to deal with this! Algorithms have different SIMD widths From very large to very small Though SIMD width varies all algorithms can exploit it Large percentage of work can be SIMDized Larger SIMD width tend to have less TLP 8 8

9 Only the instructions shown in red are MMX computations. All other instructions are simply supporting these computations. 9 Pentium III SIMD code for Discrete Cosine Transform (DCT) lea ebx, DWORD PTR [ebp+128] load/address overhead mov DWORD PTR [esp+28], ebx load/address overhead $B1$2: xor eax, eax address overhead move dx, ecx address overhead lea edi, DWORD PTR [ecx+16] load/address overhead mov DWORD PTR [esp+2], ecx load/address overhead $B1$3: movq mm1, MMWORD PTR [ebp] load overhead pxor mm0, mm0 initialization overhead pmaddwd mm1, MMWORD PTR [eax+esi] True Computation movq mm2, MMWORD PTR [ebp+8] load overhead pmaddwd mm2, MMWORD PTR [eax+esi+8] True Computation add eax, 16address overhead paddw mm1, mm0 True Computation paddw mm2, mm1 True Computation movq mm0, mm2 load related overhead psrlq mm2, 32 SIMD reduction overhead povd ecx, mm0 SIMD load overhead movd ebx, mm2 SIMD load overhead add ecx, ebx SIMD conversion Overhead mov WORD PTR [edx], cx store overhead add edx, 2 address overhead cmp edi, edx branch related overhead jg $B1$3 loop branch overhead $B1$: move cx, DWORD PTR [esp+2] load/address overhead add ebp, 16 address overhead add ecx, 16 address overhead move ax, DWORD PTR [esp+28] load/address overhead cmp eax, ebp branch related overhead jg $B1$2 loop branch overhead 9 Feb 15, 200

13 Traditional SIMD Power Breakdown 13 SCALAR PIPE 9% INTERCONNECT 2% SODA WCDMA MEM 3% CONTROL 38% REG 37% MEM REG ALU+MULT CONTROL INTERCONNECT SCALAR PIPE ALU+MULT 11% Register File Power consumes a lot of power in traditional 32-wide SIMD architecture 13

14 Register File Access 1 100% 90% 80% 70% 60% 50% 0% 30% 20% 10% 0% Register Access Bypass Read Bypass Write Lots of power wasted on unneeded register file access! FFT STBC LDPC Deblocking Filter Intra- PredicMon Inverse Transform Many of the register file access do not have to go back to the main register file 1 1

15 Instruction Pair Frequency 15 InstrucMon Pair Frequency 1 mul9ply- add 26.71% 2 add- add 13.7% 3 shuffle- add 8.5% shib right- add 6.90% 5 subtract- add 6.9% 6 add- shib right 5.76% 7 mul9ply- subtract.00% 8 shib right- subtract 3.75% 9 add- subtract 3.07% 10 Others 20.5% a ) Intra - prediction and Deblocking Filter Combined InstrucMon Pair Frequency 1 shuffle- move 32.07% 2 abs- subtract 8.5% 3 move- subtract 8.5% shuffle- subtract 3.5% 5 add- shuffle 3.5% 6 Others 3.77% b ) LDPC InstrucMon Pair Frequency 1 shuffle- shuffle 16.67% 2 add- mul9ply 16.67% 3 mul9ply- subtract 16.67% mul9ply- add 16.67% 5 subtract- mult 16.67% 6 shuffle- add 16.67% c ) FFT Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions A few instruction pairs (3-5) make up the majority of all instruction pairs! 15 15

16 Data Alignment Problem! Intra- PredicMon 16 A B C D A B C D E F G H I J x16 luma MB prediction modes for each x block X I J K L A B C D A B C D A B C D A B C D F G H I E F G H D E F G C D E F A B C D A B C D E F G A B C D E F G a b c d Traditional SIMD machines take too e f g h A B C D B C D E C D E F D E F G A E B F B F C G C G D D D i j k l long or A Bcost C D E F too G H I Jmuch to do A Bthis C D E F G H I J K L m n o p G H I J A B C D G H I J A B C D Good news A B C D small fixed number patterns per kernel A A A A B B B B C C C C D D D D I E D C J F I E K G J A B C D E F G H I J K L D D D F L H K G I J K L E F G H D I J K C D E F H.26 Intra-prediction has 9 different prediction modes Each prediction mode requires a specific permutation 16 16

17 Summary 17 Conclusion about G and H.26 Lots of different sized parallelism From wide to 96 wide to 102 wide SIMD Which means many different SIMD widths need to be supported Very short lived values Lots of potential for instruction fusings Limited set of shuffle patterns required for each kernel 17

18 18 AnySP Design 18 18

SIMD to scalar V T S Scalar Data MEM scalar RF E X 16-bit

19 Traditional SIMD Architectures 19 AGU 32- lane SIMD ALU SIMD Data MEM SIMD RF E X 32- lane SSN W B SIMD S T V SIMD to scalar V T S Scalar Data MEM scalar RF E X 16-bit ALU W B Scalar 32-Wide SIMD with Simple Shuffle Network 19 19

20 AnySP Architecture High Level 20 L1 Program Memory Controller DMA Multi-Bank Local Memory Bank 0 Bank 1 Bank C R O S S B A R 16-bit 8-wide 16 entry SIMD RF-0 16-bit 8-wide 16 entry SIMD RF-1 Multi-SIMD Datapath 16 Banked Memory with 8-wide SIMD FFU-0 8-wide SIMD FFU-1 Swizzle Network 16-bit 8- wide -entry Buffer-0 16-bit 8- wide -entry Buffer-1 Bank 8 Groups 2 of 8-Wide 16-bit 8-wide Flexible Bank Function 16 Units entry 8-wide SIMD 3 SIMD RF-2 FFU-2 Multiple Output Adder Tree 128x128 16bit Swizzle Network SRAM-based Crossbar 16-bit 8- wide -entry Buffer-2 Multi- Output Adder Tree 8 Groups of 8-wide SIMD (6 Total Lanes) To Inter-PE Bus Bank bit 8-wide 8-wide SIMD 16 entry Temporary Buffer and SIMD RF-7 FFU-7 Bypass Network 16-bit 8- wide -entry Buffer-7 Scalar Memory Buffer Scalar Pipeline AGU Group 0 Pipeline Datapath AGU and Scalar Pipeline AGU Group 1 Pipeline AGU Group 7 Pipeline AnySP PE 20 20

21 Multi-Width Support 21 Multi-Bank Local Memory Multi-SIMD Datapath AGU Group 0 AGU Group 1 AGU Group 2 AGU Group 3 Bank 0 Bank 1 Bank 2 Bank 3 Bank C R O S S B A R 16-bit 8-wide 16 entry SIMD RF-0 16-bit 8-wide 16 entry SIMD RF-1 16-bit 8-wide 16 entry SIMD RF-2 16-bit 8-wide 16 entry SIMD RF-3 8-wide SIMD FFU-0 8-wide SIMD FFU-1 8-wide SIMD FFU-2 8-wide SIMD FFU-3 Swizzle Network 16-bit 8- wide -entry Buffer-0 16-bit 8- wide -entry Buffer-1 16-bit 8- wide -entry Buffer-2 16-bit 8- wide -entry Buffer-3 Multi- Output Adder Tree AGU Group 7 Bank bit 8-wide 16 entry SIMD RF-7 8-wide SIMD FFU-7 16-bit 8- wide -entry Buffer-7 Each 8-wide SIMD Group works on different memory Normal 6-Wide SIMD mode all lanes share one AGU locations of the same 8-wide code AGU Offsets 21 21

22 AnySP FFU Datapath 22 Flexible Functional Unit allows us to 1. Exploit Pipeline-parallelism by joining two lanes together 2. Handle register bypass and the temporary buffer 3. Join multiple pipelines to process deeper subgraphs. Fuse Instruction Pairs 22 22

23 23 AnySP Results 23 23

24 Simulation Environment Traditional SIMD architecture comparison SODA at 90nm technology 2 AnySP Synthesized at 90nm TSMC Power, timing, area numbers were extracted Performance and Power for each kernel was generated using synthesized data on in-house simulator G based on a NTT DoCoMo G test setup H.26 CIF@30fps 2 2

25 25 AnySP Speedup vs SIMD-based Architecture N o r m a l i z e d S p e e d u p Baseline 6 - Wide Multi - SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass 0. 0 FFT 102 pt Radix - 2 FFT 102 pt Radix - STBC LDPC H. 26 Intra Prediction H. 26 Deblocking Filter H. 26 Inverse Transform H. 26 Motion Compensation For all benchmarks we perform more than 2x better than a SIMD-based architecture 25 25

26 AnySP Energy-Delay vs SIMD-based Architecture 26 N o r m a l i z e d E n e r g y - D e l a y FFT 102 pt Radix - 2 FFT 102 pt Radix - SIMD - based Architecture AnySP STBC LDPC H. 26 Intra Prediction H. 26 Deblocking Filter H. 26 Inverse Transform H. 26 Motion Compenstation More importantly energy efficiency is much better! 26 26

27 AnySP Power Breakdown 27 PE System Total Est. Components SIMD Data Mem ( 32 KB ) SIMD Register File ( 16 x 102 bit ) SIMD ALUs, Multipliers, and SSN SIMD Pipeline + Clock + Routing SIMD Buffer ( 128 B ) SIMD Adder Tree Intra - processor Interconnect Scalar / AGU Pipeline & Misc. ARM ( Cortex - M 3 ) Global Scratchpad Memory ( 128 KB ) Inter - processor Bus with DMA 90 nm ( MHz ) Units Area mm 2 65 nm ( MHz ) nm ( MHz ) Area Area % % % % % % < 1 % % % % % % % G + H. 26 Decoder Power Power mw % % % % % % < 1 % % % < 1 % < 1 % 1. 5 < 1 % % We estimate that both H.26 and G wireless can be done in under 1 Watt at 5nm 27 27

28 Conclusion & Future Work Conclusion We have presented an example architecture that could possibly meet the requirements of 100Mbps G and HD video on the same platform Under the power budget and meeting the performance at 5nm 28 Future and Ongoing Work Application-specific language Larger class of algorithms for AnySP Better utilization of resources for non-parallel kernels Speedup sequential parts University 28 of Michigan - ACAL 28

Low Power DSP Architectures

Overview 1 AnySP SODA++ Increase the Application Domain of the Wireless baseband architecture Diet-SODA SODA- - Taking the sugar out of SODA 1 1 2 Low Power DSP Architectures Trevor Mudge, Bredt Professor