Low Power DSP Architectures

Size: px

Start display at page:

Download "Low Power DSP Architectures"

Candace Lambert
6 years ago
Views:

1 Overview 1 AnySP SODA++ Increase the Application Domain of the Wireless baseband architecture Diet-SODA SODA- - Taking the sugar out of SODA Low Power DSP Architectures Trevor Mudge, Bredt Professor of Engineering, The University of Michigan, Ann Arbor 1st tubs.city Symposium July 1 3, 2009, Braunschweig Mark Woh 1, Sangwon Seo 1, Ron Dreslinski, Geoff Blake, Scott Mahlke 1, Chaitali Chakrabarti 2, Krisztian Flautner 3 University of Michigan ACAL 1 Arizona State University 2 ARM, Ltd

2 The Old Modern Mobile Mobile Phone Phone 3 Video Recording Future phones are becoming more complex Richer applications require much more requirements Video Editing How do phones handle this now? Higher Data Rates 3D Rendering Advanced Image Processing Photos From Inside Today s Smart Phones 4 Modern phones are looking like Frankenchips! Inside the OMAP3430 Application Processor Some cores unused and functionality duplicated Inside the X-Gold 608 (Representation of QCOM) 4 4 2

Cost for Multi-System Support 5 Programmable Unified Architectures Provide Lower Cost Faster Time to Market Support for Multiple Applications (Current and Future) Bug Fixes After Manufacturing So

2007. Software-Defined Radio Prospects for Multistandard Mobile Phones. Computer 40, 10 (Oct. 2007) - Finchelstein, D.F.; Sze, V.; Sinangil, M.E.; Koken, Y.; Chandrakasan, A.P., "A low-power 0.7-V H.

3 Cost for Multi-System Support 5 Programmable Unified Architectures Provide Lower Cost Faster Time to Market Support for Multiple Applications (Current and Future) Bug Fixes After Manufacturing So where do we start? Supporting multiple systems is reserved for the most expensive phones Cost is in supporting all the systems that may or may not be used at once Data gathered from - Ramacher, U Software-Defined Radio Prospects for Multistandard Mobile Phones. Computer 40, 10 (Oct. 2007) - Finchelstein, D.F.; Sze, V.; Sinangil, M.E.; Koken, Y.; Chandrakasan, A.P., "A low-power 0.7-V H p video decoder," Solid-State Circuits Conference, A-SSCC ' View of the Unified Architecture World 6 3G ISP V i d e o WiFi GSM 2D/3D 3G ISP V i d e o WiFi GSM 2D/3D 3G ISP V i d e o WiFi GSM 2D/3D 6 6 3

4 Power/Performance Requirements for Multiple Systems 7 Different applications have different power/performance characteristics! We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) The Applications Is there anything we can learn from the applications themselves? 8 8 4

H.264 Basics 9 T.-A. Liu, T.-M. Lin, S. -Z.

A low-power dual-mode video decoder for mobile

volume 44, issue 8, pp.119-126, Aug. 2006.

the majority of the work FFT Extract Data from

5 H.264 Basics 9 T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. A low-power dual-mode video decoder for mobile applications, IEEE Communications Magazine, volume 44, issue 8, pp , Aug G Wireless Basics 10 Three kernels make up the majority of the work FFT Extract Data from Signals STBC Combine Data into More Reliable Stream LDPC Error Correction on Data Stream

6 11 Mobile Signal Processing Algorithm Characteristics 4G H.264 Algorithm SIMD Scalar Overhead SIMD Width Amount Workload (%) Workload (%) Workload (%) (Elements) of TLP FFT Low STBC High LDPC Low Deblocking Filter Medium Intra PredicMon SIMD 85 comes 5 at a cost! Medium Inverse Transform Register 80 File 5 Power 15 8 High MoMon CompensaMon High Data Movement/Alignment Cost SIMD architectures have to deal with this! Algorithms have different SIMD widths From very large to very small Though SIMD width varies all algorithms can exploit it Large percentage of work can be SIMDized Larger SIMD width tend to have less TLP Traditional SIMD Power Breakdown 12 Register File Power consumes a lot of power in traditional 32-wide SIMD architecture 12 6

7 Register File Access % 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Register Access Bypass Read Bypass Write Lots of power wasted on unneeded register file access! FFT STBC LDPC Deblocking Filter Intra PredicMon Inverse Transform Many of the register file access do not have to go back to the main register file Instruction Pair Frequency 14 Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions A few instruction pairs (3-5) make up the majority of all instruction pairs!

8 Data Alignment Problem! Intra PredicMon 15 Traditional SIMD machines take too long or cost too much to do this Good news small fixed number patterns per kernel H.264 Intra-prediction has 9 different prediction modes Each prediction mode requires a specific permutation More Data Alignment Problems! Inverse Transform 16 Adder tree can accelerate not only matrix operations! Many different video kernels can be accelerated too!

Even More Data Alignment! 17 Block based decoding requires us to access different locations of memory for each task We cannot just rely of fetching contiguous sets of data C.H. Meenderinck, A.

9 Even More Data Alignment! 17 Block based decoding requires us to access different locations of memory for each task We cannot just rely of fetching contiguous sets of data C.H. Meenderinck, A. Azevedo, B.H.H. Juurlink, M. Alvarez, A. Ramirez, Parallel Scalability of Video Decoders, Journal of Signal Processing Systems, August 2008 Techniques like 2D-Wave and 3D-Wave decoding for H.264 helps increase amount of parallelism but we have to be able to access different macroblocks for each parallel computation Summary 18 Conclusion about 4G and H.264 Lots of different sized parallelism From 4 wide to 96 wide to 1024 wide SIMD Which means many different SIMD widths need to be supported Very short lived values Lots of potential for instruction fusings Limited set of shuffle patterns required for each kernel 18 9

10 19 AnySP Design SODA SIMD Architecture Wide SIMD with Simple Shuffle Network

11 AnySP Architecture High Level Banked Memory with SRAM-based Crossbar 8 Groups of 8-Wide Flexible Function Units Multiple Output Adder Tree 128x128 16bit Swizzle Network Temporary Buffer and Bypass Network Datapath AGU and Scalar Pipeline Multi-Width Support 22 Normal 64-Wide SIMD mode all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code AGU Offsets

12 AnySP FFU Datapath 23 Flexible Functional Unit allows us to 1. Exploit Pipeline-parallelism by joining two lanes together 2. Handle register bypass and the temporary buffer 3. Join multiple pipelines to process deeper subgraphs 4. Fuse Instruction Pairs AnySP Results

13 Simulation Environment Traditional SIMD architecture comparison SODA at 90nm technology 25 AnySP Synthesized at 90nm TSMC Power, timing, area numbers were extracted Kernels were hand written and optimized 4G based on a NTT DoCoMo 4G test setup H.264 4CIF@30fps AnySP Speedup vs SIMD-based Architecture For all benchmarks we perform more than 2x better than a SIMD-based architecture

14 AnySP Energy-Delay vs SIMD-based Architecture 27 More importantly energy efficiency is much better! AnySP Power Breakdown 28 We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm

15 Conclusion & Future Work Conclusion We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform Under the power budget and meeting the performance at 45nm 29 Future and Ongoing Work Application-specific language Larger class of algorithms for AnySP Better utilization of resources for non-parallel kernels Speedup sequential parts University 29 of Michigan ARM June Diet-SODA

Diet SODA SODA, Ardbeg, AnySP may be too powerful for the application Simple Imaging processing for cameras Audio processing for voice Lose flexibility and generality of Ardbeg, AnySP for performance

16 Diet SODA SODA, Ardbeg, AnySP may be too powerful for the application Simple Imaging processing for cameras Audio processing for voice Lose flexibility and generality of Ardbeg, AnySP for performance at less # of gates 31 Build a modular design which people can add SIMD groups and special function blocks to increase performance, at cost of area but allow voltage scaling 31 Histogram Equalization 32 Spreads out an unevenly distributed histogram Increases contrast 32 16

17 Histogram Equalization 33 for(i=0; i<length; i++) { for(j=0; j<width; j++){ k = image[i][j]; histogram[k] = histogram[k] + 1; } } for(i=0; i<length; i++){ for(j=0; j<width; j++){ k = image[i][j]; image[i][j] = sum_of_h[k] * constant; } } Hard to parallelize because of histogram[k] We can parallelize the load but we may suffer on the SIMD to scalar transfer 33 Basic Edge Detection detect_edges : 8 directional masks ( Sobel ) + thresholding

18 Basic Edge Detection 35 for(a=-1; a<2; a++){ for(b=-1; b<2; b++){ sum = sum + image[i+a][j+b] * mask_0[a+1][b+1]; } } for(i=0; i<rows; i++){ for(j=0; j<cols; j++){ if(out_image[i][j] > high) { out_image[i][j] = new_hi; } else { out_image[i][j] = new_low; } } } Can be easily parallelized across multiple masks or across multiple images Basic Edge Detection Operations are applied on 3x3 matrices 1) Reduction tree (need a function of sum of 9 values) 2) Select max function 3) Threshold operation if (out_image > threshold) out_image = hi else out_image = low

19 Performance Estimate 37 SequenMal (Mcycle) SIMDized (Mcycle) Histogram EqualizaMon Edge DetecMon Homogeneity Edge DetecMon Filter Estimate for how many cycle we need on 2048x1536 images 3.1 Megapixel Unoptimized SIMDized kernels assuming 16-wide SIMD Most work is done on 9-wide (3x3) data elements Adding multiple SIMD groups can increase performance in most kernels 37 Modular Design Consideration Build a highly application specific modular core Depending on which application supported user can add or remove specialized hardware 38 Design requirement tradeoff Smallest Area Use minimum number of SIMD groups or run SIMD groups faster Faster performance Add more SIMD Groups Lower Power Add more SIMD Groups and lower frequency and lower VDD for same throughput

20 39 The End

AnySP: Anytime Anywhere Anyway Signal Processing

1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner 3 University of Michigan ACAL 1 Arizona State University