ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 22.1 A 125µW, Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications Tsu-Ming Liu 1, Ting-An Lin 2, Sheng-Zen Wang 2, Wen-Ping Lee 1, Kang-Cheng Hou 1, Jiun-Yan Yang 1 Chen-Yi Lee 1 1 National Chiao Tung University, Hsinchu, Taiwan 2 Mediatek, Hsinchu, Taiwan A single-chip MPEG-2 SP@ML and H.264/AVC BL@L4 video decoder is fabricated in a.18µm 1P6M CMOS technology with an area of 15.21mm 2. This chip contains 19.2kb and 3.55kb of embedded SRAM for storing neighboring pixels and control tags, and adopts two s for further system integration. It operates at a power-level that is about one order of magnitude less than comparable decoders. This savings in power consumption was attained by means of both throughput and bandwidth improvements while incorporating scalable features. For mobile applications, MPEG-2 and H.264/AVC video decoding of QCIF sequences at 15 frames per second is achieved at a clock frequency of 1.15MHz and requires 18µW and 125µW, respectively, at 1V supply voltage. Moreover, CIF, D1 and HD resolutions are also supported. The chip features are summarized in Fig. 22.1.1. The advent of H.264/AVC provides high compression ratio, but, there is no backward compatibility to the prevalent MPEG-x and H.26x video coding standards. MPEG-2 [1] and H.264/AVC [3] processors have been reported at ISSCC. However, these solutions used separate modules and only processed a single type of video content in each module. To support different system requirements such as DVB-H or HD-DVD, a scalable pipeline is exploited to efficiently integrate both MPEG-2 and H.264/AVC in a single chip. Figure 22.1.2 shows the system block diagram of the proposed dual-video decoder chip. The interface is exploited to access the external. Reading and writing processes are issued by the motion compensation and deblocking filter, respectively. Furthermore, to reduce the bandwidth between the external and deblocking filter, a separate data bus and display engine are utilized for on screen display (OSD) through a direct display interface. The syntax parser, entropy decoder, inverse transformation and deblocking filter for MPEG-2 and H.264/AVC standard have been tightly combined based on characteristics of pipeline scalability. To achieve low power consumption, a simple prediction circuit is interfaced to the embedded SRAM to make a better trade-off between memory cost and transmission bandwidth. Furthermore, a high-throughput motion compensation and deblocking filter is developed to reduce the clock frequency and lower the power requirements. Figure 22.1.3 shows the scalable pipeline for the dual-standard architecture. In H.264/AVC, a 4 4 sub-block is the smallest element to be processed. However, an 8 8 block size is adopted in the MPEG-2 standard. To integrate them efficiently, the buffer size is kept as 8 8 and transfers for both standards are into a 4 4 processing unit since all blocks can be considered as a super-set of a 4 4 sub-block. Compared to a macroblock level pipeline, the power consumption of the pipelined registers is reduced by 75% in MPEG-2 and 93.75% in H.264/AVC, through a clock gating scheme. Input data of the IDCT module is partitioned into even and odd components and they are processed as a 4 4 IDCT, through a recursive IDCT algorithm. The in-loop filter is defined by H.264/AVC and the post-loop filter follows the prevalent MPEG-x standard. However, the performance improvement is very small (only.4db) when applying in-loop filter as a post filter in the prevalent MPEG standard. In Fig. 22.1.3, a H.264-like algorithm for post filtering is used to retain the filtering performance and reduce the integration cost. Finally, the PSNR gains of.2db can be achieved, as compared to an un-filtered design, with an additional gate count of only 2% of the in-loop filter requirement. Figure 22.1.4 shows the proposed bandwidth scalability of the prediction circuit. H.264/AVC achieves a high compression ratio since it utilizes the neighboring pixels to obtain a reliable predictor reducing the prediction errors. However, high data correlation also leads to a design challenge in terms of transmission bandwidth and internal memory cost. In the design, a simple prediction circuit, where a 19.2kb pixel SRAM is employed to cache the pixels of upper neighbors, improves the external bandwidth. The key idea is that not all neighboring pixels should be stored in the internal memory. In certain sequences, most edges are determined as a horizontal prediction in intra-prediction or SKIP mode in the deblocking filter. There is no need to keep them for follow-up decoding procedures. The proposed prediction circuit generates a TAG signal to predict whether the pixel data of the next row of a macroblock should be kept or not; but a prediction miss may occur. Therefore, we provide a flexible solution at the architectural level where a compromise is made between external bandwidth and memory cost. Compared to the intra-prediction and deblocking filter in [3], the proposed prediction circuit saves 33% of the bandwidth and 4% of the internal memory size on average. Figure 22.1.5 shows the processing cycle breakdown of different architectural stages. Several high throughput architectures have been implemented on this chip [4]. A 1 4 decoding order in Figure 22.1.3, context switch buffer and efficient access scheduling is exploited to achieve a 39% cycle reduction in MC. A novel prediction and hybrid schedule reduce the processing cycles a further 35%. Therefore, 11.4Mpixels/s of maximal decoding capability is achieved. Compared to the initial power consumption for 66.21 and 4.43Mpixels/s rates, the savings are 33% and 59%, respectively, for real-time decoding at QCIF resolution. Figure 22.1.6 presents a comparison of power consumption with existing designs [1][2]. Under an identical design specification, the proposed techniques lead to lower system clock rate and supply voltage and thus lower power dissipation. For mobile applications, the power reduction of this chip is about one order of magnitude compared to existing decoders and could be further improved through voltage scaling. Figure 22.1.7 shows a chip micrograph of this dual-video decoder design. Acknowledgements: Authors thank Jeng-Bin Chen, Ching-Che Chung, Wen-Hsiao Peng, Wei- Chin Lee for insightful discussions on this work. Authors also thank Chip Implementation Center (CIC) for testing services. References: [1] H. Yamauchi, et al., A.8W HDTV Video Processor with Simultaneous Decoding of Two MPEG-2-MP@HL Streams and Capable of 3frames/s Reverse Playback, ISSCC Dig. Tech. Papers, pp. 372-474, Feb., 22. [2] Hae-Yong Kang, et al., MPEG4 AVC/H.264 Decoder with Scalable Bus Architecture and Dual Memory Controller, IEEE Intl Symp. on Circuits and Systems, pp. II-145 - II-148, May, 24. [3] Yu-Wen Huang, et al., A 1.3TOPS H.264/AVC Single-Chip Encoder for HDTV Applications, ISSCC Dig. Tech. Papers, pp. 128-129, Feb., 25. [4] Tsu-Ming Liu, et al., An 865µW H.264/AVC Video Decoder for Mobile Applications, IEEE Asian Solid-State Circuits Conference, pp. 31-34, Nov., 25. 26 IEEE International Solid-State Circuits Conference 1-4244-79-1/6 26 IEEE
ISSCC 26 / February 8, 26 / 8:3 AM Specification Technology Die Size Package Logic Gates Internal External Core Power Consumption Memory Max. System Clock Max. Processing Throughput MPEG-2 H.264/AVC MPEG-2 SP@ML Dual H.264/AVC BL@L4 Standard.18 m 1P6M CMOS 1.8V core, 3.3V I/O 3.9mm 3.9mm 28-pin CQFP 33.78K 22.75Kb 2 SRAMs s 1MHz 11.4 Mpixels/sec 18 W (1.15MHz@1V, QCIF@15fps) 1.4mW (16.6MHz@1.2V, D1@3fps) 125 W (1.15MHz@1V, QCIF@15fps) 12.4mW (16.6MHz@1.2V, D1@3fps) Stream Buffer Stream Input Host Processor System BUS H.264/MPEG-2 Syntax Parser Compensation CAVLC/VLC Unit I-ZZ I-Q Predicted Path Slice Pixel SRAM MB Pixel SRAM Peripherals Interface Sync. Adder 4x4/8x8 IDCT Post-loop Residual Path 1 : MPEG-2 1: H.264/AVC In/Post-loop In/Post-Loop De-blocking Filter Display Interface YUV/ CCIR656 Display Engine Figure 22.1.1: Chip summary. Figure 22.1.2: System block diagram. 4x4 BUF Compensation BUF Novel Ordering Sync. FIFO... Pixel SRAM Pixel SRAM Compensation read In/Post-Loop De-blocking Filter Display BUF Slice Pixel Memory Pipelined Level: 8x8,4x4 Even/Odd Partition Pipeline Scalability CAVLC/VLC 1-D IDCT I-ZZ I-Q Central Controller 4x4/8x8 IDCT Not filter (Skip mode) Yes Filter? 4x4,bS: H.264 in-loop 8x8, eq_cnt: MPEG-2 post-loop 4x4 Pipelined Level: 4x4/ 8x8 Triple Mode Decision (bs/eq_cnt) 4x4/ 8x8 Triple P-i-P-o Edge Filter MPEG-2 Decoding H.264/AVC Decoding SKIP bs = / eq_cnt<t3 H.264's Strong Edge Filtering bs=4 / eq_cnt>=t2 H.264/MPEG-2 Weak Edge Filtering <bs<4 / T3=<eq_cnt<T2 External Bandwidth (Mbytes/sec) Stefan@18HD 3 25 2 15 1 15kbps 45kbps 1.5Mbps 5 1 2 3 4 5 6 7 8 x 1 4 Mother & Daughter@CIF 14 15kbps 12 45kbps 1.5Mbps 1 8 6 4 5 1 15 Memory size (bits) BW,SIZE = Predictor + Deblocking Filter 18HD@3fps 3 Deblocking Filter Total BW 1 Luma SIZE 2 BW SIZE BW Luma SIZE Design 1.5k 62M 279M.5k Design 2 94M 61.44k 15.4k 94M 76.8k [3].5k 15.4k 15.9k Proposed ( x W/8) 111.33M 7.6k 34.74M 1.92k 146.7M 9.6k 1 : Bytes/sec 2 : bits (Luma) 3 : Stefan@15kbps Figure 22.1.3: Pipeline scalability with dual-standard architecture. Figure 22.1.4: Bandwidth scalability with prediction circuit. Power Consumption(mW) 12 1 8 6 4 2 3.5 3 2.5 2 1.5 Maximal decoding 1 capability.5 94(18HD) 41.5(72HD) 15.5(D1) 4.56(CIF).57(QCIF) Throughput(MPixels/sec) Figure 22.1.5: Throughput improvement of proposed design. Power consumption(mw) QCIF@15fps, 1.8V 4.43 26.95 44.14 66.21 67.36 This Work Mpixels/sec Mpixels/sec W Core power consumption(mw) 1 3 1 2 mw 1 1 uw 1 1-1 1 2 3 4 5 6 7 Decoding throughput (MPixels/sec) Figure 22.1.6: Power consumption. This work@h.264 Decoding This work@mpeg-2 Decoding H.264 Decoder CIF@1.2V,3MHz Power(uW) 25 2 15 1 5 H.264/AVC MPEG-2 Decoder 18HD@1.8V,135MHz H.264 Decoder 18HD@1.2V,13MHz MPEG-2 1.8 1.6 1.4 1.2 1 Voltage(V) 26 IEEE International Solid-State Circuits Conference 1-4244-79-1/6 26 IEEE
ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 Figure 22.1.7: Chip micrograph. 26 IEEE International Solid-State Circuits Conference 1-4244-79-1/6 26 IEEE
ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 Specification Dual MPEG-2 SP@ML H.264/AVC BL@L4 Technology Die Size Package Logic Gates Standard.18 m 1P6M CMOS 1.8V core, 3.3V I/O 3.9mm 3.9mm 28-pin CQFP 33.78K Internal External Memory 22.75Kb 2 SRAMs s Max. System Clock Max. Processing Throughput 1MHz 11.4 Mpixels/sec Core Power Consumption MPEG-2 H.264/AVC 18 W (1.15MHz@1V, QCIF@15fps) 1.4mW (16.6MHz@1.2V, D1@3fps) 125 W (1.15MHz@1V, QCIF@15fps) 12.4mW (16.6MHz@1.2V, D1@3fps) Figure 22.1.1: Chip summary. 26 IEEE International Solid-State Circuits Conference 1-4244-79-1/6 26 IEEE
ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 Host Processor Peripherals System BUS Stream Input Stream Buffer H.264/MPEG-2 Syntax Parser Compensation CAVLC/VLC I-ZZ I-Q Predicted Path Interface Sync. Adder 4x4/8x8 IDCT MB Pixel SRAM Post-loop Residual Path 1 : MPEG-2 1: H.264/AVC In/Post-loop In/Post-Loop De-blocking Filter Display Interface YUV/ CCIR656 Display Engine Unit Slice Pixel SRAM Figure 22.1.2: System block diagram. 26 IEEE International Solid-State Circuits Conference 1-4244-79-1/6 26 IEEE
ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 Compensation 4x4 BUF BUF Novel Ordering Sync. FIFO... Pixel SRAM Pixel SRAM Compensation read In/Post-Loop De-blocking Filter Display BUF Pipeline Scalability CAVLC/VLC I-ZZ I-Q 4x4/8x8 IDCT 4x4 MPEG-2 Decoding H.264/AVC Decoding Pipelined Level: 8x8,4x4 Pipelined Level: Not filter (Skip mode) SKIP Even/Odd Partition 1-D IDCT Central Controller Filter? Yes 4x4,bS: H.264 in-loop 8x8, eq_cnt: MPEG-2 post-loop 4x4/ 8x8 Triple Mode Decision (bs/eq_cnt) 4x4/ 8x8 Triple P-i-P-o Edge Filter bs = / eq_cnt<t3 H.264's Strong Edge Filtering bs=4 / eq_cnt>=t2 H.264/MPEG-2 Weak Edge Filtering <bs<4 / T3=<eq_cnt<T2 Figure 22.1.3: Pipeline scalability with dual-standard architecture. 26 IEEE International Solid-State Circuits Conference 1-4244-79-1/6 26 IEEE
ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 Slice Pixel Memory External Bandwidth (Mbytes/sec) 3 Stefan@18HD 25 2 15 1 5 1 2 3 4 5 6 7 8 x 1 4 14 12 1 8 6 Mother & Daughter@CIF 4 5 1 15 BW,SIZE = Predictor + Deblocking Filter 15kbps 45kbps 1.5Mbps 15kbps 45kbps 1.5Mbps Memory size (bits) 18HD@3fps 3 Deblocking Filter Total BW 1 Luma SIZE 2 BW SIZE BW Luma SIZE Design 1.5k 62M 279M.5k Design 2 94M 61.44k 15.4k 94M 76.8k.5k [3] 15.4k 15.9k Proposed ( x W/8) 111.33M 7.6k 34.74M 1.92k 146.7M 9.6k 1 : Bytes/sec 2 : bits (Luma) 3 : Stefan@15kbps Figure 22.1.4: Bandwidth scalability with prediction circuit. 26 IEEE International Solid-State Circuits Conference 1-4244-79-1/6 26 IEEE
ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 Power Consumption(mW) 12 1 8 6 4 2 Maximal decoding capability 94(18HD) 41.5(72HD) 15.5(D1) 4.56(CIF).57(QCIF) Throughput(MPixels/sec) Power consumption(mw) 3.5 3 2.5 2 1.5 1.5 QCIF@15fps, 1.8V 4.43 26.95 44.14 66.21 67.36 This Work Mpixels/sec Mpixels/sec Figure 22.1.5: Throughput improvement of proposed design. 26 IEEE International Solid-State Circuits Conference 1-4244-79-1/6 26 IEEE
ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 W Core power consumption(mw) mw uw 1 3 1 2 1 1 1 This work@h.264 Decoding This work@mpeg-2 Decoding H.264 Decoder CIF@1.2V,3MHz Power(uW) 25 2 15 1 5 H.264/AVC MPEG-2 Decoder 18HD@1.8V,135MHz H.264 Decoder 18HD@1.2V,13MHz MPEG-2 1.8 1.6 1.4 1.2 1 Voltage(V) 1-1 1 2 3 4 5 6 7 Figure 22.1.6: Power consumption. Decoding throughput (MPixels/sec) 26 IEEE International Solid-State Circuits Conference 1-4244-79-1/6 26 IEEE
ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 Figure 22.1.7: Chip micrograph. 26 IEEE International Solid-State Circuits Conference 1-4244-79-1/6 26 IEEE