46 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010

Size: px

Start display at page:

Download "46 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010"

Gladys Hart
5 years ago
Views:

1 46 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 A 212 MPixels/s p Multiview Video Encoder Chip for 3D/Quad Full HDTV Applications Li-Fu Ding, Wei-Yin Chen, Pei-Kuei Tsung, Tzu-Der Chuang, Pai-Heng Hsiao, Yu-Han Chen, Hsu-Kuang Chiu, Shao-Yi Chien, Member, IEEE, and Liang-Gee Chen, Fellow, IEEE Abstract Multiview video coding (MVC) plays an important role in a 3-D video system. In addition, the resolution of HDTV is increasing to present more vivid perception for users. To realize real-time processing of dozens of TOPS, VLSI solution is necessary. However, ultra high computational complexity, a large amount of external memory bandwidth and on-chip SRAM size, and complex MVC prediction structures are three main design challenges of implementation of MVC hardware architecture. In this paper, an MVC single-chip encoder is proposed for H.264/AVC Multiview High Profile and High Profile for 3-D and quad full high definition (QFHD) TV applications, respectively. The p multiview video encoder chip is implemented on a mm 2 die with 90 nm CMOS technology. An eight-stage macroblock pipelined architecture with proposed system scheduling and cache-based prediction core supports real-time processing from one-view p to seven-view 720p videos. The 212 Mpixels/s throughput is 3.4 to 7.7 times higher than previous work. The 407 Mpixels/W power efficiency is achieved, and 94% on-chip SRAM size and 79% external memory bandwidth are saved by the proposed techniques. Index Terms H.264/AVC, MVC, QFHD, video encoder, VLSI. I. INTRODUCTION T O PROVIDE more vivid perception, TV resolution is getting higher and higher. In addition, 3-D video becomes emerging because it can present immersive and complete scenes. With the technology of 3DTV [1], [2] and free viewpoint TV (FTV) [3] [5] getting more and more mature, multiview video coding (MVC) draws more and more attention. Therefore, MVC is currently being developed as an extension profile of H.264/AVC [6]. The block diagram of H.264/AVC Multiview High Profile [7] is illustrated in Fig. 1. H.264/AVC High Profile is adopted as the base layer. The most significant feature different from original H.264/AVC standard is the inter-view prediction, which is also called disparity estimation (DE). In this profile, the bidirectional prediction on both temporal and inter-view domain are exploited by motion estimation (ME) and DE, respectively. DE can effectively exploit the inter-view redundancy and saves 20% to 30% of bit rates [8]. Output bistream of each view is assembled and then transmitted. The bitstream format is compatible with H.264/AVC, Manuscript received May 01, 2009; revised July 20, 2009 and August 29, Current version published December 23, This paper was approved by Guest Editor Kazutami Arimoto. This work was supported in part by the National Science Council, Taiwan, R.O.C., under Grant NSC E CC3. Chip fabrication was supported by University Shuttle Program of Taiwan Semiconductor Manufacturing Company (TSMC). The authors are with the Graduate Institute of Electronics Engineering, National Taiwan University, Taipei, Taiwan ( lifu@video.ee.ntu.edu.tw). Digital Object Identifier /JSSC so a single-view H.264/AVC decoder can decode the the base layer. However, DE and ME require ultra high computation and memory access. To encode a 3-view 1080p video, 82.4TOPS computing power and 54.6 TB/s memory access are required with a full search algorithm [9]. Moreover, view scalability is important for dealing with various prediction structures of 3-D video. Several H.264/AVC encoder chips have been proposed [10] [15]. The previous work has progressed from 720p/Baseline Profile [10] to recent full 1080p/High Profile [15]. Fig. 2 shows the conventional three- or four-stage macroblock (MB) pipelined architecture. The encoding task is split into integer ME, fractional ME, intra prediction, and entropy coding/deblocking. For MVC and quad full high definition (QFHD) video encoding, there are only 350 cycles in an MB pipeline stage at the required highest specification ( p/24 fps/1 view@280 MHz) [9], where the conventional three- or four-stage MB pipelining containing 600 to 1000 cycles in a pipeline stage is not feasible. In addition, if the conventional architectures directly scale up to support our target specification, a huge amount of on-chip SRAM area and external memory bandwidth are required. In summary, to design an efficient MVC encoder chip three challenges have to be overcome. 1) Encoding HD multiview video requires high processing capability. 2) Conventional MB pipelining and scheduling cannot deal with various MVC prediction structures. 3) With 3-D and QFHD TV specifications, conventional ME architectures [10] require 2.9 Mb on-chip SRAM and 13.8 GB/s external memory bandwidth, which is far beyond 6.4 GB/s supported by DDR2-800 at 100% utilization. In this paper, a p multiview video encoder chip for 3-D and QFHD TV applications is proposed. The proposed MVC encoder chip is characterized as follows: 1) View-parallel MB-interleaved (VPMBI) scheduling with eight-stage MB pipelining is introduced to overcome the first two design challenges. With this technique, the processing capability is 212 Mpixels/s, at least 3.4 times better than the previous work [10] [12]. In addition, view scalability is achieved and supports real-time processing from single-view p to seven-view 720p videos. 2) The cache-based prediction core with a search window (SW) prefetching scheme together with a predictor-centered ME/DE algorithm effectively reduce 94% on-chip memory size and 79% external memory bandwidth. These techniques realize the design of H.264/AVC Multiview High Profile encoder. This paper is organized as follows. The top-level system architecture and scheduling are introduced in /$ IEEE

DING et al.: A 212 MPixels/s 4096 2160p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 47 Fig. 1. Block diagram and data flow of an MVC encoder system. Section II.

$The encoder contains seven kinds of computation cores including integer ME/DE (IMDE), fractional ME/DE (FMDE), intra prediction (IP), motion and disparity compensation (MDC), reconstruction (REC),$

2 DING et al.: A 212 MPixels/s p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 47 Fig. 1. Block diagram and data flow of an MVC encoder system. Section II. Section III presents the architecture design of important modules. The measured chip features and architectural comparison are shown in Section IV. Finally, Section V concludes this work. II. SYSTEM ARCHITECTURE Fig. 2. Conventional three- or four-stage MB pipelined architecture. A. Eight-Stage MB Pipelined System Architecture Fig. 3 shows the system architecture. The encoder contains seven kinds of computation cores including integer ME/DE (IMDE), fractional ME/DE (FMDE), intra prediction (IP), motion and disparity compensation (MDC), reconstruction (REC), entropy coding (EC), and deblocking filter (DB). Eight-stage

3 48 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 Fig. 3. Block diagram of the proposed MVC encoder system. MB pipelining is proposed instead of simply raising the degree of parallelism. In the proposed system, the inter-prediction part is split into five MB pipeline stages, and the rest part is split into three MB pipeline stages. The cache-based prediction core is adopted as the inter-prediction part. The two prefetch stages for IMDE and FMDE can load required SWs into on-chip SRAM prior to IMDE and FMDE stages. They not only reduce the burden of pipeline-cycle budget but also enhance the hardware utilization of IMDE and FMDE cores. The purpose of NOP stage is introduced later. In the sixth MB pipeline stage, IP and MDC are performed in parallel and followed by the reconstruction of the MB in the next pipeline stage. EC and DB are processed simultaneously in the eighth MB pipeline stage. To provide sufficient symbol encoding rate for detailed textured images, EC cores are doubled. B. View-Parallel MB-Interleaved Scheduling Directly increasing MB pipeline stages causes conflict of data dependency and difficulties of resource sharing between computation cores. Two critical issues are shown in Fig. 4. First, before the SW prefetch stages, the initial guess of motion vectors (MVs) and disparity vectors (DVs) should be derived in advance. If the conventional MB pipelining is applied, IMDE for and IMDE prefetch for are performed simultaneously. Conflict of data dependency occurs because requires the MV predictors provided by. Second, another data hazard occurs between the IP and REC pipeline stages. In H.264/AVC standard, if an MB is intra-coded, it is predicted by the reconstructed boundary pixels around each sub-block. Conflict of data dependency occurs when IP and REC are split into two pipeline stages. Therefore, VPMBI scheduling is proposed to overcome the above issues. Fig. 5 shows the operation and features of the VPMBI scheduling. A stereo view video prediction structure is taken for an example. In this case, two views are processed in parallel, and MBs are processed in an interleaving manner. Each capsule unit represents the cycle budget for an MB pipeline. The VPMBI scheduling is characterized as follows: 1) Cache-based prediction with SW prefetching composed of five pipeline stages is proposed. SW speculation and prefetching are used to lower cache miss rate. The purpose of inserting the NOP stage is to prevent IMDE and FMDE from fighting for the same cache SRAM reading/writing port. 2) Hybrid open-closed loop IP and pixel-forwarding REC are decomposed into two pipeline stages without any data hazard. Reconstructed pixels in neighboring MB boundaries are forwarded to IP and adopted as intra predictors, while intra predictors inside the current MB use original pixels instead of reconstructed pixels. DCT-based rate-distortion optimization (RDO) is also adopted to avoid quality degradation. 3) To achieve the symbol encoding rate of p videos, EC cores are doubled to perform frame-parallel pipeline-doubled dual (FPPDD) context-based adaptive binary arithmetic coding (CABAC). Each EC core encodes

DING et al.: A 212 MPixels/s 4096 2160p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 49 Fig. 4. Conflict of data dependency occurs in the conventional MB pipelining. Fig. 5.

With the VPMBI scheduling, the proposed system can process nine MBs simultaneously, so the throughput is enhanced to support 4096 2160p videos.

4 DING et al.: A 212 MPixels/s p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 49 Fig. 4. Conflict of data dependency occurs in the conventional MB pipelining. Fig. 5. Features of the VPMBI scheduling. two symbols per cycle. The cycle budget of this pipeline stage is doubled, and two EC cores operate in a ping-pong manner to connect with the REC stage. With the VPMBI scheduling, the proposed system can process nine MBs simultaneously, so the throughput is enhanced to support p videos. If single-view video is processed, the VPMBI scheduling can cooperate with bidirectional prediction structure, such as IBBBP structure. Two B-pictures in a GOP can be processed simultaneously without data dependency. On the other hand, for the coding structures composed of more than two views, the master CPU in SoC takes charge of the processing order of views in the same time slot. According to the data dependency of the coding structure, two views are processed each time. The detailed architectures of these main modules are introduced in the next section. III. MODULE ARCHITECTURE DESIGN A. Cache-Based Prediction With Search Window Prefetching To overcome the design challenges of large SRAM silicon area and high external memory bandwidth, cache-based SW buffer is proposed for temporal and inter-view prediction. 1) Cache Controller Architecture: The cache architecture for reference frames replaces traditional SW buffer. For better locality, the internal addressing in the cache keeps the intrinsic 2-D nature of frames. The address-resolving flow and bank assignment are shown in Fig. 6. The three-tuple vector (x, y, frame-index) is translated to the tag address and the tag. A tag-set is located by the tag address, and the tag is compared to that set. Upon cache-hit, the word address locates the word in a five-banked on-chip SRAM. To avoid bank conflict between

50 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 Fig. 7. Architecture of the cache controller. Fig. 8. Timing diagram of cache read/lock/prefetch operations. Fig. 6.

The tag modulus is of a torus topology because the address is the modulus of a continuous frame field. The bank ID should comply with this topology.

Combining these three constraints, the final bank ID assignment can be determined. The designer is free to choose these parameters as long as all the constraints are complied.

5 50 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 Fig. 7. Architecture of the cache controller. Fig. 8. Timing diagram of cache read/lock/prefetch operations. Fig. 6. (a) Address-resolving and translation between different addressing schemes. (b) Bank ID assignment in cache data SRAM. the concurrently accessed words, the ladder bank assignment is adopted. The tag modulus is of a torus topology because the address is the modulus of a continuous frame field. The bank ID should comply with this topology. Moreover, the tag modulus is composed by a 2-D array of cache lines, so the size of the tag modulus should be a multiple of cache line size. Combining these three constraints, the final bank ID assignment can be determined. The designer is free to choose these parameters as long as all the constraints are complied. In this work, we adopt five banks and choose words as the size of tag modulus. Fig. 7 shows the architecture of cache controller. To meet the throughput of the prediction core, the proposed architecture supports sustained rate of matching four cache lines, reading five words, and refilling four words per cycle without cache line split penalty. 2) Search Window Prefetching: IMDE and FMDE result in considerable cache miss due to some irregular search pattern. It will lower the hardware utilization. Therefore, concurrent SW prefetching and reading is proposed. When data prefetching can be processed in the same cycle as reading, data prefetching can be started as early as the MB pipeline stage starts. Nevertheless, if data prefetching runs concurrently with normal data reading, the prefetched data must be stored somewhere within the cache. When the evicted cache line is required by future reading request, cache pollution happens. Therefore, a replacement policy that guarantees no cache pollution is beneficial to reduce the cache miss rate of reading. In order to eliminate the cache miss rate caused by cache pollution of data prefetching, a prioritybased replacement policy is proposed. A locking mechanism is introduced as shown in Fig. 8. Before the data prefetching, all the data needed by the following reading within this MB pipeline stage are locked first. The data locking is done similarly to the data prefetching. The only difference is that data locking has higher priority to protect the data from being overwritten. After the locking is done, all the cache lines touched by locking are labeled with a priority bit. When eviction happens due to data prefetching, only the cache line with lower priority can be chosen. As a result, the cache lines needed by future reading requests are protected from eviction, and the cache pollution is not possible. Fig. 9 shows the integration with the VPMBI scheduling. With IMDE and FMDE prefetching, the penalty of cache miss is reduced by 93%, as shown in Fig. 10. Therefore, the length of each MB pipeline stage is greatly reduced. Fig. 11 shows the evaluation of cache profiling. The cache profiling is conducted on numerous video sequences in 1080p and 2160p resolution. The refill bandwidth is shown in the unit of size of frame per reference frame. It represents the equivalent frames loaded from the external memory when doing ME/DE

DING et al.: A 212 MPixels/s 4096 2160p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 51 Timing diagram of cache operations cooperated with the VPMBI sched- Fig. 9. uling. Fig. 11.

6 DING et al.: A 212 MPixels/s p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 51 Timing diagram of cache operations cooperated with the VPMBI sched- Fig. 9. uling. Fig. 11. Cache refill bandwidth of (a) 1080p and (b) 2160p video sequences. Fig. 10. (a) Cycle reduction of cache miss penalty and (b) reduction of pipeline cycles after applying proposed prefetching scheme. on a reference frame. The required bandwidth (MB/sec) is proportional to frame resolution. Under 128-bit bus bit-width, the proposed architecture avoids considerable cache miss rate when processing various types of video contents. 3) IMDE Algorithm and Architecture: In IMDE stage, the predictor-centered fast ME/DE algorithm is used [16], as shown in Fig. 12. First, several predictors are classified into intra-frame or inter-frame predictor, including MVs of the left, top-left, top, and top-right MBs. They are from highly correlated sources of MVs such as neighboring and best matching MBs. These MV predictors are set as the refining centers and evaluated by sum of absolute difference (SAD) cost. Then a searching range is used around the best predictor. Fig. 13 illustrates the computation reduction of the proposed predictor-centered algorithm. The computational complexity of the proposed algorithm is three orders lower than that of full search and decreases by 95% compared with that of hierarchical search. Fig. 14 illustrates the IMDE architecture. Refining range speculation and data prefetching and locking are proposed to support the predictor-centered fast ME/DE algorithm and effectively reduce on-chip memory requirement. In addition, to support variable-block-size (VBS) ME/DE, reconfigurable processing element (PE) array is applied to provide various computing throughput. Fig. 15 shows the datapath of reconfigurable 256-PE array. Sixteen reconfigurable 256-PE array compute sixteen search candidates per cycle. Therefore, the overlapped pixels between 4 4 search candidates are fully reused, and the on-chip memory access is thus minimized. B. Hybrid Open-Closed Loop Intra Prediction and Pixel-Forwarding Reconstruction In order to improve the processing parallelism limited by data dependency described in Section II-B, the proposed hybrid open-closed loop IP use original pixels instead of reconstructed pixels as boundary pixels for intra predictors, as shown in Fig. 16(a). This is because that original pixels are close to reconstructed pixels when encoding 720p and 1080p video with smaller quantization parameters [17]. With the VPMBI scheduling, the reconstructed MB boundary pixels of the previous

52 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 Fig. 12. Predictor-centered algorithm and illustrations of intra-frame and inter-frame predictors, respectively. Fig. 13.

MB can be forwarded and adopted as intra predictors before the IP operation of the current MB starts. Take Intra_4 4 mode as an example, when in Fig.

The proposed hybrid open-closed loop scheme has very slight quality degradation comparing to closed-loop IP [17].

7 52 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 Fig. 12. Predictor-centered algorithm and illustrations of intra-frame and inter-frame predictors, respectively. Fig. 13. Computation reduction of the proposed predictor-centered algorithm. Fig. 14. Architecture of IMDE computation core. MB can be forwarded and adopted as intra predictors before the IP operation of the current MB starts. Take Intra_4 4 mode as an example, when in Fig. 16(b) is processing, it uses four original pixels (P, Q, R, and S) as its left boundary pixels and nine reconstructed pixels from upper row as upper boundary pixels. The proposed hybrid open-closed loop scheme has very slight quality degradation comparing to closed-loop IP [17]. With the proposed scheme, the IP operation of each sub block can be executed in parallel without waiting for the REC loop of neighboring blocks. The architectures of hybrid open-closed loop IP and pixel-forwarding REC are shown in Fig. 17. In order to be consistent with the throughput of 8 8 DCT in Intra_8 8 prediction, the parallelism of our architecture is set to be eight-pixel parallel. Since the Intra_8 8 prediction mode is similar to Intra_4 4 prediction, a reconfigurable intra luma predictor generator is proposed to generate eight predictors for Intra_8 8 mode, or eight predictors for two 4 4 sub-blocks for Intra_4 4 mode. Fig. 18 illustrates the architecture of proposed intra luma predictor. Besides, the multi-transform can be configured as two 4 4 Hadamard/DCT/IDCT or one 8 8 DCT/IDCT transform for cost estimation and reconstruction. The proposed hardware architecture can unify the throughput and improve the processing capability with excellent area efficiency by using these reconfigurable eight-pixel parallel PEs. Unlike previous prediction-reconstruction interleaved scheme [18] [20], the schedule of the proposed architecture is divided into two MB pipeline stages, as shown in Fig. 19. In IP stage, only the best mode for each sub block and total MB mode cost are stored. In REC stage, only one mode is selected for the reconstruction. If Intra_4 4 mode is selected, luma 4 4 block and chroma 4 4 block will be reconstructed in parallel for higher hardware utilization as shown in Fig. 19. It is because in H.264 decoding process, each 4 4 luma sub block should be reconstructed in the zig-zag scan order. Once Intra_8 8 mode or inter mode is chosen, it will process only

DING et al.: A 212 MPixels/s 4096 2160p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 53 Fig. 15. Adder tree datapath of sixteen reconfigurable 256-PE array. throughput higher.

The logic gate count of the proposed architecture is similar to [12] because of the benefit of reconfigurable architecture of intra predictor and REC cores. Fig. 16.

8 DING et al.: A 212 MPixels/s p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 53 Fig. 15. Adder tree datapath of sixteen reconfigurable 256-PE array. throughput higher. In addition, DCT-based rate-distortion optimization (RDO) [17] rather than SAD-based RDO maintains the video quality. The logic gate count of the proposed architecture is similar to [12] because of the benefit of reconfigurable architecture of intra predictor and REC cores. Fig. 16. Illustration of hybrid open-closed loop IP. (a) Hybrid open-closed loop prediction of boundary pixels. (b) Proposed corresponding processing scheduling of the hybrid open-closed loop IP. one 8 8 luma sub block at a time. The chroma reconstruction will be executed after four 8 8 luma blocks are done. This architecture can make eight-pixel-parallelism PEs to achieve almost 100% hardware utilization and save operating cycles from useless reconstruction. It takes less than 272 cycles to process one MB. Fig. 20 shows the comparison of throughput and silicon gate count with previous IP and REC architectures [10], [12]. In [10], IP and REC are placed in one pipeline stage, and they process MBs in an interleaved manner. In [12], REC engine is placed between the IP stage and the EC stage. The VPMBI scheduling is well-organized so that the reconstruction of intra predictors and generation of MB residues can be generated without conflict and stalling cycles. The proposed architecture can process 896 K MBs at 280 MHz, the highest operating frequency. The normalized throughput is also 1.8 and 2.7 times better than the previous work. This achievement is made because separating IP and REC into two MB pipeline stages makes the system processing C. Frame-Parallel Pipeline-Doubled Dual (FPPDD) CABAC EC is used to compress data based on their probability distribution. To achieve the symbol rate for p videos (about 1000 Msymbols/s), the binary arithmetic coder in CABAC is cascaded to become a two-symbol architecture modified from our previous work [21]. Fig. 21 shows the architecture of the two-symbol arithmetic coder. In the cascaded architecture, we cannot directly cascade two one-symbol State Stages because they are possibly the same. For example, if context1 (ctx1) and context2 (ctx2) are the same, state and the most probable symbol (MPS) of ctx2 should be replaced by the updated ones of ctx1. Besides, only the updated values of ctx2 should be written back to Ctx State registers. Applying two-symbol CABAC architecture can double the throughput. However, for some textured MBs, two-symbol cascaded CABAC architecture still does not meet the throughput requirement. Therefore, frame-parallel pipeline-doubled dual (FPPDD) CABAC is proposed. Fig. 22 shows the MB pipeline scheduling of FPPDD CABAC. Dual CABAC computation cores are adopted, and each CABAC core has double pipeline cycle budget of 700 cycles. Dual CABAC computation cores process in an interleaved manner to be compatible with the VPMBI scheduling, so the MB scheduling is preformed smoothly without being stalled by the pipeline-doubled CABAC stage. The throughput enhancement of the proposed architecture is shown in Fig. 23. The throughput (mega symbol per second) of the FPPDD CABAC architecture is 3.88 and 2 times better than direct implementation and two-symbol cascaded architectures, respectively.

54 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 Fig. 17. Architecture of IP and REC computation cores. Fig. 18.

Experimental Condition Table I shows the experimental condition and the coding tools adopted in the proposed architecture.

4. Although the quantization parameter is fixed during simulation and comparison, the quantization parameter can be configured for each MB in the proposed

9 54 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 Fig. 17. Architecture of IP and REC computation cores. Fig. 18. Architecture of intra luma predictor. Fig. 19. Schedule of IP and REC computation cores. IV. EXPERIMENTAL RESULTS A. Experimental Condition Table I shows the experimental condition and the coding tools adopted in the proposed architecture. The proposed predictor-centered IMDE algorithm is compared with the anchor, JMVM4.4. Although the quantization parameter is fixed during simulation and comparison, the quantization parameter can be configured for each MB in the proposed architecture to maintain the flexibility. Supported coding tools for I-picture include Intra8 8, Intra16 16 mode. For P- and B-picture, variable block sizes from to 4 4 are supported. In addition, up to four reference frames are supported. The Fig. 20. Comparison of throughput and silicon area with the previous works.

FULL HDTV APPLICATIONS 55 TABLE II CHIP SPECIFICATIONS Fig. 21.

The data dependency of two symbols are solved by data forwarding.

22. MB pipeline scheduling of FPPDD CABAC. Fig. 23.

arhcitectures. TABLE I EXPERIMENTAL CONDITION Fig. 24.

number of reference frames is chosen according to the coding structure

In B-picture prediction, a current MB can choose the best-matching block

10 DING et al.: A 212 MPixels/s p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 55 TABLE II CHIP SPECIFICATIONS Fig. 21. Architecture of the two-symbol arithmetic coder. The data dependency of two symbols are solved by data forwarding. TABLE III SUMMARY OF PROPOSED TECHNIQUES AND THROUGHPUT ENHANCEMENT Fig. 22. MB pipeline scheduling of FPPDD CABAC. Fig. 23. Throughput comparison with direct implementation and single CABAC arhcitectures. TABLE I EXPERIMENTAL CONDITION Fig. 24. Chip micrograph of the MVC encoder. number of reference frames is chosen according to the coding structure and the view number. In B-picture prediction, a current MB can choose the best-matching block in the reference frame list0, list1, or the average of them. The resolution of test sequences ranges between 720p and 2160p. They are the target specifications of the proposed architecture. Due to the few numbers of 2160p test sequences, they are generated from the upsampled or assembled 1080p frames. Moreover, the ME/DE

56 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 TABLE IV COMPARISON WITH THE STATE-OF-THE-ART ENCODER CHIPS y Compared with full search block matching algorithm. Fig. 25.

in horizontal and B. Chip Implementation A prototype chip for the proposed MVC encoder is fabricated by TSMC with 90 nm 1P9M process.

11 56 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 TABLE IV COMPARISON WITH THE STATE-OF-THE-ART ENCODER CHIPS y Compared with full search block matching algorithm. Fig. 25. Comparison of (a) power efficiency, (b) off-chip memory bandwidth, and (c) on-chip memory size with the state-of-the-art encoder chips. search range is vertical directions, respectively. in horizontal and B. Chip Implementation A prototype chip for the proposed MVC encoder is fabricated by TSMC with 90 nm 1P9M process. The detailed chip features and specifications are listed in Table II, and the chip micrograph and the distribution of main modules are shown in Fig. 24. The core size of the chip is mm (3.95 mm 2.90 mm), which contains 1732 K gates. This chip supports both H.264/AVC Multivew High Profile and High Profile at level 5.1. In addition, view scalability, which depends on the frame resolution, is supported for one to seven views. The power consumption, which depends on the operating frequency, varies from 58 mw to 522 mw. This chip supports maximum throughput of 212 Mpixels/s and 830 kmb/s at 280 MHz for p videos. Table III summarizes the proposed techniques and the corresponding throughput enhancement. C. Chip Comparison Table IV summarizes the performance comparison between our work and the prior art [10] [12]. With the VPMBI scheduling and the eight-stage MB pipelining, our work provides 3.4 to 7.7 times the throughput better than the previous work and supports the maximum frame resolution. The search range of ME/DE,, is 4 to 64 times larger than the previous work to ensure the encoding quality. Therefore, only 20.1 KB on-chip SRAM is used with the penalty of little quality degradation of 0.1 db. The comparison of power efficiency is shown in Fig. 25(a). The power efficiency is defined as mega pixels per Watt. Note that the technology is scaled from 0.18 m and 0.13 m process to 90 nm process. The MVC encoder chip provides the power efficiency which is 10% to 153% better than the previous work. Fig. 25(b) and (c) shows the evaluation of external memory bandwidth and on-chip SRAM size among these works. The external memory bandwidth and on-chip SRAM requirement for full search and hierarchical search algorithm are also illustrated. In the three kinds of HD resolution, the MVC chip requires the least external memory bandwidth and on-chip SRAM size. The proposed predictor-centered ME/DE algorithm is the most suitable for the hardware implementation. Compared with [12], the proposed cache-based prediction core along with SW prefetching scheme reduce 39% external memory bandwidth. Moreover, 83% to 94% on-chip SRAM size is saved compared with the previous work scaled up to p resolution.

DING et al.: A 212 MPixels/s 4096 2160p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 57 V.

The VPMBI scheduling is proposed to overcome the design challenges of high processing capability required for MVC and dealing with various MVC structures.

On the other hand, the cache-based prediction core with a SW prefetching scheme and a predictor-centered ME/DE algorithm are proposed to solve the design challenges of large on-chip SRAM area and

12 DING et al.: A 212 MPixels/s p MULTIVIEW VIDEO ENCODER CHIP FOR 3D/QUAD FULL HDTV APPLICATIONS 57 V. CONCLUSION The proposed MVC single-chip encoder supports view scalability for encoding one-view p, three-view 1080p, and seven-view 720p videos for future 3DTV and QFHD TV applications. The VPMBI scheduling is proposed to overcome the design challenges of high processing capability required for MVC and dealing with various MVC structures. Therefore, the maximum throughput of 212 Mpixels/s, which is 3.4 to 7.7 times higher than the prior art, is achieved. In addition, the view scalability is also supported. On the other hand, the cache-based prediction core with a SW prefetching scheme and a predictor-centered ME/DE algorithm are proposed to solve the design challenges of large on-chip SRAM area and external memory bandwidth. 79% system memory bandwidth and 94% on-chip SRAM are thus saved. In addition, the architecture and scheduling of each MB pipeline stage are analyzed and designed. The cachebased temporal/inter-view prediction stage saves 95% computation with quality loss of less than 0.1 db in PSNR. The hybrid open-closed loop IP and pixel-forwarding REC stages overcome the design challenge of data dependency and enhance the throughput of 1.8 to 2.7 times better than the conventional architectures. The FPPDD CABAC co-operated with the VPMBI scheduling provides 2 to 3.88 times throughput, and the symbol encoding rate required for p resolution is achieved. Furthermore, the VPMBI scheduling can be regarded as a design methodology for ASIC-based HD video encoder. By allocating more view-cache SRAM in the design, the parallelprocessing capability is enhanced. It also enables more efficient access of external system memory bandwidth for complex MVC prediction structures. From system-on-chip point of view, a well-designed bus architecture and an arbitration strategy are necessary when other IPs, such as processor and video I/O, compete for the bus communication. They are challenging research topics and also belong to our future work. REFERENCES [1] F. Isgrò, E. Trucco, P. Kauff, and O. Schreer, Three-dimensional image processing in the future of immersive media, IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 3, pp , Mar [2] A. Smolic and P. Kauff, Interactive 3-D video representation and coding technologies, Proc. IEEE, vol. 93, no. 1, pp , Jan [3] M. Tanimoto, Free viewpoint television FTV, in Proc Picture Coding Symp., Dec [4] T. Fujii and M. Tanimoto, Free-viewpoint TV system based on ray-space representation, in Proc. SPIE, Mar. 2002, vol. 4864, pp [5] A. Smolic, K. Mueller, P. Merkle, T. Rein, M. Kautmer, P. Eisert, and T. Wiegand, Free viewpoint video extraction, representation, coding, and rendering, in Proc IEEE Int. Conf. Image Processing, Oct. 2004, vol. 5, pp [6] Advanced Video Coding for Generic Audiovisual Services, Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, May [7] Joint Draft 7.0 on Multiview Video Coding, Joint Video Team of ISO/IEC MPEG and ITU-T VCEG, ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Apr [8] P. Merkle, A. Smolic, K. Muller, and T. Wiegand, Efficient prediction structures for multiview video coding, IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp , Nov [9] L.-F. Ding, W.-Y. Chen, P.-K. Tsung, T.-D. Chuang, H.-K. Chiu, Y.-H. Chen, P.-H. Hsiao, S.-Y. Chien, T.-C. Chen, P.-C. Lin, C.-Y. Chang, W.-L. Chen, and L.-G. Chen, A 212 MPixels/s 4096 x 2160p multiview video encoder chip for 3D/quad HDTV applications, in IEEE ISSCC Dig. Tech. Papers, Feb [10] Y.-W. Huang et al., A 1.3 TOPS H.264/AVC single-chip encoder for HDTV applications, in IEEE ISSCC Dig. Tech. Papers, Feb. 2005, pp [11] H.-C. Chang et al., A 7 mw to 183 mw dynamic quality-scalable H.264 video encoder chip, in IEEE ISSCC Dig. Tech. Papers, Feb. 2007, pp [12] Y.-K. Lin et al., A 242 mw 10 mm 1080p H.264/AVC high profile encoder chip, in IEEE ISSCC Dig. Tech. Papers, Feb. 2008, pp [13] Z. Liu, Y. Song, M. Shao, S. Li, L. Li, S. Ishiwata, M. Nakagawa, S. Goto, and T. Ikenaga, A 1.41 W H.264/AVC real-time encoder SOC for HDTV1080P, in VLSI Circuits Symp. Dig., Jun. 2007, pp [14] T.-C. Chen et al., 2.8 to 67.2 mw low-power and power-aware H.264 encoder for mobile applications, in VLSI Circuits Symp. Dig., Jun. 2007, pp [15] Y.-H. Chen et al., An H.264/AVC scalable extension and high profile HDTV 1080p encoder chip, in VLSI Circuits Symp. Dig., Jun [16] P.-K. Tsung, W.-Y. Chen, L.-F. Ding, S.-Y. Chien, and L.-G. Chen, Cache-based integer motion/disparity estimation for quad-hd H.264/AVC and HD multiview video coding, in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 2009, pp [17] T.-D. Chuang, Y.-H. Chen, C.-H. Tsai, Y.-J. Chen, and L.-G. Chen, Algorithm and architecture design for intra prediction in H.264/AVC high profile, in Proc. Picture Coding Symp., [18] Y.-W. Huang et al., Analysis, fast algorithm, and VLSI architecture design for H.264/AVC intra frame coder, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 3, pp , Mar [19] C.-W. Ku et al., A high-definition H.264/AVC intra-frame codec IP for digital video and still camera applications, IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 8, pp , Aug [20] K. Suh, S. Park, and H. Cho, An efficient hardware architecture of intra prediction and TQ/IQIT module for H.264 encoder, ETRI J., vol. 27, [21] Y.-J. Chen, C.-H. Tsai, and L.-G. Chen, Architecture design of area-efficient SRAM-based multi-symbol arithmetic encoder in H.264/AVC, in Proc. IEEE Int. Symp. Circuits and Systems (ISCAS), 2006, pp Li-Fu Ding was born in Keelung, Taiwan, in He received the B.S. degree in electrical engineering, and the M.S. and the Ph.D. degrees in electronics engineering from National Taiwan University, Taipei, Taiwan, in 2003, 2005, and 2008, respectively. In 2009, he joined Taiwan Semiconductor Manufacturing Company as a principal engineer. His major research interests include stereo and multiview video coding, motion estimation algorithms, and associated VLSI architectures. Wei-Yin Chen was born in Penghu, Taiwan, in He received the B.S. degree in electrical engineering and the M.S. degree in electronics engineering from National Taiwan University, Taipei, Taiwan, in 2005 and 2008, respectively. In 2007, he was with MIT as a visiting graduate student. His major research interests include super high definition and multi-view video coding, associated VLSI architectures, high level synthesis, and computer architecture. Pei-Kuei Tsung was born in Taipei, Taiwan, in He received the B.S. degree in electrical engineering and the M.S. degree in electronics engineering from National Taiwan University, Taipei, Taiwan, in 2006 and 2008, respectively, where he is working toward the Ph.D. degree in electronics engineering. His major research interests include stereo and multi-view video coding, motion estimation algorithms, view synthesis algorithms, and associated VLSI architectures.

58 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 Tzu-Der Chuang was born in Taipei, Taiwan, in 1983. He received the B.S.E.E. degree from the Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, in 2005.

His major research interests include the algorithm and related VLSI architectures of H.264/AVC, scalable video coding. Pai-Heng Hsiao was born in Taoyuan, Taiwan, in 1985. He received the B.S.E.

Now he is working toward the Master degree in the Graduate Institute of Electronics Engineering, National Taiwan University.

degree from the Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, in 2003. He is currently pursuing the Ph.D. degree at the Graduate Institute of Electronics Engineering, National Taiwan University.

Hsu-Kuang Chiu was born in Taipei, Taiwan, in 1983. He received the B.S. degree in electrical engineering in 2006 from National Taiwan University, Taipei, Taiwan. He is currently pursuing the M.S. degree in electrical engineering at Stanford University, Stanford, CA.

degrees from the Department of Electrical Engineering, National Taiwan University (NTU), Taipei, Taiwan, in 1999 and 2003, respectively.

13 58 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 45, NO. 1, JANUARY 2010 Tzu-Der Chuang was born in Taipei, Taiwan, in He received the B.S.E.E. degree from the Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, in Now he is working toward the Ph.D. degree in the Graduate Institute of Electronics Engineering, National Taiwan University. His major research interests include the algorithm and related VLSI architectures of H.264/AVC, scalable video coding. Pai-Heng Hsiao was born in Taoyuan, Taiwan, in He received the B.S.E.E. degree from the Department of Electrical Engineering, National Tsinh-Hua University, Hsinchu, Taiwan, in Now he is working toward the Master degree in the Graduate Institute of Electronics Engineering, National Taiwan University. His major research interests include the algorithm and architectures of video coding and neural signal processing. Yu-Han Chen was born in Taipei, Taiwan, in He received the B.S. degree from the Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, in He is currently pursuing the Ph.D. degree at the Graduate Institute of Electronics Engineering, National Taiwan University. His research interests include image/video signal processing, motion estimation, algorithm and architecture design of H.264 video coder, and low-power and power-aware video coding system. Hsu-Kuang Chiu was born in Taipei, Taiwan, in He received the B.S. degree in electrical engineering in 2006 from National Taiwan University, Taipei, Taiwan. He is currently pursuing the M.S. degree in electrical engineering at Stanford University, Stanford, CA. His major research interests include video coding and view-synthesis algorithms. Shao-Yi Chien (S 99 M 04) received the B.S. and Ph.D. degrees from the Department of Electrical Engineering, National Taiwan University (NTU), Taipei, Taiwan, in 1999 and 2003, respectively. During 2003 to 2004, he was a research staff in Quanta Research Institute, Tao Yuan County, Taiwan. In 2004, he joined the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University, as an Assistant Professor. Since 2008, he has been an Associate Professor. His research interests include video segmentation algorithm, intelligent video coding technology, perceptual coding technology, image processing for digital still cameras and display devices, computer graphics, and the associated VLSI and processor architectures. He has published more than 120 papers in these areas. Dr. Chien serves as an Associate Editor for IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and Springer Circuits, Systems and Signal Processing (CSSP), and served as a Guest Editor for Springer Journal of Signal Processing Systems in He also serves on the technical program committees of several conferences, including ISCAS, A-SSCC, and VLSI-DAT. Liang-Gee Chen (S 84 M 86 SM 94 F 01) was born in Yun-Lin, Taiwan, in He received the B.S., M.S., and Ph.D. degrees in electrical engineering from National Cheng Kung University, Taiwan, in 1979, 1981, and 1986, respectively. He was an Instructor ( ), and an Associate Professor ( ) in the the Department of Electrical Engineering, National Cheng Kung University. In the military service during 1987 and 1988, he was an Associate Professor in the Institute of Resource Management, Defense Management College. From 1988, he joined the Department of Electrical Engineering, National Taiwan University. During 1993 to 1994 he was Visiting Consultant of DSP Research Department, AT&T Bell Lab, Murray Hill. At 1997, he was the visiting scholar of the Department of Electrical Engineering, University, of Washington, Seattle. Currently, he is a Professor at National Taiwan University. Since 2004, he has also been the Executive Vice President and the General Director of Electronics Research and Service Organization (ERSO) in the Industrial Technology Research Institute (ITRI). His current research interests are DSP architecture design, video processor design, and video coding system. Dr. Chen is a Fellow of IEEE and a member of the honor society Phi Tan Phi. He was the general chairman of the 7th VLSI Design CAD Symposium. He is also the general chairman of the 1999 IEEE Workshop on Signal Processing Systems: Design and Implementation. He has served as an Associate Editor of IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY since June 1996 and as an Associate Editor of IEEE TRANSACTIONS ON VLSI SYSTEMS since January He has been the Associate Editor of the Journal of Circuits, Systems, and Signal Processing since He served as the Guest Editor of the Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology in November He is also the Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING. Since 2002, he has also been an Associate Editor of the PROCEEDINGS OF THE IEEE. Dr. Chen received the Best Paper Award from the R.O.C. Computer Society in 1990 and From 1991 to 1999, he received Long-Term (Acer) Paper Awards annually. In 1992, he received the Best Paper Award of the 1992 Asia- Pacific Conference on Circuits and Systems in VLSI design track. In 1993, he received the Annual Paper Award of Chinese Engineer Society. In 1996, he received the Outstanding Research Award from NSC, and the Dragon Excellence Award for Acer. He is elected as the IEEE Circuits and Systems Distinguished Lecturer in

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION Yi-Hau Chen, Tzu-Der Chuang, Chuan-Yung Tsai, Yu-Jen Chen, and Liang-Gee Chen DSP/IC Design Lab., Graduate Institute