Video Encoding with Multicore Processors March 29, 2007
Video is Ubiquitous... Demand for Any Content Any Time Any Where Resolution ranges from 128x96 pixels for mobile to 1920x1080 pixels for full HD Frame rates range from 10 to 60 fps The only constant is that raw digital data always outstrips available capacity! FORMAT lpf ppl bpp Mbpf fps Mbps 2 hr video Channel Mbps Storage GB Sub-QCIF 96 128 8 0.0983 10 0.983 885 MB GSM.0014 POTS.0056 Brdbd.3840 NTSC 480 720 16 5.53 30 166 149 GB Satellite 006 Cable 020 DVD 17 ipod 80 HDTV 1080 1920 24 49.87 60 2,986 2.69 TB T3 044.7 OC-3 155.5 HD-DVD 30 Blu-Ray 50 2
High Definition = High Quality 1080 lines per frame 60 frames per second Frame 16 x 9 aspect ratio 1 01010101010101010101010 1920 pixels per line Le Grand Jatte (aka Sunday in the Park ), 1886 George Seurat 120 x 81 inches About 3.5 million dots = 19 dpi view from 18-25 feet (vs. about 2.1 million pixels on HDTV screen) 24 bits per pixel RGB/YUV 3
Video Encoding Standards Defined by the ISO/IEC MPEG Group MPEG2, AVC/H.264 or MPEG4 Part 10 Can achieve up to 100:1 reduction in HD video data 100 88 Relative file sizes 1 hour video 80 GB 60 40 20 0 Intra-picture redundancy 13 Inter-picture redundancy 24 fps Advanced compression technology 3.5 0.86 D1 uncompressed DV MPEG-2 (DVD) H.264/AVC 4
AVC/H.264 Encoder Overview Fn (current) Fn-1 (reference) F'n-1 (reference) Motion Estmtn Motion Cmpstn (P&B) Inter - Dn Trnsfm & Qntz Buffer depth (n) Intra predctn Intra cmpnst Intra (I) switch F'n (reconstrct) Deblck Filter + D'n Inverse Qntz & Trnsfm Reordr & Encode Transmit (NAL) 5
AVC / H.264 Algorithm Sub Blocks Features Motion Estimation Adaptive block sizes 16x16, 8x8, 4x8, 8x4, 4x4 Transform 4x4, 8x8 (High P) Simple Integer Motion Compensation To ¼ pel Deblocking filter In Loop Intra prediction Modes 13 (4x4 9 modes, 16x16 4 modes) Inter prediction modes Numerous choices of block size, number/type of reference frames Entropy Encoding Context-based Adaptive VLC (CAVLC) & Context-based Binary Arithmetic Coding (CABAC) Quantization Finer range of parameters Next Generation Fidelity Range extensions (FRext) 4:2:0 High; 4:2:2 8-10 bit ; 4:4:4 10-12bit Processing Characteristics Greatest processing requirement (~50%) Highly parallelizable Highly parallelizable Highly parallelizable Partly parallelizable Partly parallelizable Partly parallelizable Bit wise operations Partly parallelizable Highly parallelizable Still higher processing reqrmts. 6
H.264 Improves Video Quality and Bit Rate 20Mbps MPEG2 Bit rate 10Mbps H.264 2Mbps 1990 1995 2000 2005 2010 Ideal Application for multiprocessor, multicore solution Keeps up with performance requirements (5-6X MPEG2) Requires programmability to keep pace with algorithm improvements 7
Telairity-1 1 Video Architecture Architecture designed for High Definition video 5 identical loosely-coupled vector/scalar processor cores in single chip Integrated DRAM controller Integrated Video controller Fully programmable 90nm process technology - up to 750 MHz operation (594 MHz today) Processor P0 TVP400 Processor P1 TVP400 Processor P2 TVP400 Processor P3 TVP400 Processor P4 TVP400 Bit Packing Unit Video Controller 20 bit parallel video I/O 5 SPI Channels DMA & SDRAM Controller 4.8 GB/s Sustained chip performance of 49.5 GigaOPS/s (BOPS) 8
TVP400 Core Block Diagram 4 Vector 44 16-bit Functional Units 16-read 8-write 2K Vector Registers 512 VR/Pipe 8-load 4-store 16-bits 16-bits 1 Scalar 6 32-bit Functional Units 8KB Scratch Memory 3-read 1-write 32-bits 32-bits 256B Local Registers 32-bits single issue Instruction Unit 32-bits 32 KB I Cache Other Cores 128 KB Vector Memory DMA 64-bits 32-bits 512 MB SDRAM Controller 4KB Data 64-bits 9
Multi Pipe Vector Instructions 4 independent 16-bit vector pipes 5 instructions per pipe total of 20 in parallel Vector length of 32 From 1 to 32 vector elements processed sequentially Extremely efficient for several video operations Motion estimation / compensation 8 x 8, 4x4 transforms H.264/AVC 4x4 block Intra-Prediction algorithm Uses multi-pipe vector core Calculates eight modes simultaneously Eightfold speed up in intra-prediction 10
Multiprocessor Encoding Engine Partition video data by parallelizing each frame into multiple slices AND Partition algorithm between top and bottom row of processors Slice 0 Slice 1 Slice 1 Slice 2 Slice 2 Slice 3 P0 P1 P2 P3 49.5 BOPS 49.5 BOPS hand off 49.5 BOPS 49.5 BOPS 49.5 BOPS 49.5 BOPS P4 P5 P6 P7 49.5 BOPS 49.5 BOPS data out hand off 49.5 BOPS 49.5 BOPS data out Slice 0 Slice 1 Slice 2 Slice 3 Frame hand off 49.5 BOPS 49.5 BOPS data out hand off data out High bandwidth chip-chip communication to eliminate slice artifacts 396 sustained BOPS 11
Multicore Processing Partition video data further by parallelizing each slice into multiple macroblocks Processed in parallel using 5 TVP400 cores Tasks divided sequentially across two T1 processors Really, very doable: 720p is 80 x 45 = 3,600 macroblocks per frame x 60 fps = 216,000 mb/sec Divided by 4 slices = 54,000 mb/sec x 2 50 BOPS T1 processors P0 P0 P1 P2 P3 P4 DMA & SDRAM Controller P1 P2 P3 P4 DMA & SDRAM Controller Bit Pack Video Cntrl Bit Pack Video Cntrl 49.5 BOPS 49.5 BOPS 12
BE8000 Peak Performance VP0 2 16-bit ops 2 loads 1 store TVP400 Core0 12.474 BOPS 5+5+5+5+1 = 21 ops/clk x 594 MHz = 12.474 BOPS VP1 2 16-bit ops 2 loads 1 store VP2 2 16-bit ops 2 loads 1 store VP3 2 16-bit ops 2 loads 1 store SP 1 32-bit op TVP400 Processor Core: 5 Pipelines TVP400 Core1 12.474 BOPS 5 x 12.474 = 62.37 BOPS TVP400 Core2 12.474 BOPS TVP400 Core3 12.474 BOPS TVP400 Core4 12.474 BOPS I/O Controller Bit Packing Unit Memory Controller TVP2000 Processor: 5 cores TVP2000 P0 62.37 BOPS TVP2000 P1 62.37 BOPS 8 x 62.37 = 498.96 BOPS TVP2000 P2 62.37 BOPS TVP2000 P3 62.37 BOPS TVP2000 P4 62.37 BOPS TVP2000 P5 62.37 BOPS TVP2000 P6 62.37 BOPS BE8000 Encoder Board: 8 processors TVP2000 P7 62.37 BOPS 13
AVClairity Compression Software Dedicated AVC software All written in-house (C and intrinsics) Runs directly on BE8000 hardware (no OS) Flexible, scalable for different resolutions, standards Main Profile, High Profile of AVC standard Level 4.0 4:2:0, 8-bit compression Video formats: 720 p 50/59.94/60 and 1080 i 25/29.97/30 Constant Bit Rate (CBR) or Variable Bit Rate (VBR) Bit rate: 2-20 Mbps Scene change detection Context Adaptive Entropy Coding: CABAC or CAVLC Spatial and motion compensated temporal filtering 14
Software Issues and Challenges Load balancing between processors One slice/frame may contain extremely dense data relative to other slices/frames Managing idle time waiting for processors to finish Synchronization of data exchange between processors Minimization of latency Efficient and high bandwidth communication between processors and cores >300 MB/sec path between processors Eliminates slice boundary artifacts 5 cores share Vector memory and DDR2 DRAM memory per processor Minimizes inter-core communication requirements 15
Processor P0 TVP400 Processor P1 TVP400 Processor Processor P2 P3 TVP400 TVP400 DMA & SDRAM Controller Processor P4 TVP400 Bit Packing Unit Video Controller 16
BE8000 Video Encoding Platform HD AVC Video Encoding AAC Audio Encoding Highest performance; multiprocessor, multicore solution Runs 40 concurrent threads Each thread has 5 execution units available to it Achieves low latency and high video quality Flexible and software upgradeable for video and bit-rate improvements 17