Video Codecs National Chiao Tung University Chun-Jen Tsai 1/5/2015
Video Systems A complete end-to-end video system: A/D color conversion encoder decoder color conversion D/A bitstream YC B C R format RGB Question: how do we compress video data for common usage? MPEG-1 video data rate: 1 ~ 2 mbps MPEG-2 video data rate: 2 ~ 20 mbps MPEG-4 video data rate: Simple Profile, 64kbps ~ 1.5 mbps AVC/H.264, 32 kbps ~ 20 mbps 2/42
Video Frame Representation Video frame are represented in YC B C R 4:2:0 format RGB format is well-known, but not suitable for video coding YC B C R is used for video coding because: Color components can be subsampled easily Human eyes are less sensitive to color gradient Some people refer to YC B C R space as YUV color space 4:2:0 stands for color subsampling one luma sample two chroma samples An *.yuv file stores video data frame-by-frame. Each frame stores complete luma sample before chroma samples. Y: luma; C B, C R : chroma For more info., see Charles Poynton s website: http://www.poynton.com/ 3/42
Color Space Conversion RGB YC B C R : Y = α red Red + α green Green + α blue Blue C B = (Blue - Y) / (2 2 α blue ) C R = (Red - Y) / (2 2 α red ) YC B C R RGB: Red = C R (2 2 α red ) + Y Green = (α blue Blue - α red Red) / α green Blue = C B (2 2 α blue ) + Y Coefficients table: Recommendation α red α green α blue BT-601 0.2990 0.5870 0.1140 BT-709 0.2126 0.7152 0.0722 4/42
History of Video Standards : developed by ITU-T/VCEG : developed by ISO/MPEG : official joint work by ISO & ITU-T ITU-T Standards ITU-T H.261 (video telephony) H.263/H.263+/H.263++ (video telephony, MMS) Joint Standards MPEG 2 Part 2 (a.k.a. H.262) AVC: MPEG-4 Part 10 / H.264 (ed.1 ~ 3) HEVC: MPEG-H Part 2 / H.265 (ed. 1) ISO/IEC Standards MPEG 1 Part 2 MPEG 4 Part 2 (ver.1 ~ ver.3) 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2007 2008 2009... 2013 5/42
From Video Frame to Bitstream video sequence I P P I B P time I: Intra coding (spatially-compensated) P: Inter coding (motion-compensated) B: Bi-inter coding (motion-compensated) video frame slices macroblocks Each pixel is composed of a luma (Y) and two chroma (C B, C R ) components Simply put, each block is composed of 386 numbers from the range of [0, 255] Picture header Slice Data Slice Data Slice Data Compressed bitstream Video header Picture header... Slice Data Note: In the past, video experts tries to break a video frame into objects instead of square blocks before coding, but the idea didn t fly! 6/42
Components of MPEG Video Codecs Popular video codecs today are all block-based motion-compensated transform codecs, which composed of four modules: Predictive coder (loss-less) Transform coder (loss-less, theoretically) Quantizer (lossy) Entropy coder (loss-less) YC B C R Prediction T Q Entropy Packing bitstream encoder decoder Prediction 1 T 1 Q 1 Entropy 1 Unpacking * T transform; Q Quantization 7/42
Predictive Coder Predictive coders perform the Guess Work in a video codec Two types of predictions are available: Spatial prediction (a.k.a. intra-prediction) Temporal prediction (a.k.a. inter-prediction) 8/42
Spatial Prediction Pixels are predicted using their neighboring pixels Assume similar pixels extend in this direction Prediction model (predictor) The predicted model may not match the real pixels exactly, therefore, the errors between the model and the real pixels must be recorded 9/42
Temporal (Motion) Prediction The most important prediction coding technique for video is called motion-compensated prediction: motion vector R M m n blocks reference frame current frame The block M can be represented by the motion vector plus the differences between R and M 10/42
How Big Should a Block Be? It would be ideal if the spatio-temporal prediction is based on the natural object boundary Engineers try this idea and failed Today, most successful codecs use variable block sizes for prediction Example: in H.264, two-level quadtree partition is used Level one 16x16 16x8 8x16 8x8 0 0 1 0 0 1 1 2 3 Level two 0 0 1 0 1 8x8 8x4 4x8 0 1 2 3 4x4 11/42
Advantages of Small Prediction Blocks The smaller the block size is, the better your model fits video data 16 16 macroblocks reference frame current frame 12/42
½-Pixel Motion Compensation Since MV resolution is half-a-pixel, when the MV is, say, (-10.5, 4.5), we must make up the predictor block R by interpolation: A p B q r Integer pixel position C D Half pixel position p = (A + B + 1 δ)/2 q = (A + C + 1 δ)/2 r = (A + B + C + D + 2 δ)/4 Note: δ = 0 or 1, is a rounding control parameter. δ is data-dependent and determined by the encoder. 13/42
Motion Vector Coding Motion vectors are also predictively coded The predictor is the median of MV1, MV2, and MV3 14/42
Transform Coder (DCT) 2D Discrete Cosine Transform (DCT) is used: Forward transform (for encoder): Backward transform (for decoder): Due to rounding issues, DCT can not be computed precisely by a computer In MPEG-2 & MPEG-4 part 2, codec modifies the last coefficient of each block by ±1 to reduce the mismatch effect = = + + = 1 0 1 0 2 1) (2 cos 2 1) (2 )cos, ( ) ( ) ( ), ( N x N y N v y N u x y x f v C u C v u F π π = = + + = 1 0 1 0 2 1) (2 cos 2 1) (2 )cos, ( ) ( ) ( ), ( N u N v N v y N u x v u F v C u C y x f π π = = 0, 0, ) ( 2 1 t t t C N N Note: In both equations, N is the size of block, and 15/42
Quantization Module The transformed coefficients are then quantized using a staircase function (with stepsize i ): Output Levels y M y i i y i = x / i b i-1 b i Input Levels x How do we choose the right i? 16/42
MPEG-4 Part 2 DC Prediction Pick DC predictor based on gradients of the DC s: If ( DC A DC B < DC B DC C ) DC X = DC C else DC X = DC A B C D or or A X Y Block (8x8) Macroblock (16x16) 17/42
MPEG-4 Part 2 AC Prediction Coefficients are predicted from previous coded blocks. The best direction is chosen based on the DC prediction. B or C or D A X Y M macroblock 18/42
Rate-Distortion (R-D) Trade-off The R-D function depends on the codec model as well as the video content (segment) quality Good operating points! bit-budget range bit-budget range rate 19/42
Rate-Distortion Optimization Theoretical R-D function is a characteristics of the video content Operational R-D function is determined by the codec Distortion codec #1 codec #2 theoretical limit from information theory Rate 20/42
DCT Coefficients Entropy Coding The transform coefficients are coded using an entropy coder Today, many popular video codecs perform entropy coding in three steps: Convert 2-D data to 1-D array of coefficients (zigzag scan) Convert 1-D array of coefficients to 1-D array of symbols (run-length coding) Variable-length coding (VLC) of the run-length symbols 21/42
Zigzag Scan Zigzag scan is used to map the 2-D array of DCT coefficients to an 1-D array: 0 1 5 6 14 15 27 28 2 4 7 13 16 26 39 42 3 8 12 17 25 30 41 43 9 11 18 24 31 40 44 53 10 19 23 32 39 45 52 54 20 22 33 38 46 51 55 60 21 34 37 47 50 56 59 61 35 36 48 49 57 58 62 63 22/42
VLC-based Entropy Coding Take MPEG-1/2/4 for example, each nonzero coefficient in the 1-D array is converted to a (Last, Run, Level) symbol: Last: is this the last non-zero coefficient? Run: the number of zeros precede this coefficient Level: the value of this coefficient These symbols are coded using variable length codes (VLC) 23/42
Entropy Coding Example 78 0 0 0 0 0 0 0-9 0 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0-9 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 zig-zag scan 78, 0, -9, 0, 0, 0, 0, 22, 0, 0,, 0, 0, -9 convert to symbols (0, 0, 78), (0, 1, -9), (0, 4, 22), (0, 14, 8), (0, 0, 3), (1, 21, -9) VLC coding Convert symbols to VLC codes by table-lookup. Last Run Level Bits VLC Code 0 0 1 3 10s 0 0 2 5 1111s 0 0 3 7 0101 01s 1 0 1 5 0111s 1 0 2 10 0000 1100 1s 24/40
Packing the Compressed Data In previous slides, we covered the core technologies inside a video codec. However, the compressed data has to be arranged into a bitstream, alone with some system information The packing mechanism is critical to the application scenario (e.g. video-over-ip) 25/42
MPEG-4 Simple Profile Bitstreams VOL Header* Elementary Stream Short Video Header Mode (H.263 Baseline) Combined Mode Data-Partitioned (DP) Mode Video Packet (VP) Mode Non-video Packet Mode *Note: VOL header for short video header mode is an empty header 26/40
Combined Mode with VP Syntax TIMESTAMP, PTYPE, QUANT VOL header Picture header Slice Data Slice Data Picture header PROFILE, WIDTH, HEIGHT Slice header MB 0 MB 1 MB N MB#, MBTYPE, DQUANT, MV, CBP, Coeff. Data VLC is used to code these symbols 27/40
Video Packets (Slices) A video packet (slice) is a set of consecutive macroblocks in scan order: Resync Marker VP Info. MB Data MB# QP HEC (width, height, etc.) 28/42
Macroblock Syntax Each macroblock data contains the following information COD MCBPC CBPY DQUANT TCOEF INTRADC MVD 2-4 MVD * MCBPC combines MB Prediction Type and chroma coded block pattern (CBP) into one symbol; INTRADC is coded using 8-bit FLC, they are present for every blocks in an Intra MB 29/42
Generic Encoder Architecture YC B C R Frame mode, model (dir, mv, ref_id, etc) Partition slice info Pred Model current MB - + T/Q Pred Transform Scan Entropy slice info Q -1 /T -1 Mux Pred Spatial residual Mode Select mode, model Model Compensation + bitstream reference MB(s) Ref 1 Ref 2 In-Loop Deblocking Filter Ref 3 30/42
Brief History of H.264 In 2000, Real Network, Nokia, and HHI proposed to MPEG to adopt their new coding technologies MPEG issued a CfP in 2001 The Joint Video Team (JVT) of ISO and ITU-T were established afterwards H.26L was selected as the starting point, however, during the course of development, most modules in H.26L were replaced H.264 became an International Standard in May 2003, 2nd ed. is released in March 2004; 3rd ed. Is released in Apr. 2005 Official name: Advanced Video Coding (AVC), also referred to as MPEG-4 Part 10 or H.264 Documents and reference software: http://iphome.hhi.de/suehring/tml/ 31/42
H.264 Key Features H.264 is still a block-based motion-compensated transform codec; it is a technical evolution from MPEG-1/2/4, not a technical revolution Key features: Predictive coding Spatial prediction: 9 directional prediction patterns plus a gradient prediction pattern (i.e. plane mode) Multiple references and 4 4 to 16 16 variable-size blocks for inter predictions 16-bit integer combined transform/quantization Exact forward-inverse transform pair is used Transform block size is 4 4 or 8 8 Two different entropy coding methods Universal VLC plus Context Adaptive VLC Context Adaptive Binary Arithmetic Coding In-loop filter 32/42
Multiple Reference Frames Good prediction H.264 Motion Prediction Good prediction MPEG-1/2/4 Motion Prediction Bad prediction Bad prediction Bad prediction 33/42
DCT-like Integer Transform Starting with a 4x4 general transform: Y T 21 22 23 24 = AXA = a a a a x31 x32 x33 x 34 a c a b One can reach: T ( ) Y = CXC E = a a a a x11 x12 x13 x14 a b a c b c c b x x x x a c a b c b b c x41 x42 x43 x44 a b a c d 1 1 d x x x x 1 1 1 d cos( π / 8) cos(3π /8) 2 2 1 1 1 1 x11 x12 x13 x14 1 1 1 d a ab a ab 2 2 1 d d 1 x21 x22 x23 x 24 1 d 1 1 ab b ab b 1 1 1 1 x 2 2 31 x32 x33 x 34 1 d 1 1 a ab a ab 41 42 43 44 2 2 ab b ab b a = b = c = 1 2 1 2 1 2 d = 2 1 = 0.414213... approximated by d = ½ 34/42
Entropy Coding of H.264 Two types of entropy coders are supported in H.264 1. Variable Length Coder: Universal VLC (UVLC) for syntax elements Context Adaptive VLC (CAVLC) for transform coefficients 2. Arithmetic coder (not for Baseline): Context Adaptive Binary Arithmetic Coding (CABAC) 35/42
UVLC for Semantic Elements UVLC uses Exp-Golomb codewords: codewords 0 1 Bit patterns 1 ~ 2 0 1 x 0 3 ~ 6 0 0 1 x 1 x 0 7 ~ 14 0 0 0 1 x 2 x 1 x 0 15 ~ 30 0 0 0 0 1 x 3 x 2 x 1 x 0...... where each x n equals 0 or 1 In original design, UVLC is used for transform coefficients as well, but the performance was bad 36/42
CAVLC for Transform Coefficients CAVLC Process: Uses the number of non-zero coefficients of neighboring blocks to select different VLC tables For example, if scanned coefficients are: 0, 3, 0, 1, -1, -1, 0, 1 The coded symbols are: number of non-zeros = 5 number of trailing ones = 3 sings of trailing ones = +, -, - level symbols = 1, 3 number of zeros = 3 runs of zeros = 1, 0, 0, 1, 1 37/42
Context Adaptive Binary AC CABAC in AVC is composed of three parts: Binarization: convert syntax values to binary (0/1) strings Context modeling: estimate probability of 0/1 occurrence Arithmetic coding: only two subdivisions for each subinterval bin value for context model update non-binary valued syntax element syntax element Binarizer binary valued syntax element bin string loop over bins bins regular bypass Context modeler bin value, context model Regular Coding Engine Bypass Coding Engine regular bypass coded bits coded bits bitstream Binary Arithmetic Coder 38/42
Quality Comparison Ref.: T. Wiegand et. al, Rate-Constrained Coder Control and Comparison of Video Coding Standards, IEEE T-CSVT, July, 2003 39/42
High Efficiency Video Coding (HEVC) HEVC is a joint video standard developed by ITU-T and ISO by the team JCT-VC The effort starts around 2008 and version 1 is officially released on April 13, 2013. HEVC is designed for high resolution videos such as 4K and 8K videos. However, it can achieve over 50% coding efficiency gains for videos larger than SD (720 480) resolutions The architecture of HEVC is similar to that of AVC, with extensions to support superblocks (64 64 pixel blocks) Documents and reference software: http://hevc.hhi.fraunhofer.de/ 40/42
HEVC Major Features Intra prediction now consider 33 different directions The block-based prediction unit (a.k.a. coding tree unit) can be as large as 64 64; The transform coding unit size can be as large as 32 32 Improved motion vector prediction and interpolation filter (now 7 or 8 taps) The loop filters now includes a deblocking filter and a sample adaptive offset (SAO) filter that can remove banding and ringing artifacts Interlaced video supported at metadata layer, not video coding layer 41/42
Other Video Codec Standards Society of Motion Picture and Television Engineers (SMPTE) has standardized three codecs since 2005 SMPTE VC-1 (a.k.a. Microsoft Video Codec) SMPTE VC-2 (a.k.a. BBC Mezzanine Video Codec) SMPTE VC-3 (a.k.a. Avid DNxHDt Video Codec) Google has its own royalty free codec: VP9 VP9 was officially released on June, 2013 VP9 is supported by major web browsers and ffmpeg/libav VP9 has 64 64 superblock coding structure similar to HEVC VP9 will be the standard codec for YouTube 4K video 42/42