IEE 5037 Multimedia Communications Lecture 12: MPEG-4

IEE 5037 Multimedia Communications Lecture 12: MPEG-4 Adapted from Prof. Hang s slides Dept. Electronics Engineering, National Chiao Tung University

MPEG-4 Video Coding Part 2: Object-oriented coding FGS Scalable coding To be dropped from the standard?

Overview of MPEG-4

MPEG-4 ISO/IEC /WG11: A standard for multimedia applications History: (Rao & Hwang, Chap. 12) Nov. 1992: MPEG started new work item Nov. 1994: Call for proposals many submitted Nov. 1995: Subjective testing and tool evaluation Jan. 1996: Define Verification Model (VM1) (encoder) July 1996: Evaluate SNHC proposals Nov. 1996: Working draft (WD) Apr. 1997: Video VM 7.0 (WD 3.0) Nov. 1997: Committee draft (CD): IS/IEC 14496 Apr. 1999: International Standard (IS) (Now) Working on newer versions with additional features

MPEG 4 Documents MPEG - 4 1998 Coding of audio-visual objects Part 1 Systems Part 2 Visual Part 3 Audio Part 4 Conformance Part 5 Reference Software Part 6 DMIF - Delivery Multimedia Integration Framework Part 7 Optimized Software Part 8 MPEG 4 on IP Part 9 Reference Hardware Part 10 Advanced Video Coding (AVC) (JVT, H.264) Part 11 Scene Description Part 12 ISO base media file format Part 13 IPMP extensions Part 14 MP4 File format Part 15 AVC File format Part 16 Multimedia Animation Framework Part 17 Streaming Text Format Part 18 Font Compression and Streaming

MPEG-4 Goals Content-based interactivity: Content-based manipulation and ending Universal access: Robustness in error-prone environments; content-based scalability Coding of natural and synthetic data: Merging pixel-based video/audio with synthesized graphics /audio/ speech in highly flexible way. High compression: Improved coding efficiency for particularly low rate applications Flexible syntax and tools Texture coding based on H.263

MPEG-4 Audio Three core coders and some additional tools: Parametric coder (PARA) 2 to 16 kbs CELP-based speech coder 4 to 24 kbs Time/frequency mapping based coder 16 to 64 kbs SNHC audio tools text-to-speech, structured audio Features: Improved coding efficiency Time-scale change, pitch change (Karaoke) Scalability: bitrate, bandwidth, Error resilience

MPEG-1 and MPEG-2 1992: MPEG-1 Standard CD-ROM (1.5 Mbit/s) 1994: MPEG-2 Standard Digital Television (SDTV/HDTV) (4 Mbit/s - 24 Mbit/s) Video Compression Audio Compression Systems (Multipl.)

MPEG-4 1999/2000: MPEG-4 Standard Flexible Multimedia Communications (5 kbits/s - 50 Mbit/s) Video Object Compression Audio Object Compression Synthetic Audio/Speech and Video Systems (Multiplex and flexible Composition)

First Things First MPEG-4 attempts to become THE standard for streaming AV media on the Internet and via wireless networks Top-quality MPEG-4 audio and video coders for streaming conventional speech, audio and video excellent AV compression excellent robustness against packet loss scalability of bitrate vs quality MPEG-4 File Format

But MPEG-4 Vision Goes MUCH Further MPEG-4 attempts to provide a bridge between the www and conventional AV media

Interactivities in MPEG-4

Example: MPEG-4 audio-visual Scene Speech AV Presentation Video Object 2D Background 3D Furniture

MPEG-4 4 Systems: BIFS-Composition of Scenes Scene Person 2D Background Furniture Audio-visual Presentation Speech Video Globe Table

Integration of Natural and Synthetic Content

Application: Augmented Reality

Application: Telepresence

MPEG-4 New Functionalities Streaming AV over mobile networks of much interest More freedom to flexibly interact with what is within scenes Support integration of natural and synthetic AV media ( Virtual Playground ) Identification, Protection of intellectual property and rights on content

MPEG-4: Coding of AV Objects AV scenes consist of objects Objects can be both natural or/and synthetic (A&V, Text & Graphics, animated faces, arbitrarily shaped or rectangular) A compositor composes objects in a scene (A&V, 2&3D) Binary Format for Scene Description : BIFS Independent of Bitrate!

Object Manipulation Original Decoded Decoded and Manipulated

MPEG-4 Part II. Visual

MPEG-4 4 Video Baseline Extended Compression Error Resilience Scalability Conventional coding Content-based Coding Still Texture Coding Object coding

MPEG-4 Video Standard MPEG-4 Video Provides Tools for a Number of Functionalities Many tools are not used Integrated Approach (Baseline and Extensions) Based on DCT Technology (except for Still Texture Coding) DWT based

MPEG-4 Baseline and Extensions

Compatibility Issues of MPEG-4 Video Standard MPEG-4 Video is Compatible to Baseline H.263 And Almost Compatible to MPEG-1 And almost compatible to MPEG-2

Basic Structure for Video Standard Input Video Signal Split into Macroblocks 16x16 pixels - Decoder Coder Control Transform/ Scal./Quant. Scaling & Inv. Transform Control Data Quant. Transf. coeffs Entropy Coding Motion- Intra/Inter Compensation Motion Estimation Output Video Signal Motion Data

Baseline: Rectangular VOP (Conventional Coding) Input Video Signal Intra DC/AC prediction (MPEG-4) Split into Macroblocks 16x16 pixels B C D A or X or Y - Decoder Macroblock Coder Control Transform/ Scal./Quant. Scaling & Inv. Transform Control Data Quant. Transf. coeffs 8x8 DCT Transform Accuracy problem (MPEG-2/4) Q: H.263 or MPEG-2 type Entropy Coding Motion- Intra/Inter Compensation MB Types 16x16 0 Output Video Signal 8x8 0 1 2 3 Motion Estimation Motion Data Motion vector accuracy 1/4 (6-tap filter) (MPEG-4)

DCT and Quantization Scan: Alternate-horizontal Alternate-vertical Zig-zag Adaptive DC prediction; adaptive AC prediction Inverse Quantizer: Quantization method 1 - similar to that of H.263 Quantization method 2 - similar to that of MPEG-2 Optimized nonlinear quantization for DC coeff. (can be used together with previous two methods)

Adaptive Intra-DC prediction Choose best DC predictor based on gradients of the DC values (side info. not transmitted) if ( QDC A -QDC B < QDC B -QDC C ) QDC X = QDC C else QDC X = QDC A B C D or or A X Y Block (8x8) Macroblock (16x16)

Adaptive Intra-AC prediction Shaded coefficients are predicted from previous coded blocks. The best direction is chosen based on the DC prediction. On/off Mblk basis -- transmitted B A or C X or D Y Macroblock

Functionality-Baseline Similar to MPEG-2/H.263 structure and algorithms 8x8 DCT/Q/MC/ME/VLC 50% bit rate reduction compared to MPEG-2 Intra DC/AC prediction, 8x8 ME, better VLC table Widely used in current consumer market Mobile phone DV DivX

Syntax

Inside the Bit Stream Video session(vs) VS1 VS1 VS1 VS1 VSN Video Object(VO) VO1 VO1 VON VO2 VS1 VS1 Video Object Layer(VOL) VOL1 VOL1 VOLN VOL2 VS1 VS1 Group Of VOPs(GOV) GOV1 GOV1 GOVN GOV2 VS1 Video Object Plane(VOP) VOP1 VS1 VOPk VOPk+1 VS1 VOP1 VOP2 VS1 VOP1 VOPk VOPk+1 VOPN VOP1 VOPN Layer 1 Layer 2

Syntax Video-object Sequence (VS) delivers the complete MPEG-4 visual scene, which may contain 2-D or 3-D natural or synthetic objects. Video Object (VO) a particular object in the scene, which can be of arbitrary (nonrectangular) shape corresponding to an object or background of the scene. Video Object Layer (VOL) facilitates a way to support (multi-layered) scalable coding. A VO can have multiple VOLs under scalable coding, or have a single VOL under non-scalable coding. Group of Video Object Planes (GOV) groups Video Object Planes together (optional level). Video Object Plane (VOP) a snapshot of a VO at a particular moment.

Syntax (1)

Syntax (2) Video Object Layer Video_object_layer_start_code (long Header) Header User Data Video Object Plane (Optional) Group_of_Video ObjectPlane (optional) Video Object Plane Video Object Plane Video Object Plane Video_plane_with_short_header (short Header) Short_video_start _marker Header Gob_layer Gob_layer Gob_layer Short_video_end _marker Gob_layer Header (Optional) Macroblock Macroblock Macroblock Video Object Plane Vop_start_code Header Sprite Data Motion_shapre_t exture Video_packet_he ader Motion_shapre_t exture Macroblock Macroblock Header Shape Data Motion Vector Block Macroblock Header Shape Data Motion Vector Block Block Differential DC Coefficient Run-Level VLC Run-Level VLC End_of_block

Important Header Information (1) VOL video_object_layer_shape vol_width vol_height interlaced vol_quant_type not_8_bit short header quarter_sample VOP vop_coding_type (vop_prediction_type) vop_coded intra_dc_vlc_thr vop_quant

Important Header Information (2) Macroblock not_coded mcbpc VLC to derive the macroblock type and coded block pattern for chrominance Table B-6, -7 (Also Table B-1~2) mcssel For S-VOP ac_pred_flag AC prediction cbpy VLC for the pattern of non-transparent Y blocks Table B-8 ~11

Object based Video Coding

MPEG-4 Visual Standards Video Object: 2-D representation of natural video MPEG-1/2, H.263 + shape) Face Object: 3-D representation of human face facial animation parameters; modelbased coding Mesh Object: 2-D deformable geometric shape (triangle) Still-texture: Wavelet-based still image coding using zero-tree technique

MPEG-4 Visual Decoding Video object decoding

MPEG-4 Video Based on Verification Model 9 (April 1997) Video Object Plane (VOP) Motion / texture coding derived from MPEG-1/2 & H.263 Polygon matching for motion estimation Padding for motion estimation / texture coding Shape coding: binary and gray-scale Sprite coding: extended background scene

Video Object Coding

Video Object Plane (VOP) An arbitrarily shaped image region

VOP Codec Structure

VOP Decoder Bitstream D E M U L T I P L E X E R Shape Decoding Motion Decoding Texture Decoding Shape Information Motion Compensation VOP Memory Reconstructed VOP Compositing script Compositor Video Out Conventional decoding + shape capability

VOP-based v.s. Frame-based

VOP-based Coding MPEG-4 VOP-based coding also employs the Motion Compensation technique: An Intra-frame coded VOP is called an I-VOP. The Inter-frame coded VOPs are called P-VOPs if only forward prediction is employed, or B-VOPs if bi-directional predictions are employed. The new difficulty for VOPs: may have arbitrary shapes, shape information must be coded in addition to the texture of the VOP. Note: texture here actually refers to the visual content, that is the gray-level (or chroma) values of the pixels in the VOP.

VOP-based Coding 1. Motion compensation coding MC + shape capability By padding process to convert non-rectangular MBs (boundary MB) into rectangular MC and applying conventional ME 2. Texture coding 8x8 DCT with zero padding or shape adaptive DCT + Q + VLC 3. Shape coding MC: binary ME or gray scale ME Context adaptive arithmetic coding (CAE)

1. VOP-based Motion Compensation MC-based VOP coding in MPEG-4 again involves three steps: (a) Motion Estimation. (b) MC-based Prediction. (c) Coding of the prediction error. Only pixels within the VOP of the current (Target) VOP are considered for matching in MC. To facilitate MC, each VOP is divided into many macroblocks (MBs). MBs are by default 16x16 in luminance images and 8x8 in chrominance images. Padding steps for MB processing To help matching every pixel in the target VOP and meet the mandatory requirement of rectangular blocks in transform codine (e.g., DCT), a pre-processing step of padding is applied to the Reference VOPs prior to motion estimation.

VOP Formulation Minimize the number of MBs to be retained Video Object Plane bounding box shape block (Binary Alpha Block)

Padding

Motion Compensation Tools -- Motion compensated coding modes (I, B, P) (similar to MPEG-1/2 and H.263) B-VOP P-VOP time I-VOP

Motion Computation reference P-VOP or I-VOP bounding box P-VOP or B-VOP modified block (polygon) matching padded reference pixels for unrestricted block matching conventional block matching padded reference pixels for block matching no matching reference VOP pixels for block matching advanced prediction mode (four 8x8 blocks) Only pixels within the VOP of the current (Target) VOP are considered for matching in MC

Motion Vector Coding Let C(x + k; y +l) be pixels of the MB in Target VOP, and R(x+i+k; y+j+l) be pixels of the MB in Reference VOP. A Sum of Absolute Difference (SAD) for measuring the difference between the two MBs can be defined as

2. Texture Coding Tools VOP macroblock entirely inside VOP (coded by conventional DCT scheme) macoblock partially outside VOP (blocks partially outside the VOP are coded by DCT after padding) macroblock entirely ouside VOP (not coded)

Texture Coding Tools (2/2) QFS[n] SQF[v][u] Reconstructed VOP Coded Data Variable Length Decoding Inverse Scan VOP Memory Inverse AC/DC prediction Inverse Quantization Inverse DCT Motion Compensation Decoded Pels QF[v][u] F[v][u] f[y][x] Decoded d[y][x] Shape

Texture Coding Boundary blocks: (DCT based) Inter blocks Padded with zeros Intra blocks Lowpass extrapolation padding Step 1: Assign the mean value of object pels (inside MB) to the outside pels; Step 2: f(i,j)=1/4[f(i,j-1) + f(i-1,j) + f(i,j+1) + f(i+1,j)] starting from the top left corner. If any of the reference 4 pels is outside the block, do not include it and adjust the 1/4 factor accodingly.

Shape adaptive DCT for Boundary MB Shape Adaptive DCT (SA-DCT) is another texture coding method for boundary MBs. Due to its effectiveness, SA-DCT has been adopted for coding boundary MBs in MPEG-4 Version 2. It uses the 1D DCT-N transform and its inverse, IDCT-N:

SA-DCT Flow

3. Shape Coding The shape information is called alpha planes Binary alpha plane Code the boundaries using context-based arithmetic encoding (CAE) Gray scale alpha plane Consists of support and alpha values (texture) Support is coded using CAE (as binary alpha plane) Alpha values (texture) are coded using motion compensated DCT (similar to the texture coding) Motion compensation for shape similar to that of texture but simpler

Shape Coding Shape Coding binary X arbitrary

CAE Context-based Arithmetic Encoding (CAE) Predict the current pel value (1 or 0) based on the conditional probability (table)

Other parts

Others Scalability: Object scalability Temporal scalability Spatial scalability Error resilience: H.263 marker, MPEG-4 marker, Sprite coding SNHC visual: Face and body Dynamic 2-D meshes Scalability still texture: Wavelet with zero-tree

MPEG-4 Visual Decoding Video object decoding

Sprite Coding A sprite is a graphic image that can freely move around within a larger graphic image or a set of images. To separate the foreground object from the background, we introduce the notion of a sprite panorama: a still image that describes the static background over a sequence of video frames. The large sprite panoramic image can be encoded and sent to the decoder only once at the beginning of the video sequence. When the decoder receives separately coded foreground objects and parameters describing the camera movements thus far, it can reconstruct the scene in an efficient manner.

Sprite Coding Sprite + Foreground Object Decoded Frame

2-D Mesh Coding Objects are represented by 2-D polygons. Node positions and motion vectors are coded.

3-D Face Animation A 3-D face model is defined in terms of 68 Face Animation Parameters (FAPs)

FGS

Fine Granularity Scalability (FGS) Amendment 2 (2001) Technique: Base layer + Enhancement layer Enhance layer bit plane coding Tuned Huffman coding Applications Internet streaming Broadcasting Unicast with/without feedback control. Resource sharing Wireless communications

Bandwidth Scalability Fine-granular scalable enhancement layer I P/B P/B P/B P/B MPEG-4 base layer

Wireless Applications

FGS Advantages Received Quality Traditional Source Coding Traditional Distortion-Rate Curve Good New Objective Moderate Bad Low Channel Bandwidth High

FGS Principles Base layer: MPEG-4 motion compensated DCT coding Enhancement layer: DCT residuals (the quantization errors of the base layer) are bit-planecoded. Enhancement layer bitstream can be truncated into any number of bits per frame Decoder may ignore some enhancement bits Reconstructed video quality is proportional to number of decoded bits

FGS Encoder DCT Bit-plane Shift Find Maximum Bit-plane VLC Enhancement Layer Encoding Enhancement Bitstream Input Video Motion Compensation DCT Q Q -1 VLC Base Layer Bitstream IDCT Motion Estimation Frame Memory Clipping

FGS Decoder Enhancement Bitstream Bit-plane VLD Enhancement Layer Decoding Bit-plane Shift IDCT Clipping Enhancement Video Base Layer Bitstream VLD Q -1 IDCT Clipping Base Layer Video (optional output) Motion Compensation Frame Memory

Bitplane Coding Bit-Plane + 1 0 + 1 0 0 1 + - 1 1 0 + 1 0-1 0 1 MSB LSB Zigzag ordering of a block of 8x8 DCT coefficient differences Bit-Plane Bit-Plane A block of 8x8 DCT coefficient differences MSB (0, 1) (28, 1) MSB LSB + - + + - + 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 LSB (6, 0) (2, 0) (31, 1) (0, 0) (0, 0) (26, 1) (RUN, EOP) symbols for a block of 8x8 DCT coefficient differences after zigzag ordering 12 zeros 18 zeros 20 zeros A block of 8x8 DCT coefficient differences after zigzag ordering

Profiles in Version 1 Simple Profile -- Basic tools of I/P VOP, AC/DC Prediction and 4 MV unrestricted Core Profile -- Simple + Binary Shape, Quantization Method 1/2 and B-VOP Main Profile -- Core + Grey Shape, Interlace and Sprite Simple Scalable Profile -- Simple + Spatial and temporal scalability and B-VOP N-Bit Profile -- Core + N-Bit Animated 2D Mesh -- Core + Scalable Still Texture, 2D dynamic Mesh Basic Animated Texture -- Banary Shape, Scalable Still Texture and 2D Dynamic Mesh Still Scalable Texture -- Scalable Still Texture Simple Face -- Face Animation Parameters

Profiles in Version 2 Advanced Real Time Simple Profile -- Simple + Advanced error resilience + improved temporal scalability Core Scalable Profile -- Simple scalable + Core + SNR, Spatila/Temporal Scalability for Region or Object of interest Advanced Coding Efficiency Profile -- Tools for improving coding efficiency for both rectangular and arbitrary shaped objects Advanced Scalable Texture Profile -- Tools for decoding arbitrary shaped texture and still image including scalable shape coding Advanced Core Profile -- Core Profile + Tools for decoding arbitrary shaped video objects and arbitrary shaped scalable still image Simple Face and Body Animation Profile -- Simple face animation + body animation

Additional Profiles Advanced Simple Profile -- Simple Profile + efficient coding tools: B-frames, 1/4 pel MC, Fine Granularity Scalable Profile Advanced Simple Profile as base layer Fine granularity scalability (FGS) Fine granularity scalability - temporal (FGST) Simple Studio Profile I-frames only Arbitrary shape Multiple alpha channels Up to 2 Gbps Core Studio Profile -- Simple Studio Profile + P-frames

MPEG-4 Video Profiles Profiles limit the set of tools in a decoding device Levels specify parameter ranges (limit complexity) Additional Tools Main Additional Tools Core Studio Advanced Coding Efficiency Core Core Scalable Simple Studio Arbitrary Shape Fine Granularity Scalable Advanced Simple Simple Simple Scalable Rectangular Frame IS AMD-1 AMD-2 Quality & Temporal Scalability Higher Error Resilience Advanced Realtime Simple No Scalability Spatial & Temporal Scalability

Levels Visual Profile Level Typical Visual Session Size (indicative) Maximum total number of objects 1 Maximum number per type Main L4 1920 x 1088 32 32 x Main or Core or Simple Maximum number different Quantization Tables Max. total Reference memory (MB units) 2 Maximum number of MB/sec Cost function equivalent I-MB/sec 5 Maximum vbv_buffer_ size (units of 16384 bits) Max. video packet length (bits) 6 Max sprite size (MB units) Wavelet restrictions 4 16320 489600 1290100 380 16384 65280 1 taps 38.4 default Mbit/s integer filter Max bitrate Max. enhancement layers per object 1 temporal, 2 spatial Main L3 CCIR 601 32 32 x Main or Core or Simple 4 3240 97200 256200 160 16384 6480 1 taps 15 Mbit/s 1 default integer filter Main L2 CIF 16 16 x Main or Core or Simple 4 792 23760 62700 40 8192 1584 1 taps default integer filter 2 Mbit/s 1 Core L2 CIF 16 16 x Core or Simple 4 792 23760 62700 40 8192 N. A. N. A. 2 Mbit/s 1 Core L1 QCIF 4 4 x Core or Simple 4 198 5940 15700 8 4096 N. A. N. A. 384 kbit/s 1 Simple Scalable Simple Scalable L2 CIF 4 4 x Simple or Simple Scalable L1 CIF 4 4 x Simple or Simple Scalable 1 792 23760 N. A. 20 4096 N. A. N. A. 256 kbit/s 1 spatial or temporal enhancement layer 1 495 7425 N. A. 20 2048 N. A. N. A. 128 kbit/s 1 spatial or temporal enhancement layer Simple L3 CIF 4 4 x Simple 1 396 11880 N. A. 20 8192 N. A. N. A. 384 kbit/s N. A. Simple L2 CIF 4 4 x Simple 1 396 5940 N. A. 20 4096 N. A. N. A. 128 kbit/s N. A. Simple L1 QCIF 4 4 x Simple 1 99 1485 N. A. 5 2048 N. A. N. A. 64 kbit/s N. A.

References F. Pereira & T. Ebrahimi, The MPEG-4 Book, Prentice-Hall, 2002 A. Puri and T. Chen, ed., Multimedia Systems, Standards, and Networks, Marcel Dekker, 2000. ISO/IEC JTC1/SC29/WG11/Doc.N1869: MPEG-4 Video Verification Model Version 9.0, Oct. 1997. Image Communication: Tutorial Issue on MPEG-4, Jan. 2000. Weiping Li, The Overview of fine granularity scalability in MPEG-4 video standard, IEEE Trans. on Circuits and Systems for Video Tech., pp.301-317, March 2001. Weiping Li and et al., Fine granularity scalability in MPEG- 4 for streaming video, IEEE ISCAS 2000, pp. 299 302.

H.M. Radha and et al., MPEG-4 fine-grained scalable video coding method for multimedia streaming over IP, IEEE Trans. on Multimedia, pp.53 68, March 2001. F. Wu and et al., A framework for efficient progressive fine granularity scalable video coding, IEEE Trans. on Circuits and Systems for Video Tech., vol. 11, no. 3, March 2001. ISO/IEC MPEG and ITU-T VCEG, Joint Committee Draft (CD), JVT-C167, May. 2002. ISO/IEC MPEG and ITU-T VCEG, Low Complexity Transform and Quantization, JVT-B038, Feb. 2002. ITU-T VCEG, H.26L Test Model Long-Term Number 9 (TML-9) draft 0, VCEG-N83d1, Dec. 2001. ITU-T VCEG, New Intra Prediction Modes, VCEG-N54, Sept. 2001.