Audio and video compression 4.1 introduction Unlike text and images, both audio and most video signals are continuously varying analog signals. Compression algorithms associated with digitized audio and video are different from those associated with text and images. 4.2 Audio compression Speech and non-speech signals are encoded in different approaches. 4.2.1 Speech coding Differential pulse code modulation (DPCM) is a derivative of standard PCM and exploits the fact that, for most audio signals, the range of the differences in amplitude between successive samples of the audio waveform is less than the range of the actual sample amplitudes. (G.711) In Adaptive differential PCM (ADPCM), fewer bits are used to encode smaller difference values than for larger values. (G.721, G.722 & G.726) DPCM and ADPCM can also be used to encode nonspeech signals. In linear predictive coding (LPC), a speech signal is analyzed to extract its perceptual features including pitch and format frequencies and these features are then encoded. (LPC-10, G.728, G.723 & G.729) CYH/MMT/CmpAV/p.1 CYH/MMT/CmpAV/p.2
Summary of speech compression standards and their applications: Standard Compression technique Compressed bit rate (kbps) Quality Example applications G.711 PCM+ companding 64 Good PSTN/ISDN telephony G.721 ADPCM 32 16 Good Fair Telephony at reduced bit rates G.722 ADPCM with subband coding 64 56/48 Excellent Good Audio conferencing G.726 ADPCM with subband coding 40/32 24/16 Good Fair General telephony at reduced bit rates LPC-10 LPC 2.4/1.2 Poor Telephony in military networks G.728 Code-excited LPC (CELP) 16 Good Low delay/low bit rate telephony G.729 CELP 8 Good Telephony in cellular networks G.729(A) CELP 8 Good Simultaneous telephony and data (fax) G.723.1 CELP 6.3 5.3 Good Fair Video and internet telephony 4.2.2 Perceptual coding Audio signal is coded based on a psychoacoustic model which describes the limitations of the human ear. Ear is more sensitive to some signals than others. Frequency masking: A strong signal may reduce the level of sensitivity of the ear to other signals which are near to it in frequency. Temporal masking: When the ear hears a loud sound, it takes a short but finite time before it can hear a quieter sound. CYH/MMT/CmpAV/p.3 CYH/MMT/CmpAV/p.4
CYH/MMT/CmpAV/p.5 CYH/MMT/CmpAV/p.6
MPEG audio coders An international standard based on this approach is defined in ISO Recommendation 11172-3. Summary of MPEG layer 1, 2 and 3 perceptual encoders Layer Application Compressed bit rate 1 Digital audio cassette 2 Digital audio and digital video broadcasting 3 CD-quality audio over low bit rate channel Quality 32-448 kbps Hi-fi quality at 192kbps per channel 32-192 kbps Near CDquality at 128 kbps per channel 64 kbps CD-quality at 64 kbps per channel Example input-tooutput delay 20ms 40ms 60ms A higher layer makes a better use of the psychoacoustic model and hence higher compression rate can be achieved. The 3 layers require increasing levels of complexity (and hence cost) to achieve a particular perceived quality, the choice of layer and bit rate is often a compromise between the desired perceived quality and the available bit rate. CYH/MMT/CmpAV/p.7 CYH/MMT/CmpAV/p.8
Dolby audio coders In AC-1, the bit allocation information of the quantized subband samples is directly encoded and embedded in the bit-stream. In AC-2, this information is indirectly encoded and has to be estimated at the decoder. In AC-3, additional information is transmitted to compensate for the estimation error. The acoustic quality of both the MPEG and Dolby audio coders were found to be comparable. Summary of compression standards for general audio: Standard Compressed bit rate MPEG Layer 1 32-448kbps Audio Layer 2 32-192kbps Dolby audio coders Quality Example applications Hi-fi quality Digital audio at 192kbps cassettes Near CD at Digital audio and 128 kbps digital video broadcasting Layer 3 64kbps CD quality CD-quality over low bit rate channels AC-1 512kbps Hi-fi quality Radio and television satellite relays AC-2 256kbps Hi-fi quality PC sound cards AC-3 192kbps Near CD quality Digital video broadcasting CYH/MMT/CmpAV/p.9 CYH/MMT/CmpAV/p.10
4.3 Video compression There is not just a single standard associated with video but rather a range of standards, each targeted at a particular application domain. 4.3.1 Video compression principles Video is simply a sequence of digitized pictures and it is also referred to as moving pictures. A video sequence can be encoded with JPEG algorithm frame by frame and this approach is known as motion JPEG. In addition to the spatial redundancy present in each frame, considerable redundancy is often present between successive frames. Frames are classified as 1 of 3 basic frame types (I-, P- and B- frames) and encoded differently. CYH/MMT/CmpAV/p.11 CYH/MMT/CmpAV/p.12
I-frames: I-frames are encoded independently using the JPEG algorithm. I-frames are inserted into the output stream relatively frequently. I-frames are used as access points for random access and FF/FR functionality in the bit stream. P-frames: Frames are partitioned into blocks of size 16x16 (macroblocks). To encode a P-frame, the contents of each macroblock in the target frame are compared on a pixel-by-pixel basis with the contents of the reference frame to find a best-matched block of equal size. The reference frame can be a P- or I- frame. The (x,y) offset of the macroblock being encoded and the best-matched block is known as motion vector. This motion-vector-searching process is known as motion estimation. CYH/MMT/CmpAV/p.13 CYH/MMT/CmpAV/p.14
A prediction of the target frame is made with the reference frame based on the motion vectors obtained. The difference between the predicted frame and the actual target frame is known as the prediction error. Motion compensation: Additional bits are required to encode the prediction error so as to compensate for the difference if necessary. B-frames: To encoded a B-frame, any motion is estimated with reference to both the immediately preceding I- or P- frame and the immediately succeeding P- or I-frame. B-frames provide the highest level of compression. B-frames are not involved in the coding of other frames and hence they do not propagate errors. CYH/MMT/CmpAV/p.15 CYH/MMT/CmpAV/p.16
The number of frames between successive I-frames is known as a group of pictures (GOP). The number of frames between a P-frame and the immediately preceding I- or P-frame is called the prediction span. The order of encoding and transmission of the frames is changed to minimize the time required to decode the frames. A 4 th type of frame known as a PB-frame has also been defined. Two neighboring P- and B-frames are encoded as if they were a single frame. A 5 th type of frame known as a D-frame has been defined for use in movie/video-on-demand applications. CYH/MMT/CmpAV/p.17 CYH/MMT/CmpAV/p.18
Basic bitstream format: Type : type of frame, I, P or B Address : identifies the location of the macroblock in the frame Quantization value: the threshold value used to quantize all DCT coefficients in the macroblock. Motion vector: encoded vector Block present: indicates which block in the macroblock are present Typical figures of the compression ratios I-frames: 10~20:1 P-frames: 20~30:1 B-frames: 30~50:1 CYH/MMT/CmpAV/p.19 CYH/MMT/CmpAV/p.20
4.3.2 H.261 H.261 has been defined by the ITU-T for the provision of video telephony and videoconferencing services over an ISDN. Supports I- and P-frames only. Encoding format: Type: indicates if the macroblock is intracoded or intercoded Address: identifies the location of the macroblock in the frame Quantization value: the threshold value used to quantize all DCT coefficients in the macroblock. Motion vector: encoded vector Coded block pattern: indicates which block in the macroblock are present Picture start code: indicates the start of a new frame. Temporal reference: a timestamp for the decoder to synchronize the video information with the audio information. Picture type: indicates if the frame is encoded as I- or P-frame. GOB start code: is a resynchronization marker which is used for resynchronization in case of error. Group of (macro)block (GOP) is a structure consists of 3x11 macroblocks. CYH/MMT/CmpAV/p.21 CYH/MMT/CmpAV/p.22
4.3.3 H.263 H.263 has been defined by the ITU-T for use in a range of real-time video applications over wireless and PSTNs. The applications include video telephony, videoconferencing, security surveillance, interactive games playing and so on. H.263 standard has a number of advanced coding options compared with H.261: Progressive scanning with a refresh rate of either 15 or 7.5 fps. Support I-, P-, B- and PB- frames Motion vectors, if necessary, are allowed to point outside of the frame area. Schemes such as error tracking, independent segment decoding and reference picture selection are included in the standard that aim at minimizing the effects of errors on neighboring GOBs. Error concealment scheme is incorporated into the decoder to mask the error from the viewer. CYH/MMT/CmpAV/p.23 CYH/MMT/CmpAV/p.24
4.3.4 MPEG The Motion Pictures Expert Group (MPEG) was formed by the ISO to formulate a set of standards relating to a range of multimedia applications that involve the use of video with sound. Typical figures of the compression ratios I-frames: 10:1 P-frames: 20:1 B-frames: 50:1 MPEG1 : ISO Recommendation 11172 Similar video compression technique as H.261. Progressive scanning with a refresh rate of 30Hz (for NTSC) and 25Hz (for PAL) Support I-, P- and B- frames I-frames must be used for the various random-access functions associated with VCRs. Improvement with respect to H.261: 1. A new layer called slice is added in the structure of the stream such that the decoder can resynchronize more quickly in case of error. 2. support B-frames 3. larger searching window of motion vectors and finer resolution of its representation CYH/MMT/CmpAV/p.25 CYH/MMT/CmpAV/p.26
Bitstream format: Sequence start code: indicates the start of a sequence CYH/MMT/CmpAV/p.27 CYH/MMT/CmpAV/p.28
Video parameters: specify the screen size and aspect ratio Bitstream parameters: indicate the bit rate and the size of the memory/ frame buffers that are required Quantization parameters: contain the contents of the quantization tables that are to be used. - GOP start code: indicates the start of a GOP Time stamp: used for synchronization purposes Parameters: defines the particular sequence of frame types that are used in each GOP (e.g. IPPBPP) - Picture start code: indicates the start of a frame Type: indicates if it's a I-, P- or B-frame Buffer parameters: indicate how full the buffer should be before the decoding operation should start Encode parameters: indicate the resolution of a motion vector. - Slice start code: indicates the start of a slice Vertical position: indicates the scan line in which the slice is Quantization parameters: indicates the scaling factor that applies to this slice. MPEG2 : ISO Recommendation 13818 It supports four levels - low, main, high 1440 and high - each targeted at a particular application domain. There are 5 profiles associated with each level: simple, main, spatial resolution, quantization accuracy and high. The different combinations of levels and profiles form a framework for all standards activities associated with MPEG-2. One of the most popular setting is the MP@ML standard which is for digital television broadcasting. There are 3 standards associated with HDTV: advanced television (ATV) in North America, digital video broadcast (DVB) in Europe, and multiple sub-nyquist sampling encoding (MUSE) in Japan. ATV DVB MUSE Aspect ratio 16/9 4/3 16/9 Resolution 1280x720 1440x1152 1920x1035 Compression (video) Compression (Audio) MP@HL of MPEG2 Dolby AC-3 SSP@H1440 of MPEG2 MP2 Similar to MP@HL CYH/MMT/CmpAV/p.29 CYH/MMT/CmpAV/p.30
Summary of video compression standards Standard Digitization Compressed Example applications format bit rate H.261 CIF/QCIF x64kbps Video telephony/ conferencing over ISDN and LANs H.263 S-QCIF/ QCIF <64kbps Video telephony/ conferencing and security surveillance over low bit rate channels SIF <1.5Mbps Storage of VHS-quality video on CD-ROMs MPEG-1/ ISO11172 MPEG-2/ ISO13818 Low SIF <4Mbps Recording of VHS-quality video Main 4:2:0 <15Mbps 4:2:2 <20Mbps High 1440 4:2:0 <60Mbps 4:2:2 <80Mbps High 4:2:0 <80Mbps 4:2:2 <100Mbps MPEG-4 Various 5kbpstens Mbps Digital video broadcasting HDTV (4/3 aspect ratio) HDTV (16/9 aspect ratio) Versatile multimedia coding standard CYH/MMT/CmpAV/p.31