Video Transcoding Architectures and Techniques: An Overview. IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang

Video Transcoding Architectures and Techniques: An Overview IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang

Outline Background & Introduction Bit-rate Reduction Spatial Resolution Reduction Temporal Resolution Reduction Error Resilience Transcoding Scalable Image Coding Scalable Coding Conclude Remarks

Background & Introduction Background & Introduction Bit-rate Reduction Spatial Resolution Reduction Temporal Resolution Reduction Error Resilience Transcoding Scalable Image Coding Scalable Coding Conclude Remarks As the number of networks, types of devices, and content formats increase, interoperability between different systems and networks is becoming more important.

We need provide a seamless interaction between content creation and consumption. In general, a transcoder relays video signals from a transmitter in one system to a receiver in another. In earlier works, the majority of interest focused on reducing the bit rate to meet an available channel capacity. Conversions between CBR (constant bit-rate) streams and VBR (variable bit-rate.) Achieve spatial & temporal resolution reduction due to mobile devices with limited computing power & display.

Scalable Coding Bit-rate Reduction Mobile device Spatial/Temporal Reduction Packet Radio services over wireless network Error-resilience Transcoding

Some of the common transcoding operations: Format, Bit-rate, spatial, temporal Bit-rate

Cascaded Pixel-domain Approach It is always possible to decode the original signal, perform the intermediate processing, and fully re-encode the processed signal subject to any new constraints. But it s very costly to do so. The quest for efficiency is major driving force behind most of the transcoding activity. However, any gains in efficiency should have a minimal impact on the quality of the transcoded video.

Illustration of cascaded pixel-domain approach: Complexity issue Re-quantized to satisfy the new output bit rate

Bit-Rate Reduction Background & Introduction Bit-rate Reduction Spatial Resolution Reduction Temporal Resolution Reduction Error Resilience Transcoding Scalable Image Coding Scalable Coding Conclude Remarks The objective of bit-rate reduction is to reduce the bit-rate while maintaining low complexity and achieve the highest quality possible. Application: broadcasting, Internet streaming

The most straightforward: decode the video bit stream and fully re-encode the reconstructed signal. The best performance can be achieved by calculating new motion vectors and model decisions for every MB at the new rate. [2] By re-using information contained in the original bit stream and consider simplified architectures, significant complexity saving can be achieved while still maintaining acceptable quality. [1]-[7] Over the years, focus has been centered on two specific aspects, complexity & drift effect. Architectures for MPEG Compressed Bitstream Scaling

Complexity & Drift Reduction Drift can be explained as the blurring or smoothing of successively predicted frames. It is caused by the loss of high frequency data, which creates a mismatch between the actual reference frame used for prediction in the encoder and the degraded reference frame. To demonstrate the tradeoff between complexity and quality, we will consider two types of systems, a closed-loop system and an open-looped system.

Transcoding Architecture Simplified transcoding for bit-rate reduction: (a) open-looped, partial decoding to DCT coefficients, then re-quantize (b) closed-looped, drift compensation for re-quantized data.

A frame memory is not required Will lead and there to drift is no need effect for IDCT

Open-looped System In the open-loop system, the bit stream is variablelength decoded (VLD) to extract the VLC words corresponding to the quantized DCT coefficients, as well as MB data corresponding to the motion vectors and other MB-level information. The quantized coefficients are inverse quantized and then simply re-quantized to satisfy the new output bit rate. Finally the re-quantized coefficients are stored MBlevel information are variable length coded.

Another alternative way is to directly cut high frequency data from each MB [2], which is less complex. To cut the high frequency data without VLD, a bit profile for the AC coefficients is maintained. As MBs are processed, code words corresponding to high frequency coefficients are eliminated to meet the target bit-rate. Alone similar lines, techniques to determine the optimal breakpoint of non-zero DCT coefficients (in zig-zag order) were presented in [3]. Carried out for each MB, so that distortion is minimized and rate constraints are satisfied.

Above two alternatives may also be used in the closed-loop system, but their impact on the overall complexity is less. Regardless of the techniques used to achieve the reduced rate, open-looped system are relatively simply since a frame memory is not required and there is no need for IDCT. In terms of quality, better coding efficiency can be obtained by the re-quantization approach. However, open-loop architectures are subject to drift effect.

Drift Effect The reason for drift is due to loss of high-frequency information. Beginning with the I-frame, information is discarded by the transcoder to meet the new target bit rate. Incoming residual blocks are also subject to loss. When a decoder receives the transcoded bit stream, it will decode the I-frame with reduced quality. When it s time to decode the next P,B frames the degraded I-frame is used as a predictive component and added to a degraded residual component.

Consider the purpose of residual is to accurately represent the difference between the original signal and the motion-compensated prediction. And now both the residual and predictive components are different then what was originally derived by the encoder, errors will be introduced in the re-constructed frame. As time goes on, the mismatch progressive increases, resulting in the reconstructed frames becoming severely degraded. degraded prediction degraded residual

Closed-loop System The closed-loop system aims to eliminate the mismatch between predictive and residual components by approximating the cascading decoder-encoder architecture [4]. In the simplified structure, only one reconstruction loop is required with one DCT and one IDCT. Some architecture inaccuracy is introduced due to the non-linear nature in which the re-construction loops are combined. However it has been found the approximation has little effect on the quality [4]. Post-processing of MPEG-2 coded video for transmission at lower bit-rate

Traditional Cascading Pixel-domain Approach Only With one the exception reconstruction of this slight loop inaccuracy, is required this with architecture one DCT is and mathematically one IDCT equivalent to a cascaded decoderencoder architecture Combined re-construction loop

In [8], additional causes of drift, e.g. due to floating point inaccuracies, have been further studied. Overall though, in comparison to the open-loop architecture discussed earlier, drift is eliminated since the mismatch between predictive and residual components is compensated for. compensated prediction compensated residual

Simulation Results Closed-loop system is very close to the cascaded architecture Open-loop system suffers from severe drift

Motion Compensation in the DCT Domain The closed-loop architecture described in the previous section provides an effective transcoding structure in which the MB re-construction is performed in the DCT domain. However, since the memory stores spatial domain pixels, the additional DCT/IDCT is still needed. This can be avoided though by utilizing the compressed-domain methods for MC proposed by S.F. Chang and Messerschmidt [9]. In this way, it is possible to reconstruct reference frames without decoding to the spatial domain. Manipulation and compositing of MC-DCT compressed video

Decoding completely in the compressed domain could yield equivalent quality to spatial domain decoding [10]. However, this was achieved with floating-point matrix multiplication and proved to be quite costly. In [12] this computation was simplified by approximating the floating-point elements by power of two fractions so that shift operations could be used, and in [13], simplifications have been achieved through matrix decomposition techniques. In [14], further simplifications of the DCT-based MC process were achieved by exploiting the fact that the stored DCT coefficients in the transcoder are mainly concentrated in low-frequency areas. Therefore, only a few low-frequency coefficients are significant and an accurate approximation to the MC process that uses all coefficients can be made.

Spatial Resolution Reduction Background & Introduction Bit-rate Reduction Spatial Resolution Reduction Temporal Resolution Reduction Error Resilience Transcoding Scalable Image Coding Scalable Coding Conclude Remarks With the emergence of mobile multimediacapable devices and the desire for users to access video originally captured in a high resolution, there is a strong need for efficient way to reduce the spatial resolution of video for delivery.

Some researchers have focused on the efficient reuse of MB-level data in the context of the cascaded pixel-domain architecture, while others have explored the possibility of new alternative. In [6],[7], the problem associated with mapping motion vectors and MB-level data was addressed. (including performance of motion vector refinement) In [18], the authors propose to use DCT-domain down-scaling and MC for transcoding. In [19], a comprehensive study on the transcoding to lower spatial/temporal resolutions and to different encoding formats has been provided based on the reuse of motion parameters. CIF-to-QCIF video bitstream downconversion in the DCT domain

Motion Vector Mapping When down-sampling four MBs to one MB, the associated motion vectors have to be mapped, where the number of motion vectors that may be associated with an MB depends on the standard used and the coding tools available in a given profile or extension. Several methods to perform the particular mapping have been described in [6],[7],[17],[19],[24].

To map from four motion vector to form the new MB, a weighted average or median filters can be applied. This is referred to as a 4:1 mapping. However, with certain compression standards, such as MPEG-4 Visual and H.263, there is support in the syntax for advanced prediction modes that allow one motion vector per 8x8 luminance block. This is referred to as 1:1 mapping. While 1:1 mapping provides a more accurate representation of the motion, it is sometimes inefficient to use since more bits must be used to code four motion vectors. An optimal mapping would adaptively select the best mapping based on a rate-distortion criterion. (a good evaluation: [6],[7],[19])

Field-to-frame transcoding with spatial and temproal downsampling Because MPEG-2 supports interlaced video, we also need to consider field-based MV mapping. In [27], the top-field motion vector was simply used. An alternative scheme that averages the top and bottom field motion vectors under certain conditions was proposed in [21]. It is our opinion that the appropriate motion vector mapping is depended on the down-conversion scheme used. We feel this is particular important for interlaced data, where the target output may be a progressive frame. Drift compensation for reduced spatial resolution transcoding

DCT-domain Down Conversion The most intuitive way to perform down conversion in the DCT domain is to only retain the low-frequency coefficients of each block and recompose the new MB using the compositing techniques in [9]. For conversion by a factor of 2, only the 4x4 DCT coefficients of each 8x8 block in a MB are retained These low frequency coefficients from each block are then used to form the output MB A set of DCT-domain filters can been derived by cascading these two operations. More sophisticated filters that attempt to retain more of the high frequency information in [28],[29]

These down-conversion filters can be applied in both the horizontal and vertical directions and to both frame-dct and field-dct blocks. Variations of this filtering approach to convert field-dct blocks to frame-dct blocks, and vice-versa, have also been derived in [10].

Conversion of MB Type In transcoding video bit streams to a lower spatial resolution, a group of four MBs in the original corresponds to one MB in the target video. To ensure that the down-sampling process will not generate an output MB in which its sub-blocks have different coding modes, the mapping of MB modes to the lower resolution must be considered ([6],[21]). ZeroOut Intra-Inter Inter-Intra

ZeroOut ZeroOut, the MB modes of the mixed MBs are all modified to intermode. The MVs for the intra-mbs are reset to zero and so are corresponding DCT coefficients. In this way, the input MBs that have been converted are replicated with data from corresponding blocks in the reference frame.

Intra-Inter Intra-Inter, maps all MBs to intermode, but the motion vectors for the intra-mbs are predicted. The prediction can be based on the data in neighboring blocks, which can include both texture and motion data. As an alternative, we can simply set the motion vector to be zero, depending on which produces less residual. In an encoder, the mean absolute difference of the residual blocks is typically used for mode decision. Based on the predicted motion vector, a new residual for the modified MB must be calculated.

Inter-Intra Inter-Intra, the MB modes are all modified to intramode. In this case, there is no motion information associated with the reduced resolution MB, therefore all associated motion vector data is reset to zero and the intra-dct coefficients are generated to replace the inter-dct coefficients.

To implement the Intra-Inter and Inter-Intra methods, we need a decoding loop to reconstruct the full resolution picture. The reconstructed data is used as a reference to convert the DCT coefficients from intra-to-inter or inter-to-intra. For a sequence of detail, the low complexity strategy of zero-out can be used. The performance of Inter-Intra is a little better than intra-inter, because Inter-Intra can stop drift propagation by transforming inter-blocks to intrablocks.

Intra-Refresh Architecture In reduced resolution transcoding, drift error is caused by many factor, such as requantization, motion vector truncation, and down-sampling. By converting some percentage of intercoded blocks to intracoded blocks, drift propagation can be controlled. In the past, the concept of intra-refresh has successfully been applied to error-resilience coding scheme [30], and it has been found that the same principle is also very useful for reducing the drift in a transcoder [21].

Intra-refresh architecture for reduced spatial resolution trandcoding Output blocks that originate from the frame store are independent of other data, hence coded as intrablocks; there is no picture drift associate with these blocks.

The decision to code an intrablock from the frame store depends on the MB coding modes and picture statistics. Based on the coding mode, an output MB is converted if the possibility of a mixed block is detected. Based on the picture statistics, the motion vector and residual data are used to detect blocks that are likely contribute to a larger drift error. Picture quality can be maintained by employing an intracoded block in its place. Of course, the increase in the number of intrablocks must be compensated for by the rate control by adjust the quantization parameters so that the target rate. [21]

Motion Vector Refinement In all of the trandcoding methods described here, significant complexity is reduced by assuming that the motion vectors computed at the original bit rate are simply reused in the reduced-rate bit stream. It has been shown that reusing the motion vectors in this way leads to nonoptimal transcoding results due to the mismatch between prediction and residual components [6],[7],[15]. To overcome this loss of quality without performing a full motion re-estimation, motion vector refinement schemes have been proposed.

Typically, the search window used for motion vector refinement is relative small compared to the original search window, e.g. [-2,+2]. Such schemes can be easily used with most bit-rate reduction architectures for improved quality, as well as the spatial and temporal resolution reduction architectures. Additional techniques and simulation results for motion vector refinement can be found in [6],[7],[15],[36],and [37].

Increasing the search window farther allows more blocks to find their best match; however, the overall gain is less.

Temporal Resolution Reduction Background & Introduction Bit-rate Reduction Spatial Resolution Reduction Temporal Resolution Reduction Error Resilience Transcoding Scalable Image Coding Scalable Coding Conclude Remarks Reducing the temporal resolution of a video bit stream is a technique that may be used to reduce the bit-requirements imposed by a network, to maintain higher quality of coded frames, or to satisfy processing (battery) limitations.

For temporal resolution reduction, it is necessary to estimate the motion vectors from the current frame to the previous non-skipped frame that will serve as a reference frame in the receiver. Solutions to this problem have been proposed in [15],[32],and [33].

The re-estimation of motion is all that needs to be done since new residuals will be calculated. However, if a DCT-domain transcoding architecture is used, a method of reestimating the residuals in the DCT domain is needed [34]. In [35], the issue of motion vector and residual mapping has been addressed in the context of a combined spatial-temporal reduction in DCT domain based on the intrarefresh architecture.

Motion Vector Re-estimation We can solve it by tracing the motion vectors back to the desired reference frame. Since the predicted blocks in the current frame are generally overlapping with multiple blocks, bilinear interpolation of the motion vectors may be used, where the weighting of each input motion vector is proportional to the amount of overlap with the predicted block. A dominant vector selection scheme as proposed in [15] and [35] may also be used.

Residual Re-estimation With pixel-domain architectures, the residual between the current frame and the new reference frame can be easily computed. For DCT-domain transcoding, the calculation should be done directly using DCT-domain MC techniques [9]. A novel architecture to compute this new residual in the DCT-domain has been presented in [34] and [35]. In this work, the authors utilize direct addition of DCT coefficients for MBs without MC, as well as an errorcompensating feedback loop for motioncompensated MBs.

Error-Resilience Transcoding Background & Introduction Bit-rate Reduction Spatial Resolution Reduction Temporal Resolution Reduction Error Resilience Transcoding Scalable Image Coding Scalable Coding Conclude Remarks Transmitting video over wireless channels requires taking into account the conditions in which the video will be transmitted because wireless channels have lower bandwidth and higher error rate.

Error-resilience transcoding for video over wireless channels has been studied in [38] and [39]. In [38], First, they use a transcoder that injects spatial and temporal resilience into an encoder bit stream where the amount of resilience is tailored to the content of the video and the prevailing error conditions, as characterized by bit error rate. Second, they derive analytical models that characterize how corruption propagates in a video that is compressed using motion-compensated encoding and subject to bit errors. Third, they use rate distortion theory to compute the optimal allocation of bit rate between spatial resilience, temporal resilience, and source rate. The transcoder then injects this optimal resilience into the bit stream.

In [39], the author propose error-resilience video transcoding for internetwork communication via GPRS. Experiments showed superior transcoding performances over the error-prone GPRS channels to the non-resilience video. [38] Error-resilience transcoding for video over wireless channels [39] Error-resilience video transcoding for robust inter-network communications using GPRS

Scalable Image Coding Background & Introduction Bit-rate Reduction Spatial Resolution Reduction Temporal Resolution Reduction Error Resilience Transcoding Scalable Image Coding Scalable Coding Conclude Remarks Although the primary focus of this article is on video, it is worthwhile to mention the current state-of-art in scalable image coding, namely the JPEG-2000 standard.

The JPEG-2000 coder also employs bit-plane coding techniques to achieve scalability, where both SNR and spatial scalability are supported. In contrast to existing scalable video coding schemes that typically reply on a non-scalable base layer and which are based on the DCT, this coder does not reply on separate base and enhancement layers and is based on the discrete wavelet transform (DWT). The coding scheme employed by JPEG-2000 is often referred to as an embedded coding scheme since the bits that correspond to the various qualities and spatial resolution can be organized into the bit stream syntax in a manner that allows the progressive reconstruction of images and arbitrary truncation at any point in the stream.

Scalable Coding Background & Introduction Bit-rate Reduction Spatial Resolution Reduction Temporal Resolution Reduction Error Resilience Transcoding Scalable Image Coding Scalable Coding Conclude Remarks Encode the video once and then by simply truncating certain layers or bits from the original stream, lower qualities, spatial resolutions, and/or temporal resolution could be obtained.

MPEG-2 Video The signal is encoded into a base layer and a few enhancement layers, in which the enhancement layers add spatial, temporal, and/or SNR quality to the reconstructed base layer. Specifically, the enhancement layer in SNR scalability adds refinement data for the DCT coefficients of the base layer. For spatial scalability, the first enhancement layer uses predictions from the base layer without the use of motion vectors. The enhancement layer in temporal scalability uses predictions from the base layer using motion vectors.

Fine Granular Scalability (FGS) FGS allows for a much finer scaling of bits in the enhancement layer [42] (adopted by MPEG-4 visual standard). This is accomplished through a bit-plane coding method of DCT coefficients in the enhancement layer, which allows bit-plane to be truncated at any point. In this way, the is rather proportional to the number of enhancement bits received. Another variation of the FGS scheme, known as FGS-temporal, combines the FGS techniques with temporal scalability [45].

Comparison To make a comparison between scalable coding and transcoding is rather complex since they address the same problem from different points of view. Scalable coding specifies the data format at the encoding stage independently of the transmission requirements, while transcoding converts the existing data format to meet the current transmission requirement. Although scalable coding can provide low-cost flexibility to meet the target, traditional schemes sacrifice the coding efficiency compared to singlelayered coding.

Considering a cascaded transcoding architecture that fully decodes and re-encodes the video according to the new requirement, its coding performance will always be better than traditional scalable coding; this has been shown in at least one study [6]. Certainly, more studies is needed that accounts for the latest scalable coding schemes. Steps in this direction are now being made within the MPEG community [48],[49]. Recent advances in video coding are showing that the possibility for an efficient universally scalable coding scheme is within reach; e.g. [50],[51].

In addition to the issue of coding efficiency, which is likely to be solved soon, scalable coding will need to define the application space that it could occupy. For instance, content providers for high-quality mainstream applications have already adopted single-layered MPEG-2 video coding as the default, hence a large number of MPEG-2 video content already exists. To access these existing MPEG-2 video contents from various devices with varying terminal and network capabilities, transcoding is needed. For this reason, research on video transcoding of single-layer streams has flourished and is not likely to go away anytime soon.

The Future However, in the short term, scalable coding may satisfy a wide range of video applications outside this space, and in the long term, we should not dismiss the fact that a scalable coding format could replace existing coding formats. For now, scalable coding and transcoding should not be viewed as opposing or competing technologies. Instead, they are technologies that meet different needs in a given application space and it s likely that they can happily coexist.

Conclude Remarks Background & Introduction Bit-rate Reduction Spatial Resolution Reduction Temporal Resolution Reduction Error Resilience Transcoding Scalable Image Coding Scalable Coding Conclude Remarks

Other Transcoding Scheme There are additional video transcoding schemes including object-based transcoding [48], transcoding between scalable and single-layered video [53]-[55], and various types of formats conversions [56]-[60]. Joint transcoding of multiple streams has been presented in [61] and system filtering has been described in [62]. Transcoding techniques that facilite trick-play modes, e.g., fast and reverse playback, have been discussed in [63] and [64].

Problems in the Future Finding an optimal transcoding strategy Given several transcoding operations that would satisfy given constraints, a means for deciding the best one in a dynamic way [65]. Further study is needed toward a complete algorithm that can measure and compare quality across spatial-temporal scales. Possibly taking into account subjective factors, and account for a wide range of potential constraints (e.g., terminal, network, and user characteristics).

Another topic is the transcoding of encrypted bit streams. Transcoding of encrypted bit streams include breaches in security by decrypting and reencrypting within the network. These problems have been circumvented in [69] with a secure scalable streaming format that combines scalable coding techniques with a progressive encryption technique. However, handling this for non-scalable video and streams encrypted with trditional encryption techniques is still an open issue.