MOTION ESTIMATION AT THE DECODER USING MAXIMUM LIKELIHOOD TECHNIQUES FOR DISTRIBUTED VIDEO CODING. Ivy H. Tseng and Antonio Ortega

MOTION ESTIMATION AT THE DECODER USING MAXIMUM LIKELIHOOD TECHNIQUES FOR DISTRIBUTED VIDEO CODING Ivy H Tseng and Antonio Ortega Signal and Image Processing Institute Department of Electrical Engineering - Systems University of Southern California Los Angeles,CA, 90089-2564 E-mail:{hsinyits, ortega}@sipiuscedu ABSTRACT Distributed video coding techniques have been proposed to support relatively light video encoding systems, where some of the encoder complexity is transfered to the decoder In some of these systems, motion estimation is performed at the decoder to improve compression performance: a bloc in a previous frame has to be found that provides the correct side information to decode information in the current frame In this paper we compare various techniques for motion estimation at the decoder that have been proposed in the literature and we propose a novel technique that exploits all the information available at the decoder using a maximum lielihood formulation Our experiments show that lielihood techniques provide potential performance advantages when used in combination with some existing methods, in particular as they do not require additional rate overhead 1 INTRODUCTION With the increasing availability of mobile cameras and wireless video sensors, new computationally-demanding applications are being proposed for these mobile devices For example, there is interest in allowing mobile device users to capture video clips and then share them with others by uploading them to a central server While compression will be needed, due to bandwidth limitations, conventional predictive video coding techniques may be too complex for some of these devices, due to the motion estimation to be performed at the encoder Thus, a need has emerged for novel video compression techniques that can achieve good performance with a computationally light encoder, possibly shifting some of the complexity to the decoder The Slepian-Wolf theorem provides a basic tool to achieve this goal Let Y and X be two correlated sources The theorem states that if the joint distribution of X and Y is nown and Y is only available at the decoder, X can be encoded at the theoretically optimal rate H(X Y ) [1] A corresponding theorem for lossy source coding due to Wyner and Ziv [2] has lead to several proposals for practical Wyner-Ziv coding (C) For video compression, we wish to encode pixel blocs in the current frame, which are liely to be correlated to blocs in the previous frame As shown in [3], in a distributed coding setting, this can be seen as a scenario where the decoder will have access to multiple candidate side information blocs, and will have to decide which side information is best for decoding Each of these candidate side information blocs are blocs to be found in the previous frame, so that the decoder is performing an operation similar to motion compensation Since the information sent by the encoder cannot be decoded without side information, identifying the correct side information (ie, the correct bloc) typically involves, for each candidate side information, (i) using the side information to decode what was transmitted, and (ii) determining whether the decoded data indicates that the side information was correct 1 This second step can be achieved by letting the encoder send information that can be used to identify the correct side information As an example, a hash function can be used for this purpose: the encoder will send the result of the hash function for a given bloc to the decoder, so that the decoder can determine if decoding is correct (ie, if the hash value it generates matches that sent by the encoder) A more formal definition of the problem can be found in [3], which also provides a ey insight: the number of bits needed (eg, in the form of a hash) to identify the correct side information at the decoder is, under some simplifying assumptions, the same that would be needed if the encoder identified the best side information (eg, via motion estimation) and sent the location of the corresponding bloc to the decoder Practical methods proposed to date to enable motion estimation at the decoder require some transmission rate overhead Aaron, Zhong and Girod resort to feedbac to deal with model uncertainty (see references in [4]) Additional parity bits are requested from the encoder if the decoder decides the decoding is not reliable However, this approach leads to increases in decoding delay and it also requires a feedbac channel Aaron, Rane and Girod propose sending an additional hash function as a coarsely quantized and sampled original frame [4] Puri and Ramchandran use cyclic redundancy checs (CRCs) to validate the correctness of the decoded blocs [5] In addition to increasing the overall rate, CRC-based approaches can only be applied reliably to a limited range of rates, as will be discussed later in this paper Our goal in this paper is to design rate-efficient techniques that will enable the correct reference to be estimated at the decoder We propose a novel estimation method based on maximum lielihood (ML) Our proposed technique can be seen to complement existing methods We provide an analysis of our method and existing methods in order to address the trade-off between performance and rate overhead and also discuss the ranges of operating rates that are most suitable for each method 1 Note that this formulation does not preclude the encoder sending some motion information to the decoder; if accurate motion information is transmitted, then there will be a single candidate side information, while if only rough motion information is sent it will be used to reduce the number of candidate side information blocs

S1(n) Delay + Noise S2(n) S 2(n) = S 1(n-d) + N(n) Intra Overhead S2_(n) Time Delay Estimation S1_INTRA(n-d ) Central Fig 1 Time Delay Estimation: The motivation and proof of using maximum lielihood method to search for the correct side information d is the estimated time delay S2 (n) 3 REFERENCE FINDING TECHNIQUES 31 Overhead-based Techniques 311 Correlation Correlation-based techniques are commonly used for TDE In our problem formulation, correlation cannot be computed directly at the decoder, since S 2 can only be decoded correctly once d has been obtained Thus, some of the information corresponding to reading S 2 needs to be coded independently so that it can be decoded without any side information from N 1 This method is similar to Aaron, Rane and Girod s hash function [4] in the sense that a subset of original information is sent to the decoder to help locate the correct reference As in [6], we have a situation where the decoder uses quantized data to estimate the time delay (or in this case the motion displacement) between two data sequences While in [6] both sources were coded independently with standard quantizers (and thus could be decoded independently), here we show that this estimation can be done reliably, even if one of the streams is coded using C The time delay estimation (TDE) problem serves as a motivation and proof of concept of the proposed technique, which in this paper is mostly proposed for decoder motion estimation This paper is organized as follows We first introduce the problem in the context of TDE in Section 2 In Section 3 we propose the ML method and also review other methods used for TDE In Section 4 the experimental results of TDE are given We extend the ML method to a video coding environment in Section 5 and provide experimental results and analysis of different methods in Section 6 2 REFERENCE FINDING AT DECODER IN THE CONTEXT OF TIME DELAY ESTIMATION (TDE) Consider the scenario illustrated by Figure 1, where there are two sensor nodes, N 1 and N 2, that obtain readings S 1 and S 2 respectively Here S 1 and S 2 are correlated in the sense that S 2 is a delayed noisy version of S 1: S 2(n) = S 1(n d)+n(n), where delay d and noise N are both unnown N 1 sends S 1 INTRA, an encoded version of S 1 to the central node S 1 INTRA is encoded independently and can be decoded without requiring any information from N 2; it will be used as side information to decode the information sent by N 2 N 2 has nowledge of the noise statistics (this could be a design parameter, or could have been learned via information exchange between the nodes) S 2 is encoded as S 2 using distributed coding techniques based on this correlation and sent to the central decoder Since the delay d is unnown at the decoder, and correct decoding of S 2 can only be guaranteed when S 1 INTRA delayed by d is used as side information, the central decoder will need to estimate the correct d Note that this problem can be seen as the 1-D counterpart of the problem of motion estimation at the decoder for distributed video coding Motion vectors represent spatial displacements, while for now we consider temporal displacements for the TDE problem Also, the residual obtained by subtracting a predictor bloc (in the previous frame) from current bloc corresponds to the noise N for the TDE problem S 2_INTRA(n) S 2_(n) S 1_INTRA(n) TDE (Correlation) S 1_INTRA(n-d ) Delay = d S 2_d (n) Fig 2 Correlation: Subset of S 2 is intra encoded to correlate with S 1 to locate the correct reference For every L-sample data bloc from S 2, we consider two simple approaches to convey intra-coded information First we can include m consecutive samples encoded independently as a preamble in each L-sample bloc; the remaining L m samples are coded using C We denote this method preamble sample correlation (PSC) The second approach would embed independently coded samples every C coded symbols; this will be denoted embedded sample correlation (ESC) PSC usually performs better than ESC in terms of estimation accuracy However, PSC has the drawbac that it cannot detect a delay longer than m samples Let the independently encoded samples in S 2 be denoted S 2 INTRA(n), where n = 1 m for PSC and n = 1, m for ESC Let S 1 INTRA(n) be the intra encoded version of S 1(n), n = 1 L PSC and ESC pic d which maximizes R(d) = 1 m as the estimated delay n S 2 INTRA(n)S 1 INTRA(n d) 312 Cyclic Redundancy Checs (CRCs) Several proposed practical systems use CRCs, that sent to the decoder along with the C encoded data, in order to find the correct reference at the decoder [4][5] As shown in Fig 3, at the decoder, each possible delay is tested sequentially S 2 (n) is first decoded with S 1 INTRA(n d ) as the side information, where d is one of possible delays Then the decoder checs if the decoded sequence S 2 d (n) passes the CRC test If it does, d is declared to be the estimated delay; if it fails, S 2 (n) is decoded with respect to the next possible delay

S2_(n) CRC Delay = d d' = d +1 FAIL CRC Chec PASS Fig 3 Cyclic Redundancy Chec: The validity of S 2 d (n) is checed with a CRC test sequentially S 2 d (n) is the decoded S 2 (n) with S 1 INTRA(n d ) as the side information The major drawbac of the CRC method is that its performance degrades outside of a certain range of bloc lengths This is because even very few bit flips (say, just one bit) in a bloc of data lead to incorrect CRC values at the decoder C techniques can be designed to limit the probability of decoding errors, but this probability is nonzero Thus, as the bloc length increases, so does the probability that at least one sample will be decoded in error, even if the correct delay, and thus side information, are being used Because of this, for longer blocs it becomes more liely that the CRC test will reject every possible candidate With similar arguments, as the decoding error probability increases (possibly due to high SNR or limitation of transmission rate), the probability that the CRC test will reject every possible candidate also increases Note also that a CRC test only provides pass/fail information, with no other ordering of the blocs Thus, when all candidate delays fail the CRC chec, the CRC provides no information to indicate which of the the blocs might be a more liely candidate To improve the CRC performance, one could partition one long bloc into n shorter ones so that each shorter bloc is sent with it own CRC (using shorter blocs could decrease the probability of being rejected by the CRC test for the correct delay) If the same length of CRC is used, the overhead will increase Conversely, if a shorter CRC is used, the ris then would be that multiple different decoded blocs could all pass the CRC test This again will pose the problem of selecting one among the multiple candidates that meet the condition, which cannot be done by using CRC provided information alone Also, in the case that one long bloc is partitioned, for each bloc, we retrieve a list of candidate delays, from which a single delay for all the blocs needs to be identified with some suitable rules As an example, if two smaller blocs, P 1 and P 2, are used, we can determine that a correct delay is identified if both blocs provide consistent information, eg, if there is only one candidate delay a that is valid for both P 1 and P 2 (we do not fully discuss all cases due to lac of space) Note that in video applications CRCs may be applied to small data units, eg, 8 8 pixel blocs or macroblocs, and thus the problems associated with long bloc lengths may not arise However when multiple macroblocs share CRC information (in order to reduce the rate overhead) the above mentioned problems may arise and additional tools may be needed to supplement CRC information 32 Maximum Lielihood Techniques We now propose a novel technique to find the correct reference at the decoder using maximum lielihood estimation (ML) This method is based on the intuition that if we decode S 2 (n) based on the correct side information S 1 INTRA(n d ), with the correct alignment d, the joint statistics of S 2 d (n) and S 1 INTRA(n d ) should be similar to the original joint statistics of S 2(n) and S 1(n d ) Also, S 2 d (n) and S 1 INTRA(n d ) should be similar We define the lielihood that the delay is d as: L(Delay = d ) = Pr(S 2 d (n) S 1 INTRA(n d )), where we apply the same probability model that was selected for the original data: Pr(S 2(n) S 1(n d )), ie, conditional distribution of S 2 given S 1, which should be nown at the encoder in order to enable efficient C This lielihood model can be obtained in the training process, learned online or be given as an a priori design parameter Our proposed ML approach involves first decoding S 2 (n) with respect to all possible references (d ) Then for each S 2 d, the lielihood at every sample point is averaged The decoded data with the highest average lielihood is then chosen as the decoded result P(S2 S1(n-d )) S2_(n) d = all possible delays S1_INTRA(n-d) S2_-d_MAX (n) S2_-d_MAX+1 (n) S2_d_MAX (n) P(S2_d (n) S1_INTRA(n-d )), d = -d_max P(S2_d (n) S1_INTRA(n-d )), d = -d_max+1 P(S2_d (n) S1_INTRA(n-d )), d = d_max P(S2 S1(n-d )) MAX Fig 4 Maximum Lielihood: S 2 (n) is decoded with all the possible side information candidates The joint probability model of S 1 and S 2, nown at the encoder, is used to compute the lielihood of each decoded candidate value The one with the maximum lielihood is then chosen as the result Bloc length influences the accuracy of TDE for the proposed method Longer blocs include more C coded samples, and thus better estimation accuracy will be achieved However, longer blocs will also introduce longer delay in decoding Moreover, in the video case, displacement information changes locally (ie, different blocs have different motion) and thus it is not practical in general to group together multiple blocs in order to improve lielihood estimation, as often the blocs will not share common motion 4 EXPERIMENTAL TDE RESULTS In our experiments a uniform 8-level scalar quantizer is used We separate 8 quantization bins into 2 cosets and transmit only the LSB of each symbol as coset index The maximum possible delay is ± 15 samples Each experimental result is attained by at least 10,000 runs of Monte Carlo simulation and at least 10 errors occur for each point In Fig 5, we compare the TDE error probability for various mechanisms The total rate is the total number of bits sent from both N 1 and N 2 for detection First we note that our proposed ML approach outperforms all other methods, while being the only method that does not require any overhead CRC wors reasonably well but only in a relatively small range of bloc sizes Outside of this range of rates its performance can degrade significantly All the correlation methods require a certain amount of overhead, and their performance degrades as the ratio of overhead samples to total samples decreases

Bj, j=1m Current frame CRC/ Hash function Select the correct reference bloc Reconstructed frame B Ci DCT Quantization Probability Classification Ci_ C i_ Fig 6 The architecture of transform domain distributed video coding: A simplified version of PRISM [5] Fig 5 Error probability of TDE by various methods at 20dB SNR The number following PSC and ESC represents the ratio of independently encoded symbols to the bloc length In CRC approaches, CRC-12 is used BLK2 refers to the 2-bloc case mentioned in Section 312 In summary these results show that the ML technique is a rate efficient and accurate method for TDE This method relies, as does C in general, on some nowledge of the conditional statistics of S 1 and S 2, so that performance of both TDE and C will degrade when there is mismatch between the noise statistics assumed in the design and the actual noise affecting the measurements 5 MAXIMUM LIKELIHOOD TECHNIQUES FOR MOTION ESTIMATION AT THE DECODER We now extend our proposed method to the video case For this we use a transform domain distributed video coding architecutre, as shown in Fig 6 This is a simplified version of the PRISM system [5] Instead of aligning two signal streams, here we are trying to find the best predictor bloc from the previous frame that enables correct decoding of the current macrobloc Let C i be the current macrobloc, C i be the coded version of C i, B j, j = 1 M be all the possible candidate blocs in the previous frame, and C i j be decoded C i when B j is used as side information The lielihood of B j being the best predictor is L(B j) = Pr(C i j B j) The same probability model Pr(C i B Ti), where B Ti is the true predictor for C i, for the original data is applied This model can be generated through training Here we assume that DCT coefficients are independent, ie Pr(C i B Ti) = Pr(C i B Ti), where C i represents the th DCT coefficient in bloc C i and liewise for B Ti D is the number of DCT coefficients Also, for transform domain coding, DCT coefficients are quantized before transmission, so here we consider the probability of the difference between the quantized DCT coefficients from the current bloc and the predictor bloc Pr(C i B Ti) = = Pr(C i B Ti) = Pr ( Q(Ci ) Q(BTi) ) Pr(Q(C i ) Q(B Ti)) Pr is the pmf for the th DCT coefficient Although the TDE problem and the motion estimation problem are very closely related, the motion estimation problem differs from the TDE problem in two major aspects First, in the TDE case, determining the delay between the two sources is more important (as in many cases TDE is used in the context of source localization) Instead, in video coding problems, reconstruction quality is more important (and the accuracy of the estimated motion itself is not as important, as long as good quality can be achieved at the decoder) Thus in our evaluation for the video case we use PSNR, rather than probability of error, as the performance metric in the video case Second, unlie in Section 4, where we now the exact probability model, in the video case the joint statistics of C i and B j are usually hard to estimate and may change over time Thus, while in Section 32 ML method could perform reliable TDE solely based on C information, here sending extra intra information is sometimes needed to help locate the best predictor and improve the reconstruction quality The extra intra information we select to send in our experiments is the DC value of the macrobloc The reason to chose the DC value is because the DC coefficient influences PSNR the most among all frequency coefficients 6 EXPERIMENTAL RESULTS In our experiments, 8 8 DCT is used A different uniform scalar quantizer is used for each of the 64 DCT coefficients (we use the MPEG-1 intra mode quantization table) C scheme is coset coding Two macrobloc sizes (8 8 and 16 16) are tested Test video sequences include Foreman and Hall monitor Experiments are done on the 11th to 12th frames of the sequence For the ML method, the probability model is trained on the 1st to 10th frames of the sequence and the true predictor blocs are those with minimum SAD For the CRC method, the length of CRC code is 12 All the methods are compared with the optimal case, ie, that where the motion vectors are computed at the encoder and are transmitted to the decoder The DC coefficients are sent as intra information The experiment results are shown in Figs 8 and 9

Extra intra info Ci_ Bj All the possible Reference blocs Bj, j=1,,m C i_j, j=1,,m Fig 7 Due to the lac of nowledge of the exact probability model in the video case, extra intra information may be needed to help locate the best predictor and improve the reconstruction quality PSNR (db) 32 30 28 26 24 22 20 18 16 Low Rate, Foreman CRC, BLK8 CRC, BLK16 ML, BLK8 ML, BLK16 MLDC, BLK8 MLDC, BLK16 EncMV, BLK8 EncMV, BLK16 ML Bj C i_ First the experiment results verified the limitation of CRC discussed in Section 312 When the decoding error probability is high, which translates to a smaller number of cosets and lower rate in our experiment setup, CRC fails to locate side information The poor performance of CRC, as seen in Fig 8, is due to the fact that not all the blocs could find a corresponding side information When the CRC test rejects all candidate side information information blocs, we cannot successfully decode the frame in its entirety and the then PSNR presented here is computed assuming that every pixel of the bloc which is unable to decode is set to 127 On the other hand, with similar transmission rate, ML could in general give better results In particular, when bigger macroblocs are used (16 16 pixels) and DC is sent as intra information, ML provides performance close to the optimum level We also show, in Fig 9, that when the decoding error probability is lower (ie, the number of cosets is bigger and the rate is higher), both CRC and ML can perform well When bigger macrobloc sizes and/or DC intra information are used, ML outperforms CRC Due to lac of space, we do not include the results for Hall monitor, but the results are similar to those for Foreman When longer test sets are used, the performance degrades; this indicates that in order to use ML techniques in practice it would be necessary to use adaptive probabilistic models Also from Fig 8 and Fig 9, we can see that under the same parameter settings, bigger bloc sizes lead to better reconstruction for the ML method This is consistent with the results of Section 4 In cases where CRC techniques fail to identify the correct side information, ML could be a reliable alternative method provided bigger bloc sizes can be used and the DC helper information is sent In this case ML can perform reliable motion estimation at the decoder with slightly lower rate than CRC 14 106 108 11 112 114 116 118 12 122 Rate (bit/frame) x 10 5 Fig 8 Foreman: PSNR of various methods in low rate The number of cosets for AC coefficients is 2 The number of cosets for DC coefficients is 2 for CRC and 32 for ML (while the CRC method spends bits on sending the CRC code, ML spends bits to have higher precision of the DC coefficients) The DC coefficients for MLDC is sent in intra mode (8 bits precision) EncMV represents the optimal case PSNR (db) 38 36 34 32 30 28 High Rate, Foreman CRC, BLK8 CRC, BLK16 ML, BLK8 ML, BLK16 MLDC, BLK8 MLDC, BLK16 EncMV, BLK8 EncMV, BLK16 26 506 508 51 512 514 516 518 52 522 524 526 Rate (bit/frame) x 10 5 7 REFERENCES [1] D Slepian and J K Wolf, Noiseless coding of correlated information sources, IEEE Transaction of Information Theory, vol IT-19, no 4, pp 471 480, July 1973 [2] A D Wyner and J Ziv, The rate-distortion function for source coding with side information at the decoder, IEEE Transaction of Information Theory, vol IT-22, pp 1 10, January 1976 [3] P Ishwar, V M Prabhaaran, and K Ramchandran, Towards a theory for video coding using distributed compression principles, in Proceeding of IEEE Conference on Image Processing, Barcelona, Spain, September 2003 [4] B Girod, A M Aaron, S Rane, and D Rebollo-Monedero, Distributed video coding, Proceedings of the IEEE, vol 93, pp 71 83, January 2005 [5] R Puri and K Ramchandran, Prism: A reversed multimedia coding paradigm, in Proceeding of IEEE Conference on Image Processing, Barcelona, Spain, September 2003 [6] L Vasudevan, A Ortega, and U Mitra, Application-specific compression for time delay estimation in sensor networs, in First ACM Conference on Embedded Networed Sensors, Los Angeles, CA, November 2003, ACM Sensys Fig 9 Foreman: PSNR of various methods in higher rate The number of cosets for AC coefficients is 32 The rest of the experimental settings are the same as those of Fig 8