ROBUST LOW-LATENCY VOICE AND VIDEO COMMUNICATION OVER BEST-EFFORT NETWORKS

ROBUST LOW-LATENCY VOICE AND VIDEO COMMUNICATION OVER BEST-EFFORT NETWORKS a dissertation submitted to the department of electrical engineering and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy Yi Liang August 2003

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Bernd Girod (Principal Adviser) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Balaji Prabhakar I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. John Apostolopoulos Approved for the University Committee on Graduate Studies. iii

Abstract The quality of service limitation of today s best-effort networks poses a major challenge for low-latency multimedia communication. Excessive delay, packet loss, variations in throughput, and high delay jitter all impair the performance of the communication. In this work, these challenges are addressed at transport and application layers of real-time and on-demand streaming media systems. On the client side, passive schemes including adaptive playout scheduling and low-latency loss-concealment have shown to significantly improve the trade-off between buffering delay and packet loss for real-time voice communication. The playout schedule of the media packet is adaptively adjusted in a highly dynamic way and the proposed underlying packet scaling technique based on time-scale modification works elegantly and preserves good sound quality. At the transport layer, the communication performance is further improved by exploiting diversity of multiple transmission channels, where the source media are coded into multiple complementary streams that are sent over independent network paths. Experiments demonstrate further gains of reduced latency and distortion, resulting from path diversity. In order to combat network losses for real-time and on-demand video communication, which exhibits stronger dependency across packets, a network-adaptive coding scheme is employed to dynamically manage the packet dependency using optimal reference picture selection. The selection of the reference is achieved within a ratedistortion optimization framework and is adapted to the varying network conditions. For network-adaptive streaming of pre-encoded media, the potential mismatch error during bitstream assembly is avoided by using a layered coding structure. Based iv

on an accurate loss-distortion model introduced in this work, a prescient scheme that optimizes the dependency of a group of packets is also proposed to achieve global optimality as well as improved rate-distortion performance. With the improved tradeoff between compression efficiency and error resilience, the proposed system does not require retransmission of lost packets, which makes low-latency communication possible. These solutions provided at the receiving client, the transport layer, and source coding significantly improve the perceptual quality and reduce the latency for media communication over best-effort networks, without any requirement to modify the current or future network infrastructure. v

Acknowledgments I would like to thank Professor Bernd Girod for his valuable advice as well as his insightful and stimulating ideas during my doctoral program, and for providing me the opportunity to work in the Image, Video and Multimedia Systems research group on challenging and rewarding projects. Also, I would like to thank Dr. John Apostolopoulos, Professor Balaji Prabhakar, and Professor Fouad Tobagi for serving as my advisors and for their valuable input for my work, and for the time they have spent with me on this thesis. Special thanks to the many faculty members in the Electrical Engineering Department, and the members, alumni, friends, and assistants of the Image, Video and Multimedia Systems group, and the colleagues in the Information Systems Laboratory. I enjoyed the supportive atmosphere as much as my stay at Stanford. Last but not least, I would like thank my family members, especially my parents, who have been giving me constant support and encouragement that is indispensable for me to make this thesis become true. vi

Contents Abstract Acknowledgments iv vi 1 Introduction 1 1.1 Motivation................................. 1 1.2 State-of-the-Art.............................. 3 1.2.1 Client-side techniques....................... 3 1.2.2 Active error-resilience techniques................ 5 1.2.3 Rate-scalable coding....................... 9 1.2.4 D iversity techniques....................... 10 1.3 Summary of Major Contributions.................... 11 2 Adaptive Playout Scheduling for Real-Time Voice Communication 13 2.1 Buffering and Playout Scheduling.................... 14 2.2 Adaptive Playout Scheduling....................... 15 2.2.1 Fixed vs. adaptive playout.................... 15 2.2.2 Setting the playout schedule................... 21 2.3 Scaling of Voice Packets......................... 24 2.3.1 Single packet WSOLA...................... 24 2.3.2 Implementation issues...................... 28 2.4 Low-Latency Loss Concealment..................... 29 2.5 Performance Evaluation......................... 33 2.5.1 Comparison of different playout scheduling schemes...... 33 vii

2.5.2 Subjective listening tests of adaptive playout scheduling.... 36 2.6 Chapter Summary............................ 40 3 Transport Using Packet Path Diversity 43 3.1 Adaptive Playout Scheduling of Multiple Voice Streams........ 45 3.1.1 Coding of multiple streams.................... 45 3.1.2 Setting the playout schedules for multiple streams....... 47 3.1.3 Estimation of the packet loss probability............ 50 3.2 Internet Experiments........................... 51 3.3 Analysis of Path D iversity Gain..................... 58 3.3.1 Simulation setup......................... 58 3.3.2 Link loss reduction........................ 59 3.3.3 D elay reduction.......................... 60 3.3.4 Shared link............................ 62 3.4 Chapter Summary............................ 64 4 Network-Adaptive Coding for Low-Latency Video Communication 66 4.1 Packet Dependency Management and Reference Picture Selection.. 68 4.2 R-DOptimized Network-Adaptive Packet Dependency Management. 70 4.3 Performance of Network-Adaptive Coding............... 74 4.4 Chapter Summary............................ 76 5 Streaming of Pre-Encoded Media 78 5.1 Layered Coding Structure........................ 79 5.1.1 Layered coding structure with coding restrictions....... 79 5.1.2 Performance of network-adaptive coding with restrictions... 81 5.2 Loss-D istortion Modeling for Compressed Video............ 82 5.2.1 Review of previous models.................... 84 5.2.2 Proposed loss-distortion modeling considering error correlation and error propagation...................... 85 5.2.3 Modeling performance...................... 90 5.3 Prescient Packet D ependency Management............... 94 viii

5.3.1 Prescient packet dependency management........... 96 5.3.2 Rate and distortion estimation for generating R-Dpreambles. 97 5.3.3 The iterative descent algorithm to determine the optimal coding modes............................... 100 5.3.4 Performance of prescient packet dependency management... 101 5.4 Chapter Summary............................ 102 6 Conclusions and Future Work 104 6.1 Conclusions................................ 104 6.2 Future Work................................ 105 Bibliography 107 ix

List of Tables 2.1 Basic notation for adaptive playout scheduling.............. 18 2.2 Collected network delay traces...................... 34 2.3 Subjective test results of packet scaling................. 38 2.4 Subjective test results of playout algorithms 2 and 3.......... 39 2.5 Subjective test results of modulated noise reference units (MNRU) and original speech............................... 40 3.1 Basic notation for multi-stream playout scheduling........... 47 3.2 Delay and loss statistics of the two streams in the Internet experiments. 53 5.1 Averaged modeling error (db) for burst losses of length two, given by the additive model, proposed model with local parameter estimation (LE) and global estimation (GE)..................... 93 x

List of Figures 2.1 Different playout scheduling schemes. Algorithm 1 : fixed playout scheduling (top); Algorithm 2 : between talkspurt adjustment (middle); Algorithm 3 : within talkspurt adjustment (bottom). Gaps in solid lines correspond to silence periods between talkspurts. The packet interval is 20 ms.............................. 16 2.2 Fixed (a) and adaptive playout (b).................... 19 2.3 Extension (a) and compression (b) of single voice packets using timescale modification............................. 26 2.4 Algorithm for adaptive playout time adjustment with packet scaling. 29 2.5 Loss concealment for (a) single loss, (b) interleaved loss, and (c) consecutive loss................................. 30 2.6 Algorithm for loss concealment...................... 33 2.7 Performance of playout scheduling schemes. Algorithm 1 : fixed playout time. Algorithm 2 : between talkspurt adjustment. Algorithm 3 : within talkspurt adjustment....................... 35 3.1 Source encoding: a) MD C; b) single-stream with FEC......... 46 3.2 Playout scheduling of multiple streams.................. 49 3.3 Experimental setup. Source and destination hosts are shown as white circles and relay servers as gray circles, all labeled with their IP addresses. Intermediate service providers are represented by boxes. The numbers in parentheses show the average time in milliseconds required for the packets to traverse corresponding providers or other interconnected networks.............................. 52 xi

3.4 Loss - delay tradeoff, Experiment 1.................... 54 3.5 PESQ score vs. delay, Experiment 1................... 55 3.6 Loss - delay tradeoff, Experiment 2.................... 57 3.7 PESQ score vs. delay, Experiment 2................... 57 3.8 Multi-hop topologies for network simulations: a) independent paths; b) paths sharing a common link. Each of the intermediate nodes N1 through N6 has a number of TCP data sources attached........ 58 3.9 Link loss reduction............................. 60 3.10 Average delay reduction vs. difference in propagation delay...... 61 3.11 Average delay reduction vs. delay STD................. 61 3.12 Link loss rate reduction with a shared link of different bandwidth... 63 4.1 A coding structure where each frame uses the third previous frame as a reference (v = 3). Each frame is correctly received at the decoder with probability 1 p. Frame 5 in the sequence depends on 5 3 =2 previous frames, and the probability it will be affected by a previous loss is p e =1 (1 p) 2.......................... 68 4.2 The probability of the 10th frame being affected by a prior loss (left axis) and the sequence-averaged rates (right axis) using different reference frames. Rates are obained by encoding the first 230 frames of Foreman sequence (30 frame/sec) using H.26L TML 8.5 at an average PSNR of approximately 33.4 db. p =0.10................ 69 4.3 The binary tree structure for estimating error propagation and optimal reference selection. v = 1 represents using frame n 1 as the reference for prediction. v = 2 represents using frame n 2 as the reference for prediction.................................. 72 4.4 R-Dperformance for Foreman sequence. V =5, p = 10%....... 75 4.5 R-Dperformance for Mother-Daughter sequence. V =5, p = 10%... 75 4.6 R-Dperformance for Salesman sequence. V =5, p = 10%....... 75 4.7 R-Dperformance for Claire sequence. V =5, p = 10%......... 75 4.8 Distortion at different channel loss rates. Foreman sequence. V =5.. 76 xii

4.9 R-Dperformance at different LTM lengths. Foreman sequence. p = 10%. 76 5.1 Layered coding structure with coding restrictions. T GOP =25,V =5. 80 5.2 R-Dperformance for Foreman sequence. p = 10%............ 83 5.3 R-Dperformance for Mother-Daughter sequence. p = 10%....... 83 5.4 R-Dperformance for Salesman sequence. p = 10%........... 83 5.5 R-Dperformance for Claire sequence. p = 10%............. 83 5.6 Measured versus estimated total distortion as a function of burst loss length, normalized by total distortion for a single loss.......... 92 5.7 Total distortion and error correlation of two losses with a lag. First loss at Frame 80, and second loss at frame 80+lag........... 94 5.8 Measured versus estimated total distortion for two losses separated by a lag, normalized by total distortion for a single loss.......... 95 5.9 Illustration of prediction modes...................... 99 5.10 The iterative descent algorithm to determine the optimal coding modes.101 5.11 R-Dperformance for Foreman sequence. L = 10............ 102 5.12 R-Dperformance for Mother-Daughter sequence. L = 10........ 102 xiii

xiv

Chapter 1 Introduction 1.1 Motivation Since the introduction of the first commercial products in 1995, Internet video streaming has experienced phenomenal growth [1]. However, despite of the rapid expansion of the underlying infrastructure, technological challenges are still a major barrier to the wide adoption of online streaming media today. The unreliable and stateless nature of today s Internet protocol (IP) results in a best-effort service, i.e., packets may be delivered with arbitrary delay or may even be lost. This quality of service (QoS) limitation is a major challenge for rich media communication over best-effort networks. Transmitted over the best-effort network and suffering from variability in throughput, delay and loss, the media data has to be delivered by a deadline to become useful. Excessive delay severely impairs communication interactivity; packet loss results in non-fluency in audio, and poor picture quality and freezing frames in video. The heterogeneity of today s Internet also poses a major challenge for media delivery to users with various connection speeds, and therefore scalability in transmission data rate is highly desirable. The challenges that the media streaming industry faces, in conjunction with the commercial promise of the technology, have attracted considerable efforts in research and product development. This thesis addresses the QoS challenges mentioned above, and provides solutions that reduce the end-to-end communication latency and increase 1

2 CHAPTER 1. INTRODUCTION the robustness for communication over lossy networks. The solutions are provided at the transport and application layers of the networked media streaming system, which are on top of and independent of the network layer and the layers below. Compared to upgrading to new network architectures such as DiffServ [2], which is not expected to become widely deployed anytime soon, the approaches in this thesis are advantageous in that any modification of the current or future network infrastructure is not necessary. To address the problems of communication latency and packet loss, network delay variation (also known as jitter) has to be considered. Due to delay jitter, data loss is a result of not only packets being dropped by the network, but also packets late arrival and thus missing their delivery deadline. Delay jitter of the data packet is caused by the congestion of the network, or possible retransmission of lost packets. In this thesis, an adaptive playout scheduling scheme is proposed to adjust the output schedule of the media units in an adaptive way, so that the impact of delay jitter is reduced. This passive scheme is implemented at the client side and does not require any cooperation of the encoder or the transport layer, which makes it easy to implement. Adaptive scheduling applies to the playout of both audio and video streams. Video communication over rate-limited and error-prone channels, such as packet networks and wireless links, requires both high compression and high error resilience. In the past, considerable efforts have been spent on the development of the most efficient compression schemes. To achieve high compression, most modern codecs employ motion-compensated prediction between frames to reduce the temporal redundancy, followed by a spatial transform to reduce the spatial redundancy, and the resulting parameters are entropy-coded to produce the compressed bitstream. These algorithms provide significant compression, however, the compressed signal is highly vulnerable to data losses. This is non-trivial for communication over best-effort networks, which are unreliable. Video streaming is different from audio streaming in the way that the dependency across successive packets is much stronger, due to the interframe motion-compensated coding. In this thesis, a network-adaptive source coding scheme is proposed to dynamically manage the dependency across packets, so that an optimal trade-off between compression efficiency and error resilience is achieved in

1.2. STATE-OF-THE-ART 3 the rate-distortion (R-D) sense. The increased error resilience eliminates the need of retransmitting lost packets, which enables less than one-second low-latency streaming. The proposed scheme applies to both live and pre-encoded video streaming and is compatible with open standards such as ISO/IEC MPEG-4 [3] and ITU-T H.264 [4]. In recent years, content delivery networks (CDN) have been developed as an overlay network on top of today s existing networks, to overcome the QoS limitations incurred by today s Internet. For the delivery of rich media content over overlay networks, the source signal can be coded into separate streams using multiple description coding (MDC) and sent over more than one network route to take advantage of the diversity of multiple transmission paths. Path diversity techniques for low-latency media communication are proposed in this thesis as a transport layer solution to mitigate any network problems on one particular transmission path and increase the robustness and reliability of communication. 1.2 State-of-the-Art To address the challenges of media streaming, research efforts in recent years have particularly been directed towards communication efficiency, error-robustness, low latency, and scalability [5], [6], [7], [8]. 1.2.1 Client-side techniques In a networked multimedia system, the client typically employs error-detection and loss-concealment techniques to mitigate the effect of lost data. A survey studying various packet loss recovery techniques for streaming audio is presented in [9], where trade-offs among algorithm delay, voice quality and complexity are discussed. Most client-side schemes take advantage of the data received adjacent to the lost packet and interpolate the missing information by exploiting the redundancy in the audio signal. In particular, waveform repetition initially proposed in [10] and [11] simply repeats the information contained in the packets prior to the lost one. A more

4 CHAPTER 1. INTRODUCTION advanced loss concealment technique using time-scale modification is described in [12] and [13]. Waveform repetition does not introduce any algorithm delay as time-scale modification typically does, however, it does not provide as good a sound quality [9]. One goal of this thesis is to develop loss-concealment techniques that provide good sound quality at very low latency, which work together with the packet playout scheduling scheme. For video communication, postprocessing is also applied at the client side for error concealment and loss recovery. Techniques to recover the damaged areas based on characteristics of image and video signals have been reviewed in [14]. More specifically, spatial-domain interpolation is used in [15] to recover an impaired macroblock; transform-domain schemes are used to recover the damage from partial coefficients, as presented in [16], [17], [18] and [19], where a maximally smooth image measure can be employed. Temporal-domain schemes interpolate the missing information by exploiting temporal correlation in adjacent frames. Examples include motion-compensated interpolation [20], [21], state recovery [22], temporal smoothness method [23], coding mode recovery [23], [19], and displaced frame difference (DFD) and motion vector (MV) recovery [24], [25], [26], [27]. These schemes can also be combined with layered coding, as presented in [28] and [29]. To compensate for the network delay jitter and smooth out the playout, buffering and playout scheduling for data units are also employed at the client side for both audio and video communication. For real-time speech communication, most of previous works compensate the jitter completely within talkspurts [30], [31], [32], [33], [34]. By completely compensating the jitter, the output packets are played in the original, continuous, and periodic pattern. However, the fixed schedule poses a limitation for the trade-off between the buffering delay and packet loss for low-latency applications. Some recently proposed schemes, [35] and [36], allow certain amount of playout jitter for audio and other types of multimedia. Unfortunately, in these methods, the playout time adjustment is made without regard to the media signal and it is not addressed how continuous playout of the media stream, especially audio, can actually be achieved. The development of a flexible and adaptive scheduling scheme that tolerates certain amount of playout jitter but guarantees continuous output as

1.2. STATE-OF-THE-ART 5 well as good audio quality is one of the major contributions in this thesis. Recently, this idea of adaptive playout scheduling is extended to video streaming to reduce the latency and the effective packet loss rate [37], [38], [39]. The techniques mentioned above, including error-concealment and playout scheduling, are categorized as passive methods that are implemented at the client side, which do not require any cooperation of the sender or increase the cost of transmission. In this thesis, we will demonstrate that passive techniques, including advanced lossconcealment and packet playout scheduling, impose low overhead for the communication but are highly efficient in enhancing the quality of the rendered media. 1.2.2 Active error-resilience techniques A different category of error-resilience techniques require the encoder to play a primary role. They are able to provide even stronger robustness for media communication over best-effort networks. We refer to these techniques as active to differentiate them from those only employed at the client side. Error control for audio For speech communication, one widely accepted way to reduce the effective packet loss observed by the receiver is to add redundancy to the voice stream at the sender. This is possible without imposing too much extra network load since the data rate of the voice traffic is very low when compared with other types of multimedia and data traffic. A common method to add redundancy is forward error correction (FEC), which transmits redundant information across packets [40], [41], [42], where loss recovery is performed at the cost of higher latency. The efficiency of FEC schemes is largely limited by the bursty nature of the channel losses [43]. In order to combat burst loss, redundant information has to be added into temporally distant packets, which introduces higher delay. Another sender-based loss recovery technique, interleaving, does not increase the data rate of transmission but still introduces delay at both the encoder and decoder sides. The efficiency of loss recovery depends on over how many packets the source

6 CHAPTER 1. INTRODUCTION packet is interleaved and spread over. Again, the wider the spread, the higher the introduced delay [44]. For these reasons, in this work we instead employ path diversity techniques for error-robustness, combined with passive schemes including playout scheduling and loss-concealment, for low-latency speech communication. Error control for video Video communication typically requires much higher data transmission rates than audio. A variety of active schemes have been proposed not only to increase the robustness of communication, but also to take the data rate efficiency into consideration [45], [46], [47], [48], [49], [50]. Many of the recent algorithms use R-Doptimization techniques to improve the compression efficiency [51], [52], [53], as well as to increase the error-resilient performance over lossy networks [54], [55]. The goal of these optimization algorithms is to minimize the expected distortion due to both compression and channel losses subject to the bit-rate constraint. One example of recent work in this area is Intra/Inter-mode switching [56], [57], [58], [59], where Intra-coded macroblocks are updated according to the network condition to mitigate temporal error propagation. In particular, an algorithm to estimate the overall distortion of decoder frame reconstruction due to quantization, error propagation, and error concealment has been proposed in [59], [60], [61], for optimal Intra/Inter-mode switching. Another approach is to modify the temporal prediction dependency of motioncompensated video coding in order to mitigate or stop error propagation. Example implementations include reference picture selection (RPS) [62], [45], [63], [4] and NEWPREDin MPEG-4 [64], [3], where channel feedback is used to efficiently stop error propagation due to any transmission error. Another example is video redundancy coding (VRC), where the video sequence is coded into independent threads (streams) in a round-robin fashion [65], [45]. A Sync-frame is encoded by all threads at regular intervals to start a new thread series and stop error propagation. If one thread is damaged due to packet loss, the remaining threads can still be used to predict the Sync-frame. VRC provides improved error resilience, but at the cost of much higher data rate. More recently, dynamic control of the prediction dependency has

1.2. STATE-OF-THE-ART 7 been presented using long-term memory (LTM) prediction to achieve improved R-D performance [53], [54], [66], [67]. To achieve network adaptiveness and bitrate scalability for the streaming of preencoded media, one particular issue is the dynamic assembly of compressed bitstreams without mismatch error (also known as drift ). To that end, S-frames [68] and SP-frames [69] have been proposed, but at the cost of higher data rate. A model for estimating the end-to-end distortion for pre-encoded video is proposed in [70], [71] to aid R-Doptimized streaming. In this work, we will propose R-Doptimized adaptive streaming without mismatch error, at low data rate overhead, and for low-latency applications. Typically a channel coding module in a robust video communication system may involve FEC and automatic retransmission on request (ARQ). Similar to their applications in voice communication, when FEC is employed across packets, missing packets can be recovered at the receiver as long as a sufficient number of packets is received [72], [73], [74], [75], [76]. In particular, Reed-Solomon (RS) codes are suitable for this application due to their convenient features [77], [78]. FEC is widely used as an unequal error protection (UEP) scheme to protect prioritized transmissions. Recent works have addressed the problem of how much redundancy should be added and distributed across different prioritized data partitions [79], [80], [81], [82], [83], [84], [78], [85]. In addition to FEC codes, data randomization and interleaving are also employed for enhanced protection [86], [23], [87], [88], [89]. ARQ techniques incorporate channel feedback and employ the retransmission of erroneous data [90], [91], [92], [93], [94]. ARQ systems use combinations of positive (ACK) and negative acknowledgments (NACK) and time-outs to determine which packets should be retransmitted. Unlike FEC schemes, ARQ intrinsically adapts to the varying channel conditions and tends to be more efficient in transmission. However, for real-time communication and low-latency streaming, the latency introduced by ARQ is a major concern. In addition, like all feedback-based error control schemes, ARQ is not appropriate for multicasting.

8 CHAPTER 1. INTRODUCTION Rate-distortion optimized packet scheduling Recently, the problem of scheduling of packet transmissions and retransmissions has attracted a considerable research effort. One of the earliest publications is [95], in which a Markov chain analysis is used to find the optimal policy for transmitting layered media with a minimum end-to-end distortion under a constant rate constraint. Later, a low-complexity heuristic scheme for sender-driven scheduling of media packets over best-effort networks has been developed in [96]. An intriguing framework for computing R-Doptimized packet transmission/retransmission policies has been proposed in [97], [98], [76]. The system optimized allocates the time and bandwidth resources among packets and decides which packet to transmit or retransmit at each opportunity, so that a Lagrangian cost function of expected rate and distortion is minimized, given each packet s delivery deadline. This framework overcomes the complexity problem encountered in [95] by using an iterative optimization. The framework has been extended to receiver-driven streaming in [99], and to hybrid wireline-wireless channels in [100] and [101]. In [102], cost-distortion optimized streaming over DiffServ networks has been explored. In [103] and [104], the case of hybrid receiver/sender driven streaming using a proxy server located at the edge of the network has been studied. R-Doptimized packet scheduling enhanced with adaptive media playout has been presented in [105] and [106] to improve the trade-off between buffering delay and reconstruction quality. A scheme considering multiple deadlines and employing accelerated retroactive decoding of late packets has been presented in [107] to further improve the R-Dperformance. In [108] the authors present an optimized video streaming strategy based on frame reordering for networks with significant delay variations, so that the distortion resulting from late frames is minimized. The active error-resilience techniques discussed above are employed to improve the overall system performance at different data rate and latency costs. In this work, we develop error-resilient source coding techniques with focus on R-Doptimization for applications that pose very low latency requirements. In the following we continue to review the state-of-the-art in rate-scalable coding and diversity techniques, which can also be classified as error-resilience techniques.

1.2. STATE-OF-THE-ART 9 We present them in separate subsections due to their unique characteristics. 1.2.3 Rate-scalable coding Layered or scalable coding, combined with transmission prioritization, is another effective approach to provide error resilience [79], [109], [110], [80], [20], [111], [7]. In a layered scheme, the source signal is coded into more than one group or layer, with the base layer containing the most essential information for media reconstruction at an acceptable quality, while the enhancement layer(s) containing the information for a reconstruction at an even better quality. At high loss rates, the more important, more strongly protected layers can still be recovered, while the less important layers might not. Common layered techniques can be categorized into temporal scalability [112], spatial scalability [113], [114], [115], signal-to-noise (SNR) scalability [79], data partitioning [116], [4], or any combination of these. Layered scalable coding has been widely employed for video streaming over best-effort networks including IP and wireless networks [117], [118], [119], [120], [111], [121]. Different layers can be transmitted with the built-in priority mechanism without network support, such as the UEP mentioned above, or using network architectures such as DiffServ that provide multiple grade of QoS [2], [102], [122], [123], [124], [125]. A scheme for optimal Intra/Inter-mode selection has recently been proposed for scalable coding to limit error propagation due to packet loss [126], and a scheme for drift management and adaptive bit rate allocation for scalable coding is presented in [127]. Layered scalable coding has become part of established video coding standards such as MPEG [116], [3] and H.263+ [63]. One special case of scalable video coding is the partially embedded scheme using fine-granular scalability (FGS) [128], [129], [130], which is supported by MPEG-4 [131], [132]. FGS is more elegant for rate scalability and adaptation, but is considered less efficient in compression when compared with non-scalable coding, especially at low bit-rates. Recently a novel scheme using leaky prediction in the enhancement layers has shown to close much of the efficiency gap [133]. Layered video coding schemes can aid TCP-friendly streaming, as they support

10 CHAPTER 1. INTRODUCTION rate scalability and rate control to adapt to the varying network condition, so that network congestion can be mitigated [134], [135], [136], [50]. One particular case is receiver-driven layered multicasting, where video layers are transmitted in different multicast groups and rate control is realized by subscribing to the appropriate groups [137], [138], [139], [55]. 1.2.4 Diversity techniques Another category of error-resilience techniques is based on MDC, a source coding technique, which may be combined with packet path diversity, a transport technique. MDC encodes the source signal into more than one description of equal importance, which may be delivered over several parallel channels. Earlier work on MDC from the perspective of information theory can be found in [140], [141], [142], [143], [144], [145], [146], [147], [148]. In multimedia communication, MDC can be applied to both voice [149], [150], [151], [152], and image and video signals [153], [154], [155]. An MDC scheme constructed through the usage of FEC codes is presented in [83] to achieve robustness and efficiency in Internet video streaming and multicast. A construction for layered multiple description codes has been explored in [156]. In this scheme, base layer descriptions can be transmitted to low-speed clients, while both base and enhancement layer descriptions can be transmitted to high-speed clients, which has the advantages of both layered and multiple description codes. To maximize the benefits of diversity in media communication, multiple descriptions in different streams can be sent, in a distributed manner, over independent or largely uncorrelated network paths with diversified loss and delay characteristics [157], [158], [159], [160], [161], [162], [163], [164]. In this way, the probability of a negative disturbance, such as packet loss, impacting all channels at the same time will be small. Path diversity also alleviates the problem that the default path determined by the routing algorithm is not optimum, which might often be the case according to [165]. Path diversity is also used with R-Doptimized scheduling of packet transmission to achieve enhanced performance [166], [167]. Packet path diversity can be implemented by means of an overlay network that consists of relay nodes [157], [168],

1.3. SUMMARY OF MAJOR CONTRIBUTIONS 11 such as the CDN deployment [169], [170], [171], and the peer-to-peer network [172], [173]. Most of the past work on path diversity has focused on increasing the communication robustness over error-prone networks. In this thesis we take advantage of path diversity not only to improve the quality of media communication by reducing the effective packet loss rate, but also to reduce the latency for applications with very stringent delay requirement. 1.3 Summary of Major Contributions The major contributions of this thesis are summarized as follows. An adaptive media playout scheduling algorithm for real-time voice communication is proposed, which achieves both reduced latency and reduced loss rate. The packet playout schedule is dynamically adjusted using time-scale modification techniques to generate continuous media playout. Waveform Similarity Overlap Add (WSOLA) algorithm is employed to process each packet independently, without posing any algorithmic delay. Comprehensive subjective listening tests are performed to report the quality of time-scale modified speech. A novel low-latency loss concealment scheme is proposed for this application. Packet path diversity schemes are introduced for real-time voice communication. An adaptive multi-stream playout scheduling algorithm is proposed for real-time voice communication, which takes advantage of uncorrelated delay jitter across multiple network paths. Internet experiments using relay servers to form multiple paths are performed to validate the diversity gain. Experiments using a network simulator are also performed to analyze the diversity gain quantitatively. A real-time video communication system is developed that provides high quality in a lossy environment without requiring packet retransmission. Since packet retransmission is not needed, VoIP-like low-latency can be achieved for video

12 CHAPTER 1. INTRODUCTION communication. A rate-distortion (R-D) optimization framework is developed to determine the optimal packet dependency for adapting to the channel. An accurate model is developed to quantify video distortion as a result of packet losses. The model explicitly considers the correlation between errors as well as error propagation, yielding higher accuracy as compared to previous loss models. A novel layered coding structure is introduced for network-adaptive streaming of pre-encoded video. The layered structure allows switching between streams at SYNC positions without mismatch error. The coding restrictions imposed do not compromise R-Dperformance over lossy channels compared with liveencoding. A prescient scheme that manages the dependency for a group of packets using a pre-computed R-Dpreamble, within an R-Doptimization framework. A simplified loss-distortion model is used in estimating the distortion and generating the R-Dpreamble. This thesis is organized as follows. Chapter 2 presents an adaptive playout scheduling scheme for low-latency real-time voice communication. Chapter 3 presents packet path diversity as a transport solution and uses real-time voice communication as an explanatory example. In Chapter 4, network-adaptive coding for packet dependency management and a low-latency video communication system that does not require packet retransmission is introduced. Chapter 5 discusses practical issues for streaming of pre-encoded media, including a layered coding structure that allows mismatch-free bitstream assembly, and a prescient scheme to achieve global optimality for packet dependency based on a loss-distortion model. Finally, conclusions and potential future work are discussed in Chapter 6.

Chapter 2 Adaptive Playout Scheduling for Real-Time Voice Communication QoS limitation is a major challenge for real-time voice communication over IP networks (VoIP). Since excessive end-to-end delay impairs the interactivity of human conversation, active error control techniques such as retransmission cannot be applied. Therefore, any packet loss directly degrades the quality of the reconstructed speech. Furthermore, delay jitter obstructs the proper reconstruction of the voice packets in their original sequential and periodic manner. Considerable efforts have been made on different layers of current communication systems to reduce the delay, smooth the jitter, and recover the loss. One important functionality to be implemented at the receiver is the concealment of lost packets, i.e., the recovery of lost information based on the redundancy in neighboring packets. Another functionality that is to be discussed in this chapter is the playout scheduling of voice packets. In this chapter we propose a new playout scheduling scheme that exploits the increased flexibility of allowing more playout jitter. In this scheme, we adaptively adjust the playout schedule of each individual packet according to the varying network condition, even during talkspurts. The continuous output of high-quality audio is achieved by scaling of the voice packets using a time-scale modification technique. 13

14 CHAPTER 2. ADAPTIVE PLAYOUT SCHEDULING As a result, we are able to allow a higher amount of playout jitter compared to previous work, which allows us to improve the trade-off between buffering delay and late loss significantly [174], [175]. This improvement is based on the interesting finding that increased playout jitter is perceptually tolerable if the audio signal is appropriately processed, as demonstrated by subjective listening tests. The receiver-based, passive methods have the advantage that no cooperation of the sender is required. Furthermore, they can operate independently of the network infrastructure. This chapter is organized as follows. Section 2.1 introduces the basic concept of buffering and playout scheduling. Section 2.2 describes the principles of adaptive playout and introduces the notation and performance measures used for evaluation. Section 2.3 describes how the voice packet can be scaled using time-scaling techniques. In Section 2.4 a loss concealment mechanism is proposed that works together with the adaptive playout scheme. Finally, a performance comparison and the subjective quality test results are presented in Section 2.5. 2.1 Buffering and Playout Scheduling The common way to control the playout of packets is to employ a playout buffer at the receiver to absorb the delay jitter before the audio is output. When using this jitter absorption technique, packets are not played out immediately after reception but held in a buffer until their scheduled playout time (playout deadline) arrives. Though this introduces additional delay for packets arriving early, it allows to play packets that arrive with a larger amount of delay. Note that there is a trade-off between the average time that packets spend in the buffer (buffering delay) andthe number of packets that have to be dropped because they arrive too late (late loss). Scheduling a later deadline increases the possibility of playing out more packets and results in lower loss rate, but at the cost of higher buffering delay. Buffering delay is a significant component in the tight end-to-end delay budget for applications with very stringent delay requirement, such as VoIP 1. However, it is difficult to decrease 1 ITU-T recommends a one-way delay below 150 ms for bi-directional real-time voice communication [176]. Usually an even lower latency is desired by the user

2.2. ADAPTIVE PLAYOUT SCHEDULING 15 the buffering delay without significantly increasing the loss rate. In fact, packet loss in delay-sensitive applications, such as VoIP, is a result of not only packets being dropped over the network, but also delay jitter. The influence of delay jitter and buffering schemes is significant on the performance of applications with low-latency requirements. Previous work mainly focused on improving the trade-off between delay and loss, while trying to compensate the jitter completely or almost completely within talkspurts [30] - [34]. By setting the same fixed time for all the packets in a talkspurt, the output packets are played in the original, continuous, and periodic pattern, e.g., every 20 ms. Therefore, even though there may be delay jitter on the network, the audio is reconstructed without any playout jitter. Other proposed schemes [35][36] apply adaptive scheduling of audio and other types of multimedia, accepting certain amount of playout jitter. However, in these methods, the playout time adjustment is made without regarding the audio signal and it is not addressed how continuous playout of the audio stream can actually be achieved. As a result, the playout jitter that can be tolerated has to be small in order to preserve reasonable audio quality. 2.2 Adaptive Playout Scheduling 2.2.1 Fixed vs. adaptive playout Fig. 2.1 (a)-(c) illustrate the three basic scheduling schemes that are investigated in this work. The graphs show the delay of voice packets on the network as dots and the total delay as a solid line. When a later playout time is scheduled, the total delay increases. Packets arriving after the playout deadline, i.e., dots above the line, arrive late and are considered lost, and have to be concealed. The task of a playout scheduling scheme with respect to Fig. 2.1 is to lower the solid line (reduce the total delay) as much as possible while minimizing the number of dots above the line (minimize late loss). The simplest method, denoted as Algorithm 1, uses a fixed playout deadline for all the voice packets in a session, as depicted in Fig. 2.1 (a). It is not very effective

16 CHAPTER 2. ADAPTIVE PLAYOUT SCHEDULING 180 delay (ms) 160 140 120 50 100 150 200 250 (a) 180 delay (ms) 160 140 120 50 100 150 200 250 (b) 180 Network Delay d n i Total End-to-end Delay d t i delay (ms) 160 140 120 50 100 150 200 250 packet sequence number (c) Figure 2.1: Different playout scheduling schemes. Algorithm 1 : fixed playout scheduling (top); Algorithm 2 : between talkspurt adjustment (middle); Algorithm 3 : within talkspurt adjustment (bottom). Gaps in solid lines correspond to silence periods between talkspurts. The packet interval is 20 ms.

2.2. ADAPTIVE PLAYOUT SCHEDULING 17 in keeping both delay and loss rate low enough in practice, because the statistics of the network delay change over time and a fixed playout time can not compensate for this variation. With improved playout algorithms proposed in [30] - [34], the network delay is monitored, and the playout time is adaptively adjusted during silence periods. This is based on the observation that, for a typical conversation, the audio stream can be grouped into talkspurts separated by silence periods. The playout time of a new talkspurt may be adjusted by extending or compressing the silence periods. This approach is denoted Algorithm 2 and provides some advantage over Algorithm 1 as illustrated in Fig. 2.1 (b). However, the effectiveness is limited when talkspurts are long and network delay variation is high within talkspurts. For example, the silencedependent method is not able to adapt to the spike of high delay within the third talkspurt at packets 113-115. As a result, several packets are lost in a burst causing audible quality degradation. In the new scheduling scheme proposed in this work, the playout is not only adjusted in silence periods but also within talkspurts. Each individual packet may have a different scheduled playout time, which is set according to the varying delay statistics. This method is denoted as Algorithm 3 and illustrated in Fig. 2.1 (c). For the same delay trace, the new algorithm is able to effectively mitigate loss by adapting the playout time in a more dynamic and reactive way. Note that Algorithm 3 requires the scaling of voice packets to maintain continuous playout and therefore introduces some amount of playout jitter. However, this flexibility allows to reduce the average buffering delay while reducing late loss at the same time. Hence, the trade-off between buffering delay and late loss is improved. In the following we define an objective performance measure for playout scheduling schemes. We introduce the basic notation and define the average buffering delay and the late loss rate as the basic performance measures. For convenience all variables are summarized in Table 2.1. As illustrated in Fig. 2.2, we denote the time when a packet is sent, received, and played out by t i s, ti r and ti p respectively. The index i =1, 2,...N denotes the packet sequence number, assuming N packets are sent in the stream. For packet

18 CHAPTER 2. ADAPTIVE PLAYOUT SCHEDULING Table 2.1: Basic notation for adaptive playout scheduling. Notation Description t i s t i r t i p d i n d i b d b d i t d max d i max Time packet i sent Time packet i received Time packet i played out Network delay of packet i Buffering delay of packet i Average buffering delay of a stream Total delay of packet i Fixed playout deadline Playout deadline of packet i D k Sorted order statistics of {d i n } ε n Link loss rate ε l Late loss rate ˆε l User-defined late loss rate ε b Burst loss rate ε Total loss rate R Set of received packets P Set of played packets B Set of packets lost consecutively N Number of packets in a stream L 0 Sender packetization time L i Actual length of scaled packet i ˆL i Target length of packet i voice, speech is usually processed and packetized into fixed size blocks and outgoing packets are generated periodically at a constant packetization time L 0, i.e., t i+1 s t i s = L 0 = constant. The buffering delay of packet i is then given by d i b = ti p t i r, while the networkdelay d i n is given by di n = ti r ti s. The total delay di t, is the sum of the two quantities above, i.e., d i t = d i n+d i b. Note that this total delay does not include encoding and packetization time (a constant component of the end-to-end delay), since we are mainly interested in packet transmission and speech playout in this work. We use d i n = to indicate that packet i is lost during transmission and never reaches the receiver. Hence, the set of received packets is given by R = {i t i r < }. The task of a particular scheduling scheme is to set the maximum allowable total

2.2. ADAPTIVE PLAYOUT SCHEDULING 19 Sender Packet Sequence Number i i+1 i+2 i+3 L0 ts i i+4 ts Receiver tr i tr Playout dn i dmax db i tp i L 0 (a) Late loss tp i+4 tp Time Sender Packet Sequence Number i i+1 i+2 i+3 L0 ts i i+4 ts Receiver tr i tr Playout i i dn db i dmax tp i i L (b) Time tp Figure 2.2: Fixed (a) and adaptive playout (b).