Rate-distortion optimized video peer-to-peer multicast streaming

Rate-distortion optimized video peer-to-peer multicast streaming Invited Paper Eric Setton, Jeonghun Noh and Bernd Girod Information Systems Laboratory, Department of Electrical Engineering Stanford University, Stanford, CA94305-9510, USA {esetton, jhnoh, bgirod}@stanford.edu ABSTRACT In live peer-to-peer multicast streaming a source distributes a multimedia file to a large population of hosts by making use of their forwarding capacity rather than relying on costly dedicated media servers. Due to the dynamic nature of the hosts which may disconnect at any time, a robust control protocol is needed to maintain connectivity between the peers. To avoid playout interruptions, the control plane robustness needs to be reproduced at the application. This work analyzes the gains that video coding and prioritized packet scheduling at the application layer can bring to the overall streaming performance. A rate-distortion model which predicts end-to-end video quality in throughput limited environments is used to determine the minimal amount of over-provisioning necessary to avoid self-inflicted congestion. The video stream transmitted by the source contains H.264 SP and SI frames, which are used to adaptively stop error propagation due to packet loss. Distortion-optimized retransmission requests are issued by receiving hosts to recover the most important missing packets while limiting the induced congestion. Experiments for several hundred hosts simulated in NS-2 illustrate the benefits of the different approaches. Categories and Subject Descriptors C.2 [Computer-communication networks]: Distributed systems General Terms Design, Performance Keywords Peer-to-peer, video streaming (Produces the permission block, copyright information and page numbering). For use with ACM PROC ARTICLE- SP.CLS V2.6SP. Supported by ACM. 1. INTRODUCTION As IP multicast is not universally supported, distribution of media streams in the public Internet to a large audience (multicasting) is typically realized by a large number of unicast connections. If the maximum number of streams of an individual media server (typically between a few hundred and a few thousand) is exceeded, additional server capacity must be provided by a suitable content-delivery infrastructure, e.g., in the form of a network of replication servers. Peer-to-peer (P2P) multicasting is an elegant alternative in which each end-host may act as a potential server for other clients. This avoids dedicated replication servers altogether. The approach is self-scaling, as the number of peer servers and peer clients increases at the same rate, hence it avoids the bottleneck of a central server (or dedicated replication server). The approach, in principle, would allow a highly dynamic support of changing multicast demand at very low cost. The major challenge, however, is the complete lack of performance guarantees in the P2P network. Peer nodes might be turned off or disconnected at any time without prior notice, while other nodes join or re-join. Such a highly unreliable network fabric poses a major difficulty for media streaming. A recent study based on statistics collected over the Internet reveals that enough bandwidth is available for peer-to-peer streaming even on a large scale [11]. This is also reflected in analysis collected over real implementations of peer-to-peer streaming systems [3][12]. The stability of the system which is necessary to provide a satisfactory user experience largely depends on the design of the protocol. To gain robustness and possibly aggregate data rate, path diversity should be attained by distributing streams across a sufficiently large number of complementary multicast trees. Individual nodes are important nodes near the top in some multicast tree, but near the bottom and less important in others, thus avoiding single points of failure in the node dependency graph. In the aggregate, the uplink and the downlink data rates of the clients for P2P multicasting are the same. With carefully designed diverse multicast trees, this balance can be attained for each individual peer node as well. Although the control protocol is essential to provide efficient means of building and maintaining multiple multicast trees, it needs to be combined with advanced multimedia (audio and video) coding and streaming solutions at the ap-

plication layer. State-of-the art compression which achieves better rate-distortion performance alleviates the bandwidth requirements, while error resilient streaming techniques may improve the received media quality, as in [8], where path diversity is associated with multiple description coding (MDC). The purpose of the this work is to show the benefits on a video peer-to-peer multicast streaming system of video coding, adaptive streaming and optimized packet scheduling. We believe the overall system performance can be greatly improved by using efficient single-description coding techniques and achieving the required robustness through distortion-optimized packet scheduling. In the next section, we describe the control protocol which builds and maintains multiple multicast trees to broadcast a video stream from a source to a set of peers. In Sec.3, we make use of a video distortion model, previously proposed in [10], to predict the received video quality when a video stream is sent from multiple throughput-limited senders to a receiver at different rates. The model is used to determine the amount of over-provisioning required to limit the congestion created between end-hosts. The single description video encoding contains the new picture types SP and SI introduced in the latest video coding standard H.264 [4]. These switching picture types can be send adaptively stop error propagation in the case of transmission errors. In Sec.4 we will characterize the bit rate saving achieved by using SP and SI frames compared to traditional video transmission based on I and P frames. In the last part of the paper, we focus on retransmission scheduling from a receiving host to its ancestors. Different from network-level multicasting, the incorporation of application level retransmission requests into P2P multicasting is possible without feedback implosion, since the fan-out of each individual node is small. We describe in Sec.5, how to schedule retransmissions from the receiver to maximize its decoded video quality while limiting the incurred network congestion. 2. OVERLAY MULTICAST PROTOCOL The control protocol enables a video source to distribute a video stream to a population of hosts via P2P multicasting. The video source peer and other peers are connected via multiple trees which are constructed dynamically by the protocol. The source is the root of all trees and the trees are built independently. The branches of each tree connect a host to its descendants. These links are virtual tunnels which hide the underlying physical network topology. The video stream is distributed evenly over the different trees. Hence, peers need to join each of the multiple trees to decode and play out the video successfully. Our simulations are based on a moderate size network of a few hundred nodes, which resembles a large private intranet or a campus network, an example of which is shown on Fig.1. We make the following simplifying assumptions: the control and transmission protocol is implemented over the UDP/IP protocol stack and we ignore any Network Address Translator or firewall issue which may limit the connectivity; the hosts have heterogeneous but fixed upload bandwidth which they have measured and know accurately. Although these problems need to be addressed for a real Internet implementation (see e.g. [3]), they are not directly relevant to Figure 1: Example of network topology used for simulation. the scope of this paper. Besides accepting new peers, the protocol maintains the trees when peers leave or are disconnected during the session. A peer may leave ungracefully due to network shutdown or system failure. To keep the descendants of the peer connected, hosts need to monitor the state of their ancestors and may decide to rejoin if they detect traffic interruptions. Our control protocol is completely distributed, except for an approximate list of connected peer maintained at the source. 2.1 Joining Each joining host discovers the address of the source of the video stream by a directory service such as a website and contacts this peer to obtain a list of hosts randomly chosen among connected members. The joining peer contacts all the members of the list and waits for replies. The hosts determines a candidate parent for each tree. If the candidate parent has enough available bandwidth at the time of the request, it accepts the host and starts forwarding video packets to him. This 6-way handshaking join process is common in peer-to-peer multicast protocols. Parents are selected based on their available bandwidth and on the delay in the response which gives a coarse roundtrip-time estimate. To make use of path diversity and reduce multiple tree failure due to a common parent, different parents whenever it is possible. Once a peer is connected, it will inform its parents of its presence by transmitting periodic hello messages. These messages are also used to propagate topology information such as the subtree size of a peer. Reception of a hello message generates an immediate response. This reply is intended to confirm the parent s presence. 2.2 Node disconnection Ungraceful leaves occur when a peer leaves the group without notice which may cause disconnection of the peer s descendants from the group. When a host leaves, its children detect missing video packets and/or missing hello replies. Once the leave is detected, hosts will try to use extra links, maintained in parallel of the multicast trees, to recover connectivity. If this fast recovery mechanism fails, the peer has

to contact the source to get a new list of candidate parents. While the host reconnects, retransmission requests are issued over the other multicast trees to recover missing video packets. Extra links to potential parents are maintained by messages exchanged periodically. The list is obtained when the source peer replies to the initial join request. After a host joins the different multicast trees, it keeps the list of remaining available hosts to construct a pool of extra links. If the number of hosts in the pool falls below a certain threshold the list is updated using a gossip algorithm. Extra links are not reserved resources but do indicate which hosts are available for potential reconnection. Child leave detection is less influential to the overall performance because late detection of a child leave only means temporary waste of the local network bandwidth. Since the penalty of false child leave detection is high, a longer time interval is allowed before a child leave is detected. When a child leaves, its parent will stop forwarding video packets by removing it from its forwarding table. 2.3 Loop avoidance After a host is disconnected, it may try to rejoin one of its descendants. This will create a loop which will eventually starve the whole subtree. To prevent such an undesirable case, each peer keeps the list of its direct ancestors, which are peers in the path from the peer to the source. When one of the peers in the extra link pool receives a rejoin request, it first examines if the requesting peer is one of its direct ancestors. Since a peer only needs to remember the direct ancestors, it doesn t need to know the whole tree structure. Thus, additional memory and processing power are negligible. Figure 2: Percentage of control protocol overhead. 2.4 Protocol evaluation The control protocol was evaluated over different network topologies for varying number of multicast trees. In Fig.2, the control traffic overhead is shown for different configurations. As illustrated, even when 4 mutliple trees are maintained the overhead does not exceed 2% of the total traffic exchanged on the network, for topologies supporting up to 750 hosts. In terms of latency, the joining operation takes less than 0.5s, parent leaves are detected in less than 2 seconds and it takes an additional second to rejoin, on average. When video is distributed over a larger number of multiple trees, the effect of an ungraceful leave is less important as children have several parents from whom they can request retransmissions. On the other hand, maintaining more trees incurs more control traffic overhead. In Fig.3, this tradeoff is shown in terms of the average video quality for the hosts of the network, measured as the average Peak Signal to Noise Ratio (PSNR), represented in db, as a function of the number of multiple multicast trees maintained by the protocol. In this environment, the optimal tradeoff between robustness and congestion is obtained when 4 multiple trees are used. Figure 3: Video quality as a function of the number of multiple trees. 3. VIDEO DISTORTION MODEL For live video streaming applications, video packets are transmitted over the network and need to meet a playout dead-

line. Decoded video quality at the receiver is therefore affected by two factors: distortion introduced by the encoder compression, denoted by D enc, and distortion due to packet loss or late arrivals, denoted by D loss. Assuming an additive relation of these two independent factors, a video distortion model was derived in [10]. The decoded video distortion, D dec, is given by: D dec = D enc + D loss, (1) D enc = D 0 + θ/(r R 0), (2) D loss = κ(p r +(1 P r)e (C R)T/L ), (3) n C = C i, (4) R = i=1 n R i (5) i=1 In (2), R is the total rate of the video stream, and the parameters D 0, θ and R 0 are estimated from empirical ratedistortion curves via regression techniques. PSNR (db) 40 38 36 34 32 30 28 80 kbps exp. 26 260 kbps exp. 360 kbps exp. 80 kbps model 24 260 kbps model 360 kbps model 22 0 50 100 150 200 250 300 350 400 Rate (kbps) Figure 4: Rate-distortion model and experimental results video streaming over varying capacity paths. The second distortion term, D loss, depends linearly on the packet loss rate. The scaling factor κ indicates the sensitivity of the stream to losses which depend on the encoding structure. The other factor reflects the combined rate of random losses and late arrivals. P r is the random packet loss rate and T is the time within which each packet should reach the receiver (typically a few hundred milliseconds). The parameters C and L depend on the maximum allowable rate and on the average packet size. In (4)-(5), C i and R i represent the available throughput and the rate of the video exchanged between the receiver and its parent on the i th multicast tree. Typically, each video substream will be equivalent: R i = R/n. In Fig.4, the model is represented together with empirical measurements for three different network setups. In the first case the aggregate throughput to the receiver over the different multicast trees is 80kbps, 260kbps in the second case and 360kbps in the third. This model reflects the impact of the rate on video distortion. At lower rates, reconstructed video quality is limited by coarse quantization, whereas at high rates, more packets are delayed beyond their playout deadline due to network congestion. For live video steaming in a bandwidth-limited environment, we therefore expect to achieve maximum decoded quality for some intermediate rate. This is illustrated in Fig. 4, by the bell shape of the curves representing decoded video quality. For peer-to-peer streaming, it is essential to limit the amount of self-inflicted congestion created by the media streams. Indeed, as there might be a large number of intermediate end-hosts separating a peer from the source, any increase in network congestion, may be reflected multiplicatively in the total end-to-end delay. Combined with physical link latency this delay may cause some packets to miss their playout deadline, resulting in decoding errors and a decrease in decoded quality. The second term of (3) reflects congestion. Hence, for fine tuning, congestion may be reduced by the following methods: increasing the amount of over-provisioning by reserving a larger capacity C i on each of the multicast Figure 5: SI frames share the instant refresh properties of I frames but are only sent after a frame is lost. trees between a sender peer and a receiver peer; decreasing the encoding rate of the video R. When video is pre-encoded at a constant rate, as is often the case, it is not always possible to reduce the streaming rate. Thus, it may be necessary to employ more over-provisioning, this in turn may decrease the number of hosts supported by the peer-to-peer system. Themodelcanbeusedtodeterminetherightamountof over-provisioning by limiting (3) by a fixed threshold. 4. ERROR-RESILIENT STREAMING In this section, we describe the structure used to encode the video file. The video stream distributed over the peerto-peer network includes periodic SP frames which enable adaptive error recovery. 4.1 SP and SI frames SP/SI pictures are new types of predictively/intra coded pictures. They were proposed in 2001 by Karczewicz and Kurceren, as a solution for error resilience, bitstream switching and random access [6, 7]. They are now part of the Extended Profile of H.264. The main advantage of this new picture type is that it can be reconstructed exactly by using different sets of predictors or no predictor at all. This allows to refresh a prediction chain adaptively as depicted in Fig. 5.

Figure 6: GOP structures used for streaming with SP and SI frames and for periodic I frame insertion. In a video streaming application, when a packet arrives at the receiver after its playout deadline, it is discarded by the decoder as if it were lost. To avoid interruptions, the errors due to packet loss or to excessive delays are concealed by freezing the previous frame until the next decodable frame and the playout continues at the cost of higher distortion. When SP frames are used, if a P frame or an SP frame is lost and cannot be retransmitted, an SI frame is sent at the beginning of the next group of pictures (GOP) to stop decoding error from propagating further, as depicted in Fig. 5. This differs from traditional video streaming where I frames are transmitted periodically whether decoding errors occurornot. 4.2 Compression efficiency The video encoding structure shown in Fig. 6 was chosen for the streaming experiments presented in this paper. The number of frames in a Groups of Pictures (GOP) is 16, with one SP frame (and its corresponding SI frame) per GOP and 3 B frames between P frames. This ensures good error resilience properties and allows to easily scale down the frame rate by 2 or even 4 if needed. The encoded video sequences used in the following experiments, as well as rate-distortion preambles characterizing the size and quality of the frames are made publicly available [1] 1. The average compression efficiency of different frame types is shown in Fig.7. As the size of SP frames is less than that of I frames, for environments with limited loss rates, the bit rate savings obtained by not transmitting unnecessary I frames may reach up to 25%. This is illustrated in Fig.8, which shows the compression efficiency of GOPs containing SP or SI frames compared to GOPs with periodic I frames. 4.3 SI frame regeneration The details of the encoding/decoding process of SP and SI frames is beyond the scope of this paper, however, we note that if an SP frame is decoded correctly, the corresponding SI picture may be created by the decoder with very limited additional complexity 2. As a consequence, in the peer-to-peer streaming scenario, each host which receives and decodes an SP frame correctly may regenerate the corresponding SI frame. Therefore, this host may transmit an SI frame rather than the SP frame it received if it is needed by one of its descendant. This technique allows the adaptive streaming to take place not only between the source and its direct descendants but also further down in the tree. 1 The preambles also indicate the distortion values obtained by concealing a frame with any other frame of the stream, allowing to simulate realistically video streaming without the overhead of encoding and decoding. 2 However the reverse is not true: the SP frame cannot be recovered from the SI frame without a complete encoder. PSNR (db) 48 46 44 42 40 38 36 B frames P frames 34 SP frames I frames SI frames 32 0 0.5 1 1.5 2 2.5 Rate (bits/pixel) Figure 7: Frame type comparison for Mother & Daughter sequence. PSNR (db) 44 43 42 41 40 39 38 periodic I frames 37 periodic SP frames periodic SI frames 36 0 100 200 300 400 500 600 700 Rate (Kbps) Figure 8: Rate-distortion performance with periodic I frame, SP frame or SI frame insertion, for Mother & Daughter.

5. OPTIMIZED RETRANSMISSIONS In this section, we describe how to determine optimal retransmission schedule for missing packets of a video stream. This schedule indicates which packets of the stream will be requeste to maximize the decoded video quality at the receiver while limiting the congestion created on the network. The aim of the congestion distortion optimized retransmission scheduler (CoDiO) is to determine a schedule minimizing the expected Lagrangian cost D + λ, where D is the distortion of the received video stream and is the endto-end delay which serves as the congestion metric. In [9], we analyze the benefits of using this metric rather than the traditional objective D + λr used in rate-distortion optimized scheduling [2]. In particular, end-to-end delay is inherently adaptive to time-varying network conditions. In addition, it reflects better the impact of a user operating on a bandwidth-limited network. To minimize the Lagrangian cost, CoDiO selects the most important packets in terms of video distortion reduction, and requests them in an order which minimizes the congestion created on the network. For example, I frames are requested in priority whereas B frames might not be retransmitted at all. In addition, Co- DiO avoids requesting packets in large bursts as this has the worse effect on the queuing delay. In the following, we describe how to estimate, with low complexity, the distortion corresponding to a given retransmission schedule and how to limit congestion. 5.1 Determining the video distortion The expected value of the distortion for the video stream decoded by the client is computed as in [5]. Namely, if copy error concealment is used, an undecodable frame is replaced with the nearest correctly decoded frame for display. Hence, to capture the effect of packet loss on the video quality, only a limited number of display outcomes need to be identified and associated with different distortions. Let D(s, f) denote the distortion resulting from substituting frame s to frame f, the expected distortion when displaying frame f is: D(f) = D(s, f)pr{s} (6) s In Eq. (6), Pr{s} represents the probability that frame s is displayed instead of f. This probability may be computed, as described in [5], by combining the probabilities that different packets do not reach the client by their playout deadline. The difficulty resides in estimating the delay distribution function needed to derive these quantities. Different techniques, with varying complexity, can be used to model this distribution. In our simulations, we use a very simple estimation method to keep the simulation time low: we classify packets into two sets, the set of missing packets and the set of packets which have been or will be received before their playout deadline. This classification reduces the number of possible decoding states to one, thus we know which frames will be displayed over the horizon considered. It is then easy to determine the sensitivity of the distortion to each missing packet, and requests are issued in the order of importance of the packets. 5.2 Limiting congestion In the scenario considered, the available transmission rate between a sending peer and a receiving peer depends on several factors, such as the uplink throughput of the sending peer, the number of other hosts served by this peer and the rate of the video stream. Given these parameters, the available throughput C i may be computed and the average end-to-end delay over a certain time horizon can be estimated for a given retransmission schedule. However, the estimation also requires modelling the delay distribution of the path between sender and receiver. As a low-complexity alternative, we suggest to limit the number of unacknowledged retransmission requests from a peer to each of its parents. As unacknowledged retransmission requests represent a packet being transmitted or processed between the two peers, these packets contribute to end-to-end delay hence to congestion. The tradeoff between the speed of the retransmissions and the amount of congestion created on the bottleneck links can be set by determining the optimal number of unacknowledged packets tolerated between a sender and a receiver This optimization will be carried out experimentally. 6. CONCLUSIONS Live peer-to-peer multicast streaming is constrained by the dynamic behavior of hosts and by limited uplink throughput. Dynamic control protocols are needed to support heterogeneous peers and react rapidly to node disconnections. We propose to use video encoding, streaming and scheduling techniques developed recently to further enhance the performance. A rate-distortion model analyzes the tradeoff between self-inflicted congestion and video quality and is used to determine the amount of over-provisioning necessary when low latency is required. H.264 SP and SI frames are incorporated into the video stream to provide adaptive error-resiliency capability and achieve bit-rate saving gains of up to 25% compared to traditional video streams. Last, retransmission of missing packets are requested in a congestion-optimized fashion which selects the most important packets in terms of video quality while limiting the induced congestion. The experimental results are collected over a simulated network in ns-2 for a large number of hosts which join and leave dynamically. Although the gains obtained in this paper are for our implementation of an overlay multicast control protocol, we believe the results are more general and could be applied to most implementations of video peer-to-peer streaming. 7. REFERENCES [1] Encoded sequences with SP/SI frames. http://ivms.stanford.edu/ esetton/sequences.htm. [2] P. Chou and Z. Miao. Rate-distortion optimized streaming of packetized media. Microsoft Research Technical Report MSR-TR-2001-35, Feb. 2001. [3] Y.Chu,A.Ganjam,T.Ng,S.Rao, K. Sripanidkulchai, J. Zhan, and H. Zhang. Early experience with an internet broadcast system based on overlay multicast. Proceedings of USENIX 04, page 155170, June 2004. [4] ITU-T and ISO/IEC JTC 1. Advanced Video Coding for Generic Audiovisual services, ITU-T

Recommendation H.264 - ISO/IEC 14496-10(AVC), 2003. [5] M. Kalman, P. Ramanathan, and B. Girod. Rate-distortion optimized streaming with multiple deadlines. Proc. International Conference on Image Processing, Barcelona, Spain, Sept. 2003. [6] M. Karczewicz and R. Kurceren. A Proposal for SP-Frames. VideoCodingExpertsGroupMeeting,, Doc. VCEG-L-27, Eibsee, Germany, Jan. 2001. [7] M. Karczewicz and R. Kurceren. The SP- and SI-frames design for H.264/AVC. IEEE Trans. CSVT, 13(7):637 644, July 2003. [8] V. Padmanabhan, H. Wang, P. Chou, and K. Sripanidkulchai. Distributing streaming media content using cooperative networking. Proceedings NOSSDAV 02, Miami, USA, May 2002. [9] E. Setton and B. Girod. Congestion-Distortion Optimized Scheduling of Video. Multimedia Signal Processing Workshop (MMSP), Siena, Italy, pages 99 102, Oct. 2004. [10] E. Setton, X. Zhu, and B. Girod. Minimizing distortion for multipath video streaming over ad hoc networks. International Conference on Image Processing, Singapore, pages 1751 1754, Oct. 2004. [11] K. Sripanidkulchai, A. Ganjam, B. Maggs, and H. Zhang. The feasibility of supporting largescale live streaming applications with dynamic application endpoints. Proceedings SIGCOMM 04, Portland, USA, Aug. 2004. [12] X. Zhang, J. Liu, B. Li, and T.-S. P. Yum. Donet/coolstreaming: A data-driven overlay network for live media streaming. Proceedings IEEE Infocom, Miami, USA, Feb. 2005.