vehicle-to-infrastructure video transmission for road surveillance applications, IEEE

Size: px

Start display at page:

Download "vehicle-to-infrastructure video transmission for road surveillance applications, IEEE"

Ophelia McDonald
5 years ago
Views:

1 Belyaev, E., A. Vinel, A. Surak, M. Gabbouj, M. Jonsson, and K. Egiazarian, Robust vehicle-to-infrastructure video transmission for road surveillance applications, IEEE Transactions on Vehicular Technology, accepted 24. Robust vehicle-to-infrastructure video transmission for road surveillance applications Evgeny Belyaev, Member, IEEE, Alexey Vinel, Senior Member, IEEE, Adam Surak, Moncef Gabbouj, Fellow, IEEE, Magnus Jonsson, Senior Member, IEEE, and Karen Egiazarian, Senior Member, IEEE Abstract IEEE 82.p vehicle-to-vehicle and vehicle-toinfrastructure communication technology is currently an emerging research topic in both industry and academia. Respective spectrum allocation of MHz channels in the 5.9 GHz band for USA and Europe allows considering inter-vehicle transmission of a live video information as a basis, which enables a new class of safety and infotainment automotive applications such as road video surveillance. This paper is first of its kind where such a video transmission system is developed and experimentally validated. We propose a low-complexity unequal packet protection and rate control algorithms for a scalable video coding based on the three-dimensional discrete wavelet transform. We show that in comparison with a scalable extension of the H.264/AVC standard the new codec is less sensitive to packet es, has less computational complexity and provides comparable performance in case of unequal packet protection. It is specially designed to cope with severe channel fading typical for dynamic vehicular environments and has a low complexity, making it a feasible solution for real-time automotive surveillance applications. Extensive measurements obtained in realistic city traffic scenarios demonstrate that good visual quality and continuous playback is possible when the moving vehicle is in the radius of 6 meters from the roadside unit. Index Terms Video surveillance, vehicular communication, unequal protection, 3-D DWT, H.264/SVC, IEEE 82.p. proposed in [4] to maximize the end users peak signal-tonoise ratio. Multi-source streaming across a VANET overlay network is studied in [5], while video streaming from one source to all nodes in an urban scenario is analyzed in [6]. Dynamic service schemes, which maximize the total usersatisfaction and achieve an appropriate amount of fairness in peer-to-peer-based VANETs, are proposed in [7]. The commonality of the above reviewed works is that they all present either analytic or simulation based work. In this paper a prototype of a video coding and transmission system for a new VANET-enabled application, namely video surveillance for public transport security and road traffic control, is introduced. Figure demonstrates the operation principle of the considered road video surveillance system, where a captured video is transmitted with the use of IEEE 82.p from a vehicle to a roadside unit. The latter can serve as a gateway to the backbone network used to deliver the live video to the management center. Depending on the area, where the vehicle-mounted camera is directed to (inside or outside of a vehicle) the following two types of applications can be considered: I. I NTRODUCTION HE coming years will witness the adoption of IEEE 82.p (currently part of IEEE specification []) technology, which will enable broadband vehicle-to-vehicle and vehicle-to-infrastructure connectivity. The design and validation of this technology as well as prospective safety and infotainment applications in VANETs (Vehicular Ad-hoc NETworks) are currently areas of intensive research [2]. The availability of MHz channels for DSRC (Dedicated Short Range Communication) in 5.9 GHz band allows a live video transmission between vehicles, which makes it possible to introduce new specialized automotive video-based applications (e.g. overtaking assistant [3]). Recently a few papers addressed the problems of video transmission in VANET environment. A novel applicationcentric multi-hop routing protocol for video transmission is T The following requirements are to be met for the considered application: E. Belyaev, K. Egiazarian and M. Gabbouj are with the Department of Signal Processing, Tampere University of Technology, 33 Tampere, Finland. A. Vinel and M. Jonsson are with the School of Information Science, Computer and Electrical Engineering, Halmstad University, 3 8 Halmstad, Sweden. A. Surak is with the Department of Electronics and Communication Engineering, Tampere University of Technology, 33 Tampere, Finland. In-vehicle surveillance, which allows real-time monitoring of public transportation for security services such as police, ambulance, etc. aiming at preventing terrorism, vandalism and other crimes. Traffic surveillance, which can be used to monitor the current situation at a given road segment, particular intersection or even a given traffic lane aiming at realtime reaction to traffic jams, accidents, etc. video capturing and compression, packets protection and their transmission as well as video reconstruction at the receiver side must satisfy the real-time constraint; start-up buffering delay should not be more than sec (as in regular surveillance [8]) and continuous video playback on the receiver side should be provided during the overall system operation; high video frame resolution (not less than 64 48) and frame rate (25 3 frames/sec) are to be used; received video must have acceptable visual quality for low and medium packet probabilities ( 2%) and be recognizable for high packet probabilities (2 4%); reconstructed video should be authentic (new objects,

2 View of the cabin IEEE 82.p transceiver Video compressor Internal camera External camera Backbone network, roadside unit or other vehicles View of the road Fig.

video bit rate must be easily variable at the transmitter side as well as at any intermediate node for adaption to the channel bandwidth at the current link and user display size/power consumption

and a roadside unit. This is, in particular, due to the the following properties of the 5.9 GHz 82.

2 2 View of the cabin IEEE 82.p transceiver Video compressor Internal camera External camera Backbone network, roadside unit or other vehicles View of the road Fig. : Operation principle of road video surveillance system which are not presented in original video, should not appear). video bit rate must be easily variable at the transmitter side as well as at any intermediate node for adaption to the channel bandwidth at the current link and user display size/power consumption restrictions; As it was shown during our preliminary real-world experiments [9], the above requirements are difficult to be met even for the simplest case of a communication between a single vehicle and a roadside unit. This is, in particular, due to the the following properties of the 5.9 GHz 82.p communication channel [], []: different forms of fading in a mobile urban environment, leading to high packet es even when line-of-sight conditions between the transmitter and the receiver antennas are satisfied; occurrence of vision-obstructing obstacles (e.g. buses, trucks) between the transmitter and the receiver antennas, causing severe es of packet bursts. Scalable video coding (SVC) is a preferred compression method for a video transmission over the unreliable communication channel [2], [3]. Numerous works reported that the SVC combined with an unequal protection (ULP) allows delivering of a video reliably even for severely erroneous channels [4], [5], [6]. Application layer ULP can be based on inter-packet Reed-Solomon (RS) codes: the base video stream layer is protected using highly redundant RS codes and the remaining layers are encoded with a lower level of redundancy. Scalable extension of the H.264/AVC standard denoted as the H.264/SVC is highly efficient in terms of compression performance due to the use of inter-layer prediction and motion compensation. ULP techniques are mostly discussed in the context of the H.264/SVC, which includes temporal, spatial and quality scalability (see [6]). However, the main limitation of the H.264/SVC for real-time live surveillance is its high computational complexity due to the need of complex end-to-end distortion estimations, erasure-correction encoding and optimizations of the ULP. An alternative solution might be based on a threedimensional discrete wavelet transform (3-D DWT) scalable video coding [9]. As of today, the schemes based on the 3-D DWT achieve rate-distortion performance comparable to the one of the H.264/SVC. This is due to the advances in Motion Compensated Temporal Filtering. Unfortunately, when combined with the ULP, the resulted 3-D DWT video systems are also computationally intensive [7]. For instance, for the resolution of , the encoding speed of 6 frames per second was achieved on a Pentium M 2-GHz processor [8], which is insufficient for real-time video delivery provisioning. Thus, an efficient low-complexity scalable video coder with unequal packet protection and rate control, which might be suitable for the considered surveillance application, up to the authors knowledge, has not been considered in the literature so far. The main contributions of this paper are the following: ) We extend our low-complexity video coding framework based on the three-dimensional discrete wavelet transform (3-D DWT) presented in [2] by adding the errorresilience coding, error concealment, ULP based on inter-packet Reed-Solomon codes as well as rate control of source and parity video data 2. 2) We perform the tuning of the proposed video coding system for the 5.9 GHz 82.p communication channel, which aims at achieving the given visual quality requirement. The channel is characterized either by a maximal possible independent packet probability or by a maximal length of a burst of lost packets. The values for these parameters are obtained from a conducted measurements campaign in a city environment. 3) We simulate the performance of our video codec for video sequences captured in real-life road surveillance conditions and show that in comparison with H.264/SVC, it is significantly less sensitive to packet es and provides better authenticity of the reconstructed video. Our codec provides comparable performance in the case of ULP usage, but, at the same time, has significantly less complexity. 4) We have fully implemented a video transmission system based on the 3-D DWT coder. Our real-world experiments demonstrating the use of the proposed 3-D DWT codec together with IEEE 82.p equipment for road video surveillance are carried out. A good visual quality is achievable when the distance between a transmitting vehicle and a roadside unit is less than 6 meters, while acceptable video quality can be provided at distances of 6-8 meters. Continuous video playback is possible at distances not exceeding 8 meters. The rest of the paper is organized as follows. Section II reviews the basic video codec based on 3-D DWT presented in [2]. Sections III and IV introduce the low-complexity errorresilience coding and error concealment as well as ULP and To the best of the authors knowledge there are no results reported, which would present the complexity evaluation of a joint compression and protection. 2 This result was presented by the authors in [2] for a simplistic case of a constant and known probability of a packet.

3 3 rate control, respectively. The results are presented for both the 3-D DWT and H.264/SVC codecs. Section V presents a tuning of the video codecs for urban vehicular environment. Real-world experiment results of the proposed codec usage for the considered surveillance application are presented in Section VI. Finally, conclusions are drawn in Section VII. LL LH LH HL HH HL L3 II. V IDEO CODING BASED ON 3-D DWT In this paper we use video codec based on the threedimensional discrete wavelet transform (3-D DWT codec) proposed in [2]. First we accumulate a group of frames (GOF) of length N in the input frame buffer. Afterward, we apply a one-dimensional multilevel DWT of length N in the temporal direction. A higher value of N corresponds to the better temporal decorrelation of the video data. On the other side, N is limited by memory restriction of the transmitter and receiver devices and the maximum allowable encoding latency caused by frames accumulation in the buffer. Therefore, typical values for N are 4,8,6,32. After temporal transform, we apply a 2-D spatial wavelet transform for each frame. 3-D DWT transform easily provides temporal as well as spatial scalability of the video stream. Figure 2 illustrates an example of the wavelet decomposition for a GOF size N = 4 with two-level temporal and twolevel spatial wavelet decomposition. Here four input frames are decomposed to 28 wavelet matrices (subbands). One can see that this decomposition allows to have three temporal scalable layers (full, half and quarter frame rate) and three spatial scalable layers (full, quarter and eighth frame resolution). Herewith, the base spatial layer corresponds to eighth frame resolution, while the base temporal layer corresponds to quarter frame rate. After the 3-D DWT, all frames in a GOF are processed starting from low-frequency to high-frequency frames (L3 H2 H H processing order in Figure 2). For each frame, the spatial subbands are also processed from low-frequency to high-frequency spatial subbands (LL LH HL HH LH HL processing order at the Figure 2). Each wavelet subband is represented as a set of bit-planes, Figure 3, and is compressed independently on other subbands by a bit-plane entropy encoder. Here the bit with number n of a wavelet coefficient belongs to bit-plane n. The entropy encoder consists of two cores: Levenstein zerorun coder for low-entropy binary contexts and adaptive binary range coder for the remaining binary contexts. Each core has its own output bit stream buffer. First, the number of non-zero bit-planes in subband tmax is calculated and encoded. Then all coefficients are processed from the highest tmax to the lowest bit plane and in a raster scan. After the encoding of each bit plane t, the bit stream size r(t) and the distortion d(t) of the subband are determined. In this case a progressive bit stream is generated, i.e. r(t + ) > r(t), d(t + ) d(t) and bit-plane t+ can be decoded only if bit-plane t is successfully decoded. Then, if the Lagrange sum ψ(t) = d(t) + λ r(t) for bit plane t is greater than the sum ψ(t + ) = d(t + ) + λ r(t + ) for bit plane t +, then the bit stream is truncated at bit plane t +, and the encoding process is stopped. HH HH HH H2 H H Fig. 4: Example of parent-child skipping tree In some cases, after compression of the highest bit-plane tmax the following inequality may hold: 2 d(tmax ) + λ r(tmax ) > wij, () (i,j) where wij is a wavelet coefficient with coordinates (i, j). In these cases, even the highest bit-plane is not included into the output bit stream, i.e. the subband is skipped. Since the encoder still calculates the 2-D DWT and forms the subband bit stream, the computational resources are wasted. One can avoid this situation by applying the parent-child tree based subband skip which operates as follows. If a skipping criteria () holds for any subband, then none of the child-subbands should be processed. Therefore, the spatial transform calculation and entropy coding steps are skipped. The corresponding transform coefficients are assumed to be zero at the decoder side. The proposed approach is illustrated in Figure 4. If a spatial subband HH in frame H2 is skipped then the corresponding child-subbands in frame H2, HH and in frame H, and HH and in frame H are directly omitted without processing. We show in [2] that the parent-child skipping allows decreasing the complexity of the encoder significantly (up to 7 times) while keeping the compression performance. More details can be found in [2]. III. L OW- COMPLEXITY ERROR - RESILIENCE CODING AND ERROR CONCEALMENT A. Video bit-stream packetization Let us consider an example of coding and transmission of a wavelet subband with four non-zero bit-planes (see Figure 3). Let us assume that at the encoder side the subband was truncated at the truncation point t =. In this case Levenstein coder as well as the range coder generates three subpackets which contain information about bit-planes 3, 2, and. If no packet es happened, then r() bits are received, what corresponds to distortion d(). Let us assume that subpacket from Levenstein coder buffer and subpacket 2 from the range coder buffer are lost. Taking into account that subpackets from both coders are needed for bit-plane decoding, only bit-plane 3 will be reconstructed by the decoder what corresponds to distortion d(3). In this case, the distortion for the subband will corresponds to the case when the truncation point t = 3 is selected at the encoder side and there is no packet, i.e. packet es in the 3-D DWT codec does not lead to error propagation in the bit stream and corresponds to higher truncation (quantization) of the subband.

4 4 a) b) Base spatial layer The main subband LL LH HL HH LH c) HL LL LH HL HH HL L3 LH LL LH HL HH HL H2 Base temporal layer LH LL LH HL HH HL LH H H Fig. 2: Three-dimensional discrete wavelet transform. a) original frames b) frames after -D Temporal DWT c) frames after -D Temporal DWT and 2-D Spatial DWT Bit-plane representation of wavelet matrix i Bit-plane 3 j Subpacket Low-entropy contexts Bit-plane 2 Bit-plane context modeler Bit-plane Bit-plane Wavelet coefficient wij Zero-run Levenstein coder Bit-plane tmax=3 t*= Bit-plane Bit-plane 2 Available at the decoder Packet es Bit-plane Truncated at the encoder Adaptive Bit-plane Bit-plane Bit-plane Bit-plane binary tmax=3 2 High-entropy range coder contexts (d,r) (d2,r2) (dskip,rskip=) (d3,r3) (d=,r) Fig. 3: Illustration of bit-plane entropy encoding, packetization, truncation, packet es and bit stream available at the decoder Taking into account that in many cases a subpacket length generated by Levenstein coder and range coder can be significantly less than network packet length, it is needed to packetize it to a network packet of larger length. Let us assume the packet probability is p. Then if two subpackets corresponding to the same bit-plane are placed into different packets, then the probability that both of them will be not received is π = ( p)2, and π = p, if these subpackets are placed into the same packet. It is easy to see, that in the second case, the probability that the subpackets are lost is always less than or equal to that of the first case. Therefore, we are using the second type of packetization in all cases, when the subpackets can be placed into the same packet and the first type for the rest of the cases. B. Error-resilience coding and error concealment The video bit stream generated by the 3-D DWT codec can be divided into the main components, which can be listed in order of importance with respect to their influence on the packet es on the reconstructed video as follows: ) The main subband in the base spatial layer (LL subband in frame L3 in Figure 2) contains the brightness information for the whole GOF. Therefore, the of

2) Due to properties of the -D temporal wavelet transform, in case of es of the remaining subbands of the base spatial layer (LL subband in frames H2, H and H in Figure 2), at the decoder side the

5 5 (a) no packet es (b) temporal artifacts caused by of LL subband in L3 frame (c) temporal artifacts (d) spatial artifacts Fig. 5: Examples of artifacts caused by packet es this subband corresponds to the of all frames in the GOF. 2) Due to properties of the -D temporal wavelet transform, in case of es of the remaining subbands of the base spatial layer (LL subband in frames H2, H and H in Figure 2), at the decoder side the reconstructed video will contain specific temporal artifacts, like a ghost effect (see Figure 5c). 3) Losing the remaining subbands leads to the spatial artifacts, when frames in the GOF appear such as after downsampling (see Figure 5d). But in this case the reconstructed video has an acceptable visual quality. acceptably. For the remaining subbands of the base spatial layer it is proposed to use repetition of the highest bit-planes which can be placed into one additional packet. Here we use one packet, because it is enough to avoid significant temporal artifacts caused by inverse temporal wavelet transform. Notice, that the proposed error-resilience coding uses some insignificant amount of operations needed for packet repetition at the encoder side. Therefore, it does not affect the overall encoder complexity. Applying the above reasoning, at the decoder side we use the following error concealment. First, the bit stream of each subband is decoded in a progressive way until a is detected. In this case, any es in a stream correspond to the higher quantization at the encoder side and do not lead to an error propagation. Second, if the main subband, which contains the brightness information for all GOFs, is lost, then we copy the corresponding subband from the previous GOF. In this case an error propagation can happen (see the example in Figure 5b). To minimize the probability of this event we use the following simple error-resilient coding. At the encoder side for the main subband it is proposed to use repetition of the highest bit-planes which can be placed into two additional packets. We use two packets because it is the minimum data portion which is enough to represent this brightness Influence of packet es on reconstructed video data were compared for the proposed 3-D DWT codec and H.264/SVC. For 3-D DWT we have used GOF size N = 6 with Haar transform in the temporal direction and the 5/3 spatial wavelet transform with a three-level decomposition. For H.264/SVC we have used the JSVM 9.8 reference software [29] with 2 spatial and 5 temporal scalable layers. The GOP size and intra-frame period are set to 6. For error-resilient coding, flexible macroblock ordering with two slice groups and aware rate-distortion optimized macroblock mode decision have been used. At the decoder side, the frame copy error concealment method is applied. In both cases the simulations were implemented in C language. Simulation results were obtained for the video sequences Tampere and Tampere 4 (64 48, 3 frames, 3 C. Comparative results

6 (a) (b) Fig. 6: Expected visual quality comparison for 3-D DWT without protection and H.264/SVC without protection Fig. 7: Visual comparison for 5% packet. a) Proposed 3-D DWT b) H.264/SVC (JSVM 9.

6 6 (a) (b) Fig. 6: Expected visual quality comparison for 3-D DWT without protection and H.264/SVC without protection Fig. 7: Visual comparison for 5% packet. a) Proposed 3-D DWT b) H.264/SVC (JSVM 9.8) Hz) [26], which are typical for road surveillance applications3. For comparison, we have used the following approach. First, each video codec generated the video stream as a set of packets. The packet length was set to 8 bytes. Then, we randomly removed the packets from this set by using the independent packet es model with probability p. The resulting video stream was used as an input for the video decoder for reconstruction of the received video data. Then, the mean square error was calculated as: D= F W H 2 (s[x, y, f ] s [x, y, f ]), (2) F W H f = x= y= where s[x, y, f ] and s [x, y, f ] are the luma values for the pixel with coordinates (x, y) in frame f for the original and reconstructed video sequences, respectively, W is the frame width, H is the frame height, and F is the number of frames in the test video sequence. 3 The test video sequences as well as detailed visual comparisons with H.264/SVC are available at belyaev/v2v.htm Note that es of different packets lead to different distortion values D in the reconstructed video. Therefore, we repeat the mean square error calculation described above for K different packet realizations. Finally, the visual quality metric - Expected Peak Signal-to-Noise Ratio (Y-PSNR) of the luma color component - is estimated as: E[Y-PSNR] = lg s2max K Di K i=, (3) where smax is the maximum possible luma value. Figure 6 shows the expected visual quality for different packet probabilities and channel rates for the proposed codec and for H.264/SVC. As it can be noted, while H.264/SVC provides significantly better visual quality in case of no packet es, it is much more sensitive to packet es. One can see that the proposed codec provides better authenticity of the reconstructed video while H.264/SVC has frames with unrecognizable or new objects, which do not exist in the original video, even for relatively low packet

7 7 Source packet header Stuffing bytes Source byte Parity packet header k source packets n - k parity packets Parity byte Inter-packet RS codeword Fig. 8: Inter-packet erasure protection example for ReedSolomon (7, 3) code where dskip is the distortion corresponding to the case that none of the bit-planes are received (all coefficients of the subband are assumed to be zeros). Notice, that di values are already calculated at the basic video encoder described in Section II, πi values can be pre-calculated once for a given packet probability p and Reed-Solomon code. Therefore, the complexity of E[d(ψ)] calculation does not affect significantly to the overall encoder complexity. Let Ψ = {ψi } be the set of decision vectors, where ψi is the decision vector for the i th subband in a GOF. The overall expected GOF distortion is E[D(Ψ)] = E[d(ψi )], (7) i and the resulting GOF bit stream size is probabilities (see the visual comparisons in Figure 7). R(Ψ) = r(ψi ), (8) i IV. J OINT UNEQUAL PACKET LOSS PROTECTION AND where r(ψi ) is the bit stream size of the i th subband including source and parity data when the decision vector ψi is applied. RATE CONTROL A. Loss protection by inter-packet error-correction codes For better visual quality achievement we apply the interpacket protection based on systematic Reed-Solomon (n, k) codes in the finite field GF (28 ). In this approach k source bytes with the same index are used to form the source polynomial m(x) and corresponding r parity bytes are generated as: r(x) = xr m(x) mod g(x). (4) If the source byte with the current index does not exist (stuffing byte), we put zero byte instead (see example in Figure 8). If k or more out of total n packets are successfully delivered, then Reed-Solomon decoder will be able to recover all the lost source packets. Let p be the probability of a packet, then the probability of a source packet is pdec (p, n, k) = n i n ( ) pi ( p)n i. n i i=n k+ (5) The probability that two subpackets, which belong to the same bit-plane t, are not received, is computed as πt = 2 ( pdec (p, n, k)) (if the subpackets are encapsulated into one packet) or πt = pdec (p, n, k) (if the subpackets are encapsulated into different packets). B. Expected distortion estimation at the encoder side We denote a decision vector for a subband truncated at bitplane t and with RS code (n, k ) applied to each bit-plane as ψ = {t, n, k }. Let di be the subband distortion in the case when the bit-planes tmax,..., i have been received while the bit-plane i has not been received. The expected distortion after truncation of the subband at bit-plane t can be computed as: tmax tmax E[d(ψ)] = ( πj ) πi di + i=t + j=i tmax +πtmax dskip + ( πi )dt, i=t (6) C. Rate control by Lagrangian relaxation The considered output video bit stream contains source packets generated by the video encoder and parity packets generated by Reed-Solomon encoder. For the restricted channel rate, the expected visual quality is dependent on a portion of parity packets (or redundancy level) in the overall bit stream. Typically, for the zero redundancy, the expected visual quality is low due to the severe packet es. With the increase of the redundancy level, the visual quality increases, since more and more lost source packets containing video data are successfully reconstructed by Reed-Solomon decoder. However, after some protection level, a further increase of the redundancy does not lead to increase of the visual quality. This is because of the need to compress the wavelet subbands with higher and higher compression ratio in order to keep the source and parity packets bit rate equal to the channel rate. In such a case, the visual quality caused by compression (truncation) decreases and the distortion caused by packet es is no longer significant, because the video bit stream is overprotected. As a result, the visual quality decreases. Thereby, to maximize the expected visual quality measure, it is crucial to find the optimal redundancy level for the target video sequence and packet probability under the constraints of a channel rate. Mathematically, this problem can be formulated as follows. For each GOF we need to identify the set of decision vectors Ψ, such that E[D(Ψ)] Ψ = arg min {Ψ} (9) R(Ψ ) Rmax, where Rmax = Nf C is the bit budget for each GOF, N is the GOF size, C is the channel rate, and f is the video source frame rate. The rate control problem (9) can be solved using the Lagrangian relaxation method. It can be proven (see Appendix A), that for λ, the decision vector Ψ λ minimizing E[D(Ψ)]+λ R(Ψ) is the solution of the rate control task (9) for Rmax = R(Ψ λ ). From this statement it follows that in

8 8 order to solve (9) one should find λ, such that R(Ψ λ ) = R max. It can be also proven that R(Ψ λ ) is a non-increasing function of λ, i.e. λ can be found by the bisection method. For our codec the rate-distortion function for the subband i is independent of the truncation and RS code parameters of other subbands, what leads to min{e[d(ψ)] + λ R(Ψ)} = min Ψ = i min ψ i (E[d(ψ i )] + λr(ψ i )). Ψ i (E[d(ψ i )] + λr(ψ i )) () Consequently, a solution of the problem (9) can be expressed as Ψ = {ψi }, where the value ψ i = {t i, n i, k i } can be calculated independently for each subband i as follows: ψ i = arg min {ψ i} {E[d(ψ i)] + λ r(ψ i )}. () D. Rate control by adaptive selection of Lagrangian multiplier To find the target λ value by the bisection method, one needs to calculate the 2-D DWT and compress each subband lessly in order to determine all the sets of truncation points. It requires significant computational resources and contradicts to the wavelet subband skipping approach described in Section II. In this paper we propose a rate control of source and parity packets which uses the virtual buffer concept to estimate the λ value, similarly to the rate control proposed in [2]. Let b virt tbe the number of bits in a virtual buffer and B max be the virtual buffer size. B max = L C, where L is the buffering latency at the decoder side required for continuous video playback, while C is the channel rate. Then for each GOF, the target λ value is calculated proportionally to the occupancy of the virtual buffer: λ λ max ( b γ virt ), (2) B max where γ defines the proportion degree, and λ max is the maximum possible λ value. When λ has been calculated, for each subband in the GOF, the minimization () is solved in the way described in Section II by replacing distortion d(n) caused by the truncation with the expected distortion E[d(ψ i )], which is calculated via (6) and caused by truncation, protection mode and packet es. Then, the number of bits in the virtual buffer b virt is updated in the following way: b virt b virt + R(Ψ ) R max, (3) where R(Ψ ) is the number of bits for source and parity data (including the repeated packets as described in III-B) when the set of decision vectors Ψ is used. Finally, a new λ value is recalculated by (2) and used for the next GOF. Analogously to the rate control proposed in [2], in practice, after some adaptation period, the value of λ found by the proposed algorithm, approaches the optimal one computed by the bisection method. At the same time, the proposed λ calculation in (2) does not requires 2-D DWT calculation and less entropy encoding of all subbands in the GOF. Therefore, it does not contradict to the subband skipping, which provides skipping of 2-D DWT, bit-plane entropy encoding as well as packetization, Reed-Solomon encoding and protection mode selection. Herewith, calculations in (2) and (3) can be neglected. V. TUNING OF THE PROPOSED 3-D DWT CODEC FOR VANET APPLICATIONS A. Feedback channel elimination In the traditional joint source-channel coding framework it is assumed that the packet probability is known or the channel conditions change slowly enough what makes it possible to use the feedback channel for the adaptation of the packet protection parameters to the new channel state. However, our real-world experiments (reported in Section VI) demonstrate that packet es occur due to the following main reasons: ) Packet es caused by the different forms of fading (see distances 6-8 meters in Figure, a); 2) Bursty packet es caused by the appearance of shortterm obstacles between the transmitter and the receiver antennas (see distances -4 meters in Figure, a). Therefore, in the case of a considered road surveillance application the above assumptions do not hold due to the fast varying wireless channel conditions, which makes the feedback channel concept useless. We propose the following approach to combat with this problem. The ULP algorithm continuously optimizes the video bitstream for a constant target packet probability p target value. Herewith, the value of p target value should be selected to provide acceptable visual quality for the entire range of typical packet probabilities p p max. In accordance to our experiments, we set p max =4%. The minimum acceptable visual quality is selected equal to 24dB. Our experiments show that when the quality is lower than this level, then it is very difficult to recognize even the main objects in the video scene. B. Comparative results For target packet probability p target selection, the 3-D DWT codec with ULP and the rate control were implemented in real-time mode as described in Sections II, III and IV. For H.264/SVC, the ULP and the rate control were implemented in an off-line mode through multiple encodings with different protection and compression parameters. Therefore, the results for H.264/SVC represent an upper bound for its low-complexity and real-time implementation (i.e. in practice H.264/SVC will not be able to achieve a better performance than reported here). For both codecs, we have used the Reed-Solomon codes from Table I with generator polynomial g(x) = x 4 +3x 3 +26x 2 +23x+66, which forms r = n k = 4 parity packets for each k source packets. Protection mode Protection level TABLE I: Packet protection modes Four-fold repetition RS (6,2) RS (7,3)... RS (5,) No protection Figure 9 a), b), c) shows the visual quality for different values of p target, when the channel rate is set to 5kbps.

9 9 (a) ptarget =. (b) ptarget =. (c) ptarget =.2 (d) Best protection settings for p 4% Fig. 9: Expected visual quality comparison for 3-D DWT and H.264/SVC with ULP First, one can see that combined with ULP the H.264/SVC provides better performance for low packet probabilities. However, this performance gain becomes less and less with increasing values of ptarget. This behavior can be explained as follows. Since H.264/SVC is more sensitive to packet es it needs to add more redundancy by Reed-Solomon coding than the 3-D DWT codec, which has less compression performance, but requires less redundancy for protection (see Table II). Second, to provide visual quality more than 24dB for both video sequences the proposed codec needs roughly ptarget = %, while H.264/SVC needs ptarget = 2%. Herewith, the 3-D DWT codec provides better quality for the most common case, when the packet es caused by the signal fading is less than % and there are no obstacles between a transmitter and a receiver. To measure the computation complexity we define the encoding speed as the number of frames which can be encoded by the given CPU per second. It was computed without any TABLE II: Protection redundancy Tampere Tampere 4 Packet probability, p H.264/SVC 3-D DWT H.264/SVC 3-D DWT TABLE III: Encoding speed for CPU Intel Core 2. GHz, fps Video bit rate, kbps Tampere H.264/SVC (JSVM 9.8) x D DWT 3-D DWT+ULP (p=.) 3-D DWT+ULP (p=.2) Tampere 4 H.264/SVC (JSVM 9.8) x D DWT 3-D DWT+ULP (p=.) 3-D DWT+ULP (p=.2) 25 < < < < < < < <

10 use of assemblers, threads, or other program optimization techniques. Table III demonstrates the encoding speed for the proposed codec and H.264/SVC with respect to the bit rate. Since the JSVM 9.8 reference software is not optimized for real-time operation we have also used a fast implementation of the H.264/AVC standard in single-layer mode (x.264 codec in ultrafast profile) to understand the maximum encoding speed of H.264/SVC after optimization. Moreover, we observe that the complexity of our codec decreases with the increase of the packet probability or with the decrease of the bit rate. The reason for this is as follows. As long as the packet probability increases, the portion of parity packets in the output bit stream also increases. The video compression ratio is to be increased to keep the source and parity data bit rate equal to the channel rate. As a result, the condition () holds for the increasing number of subbands that are skipped jointly with the corresponding child-subbands. In other words 2-D spatial DWT, an entropy coding as well as a packetization, Reed-Solomon encoding and a selection of the protection mode are not executed for a significant ratio of the spatial subbands. Our estimation in Table III shows that the computational complexity of the proposed codec is 3 5 times less than H.264/SVC at least. Taking into account that the transmission system based on H.264/SVC also requires the end-to-end distortion estimation caused by packet es, erasure-correction coding and ULP optimization, its full complexity is likely to be higher than in our estimations. C. Packet interleaving Appearance of short-term obstacles between the transmitter and the receiver antennas leads to the situation that no packets at all are delivered to the receiver. We refer to this situation as a packet burst. Burst packet es make inter-packet protection useless since source as well as parity packets are lost together. To overcome this problem an interleaving of the transmitting packets can be applied. First, the encoder accumulates source and parity packets which belong to several GOFs into the interleaver buffer. Then all packets are randomly transmitted from the buffer. When the packet burst duration is smaller than the duration of the packets accumulation in the interleaver buffer, the interleaving will result in the emulation of independent random packet es from the buffer. Therefore, when the interleaving is enabled, our inter-packet protection for random packet es is applicable for the case of bursty packet es. Taking into account, that in real-live scenarios the length of the packet burst can have different durations, as in the previous case, it makes sense to define the maximum packet burst duration Tburst max, and formulate the requirements for a codec in the following way: for all the burst durations less than Tburst max a codec should provide the minimum acceptable visual quality (24dB). Respectively, the duration of accumulation in the interleaver buffer should satisfy T int T burst max. p max VI. REAL-WORLD EXPERIMENTAL RESULTS We used Componentality FlexRoad equipment in our measurements. Componentality Oy is a Finnish corporate designing and manufacturing automotive multimedia and communication systems. The characteristics of Componentality FlexRoad devices, used in the paper, are the following: 3 MHz Atheros CPU with MIPS architecture; 64 MB of RAM; 5.9 GHz IEEE 82.p radio module 4 ; 2 dbm and 4 dbm antennas on transmitter and receiver sides respectively. The proposed and implemented video transmission system is presented in Figure. It is based on the DirectShow in framework [27] and performs video capturing from the USB video camera, video encoding, decoding, protection and playback in real-time. For RTP streaming the open-source JRT- PLIB library [28] was used. The target packet probability p target was set to.. The decoding startup latency was set to 6 GOF or 3.2 sec, the interleaver buffering latency was set to 4 GOF or 2. sec which corresponds to the maximum burst duration of Tburst max =.85 sec. Taking into account input frame buffering, the end-to-end latency is less than 6 sec which is acceptable for video surveillance applications. The following experiment was carried out in a major street close to the university campus in Hervanta, a suburb of Tampere, Finland. The video data was transmitted when the vehicle was moving at the speed of 5 km/h starting from distance meters. Figure a shows the packet rate per GOF depending on the distance, while Figure b presents the corresponding visual quality. During the experiment, the traffic was moderate and some cars, buses and a pedestrian bridge occasionally obstructed the line of sight between the transmitting vehicle and the receiving roadside unit (see labels on Figure a). In our experiment continuous video playback was provided when the distance between the vehicle and the roadside unit was less than 8 meters and a good visual quality was achieved for distances of 6 meters. Our results show that an acceptable video transmission is also possible for distances of 6 8 meters, but in this case a better protection scheme should be applied (higher inter-packet coding redundancy and longer interleaving buffering delay). VII. CONCLUSION IEEE 82.p communication technology makes it possible to introduce new automotive applications, which make use of broadband vehicle-to-vehicle and vehicle-to-roadside connectivity. We have developed and evaluated a new surveillance system aimed at improving public transport security and road traffic control. For the case of single transmitter-receiver pair we conclude that good visual quality can be achieved with commercially available Componentality FlexRoad devices and our 3-D DWT video codec when the inter-vehicle distance is less than 6 meters, and acceptable visual quality at distances 4 This information was provided to us by Componentality Oy. We have not performed a study about the compliance of FlexRoad devices with the IEEE 82.p standard.

11 Input frame Input frame buffer Rate controller Channel rate Skip flag -D Temporal DWT 2-D Spatial DWT 2-D Spatial Inverse DWT Target packet probability IEEE 82.p network Joint entropy encoding and protection mode selection Inter-packet RS encoder and RTP packetization Packet interleaver -D Temporal Inverse DWT Entropy decoder Video packets Playback buffer RTP depacketization and RS decoder Reconstructed frame Visual quality measurement Fig. : Implemented video transmission system Tampere_4 (64 48, 3Hz, 3 frames) Y PSNR, db (a) Packet rate per GOF depending on the distance distance, m (b) Visual quality depending on the distance Fig. : Real-world experimental results for the proposed joint source-channel video coding algorithm of 6 8 meters. For the above distances continuous playback is also guaranteed. Our future work is dedicated to the evaluation of the proposed system for multiple transmitting vehicles and a set of interconnected roadside units. is an optimal solution of the optimization problem when Rmax = R(Ψ λ ). Proof: Let us assume that the statement does not hold, and the set of decision vectors Ψ {Ψ} exists, such that E[D(Ψ)] < E[D(Ψ λ )] and R(Ψ) R(Ψ λ ). Then inequality ACKNOWLEDGMENT The authors thank Componentality Oy ( for the provided DSRC FlexRoad platform. This work was partly supported by the project of the National Natural Science Foundation of China (NSFC) for International Young Scientists. E[D(Ψ)] + λ R(Ψ) < E[D(Ψ λ )] + λ R(Ψ λ ), A PPENDIX A L AGRANGE RELAXATION METHOD The optimization problem (9) can be solved by Lagrangian relaxation method [25], when the two following proposition hold. Proposition. For each λ, the set of decision vectors Ψ λ {Ψ}, which minimizes E[D(Ψ)] + λ R(Ψ), (4) means that the set of decision vectors Ψ λ does not minimize (4), that contradicts with the statement being proved. From this proposition, it follows that λ, such that R(Ψ λ ) = Rmax, is the solution of (9). Proposition 2. Let us assume that for some λ and λ2, the values Ψ λ and Ψ λ2, which minimize (4), are found. Then, if R(Ψ λ ) > R(Ψ λ2 ), then the following inequality holds: λ2 E[D(Ψ λ )] E[D(Ψ λ2 )] R(Ψ λ ) R(Ψ λ2 ) λ. (5) Proof: From the Proposition it follows, that E[D(Ψ λ )] + λ R(Ψ λ ) E[D(Ψ λ2 )] + λ R(Ψ λ2 ). (6)

12 2 From (6) and R(Ψ λ ) > R(Ψ λ2 ) follows, that E[D(Ψ λ )] E[D(Ψ λ2 )] R(Ψ λ ) R(Ψ λ2 ) λ, (7) what proves the rhs of (5). Analogously, from Proposition it follows, that E[D(Ψ λ2 )] + λ2 R(Ψ λ2 ) E[D(Ψ λ )] + λ2 R(Ψ λ ). (8) From (8) and R(Ψ λ ) > R(Ψ λ2 ) it follows, that λ2 E[D(Ψ λ )] E[D(Ψ λ2 )] R(Ψ λ ) R(Ψ λ2 ). (9) From this proposition it follows, that function R(Ψ λ ) is a non-increasing function of λ. This means that target λ value can be found by the bisection method [25]. Taking into account, that in practice R(Ψ λ ) is a discrete function, in many cases no Ψ λ exists such that R(Ψ λ ) = Rmax. Then the decision vector Ψ λ, such that R(Ψ λ ) is the maximum existing value not exceeding Rmax, is the solution of the problem. R EFERENCES [] IEEE Standard for Information technology Telecommunications and information exchange between systems Local and metropolitan area networks Specific requirements Part : Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications [2] G. Karagiannis, O. Altintas, E. Ekici, G.J. Heijenk, B. Jarupan, K. Lin and T. Weil, Vehicular networking: A survey and tutorial on requirements, architectures, challenges, standards and solutions, IEEE Communications Surveys & Tutorials, vol. 3, no. 4, pp , 2. [3] A. Vinel, E. Belyaev, K. Egiazarian and Y. Koucheryavy, An overtaking assistance system based on joint beaconing and real-time video transmission, IEEE Transactions on Vehicular Technology, vol. 6, no. 5, pp , 22. [4] N.N. Qadri, M. Fleury, M. Altaf and M. Ghanbari, Multi-source video streaming in a wireless vehicular ad hoc network, IET Communications, vol. 4, no., pp. 3 3, 2. [5] M. Asefi, J.W. Mark and Xuemin Shen, An Application-Centric InterVehicle Routing Protocol for Video Streaming over Multi-Hop Urban VANETs, IEEE International Conference on Communications (ICC 2), Kyoto, Jun. 2. [6] L. Zhou, Yan Zhang, K. Song, Weiping Jing and A. Vasilakos, Distributed Media Services in P2P-Based Vehicular Networks, IEEE Transactions on Vehicular Technology, vol. 6, no. 2, pp , 2. [7] F. Soldo, C. Casetti, C.-F. Chiasserini and P.A. Chaparro, Video Streaming Distribution in VANETs, IEEE Transactions Parallel and Distributed Systems, vol. 22, no. 7, pp. 85 9, 2. [8] N. Baghaei and R. Hunt, Review of quality of service performance in wireless LANs and 3G multimedia application services, Computer Communications Journal, vol. 27, pp , 24. [9] A. Vinel, E. Belyaev, O. Lamotte, M. Gabbouj, Y. Koucheryavy and K. Egiazarian, Video transmission over IEEE 82.p: real-world measurements, IEEE ICC-23 (Workshop on Emerging Vehicular Networks). [] J. Gozalvez, M. Sepulcre and R. Bauza, IEEE 82.p vehicle to infrastructure communications in urban environments, IEEE Communications Magazine, vol. 5, no. 5, pp , 22. [] T. Abbas, L. Bernado, A. Thiel, C.F. Mecklenbrauker and F. Tufvesson, Radio Channel Properties for Vehicular Communication: Merging Lanes Versus Urban Intersections, IEEE Vehicular Technology Magazine, vol. 8, no. 4, pp , 23. [2] M. Gallant and F. Kossentini, Rate-distortion optimized layered coding with unequal error protection for robust internet video, IEEE Transactions on Circuits and Systems for Video Technology, vol., pp , 2. [3] M. van der Schaar, Y. Andreopoulos and Z. Hu, Optimized scalable video streaming over IEEE 82. a/e HCCA wireless networks under delay constraints, IEEE Transactions on Mobile Computing, vol. 5, no. 6, pp , 26. [4] B. Girod, K.W. Stuhlmller, M. Link and U. Horn, Packet resilient internet video streaming, SPIE Visual Communications and Image Processing, 999. [5] M. van der Schaar and H. Radha, Unequal packet resilience for fine-granular-scalability video, IEEE Transactions on Multimedia, vol. 3, no. 4, pp , 2. [6] E. Maani and A. Katsaggelos, Unequal Error Protection for Robust Streaming of Scalable Video Over Packet Lossy Networks, IEEE Transaction on Circuits ans Systems for Video Technology, vol. 2, no. 3, 2. [7] V.Stankovic, R. Hamzaoui, X. Zixiang, Real-time error protection of embedded codes for packet erasure and fading channels, IEEE Transactions on Circuits and Systems for Video Technology, vol. 4, no. 8, pp , 24. [8] J. Fowler, M. Tagliasacchi and B. Pesquet-Popescu, Video Coding with Wavelet-Domain Conditional Replenishment and Unequal Error Protection, IEEE International Conference on Image Processing, 26. [9] M. van der Schaar and D.Turaga, Cross-Layer Packetization and Retransmission Strategies for Delay-Sensitive Wireless Multimedia Transmission, IEEE Transactions on Multimedia, vol. 9, no., 27. [2] E. Belyaev, K. Egiazarian and M. Gabbouj, A low-complexity bit-plane entropy coding and rate control for 3-D DWT based video coding, IEEE Transactions on Multimedia, vol. 5, no. 8, pp , 23. [2] E. Belyaev, K. Egiazarian, M. Gabbouj and K. Liu, A low-complexity Joint Source-Channel Video Coding for 3-D DWT Codec, International Symposium on Communication and Information Theory, Chengdu, China, December, 23 (published in Journal of Communications, vol. 8, no. 2, pp , 23). [22] E. R. Berlekamp, Algebraic coding theory, McGraw-Hill, 968. [23] Jin Li, The efficient implementation of Reed-Solomon high rate erasure resilient codes, IEEE International Conference on Acoustics, Speech, and Signal Processing, 25. [24] ITU-T and ISO/IEC JTC, JPEG 2 Image Coding System: Core Coding System, ITU-T Recommendation T.8 and ISO/IEC JPEG 2 Part, 2. [25] G.M. Schuster and A.K. Katsaggelos, Rate-Distortion Based Video Compression, Optimal Video Frame Compression, and Object Boundary Encoding, Kluwer Academic Publisher, 997. [26] Test video sequences for VANETs applications, belyaev/v2v.htm. [27] DirectShow multimedia framework, [28] JRTPLIB, [29] JSVM 9.8 software package, Evgeny Belyaev (M 2) received the Engineer degree in automated systems of information process ing and control and the Ph.D. degree (candidate of science) in technical sciences from the State University of Aerospace Instrumentation, Saint Petersburg, Russia, in 25 and 29, respectively. He is currently a Researcher with the Institute of Signal Processing, Tampere University of Technology, Tampere, Finland. He has been recognized as Exemplary Reviewer of IEEE Communications Letters in 24 and as the finalist of Grand Video Compression Challenge organized by 3th Picture Coding Symposium in 23. His research interests include real-time video compression and transmission, video source rate control, scalable video coding, motion estimation, and arithmetic encoding.

3 Alexey Vinel (SM 2) received the Bachelor s (Hons.) and Master s (Hons.) degrees in information systems from Saint Petersburg State University of Aerospace Instrumentation, St.

degree (candidate of science) in technical sciences from the Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia, in 27.

Tampere, Finland). He has been an Associate Editor for IEEE COMMUNICATIONS LETTERS since 22. His research interests include wireless networking protocols and intelligent transportation systems.

He is currently a Research and Teaching assistant at Tampere University of Technology, Finland. His research and teaching interests include computer networks, wireless networks and computer systems.

in 986 and 989, respectively. Dr. Gabbouj is an Academy of Finland Professor.

13 3 Alexey Vinel (SM 2) received the Bachelor s (Hons.) and Master s (Hons.) degrees in information systems from Saint Petersburg State University of Aerospace Instrumentation, St. Petersburg, Russia, in 23 and 25, respectively, and the Ph.D. degree (candidate of science) in technical sciences from the Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia, in 27. He is currently a Guest Professor with the School of Information Science, Computer and Electrical Engineering, Halmstad University, Halmstad, Sweden (on leave from Tampere University of Technology, Tampere, Finland). He has been an Associate Editor for IEEE COMMUNICATIONS LETTERS since 22. His research interests include wireless networking protocols and intelligent transportation systems. Adam Surak received his B.Sc. and M.Sc. degrees with distinction in information technologies from VSBTechnical University of Ostrava in 2 and 23. He is currently a Research and Teaching assistant at Tampere University of Technology, Finland. His research and teaching interests include computer networks, wireless networks and computer systems. Moncef Gabbouj Moncef Gabbouj (F ) received his BS degree in electrical engineering in 985 from Oklahoma State University, and his MS and PhD degrees in electrical engineering from Purdue University, in 986 and 989, respectively. Dr. Gabbouj is an Academy of Finland Professor. He held several visiting professorships at different universities including Purdue University, University of Southern California and Hong Kong University of Science and Technology. He holds a permanent position of Professor at the Department of Signal Processing, Tampere University of Technology. His research interests include multimedia, machine learning, nonlinear signal processing, voice conversion, and video processing and coding. Dr. Gabbouj is a Fellow of the IEEE and member of the Finnish Academy of Science and Letters. He is currently the Chairman of the IEEE CAS TC on DSP and committee member of the IEEE Fourier Award for Signal Processing. He served as Distinguished Lecturer for the IEEE CASS. He served as associate editor of the IEEE Transactions on Image Processing, and was guest editor of Multimedia Tools and Applications and the European journal Applied Signal Processing. Magnus Jonsson (SM 7) received his BS and MS degrees in computer engineering from Halmstad University, Sweden, in 993 and 994, respectively. He then obtained the Licentiate of Technology and Ph.D. degrees in computer engineering from Chalmers University of Technology, Gothenburg, Sweden, in 997 and 999, respectively. He has been a full Professor of Real-Time Computer Systems at Halmstad University since 23, where he also is Director of Research at the School of IDE and vicedirector of CERES - The Centre for Research on Embedded Systems. From 998 to March 23, he was Associate Professor of Data Communication at Halmstad University (acting between 998 and 2). He has published over 9 scientific papers and book chapters, most of them in the area of real-time communication, wireless networking, real-time and embedded computer systems, optical networking and optical interconnection architectures. Karen Egiazarian (SM 96) received the M.Sc. degree in mathematics from Yerevan State University in 98, the Ph.D. degree in physics and mathematics from Moscow State University, Moscow, Russia, in 986, and the D.Tech. degree from the Tampere University of Technology (TUT), Tampere, Finland, in 994. He has been Senior Researcher with the Department of Digital Signal Processing, Institute of Information Problems and Automation, National Academy of Sciences of Armenia. Since 996, he has been an Assistant Professor with the Institute of Signal Processing, TUT, and from 999 a Professor, leading the Computational Imaging Group. His research interests are in the areas of applied mathematics, image and video processing.

A Low-Complexity Joint Source-Channel Video Coding for 3-D DWT Codec

Journal of Communications Vol 8, No 12, December 2013 A Low-Complexity Joint Source-Channel Video Coding for 3-D DWT Codec Evgeny Belyaev 1, Karen Egiazarian 1, Moncef Gabbouj 1, and Kai Liu 2 1 Tampere