WCAM IST

Size: px

Start display at page:

Download "WCAM IST"

Lawrence Watts
5 years ago
Views:

1 WCAM IST DELIVERABLE D2.1 STATE-OF-THE-ART / MULTIMEDIA AND VIDEO SURVEILLANCE CONVERGENCE Contractual Date of Delivery to the CEC: 1 st of May 2004 Actual Date of Delivery to the CEC: 30th of April 2004 Author(s): J.Meessen, C. Parisot, J-F. Delaigle, D. Nicholson, F. Dufaux, X. Duthu, A. Nix, C. Serrao, C. Lebarz, D. Bull, F-O. Devaux, Y. Sadourny, A. Doufexi, D.Agrafiotis, C. Tricot. Participant(s): Workpackage: WP2 / System Specifications Est. person months: Security: PU Nature: R Version: 9 Total number of pages: 154 Abstract: This document aims at analysing the state-of-the-art in terms of video surveillance and multimedia content distribution. The possibilities in terms of technology and system convergences are studied. Other issues than technical are also taken into account, especially the legal aspects and privacy issues. Keyword list: State-of-the-art, video coding, wireless, security, surveillance, multimedia distribution Page 1/154

2 Table of Content 1 PURPOSE OF THIS DOCUMENT MOTIVATION VIDEO CODING MOTION JPEG JPEG 2000 overview MotionJPEG H.264 / MPEG-4 AVC Overview Intra Prediction Inter Prediction Transform & Quantisation De-blocking filter Entropy coding Profiles Performance Error Resilience ROI Coding TRANSCODING Introduction Bit-rate reduction Spatial resolution reduction Temporal resolution reduction Error-resilience Transcoding Transcoding form one standard to another Transcoding of encrypted bit streams SCALABLE VIDEO CODING Scalability in Motion JPEG Scalability in MPEG-2, MPEG-4 and H Scalability in MPEG-21, Scalable Video Coding WIRELESS TRANSMISSION WIRELESS TRANSMISSION STANDARDS IDENTIFICATION OF CURRENT WIRELESS STANDARDS WLANs at 2.4 GHz - IEEE Technical discussion of IEEE and b (Wi-Fi) Main Parameters of b Throughput analysis COFDM WLANS: A, G, HIPERLAN/2, HISWANA Quality of Service (QoS) support GHz radio regulations IEEE MAC IEEE a MAC PDU IEEE g SECURITY SECURITY OF THE CONTENT Content Encryption Techniques Content Integrity Techniques...62 Page 2/154

3 5.1.3 DRM Digital Rights Management Content Identification Content Metadata Rights Expression Languages Users and Devices Authentication Existing DRM solutions IPMP Intellectual Property Management and Protection Specific Motion-JPEG2000 Protection Techniques A Specific MPEG / DCT-based Encryption Technique SECURITY OF DELIVERY Authentication Techniques Secure-RTP IPSEC Wi-Fi security and its evolution STATE-OF-THE-ART FOR VIDEO SURVEILLANCE VIDEO SURVEILLANCE SYSTEMS Introduction Analogue systems Digital systems CCTV digital parts Video surveillance over wireless network Security issues Last evolution: The Video Analysis SEGMENTATION AND TRACKING FOR SURVEILLANCE APPLICATIONS Extraction Detection Characterization Object Tracking LEGAL ASPECTS AND PRIVACY European law related to video surveillance Forensic video decisions STATE-OF-THE-ART FOR MULTIMEDIA CONTENT DISTRIBUTION WHAT IS STREAMING TECHNOLOGY? Video delivery via file download Video delivery via streaming Expressing video streaming as a sequence of constraints VIDEO COMPRESSION STANDARD IN MULTIMEDIA VIDEO FILE FORMAT STANDARDS MPEG formats RealPlayer QuickTime Microsoft s Window media files DivX Streaming format vs file format Streaming MPEG4 vs MPEG STANDARD NETWORK PROTOCOLS FOR VIDEO DISTRIBUTION UDP TCP RTP: Real Time Protocol RTSP UNICAST & MULTICAST Unicast Page 3/154

4 7.5.2 Multicast PROBLEMS OF VIDEO TRANSMISSION Key parameters to quality streaming video-over-ip Effects of Jitter/Delay and network Inter-Packet Gap Drift Effects of Jitter/Burst and Network Inter-Packet Gap Drift Network Packet Loss CLIENT Viewing over a set-top box Viewing over PC, PDA and laptop PROTECTION AND ANTI-PIRACY SYSTEM The law: Internet Type of piracy with video Why Video content protection? How to protect Video content? TYPICAL APPLICATION AND TOOLS SURVEILLANCE AND MULTIMEDIA DISTRIBUTION CONVERGENCE THE INTEREST FOR CONVERGENCE TRAFFIC MANAGEMENT ISSUES CONCLUSIONS TABLE OF FIGURES TABLES LIST REFERENCES Page 4/154

5 1 Purpose of this document This document aims at analysing the state-of-the-art in terms of video surveillance and multimedia content distribution. The possibilities in terms of technology and system convergences are studied. Other issues than technical are also taken into account, especially the legal aspects and privacy issues. The goal of this state-of-the-art deliverable is not to detail all available issues and solutions related to multimedia distribution and video surveillance, which would be a waste of time and paper. The reader will rather find all the necessary references to understand the issues tackled within the WCAM project. The document is organized as follows. After motivating the interest of video surveillance in section 2, section 3 presents the state-of-the-art related to video coding technologies (i.e. Motion JPEG2000, MPEG4-AVC/H264, transcoding and scalable video coding issues) and today s wireless transmission standards are overviewed in section 4. The security issues of both the content and the delivery are described in section 5. Section 6 and section 7 deal with the state-of-the-art for video surveillance and multimedia distribution respectively, including legal aspects, before analysing their convergences in section 7.1 and concluding the document with section 9. Relevant references are listed at the end of this document. Page 5/154

6 2 Motivation In countries such as the UK, crime costs the economy some 50 billion a year and the security of individuals and their property remains a cause of major public concern. The installation of surveillance cameras in commercial premises and high streets, often carried out in partnership with local community schemes, local authorities, police forces and local business, has assisted in detecting crimes such as personal attacks, theft and drug dealing. They also provide benefits in related areas such as public safety, alarm verification and number plate recognition. Recent international terrorist events have also clearly demonstrated the need for video systems capable of maximising the use of intelligence from both real-time acquisition and post-event analysis. The UK HMIC (Her Majesty s Inspectorate of Constabularies) report on the evaluation of Special Branches recently highlighted the need for a national network for high bandwidth information exchange with interfaces to both mobile and rapidly deployable units. In particular the capture and secure transmission of high quality multimedia information from multiple sources has been acknowledged as being of paramount importance. The need for a seamless communications infrastructure to facilitate the capture and exchange of secure, reliable and high quality information, whether in public spaces, private premises or to police officers on the move or at the site of a deployed unit is also key. Surveillance systems are currently employed in a wide range of crime and safety-related applications. These include: Personnel screening, luggage, freight and vehicle scanning, explosives detection, Building Security, access control, intruder identification by personal signature, Public space surveillance for crime detection, Person detection/ biometrics, visual features, gait, crowd and behavioural analysis, Vehicle security, traffic analysis, identification and tracking, Anti-terrorism. However concerns still remain over the widespread adoption of this technology. In this context, the questions to be asked for any CCTV or surveillance system are: Will the individual image quality be adequate (transmission errors, coding artefacts)? What limits resolution is it the camera, the coding or the recorder? Is the frame rate high enough to ensure all activities are adequately captured? How might the evidence be tampered with and how can this be prevented and detected? Is the transmission secure? How does the system perform in low light conditions? How is the system affected by viewing variations between daylight and infrared illumination? Is the coverage adequate (eg camera positions, radio coverage)? What degree of back up storage is available and what is the cost of storage? Is the capacity of the system limited how many cameras can be connected? Is the communications infrastructure of sufficient bandwidth and reliability wired or wireless? What image analysis features are available and desirable tracking, activity detection, threat assessment? Is the control environment and user interface appropriate to the task? Is the coding and storage technology compatible with existing standards is it usable by the police as evidence? In the past decade we have seen the growth of the internet and of mobile communications. This success has been due to the rapid development and take-up of digital technology Page 6/154

7 allowing low cost deployment at reasonable cost to the user. Digital connectivity, coding and storage is rapidly overtaking more established analogue systems in the professional and consumer markets but has made slower penetration in CCTV and surveillance markets. The benefits include: Cost digital cameras and data networks are well established in professional and domestic applications. Increased investment and take-up will drive down costs, Reproducibility digital information can be copied without distortion. Also tape wear is generally less of an issue with digital storage due to the robustness of the encoding method coupled with advanced error protection techniques, Flexibility the potential for adaptation to channel variations in terms of bandwidth (compression) and error performance, Post processing of images and video is possible to facilitate alarm generation and to improve recognition and detection, Standards the importance of compatibility for multi-sourcing and media interchange, in particular for evidential purposes. On the negative side, the issues that have prevented rapid take-up in the CCTV market have included a lack of standards, access to ready deployed digital communications infrastructure and historical storage and capture costs. In addition, the issue of compatibility between digital storage technologies (e.g. download from integrated hard disk systems) can be a serious issue for police when acquiring evidence. Electronic surveillance techniques have been classified as belonging to 3 generationsi. First generation techniques ( ) employed basic analogue video cameras and transmission techniques, connecting to a control centre for viewing on an array of monitors. Problems arose because of the attention span of operators and the frequent occurrence of missed events, especially in large systems. From around 1980, improvements and cost reductions in sensor, computing and communications technology led to the emergence of more sophisticated processing for event detection, alarm generation, fault detection, illumination compensation and tracking of objects in scenes. Digital transmission techniques also emerged during this so-called second generation and coding standards such as JPEG (still image) and H.261 (video) emerged to facilitate basic levels of digital storage and transmission using ISDN lines and solid state and tape storage. These techniques allowed increased amounts of data to be assimilated due to attention focussing techniques. From around 2000, a third generation of system began to emerge, exploiting new coding standards, broadband communications, open protocols, and offering enhanced processing and intelligent sensorsii. Using internet and wireless access methods, monitoring is now possible from remote sites. It is also possible to fuse information from different types of sensor in order to offer improved performance. These issues are addressed further in the following sections. Commercially available CCTV systems are clearly becoming increasingly sophisticated and methods are progressively migrating from the research laboratory into deployed systemsiii. Many now offer specifications similar to those in Table 1. Frame rate Table 1: Typical specifications of existing CCTV systems Sensor/ Resolution Bandwidth/ Analysis Connectivity Capacity Authentication codec 1 Page 7/154

8 1-25fps lines -1/3-1/2in CCD -8Kb/s-2Mb/s -MJPEG -H.261 -(H.263) -Motion detection -Motion trigger -Time lapse -Illumination compensation -Peak white inversion -Tracking and co-operative working -(Object recognition (face, numberplate)) -Analogue -ISDN -ethernet -ATM -IP cameras -Limited -Hashing The challenge of third generation systems is therefore to provide high quality data acquisition, efficient and robust coding, high bandwidth and secure transmission, efficient storage with ease of access, and sophisticated processing to allow flexibility of control, adaptation and enhanced event analysis. In particular these systems should offer: Reduction of reaction time for alert generation, and information assisted decision Easy deployment of sensors without large infrastructures, taking advantage of wireless networks technologies, with security features for source authentication and content access protection. Adaptation to the network conditions, which imply the possibilities of scalability for QoS management Extraction of metadata from images and video for reporting, indexing and search purposes. The following sections highlight some existing and emerging technologies which will contribute to these goals. Page 8/154

9 3 Video coding 3.1 Motion JPEG2000 JPEG 2000 is the new ISO image compression standard (JPEG 2000 Part 1 Core coding system ISO/IEC /ITU-T T.800) [159][162][163]. Part 3 of JPEG 2000, namely MotionJPEG-2000 (MJP2), ISO/IEC /ITU-T T.802 [158] is targeting the intra frame coding of video, in direct replacement of MJPEG, while providing every scalability and Region Of Interest features of JPEG MJP2 is less efficient than the other video coding schemes like MPEG but nonetheless holds an interest for WCAM: - To replace MJPEG and proprietary frame based video coding schemes used in surveillance applications. - Where low delay, high resolution (size and bit-depth), robustness to transmission errors, frame to frame independence, partial access to content and coding features such as Regions of Interest are demanded. - The JPEG 2000 codec may be also used for capturing high resolution still images. - JPEG 2000 provides Region of Interest coding capabilities, as well as fine grain scalability which is very useful for adapting the video encoding to the network bandwidth capability. - New JPEG 2000 parts can provide additional features to Part 1 that can be applied for Motion JPEG Part 8 (JPEG 2000 Security: JPSEC) provides hooks to security mechanisms such as encryption, Part 9 (JPEG 2000 Interactive Protocol: JPIP) allows interactive access in a client-server architecture to JPEG 2000 images, and part 11 (Wireless JPEG 2000: JPWL) enhances the basic error resilience features of JPEG 2000 part JPEG 2000 overview The first part of JPEG 2000 standard (Part 1) [159] includes the core technology, while part 2 proposes additional tools, such as a powerful image format including a new Region of Interest definition, new wavelets transforms, and trellis coded quantization. Effort has been made to make this new standard suitable for actual and future applications by providing features unavailable in previous standards. The JPEG 2000 standard is the only one with the same algorithm to cover all needs in digital imagery for lossy or lossless (mandatory in medical applications) compression. A key feature of JPEG 2000 is the flexible bit stream representation of the images suited for transmissions of a wide range of formats in heterogeneous environments. It allows also accessing to different representations of the images using its scalability features (resolution, quality, position and image component) and the Region of Interest feature taken into account at the encoder or decoder level. This allows an application to manipulate or transmit only the essential information for any target device from any JPEG 2000 compressed source image. The data flow can be then adapted to the user terminal capability, and offer to him a mechanism for interactive decoding. The requested part of the unique compressed file corresponding to the user's terminal capability and/or Region of Interest selection, will be then filtered at the image server level and sent to the user. Page 9/154

The compression performances of JPEG 2000 overcomes (DCT based) JPEG, as depicted in the following figure, with the drawback of a much higher complexity.

Daubechies 9/7 and the other one being an integer kernel (5/3), enabling reversible compression. The two kernels are implemented using the lifting method.

10 The compression performances of JPEG 2000 overcomes (DCT based) JPEG, as depicted in the following figure, with the drawback of a much higher complexity. Figure 1: JPEG2000 (Compression rate 64:1) compared to JPEG The algorithm modules included in JPEG 2000 are the Discrete Wavelet Transform (the DWT is based upon two kernels, one being the well-known Daubechies 9/7 and the other one being an integer kernel (5/3), enabling reversible compression. The two kernels are implemented using the lifting method.), the scalar quantization and the entropy coding based upon the EBCOT algorithm (including a very efficient arithmetic encoder), which includes efficient rate control mechanism. Figure 2 : JPEG 2000 Coding algorithm Image model JPEG 2000 uses an image model, allowing: From 1 to 38 bits per pixel per component, signed or unsigned Up to 16k components Relationship between component (Size ratio and pixel position of components relative to the image size) Tiles of arbitrary size and location Page 10/154

11 Image position in a reference grid (allow image processing like cropping with need of re-compression) Additionally to these features component transforms are also defined, as well as colour spaces. The image may be very huge in comparison to the available processing resources in terms of memory. The encoding process allows dividing this large image into smaller blocks (rectangular regions) called tiles. The image partitioning is described through a reference grid, an offset from this grid (XOsiz and YOsiz) and an horizontal and vertical space (Xsiz and Ysiz). The reference grid is shown in the following figure: (0,0) Reference grid YOsiz Xsiz- XOsiz Ysiz Ysiz- YOsiz Image data XOsiz Xsiz Figure 3 : The reference grid The reference grid is partitioned into tiles. The tile size and offset are defined, on the reference grid, by dimensional pairs (XTsiz, YTsiz) and (XTOsiz, YTOsiz). The top left corner on the first tile (the tiles being numbered in raster order) is offset from the top left corner of the reference grid by (XTOsiz, YTOsiz). Page 11/154

12 Reference grid XTOsiz Xsiz- XOsiz YTOsiz YOsiz XTsiz XTsiz XTsiz XTsiz XTsiz Image data Ysiz- YOsiz Ysiz YTsiz Tile 0 Tile 1 Tile 2 Tile 3 Tile 4 YTsiz Tile 5 Tile 6 Tile 7 Tile 8 Tile 9 Tile 10 Tile 11 Tile 12 Tile 13 Tile 14 XOsiz Xsiz Figure 4: The reference grid partitioned into tiles A tile subdivision for each resolution level is possible through precincts, further described in this document DC level shift DC level shifting converts the unsigned values coming from the original image data to the proper range optimised for the encoder. If a sample of a component is represented through an unsigned number using N bits, it will be shifted down in order to have a dynamic range which is symmetrically distributed around zero. This is done by subtracting a constant value of 2 N-1 to every sample Colour transforms The component transformation allows when necessary to change the colour representation [167] of the image (RGB) into the JPEG 2000 colour representation (YCC). In fact there are two types of colour transformation, the reversible transform (RCT) and the irreversible transform (ICT), respectively for lossless and lossy coding. The irreversible colour transform (ICT) is nothing else than the colour transform already used in JPEG, and is a classic RGB to YCC colour space transform. It is defined by: Y = Cb Cr R 0.5 G B The RCT is a reversible integer to integer approximation of the ICT, which is used for reversible JPEG 2000 compression, given by: Page 12/154

13 Y 0.25 = Cb 0 Cr R 1 G 0 B Wavelet transform The forward discrete wavelet transformation is applied to one tile component. To perform the forward discrete wavelet transformation (FDWT), JPEG2000 uses a onedimensional sub-band decomposition of a one-dimensional set of samples into low-pass coefficients, representing a downsampled low-resolution version of the original set, and highpass coefficients, representing a downsampled residual version of the original set, needed to reconstruct the original set from the low-pass set. Each tile component is transformed into a set of two-dimensional sub-band, each subband representing the activity of the signal in various frequency bands, at various spatial resolutions. The different number of levels of spatial resolutions N L is called the number of decomposition levels. JPEG2000 gives two possibilities for the discrete wavelet transformation, based upon one reversible transformation and one irreversible transformation. The use of the reversible transform, Le Gall 5/3 wavelet transform [164], is mandatory when the objective of the compression process is to do a lossless compression. Nevertheless it is possible using this transformation to perform a lossy compression, based on power of two quantization. But when the objective is to make a lossy compression, the irreversible transform gives better performances, and should be used. This irreversible transform is based upon the well known Daubechie 9/7 kernel [165]. First of all the wavelet transform is applied vertically and then horizontally, and is iteratively applied in this way until the number of decomposition levels has been reached. This process is shown in the following figure: 1L 1LL 1HL 2L 2H 1HL 2LL 2HL 2LH 2HH 1HL 1H 1LH 1HH 1LH 1HH 1LH 1HH Figure 5: The wavelet transform application Irreversible forward wavelet transform When the irreversible transform is chosen, the one dimension wavelet filter may be applied in two ways which deliver the same results. On one hand, one can use a classical convolution arithmetic operation to implement the filter, on the other hand one can use the lifting implementation which enables to reduce the amount of arithmetic operations. These two implementations give the same results in terms of output values. Page 13/154

14 X(n) denotes the one-dimensional sequence of input samples and Y(n) denotes the one-dimensional sequence of interleaved sub-band samples, where the low-pass sub-band is identified with the even samples, Y(2n), while the high-pass sub-band is identified with the odd samples, Y(2n+1). Y N ( 2n) = h( k) X (2n + k) Y ( 2n + 1) = k = 1 N k = 1 where h(k) and g(k) denote the low and high pass kernels. g( k) X (2n + 1+ k) The lifting-based DWT implementation of the Daubechies 9/7 filter allows to reduce the complexity. This implementation is based upon 4 lifting steps and 2 scaling steps. ( 2 n + 1) = X ( 2n + 1) + ( X ( 2n) + X ( 2n + 2) ) ( 2 n) = X ( 2n) + ( Y ( 2n 1) + Y ( 2n + 1) ) ( 2 n + 1) = Y ( 2n + 1) + ( Y ( 2n) + Y ( 2n + 2) ) ( 2 n) = Y ( 2n) + ( Y ( 2n 1) + Y ( 2n + 1) ) ( 2n + 1) = Y ( 2n + 1) ( K) Y α Y β Y γ Y δ Y Y 1 K ( 2 n) = Y ( 2n) with α = β = γ = δ = K = Reversible forward wavelet transform When the reversible transform is chosen, the one dimension wavelet filter has to be implemented using the lifting algorithm. Reversible lifting-based filtering consists of a sequence of simple filtering operations for which alternately, odd sample values of the signal are updated with a weighted sum of even sample values which is rounded to an integer value, and even sample values are updated with a weighted sum of odd sample values which is rounded to an integer value. The odd coefficients of output signal Y are computed first from the input signal X. Y ( 2n + 1) = X ( 2n + 1) ( 2n) + X ( 2n 2) X + 2 The even coefficients of the output signal Y are computed from the even values of input signal X and the odd coefficients of signal Y computed during the previous stage. Y ( 2n) = X ( 2n) ( 2n 1) + Y ( 2n + 1) Y Page 14/154

15 Quantization Quantization is the process by which the transform coefficients are reduced in precision. This allows to represent transform coefficients with only the minimal precision required to obtain a certain level of image quality. This quantization is applied only in the case of the use of irreversible transform, otherwise, the quantization value is one. Each of the transform coefficients of the sub-band Z b (u,v) is quantized to the value Q b (u,v) according to the following equation: Z b ( u, v) Q = b ( u, v) Sign( Z b ( u, v)) b The step-size b is represented with a total of two bytes, a 11-bit mantissa m b, and a 5- bit exponent ε b, as explained in the following formula, where Rb is the number of bits representing the dynamic range of the subband b. b = R εb b 2 11 b µ Two modes of quantization are possible, the first one, the implicit mode (derived quantization) requires only to mention the LL sub-band quantizer, the remaining quantizer being derived from this one, whereas, the explicit quantization (expounded quantization) requires the indication of the quantizer of each sub-band. In expounded quantization, exponent/mantissa pairs ( εb, µb) for every subband are mentioned in the JPEG 2000 codestream. In derived quantization, only the LL subband are transmitted and the other quantizer will be calculated using the following equation where i is the number of wavelet decomposition to reach the subband b: ( εb, µb) = ( ε0 - nb_decomposition_level + i,µb) The scalar quantization allows also to control the compression level, sub-band per subband, in the same manner than the quality factor of JPEG. It allows to implement a controlled by quality factor compression Entropy coding The entropy coding [166] is made of two parts, applied after having divided the subbands in groups of data called code-blocks, the coefficient bit modelling and the arithmetic coding Division of Sub-Bands into code-blocks Each sub-band is partitioned into code-block, each of these blocks being coded in the entropy coder. The code-blocks in a particular sub-band must have the same size. Their width and height are encoding parameters, which could be fixed by the user. The horizontal and vertical dimensions of the code-block must be a power of 2, with a minimum vertical size of 4 and their product must not exceed Normally a block size of 64x64 is used as default Coefficient Bit Modelling Each code-block must be coded by bit-planes, starting from the most significant one. Each coefficient bit in a bit-plane will be coded in one of the following three coding passes: significant, magnitude refinement or clean up. For each bit coded in a given pass, a context Page 15/154

16 will be generated. The context represents the probability estimation and depends on the coding pass, the neighbour s state information and the already encoded coefficient bits.! "!! # Figure 6: Code-block and bitplanes Symbols and contexts generated by each coding pass are arithmetically encoded. In the normal operation all the coding passes will be encoded to obtain a good coding efficiency. However, for applications where coding efficiency is less important than computational complexity, a lazy mode (bypass mode) should be available where only the clean up passes will be arithmetically encoded for the non four first bitplanes. Figure 7: code-block stripe scan order First of all, each code-block data bit-plane is scan by stripe of four vertical bits. Each bitplane will be scanned three times, corresponding to the three coding passes. Each coefficient of this bitplane which has a neighbor being already significant, or being becoming significant in this bitplane, is declared to be significant and coded in the significance pass. Each coefficient which has been already significant in a previous bit plane will be coded in the refinement pass, whereas all other cases will be coded in the clean-up pass. In each of the three passes, a local context will be computed from neighbor bits significance state and sign. It is to be noted that in the clean-up pass a run length stripe coding is used when all significance states of the stripe is zero. Context and bit are sent as input to the arithmetic encoder. Page 16/154

17 For each codeblock, the number of MSB planes for which all bits are 0 is signaled by the Insignificant MSBs (IMSB). The significant state of all the coefficients in the first bitplane to be coded is zero, only the cleanup pass is used Arithmetic Coding Each time a coefficient bit is modelled, a binary value (decision) and context will be generated. If the corresponding pass must be arithmetically encoded, the decision and context pairs will be processed together to produce compressed data. Depending on the termination mode of the arithmetic encoder, it will produce the compressed bit stream at the end of each coding pass or only once the entire code-block has been encoded. The arithmetic coder used in JPEG 2000 is called the MQ coder, which comes from the family of the IBM Q coder. One member (QM) of this family was already used in JPEG additionally to Huffman coding. In an arithmetic coder, data are not coded as symbols, but as probability intervals Rate/Quality allocation The size of the compressed image may be controlled for obtaining a precise amount of bits. This process is called rate control and may be managed by a rate distortion allocation mechanism. This process tries to minimise the compression error while reaching the desired compression rate. Rate control can be achieved either by adjusting the quantization steps or selecting the coding passes to be included in the codestream. The first method has the advantage of reducing the complexity of the entropy coding by minimising the number of bitplanes to be coded, whereas the second method requires to code the maximum number of bitplanes and therefore implies the maximum entropy coding complexity. A simple way is to have default quantizers, to be changed by a scale factor, similar to the quality factor of JPEG. This allows a very simple way of managing the compression ratio, but may conduct to far results from the targeted rate. A work has been done in [160] for linking the quantization to an entropy model. This allows for a targeted rate or a targeted quality expressed in MSE, to compute the optimal quantizer. The drawback of allocation through quantization is that it allows generating only one quality layer in a single codestream. In the informative sections, JPEG 2000 gives an example of post entropy coding algorithm. This algorithm built a quality scalable codestream in the MSE sense. The final codestream will be composed of quality layers where each layer will be optimal for a given rate. The target rates for each layer are encoding parameters Syntax and data ordering The codestream contains all the information concerning the original image ( size, bit depth, number of component, ), all the information about the compression ( number of resolution, wavelet filter, quantization, ) and all the compressed data. The structure of a JPEG 2000 codestream is as follow: Page 17/154

18 Main Header Tile Part Header Packet Packet Packet Tile Part Header Packet Packet Packet Tile Part Header Packet Packet EOC marker Layer 1 packets Layer k packets Layer Nc packets Entropic Data Packet Header Figure 8: JPEG 2000 codestream syntax A JPEG 2000 codestream starts always by the Main Header, with the Start of Codestream (SOC) marker segment, followed by the SIZ marker segment which indicates the image size, reference coordinates.the Main Header is a collection of marker segments. After the Main Header, one will find at least one Tile Part header. At least one Tile Part Header per tile is present, whereas the tile information can be separated in different Tile Parts. Like the Main Header, the Tile Part header is a collection of marker segments. Each Tile Part Header is followed by the compressed data belonging to this Tile Part, preceded by a Start Of Data (SOD) marker. The compressed data appears in packets preceded by a packet header (It can be relocated in Main or Tile Part Header), in the order defined by the selected progression order. It is to be noted that progression order can change along the codestream, and this change is indicated by the Progression Order Change (POC) marker segment. The following table summarises the information found on the different codestream markers segments. Page 18/154

19 Table 2 : Codestream markers Information Image area size or reference grid size (height and width) Tile size (height and width) Number of components Component precision Component mapping to the reference grid Information Tile index Tile-part data length SOT, TLM Progression order Number of layers Multiple component transformation used Coding style Number of decomposition levels Code-block size Code-block style Wavelet transformation Precinct size Region of interest shift No quantization Quantization derived Quantization expounded Progression starting point Progression ending point Progression order default Error Resilience End of Packet Header Packet Headers Packet Lengths Component registration Optional information Marker Segment SIZ Marker Segment SOT, TLM COD COD, COC RGN QCD, QCC POC SOP EPH PPM, PLT PLM, PLT CRG, COM Data progression order. JPEG 2000 allows to organise the compressed data packets in four directions. Packet are always ordered from the lowest resolution level to the highest, but the four possible direction in the progression order of the packets allow to do progressive data in quality (layers), component, position (Precincts), and of course resolution. 1. Precincts Each resolution level of a tile of a component is partitioned into rectangular regions called precincts which allow direct access to a particular region of the tile and component. The precinct size can vary from resolution to resolution, but is restricted to be a power of two, and precinct boundaries coincide with code-blocks boundaries. Precincts like code-blocks are Page 19/154

20 anchored at the image reference (X0,Y0). Each of the precincts of the same layer, resolution and component will be included in separated packets. LL2 P 0 P 1 LH2 P 2 P 3 P 0 P 1 HL2 P 2 P 3 P 0 P 1 HH2 P 2 P 3 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 HL1 P 8 P 9 P 10 P 11 P 12 P 13 P 14 P 15 P 0 P 1 P 2 P 3 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 LH1 P 8 P 9 P 10 P 11 P 4 P 5 P 6 P 7 HH1 P 8 P 9 P 10 P 11 P 12 P 13 P 14 P 15 P 12 P 13 Figure 9: Precincts P 14 P Layers For each code-block, a number of consecutive coding passes is grouped into a layer, which in fact represents a quality increment. The number of coding passes included in a specific layer can vary from one code-block to another and is let to the rate allocation algorithm. Each of the layers of the same precinct, resolution and component will be included in separated packets. 3. Resolution level A resolution level corresponds to one step in the image resolution reconstruction. For the lowest resolution level (the LL sub-band) a resolution level includes JPEG 2000 data packets only for this sub-band, whereas for the other resolution levels it includes JPEG 2000 data packets of the three high frequency sub-bands (HL,LH and HH). Each of the resolution levels of the same precinct, layer and component will be included in separated packets. 4. Component The component corresponds to the JPEG 2000 data packets of different image components. Each of the components of the same precinct, layer and resolution will be included in separated packets. Five progression orders are possible in JPEG 2000: LRCP: Layer, Resolution level, Component, Position RLCP: Resolution level, Layer, Component, Position RPCL: Resolution level, Position, Component, Layer PCRL: Position, Component, Resolution level, Layer CPRL: Component, Position, Resolution level, Layer Additionally to these packets progression order, let s note that packets belonging to the same tile appear in the corresponding tile part. But it is possible to separate them into tile parts. Tile parts can appear in any order (except the first Tile part of a tile that will remain the fist). This mechanism is called out of order tile parts. It allows propagating the progression order through the image and not only inside the tiles. Page 20/154

21 Region of Interest JPEG 2000 allows the coding of region of interest [168][175], which means to code in a single codestream, part of an image (called the foreground) with a better quality than the other part (called the background). The method used by JPEG 2000 is called the Max-shift method. No Region of Interest shape is encoded, and no restriction is applied in the number of regions. After quantization, the wavelet coefficients which belong to the foreground are simply placed in a dynamic range where they are above the maximum dynamic range of the coefficients belonging to the background. During the rate allocation process, because of their increase in dynamic, they will be considered as more fundamental for the image quality than the background and then prioritised. The decoder will be aware of the presence of Region of Interest(s) through the RGN marker segment, which includes also the value of the shift which has been applied. Every coefficient to be detected to be above the maximum dynamic range of the background will be declared to be part of the Region of Interest and therefore shifted down. Coefficients included in ROI Figure 10: Max-shift method Error resilience tools Some error resilience tools have been defined in JPEG 2000 Part 1, which allow resynchronising the decoding process in case of packet loss. Other tools (Segmentation symbols and predictable terminations) allow detecting the presence of errors at the entropy decoding level. Depending of the location of the missing information, the impact on image quality can be very small or very annoying, up to no image decoding, the most sensitive to errors part of the codestream being the headers (Main Header, Tile Part headers and packet headers) JPEG 2000 Part 1 File format Additionally to the compressed data, called codestream, JPEG 2000 has defined a file format. In fact in JPEG 2000 Part 2, a more complex file format (JPX) is defined as an extension to the basic file format of Part 1 called JP2. Another part of JPEG 2000 standard (Part 6 JPM) defines an extended file format which allows to mix different types of content together with JPEG 2000, like graphics and text. All of the file formats are members of this family (including MotionJPEG 2000 file format) and share the same architecture, which was derived from the Apple s QuickTime file format. Page 21/154

22 The major aim of the JP2 file format in addition of conveying the JPEG 2000 codestream, is to provide information about the colour space [167] used and to include metadata. These metadata can be of course of different types from camera related information to copyright information. They are compliant to the DIG35 specifications. The JP2 file format is used as the representation of a JPEG 2000 compressed image in MotionJPEG 2000 file format JPEG 2000 part 4 and 5: Compliance testing and Reference software JPEG 2000 has defined compliance tests and reference software for JPEG 2000 part 1. These tools are available through the JPEG web site: JPEG 2000 Part 1 defines Profiles (3 profiles) whereas Part 4 defines Compliance Classes (3 classes). The combination of the two is called a Compliance point. Profiles correspond to a restriction in the use of JPEG 2000 part 1 encoding tools, whereas Classes correspond to a limitation in the obligation of decoding a codestream, in terms of quality and parts of the codestream to be decoded. Compliance codestreams including file formats are available with the associated metric (Peak error and MSE) the decoder has to comply with. For more information please refer to ISO/IEC /ITU-T T.803 and ISO/IEC /ITU-T T JPEG 2000 part 8 : JPEG 2000 Security JPSEC This part of JPEG 2000 provides hooks to security tools, such as encryption [173], integrity, authentication These security tools definition is out the scope of JPSEC, but the syntax of JPSEC enables the use of them, describing in the JPEG 2000 headers, what are the tools and their zone of appliance in the compressed data. More details will be given later in this document. stage. At the time of the writing of this document, JPSEC has reached the Committee Draft JPEG 2000 part 9: Interactivity tools, API and Protocols "JPIP" The part 9 of the JPEG 2000 standard refers to the JPEG 2000 Interactive Protocol, "JPIP", and interactive applications using JPIP [169][171][113][111]. JPIP specifies a protocol consisting of a structured series of interactions between a client and a server by means of which image file metadata, structure and partial or whole image codestreams may be exchanged in a communication efficient manner. This International Standard includes definitions of the semantics and values to be exchanged, and suggests how these may be passed using a variety of existing network transports. With JPIP, the following tasks may be accomplished in varying, compatible ways: The exchange and negotiation of capability information. The request and transfer from a variety of containers, such as JPEG 2000 family files, JPEG 2000 codestreams and other container files. Selective data segments. Selective and defined structures. Parts of an image or its related metadata. Page 22/154

23 Further, JPIP provides the capability for 'fallback', such that the protocol can deliver similar results using differing levels of awareness of JPEG 2000 file and codestream structures at the client and the server. JPIP can be used over a variety of networks and communications media having different characteristics and quality of service characteristics. It can use a number of methods to communicate between client and server, based on existing protocols and network transports, which this International Standard extends to provide additional JPEG 2000 related functionality. Information, which is user or session related, can also be exchanged. Its use can be tailored via the various extensions to the JPEG 2000 file format, as defined in ISO/IEC , and however these are not mandated to achieve a simple level of interactivity that allows portions of a single JPEG 2000 file or codestream to be transferred. Although the terms 'client'and 'server'are used in this International Standard to refer to the image receiving and delivering application respectively, it is intended that JPIP can be used within both hierarchical and peer to peer networks, for data delivery in either direction, and for machine to machine as well as user to machine or user to user applications. It is also intended for use as an adjunct to an alternative more comprehensive protocol for image delivery, and for the delivery of non-jpeg 2000 coded information. While some features of JPIP may be applied to codestreams that are not JPEG 2000 compliant, such use is not mandated or required by JPIP systems conforming to this International Standard. The use of JPIP in an Internet or intranet environment is addressed. JPIP is used on top of TCP/IP and HTTP protocols. Only an informative section of JPIP standard is addressing UDP protocol. JPIP s standard only considers one client interacting with his server. So, the collaborative aspect of JPIP is restricted to the server s ability to deal with multiple clients. JPIP considers also the possibility for a user to upload modified images tiles from the client to the server. When using TCP protocol, packets are re-send until acknowledgement of their reception by the client. When the transmission is done in error prone environment, the server can send many time the same packet until it can be received without error, which can decrease a lot the available bandwidth. When using UDP, packet can be lost definitely, and it is up to the application to be able to handle errors. At the time of the writing of this document, JPIP has reached the Final Draft International Standard stage JPEG 2000 part 11 : JPEG 2000 Wireless JPWL Wireless JPEG 2000 (JPWL), i.e. JPEG 2000 part 11, provides tools for enhancing the robustness to errors of JPEG It extend the basic JPEG 2000 part 1 error resilience tools by protecting headers [172] and offering tools for enabling Unequal error Protection techniques [161][170]. Like JPSEC, JPWL provides an extension of the JPEG 2000 Part 1 syntax in order to describe tools which are in use for protecting the JPEG 2000 codestream, while including a basic error protection technique. These tools are the following: Error Protection Capacity descriptor (EPC) Page 23/154

24 Error Protection block (EPB) Error Sensitivity Descriptor (ESD) Residual Error Descriptor (RED) It has to be noted that ESD marker segment, including quality metrics for different parts of the codestream can be used for other purposes than JPEG 2000 part 11, by example for helping codestream rate/quality transcoding. At the time of the writing of this document, JPWL is planned to reach the Committee Draft stage in July MotionJPEG 2000 In fact two parts of JPEG 2000 standard are related to MotionJPEG 2000 [158]. First of all Part 3, namely MotionJPEG 2000 ISO/IEC /ITU-T T.802, and Part 12: ISO Media File format, which is a common text with MPEG: ISO/IEC and ISO/IEC To implement Motion JPEG 2000 both ISO/IEC and ISO/IEC documents will be needed. The standards describe the file format to be used for MotionJPEG 2000 for storing JPEG 2000 Part 1 codestreams together with audio. The Motion JPEG 2000 specification uses only audio and video media tracks Motion JPEG 2000 video is stored in video tracks, as documented in the ISO Base Media File Format. The structure of an audio track is like that of a video track. Support for uncompressed (raw) audio is defined. MotionJPEG 2000 has a current Amendment on compliance, which is defining four compliance points from small size video (CIF) to Digital Cinema video size [174]. For each of the points, several MotionJPEG 2000 compliance files have to be decoded, as well as metrics in the decoding of individual JPEG 2000 codestreams have to be respected. Unlike JPEG 2000 Part 1, no reference software exists for MotionJPEG For the development of the standard, MotionJPEG 2000 AHG has used a Verification Model (latest version v2.1) still present in the document registry of the JPEG web site. This verification model allows to encode raw.yuv video files, and generates a stream of concatenated JPEG 2000 codestream, and is not considering the MotionJPEG 2000 File format. Another software package has been used in the group, which corresponds to the Apple implementation of the ISO Media File Format, available for JPEG/MPEG members on Apple s ftp site. Nevertheless, MotionJPEG 2000 File generation is implemented in Kakadu software, which allows also to generate concatenated JPEG 2000 files. 3.2 H.264 / MPEG-4 AVC After finalising the H.263 standard in 1995, the ITU-T Video Coding Experts Group (VCEG) started working on two further development areas: a short-term effort to add extra features to H.263 (resulting in H.263+ and H.263++) and a long-term effort to develop a new standard for low bitrate visual communications. The long-term effort led to the draft H.26L standard. In 2001 the Joint Video Team (JVT) was formed, including experts from both MPEG and VCEG in order to develop H.26L into a full international standard (currently the Page 24/154

25 final committee draft is pending approval). The outcome will be two identical standards: ISO MPEG4 Part 10 of MPEG4 and ITU-T H.264. The official title of the new standard is Advanced Video Coding (AVC), however it is also known as H.26L and H.264 [56]. The coding scheme defined by H.264 is very similar to that employed in prior video coding standards. It is a hybrid codec that makes use of translational block-based motion compensation followed by block transformation of the displaced frame difference (DFD), scalar quantization of the transform coefficients with an adjustable step size for bit rate control, zigzag scanning and finally run-length VLC coding of the quantized transform coefficients. However, H.264 modifies and enhances almost all of the above operational blocks thus achieving significant performance improvements [57] Overview A general block diagram of the encoder is shown in figure [56][58][59]. The motion compensation model used in H.264 supports segmentation of macroblocks down to 4x4 subblocks, use of multiple reference pictures for prediction and quarter-pixel motion vector accuracy. The de-blocking filter specified in the standard is applied within the motion compensation loop, thus improving both quality and prediction. H.264 uses a 4x4 integer spatial transform instead of the floating-point 8x8 DCT specified in all previous standards. The small size reduces blocking and ringing artifacts while the integer specification eliminates any mismatch between the encoder and decoder in the inverse transform. Coding of intra macroblocks now includes a spatial prediction within the frame for de-correlation purposes. Finally there is a choice of two entropy coding methods (depending on the profile): Context Adaptive Variable Length Coding (CAVLC) which uses a single infinite-extent codeword set for all syntax elements, and the more complex Context-Adaptive Binary Arithmetic Coding (CABAC) which provides better coding efficiency. Input Video Signal Split into Macroblocks 16x16 pixels - Coder Control Transform/ Scal./Quant. Decoder Scaling & Inv. Transform Control Data Quant. Transf. coeffs Entropy Coding Intra/Inter Intra-frame Prediction Motion- Compensation De-blocking Filter Output Video Signal Motion Estimation Motion Data Figure 4.2.1: AVC Encoder Intra Prediction When a block or macroblock is coded in intra mode, a prediction block is formed based on previously encoded and reconstructed blocks in the same frame. For the luminance (luma) samples, the intra prediction can be formed for each 4x4 sub-block or for a 16x16 macroblock. There are a total of 9 optional prediction modes for each 4x4 luma block; 4 optional modes for a 16x16 luma block and one mode that is always applied to each 4x4 chroma block. If 8x8 chroma prediction is chosen then there are 4 prediction modes (very Page 25/154

similar to the luma 16x16 prediction modes). The encoder can select the prediction mode for each block that minimizes the residual between the prediction and the block to be encoded. Figure 4.2.

However the prediction is not restricted to the previous frame but can come from more than one previously encoded video frames (up to 16) [60]. Additionally H.

26 similar to the luma 16x16 prediction modes). The encoder can select the prediction mode for each block that minimizes the residual between the prediction and the block to be encoded. Figure 4.2.2: 4x4 luma intra prediction modes Inter Prediction As with previous standards, H.264 uses block-based motion compensation. However the prediction is not restricted to the previous frame but can come from more than one previously encoded video frames (up to 16) [60]. Additionally H.264 uses tree structured motion compensation which incorporates motion compensated sub-blocks of varying size resulting from partitioning of macroblocks. The partitioning is done on the basis of the residual error as well as the motion vector and partition type information that is generated. More specifically the luminance component of each macroblock (16x16 samples) may be split up in 4 ways as shown in figure 4.2.3a, into macroblock partitions of size 16x16, 16x8, 8x16 or 8x8. If the 8x8 mode is chosen, each of the four 8x8 macroblock partitions within the macroblock may be split in a further 4 ways as shown in figure 4.2.3b in sub-partitions of size: 8x8, 8x4,4x8 or 4x4. A separate motion vector is required for each partition or sub-partition. It is the task of the encoder to find the optimum choice between large partition sizes (e.g. 16x16, 16x8, 8x16) requiring fewer bits for coding the motion vector(s) and small partition sizes (e.g. 8x4, 4x4, etc.) giving lower-energy DFD signals at the cost of a larger number of bits for the motion vectors. Chroma blocks are partitioned in the same way as the luma component, except that the partition sizes have exactly half the horizontal and vertical resolution. (a) (b) Figure 4.2.3:Macroblock partitions (a) and sub-partitions (b) for motion compensation Transform & Quantisation The transform used in H.264 is an approximation of the 4x4 DCT (instead of an 8x8 DCT). The output of the employed transform, although not identical to the 4x4 DCT, has almost identical compression performance to it with the added advantage of employing integer arithmetic. This removes the risk of mismatch between encoder and decoder (DCT vs. IDCT precision) making drift problems less likely. All residual 4x4 blocks (luma or chroma, inter or intra) are transformed similarly except from the 4x4 residual blocks belonging to intra-coded luma macroblocks that use a 16x16 prediction mode, for which the standard 4x4 transformation is followed by an integer Hadamard transformation of the 4x4 group of DC coefficients of all 4x4 transformed sub-blocks in the macroblock. The DC coefficients of the 4x4 chroma blocks of each macroblock also undergo a further integer transformation. In both Page 26/154

27 cases the extra transformation takes place to further de-correlate the DC coefficients. Quantisation in H.264 uses a scalar quantiser with a total of 52 values of quantiser steps (indexed by a quantisation parameter -QP). The quantiser step doubles in size for every increment of 6 in QP, while it increases by 12.5% for each increment of De-blocking filter A filter is applied to every decoded macroblock in order to reduce blocking distortion [61]. The decoded filtered frame is stored in the buffer at both encoder and decoder (in-loop filter). The use of such a loop filter has two benefits: (1) block edges are smoothed, improving the appearance of decoded images (especially at higher compression ratios) and (2) the filtered macroblock is used for motion-compensated prediction of further frames in the encoder, resulting in a smaller residual after prediction. Filtering is applied to vertical or horizontal edges of 4x4 blocks in a macroblock, with the strength of the filter depending on the current quantiser, the coding modes of neighbouring blocks and the gradient of image samples across the boundary. Generally the filter is stronger at places where there is likely to be significant blocking distortion Entropy coding H.264 specifies two types of entropy coding: Context-based Adaptive Binary Arithmetic Coding (CABAC) and Context Adaptive Variable-Length Coding (CAVLC). Above the slice layer, syntax elements are encoded as fixed or variable-length binary codes (Exp-Golomb codes are used). At the slice layer and below, elements are coded using either CAVLC or CABAC depending on the entropy coding mode. CAVLC uses run-level coding to compactly represent strings of zeros typically encountered after transform and quantisation. It also signals the number of high-frequency +/-1 coefficients ( trailing 1s ) in a compact way. Adaptation is used in two ways. For coding the number of non-zero coefficients a look-up table is chosen based on the number of non-zero coefficients in neighbouring blocks. For coding the level (magnitude) of non-zero coefficients the VLC look-up table for the level parameter is chosen based on recently-coded level magnitudes. CABAC can provide better coding performance at the cost of higher complexity. It is based on three elements: selection of probability models for each syntax element according to the element s context; adaptation of probability estimates based on local statistics; use of arithmetic coding. At the beginning of each coded slice the context models are initialised depending on the initial value of the quantization parameter QP, since this has a significant effect on the probability of occurrence of the various data symbols Profiles Three profiles have been specified for H.264 so far [60]. These are the following: Base profile. This includes all the basic coding tools plus some enhanced error resilience (ER) tools. Its target is mainly two way video communication. Main profile. The main profile targets mainly broadcasting and DVD type of applications. It includes support for B pictures, fields and CABAC. None of the enhanced ER tools are supported however. Extended profile. Otherwise known as the streaming profile. This is a superset of the baseline profile which apart from support of B pictures and fields also provides extra ER tools (data partitioning) and switching pictures. Page 27/154

28 3.2.8 Performance The performance of H.264 is superior to all previous video coding standards[57][62]. Compared to MPEG-2, H.264 can provide up to 70% improvement in compression performance (same quality at a lower bit rate). Compared to more resent coding standards like MPEG-4 and H.263, savings of at least 25% can be observed. It is thus clear that H.264 represents the state-of-the-art in video coding Error Resilience Error resilience methods for video transmission can be categorised in three groups based on where in the system (transmitter/encoder or receiver/decoder) the error control operations take place [63][64]. At the transmitting side error resilience mechanisms are introduced at the source and/or channel encoder in order to make the transmitted data more resilient to potential errors and/or to facilitate (better) error concealment at the decoder. At the receiving side error concealment mechanisms are employed by the decoder after error detection in order to conceal the effect of the encountered errors as much as possible. Finally the encoder can adapt or introduce error control mechanisms based on feedback from the decoder regarding any detected losses Error concealment at the decoder Decoder error concealment is the process of recovering or estimating lost information due to transmission errors. To accomplish this recovery, concealment methods employ the correlation that exists between a damaged macroblock (MB) and its adjacent macroblocks in the same or previous frame(s). Three types of information may need to be estimated [63]: texture information (pixel or transform coefficients), motion information (motion vector(s) - MVs- of P and B macroblocks) and coding mode (intra or inter). Motion vector recovery (temporal concealment) is normally used in P and B frames and can lead to successful concealment via motion compensated temporal replacement of the missing pixels. Texture recovery (spatial concealment) relies on intra frame information only and is usually applied to intra MBs. Choosing between the two may require estimation of a coding mode for the damaged MBs Coding mode recovery The coding mode can be recovered by collecting statistics about the coding mode of adjacent MBs, thus finding the most likely mode for the damaged one [65]. Alternatively the coding mode (and hence the concealment strategy) can be chosen adaptively based on the success or otherwise of temporal concealment (MV estimation) which is normally the mode that will give the best results. In [66] the criterion used for this choice is gradient boundary matching (boundary matching along detected edge directions) - GBM. In case of I frames or generally when MVs for neighbouring MBs are not available re-estimation of motion information has to take place for these MBs. Adaptive mode selection can lead to visually more pleasing results especially for I frames since intra-coding of frames/slices doesn t necessarily mean lack of a good temporal match it can for example be the result of intra-refresh coding for errorprotection purposes. Page 28/154

29 Temporal Concealment The method used for the recovery of the missing pixels in a damaged macroblock greatly depends on the availability or not of the motion vector(s) belonging to this MB. If these are available (possible when some form of data partitioning is used in the encoder) motion compensated temporal prediction can be used whereby a damaged MB is replaced with the corresponding motion compensated MB in a previously decoded frame as indicated by the received MV(s). Concealment of damaged MBs in such a way leads to very good results since it is only the displaced frame difference (DFD) signal that is missing. When motion information is also damaged, estimation of the missing motion vector(s) can be attempted based on the availability of correctly received analogous information in surrounding macroblocks. Again this implies that error resilient methods have been used at the encoder in the form of slice structuring and/or interleaving, which increase the probability that errors only affect disjoint macroblocks and not a whole frame. For MV estimation a number of possible methods can be found in the literature including the following [64]: setting the MVs to zero suitable for sequences with small motion, using the MVs of the corresponding block in the previous frame, using the MV of a neighbouring MB, using the average of the MVs of surrounding macroblocks or using the median of the MVs of surrounding macroblocks. In [67] the selection of one of the above cases is done adaptively based on a smoothness criterion. The criterion used in is the least boundary matching (BM) error (otherwise known as side match distortion SMD) which is defined as the sum of variations along the one-pixel boundary between the recovered MB and the surrounding ones. The MV that minimises the boundary matching error is the estimated solution for recovering the lost MB. This BM approach reportedly yields better results than any of the previous candidate methods on its own, and together with the median vector approach can lead to visually pleasing concealment [63] The BM algorithm (BMA) is widely mentioned in the relevant literature and in many cases it serves as a measure for comparisons with other methods. In fact a form of it has been implemented in the H.264 joint model (JM) decoder (discussed in 2.2.3) [68]. A number of other methods build on the BM algorithm and try to improve it by means of different matching error measures, different MV candidate lists, multiple MV recovery, overlapping block motion compensation (OBMC) and even reestimation/refinement of selected MVs. Such methods are presented below. The method presented in [69] is named refined boundary matching algorithm RBMA. The refinement comes from estimation of 4 instead of 1 MVs for each damaged MB (one for each 8x8 block) with a separate boundary for each sub-block involving 3 adjacent MBs each time (corner pixels are also considered). The error measure used to quantify the quality of the match is the overlapping boundary match error which measures the distortion between the external onepixel-wide boundary of the blocks adjacent to the damaged block and the same external boundary of the blocks adjacent to the candidate replacement reference block. Further refinement of the selected MV is done by means of re-estimation, i.e. a search based on selected candidate MVs which are shifted within some limits. To avoid the extra computational cost of the re-estimation step, a temporal activity measure for the surrounding MBs is used which triggers the RBMA process when a certain threshold is exceeded (otherwise normal BMA is used). The same measure also determines the extent of the refinement search window. A measure for judging the reliability of the candidate MVs used as a starting point in the refinement process is also introduced based on the variation of motion vectors in the neighbourhood of a candidate MV. Results presented in [69] show an average improvement of 1.5 db over BMA. Re-estimation of missing motion vectors is also proposed in [70]. Motion estimation takes place for a band of pixels (4-8 pixels wide) directly above and to the left of the missing MB. The motion vector found to minimise the sum of absolute differences (SAD) for this band of pixels becomes the recovered MV of the damaged MB. This method is reported to outperform methods using the median or average of adjacent MVs. Page 29/154

30 However the complexity of this approach can be higher, especially when more than one reference frame is used. Such is the case for the method described in [71], where motion estimation in 2 to 5 reference frames takes place for a 2-pixel-wide boundary above and below the damaged MB. The final concealment is done by (weighted) averaging of the prediction signals. The results presented in [71] suggest that concealment from two previously decoded frames gives better performance than concealment form the previous frame alone. The concept of overlapped motion compensation is exploited by the methods described in [72][73]. Overlapped motion compensation for concealment implies use of more than one prediction signal (and hence more than one estimated MVs) for the recovery of damaged pixels. In [72] BMA is initially used to estimate one MV for a damaged macroblock using as candidates the MVs of surrounding correctly received MBs. This is then followed by subdivision of the damaged MB into 4 sub-blocks (8x8) for each of which 3 prediction signals are formed using the previously calculated MV (via BMA), the MV of the horizontally neighbouring MB and the MV of the vertically neighbouring MB. The overlapping signals are blended (weighted averaging) according to 3 weighting matrices, one for each vector, with higher weight being applied to the BMA motion vector (the matrices used for each 8x8 block are those suggested in H.263 for OBMC). In the results presented in [72] the suggested method seems to outperform a typical BMA by about 1~2 db for block loss rates ranging from 1% to 10%. In [73] a similar approach is followed, i.e. BMA coupled with OBMC, but with a number of alterations which lead to improved concealment results compared to the method of [72]. The alterations include a more complete list of candidate MVs used for BMA, usage of a weighted extended boundary match (WEBM) as the match function for selecting one MV from the list of candidates, 4 overlapped prediction signals (4 estimated MVs) for each 8x8 block of the damaged MB instead of 3 and a weighting matrix which assigns weights to the overlapped MVs more correctly than in [72] (boundary pixels are influenced more from adjacent blocks while central pixels are influenced more from the BMA vector). A raised cosine is used as the weighting matrix for an extended MB (32x32) in which the damaged MB occupies the central area. The overlapped boundary matching measure uses a wider than before pixel boundary and the calculation of the error is done using weights from the raised cosine matrix. The presented results suggest a ~1dB improvement over the OBMC method of [72]. The concept of BMA and OBMC is also followed in [74] coupled with motion field interpolation (MFI) [75]. MFI estimates motion parameters on a pixel (or block / sub-block) basis from MVs of surrounding blocks. As described in [74], MFI linearly interpolates one MV per pixel from spatially adjacent blocks with the weights being adjusted according to the spatial location of the pixel (e.g. a pixel close to the left border is influenced more from the left neighbouring MB). The MFI method on its own outperforms the BM approach. However better performance is achieved when the MFI technique is combined with BMA in an OBMC fashion as described again in [74]. Two MVs (hence two prediction signals) are estimated for each pixel, one with the BM method (MB basis) and one with the MFI method (pixel basis). A linear combination of the two prediction signals with the weights reflecting the spatial location of the pixel (the BM method is favoured in locations close to the borders while the MFI method is preferred for pixels close to the centre of the recovered MB) assigns a replacement pixel for each missing one in the currently recovered MB. Gains of around 0.5 db are reported with the BM-MFI method compared to simple BM. An extension of BM-MFI for the multiple reference frame case is discussed in [76] Spatial Concealment Spatial concealment uses spatially adjacent MBs that have been received correctly or have been already concealed in order to interpolate the missing pixel values. Because pixels in neighbouring MBs can be far away from the missing pixels and hence correlation might be Page 30/154

31 small, it is common to use only the border pixels of these MBs in the interpolation process [77]. In [78] this interpolation is done with the aim of minimising the variation inside the concealed MB and across the borders with its neighbours thus producing a smooth signal. The measure used to quantify smoothness is the sum of squared differences between spatially adjacent pixels (including the border pixels of surrounding MBs). In [70] interpolation is done recursively for bands of missing pixels, using border pixels of surrounding MBs and already concealed pixels of the recovered MB. The choice of border pixels used for the interpolation is based on a cost estimate which favours pixels that preserve the existing (in the adjacent MBs) edge information across the missing MB. Edge continuity is also the criterion used in [79] for selecting the interpolation direction, based on a previously applied edge classifier. Concealment of damaged blocks in one MB can also be done by means of estimating the DC value (mean value) or partial DC value of each block from surrounding blocks or bands of pixels and replacing the missing block pixels by this estimated mean value [80]. Statistical methods for spatial concealment have also been proposed [66][81]. All of the above methods can greatly benefit from the availability of correctly received surrounding MBs, which is more likely when some form of interleaving is employed at the encoding stage The H.264 joint model (JM) error concealment feature Although not normative, the reference software decoder (JM 7.4) implements both spatial and temporal error concealment for missing intra and inter coded macroblocks [68][82]. The spatial concealment employed for damaged intra MBs is based on the method described in [81] which replaces missing pixels with weighted averages of boundary pixels in surrounding MBs. The weights used are inversely proportional to the distance of source and destination pixels, with source pixels coming either from correctly received neighboring MBs only (if more than 2 such MBs exist) or from both correctly received and previously concealed MBs. Inter concealment (motion vector prediction + motion compensated temporal concealment) is implemented as specified in [67]. This is the boundary matching (BM) method described previously for predicting one missing macroblock MV, with the prediction candidates coming from surrounding MBs or 8x8 blocks of these MBs. The average and median of the surrounding blocks are no longer used as candidates. Instead only the actual MVs of these blocks together with a zero MV are tested and the one that results in the smallest side match distortion (SMD = boundary matching error) becomes the recovered MV, which is then used for motion compensated replacement of the missing pixels from a previously decoded frame. Certain rules regarding which MVs to use in the case of B-frames are also mentioned in [68] (only one MV is used as candidate in any case). Additionally if the motion activity in the current frame, as recorded by the average MV of all correctly received MBs, is below a certain threshold (1/4 pixels), then simple temporal copying is used (recovered MV equals zero) Pixel to be concealed Macroblock boundary top mv right mv IN OUT bot mv 1 bot2 mv 21 (a) Figure 11: Spatial (a) and temporal (b) concealment of the H.26L joint model. (b) Page 31/154

Error concealment in the JM 7.4 decoder proceeds with a specific order based on the assumption that centrally placed MBs are more difficult to conceal than boundary MBs.

towards the center of the frame. For each column concealment proceeds from the top and bottom towards the centre of the column.

32 Error concealment in the JM 7.4 decoder proceeds with a specific order based on the assumption that centrally placed MBs are more difficult to conceal than boundary MBs. Thus after the whole frame has been decoded and missing MBs have been identified based on a MB status map maintained at the decoder, concealment starts at the boundaries and proceeds column wise towards the center of the frame. For each column concealment proceeds from the top and bottom towards the centre of the column. The performance of the concealment strategy used by the JM decoder was found to be better than that of a simple temporal copying [68]. Frame 12 (a) Frame 84 Frame 12 (b) Frame 84 Figure 12 : Reconstructed frames of the foreman sequence (coded using slices of fixed size) with one missing slice at frame 12 : a) without error concealment b) with JM concealment Whole frame concealment special case All the concealment strategies mentioned above are inapplicable when a whole frame is missing at the decoder. Temporal concealment through estimation of missing MVs from neighbouring MBs is impossible, since no such neighbouring MBs exist (assuming no data partitioning was employed at the encoder). The same applies to spatial concealment, again due to lack of adjacent macroblocks. The easiest approach to concealing a missing frame is by copying the previous frame in the reference buffer (picture freeze). A more sophisticated approach that makes use of the multi-picture buffers employed in the latest standards (e.g. H.264) has been recently presented in [83]. The method employs up to 5 previous frames (although the use of 2 frames seems adequate) to generate a motion vector history for each pixel in the previous frame (before the current missing one). This history is then employed to estimate forward motion vectors for each pixel in the previous frame that will effectively project the previous frame to the current missing one (the average of the MV history is used). The estimated motion field is filtered with a median filter to avoid discontinuities in the MVs of pixels that were neighbours in the previous frame. Pixel locations in the current frame that have been left empty by this process are interpolated from surrounding pixels via a median filter again. More details can be found in [83]. The performance gains of this method compared to a simple copying of the previous frame were reported to be very significant especially for the case of increased motion (up to 8 db) at the cost of increased complexity Error resilience tools at the encoder All three of the latest video coding standards, H.263, MPEG-4 and H.264, provide error resilience support in the form of optional coding modes. These error resilience tools aim at either preventing error propagation or enabling the decoder to perform better concealment. In both cases use of these tools typically decreases the efficiency of the encoder due to the addition of extra redundancy bits. Hence the trade off that has to be made with error resilient coding is that of gain in error robustness versus the amount of redundancy [63]. We are mostly interested in the error resilience support of H.264 [84]. However some of the error resilient tools present in H.264 are common among the 3 coding standards with only minor Page 32/154

33 differences. These tools are discussed below. Tools that are specific to H.264 are described in section , while other error resilient techniques (with or without application to a specific or any of the standards) are presented in section Common error resilience tools Error resilience tools that are present in most video coding standards (including older ones like MPEG-1, MPEG-2 and H.261) include different forms of picture segmentation - like slices and group of blocks (GOBs) - and intra refresh coding at the macroblock, slice or picture level. The three most recent coding standards add some further options like reference picture selection and data partitioning. These tools are discussed below Intra refresh coding Intra placement or intra refresh serves the purpose of preventing or reducing drifting errors caused by error propagation due to the predictive nature of the codec (temporal and spatial). In inter mode the loss of information in a past frame has a significant impact on the quality of the current and future frames, and although the impairment introduced will decay with time (due to the leakage of the prediction loop), it is only with intra coding that full recovery can actually take place [85]. Intra placement can be applied at the picture level, slice level or MB level, with the latter being the preferred choice when bit-rate and latency requirements are tight. However compared to previous standards, intra placement for error resilience in an H.264 encoder differs in a number of ways. At the picture level, intra coding does not necessarily mean resynchronization and hence cannot guarantee an end to temporal error propagation. A clearing of the multi-picture buffer is needed for stopping any drifting effects and this takes place with IDR pictures (Instantaneous Decoder Refresh) which combine intra picture coding with flushing of the picture buffer (IDR slices can only be part of an IDR picture) [60]. At the macroblock level, intra coding, unless otherwise specified at the sequence parameter set (SPS), might entail spatial prediction from neighboring inter coded MBs which can result in propagation of errors even to intra coded MBs. Hence for this tool to be effective the ConstrainedIntraPrediction flag at the SPS has to be raised. Moreover when slice or MB intra refresh coding is combined with multiple reference frames, a reference frame restriction should be used. This can prevent the reappearance of error propagation due to referencing of MBs at the same spatial location but in frames prior to the one in which the intra refresh took place [86]. Effectively the set of possible combinations of inter macroblock modes and reference frames has to be restricted after motion estimation based on whether a specific motion vector points to an area which has been intra refreshed in a more recent frame of the buffer. The use of intra refresh coding at a slice/macroblock basis is an established method for improving the quality of the decoded video in the presence of errors. However, intra-coded information in general requires more bit rate and therefore a careful selection of intra-updates is necessary. A number of approaches to intra refresh coding can be found in the literature including random intra refresh, regular intra refresh, content based intra refresh [87] and loss aware rate distortion optimised intra refresh [88][85]. The latter one makes use of estimates of the expected distortion caused by transmission errors and losses when a specific error concealment method is used at the decoder, in order to choose the optimal coding mode. The complexity of this method however is significant and for practical implementations regular or random intra refresh coding can be used, with the former one being superior Slices A slice is a collection of macroblocks in raster scan order which can range from one MB to all MBs in one picture. Slices interrupt the in-picture coding mechanisms thus limiting any spatial error propagation to the affected slice only. Additionally, headers included in each Page 33/154

34 slice serve as spatial synchronisation markers. As a result slices can be independently decoded without requiring any other information. Slices are a prerequisite for error concealment methods since they can prevent the loss of entire pictures at the decoder. Slices are also useful for adapting the payload of packets and for interleaving purposes [84]. Relatively small slices and hence relatively short packets reduce the packet loss probability and the amount of lost information. At the same time however the in-picture prediction restrictions (intra and MV coding) and the increased overhead associated with small slices can harm the efficiency of the codec considerably [89]. In H.263 and MPEG-4 the concept of GOBs is also supported which is essentially a restricted version of slices (one row of MBs constitutes a GOB) Data Partitioning Data partitioning allows the categorisation of syntax elements to different groups according to their importance in the decoded picture quality in the presence of transmission errors. Three partitions are allowed in H.264 [60] which are described below in order of importance. Importance is judged based on the error concealment influence that each partition can have in case of errors. Data partition A includes header information (slice), quantisation parameters and motion vectors all necessary for decoding either of the two other partitions. Data partition B collects data related to intra coding (intra coding block patterns and intra coefficients) while data partition C contains similar inter coding information. Loss of any or both of partitions B and C will not have catastrophic results if partition A is received correctly since knowledge of coding modes and motion vectors can allow for very good error concealment. Data partitioning is usually accompanied by some form of unequal error protection, with data partition A being strongly protected either through the use of channel coding or retransmission/repetitions [84] Reference Picture Selection The multi-picture reference buffer supported by H.264 (and H.263++) allows the encoder to select the reference picture used in inter prediction. This can be exploited for error resilience purposes. In a feedback-based system (see section ) reference picture selection on a slice or picture basis can stop error drifting. When no feedback is available, periodic referencing (every n th frame) of a specific past frame (n th previous frame) can be employed with the periodic frames being coded more robustly compared to other frames (e.g. using FEC codes and data partitioning) [90] H.264 Specific error resilience tools Flexible Macroblock Ordering - FMO FMO permits the assignment of MBs to slice groups in orders other than the normal raster scan order based on a macroblock allocation map. The available map types include among others dispersed macroblock allocation, interleaved slices, explicit assignment of a slice group to each macroblock location in raster scan order and one or more foreground slice groups and a leftover slice group. FMO is an error resilience tool specific to H.264 and can lead to very good results when combined with concealment (e.g. the dispersed type can preserve many neighbouring MBs which can lead to better concealment results) [84]. Page 34/154

Figure 13 : Dispersed and foreground / background type of FMO. 3.2.9.4.

slices can be coded with different parameters including a different QP.

35 Figure 13 : Dispersed and foreground / background type of FMO Redundant Slices Redundant slices play a similar role in error resilience at the application layer as packet repetition does at the transport layer, with the main difference being that redundant slices can be coded with different parameters including a different QP. As a result a coarser version of a slice can be sent in addition to the primary one, which can then be reconstructed in case the latter one is lost due to transmission errors Parameter Sets Although parameter sets are not an error resilience tool as such, their introduction and usage in H.264 enhances error resilience. Parameter sets could be transmitted only once for the duration of a transmission if coding settings don t change, or even not transmitted at all if they are hard coded at the decoder. In any case their reliable arrival at the decoder can be made certain by means of out of band transmission, or in-band transmission coupled with increased error robustness (e.g. packet repetition, multiple copies and/or high number of retransmissions) Other error resilient coding methods Multiple description coding The term multiple description coding (MDC) describes a set of encoding techniques which provide robustness by transmitting multiple different descriptions for the same data. MDC relies on the fact that if any one description (or a subset of descriptions) is corrupted or lost, the decoder will still be able to decode the remaining descriptions in order to recover the desired frame (at a reduced quality). When all descriptions are recovered then the decoded quality will be optimal. Due to the contrasting requirements posed by the need for better compression performance (the data should be uncorrelated) and the need for better error resilience (enough correlation should remain), MDC systems always incur a loss of compression efficiency EREC The basic operation of the Error Resilient Entropy Coding (EREC) method [91] is to rearrange the N variable length of data into a fixed length slotted structure (with specific slot lengths), in such a way that the decoder can independently find the start of each block and start decoding it. The encoder first chooses a total data size which is sufficient for coding all the data and which needs to be coded as (protected) header information (thus introducing a small amount of overhead). Then the coded data for each image block are placed into the designated slot for the block either partially or fully. A pre-defined offset sequence is then used to search for empty slots to place any remaining bits of blocks that are bigger than the slot size. This is repeated until all bits have been packed into one of the slots. The EREC ensures that the decoder can regain synchronisation at the start of each block, and that data at the beginning of each block are more immune to error propagation than data at the end, which mainly represent high frequency coefficients. Page 35/154

36 Feedback based error resilience All the error resilience methods described so far treat the encoder and decoder as two independent entities when it comes to combating transmission errors. However it is possible in some cases for a feedback channel to exist between the decoder and encoder which can be exploited for error control purposes. A typical situation with WLANs transmissions is the use of ARQ (Automatic Repeat request) for retransmitting packets that have not been received by the receiver (or have been received but are in error even after forward error correction FEC). Although one re-transmission alone will result in the packet error rate being squared, delays will be introduced which can be unacceptable for delay critical applications like video transmission [63]. When that s the case, the decoder s feedback can be used to reduce the impact of the encountered errors i.e. limit their propagation in time and space - by adjusting the encoder s operations. Typically the remaining distortion after error concealment of corrupted image regions may remain visible in the image sequence for several seconds, unless intra refresh coding (on a MB, slice or frame basis) is employed [92]. The aim of feedback error control is similar to that of intra refresh coding, but the use of a feedback based approach can lead to solutions that suffer a lot less in terms of coding gain loss (intra coding has a high coding gain penalty). Two types of feedback messages are used in order to indicate correct or incorrect reception of specific packets. ACK messages indicate positive acknowledgement of the reception of a packet and when used they have to be sent continuously, unless of course an error is encountered. NACK messages (negative acknowledgement) indicate erroneous reception (or no reception at all) and compared to ACK messages they require an even lower bit-rate since they only have to be sent when errors occur. In general the bit-rate requirements for feedback messages is minimal compared to the video bit-rate. The round trip delay however can be considerable, with a worst case scenario that involves several retransmission attempts requiring 300ms [92]. The larger the round trip delay the later error recovery will start. Three types of feedback based error control approaches can be identified: error tracking, error confinement and reference picture selection. These are described below Error tracking Error tracking approaches rely on NACK feedback messages in order to modify the coding control of macroblocks that use as prediction past MBs which have been identified as corrupted, based on reconstruction of the inter-frame error propagation that takes place at the decoder. Assuming a multi-picture buffer, error tracking approaches store information for each MB with regard to the spatial overlapping that takes place between MBs of successive frames due to motion compensation. In [93] this information is used upon receipt of a NACK message, to mark MBs of past frames in the buffer as corrupted or not. Those MBs marked as corrupted are not used in the prediction process any more in order to prevent further error propagation. The method described in [92][94] extends the previous strategy by calculating additionally the severity of the impairment that is introduced based on usage of a specific error concealment technique at the decoder. The calculation is done through the use of an inter-frame error propagation model that estimates the error energy introduced at each MB after concealment. A more complex approach would apply the exact same concealment at the encoder effectively requiring duplication of the decoders operations for a number of frames [63][64]. Based on this information [92][94] uses intra coding only for those MBs that address severely corrupted past MBs. The advantage of this approach is that the bit-consuming intra mode is selected less often compared to the method of [93]. The disadvantage of course is higher complexity and the need for knowledge of the decoder s error concealment strategy. Page 36/154

37 Error confinement Error confinement approaches rely on restricted encoding operations in order to ensure that error propagation is limited to specific spatiotemporal locations. Confinement of errors to specific locations in space and time means that error tracking is not necessary any more in order to assess the extent of error propagation when a NACK message is received. Two of the latest video coding standards H.263 and MPEG-4 can facilitate confinement of errors in case of transmission failures, each one in a different way. In MPEG-4 the use of the video object planes approach (VOP) sub-videos that are coded independently and then reassembled at the decoder ensure that any errors affecting a specific video object will not spread to other video objects [92]. Annex R in H.263 (H.263+) specifies the independent segment decoding (ISD) mode where each slice (or GOB) is coded independently of others, i.e. macroblocks outside the current slice are not used for prediction [95]. In both cases error propagation is combated by feedback based intra updates or by the reference picture selection approach discussed below. Independent coding introduces losses in coding performance due to less efficient motion compensation, which is roughly inversely proportional to the picture size and which makes CIF sized pictures the limit for beneficial application of this approach [63] The MPEG-4 VOP approach adds overhead due to the presence of shape information Reference picture selection All of the above methods can be combined with reference picture selection (RPS) as the preferred reaction to NACK feedback messages. For example, with error confinement instead of coding the current VOP/segment in intra mode to stop further inter-frame error propagation, motion compensated prediction from the last frame available without errors could be used which should result in better coding efficiency since fewer bits are required for coding the prediction error. A similar approach could take place with error tracking. Typically error concealment at the decoder would be employed to minimise the effect of error propagation during the round trip time (i.e. until an error free frame arrives). The method described in [96] tries to aid concealment during that period by dispersing in time the motion vectors of coded neighbouring macroblocks (i.e. the MVs of adjacent MBs should point to different past frames if possible). To do that the Lagrangian optimisation that is normally used with long term motion compensated prediction is enhanced with an additional robustness cost which measures the time dispersion of MVs in the neighbourhood of a MB. The improvement in performance at the presence of errors with the above method comes at the cost of degraded quality when no errors occur. RPS can also be used in conjunction with ACK messages, where only acknowledged slices/frames are used for prediction of the current slice/frame. This approach would ensure that error propagation is entirely avoided but at the same time would incur a higher coding penalty since the encoder would have to use older reference pictures for prediction (due to the round trip delay) even in the case of no errors [92]. Combinations of the two are also possible. In [97] a method for adaptive switching between the two types of feedback messaging in response to error conditions is described. More precisely, the proposed system switches over from one type of messaging to the other (i.e. from NACK to ACK and vice versa) based on the number of occurrences of an error (NACK to ACK mode) and the number of consecutive absences of an error (ACK to NACK mode). Typical threshold values suggested are 1 and 5 respectively. With this adaptive approach it is possible to optimise the coding efficiency by combining the benefits of both ACK (no error propagation) and NACK (better coding efficiency) types of feedback. The NEWPRED mode that is included in MPEG-4 version 2 [98] allows the decoder to inform the encoder which segments (slices, frames or VOPs) have been received correctly and which haven t, which again represents a combination of the two types of feedback messages. Page 37/154

3.2.10 ROI Coding Standard frame-based video coding assigns the same quantiser to all macroblocks in one frame which effectively means that all macroblocks / regions are treated as if they have the

38 ROI Coding Standard frame-based video coding assigns the same quantiser to all macroblocks in one frame which effectively means that all macroblocks / regions are treated as if they have the same importance/priority. Under limited constant rate conditions the available bit budget should be distributed among macroblocks in a way that reflects their importance in the task at hand. Or equivalently when higher compression efficiency is required the introduced degradation in quality should follow the subjective importance of each MB in order for the reconstructed video to have better subjective quality. When transmission is involved the resulting quality also depends on the errors encountered. Error resilience methods generally introduce redundancy, thus affecting the efficiency of the codec. Consequently any bits destined for error resilience usage should be subject to the same priority strategy as the MB quantisers, with the subjectively important regions receiving better error protection than others. Both approaches fall under the category of region of interest (ROI) coding Variable Macroblock Quantisation The number of bits required for coding each macroblock in a picture (video frame) depends on the level of activity in the macroblock, the effectiveness of the prediction, and the quantisation step size. The latter one is controlled by the quantisation parameter (QP). The value of QP controls the amount of compression and corresponding fidelity reduction for each macroblock. Unless flexible macroblock ordering (FMO) and/or multiple slices are used, the value of QP is the same for all macroblocks in a picture. In [99] a perceptually optimised approach for coding of sign language image sequences was proposed which employs variable quantisation based on macroblock priority maps for each frame. The maps are the result of a pre-processing module which applies a form of segmentation and makes use of knowledge about the subjective importance of regions in the scene. Effectively the maps define a number of priority regions (from high priority to low priority) with a gradual degradation of quality based on increasing values of QP. Similar approaches have been proposed elsewhere [100] for general video sequences. Such ROI coding approaches lead to bit savings without affecting significantly the subjective quality of the coded video, assuming that the critical regions of the picture have been identified. Figure 14: A macroblock priority map with 8 priority regions (0:high 7:low priority) Unequal error protection Unequal error protection refers to methods that protect parts of a transmitted bit-stream in the transport system better than the rest [101].This unequal protection can be in the form of application layer selective retransmission, transport level forward error control, message repetition etc. The method proposed in [101] combines ROI coding with unequal error protection in what is referred as sub-picture coding. Each frame is split into a foreground rectangular region and a background one, and is then further partitioned into slices. Apart from better quality (lower QP), slices (packets) belonging to the foreground region receive Page 38/154

39 better protection during transmission with an extra parity packet as well as an independent segment type of coding. In the presence of errors such an approach leads to better subjective quality. 3.3 Transcoding Introduction Video transcoding is used to convert a coded video signal to another. It is mostly used to reduce a video s bit-rate to send it through a lower capacity channel. It is also used to change the video s spatial and temporal resolution, and to increase error-resilience when the signal is sent on wireless networks. Finally, transcoding is used to connect systems using different technologies, like different video compression standards. A good overview of the transcoding techniques has been published in [114] The easiest way produce a transcoded signal is to decompress the initial signal, and recompress it with the new constraints like on the figure below. The cost of this method is of course very high, and more efficient methods have been developed. A compromise has to be found between the output video s quality and the low complexity of the algorithm. Coded Bit stream decoder Component video channel recoder New Coded Bit stream Most of the existing transcoding techniques are based on video compression using the DCT, like MPEG for example. In WCAM, transcoding from Motion JPEG 2000 to H.264 will also be implemented. To facilitate the understanding of the following points, a generalized video coder is represented below. Motion Compensation uses the fact that images temporarily close are very similar. Motion vectors are used to estimate the movement of image blocks. An estimation of the next frame is calculated using these motion vectors on the actual frame. The signal representing the difference between the estimated frame and the real frame is called inter frame, in opposition to intra frames which are the video original frames. Alternatively, inter and intra frames are coded. They are first transformed using the DCT (Discrete Cosine Transform), and quantified to reduce the data size and to achieve a specific output bit-rate. Finally, the redundancy still present in the signal is reduced. To do this, MPEG-2 uses the VLC (variable length coding) technique, but in more recent MPEG compression standards, Page 39/154

40 other methods are used Bit-rate reduction Most of the research activity in the transcoding domain has been focused on bit-rate reduction. We can classify these bit-rate transcoders in three categories as shown on the table below. Table 3: Transcoders categories Q, VD/VLC MCP, DCT/IDCT Motion estimation Category 1 X Category 2 X X Category 3 X X X In each category, the variable length decoding is the first step. After that, the first category transcoders only change the quantization step. A variable length coding is then necessary to recode correctly the bit stream. The second category transcoders recalculate the difference between the estimated and the real frames. The last category recalculates the motion vectors, and thus recomposes the signal in the pixel domain using the inverse DCT. This enables transcoders to transform intra frames into inter frames. The three categories are detailed below Use of previous decisions As explained in [115], the previous encoding decisions (Motion vectors, GOP, ) need to be taken into account during the transcoding. If this is not done, new estimated motion vectors will introduce additional distortion in the decoded picture. This is even the case when the output rate is not changed Open loop transcoders The first category transcoder are called open loop transcoders, because the VLD, Quantization and VLC steps are made successively. The complexity of open loop transcoders is not very high. Such transcoders have been discussed in [116] and [117]. Their biggest inconvenient comes from the fact that a double distortion is generated. The first distortion is due to the increase of the quantization step on the intra frames. Because inter frames only have been calculated during the first encoding of the video, they represent the difference with the original intra frames, and not the new distorted ones. This leads to a second distortion, called drifting. As time goes on, the mismatch between the inter frames prediction and the desired prediction highly degrades the image Closed-loop transcoders To solve the drifting problem present in the fast open loop transcoders, a re-estimation of the inter frames has to be achieved. As the figure below shows, the re-quantified intra frames are used to recalculate inter frames. The info-bus contains all the decisions made Page 40/154

41 during the first coding. Because of the cascading of the IDCT and the DCT used to make the inter frames corrections, there is a slight loss of quality like explained in[118] Reconstruction in compressed domain To avoid the IDCT and DCT blocks used to re-estimate the inter frames, the reconstruction of the intra frames can be achieved in the DCT domain. [119] and [120] have proposed such transcoders. To achieve this, complex matrices are used, and through the years, a lot of simplifications were made in order to speed the transcoding Suppression of high frequencies Lin and Lee in [121] use the fact that the DCT reorganizes the image s energy by ordering the coefficients by frequency. By considering only the coefficients corresponding to the low frequencies, the speed of the transcoding process can be highly improved, at the cost of quality decrease Spatial resolution reduction Transcoding techniques allow the spatial reduction of high quality videos, enabling their visualization on small devices for example, without completely decoding and reencoding the frames. To perform the spatial conversion, it is not necessary to undo the DCT. By retaining only some low frequencies coefficients of the DCT, you obtain the same result. To achieve this, down-sampling filters have been derived, for example in [124] Motion vector mapping To reduce the frame size, we can pass from four Macro Blocks (group of pixels) to one MB of the same size, or to four smaller MB. The estimation of the new motion vectors is called motion vector mapping. The motion vector mapping problem is illustrated on the figure below. Page 41/154

In the first case, weighted average or median filters can be applied. The second case provides a better estimation of the motion, but increases the information concerning motion vectors.

42 In the first case, weighted average or median filters can be applied. The second case provides a better estimation of the motion, but increases the information concerning motion vectors. Two papers ([122] and[123]) propose a good evaluation of the resulting quality using several motion vector methods Motion vector Refinement To obtain a higher quality transcoded video, a special attention has to be brought to the motion vectors. After the re-quantization, the spatial down sampling, and other image transformations, the reuse of the original motion vectors is not always appropriate. A motion vector refinement in small search windows (to keep the complexity low) can really improve quality. Such techniques are described in [125] Temporal resolution reduction Some devices with limited resources need to receive video signals having not only a limited spatial resolution, but also a limited temporal resolution. A tradeoff can be found between those two constraints in order to get a satisfying image at the output. To reduce the temporal resolution, some frames have to be skipped. This is illustrated in the figure below. In the original video, the frame A was estimated using frame B, which was estimated using the data of frame C. If we skip frame B, we will have to deal with two problems: the motion vector of A will have to be modified, and the estimation of frame A will have to be based on frame C, and not on frame B that has been skipped. B A C As we can see on the figure above, the re-estimation of the motion vector can be determined by tracing the motion vectors back to the desired frame (directly from A to C). Since the predicted blocks in the current frame are generally overlapping with multiple blocks, bilinear interpolation of the motion vectors in the previous skipped frame can be used. The weighting of each input motion vector is proportional to the amount of overlap with the predicted block. Such techniques are developed in [126] and[127]. The second problem, which is the residual re-estimation, can be easily solved in the pixel domain, but this increases the complexity by leaving the DCT domain. To avoid this, techniques presented in [128] and [129]enable the computation of the new residual in the DCT domain. Page 42/154

43 3.3.5 Error-resilience Transcoding In channels with a high error rate, like wireless channels, it can be necessary to increase the video resilience by transcoding the signal. The resilience is performed in two directions: spatially and temporally. The spatial resilience is increased by reducing the number of blocks per slice and the temporal resilience is increased by sending more intra frames. Such a technique is explained in[130] Transcoding form one standard to another A lot of research is done on the transcoding between standards. A few examples are given here. Transcoders from MPEG-2 to H.263 and MPEG-4 are described respectively in [131] and [132]. Paper [133] gives a general overview of the transcoding between the different MPEG standards. Sand and Moonlight, two commercial companies offer MPEG-2 to H.264 transcoders. A MPEG-2 to MPEG-4 transcoder is described in [136]and a H.263 to MPEG-4 in [137] Transcoding of encrypted bit streams Because of security problems, an encrypted bit stream should not be de-encrypted during the transcoding process. In the case of JPEG 2000 bit streams, where the data is organized in packets, and in which only the packet bodies are encrypted, offer a solution to this problem. By reading the packet headers (and not the bodies), and by taking advantage of JPEG 2000 s scalability, it is possible to determine which packets are useful to the reconstructed the bit stream, enabling secure transcoding. Such a technique is described in [134]. 3.4 Scalable video coding Traditionally, the main objective of video coding has been to optimize video quality at a given bit rate. With the widespread usage of very different networks and type of terminals, the interoperability between different systems and networks is becoming more important. Therefore, video servers should provide a seamless interaction between stored content and delivery. More specifically, the video transmission should efficiently adapt to the varying channel bandwidth and terminal capabilities. Scalable video coding and transcoding are two technologies to achieve this goal. Both address the same problem with two different methods. Basically, transcoding converts the existing data format in order to meet the current transmission constraints (see Sec. 3.3). In contrast, scalable video coding defines the compressed bitstream at the encoding stage independently from the transmission environment. Scalability is a very important feature, especially in heterogeneous environments. Through its syntax and coding representation, a scalable video coding scheme allows for access to the content at multiple resolutions (spatial scalability), frame rates (temporal scalability), qualities (quality or SNR scalability), and image regions. On the one hand, scalability is critical when terminals have differing capabilities in terms of processing power, memory and display resolution. In this case, the terminal will only decode the relevant part of the bitstream Page 43/154

44 according to its capabilities. On the other hand, scalability is also needed when the available network bandwidth is fluctuating. In this case, scalable coding allows for an efficient use of the network bandwidth by adjusting the video bit rate throughput. Scalable video coding can also be useful to transmit video over error-prone networks. More specifically, the layered nature of a scalable video bit stream enables the efficient use of unequal error protection techniques. Namely, it is straightforward to protect more the most important layers, and less the less important layers. In coding schemes based on motion compensated Discrete Cosine Transform (DCT), such as MPEG-2, MPEG-4 and H.263, scalability performance is unsatisfactory. Indeed, the functionality is rather limited, and induces a significant drop in coding efficiency along with a large increase in complexity. JPEG2000 has made a significant step forward by developing a wavelet-based embedded coding which supports a flexible and efficient scalability. As a consequence, Motion JPEG 2000, the extension of JPEG2000 to encode video sequences, is efficiently providing the full scalability features. However, Motion JPEG 2000 is simply encoding each video frame independently (i.e. intra-frame coding). Henceforth, it fails to fully exploit the temporal redundancy in the sequence resulting in lower coding efficiency. Therefore, the problem of scalable video coding remains. For this reason, a new work item for Scalable Video Coding has recently been launched in MPEG-21 to address this issue. The goal is to provide with very efficient scalability functionality while achieving coding efficiency close to the best available non-scalable video compression schemes (i.e. H.264/AVC). Scalability in video coding is discussed in more details hereafter Scalability in Motion JPEG 2000 Motion JPEG 2000 (the video standard encapsulating JPEG 2000 frames) is the most appealing existing standard from a scalability point of view. Spatial, temporal and quality scalability are efficiently supported by taking advantage of the bit stream structure. Several spatial resolutions of the video can be extracted from the bit stream thanks to the wavelet transform used in JPEG Because there is no inter frame coding, the temporal scalability is straightforward. As JPEG 2000 is based on an embedded bit stream structure (bit-plane coding), quality scalability is very efficiently supported. In addition, due to the spatial structure of a JPEG 2000 image, it possible to easily extract specific regions of interest Scalability in MPEG-2, MPEG-4 and H.263 The MPEG-2 standard has limited support for spatial, temporal and quality scalability. A layered approach is used. Namely, a video sequence is coded into a base layer and one or more enhancement layers. Temporal scalability can be efficiently achieved by the use of B- frames in the enhancement layer. Spatial and quality scalability can be obtained by introducing multiple motion compensation loops. However, this generally leads to a significant decrease in the compression performance, combined with a significant increase in complexity. For this reason, scalability has had limited success in MPEG-2. In addition to the same type of scalability as in MPEG-2, MPEG-4 also supports a finer form of scalability known as fine granularity scalability (FGS). This scalability is possible thanks to the bit-plane coding of DCT coefficients in the enhancement layers. An overview of this Page 44/154

45 technique is presented in [101]. The advantage of FGS over previous base/enhancement layered scalable coding is that it provides with many more truncation points hence leading to a more flexible and adaptable bit stream. In [103] a H263+ codec is presented creating a temporally scalable and error-resilient bit stream. A unified efficient and universal scalable video coding framework that supports different scalabilities, such as fine granularity quality, temporal, spatial and complexity scalabilities is presented in [104]. It is based on the studies on FGS Scalability in MPEG-21, Scalable Video Coding The fundamental problem in motion compensated predictive coding, such as MPEG-2, MPEG-4 or H.263, is that the encoder includes a model of the decoder. Therefore, in order to support scalability (i.e. multiple truncation points in the decoder), multiple motion compensation loops have to be used, leading to inefficiency as discussed above. In order to circumvent this problem, the feedback loop has to be eliminated. However, motion redundancy should be exploited in order to achieve high coding performance. It was soon recognized that motion compensated 3D wavelet coding offers a very compelling solution. In [104], a warping to align each frame is applied, prior to a 3D subband transform. A similar technique was proposed in [105] using more complex motion models. Another approach was proposed in [106] and later enhanced in [107]. Instead of a frame warping, the displacement is compensated on a block basis. A significant contribution was proposed in [108][109][110] to introduce a motion compensated temporal lifting as the temporal wavelet transform. The motion compensated temporal lifting can be seen as equivalent to applying a temporal filtering along motion trajectories. These schemes are combined with an embedded entropy coder. Consequently, the resulting video coding schemes efficiently support spatial scalability (thanks to the spatial wavelet transform), temporal scalability (thanks to the motion compensated temporal lifting wavelet transform) and quality scalability (thanks to the embedded entropy coder). This approach is largely considered as the most promising. Remaining challenges include the use of complex motion models and an embedded representation of the motion information Status of Scalable Video Coding (SVC) in MPEG-21 In view of the above considerations, MPEG issued a call for proposal in December submissions were received and evaluated in March Of those, 14 were for full proposals and 7 for stand-alone tools. The proposals broadly fall into two categories: based on motion compensated wavelet (2D+t or t+2d) and based on a H.264/AVC compliant base layer. Extensive subjective tests showed that at higher rates and resolutions, SVC performance is very close to the non-scalable single layer H.264/AVC anchor. In some cases, SVC even outperforms the anchor. Conversely, at lower rates and resolutions, the performance gap between SVC and the non-scalable single layer H.264/AVC increases. MPEG has started a number of core experiments in order to further understand the technical solutions, to evaluate the technical proposals bringing further improvements, to further refine the testing conditions and to further improve the applications and requirements documents. This work is on-going. Page 45/154

46 4 Wireless transmission 4.1 Wireless Transmission Standards This part of the report discusses a range of currently available (or emerging) Wireless LAN standards. Each standard is considered as a potential off-the-shelf solution for WCAM s data and video transport applications. Broadly speaking, the standards can be split by: Geographic location (European or North American); Operating frequency band (2.4 GHz or 5.2 GHz); Intended coverage area (Personal Area Network (PAN) or Wireless Local Area Network (WLAN)). Table 4: Summary of relevant wireless transmission standards 2.4 GHz 5.2 GHz European North American European North American b [11 Mbit/s, 50m] b [11 Mbit/s, 50m] Hiperlan/2 [54 Mbit/s, 50m] a [54 Mbit/s, 50m] g [54 Mbit/s, 50m] g [54 Mbit/s, 50m] h (a/e with DFS and TPC) [54 Mbit/s, 50m] e [54 Mbit/s, 50m] WLAN technology is designed to provide wireless connectivity, even in less favourable nonline-of-sight conditions. WLANs allow the sharing of information between PCs, laptops and other equipment in corporate, public and home environments. The WLAN market is expanding rapidly and growth is driven by lower costs, ease of installation, flexibility and mobility. Current standards include IEEE b, Hiperlan/2 and IEEE a/e/g. The major WLAN standards are summarised in Table 1 together with maximum bit-rate and typical operating ranges for indoor use. The b standard offers a maximum 11Mbit/s capability in the 2.4 GHz band. In the quest for ever higher bit-rates, the IEEE has produced the a/g standard while ETSI (the European Telecommunications Standards Institute) has produced the Hiperlan/2 standard. Each offers transmission rates up to 54 Mbit/s g operates in the 2.4 GHz band whereas a operates in the 5.2 GHz band. Hiperlan/2 also operates at 5.2 GHz and offers improved Quality of Service (QoS) via the use of a centralised MAC protocol. The IEEE has recently introduced e for applications that require high QoS. Interference in the 2.4 GHz band represents a considerable future risk. While coverage at 2.4 GHz is more favourable than at 5.2 GHz, in the presence of strong interference, this benefit will not be realisable in practice. In the 2.4 GHz band, b and g appear worthy of further study. At 5 GHz, the Hiperlan/2 and a and e standards should be carefully considered. 4.2 Identification of current wireless standards Globally, up to 70 MHz of radio spectrum has been allocated at 2.4 GHz to what is commonly known as the ISM (Industrial Scientific and Medical) band. This band also includes the operating frequency for microwave ovens, which represent a considerable source of radio interference. Page 46/154

47 WLAN technology at 2.4 GHz is now starting to sell in high volumes to the computer industry. The 2.4 GHz WLAN market is dominated by the family of wireless standards [233], which provides an internationally accepted format for WLAN connections. The original standard, now referred to simply as , provides data rates of 1-2 Mbit/s. A higher rate extension, known as b [234] or more commonly as Wi-Fi, now achieves 1, 2, 5.5 or 11 Mbit/s where environmental conditions allow. In North America, the FCC has allocated 300 MHz of spectrum at 5 GHz in the U-NII (Unlicensed National Information Infrastructure) band. Meanwhile, the IEEE has developed another extension to the Physical (PHY) layer known as a [235] and a high rate extension to b in the 2.4 GHz ISM band known as g [236]. In addition, an enhancement to the current MAC layer has been considered by IEEE Task Group E to provide strong Quality of Service support [237]. In Europe, at 5 GHz the ERC has designated a total of 455 MHz of spectrum for WLAN use and ETSI has developed the Hiperlan/2 standard[238]. In Japan, spectrum has also been allocated at 5 GHz and the HiSWANa standard [239] has been developed by ARIB. The physical layers of these new standards will support multiple transmission modes, providing raw data-rates of up to 54 Mbit/s, where channel conditions permit [241]. However, the actual throughput achieved is also highly dependent upon the Medium Access Control (MAC) and network layers. Close cooperation between ETSI BRAN, ARIB MMAC and the IEEE has ensured that the physical layers of the various 5 GHz wireless LAN standards are broadly harmonised. The large scale worldwide markets and the harmonisation of the physical layers will facilitate low cost production of devices conforming to all three standards WLANs at 2.4 GHz - IEEE The original IEEE [233] standard offered features that were only marginally more attractive than Bluetooth (with network speeds of 1-2 Mbit/s). A clear need for higher data-rates forced the formation of the TGb task group, which eventually initiated the roll-out of the IEEE b standard in The b standard increased the data-rate to a maximum of 11 Mbit/s where environmental conditions allow (ie. high power signals and low interference) and also introduced the concept of roaming, whereby a connection was not lost when a user moved between access points. This standard, now known as WiFi, was the first to offer the possibility of digital video transmission, however, the standard was developed with data transfer rates in mind and no provision was made for the allocation of fixed bandwidths. This makes the transmission of video extremely challenging due to large latencies and extreme timing jitter. The b standard operates in the 2.45 GHz band and uses Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) to enable fair multiple access to the radio channel. CSMA/CD is similar to the standard used for wired Ethernet, which uses Collision Detection (CSMA/CD). Work is currently in progress to increase the data-rate to 22 Mbit/s without sacrificing backward compatibility with the existing b standard. The g physical layer standard specifies a link-adaptive Coded Orthogonal Frequency Division Multiplexing (COFDM) scheme (as for a) for operation in the 2.4 GHz ISM band. In native mode (ie. sacrificing backward compatibility), g can offer similar throughputs to a [251]. Page 47/154

48 4.2.2 Technical discussion of IEEE and b (Wi-Fi) The IEEE standard places specifications on the parameters of both the physical (PHY) and medium access control (MAC) layers of the network. The PHY layer, which actually handles the transmission of data between nodes, can use either: Direct Sequence Spread Spectrum (DSSS); Frequency-Hopping Spread Spectrum (FHSS); or Infrared (IR) pulse position modulation. IEEE makes provisions for data-rates of either 1 Mbit/s or 2 Mbit/s and operates in the unlicensed GHz ISM frequency band (in the case of spread-spectrum transmission), and ,000 GHz for IR transmission. IR is generally considered to be more secure; however, IR requires absolute line-of-sight links as opposed to the freedom of radio frequency transmissions, which has the potential to penetrate walls. IR transmissions can also be adversely affected by sunlight. The MAC layer represents a set of protocols that are responsible for maintaining order in the use of a shared medium. The standard specifies the use of a Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA) protocol. In this protocol, when a node receives a packet to be transmitted, it first listens to ensure that no other node is transmitting. If the channel is clear, it then transmits the packet. Otherwise, it chooses a random back-off factor which determines the amount of time the node must wait until it is allowed to transmit its packet. During periods in which the channel is clear, the transmitting node decrements its back-off counter. When the back-off counter reaches zero, the node transmits the packet. Since the probability that two nodes will choose the same back-off factor is small, collisions between packets are minimised. Collision detection is employed in wired Ethernet. This cannot be used in a simplex radio system since a transmitting node cannot listen to any other nodes. In IEEE , whenever a packet is ready for transmission, the transmitting node first sends out a short ready-to-send (RTS) packet containing information on the length of the packet. If the receiving node hears the RTS, it responds with a short clear-to-send (CTS) packet. After this exchange, the transmitting node sends its packet. When the packet is received successfully, as determined by a cyclic redundancy check (CRC), the receiving node transmits an acknowledgment (ACK) packet. This back-and-forth exchange is necessary to avoid the problem known as hidden node. The IEEE b (also known as the IEEE High Rate or Wi-Fi) standard was developed as a higher speed physical layer extension in the 2.4GHz band to the original IEEE mentioned above. The standard is backward compatible with IEEE The following sections provide a technical summary of the Wi-Fi standard Main Parameters of b The general parameters for the b standard are given in Table 2 below. Key Parameter Operating Frequency Table 5: Key parameters for IEEE b [262][263] Description GHz ISM Band (depending on local regulations) Access Scheme Infrared PHY. This PHY provides 1 Mbit/s with optional 2 Mbit/s. The 1 Mbit/s version uses Pulse Position Modulation with 16 positions (16-PPM) and the 2 Mbit/s Page 48/154

49 version uses 4-PPM. FHSS at lower speeds (1Mbit/s using 2-level GFSK and 2 Mbit/s using 4-level GFSK). DSSS at lower speeds (1 and 2Mbit/s using DBPSK and DQPSK respectively). DSSS at higher speeds (5.5 Mbit/s and 11 Mbit/s). Modulation DBPSK and DQPSK with a 11-chip Barker sequence for 1 and 2 Mbit/s. Complementary Code Keying (CCK) and QPSK for 5.5Mbit/s and 11Mbit/s. Optional: PBCC (Packet Binary Convolutional Code) which uses a 64-state binary convolutional code for 5.5Mbit/s and 11Mbit/s. Data Rate Transmission Format 1, 2, 5.5 and 11Mbit/s Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA). There are two types of packets. Each packet consists of a PLCP preamble and PLCP header. The PSDU (payload) is appended to the PLCPs to produce a PPDU. Two packet formats have been defined: Long PLCP PPDU format o PLCP Preamble: 144 bits and PLCP Header: 48 bits modulated at 1Mb/s using DBPSK o PSDU: 1 Mbit/s using DBPSK 2 Mbit/s using DQPSK and 5.5 or 11 Mbit/s using CCK Short PLCP PPDU format o PLCP Preamble: 72 bits modulated at 1Mbit/s and PLCP Header: 48 bits modulated at 2Mbit/s using DQPSK o PSDU variable at 2, 5.5 or 11Mbit/s DSSS use much wider channels of approximately 22 MHz in width. In DSSS, the transmission is spread over the entire 22 MHz range. In Europe, there are 13 operating channels and in North America there are 11 operating channels. There is 5 MHz separation between channels in both countries. The channels overlap each other and hence at any particular time, only 3 channels can be used simultaneously. The chosen sets of channels are orthogonal to each other to ensure minimal interference. Regulations in Japan only allow the use of just a single channel Throughput analysis Although the data rates specified for the b look promising, in practice the high datarates required for digital TV and video applications can be difficult to achieve. It is common for the physical layer overhead to consume 30-50% of the available bandwidth. An b system running at the full 11 Mbit/s rate will provide an application throughput in the order of 5 Mbit/s in ideal conditions, which is sufficient for a single video channel. If there are many transmission errors, the actual throughput will drop dramatically as the receiving station must Page 49/154

50 advise the transmitting station of the frames in error and then wait for retransmissions. In difficult channel conditions, b will fall back to a lower transmission mode (either 5Mbit/s or 2 Mbit/s). 4.3 COFDM WLANs: a, g, Hiperlan/2, HiSWANa The remainder of this section will now focus on the newer COFDM based WLAN standards. With the exception of g, these systems all operate in the 5 GHz band. While these technologies will be more expensive than their 2.4 GHz counterparts, the far lower levels of expected interference make these bands desirable. Sadly, radio wave propagation at 5GHz is more severe and this results in lower received powers for a given transmit power. The physical layers of a, g, Hiperlan/2, and HiSWANa standards are very similar and are all based on the use of Coded Orthogonal Frequency Division Multiplexing. COFDM is used to combat frequency-selective fading and to randomize the burst-errors caused by a wideband fading channel. The physical layer modes (defined in Table 3) with different coding and modulation schemes are selected by a link adaptation scheme[241]. The exact mechanism of this process is not specified in the standards and its implementation can in principle be specified in WCAM. The HiSWANa physical layer is very similar to Hiperlan/2. Fig. 1 shows the reference configuration of a COFDM based transmitter. PDU train from DLC Scrambling ½ Rate Convolutional code Puncturing Interleaving Mapping OFDM PHY Bursts Figure 15:OFDM-based WLAN transmitter Data for transmission is supplied to the physical layer from the MAC in the form of an input packet. This data is derived from the application stream and the mapping of video data to PHY packets is a process requiring careful design. The data for transmission is input to a scrambler that prevents long runs of 1s and 0s in the input data from being sent over the radio channel (resulting in spectral mask problems). Although both a/g and Hiperlan/2 scramble the data with a length-127 pseudo random sequence, the initialization of the scrambler is different. The scrambled data is input to a convolutional encoder. The encoder consists of a 1/2-rate mother code and subsequent puncturing. The puncturing schemes facilitate the use of code rates: 1/2, 3/4, 9/16 (Hiperlan/2 only) and 2/3 (802.11a/g only). In the case of 16-QAM, Hiperlan/2 uses rate 9/16 instead of rate 1/2 in order to ensure an integer number of OFDM symbols per PHY layer packet. The rate 2/3 is used only for the case of 64-QAM in a. Note that there is no equivalent mode for Hiperlan/2 which also uses additional puncturing in order to keep an integer number of OFDM symbols with its 54-byte packets. Page 50/154

51 20 MHz Figure 16: OFDM sub carriers f The coded data is interleaved in order to prevent error bursts from being input to the convolutional decoding process in the receiver. The interleaved data is subsequently mapped to data symbols according to either a BPSK, QPSK, 16-QAM or 64-QAM constellation. The OFDM modulation is implemented by means of an inverse FFT. 48 data symbols and 4 pilots are transmitted in parallel in the form of one OFDM symbol. This is illustrated in Fig. 2. Numerical values for the OFDM parameters are given in Table 4. In order to prevent Inter- Symbol Interference (ISI) and Inter-Carrier Interference (ICI) due to delay spread, a guard interval is implemented by means of a cyclic extension (see Fig. 3) Erreur! Source du renvoi introuvable.. Thus, each OFDM symbol is preceded by a periodic extension of the symbol itself. The total OFDM symbol duration is T total =T g +T, where T g represents the guard interval and T represents the useful OFDM symbol duration. When the guard interval is longer than the excess delay of the radio channel, ISI is eliminated. T g GI T Data Copy Fig. - OFDM symbol with guard interval The OFDM receiver basically performs the reverse operations of the transmitter. However, the receiver is also required to undertake AGC, time and frequency synchronization and channel estimation. Training sequences are provided in the preamble for the specific purpose of supporting these functions. Two OFDM symbols are provided in the preamble in order to support the channel estimation process. A priori knowledge of the transmitted preamble signal facilitates the generation of a vector defining the channel estimate, commonly referred to as the Channel State Information (CSI). The channel estimation preamble is formed such that the two symbols effectively provide a single guard interval of length 1.6s. This format makes it particularly robust to ISI. By averaging over two OFDM symbols, the distorting effects of noise on the channel estimation process can also be reduced. Page 51/154

52 Table 6: Transmission modes Mode Modulation Coding rate R Nominal bitrate (Mbit/s) Coded bits per OFDM symbol Data bits per OFDM symbol 1 BPSK ½ BPSK ¾ QPSK ½ QPSK ¾ QAM (H/2 only) 9/ QAM (IEEE only) ½ QAM ¾ QAM ¾ QAM (IEEE only) 2/ Table 7: OFDM Parameters for WLANs Parameter Available Bandwidth (W) Value 20 MHz Useful Symbol Duration (T) 3.2 µs Guard Interval Duration (T g ) 0.8 µs Total Symbol Duration (T symbol ) 4.0 µs Number of data sub-carriers (N D ) 48 Number of pilot sub-carriers (N P ) 4 FFT Size 64 Sub-carrier spacing ( f ) Total Useful Bandwidth (B u ) MHz 54/64*B= MHz Hiperlan/2 and a/g use different training sequences in the preamble. The training symbols used for channel estimation are the same but those provided for time and frequency synchronization are different. Decoding of the convolutional code will typically be implemented by means of a soft decision Viterbi decoder, which makes use of the CSI. Page 52/154

53 4.3.1 Quality of Service (QoS) support The centralised MAC in Hiperlan/2 is capable of offering strong time-bounded QoS support, something that is difficult to achieve with the contention-based Ethernet style MAC. The connection-oriented nature of Hiperlan/2 makes it straightforward to implement support for QoS. Each connection can be assigned a specific QoS, for instance in terms of bandwidth, delay, jitter or bit-error rate. It is also possible to use a more simplistic approach, where each connection can be assigned a priority level relative to other connections. This QoS support in combination with the high transmission rate facilitates the simultaneous transmission of many different types of data streams, eg. video, voice, and data. From this viewpoint, the Hiperlan/2 MAC is strongly recommended for video applications, unfortunately, within the market place, computer-based WLAN products currently dominate and the a MAC is seen as better for supporting high volumes of non-time-sensitive data communications. This has led to a lack of products conforming to the Hiperlan/2 standard which is better suited to multimedia traffic. Table 8: Characteristics of , b, a, g and Hiperlan/2 Characteristic b g a Hiperlan/2 Spectrum 2.4 GHz 2.4 GHz 2.4 GHz 5 GHz 5 GHz Max physical rate 2 Mbit/s 11 Mbit/s 54 Mbit/s 54 Mbit/s 54 Mbit/s Medium access control Carrier sense CSMA/CA Carrier sense CSMA/C A Carrier sense CSMA/CA Carrier sense CSMA/CA Central resource control / TDMA/TDD Connectivity Conn-less Conn-less Conn-less Conn-less Conn-orientated Multicast Yes Yes Yes Yes Yes QoS support PCF PCF PCF PCF Frequency selection FH or DSSS DSSS Single Carrier Authentication No No No No Encryption Fixed network support 40-bit RC4 40-bit RC4 Single Carrier ATM/802. 1p/ RSVP/ DiffServ (full control) Single carrier/dfs NAI/IEEE address/x bit RC4 40-bit RC4 DES, 3DES Ethernet Ethernet Ethernet Ethernet Ethernet, IP, UMTS In a the Point Coordination Function (PCF) is optionally defined to allow certain time slots to be allocated for real-time critical traffic. Table 8 above summarises the characteristics of , b, a, and Hiperlan/2. The IEEE is considering an extension to a, known as e [237]. This will improve QoS support for real-time applications at the expense of data throughput. Page 53/154

54 GHz radio regulations Table 6 gives the power spectrum requirements of WLANs for different geographic regions. The Hiperlan/2 standard supports the use of Dynamic Frequency Selection (DFS) in order to minimise interference when multiple APs are employed (such as a dense grouping of dwellings). The listen before talk nature of the MAC protocol employed in networks is also expected to offer advantages in the presence of interference. However for devices to work in Europe, DFS is mandatory. The Hiperlan/2 radio network is defined in such a way that there are core independent PHY and Data Link Control (DLC) layers as well as a set of convergence layers (CL) for interworking with IP, Ethernet, and UMTS. IEEE a defines similarly independent PHY and MAC layers (with the MAC common to multiple PHYs within the standard) and a similar approach to network protocol convergence is expected. Currently, IEEE a supports interworking with Ethernet. Table 9: WLAN spectrum overview Region A) MHz USA Power limit Power limit EIRP U-NII band FCC Part 15 sub E 50 mw indoor only 200mW B) MHz U-NII band FCC Part 15 sub E 250 mw indoor/ outdoor 1 W Licensing Unlicensed Unlicensed Coexistence/Etiquette None None Europe Power limit EIRP ERC Decision(99)23 200mW Hiperlan indoor only ERC Decision(99)23 200mW Hiperlan indoor only C) MHz Not allowed ERC Decision(99)23 1 W Hiperlan indoor/outdoor Licensing License exempt License exempt License exempt Coexistence/Etiquette Japan Power limit EIRP DFS & TPC mandatory 200 mw DFS & TPC mandatory Under consideration Licensing Unlicensed Licensed (FWA) Coexistence/Etiquette None DFS & TPC mandatory Not allowed Essentially, the role of the Convergence Layers are twofold: to map the service requirements of the fixed core network to the services offered by the DLC Layer, and to convert packets received from the core network to the format expected at the lower Hiperlan/2 layers. This Page 54/154

55 interworking structure will enable the interaction of Hiperlan/2 with evolving 3G mobile networks to be defined in a flexible and future-proof manner. 4.4 IEEE MAC To access the medium, IEEE provides two types of service: asynchronous and contention free [241]. The asynchronous type implements a CSMA/CA MAC protocol, with back off, known as the distributed coordination function (DCF). DCF defines a basic access method, and an optional four-way handshaking technique, known as request-to-send/clear-tosend (RTS/CTS). The contention free service is provided by the point coordination function (PCF) in order to support time-bounded services. The PCF is optional and is briefly discussed later in this section. It should be noted that we believe the PCF is not implemented in current products and as such emphasis at this stage in the project has been given on determining the suitability or otherwise of the DCF mechanism. A mobile terminal must sense the medium for a specific time interval (called a DIFS, Distributed Inter Frame Space) before transmitting a packet. If there were more than one station waiting for the channel to be idle before transmitting, a collision would certainly occur. Therefore after a DIFS, a station waits a random back-off time before trying to transmit; this is used for Collision Avoidance (CA). Once the back-off time has expired, the terminal can access the medium once again. The back-off time is calculated by the station and is an integer number of timeslots taken in a range called the Contention Window (CW). A back-off timer decrements until the medium becomes busy again or the timer reaches zero. If the medium becomes busy before the timer reaches zero the station must again wait a time interval (DIFS) before the timer starts to decrement. When the timer reaches zero, the station starts to transmit. If the timers of more than one station reach zero at the same time then there will be a collision. Stations involved in a collision generate new back-off times drawn from a CW, which is twice as large as the previous one. Because a collision in a wireless environment is undetectable, a positive acknowledgement is used to notify that a frame has been successfully received. If this acknowledgement is not received the terminal will retransmit the packet. The basic DCF access mechanism can be seen in Fig. 5. The transmission cycle consists of the following phases: DIFS, Back-off, Data Packet Transmission, SIFS (Short Inter Frame Space), and Acknowledgement (ACK). Since a station is unable to listen and detect a collision while transmitting, sometimes the station continues transmitting even if a collision occurs and (especially for long packets) considerable time is wasted. There is also a problem with hidden nodes. This means that if two stations are communicating and a third hidden station, (unaware of the transmission of one of the two), starts transmitting, a collision can occur. A solution to reduce the impact of these two problems is performed by means of handshaking. As before, the station waits a time interval (DIFS) and a back-off period but instead of transmitting the data, the station transmits an RTS. If the receiver of the RTS agrees on this then a CTS is sent in response. The CTS is sent after an idle period equal to one SIFS. After receiving the CTS and when a SIFS time interval has elapsed, the initiating station starts transmitting the data; the rest of the procedure is then the same as that mentioned previously but without the handshaking. Since the length of the control signals are short in comparison to a data frame, the time loss due to a collision is decreased. Stations available to receive control signals (either sent by the source or destination station), use the duration field in these control signals to prevent collisions. The duration field contains information about the remaining time to complete a transmission, including control signals. The signal flow is illustrated in Fig. 5b. The contention free mode uses a Point Coordination Function (PCF) to access the medium. A Point Coordinator enables stations to transmit without contending for the medium by polling Page 55/154

56 them for traffic. This mode of operation that provides contention free service is optional (and so there is no guarantee it can be supported by all nodes) and is also required to coexist with the DCF. In contrast to the DCF, which offers a best effort service, the PCF is supposed to reduce delays for some portion of the traffic. The problem with the PCF according to [237] is that low delays require a short time between the polling intervals and this will cause a lot of overhead, thereby degrading the throughput of the system. In the DCF there is a possibility of giving a session priority over others by defining a smaller initial CW and DIFS. This is not implemented in the standard as yet, but has been proposed as an Enhanced DCF (EDCF) to better support Quality of Service (QoS). Since time-bounded QoS is of growing importance, the IEEE are now considering an extension to a known as e [265]. This will employ the EDCF principle in order to provide better Quality of Service. Stations that operate under e are called enhanced stations and may optionally work as centralized controllers (which is advisable for a home wireless gateway). The QoS support is realised with the introduction of Traffic Categories (TC). The back-off required for each Mac Service Data Unit (MSDU) is parameterised with TC specific parameters. The contention window is now a parameter depended on the TC. Arbitration Interframe Space (AIFS) is introduced which is at least as long as DIFS and can be enlarged individually for each TC. Priority over legacy stations can be achieved by setting CW<15 and AIFS=DIFS. Fig. 6 [237] illustrates the multiple back-off of MSDU streams with different priorities. Even with the above enhancements, the e MAC will result in some degree of variable bandwidth and jitter which can result in poor service quality for video based applications. (a) Figure 17: DCF Access Mechanism: basic DCF mechanism (b ) Figure 18: DCF Access Mechanism: RTS/CTS method Page 56/154

Figure 19: Multiple back-off of MSDU streams with different priorities 4.4.1 IEEE 802.11a MAC PDU The Physical Layer Convergence Procedure (PLCP) maps a MAC PDU into a frame format. Fig.

57 Figure 19: Multiple back-off of MSDU streams with different priorities IEEE a MAC PDU The Physical Layer Convergence Procedure (PLCP) maps a MAC PDU into a frame format. Fig. 7 shows the format of a complete packet (PPDU) in a, including the preamble, header and Physical Layer Service Data Unit (PSDU or payload). The header contains information regarding the length of the payload and the transmission rate (since variable modes are supported), a parity bit and six zero tail bits. The header is always transmitted using the strongest (and therefore the slowest) transmission mode in order to ensure robust reception. Hence, it is mapped onto a single BPSK modulated OFDM symbol. The rate field conveys information about the type of modulation and the coding rate used in the rest of the packet. The length field takes a value between 1 and 4095 and specifies the number of bytes in the PSDU. The parity bit is positive for the first seventeen bits of the header. The six tail bits are used to reset the convolutional encoder and to terminate the code trellis in the decoder. The first 7 bits of the service field are set to zero and are used to initialize the descrambler. The remaining nine bits are reserved for future use. The pad bits are used to ensure that the number of bits in the PPDU maps to an integer number of OFDM symbols. HEADER RATE (4 bits) Reserved (1 bit) LENGTH (12 bits) Parity (1 bit) Tail SERVICE (6 bits) (16 bits) PSDU Tail (6 bits) Pad Bits BPSK ½ Rate Mode that is indicated from RATE PREAMBLE 12 symbols SIGNAL One OFDM symbol DATA Variable number of OFDM symbols Figure 20: PPDU Frame Format Page 57/154

58 4.4.2 IEEE g g represents a high-rate extension to b in the 2.4GHz ISM band [236] g uses the same link adaptive COFDM modulation scheme as a but also supports mandatory backwards compatibility with b. Optional modes based on a CCK-COFDM hybrid and a Packet Binary Convolutional Code (PBCC) scheme are also offered in the standard. Clearly, there are considerable similarities between the baseband modulation techniques specified in a and g with the main distinction between these two standards being the frequency band specified for operation and the additional optional modes of g. Both a and g are PHY layer specifications. A system based on either of these PHYs will employ the common MAC. However, some differences in the values of the MAC parameters affect the throughput performance. The RTS/CTS mechanism as described earlier was provided in order to minimise collisions between terminals in the network due to the hidden node problem. It is interesting to note that in g, the OFDM modulated signals cannot be demodulated by legacy b devices potentially resulting in collisions. The use of the RTS/CTS protocol is a potential solution to this problem as well as to the hidden node problem. The overheads introduced by the MAC are mode dependent and also differ between a and g. As described, overheads are primarily due to the requirement to implement DIFS and SIFS between data packet transmissions as well as ARQ signalling. The difference in overhead between a and g is due to the fact that different lengths are specified for the DIFS and SIFS in a and g. In order to inter-operate effectively with legacy b devices, g devices are required to implement the DIFS, SIFS and ARQ in a manner common with b. If the backward compatibility with b devices were neglected, g devices could operate with the same MAC overhead as a devices. Page 58/154

59 5 Security 5.1 Security of the content There are many reasons why we may want to protect data: with the coming of the information age, securing the transmission of information is becoming a great priority. In any type of communication there are four main reasons for cryptography. The first and most obvious is for secrecy which is of course keeping information out of the hands of unauthorised users. The second reason is authentication or the confirming of the identity of with whom you are communicating before sending data. The third reason is non-repudiation. Non-repudiation is a signature by the sender which allows the receiver to prove that the sender actually sent the message. Finally, integrity control is the last goal. Integrity control is being sure that the message received is the one that was sent and it was not tampered with along the way Content Encryption Techniques Introduction to Encryption Techniques Symmetric Key Encryption Symmetric key, or secret-key encryption algorithms use a single secret key to encrypt and decrypt data. You must secure the key from access by unauthorised agents because any party that has the key can use it to decrypt data. Secret-key encryption is also referred to as symmetric encryption because the same key is used for encryption and decryption. Secret-key encryption algorithms are extremely fast (compared to public-key algorithms) and are well suited for performing cryptographic transformations on large streams of data. Figure 21 Generic Encryption Process Typically, secret-key algorithms, called block ciphers, are used to encrypt one block of data at a time. Block ciphers (like RC2, DES, TripleDES, and Rijndael/AES) cryptographically transform an input block of n bytes into an output block of encrypted bytes. If one wants to encrypt or decrypt a sequence of bytes, one has to do it block by block. Because n is small (n = 8 bytes for RC2, DES, and TripleDES; n = 16 for Rijndael/AES), values larger than n have to be encrypted one block at a time. Another way to encrypt large chunks of data is to use modes of operation ( , [140]) which involve additional data to perform encryption, mostly Initialisation Vectors (IV s). The disadvantage of secret-key encryption is that it presumes two parties have agreed on a key and communicated their values. Also, the key must be kept secret from unauthorised users. Because of these problems, secret-key encryption is often used in conjunction with public-key encryption to privately communicate the values of the key Public-key encryption Page 59/154

60 Public-key encryption uses a private key that must be kept secret from unauthorised users and a public key that can be made public to anyone. The public key and the private key are mathematically linked; data encrypted with the public key can be decrypted only with the private key, and data signed with the private key can be verified only with the public key. The public key can be made available to anyone; it is used for encrypting data to be sent to the keeper of the private key. Both keys are unique to the communication session. Public-key cryptographic algorithms are also known as asymmetric algorithms because one key is required to encrypt data while another is required to decrypt data. Public-key encryption has a much larger key-space, or range of possible values for the key, and is therefore less susceptible to exhaustive attacks that try every possible key. A public key is easy to distribute because it does not have to be secured. Public-key algorithms can be used to create digital signatures to verify the identity of the sender of data. However, public-key algorithms are extremely slow (compared to secret-key algorithms) and are not designed to encrypt large amounts of data. Public-key algorithms are useful only for transferring very small amounts of data. Typically, public-key encryption is used to encrypt a key and IV to be used by a secret-key algorithm. After the key and IV are transferred, then secret-key encryption is used for the remainder of the session. The almost universal public/private key algorithm is named RSA and was developed by a group that is known by the initials of the three developers: Rivest, Shamir, Adleman. The algorithm follows the outline below: Choose two large primes, p and q, (typically > ) Compute n = p x q and z = (p-1) x (q-1) Choose a number relatively prime to z and call it d Find e such that e x d = 1 mod z Encryption can take place now by dividing the plaintext message into blocks so that each plaintext message, P, falls into the interval 0 <=P < n. To encrypt P, compute C=P e (mod n). To decrypt C P=C d (mod n). e represents the private key, while d is the public key Block ciphers We do not intend here to give an exhaustive review of all known and secure block ciphers. We will just briefly describe two common, standard algorithms which have been standardised by the NIST and which can be used to protect video content in the WCAM project Data Encryption Standard (DES) DES is the Data Encryption Standard, an encryption block cipher defined and endorsed by the U.S. government in 1977 as an official standard; the details can be found in the latest official FIPS (Federal Information Processing Standards) publication concerning DES [153]. It was originally developed at IBM. DES has been extensively studied since its publication and is the most well-known and widely used cryptosystem in the world. DES is a symmetric cryptosystem. When used for communication, both sender and receiver must know the same secret key, which is used both to encrypt and decrypt the message. DES can also be used for single-user encryption, such as to store files on a hard disk in encrypted form. In a multi-user environment, secure key distribution may be difficult; public-key cryptography provides an ideal solution to this problem. DES has a 64-bit block size and uses a 56-bit key during encryption. NIST has re-certified DES as an official U.S. government encryption standard every five years; DES was last recertified in 1993, by default. NIST has indicated, however, that it may not re-certify DES again, due to the standardisation of a new algorithm in 2001: AES. Page 60/154

61 No easy attack on DES has been discovered, despite the efforts of many researchers over many years. The obvious method of attack is brute-force exhaustive search of the key space; this takes 2 55 steps on average. The first attack on DES that is better than exhaustive search in terms of computational requirements was announced by Biham and Shamir using a new technique known as differential cryptanalysis [152]. This attack requires the encryption of 2 47 chosen plaintexts, that is, the plaintexts are chosen by the attacker. More recently Matsui has developed another attack, known as linear cryptanalysis [156]. However, the consensus is that DES, when used properly, is still secure and that triple encryption DES is far more secure than DES. Both single and triple encryption DES are used extensively in a wide variety of cryptographic systems Advanced Encryption Standard (AES) AES has been standardised in 2000 by the NIST to replace DES [154]. The chosen algorithm was Rijndael, a block cipher specifically designed by Joan Daemen and Vincent Rijmen as a candidate algorithm for the AES. The cipher has a variable block length and key length. The keys can have a length of 128, 192, or 256 bits to encrypt blocks with al length of 128, 192 or 256 bits (all nine combinations of key length and block length are possible). Both block length and key length can be extended very easily to multiples of 32 bits. The number of rounds to be performed during the execution of the algorithm is dependent on the key size. The number of rounds is 10 when the key is 128 bits long, 12 when the key is 192 bits long, and 14 when the key is 256 bits long. For both its Cipher and Inverse Cipher, the AES algorithm uses a round function that is composed of four different byte-oriented transformations: - byte substitution using a substitution table (S-box), - shifting rows of the State array by different offsets, - mixing the data within each column of the State array, - adding a Round Key to the State Modes of operation Once a symmetric key block cipher algorithm has been chosen as the underlying algorithm, and that a secret, random key, denoted K, has been established among all of the parties to the communication, we still have to choose the mode of operation which will be used to apply the protection algorithm. A confidentiality mode of operation of the block cipher algorithm consists of two processes that are inverses of each other: encryption and decryption. Encryption is the transformation of a usable message, called the plaintext, into an unreadable form, called the ciphertext; decryption is the transformation that recovers the plaintext from the ciphertext. The modes of operation described in this document have been recommended by the NIST in [140] Electronic Codebook (ECB) The ECB mode is a confidentiality mode that features, for a given key, the assignment of a fixed cipher block to each plaintext block. It is analogous to the assignment of code words in a codebook. The plaintext size must be a multiple of the cipher block size(128 bits in the case of AES). Else some bit padding is required to increase its size Cipher Block Chaining (CBC) The CBC mode is a confidentiality mode whose encryption process features the combining ( chaining ) of the plaintext blocks with the previous ciphertext blocks. It requires an Initialisation Vector (IV) to combine with the first plaintext block. The IV need not be secret, Page 61/154

62 but it must be at least unpredictable. The plaintext size must be a multiple of the cipher block size Cipher Feedback Block (CFB) The CFB mode is a confidentiality mode that features the feedback of successive ciphertext segments into the input blocks of the forward cipher to generate output blocks that are exclusive-ored with the plaintext to produce ciphertext, and vice versa. A segment is defined as a block of size s, where s < cipher block size. The CFB mode also requires an IV as the initial input block, which need not be secret. The plaintext must be a multiple of the segment size. Therefore a typical choice is to have one byte segments, which allows to keep the original plaintext size unchanged (measured in bytes) Counter (CTR) The CTR mode is a confidentiality mode that features the application of the forward cipher to a set of input blocks, called counters, to produce a sequence of output blocks that are exclusive-ored with the plaintext to produce the ciphertext, and vice versa. The sequence of counters must have the property that each block in the sequence is different from every other block. The condition is not restricted to a single message: across all of the messages that are encrypted under the given key, all of the counters must be distinct. A typical way to produce a set of counter is to choose the initial block, and to compute the following ones with an adequate incrementing function. This mode allows one to keep the original plaintext size unchanged (measured in bits) Content Integrity Techniques Message Digests A message digest is a digital fingerprint of a message, derived by applying a mathematical algorithm on a variable-length message. There are a number of suitable algorithms, called hash functions, each having the following special properties: the original message (the input) is of variable-length; the message digest (the output) is of a fixed-length. It is practically impossible to determine the original message (the input) from just the message digest (the output). This is known as being a one-way hash function. It is practically impossible to find two different messages (the inputs) that derive to the same message digest (the output )this is known as being a collision-free hash function. The algorithm is relatively simple, so when computerised it is not CPU-intensive. The calculated digest is (often considerably) smaller than the item it represents. Message digests are used to guarantee that no one has tampered with a message during its transit over a network. Any amendment to the message will mean that the message and digest will not correlate. Also, message digests can also be used to supply proof that an item of information, such as a password, is known without actually sending the password or information in the clear. The most common message digest algorithms are MD4, MD5 and Secure Hash Algorithm (SHA, and its upgraded version SHA-1), which offer in that order increasing levels of security, and therefore CPU usage (see ) Common hash functions MD2, MD4 and MD5 are message-digest algorithms developed by Rivest. They are meant for digital signature applications where a large message has to be compressed in a secure manner before being signed with the private key. All three algorithms take a message of arbitrary length and produce a 128-bit message digest. Page 62/154

63 In particular, MD5 was developed by Rivest in 1991 [151]. It is basically an improved, more secure MD4. The algorithm consists of four distinct rounds, which have a slightly different design from that of MD4. Message-digest size, as well as padding requirements, remains the same. Den Boer and Bosselaers have found pseudo-collisions for MD5, but there are no other known cryptanalytic results. The Secure Hash Algorithm (SHA), the algorithm specified in the Secure Hash Standard (SHS), was developed by NIST and published as a federal information processing standard [155]. SHA-1 was a revision to SHA that was published in The revision corrected an unpublished flaw in SHA. Its design is very similar to the MD4 family of hash functions developed by Rivest. The algorithm takes a message of less than 2 64 bits in length and produces a 160-bit message digest. The algorithm is slightly slower than MD5, but the larger message digest makes it more secure against brute-force collision and inversion attacks Digital Signatures An additional use of RSA (and more generally, public key algorithms) is in digital signatures, which involves swapping the role of the private and public keys. If a sender encrypts a message using its private key, everyone can decrypt the message using the sender's public key. A successful decryption implies that the sender, who is the only person in possession of their private key, must have sent the message. This also prevents repudiation, that is, the sender cannot claim that they did not actually send the message. A piece of data encrypted with a private key is called a digital signature. Common practice is to use a message digest (see above) as the item of data to be encrypted. Note that a signed document (its integrity and origin are assured) is not encrypted, which means that anyone could look at the original document included with the signed digest. This does not imply that the document cannot be encrypted as well, however DRM Digital Rights Management Digital Rights Management (DRM) involves the description, layering, analysis, valuation, trading and monitoring of the rights over an individual or organization s assets; both in physical and digital form; and of tangible and intangible value [192]. DRM covers the digital management of rights - be they rights in a physical manifestation of a work (eg a book), or be they rights in a digital manifestation of a work (eg an e-book). Current methods of managing, trading and protecting such assets are inefficient, proprietary, or else often require the information to be wrapped or embedded in a physical format [185]. The copyright environment consists of three main aspects: rights (what can be protected by copyright) and exceptions (e.g. copies for private use or for public libraries); enforcement of rights (sanctions for making illegal copies and for trading in circumvention devices); and management of rights (exploiting the rights). In the online world, management of rights may be facilitated by the use of technical systems called Digital Rights Management (DRM) systems [196][185]. Page 63/154

64 DRM consists broadly of 2 elements: the identification of intellectual property and the enforcement of usage restrictions. The identification consists in the attribution of a (standard) identifier (such as the ISBN numbers for books) and the marking of the property with a sign (such as a watermark). The enforcement works via encryption, by i.e. ensuring that the digital content is only used for purposes agreed by the right holder. DRM is the chain of hardware and software services and technologies governing the authorized use of digital content and managing any consequences of that use throughout the entire life cycle of the content [196] Functional DRM architecture Any DRM solution is composed of a standardized set of different building blocks. Two different visions from DRM can be presented: an architectural view and a functional view. From an architectural view, three major components can be identified: the content server, the license server, and the client [196]. The content server is a server component on the DRM architecture that consists of the actual content, information about products (services) that the content provider wants to distribute, and the functionality to prepare content for a DRM-based distribution. The license server is responsible for managing licensing information. Licenses contain information about the identity of the user or device that wants to exercise rights to content, identification of the content to which the rights apply, and specifications of those rights. The client resides on the user s side and supplies the following functionalities: DRM controller, the rendering application and the user s authentication mechanism [187]. Figure 22 : DRM architectural components (adapted from [187]) Page 64/154

65 From a functional point of view, Figure 23 can resume the most important functions of any DRM architecture: Content Creation and Capture Content Management Content Use Figure 23: DRM functional architecture Content creation and capture: Managing the creation of content to facilitate trading, including asserting rights when content is first created (or reused and extended with appropriate rights to do so) by various content creators or providers. This module supports: o Rights validation to ensure that content being created from existing content includes the rights to do so. o Rights creation to assign rights to new content, such as specifying the rights owners and allowable use (permissions). o Rights workflow to process content for review and/or approval of rights. Content management: Managing and enabling the trade of content, including accepting content from creators into an asset management system. This module supports: o Repository functions to access content and the metadata that describes the content and the rights specifications (see Information Architecture). o Trading functions to assign licenses to parties who have done deals for rights over content, including, for example, royalty payments. Content use: Managing the use of content once it has been traded. This module supports: o Permissions management to enforce the rights associated with the content. For example, if the user has only the right to view the document, then printing will be prohibited. o Tracking management to monitor the use of content where such tracking is a requirement of the user s agreement. This module may need to interoperate with the trading functions to track use or to record transactions for per use payments [196]. Page 65/154

66 According to the presented architectural and functional point of views presented, there are several standards/efforts currently available which are of interest for DRM: Content Identification (DOI, MPEG-21 DIID); Content Metadata (INDECS); Rights Expression Languages (ODRL, MPEG-21 REL); Users and Devices Identification (LAP, XNS, PASSPORT). On the next sections, some of these most important mechanisms are presented and described with some more details. However, a DRM solution cannot exist just by the existence and application of such standards. Additional technologies need to be applied in order to ensurecontent protection, rights enforcement, payment and others Content Identification Content identification is the process of linking a unique identifier to a specific content. Content identification is not something completely new to digital content. In the real world and dealing with real content the same concept exists (ISBN, erc.). The following table lists some of the most important content identifiers [190]: Identifier Book Item and Component Identifier (BICI) Digital Object Identifier (DOI) International Standard Audiovisual Number (ISAN) International Standard Book Number (ISBN) International Standard Music Number (ISMN) International Standard Musical Work Code (ISWC) International Standard Recording Code (ISRC) International Standard Serial Number (ISSN) International Standard Textual Work Code (ISTC) Publisher Item Identifier (PII) Serial Item and Contribution Identifier (SICI) Unique Material Identifier (UMID) also known as Universal Media Identifier MPEG-21 DIID Table 10 : Content identification schemas [190] Standard, Scope, and Agency or (for Draft Standards) Working Group ANSI/NISO Draft Standard for Trial Use Books ANSI/NISO Z (syntax only) Any IP entity International DOI Foundation (IDF) (Internet: ISO/DIS Audiovisual abstractions ISO/TC 46/SC 9 Working Group 1 (Internet: ISO 2108:1992 Books, software, mixed-media, etc. The International ISBN Agency (Internet: ISO 10957:1993 Printed music The International ISMN Agency (Internet: ISO 15707:2001 Compositions The International ISWC Agency, International Confederation of Societies of Authors and Composers (Confédération Internationale des Sociétés d Auteurs et de Compositeurs; CISAC) (Internet: ISO 3901:1986 Audio and video recordings International Federation of the Phonographic Industry (IFPI) (Internet: ISO 3297:1998 Serial publications ISSN International Centre (Internet: ISO/WD Textual abstractions ISO/TC 46/SC 9 Working Group 3 (Internet: Elsevier Science Ltd. Textual abstractions Elsevier Science Ltd. (Internet: ANSI/NISO Z (Version 2) Components of serials Society of Motion Picture and Television Engineers (SMPTE) 330M-2000 Digital content SMPTE Registration Authority, LLC (Internet: MPEG-21 defines a URI in the form of urn:mpeg:mpeg21:diid:sss:nnn, sss denotes the identification system and nnn denotes a unique identifier within that identification system. Valid values for sss are shown a table. Further identification systems will be added through the process of a Registration Authority for sss. MPEG-21 also defines all the requirements for a Registration Authority of Identification Systems. (Internet: mpeg.nist.gov/) Page 66/154

67 DOI Digital Object Identifier One of the key challenges in the move from physical to electronic distribution of content is the rapid evolution of a set of common technologies and procedures to identify, or name, pieces of digital content. The International DOI Foundation (IDF) was established in 1998 to address this challenge, assuming a leadership role in the development of a framework of infrastructure, policies and procedures to support the identification needs of providers of intellectual property in the multinational, multi-community environment of the network [189]. Major components of the IDF mission involve stimulating interest in and understanding of this framework, encouraging alliances and collaborative activities to explore in depth the complex issues to be addressed, and influencing the development of standards that will ensure the appropriate level of value-added and quality control across the spectrum of participation. The main activity of the IDF is to encourage the widespread implementation and use of a standard digital identifier: the Digital Object Identifier (DOI), an actionable identifier for intellectual property on the Internet [189]. The DOI has three main characteristics: Persistency: DOI can be used to identify the various physical objects that are manifestations of intellectual property and can also be used to identify less tangible manifestations, the digital files that are the common form of intellectual property in the network environment. Actionable: a DOI identifier itself is one element of a more complex system in which a user can use a DOI to perform an action. For instance, use a DOI to locate the entity that it identifies. Interoperable: DOI was designed to be interoperable with past, present and future identification technologies. A DOI identifier is composed of two parts: a prefix and a suffix, separated by a forward slash (/). The prefix identifies the publisher that registered it while the suffix is a character string that the publisher has to choose [187] like : /xpto MPEG-21 DIID Digital Item Identification and Description The MPEG-21 DII specification does not specify new identification systems for the content elements for which identification and description schemes already exist and are in use [184]. The scope of the Digital Item Identification (DII) specification includes: How to uniquely identify Digital Items and parts thereof (including resources); How to uniquely identify IP related to the Digital Items (and parts thereof), for example abstractions; How to uniquely identify Description Schemes; How to use identifiers to link Digital Items with related information such as descriptive metadata. How to identify different types if Digital Items [198]. Table 11: List of Identifiers compatible with MPEG-21 Identification System Identifier (sss) Format of Identifier (nnn) Content ID Forum cidf Digital Object Identifier DOI EAN/UCC System International Standard Audiovisual Number ISAN International Standard Book Number ISBN ) International Standard Recording Code ISRC International Standard Serial Number ISSN ) Page 67/154 cid doi ean or ucc isan isbn isrc issn

68 Identification System Identifier (sss) Format of Identifier (nnn) International Standard Textual Work Code ISTC International Standard Work Code ISWC Music Industry Integrated Identifier Project Version identifier for ISAN V-ISAN Page 68/154 istc iswc mi3p visan Content Metadata Metadata refers to data that is related to describe other data. In the present context, metadata is data that is used to describe a specific content. Metadata formats that are currently used are mostly tied to the type of content that is being described. For instance, there are specific metadata schemas which are adequate for digital still images that cannot be applied to digital music [191]. The following table lists some of the main initiatives in the metadata field. Initiative Table 12 : Content metadata schemas (adapted from [190]) Interoperability of Data in ecommerce Systems (INDECS; usually written <indecs>) (Internet: Online Information exchange (ONIX) International (Internet: EDItEUR EPICS Product Information Communication Standards (Internet: Description The <indecs> project was established at the end of 1998 with support from the European Commission s Info 2000 Programme. The project developed an analysis of the requirements for metadata for e-commerce in IP in the network environment. This analysis has received widespread support. The Commission evaluated the project as having been very successful when it came to an end in March <indecs> Framework Ltd. is a not-for-profit company, limited by guarantee, established by the partners in the <indecs> project to fulfil the sole task of owning the valuable IP rights created during the project and continues to encourage the development of new <indecs>based projects, including ONIX International and the DOI system. The <indecs> metadata framework Version 2.0 was published in June 2000 (Internet: ONIX is an international standard for representing an communicating book-industry product information in electronic form, incorporating the core content which has been specified in national initiatives, such as Book Industry Communication (BIC) Basic and Association of American Publishers (AAP) ONIX Version 1. ONIX International is developed and maintained by EDItEUR jointly with BIC and the Book Industry Study Group (BISG). ONIX uses XML and a subset of the EPICS data dictionary. The primary purpose of the EPICS data dictionary is to define the content of a set of data elements which can be used potentially in a variety of carrier formats for the communication of book trade product information between computer systems. Subsequent drafts have been developed with the benefit of significant input from the <indecs> project Rights Expression Languages Rights Expression Languages are mechanisms that allow the expression of rights in a language that is commonly understandable in the digital world. Most of these languages use XML as base syntax and the most currently relevant ones are presented bellow. Initiative Table 13 : Rights expression languages (adapted from [190]) Open Digital Rights Language (ODRL) (Internet: odrl.net/) Extensible Rights Markup Language (XrML) (Internet: OMA (Internet: MPEG-21 REL (Internet: mpeg.nist.gov/) Description ODRL provides the semantics for a DRM expression language and data dictionary pertaining to all forms of digital content. The ODRL is a vocabulary for the expression of terms and conditions over digital content, including permissions, constraints, obligations, and agreements with rights holders. ODRL specifies an XML binding and can be used within both open and trusted environments. XrML is a language in XML for describing specifications of rights, fees, and conditions for using digital content together with message integrity and entity authentication. XrML is an extension of Digital Property Rights Language (DPRL), a rights language developed by Xerox Palo Alto Research Center (PARC). ContentGuard, a Xerox PARC spin-off, has developed and contributed XrML as an open specification licensed on a royalty-free basis to unify the DRM industry and encourage inter-operability. The mission of the Open Mobile Alliance is to facilitate global user adoption of mobile data services by specifying market driven mobile service enablers that ensure service interoperability across devices, geographies, service providers, operators, and networks, while allowing businesses to compete through innovation and differentiation. OMA uses and ODRL profile as rights expression language. MPEG REL adopts a simple and extensible data model for many of its key concepts and elements. The MPEG REL data model for a rights expression consists of four basic entities and the relationship among those entities. This basic relationship is defined by the MPEG REL assertion grant. Structurally, an MPEG REL grant consists of the following: The principal to whom the grant is issued

The right that the grant specifies The resource to which the right in the grant applies The condition that must be met before the right can be exercised 5.1.6.

69 The right that the grant specifies The resource to which the right in the grant applies The condition that must be met before the right can be exercised ODRL Open Digital Rights Language The Open Digital Rights Language (ODRL) is a proposed language for the Digital Rights Management (DRM) community for the standardisation of expressing rights information over content [186]. The ODRL is intended to provide flexible and interoperable mechanisms to support transparent and innovative use of digital resources in publishing, distributing and consuming of electronic publications, digital images, audio and movies, learning objects, computer software and other creations in digital form. The ODRL has no license requirements and is available in the spirit of "open source" software. ODRL is part of the W3C standards and the latest version is 1.1. ODRL is focused on the semantics of expressing rights languages and definitions of elements in the data dictionary. ODRL can be used within trusted or untrusted systems for both digital and physical assets. ODRL does not determine the capabilities nor requirements of any trusted service (e.g. for content protection, digital/physical delivery, and payment negotiation) that utilises its language. ODRL will benefit to transactions over digital assets as these can be captured and managed as a single rights transaction. In the physical world, ODRL expressions would need an accompanying system with the distribution of the physical asset [194]. ODRL is based on an extensible model for rights expressions which involves a number of core entities and their relationships (ODRL Foundation Model). Figure 24: ODRL Foundation Model The ODRL model is based on three core elements: Assets, Rights and Parties XrML Extensible Rights Markup Language XrML is a language to specify rights. XrML is an XML-based usage grammar for specifying rights and conditions to control the access to digital content and services [195]. With XrML, anyone owning or distributing digital resources (such as content, services, or software applications) can identify the parties allowed to use those resources, the rights available to those parties, and the terms and conditions under which those rights may be exercised. These four elements are the Core of the language and determine the full context of the rights that are specified. In other words, it is not sufficient to just specify that the right to view certain content has been granted, but also who can view it and under what conditions. The basic relationship is defined by the XrML assertion grant. Structurally, an XrML grant consists of the following [195]: Page 69/154

The principal to whom the grant is issued The right that the grant specifies The resource that is the direct object of the right verb The condition that must be met for the right to be exercised

70 The principal to whom the grant is issued The right that the grant specifies The resource that is the direct object of the right verb The condition that must be met for the right to be exercised Figure 25: XrML core model MPEG-21 REL Rights Expression Language The MPEG-21 REL is intended to provide flexible, interoperable mechanisms to support transparent and augmented use of digital resources in publishing, distributing, and consuming of digital movies, digital music, electronic books, broadcasting, interactive games, computer software and other creations in digital form, in a way that protects digital content and honours the rights, conditions, and fees specified for digital contents. It is also intended to support specification of access and use controls for digital content in cases where financial exchange is not part of the terms of use, and to support exchange of sensitive or private digital content [184]. The Rights Expression Language is also intended to provide a flexible interoperable mechanism to ensure personal data is processed in accordance with individual rights and to meet the requirement for Users to be able to express their rights and interests in a way that addresses issues of privacy and use of personal data [199]. A standard Rights Expression Language should be able to support guaranteed end-to-end interoperability, consistency and reliability between different systems and services. To do so, it must offer richness and extensibility in declaring rights, conditions and obligations, ease and persistence in identifying and associating these with digital contents, and flexibility in supporting multiple usage/business models [199]. MPEG REL adopts a simple and extensible data model for many of its key concepts and elements. The MPEG REL data model for a rights expression consists of four basic entities and the relationship among those entities. This basic relationship is defined by the MPEG REL assertion grant. Structurally, an MPEG REL grant consists of the following [199]: The principal to whom the grant is issued The right that the grant specifies The resource to which the right in the grant applies The condition that must be met before the right can be exercised Page 70/154

Figure 26- The REL data model In essence, MPEG-21 REL is quite similar to XrML. 5.1.7 Users and Devices Authentication Another important component on any DRM architecture is the User or Device Authentication.

71 Figure 26- The REL data model In essence, MPEG-21 REL is quite similar to XrML Users and Devices Authentication Another important component on any DRM architecture is the User or Device Authentication. This refers in fact to two different things: First, the user identification that is used to create a liaison between the user and the content and to between the user identification and what he is entitled to do with the content. Second, the device authentication is more concerned with the assignation of unique identifiers to content consuming devices which may be used to govern if a specific content can or cannot be rendered on such device. The following table provides an overview of the most well known user identifiers. Table 14- User identification schemas (adapted from [190][196]) Identity Service Liberty Alliance Project (LAP) Microsoft.NET Passport XNS (extensible Name Service) Description LAP is a multi vendor organization formed to create an open, federated, single sign-on identity solution for the digital economy via any device connected to the Internet. One of its three main objectives is to provide an open standard for network identity spanning all network-connected devices. (Internet: Microsoft.NET Passport is an online service that makes it possible for a user to use an address and a single password to sign in to any Passport-enabled Web site or service. Because it is based on an address, however, it does not provide a persistent identifier. (Internet: XNS is a new open protocol and open-source platform for universal addressing, automated data exchange, and privacy control. An XNS universal address is a single superaddress that consolidates all other addressing and profile data into a single XML digital container. An XNS address persists for the lifetime of a person, product, service, or company, no matter how often any other contact data changes. XNS provides a number of other identity-based services. XNS Public Trust Organization (XNSORG) is responsible for governance of the XNS global trust community. (Internet: In what concerns device identification several alternatives can also be used: on Internet connected devices the MAC or IP address can be used, or on some particular CPU architectures (Intel) the CPU ID. On mobile devices, the SIM card can be used for User identification while the IMEI number can be used for Device identification. Some more advance initiatives related to device identification will appear on a near future. Two of the most important initiatives on this field are Palladium ( and the TCPA Trusted Computing Platform Alliance ( Palladium is a system from Microsoft that combines software and hardware controls to create a "trusted" computing platform. TCPA is an open alliance that was formed to work on the creation of a new computing platform that will provide for improved trust in the PC platform. TPCA is currently known by TCG - Trusted Computing Group ( Page 71/154

5.1.7.1 LAP Liberty Alliance Project The Liberty Alliance Project (started in 2001) objective is to serve as the first open standards organization for federated identity and identity-based services.

72 LAP Liberty Alliance Project The Liberty Alliance Project (started in 2001) objective is to serve as the first open standards organization for federated identity and identity-based services. The Alliance is delivering specifications and guidelines to enable a complete network identity infrastructure that will resolve many of the technology and business issues hindering the deployment of identity-based web services [187]. The main goal of LAP is to create open, technical specifications that (a) enable simplified sign-on through federated network identification using current and emerging network access devices, and (b) support and promote permission-based attribute sharing to enable a user's choice and control over the use and disclosure of his/her personal identification Microsoft.NET Passport Microsoft.NET Passport was introduced in 1999 by Microsoft and has become one of the largest online authentication systems in the world, with more than 200 million accounts performing more than 3.5 billion authentications each month (this is primarily due to the fact that Microsoft Passport is the Single Sign-On technology of choice of Microsoft and its partners). Passport participating sites include Nasdaq, McAfee, Expedia.com, ebay, Cannon, Groove, Starbucks, MSN Hotmail, MSN Messenger, and others. The Passport single sign-in service allows users to create a single set of credentials that can be used to access any site that supports a Passport service. The objective of the Passport single sign-in service is to increase customer satisfaction by allowing website visitors easy access without the frustration of repetitive registrations and forgotten passwords XNS Extensible Name Service XNS provides a flexible, interoperable method for establishing and maintaining persistent digital identities and relationships between these identities. The protocol provides services for registering and resolving identity addresses, defining and managing XML identity documents, conducting and protecting identity transactions, and linking and synchronizing identity attributes. XNS uses XML schemas to define and manage any type of identity attribute and XML-based Web services for platform-independence and extensibility. XNS also uses emerging XML security standards such as XML Signatures and SAML to assert identity credentials and verify identity transactions. XNS is a peer-to-peer protocol that can be used to create and connect any number of identity networks into a global Identity Web [193]. Page 72/154

73 5.1.8 Existing DRM solutions The following table lists some of the existing commercial DRM solutions existing on the market. Table 15 Companies working on DRM systems (adapted from [223][229]) Company Adobe Systems Inc. (Internet: Alchemedia Technologies, Inc. (Internet: ContentGuard Holdings Inc. (Internet: Elisar Software Corp. (Internet: InterTrust Technologies Corp. (Internet: MediaDNA (Internet: ) Microsoft Corp. (Internet: NetActive, Inc. (Internet: Reciprocal (Internet: Description Content Server supports Adobe s earlier PDF Merchant and the standard EBX DRM schemes. Content Server packages, protects, distributes, and sells Adobe PDF ebooks directly from an organization s Web site. Consumers need only the free Acrobat ebook Reader to purchase and use the content. Clever Content is aimed primarily at B2B applications to ensure that proprietary information in JPEG, text, and PDF formats remains safe by disabling copying, saving, printing, and screen capturing. It has been used to protect documents such as production process documentation, unreleased product information, sensitive financial data, and satellite imagery. End users need to download Clever Content Viewer. ContentGuard is a spin-off from Xerox Corp. with strategic alliances and investments from Microsoft Corp. and Xerox Corp. ContentGuard comprises four components: A protection toolkit lets organizations set encryption and access parameters. A distribution toolkit lets organizations create storefronts and other means of issuing content to consumers. A consumer toolkit that verifies that access terms have been met before issuing content. A back-office component tracks use and license generation. MediaRights is a digital rights enforcement technology that protects the rights associated with multimedia content that is displayed and distributed over the Internet. The software operates in conjunction with industry standard viewers. The MediaRights client is an operating system extension, which gives it the flexibility to work seamlessly across applications and, hence, to work with any file format. Rights System is a lightweight, flexible, multiplatform system that enables consumers to access any type of content with any of a broad range of devices including PCs; set-top boxes and video recorders; mobile communicators; and consumer electronics components, such as game stations and portable devices. The Rights System system consists of three components: packagers, rights servers, and DRM clients. Rights System does not need a separate DRM application download to the consumer s computer. InterTrust licenses its technology and patents in the form of software or hardware and tools to partners. These partners provide digital commerce services and applications that together form the MetaTrust Utility, an interoperable global commerce system. InterTrust serves as the commercial administrator of the MetaTrust Utility. emediator requires end users to download a plug-in before they can view protected content. Once the plug-in is downloaded, the organization can set various levels of access to protected media e.g., can let users view but not print photographs, view content for a set amount of time, or print a set number of times. emediator also prevents end users from sharing their purchased content with others. Windows Media Rights Manager is an end-to-end DRM system that lets content providers deliver digital media content over the Internet in a protected, encrypted file format. A packaged media file contains a version of a media file that has been encrypted and locked with a key. This packaged file is also bundled with additional information from the content provider. The result is a file that can be played only on a media player that supports Windows Media Rights Manager by a person who has obtained a license. NetActive Reach gives content providers control over digital content distribution and revenue by activating authorized customers via a live connection over the Internet. Organizations can offer consumers digital content on trial-, rental-, purchase-, or subscription-based license options. Reciprocal s Digital Clearing Service enables e-commerce for all forms of digital content, including audio, text, graphics, software, images, and video via the Internet or other networks. Its offerings cover packaging, selling, and distributing digital content. Reciprocal is a core partner of InterTrust s MetaTrust Utility and licenses DRM platform technology from InterTrust IPMP Intellectual Property Management and Protection IPMP systems are used to enforce IPR on the client side. This can range from simple XML parsing tools to more complex encryption or watermarking algorithms OPIMA IPMP OPIMA was established with the purpose of enabling a framework where content and service providers would have the ability to extend the reach of their prospective customers and consumers would have the ability to access a wide variety of content and service providers in a context of multiple content protection systems [200]. Page 73/154

74 OPIMA specification presented an architecture and a description of the functions required to implement an OPIMA-compliant system. Further it presented security protocols and a description of Application Programming Interfaces (APIs) and functional behaviours that enable interoperability. OPIMA specification did not specify how the diverse components on the architecture would work. The OPIMA specification is device and content independent. Content includes all multimedia types and executables. The specification is independent of all digital content processing devices and content types. The term Rules refers to information that stipulates how content may be used on a given device; specifically, Rules determine how business models are established. An IPMP (Intellectual Property Management and Protection) system controls access and use of the content by enforcing the rules associated with it. Conditional access systems are particular examples of IPMP systems [200]. For OPIMA, protected content consists of: a content set, which may consist of multiple media types; an IPMP system set, which may consist of multiple IPMP systems; a rules set that applies under the given IPMP system. Application Services API Application Services IP IP MP MP #1 #2 OPIMA Virtual Machine.. IP MP #n IPMP Services API Native OS & Hardware Figure 27 - The OPIMA architecture OPIMA finished its work on June 2000 releasing version 1.1 of its specification. OPIMA was implemented by the first time by the IST OCCAMM project (IST 11443) MPEG-4 IPMP MPEG IPMP work started by defining a set of hooks inside the MPEG-4 terminal. After some critics on the limitations of this model, they started working on a more flexible approach called IPMP extensions (IPMP-X) MPEG IPMP Hooks MPEG-4 IPMP (version 1) is a simple hooks based DRM architecture standardized in ISO/IEC :1999. The MPEG-4 IPMP extensions, have been more recently standardized as an extension to the traditional MPEG-4 IPMP hooks. IPMP hooks allows several IPMP systems to co-exist on the same Terminal. According to the chosen protected content, the IPMP System specified by the Content Owner at authoring time will be instantiated. The IPMP System itself is proprietary [183][188]. However, the following issues were left open by IPMP hooks : There is no standard way to specify how an IPMP System can be hooked in a MPEG-4 player without previous agreement between MPEG-4 player manufacturers and IPMP System providers; There is no standard mechanism to allow IPMP Systems to authenticate each other; It does not provide easy replacement of a broken IPMP system. Page 74/154

IPMP-X was designed to answer the above open questions and to provide a more complete DRM architecture within MPEG and to do so in a secure manner [188].

75 IPMP-X was designed to answer the above open questions and to provide a more complete DRM architecture within MPEG and to do so in a secure manner [188]. Figure 28 MPEG-4 IPMP "hooks" architecture MPEG-4 IPMP extensions do not break or otherwise negatively impact existing implementations based on the original, hooks specifications. An identifier is defined in order to specify which IPMP solution is being used. MPEG-4 IPMP hooks protected content may be accessed by MPEG-4 IPMP-X terminal. MPEG-4 IPMP-X protected content will be conceived by IPMP hooks terminal as being protected by an unknown IPMP system [183][188] MPEG IPMP Extensions IPMP-eXtensions comes in two flavours: MPEG-2 IPMP-X and MPEG-4 IPMP-X. MPEG-2 IPMP-X is designed to be applied to MPEG-2 based systems; MPEG-4 IPMP-X is designed to be applied to MPEG-4 based systems [197]. MPEG-4 IPMP-X can be used to host any type of media protection at a varied level of granularity and complexity as required by the specific DRM system employed to protect a given content within MPEG-4 Systems. MPEG-4 IPMP-X may protect any kind of media content included in an MPEG- 4 stream, as for example video, audio, computer graphics, text, interactive contents, etc. MPEG-2 IPMP-X is provided MPEG-2 Systems with the same functionality and support as in MPEG-4 IPMP-X. MPEG-2 IPMP-X may protect any kind of content that can be inserted in an MPEG-2 transport stream, as for example video, audio, text, private streams, etc. IPMP-X can also be integrated in other non-mpeg based Systems easily [183] [197]. Feature OPIMA MPEG IPMPX Ability do download Rules/Licenses Rules/License and Protection Algorithms (encryption, watermark, etc.) Secure environment based on Tamper resistance/ OPIMA compartments Page 75/154 Mutual authentication allows tools to be linked together

76 Targeting applications STBs and Mobile devices Military, Government, D- Cinema to STBs and Mobile Devices. The main features of MPEG IPMP-X are: Interoperability: By using a standardized set of IPMP messages, and industry defined APIs, different IPMP tools can be easily plugged into a consumer device and interact with each other, providing the users with seamless operation on the digital media content. Security: IPMPX provides methods to perform mutual authentication and then to use the resulting secure authenticated channel for supporting secure communications between components requiring it. Additionally mutual authentication can be used purely to verify existing trust relationships as they may exist or be required in a given DRM solution. Renewability: Mutual authentication can be used to verify the validity of certificates and credentials. Additionally, means are given to support for the replacement of a tool in case of security breach, enabling content owners to safely deploy their assets. Flexibility: One can choose whichever IPMP tool to perform decryption, watermarking, user authentication or data integrity checking, in order to enable system manufacturers to maintain security of their solutions over time. Dynamic operation: The kind of protection intended for the content (i.e. the IPMP tools required to protect certain content) can be specified at authoring time, enabling a variety of businesses to flourish based on IPMPX solutions. Compatibility: Forward and backward compatibility with existing Conditional Access systems. Figure 29 : MPEG-4 IPMP Extensions MPEG-4 IPMPX was implemented by the first time by the IST MOSES project (IST 34144). This was in fact the MPEG-4 reference implementation of IPMPX and is included on all MPEG-4 reference coderes MPEG-21 IPMP MPEG-21 aims at defining a normative open framework for multimedia delivery and consumption for use by all the players in the delivery and consumption chain. This open framework will try to provide content creators, producers, distributors and service providers with equal opportunities in Page 76/154

77 the MPEG-21 enabled open market. This will also be to the benefit of the content consumer providing them access to a large variety of content in an interoperable manner [184]. MPEG-21 is based on two essential concepts: the definition of a fundamental unit of distribution and transaction (the Digital Item) and the concept of Users interacting with Digital Items. The Digital Items can be considered the what of the Multimedia Framework (e.g., a video collection, a music album) and the Users can be considered the who of the Multimedia Framework [198]. The latest MPEG project is MPEG-21 Multimedia Framework that has started with the goal of making this vision possible: To enable transparent and augmented use of multimedia resources across a wide range of networks and devices. The basic elements of the framework are [198]: Digital Items, structured digital objects with a standard representation, identification and metadata within the MPEG-21 framework. o Digital Item is a very broad concept. Let s see how this applies to music compilation. This Digital Item may be composed of music files or streams, associated photos, videos, animation graphics, lyrics, scores, MIDI files, but could also contain interview with the singers, news related to the song, statements by opinion maker, ratings of agencies, positions in the hit list. Most importantly it could contain navigational information driven by user preferences, and, possibly bargains related to each of these elements. Uses all entities that interact in the MPEG-21 environment or makes use of a MPEG-21 Digital Items. o The meaning of User in MPEG-21 is very broad and is by no mean restricted to the end user. Therefore an MPEG-21 user can be anyone who creates content, provides content, archives content, rates content, enhances or delivers content, aggregates content, syndicates content, sells content to end users, consumes content, subscribe the content, regulate content, facilitates or regulates transactions that occur from any of the above. The work carried out so far has identified seven technologies that are needed to achieve the MPEG- 21 goals. They are: Digital Item Declaration: a uniform and flexible abstraction and interoperable schema for declaring Digital Items Content Representation: how the data is represented as different media Digital Item Identification and Description: a framework for identification and description of any entity regardless of its nature, type or granularity Content Management and Usage: the provision of interfaces and protocols that enable creation, manipulation, search, access, storage, delivery, and (re)use of content across the content distribution and consumption value chain Intellectual Property Management and Protection: the means to enable content to be persistently and reliably managed and protected across a wide range of networks and devices Terminals and Networks: the ability to provide interoperable and transparent access to content across networks and terminal installations Event Reporting: the metrics and interfaces that enable Users to understand precisely the performance of all reportable events within the framework. MPEG-21 Part-4 will define an interoperable framework for Intellectual Property Management and Protection (IPMP). MPEG decided to start a new project on more interoperable IPMP systems and tools. The project includes standardized ways of retrieving IPMP tools from remote locations, exchanging messages between IPMP tools and between these tools and the terminal. It also Page 77/154

78 addresses authentication of IPMP tools, and has provisions for integrating Rights Expressions according to the Rights Data Dictionary and the Rights Expression Language. MPEG-21 subgroup is still at the level of defining the requirements for its MPEG-21 IPMP Specific Motion-JPEG2000 Protection Techniques JPSEC Standard Recognizing the need to integrate security tools in JPEG 2000, JPEG has started an activity referred to as Secure JPEG 2000 or JPSEC as the part 8 of the JPEG 2000 specifications. The goal of JPSEC is to extend the baseline specifications in order to efficiently integrate and support the tools needed to secure digital images. This section gives an overview of the current status of JPSEC. JPSEC is currently at the Committee Draft stage 149. While the JPSEC standard is now relatively stable, it is important to point out that as of this writing JPSEC is still under development and what follows is subject to change. The features supported by JPSEC include the following: - Encryption: confidentiality mechanisms to allow for encryption of the code stream. - Source authentication: verification of the authenticity of the source. A typical technique is to register a digital signature of the original image. - Data integrity: verification of the integrity of the image content, including fragile or semifragile integrity verification. Typical techniques include digital signatures and watermarking. - Conditional access: control for conditional access to parts of the image content. For instance, a low resolution preview of the image is freely available but the higher resolutions cannot be viewed. - Secure scalable streaming and transcoding: ability of a network node or proxy to perform streaming and transcoding of protected JPEG 2000 code stream that preserves end-to-end security without requiring decryption of the content. The specific tools to secure an image are out of the scope of JPSEC. Instead, JPSEC normalizes the syntax to signal the use of security tools. In other words, JPSEC defines an open framework for secure imaging. The JPSEC syntax gives overall information about the security tools which have been applied to secure the image, along with some parameters referring to the technique used. Among other things, these parameters indicate which parts of the code stream have been secured. Such an approach not only ensures an open and flexible framework for JPSEC, but also provides with a straight path for future extensions JPEG-2000 Selective Encryption Example The principle of selective encryption is to encrypt only part of the image content. The resulting encrypted images remain visible with standard-compliant viewers, but the encrypted parts appear scrambled on the screen. The protection originally presented in [142] is based on standard encryption techniques, which are applied on selected parts of the codestream. The basic encrypted units are JPEG packet bodies (packets in the sense defined in Part-1 of this ISO standard). Each packet holds data related to a given set of resolution level, quality layer, component and precinct, which allows having an excellent flexibility when choosing the parts of the image to be protected. Page 78/154

Wavelet transform Scalar quantization Entropy coding Rate allocation Packet-based encryption Code stream header Packet 1 Packet 2 Packet 3 Packet 4 Packet n Protection of pack - stream Figure 30

79 Wavelet transform Scalar quantization Entropy coding Rate allocation Packet-based encryption Code stream header Packet 1 Packet 2 Packet 3 Packet 4 Packet n Protection of pack - stream Figure 30 Packet-based Encryption Principle The interest of such a selective encryption is that the codestream structure is kept intact: the ciphered, compressed data is still ordered as it was in the original, unprotected image data. Any decoder (implementations similar to the reference software) will simply read the protected codestream as if it was the true output of a standard encoder, and it will display a blurred image. Maintaining the original structure intact has much more advantages. Among them, it allows to transcode a protected image, dropping higher resolution or better quality layer packets without bothering about their content: that is to say, without having to decrypt, re-encode, and reencrypt the data. Therefore the secret key which protects the content does not have to be given to anybody else than the end-user, which is a more flexible and more secure way of handling JPEG-2000 images than basic encryption Content Scrambling This example describes a technique for conditional access control. The method was initially presented in [143]. Basically, it adds a pseudo-random noise in the image. Authorized users know the pseudo-random sequence and can therefore remove this noise. In opposite, unauthorized users have only access to severely distorted images. In order to fully exploit and retain the properties of JPEG 2000, the scrambling is selectively applied on the code-blocks composing the code stream. Consequently, the distortion level introduced in specific parts of the image can be controlled. In other words, this enables access control by resolution, quality or regions of interest in an image. The system is composed of three main components: scrambling, pseudo-random number generator and encryption which are discussed in more details hereafter. Two approaches are proposed for scrambling. The scrambling is either performed on the wavelet coefficients or directly on the code stream. In the first case, the signs of the wavelet coefficients in selected code-blocks are pseudo-randomly inverted. In the second case, the bits in selected portions of the code stream are pseudo-randomly inverted. For example, the SHA1PRNG algorithm ([144]) with a 64-bit seed is used for the pseudorandom number generator (PRNG). Note that other PRNG algorithms could be used as well. Page 79/154

80 To communicate the seeds values to authorized users, they are encrypted, and inserted in the bitstreams. In this example, the RSA algorithm is used for encryption [145]. Note that other encryption algorithms could be used as well. The length of the key can be selected at the time the image is protected. Two block diagrams are given below corresponding to the two cases of wavelet and bitstream domain scrambling. wavelet transform quantizer selective scrambling arithmetic coder scrambled codestream PRNG JPSEC syntax JPSEC codestream encrypted seed seed encryption Figure 31 Block Diagram for wavelet domain scrambling codestream selective scrambling scrambled codestream PRNG seed JPSEC syntax encryption encrypted seed JPSEC codestream Figure 32 Block Diagram for bitstream domain scrambling In order to improve the security of the system, the seed can be changed from one code-block to another. Also, several levels of access can be defined, using different encryption keys. The syntax given below is very flexible and supports the usage of multiple seeds and multiple keys. The different components of the technique, as well as the associated JPSEC syntax, are discussed in more details in [142] and [146] A Specific MPEG / DCT-based Encryption Technique RVEA was developed as a selective encryption algorithm for MPEG-1 and MPEG-2 streams. It operates only on the sign bits of DCT coefficient and/or motion vectors of a compressed video. RVEA can use any secret key cryptography algorithms (such as DES, AES or IDEA) to encrypt those selected sign bits. Shi and Bhargava originally proposed it in [147] and [148]. In [147], the authors present a light version of the algorithm, which is far less secure than the second version they proposed in [148]. The main difference between the two algorithms is that in the first one, encryption was just performed by an XOR operation, whereas for the second one, it can be performed by strong algorithms such as DES or AES. Page 80/154

81 The idea of RVEA can be extended and applied to any DCT-based compressed video. In particular, it could be used to protect H.264 content. 5.2 Security of delivery Authentication Techniques Authentication is the technique by which a process verifies that its communication partner is who it is supposed to be and not an impostor. The general model that all authentication protocols use is the following: someone initiates a request to establish a secure connection with a second user. The initiator sends a message to the receiver or to a trusted key distribution centre. Several message exchanges follow. When all is said and done the receiver is sure that the sender is valid and a session key has also been established for use in the upcoming conversation. Public key cryptography is widely used for the authentication process and for setting the session key. A typical way to distribute a public key is through a digital certificate Digital Certificates The most widely accepted format for digital certificates is the X.509 standard, and is relevant to both clients and servers [157]. A typical digital certificate is an item of information that binds a distinguished name to a public key. It contains a data section and a signature section and can be seen in the attached page. The data section contains the version number, a serial number assigned by the CA. A distinguished name is a series of name-values pairs such as user ID, address, the user's common name, organisation and country. Information about the public key including the algorithm used and the public key itself. RSA is the most common algorithm used with SSL but others can be used. Certificate Authorities (CAs) are responsible for the issuing of digital certificates and prevent anybody from creating false certificates and pretending they are somebody else. A CA is a commonly known trusted third party, responsible for verifying both the contents and ownership of a certificate. Users can apply the appropriate levels of trust for each CA they encounter. Also, different classes of certificates are available, which reflect the level of assurance given by the CA. If two entities trust the same CA, they can swap digital certificates to obtain access to each other's public key, and from then onwards they can undertake secure transmissions between themselves. Digital certificates include the CA's digital signature (i.e. information encrypted with the CA's private key). This means that no one can create a false certificate. The public keys of trusted CA's are stored for use by applications. Digital certificates automate the process of distributing public keys and exchanging secure information. When someone installs a digital certificate on his computer or web server, his computer or web site now has its own private key. Its matching public key is freely available as part of his digital certificate, posted on his computer or web site. When another computer wants to exchange information with his computer, it accesses his digital certificate, which contains his public key. Here is the detailed process for a mutual authentication between two devices: 1. First the device checks the other s certificate s validity period. If the current date and time are outside of that range, the authentication process won t go any further. If the current date and time are within the certificate s validity period, the device goes on to step 2. Page 81/154

Remote Device s certificate Server s public key Certificate s serial Certificate validity Server s DN Issuer s DN Issuer s digital signature ❶ Is today s date within validity period?

82 Remote Device s certificate Server s public key Certificate s serial Certificate validity Server s DN Issuer s DN Issuer s digital signature ❶ Is today s date within validity period? ❷ Is issuing CA a trusted CA? ❸ Does issuing CA s public key validate issuer s digital signature? ❹ Does the domain name specified in the server s DN match the server s actual domain name? Authenticating Device s list of trusted Issuing CA s Certificate Issuer s DN Issuer s public key Issuer s digital signature 2. Each device maintains a list of trusted Certificate Authorities (CAs) certificates: this list determines which other devices certificates it will accept. If the distinguished name (DN) of the issuing CA matches the DN of a CA on the device s list of trusted CAs, the answer to this question is yes, and the device goes on to step 3. If the issuing CA is not on the list, the other device will not be identified unless the authenticating device can verify a certificate chain ending in a CA that is on the list. 3. Then the authenticating device uses the public key from the CA s certificate (which is found in its list of trusted CAs in step 2) to validate the CA s digital signature on the other s certificate being presented. If the information in this certificate has changed since it was signed by the CA or if the CA certificate s public key does not correspond to the private key used by the CA to sign the destination device s certificate, the device won t authenticate the other s identity. If the CA s digital signature can be validated, the device treats the other s certificate as a valid letter of introduction from that CA and proceeds. At this point the authenticating device has determined that the other s certificate is valid. 4. The last step confirms that the device to be trusted is actually located at the same domain specified by the domain name in the server certificate. This step is not really part of the original SSL protocol, but it is often useful as it provides the only protection against a form of security attack known as a Man-in-the-Middle Attack. Devices must perform this step and must refuse to authenticate or establish a connection with a device whose domain name does not match. If the device s actual domain name matches the domain name in its certificate, it can be definitely authenticated SSL/TLS Protocol SSL (Secure Sockets Layer) protocol was originally developed by Netscape and went to Internet Draft stage for IETF standardisation [149]. The protocol that was finally standardised is TLS 1.0, which is based on SSL [150]. The SSL security protocol provides data encryption, server authentication, message integrity, and optional client authentication for a TCP/IP connection. SSL comes in two strengths, 40- bit and 128-bit, which refer to the length of the "session key" generated by every encrypted transaction. The longer the key, the more difficult it is to break the encryption code. Page 82/154

83 These protocols use digital certificates to perform authentication and to agree on a session key which enables to encrypt and decrypt the multimedia data delivered through the network Secure-RTP SRTP [139] is a profile of RTP (Real-time Transport Protocol [136]) and an extension to the RTP Audio/Video Profile [138]. It has been standardised as an RFC very recently (march 2004). SRTP intercepts RTP packets and then forwards an equivalent SRTP packet on the sending side, and intercepts SRTP packets and passes an equivalent RTP packet up the stack on the receiving side. SRTP provides a framework for encryption and message authentication of RTP and RTCP streams. SRTP defines a set of default cryptographic transforms, and it allows new transforms to be introduced in the future. The security goals for SRTP are to ensure: - the confidentiality of the RTP and RTCP payloads; - the integrity of the entire RTP and RTCP packets, together with protection against replayed packets. These security services are optional and independent from each other, except that SRTCP integrity protection is mandatory (malicious or erroneous alteration of RTCP messages could otherwise disrupt the processing of the RTP stream). Among the functional advantages of this protocol, we can list: - STRP defines a framework that permits upgrading with new cryptographic transforms; - SRTP is independent from the underlying transport, network, and physical layers used by RTP, in particular it has high tolerance to packet loss and re-ordering. These properties ensure that SRTP is a suitable protection scheme for RTP/RTCP in both wired and wireless scenarios. Besides the above mentioned characteristics, SRTP provides for some additional features, which have been introduced in the standard to lighten the burden on key management and to further increase security. They include a single "master key" which can provide keying material for confidentiality and integrity protection, both for the SRTP stream and the corresponding SRTCP stream. This is achieved with a key derivation function providing "session keys" for the respective security primitive, securely derived from the master key IPSEC IPsec is designed to provide interoperable, high quality, cryptographically-based security for IPv4 and IPv6. The set of security services offered includes access control, connectionless integrity, data origin authentication, protection against replays (a form of partial sequence integrity), confidentiality (encryption), and limited traffic flow confidentiality [141]. These objectives are met through the use of two traffic security protocols, the Authentication Header (AH) and the Encapsulating Security Payload (ESP), and through the use of cryptographic key management procedures and protocols. The set of IPsec protocols employed in any context, and the ways in which they are employed, are determined by the security and system requirements of users, applications, and/or sites/organisations. The mechanisms of IPSec are designed to be algorithm-independent. This modularity permits selection of different sets of algorithms without affecting the other parts of the implementation. For example, different user communities may select different sets of algorithms if required. There are 2 modes of operation defined in IPSec, which are available for both the AH and ESP protocols: - Transport mode: the payload is encrypted but the headers are left intact. This mode ensures privacy of content but does not protect against traffic analysis attacks. Page 83/154

84 - Tunnel mode: the entire original IP datagram is encrypted and it becomes the payload in a new IP packet. This mode adds extra overhead in the extra header but can provide traffic flow confidentiality. As a means to provide end-to-end or gateway-to-gateway security (or any variant thereof), IPsec can be a valuable tool. However, the security given by IPsec is dependent on the operating environment in which it is deployed. If this environment is breached or keys are exposed, the security provided by IPsec can be severely degraded Wi-Fi security and its evolution The security protocol designed for Wireless local area networks (defined in the b standard) aims to ensure the same level of security than for wired LAN Wired Equivalent Privacy The first algorithm used to protect wireless communication is Wired Equivalent Privacy. A first functionality of WEP described in b is the protection against the eavesdropping. A second one, considered as a WEP feature but not described explicitly in b, is the prevention against unauthorized access. WEP is based on a static secret key that is shared between a mobile station and an access point. The secret key is used to encrypt packets before transmitting, and a simple check granted data integrity. WEP employed a 40-bit encryption for RC4 encryption scheme. Key management and size of the key contribute to the weakness of the WEP security. The WEP flaws lead to several attacks such as passive and active attack to decrypt traffic, active attack to inject new traffic, dictionary attack to recover key encryption. Some sniffing tools such as AirSnort or WEPcrack put in prominent position the weakness of key management. With this type of network software, an unauthorized person could monitor your network and decode the encrypted messages. Wireless LANs, which carry out information over radio waves, do not have the same physical structure and therefore are more vulnerable than wired LAN. WEP security can be improved by combining other security technologies such as Virtual Private Networks or 802.1x authentication with dynamic WEP keys. IEEE 802.1X achieves an effective framework for authentication and user traffic control, providing different encryption keys X uses the Extensible Authentication Protocol (EAP) and Remote Authentication Dial-In User Service (RADIUS) to authenticate clients and distribute keys X and EAP ensure that new encryption keys are generated with a dynamic key distribution. IEEE Task Group is still on a path to achieve a fully ratified standard i i standard is designed to combine the authentication scheme of 802.1X and EAP with security features addition, including a new encryption scheme and dynamic key distribution i will use Temporal Key Integrity Protocol (TKIP). TKIP updates dynamically the WEP static key. The encryption scheme is still based on RC4. A 128-bit RC4 changed every 10,000 packets is still breakable. IEEE committee decided to replace RC4 with Advanced Encryption Standard (AES). AES uses a mathematical ciphering algorithm with variable key sizes of 128-, 192- or 256-bits i is designed to work with 802.1X. The i standard is expected to be released by the end of Wi-Fi Protected Access Page 84/154

85 Wi-Fi users would like to obtain a strong, interoperable, and immediate security. Due to this demand, the non-profit Wi-Fi Alliance offers an immediate and strong security solution Wi-Fi Protected Access (WPA). Wi-Fi Protected Access includes 802.1X and TKIP technology. This new solution is a subset of the IEEE i draft standard security specification. The goal of Wi-Fi Protected Access is to be a strong, interoperable, security replacement for WEP, be software upgradeable to existing Wi-Fi certified products, be applicable for both home and large enterprise users, and be available immediately. The Wi-Fi Alliance Security roadmap is well defined. In February 2003 began the WPA certification. The i ratification is expected during the third quarter of this year and the i product availability is also expected during the last quarter of this year. For the first quarter of 2004, interoperability testing of i is starting. Page 85/154

86 6 State-of-the-art for video surveillance 6.1 Video surveillance systems Introduction In the past decade, we have seen the growth of Internet and the growth of the mobile communications. This success is due to the hardware component evolution allowing the deployment of those technologies at a reasonable cost for the user. The focus has been on physical layer and, thanks to hardware component evolution, modems, WLAN boards or mobile phones are available at always more reasonable costs. In parallel, audio and video coders have leveraged the available CPU power to offer better compression and quality allowing the audio and video digital market success. In CCTV world exists the same evolution but as it concerns the security, requirements are more strict and so, new technologies are coming slowly. A lot of systems are still full analogue systems, but digital storage and video over IP is taking more and more place. Trend is full digital CCTV system Analogue systems Analogue video surveillance systems usually include cameras, one or several switching matrix and specific keyboards for controlling the cameras. Video switching in analogue CCTV systems is made using an analogue video matrix, which is centralised video equipment. It means that all video connections must be gathered together at the place where the video matrix device is. It can then involve optic or radio transmission for long distances and, in any case, the deployment of cables around the whole site. Larger systems need more sophisticated operator interfaces. These man machine interfaces usually use Windows graphics and show the location of the cameras on the site map. In an environment with multiple operating stations, it is worthwhile to install a video server software to ensure the linkage among all operator consoles, on the one hand, and all field equipment (cameras, matrixes, digital recorders, video based motion detectors, etc.) on the other. Finally, in the case of very large-scale sites that must manage several matrixes, it also becomes easy to design a partitioning into independent surveillance areas. Inter matrix links then permit analogue video flows to move from one area to another, so that any camera can be picked up on any monitor screen on the site. The display is made either on analogue monitors or on PC based GUI application with video acquisition board. Here is an example of such a system: Page 86/154

WCAM Deliverable D 2.1 Ethernet GUI CCTV Server CCTV Server Video matrix CCTV Server Video matrix Video Recorder Video matrix Figure 33 : Example of analogue CCTV system 6.1.3 Digital systems First digital devices in CCTV systems have been digital video recorders.

Then video encoding and compression have made such progresses that it became possible to distribute the live video through Ethernet network with reasonable bandwidth and good quality.

87 WCAM Deliverable D 2.1 Ethernet GUI CCTV Server CCTV Server Video matrix CCTV Server Video matrix Video Recorder Video matrix Figure 33 : Example of analogue CCTV system Digital systems First digital devices in CCTV systems have been digital video recorders. Those devices were able to store video sequences on hard disks and to replay them on PC Windows GUIs. Then video encoding and compression have made such progresses that it became possible to distribute the live video through Ethernet network with reasonable bandwidth and good quality. Full digital CCTV systems were then a reality. That is the reason why there are two approaches in terms of products on the market: the first one is based on solutions of recording while the second is based on solutions of transmission. Today, both types of solutions provide similar functionalities (i.e. mainly live video display and recording), but the devices packaging are different. Approach based on recording is rather centralised compared to the one based on transmission. Moreover integrated IP cameras available today are manufactured by companies more focused on transmission capabilities (IndigoVision for example). Video switching in network CCTV systems is made by the network itself. No need here for analogue video switching matrix. That means all composite video signals delivered by analogue cameras are encoded and compressed in order to be distributed over an Ethernet network. The display of live video is made either on PC based GUI applications with software video decoding or on analogue monitors after the video had been decompressed & decoded by a specific decoder. Here is an example of such a system: Page 87/154

Hub, switch or router Data stream on demand CCTV Server Constant Video Decoder GUI Application Video Decoder Video/Audio encoder Storage GUI Application Archive Figure 34:Example of digital

The decoder does the reverse operation. Each encoder or decoder is connected to the network. Bandwidth can be adjusted in real time.

The offer is very large today, proposing different technologies.

88 Hub, switch or router Data stream on demand CCTV Server Constant Video Decoder GUI Application Video Decoder Video/Audio encoder Storage GUI Application Archive Figure 34:Example of digital surveillance sytem CCTV digital parts Video Encoder-Decoder The encoder is the device which converts the analogue video signal delivered by the camera into a compressed stream which will be distributed over the network. The decoder does the reverse operation. Each encoder or decoder is connected to the network. Bandwidth can be adjusted in real time. Most of those devices offer audio encoding, RS232 or RS485 serial ports for data channel (used to remote control the cameras or domes), I/O for alarms and other equipment control (light for example). The offer is very large today, proposing different technologies. It s really important to notice that even if the base technology is the same for two different companies, the software development kit to use these systems will always be different and proprietary. Here is a non exhaustive list of what can be found on the market: INDIGOVISION provides two series of encoder and decoders : 6000 Series is H261 products for IP-Video applications. Full frame rate, full colour: H.261: Up to 30fps at CIF resolution. M-JPEG compression up to 12.5 fps. Bite rate up to 2 Mbps. Variable Bit Rate. Audio Compression : G.711 and G Serial Port RS 232/422/485 port for PTZ remote camera control functions and maintenance. 8 Binary Inputs/Outputs Series is MPEG-4 products for IP-Video applications. Resolutions : CIF, 2CIF, 4CIF. Bit rates to 4 Mbps. Variable Bit Rate. Motion Detection. 2 Serial Port RS 232/422/485 port for PTZ remote camera control functions and maintenance. 8 Binary Inputs/Outputs. VCS provides 5 series of devices : Page 88/154

89 VIDEOJET 10 : MPEG4 / M-JPEG Encoder/Decoder, Resolutions : QCIF, CIF, 2CIF. Low latency Mode < 150 ms. Video Data Rate up to 4 Mbit/s. Audio Compression: G711. Wireless Ethernet : RF-interface b via CF Card interface. 2 Serial Port RS 232/422/485 port for PTZ remote camera control functions and maintenance. 1 Inputs / 1 relay Outputs. VIDEOJET 100/400 : MPEG4 / H263 Encoder/Decoder, Resolutions : QCIF, CIF, 2CIF, Video Data Rate up to 1 Mbit/s. Audio Compression : G Serial Port RS 232/422/485 port for PTZ remote camera control functions and maintenance. 1 Inputs / 1 relay Outputs. VIDEOJET 1000 : MPEG 2/4 Encoder/Decoder. Resolutions : MPEG-2 : 720 x 576 full D1, CIF. MPEG-4 : 4CIF, CIF, QCIF. Low latency Mode : 150 ms (MPEG-4), 198/170ms (PAL/NTSC MPEG-2). Video Data Rate : MPEG-2 : 1 Mbit/s to 8 Mbit/s. MPEG-4 : 9,6 Kbit/s to 1,5 Mbit/s. Constant & Variable. Audio Compression : MPEG-1 & G Serial Port RS 232/422/485 port for PTZ remote camera control functions and maintenance. 1 Inputs / 1 relay Outputs. VIDEOJET 8000 : MPEG 2/4 Encoder/Decoder (Preliminary Information). Same characteristics as VIDEOJET 1000 series. 8 Video channels / 1 Video Output. Support 1000 Base-T Interface. Integrated Video Scene Analysis. USB Interface. VIDEOJET XPRO : MPEG 2/4 Encoder/Decoder (Preliminary Information). Same characteristics as VIDEOJET 8000 series. The VideoJet XPro is a cartridge-based video sender/receiver, Each cartridge functions like a VideoJet Up to 160 Video Channels. VISIOWAVE offers a comprehensive range of Digital Video Networking products which are forming the building blocks of its complete and integrated Security and Media Solutions. Use of VisioWave Dynamic Coding, 3D wavelets compression technology (proprietary technology). VisioBox : PAL 50fps), NTSC 60fps). Audio MPEG 1/2 layer 3 (32 to 320 kbits/s). Ethernet optional module for wireless transmission. 1 Serial Port RS 232/422/485 port for PTZ remote camera control functions and maintenance. 3 Inputs / 3 relays Outputs. CPCI-4 II Video Codec Board : PAL/NTSC full field and frame resolution. 4 totally independent video codecs. Visually loss-less compression at 2 Mbps. Constant bit rate or constant quality video. Audio MPEG 1/2 layer 3 (32 to 320 kbits/s). PCI-2 Video Codec Board : Ultra-short size 32-bit PCI local bus. 2 totally independent video codecs. NICEVISION Harmony System : 16 to 64 camera inputs per recorder - Flexible frame rate at 4 to 30 frames per second NTSC (25 PAL). Frame rates can change upon alarm notification.h263+, MJPEG. Bite rate up to 4 Mbps. Pro System : Up to 96 camera inputs per unit. Up to 64 IP-video inputs per unit. Selectable rate per camera up to 4 Mbps. Up to 30 frames/sec NTSC (25 PAL). MPEG4 main profile, H263+, MJPEG compression. Video Authentication IP Cameras Some IP integrated CCTV camera begin to be available. It is here important to separate professional CCTV cameras (H261, MPEG4) and multimedia cameras for general users (USB, DV, HTTP, FTP). At the moment, transmission capabilities, quality of the in build control systems for the video, lenses, packaging, environment... are not the same. But it seems that the technology used will converge in the years to come. Here are some IP camera available today : Page 89/154

90 INDIGOVISION provides IP camera: IVC100 is a state-of-the-art, CCD sensor based IP video camera. Based on 6000 series codec (H261 encoding, Cf. Above ). BAXALL provides IP Camera based on IndigoVision Chips. Note that Baxall is a camera and CCTV Systems manufacturer Network equipment s Equipment used to build the network is standard. There are today no particular devices for CCTV systems. Critical points are the bandwidth availability and the multicast management. At the moment, CCTV network installed are completely dedicated to security applications and especially deployed for it CCTV Server The CCTV server is a software application which is intended to manage CCTV devices (remote control cameras, alarm management, video recording control...), interface with users front end, user s rights. CCTV servers also connect video encoders of the network to video decoders in order to display live video on monitors for users User Interface User interface today are either simple CCTV keyboards or Graphical User Interface (PC based software). GUI allows the intuitive use of the system. No need for the operator to have the list of all cameras with its associated number, as it is the case with CCTV keyboards. User navigates through graphic maps where cameras are displayed, one click and the video is displayed. Alarms are also graphical animations, with order displayed, acknowledgement and alarm video sequence playback facilities. Here is an example of such a GUI: Video Storage / SDKs Video storage equipment today can be Figure 35 : Example of surveillance GUI Page 90/154

91 - Part of the network digital solution. It means that encoders have local hard disk to store video sequences. - Software PC based applications, using tools and SDK to get video streams and to store them. - Racks with several analogue video inputs, encoders, Windows or Linux application and hard disks. Note that storage is a hard key point in term of network bandwidth dimensioning Video surveillance over wireless network Wireless transmission in CCTV is not used today. This situation is mainly du to the lack in security. When implementing wireless video transmission in a security system, it is obvious that you need to protect both content and transmission of the data. So, despite strong demand for deploying wireless video transmission in CCTV installation, nothing really exists today answering this need Security issues In digital CCTV, the protection of the data is a very significant point. The main data is the video. It can be live video or stored video. In all cases, the system shall propose a way of protect: - the transmission. It must be impossible to decode the stream. This item is still more significant regarding wireless technology. - the content. It must be impossible to modify the content of a video stream and the system must ensure that video encoded at the camera side will be the same that the one which will be displayed or stored. Indeed, content in digital form can easily be accessed, manipulated, copied and distributed at negligible cost. In the case of surveillance applications, security of the content will ensure the integrity of the system and the privacy of the users. Actually, no solution is really available about the content protection. Offers are proprietary, which means that proprietary codec should be used to decode the video stream. But this is not strong protection and has nothing to do with encryption, digital signature or watermarking, only techniques really providing high protection Last evolution: The Video Analysis The last apparition in CCTV systems and proposal is the ability to have Video Analysis. This purposes have many applications. For example: - suspect behaviour detection - crowd detection - bad way moving - speed detection - object removal - people counting - smoke detection -... Those algorithms, when detecting critical situations are able to send events to CCTV server, which will then warn the operators, display video stream, record video sequence... Page 91/154

92 6.2 Segmentation and tracking for surveillance applications Extraction Detection Characterization Introduction When looking at an image or watching a video sequence, humans see various objects/entities present in that image or this video sequence. Unlike us, when looking to an image, a computer sees a collection of pixels. The goal of any image or sequence content analysis application is to make the computer see the objects/entities. The difficulty of such a task resides in the large variability in the appearance of these entities. For example, the image of the same object can be represented by various collections of pixels depending on a number of factors (e.g. illumination conditions, distance to the object, parameters of the photo equipment, etc.). Furthermore, different objects could belong to the same entity. Take, for example, a blond tall man and a small dark haired woman. They are both humans, easily recognizable for us. However, from a computer perspective they are completely different things due to their representation in terms of pixels. Making a computer to identify humans in images is an example of a content analysis application. Any content analysis task takes as input image pixels and outputs object related information (e.g. how many objects are in an image, is object X present, which objects are in an image), using an a priori knowledge on the observed scenes and a model of the world. To achieve this, current methods [1][2] employ a number of steps: Feature extraction Clustering/Segmentation Object descriptor extraction Object recognition These steps will be described shortly in the following sections Point feature extraction In the feature extraction step, for each image pixel a set of values, called in the literature [4][5] feature vector, is computed. The purpose of this step is to assign (a) the same feature vector to all pixels belonging to one object and (b) different feature vectors to pixels belonging to different objects. The values in a feature vector are derived from the colour/grey values in the considered image point and/or in its neighbourhood. Methods (also called in the literature operators and/or filters) to derive features can be based on: Filtering in spatial domain. Some specific examples are the Laplacian operator [6], the Sobel operator, and the Gabor filters [6][8]; Filtering in Fourier domain, where the amplitude and the phase of the Fourier spectrum represent relevant image features [9][10]; Local statistics computation around the pixel of interest with the averaging filters [6] and the co-occurrence matrices [11] as the most representative examples; Model fitting, e.g. Markov random fields [12] and Gibbs distribution [13]. The choice of the feature extraction method depends on the appearance of the objects one wants to identify in an image. For example, if one is looking for yellow flowers one should use a method that gives information about the colour in a point. When one is looking for a Page 92/154

93 zebra, a method that gives information about periodic patters like those on a zebra is more appropriate. The feature extraction step maps the points in the image space in points in a feature space. In the ideal case, all points belonging to an object/entity are mapped to one point in the feature space. Practically, object points are mapped onto regions of the feature space. This should not be a problem as long as different objects are mapped onto different regions. However, in practice, extensive overlap appears requiring further processing Segmentation The segmentation algorithms differs a lot depending on the application. One may distinguish segmentation in a image and in a video sequence, even if some integrated systems, such as SAKHBOT [14] or the Merl system, use a combination of these methods Image Segmentation In image segmentation an image is subdivided into its constituent regions or objects. Image segmentation algorithms generally are based on one of two basic properties of intensity values : discontinuity and similarity. In the first category, the approach is to partition an image based on abrupt changes in intensity, such as edges in an image. The principal approaches in the second category are based on partitioning an image into regions that are similar according to a set of predefined criteria. Thresholding, region growing and region splitting and merging are examples of methods in this category. The features calculated in the previous step can be used to either detect abrupt changes or to group similar regions. Some classes of segmentation techniques are Thresholding : adaptive and hysteresis thresholding Region-based segmentation : growing, splitting and merging Morphological watersheds Model fitting Video Segmentation In video sequence segmentation most methods use successive frames differencing and/or a background modelling to distinguish between moving objects pixels the foreground and the part of the image that doesn t change the background. Several methods exist in the stateof-the-art, for instance the background subtraction [15][16], which removes non-foreground points from the image. The modelling of the background to be removed is generally referred to as background estimation. Alternatively, one may employ difference images or Gaussian mixed models [17], conversion to HSV colour space, hidden Markow models... Once the foreground and background pixels have been segmented some filtering operations have to be performed. Removing small moving areas that couldn t turn to be a real moving object is one of them. From these areas, man could build groups of moving pixels, or blobs [18], for which several techniques are known (connected components, runs-based encoding, ). Other problems encountered are handling of shadow, changes in lighting conditions, clutter, occlusion, etc. These problems lead to the implementation of shadow removal algorithm, adaptative segmentation, etc. Page 93/154

94 In video surveillance applications, the quality of the detection/tracking of moving objects depends mainly on the results of the segmentation algorithm used to distinguish foreground objects from the background. When the video quality is high and the scene is static, most of the segmentation methods perform well. But the problems occur when there are variable lighting conditions, reflections, acquisition/transmission noise, periodic or non-periodic movements in the background Many segmentation methods can t deal with some of these situations and high level processing is needed to determine that some of the segmented objects are part of the background. Frame differencing is a very cheap method but it can t deal with variable background. Therefore, our initial segmentation method for WCAM will probably be a statistical one. The idea is to have a background model of the luminance for each pixel. For example, Gaussians can be used to model the different states of a pixel. For each pixel, the Gaussians that are most frequent correspond to the background. The possibility to take into account several states for the background in each pixel allows to incorporate periodic states (e.g. traffic lights) or moving objects in the background (e.g. trees) automatically. With the statistical segmentation, moving trees, blinking lights and the acquisition/compression noise are not segmented anymore as foreground. This reduces the complexity of the interpretation pass Content Segmentation Content segmentation based on visual cues One of the first approaches using the DCT coefficients was proposed by Arman et al. [19] for both JPEG and MPEG streams. For MPEG streams only I-frames are analyzed. This implementation employed a two-step approach. Video frames are compared based on their representation using a vector of subsets of DCT coefficients. Then the normalized inner product is subtracted from one and compared to a threshold. If a potential editing cut is detected, the images can be decompressed for further processing. Another approach using I- frames only is described by Patel and Sethi [20]. In this approach, three histograms the global, the row and the column histograms, of two successive I-frames are compared using the Chi-Square test to determine whether the two sets of histograms are arising from same source or not. A multi-pass approach has been used by Zhang et al., but their technique also analyzes the B- and P-frames in an MPEG stream [21][22]. The first two passes compare the images based on DCT coefficients with different skip factors on I-frames. In another pass the number of motion vectors is compared to a threshold. If there are fewer motion vectors than some predefined threshold, a visual transition is determined. Kobla et al. also reported on video segmentation using DCT coefficients [23]. Their method is similar to Zhang s method in that it counts the motion vectors for the predicted blocks if it is an MPEG stream. If they determine that it is a Motion JPEG stream, they switch to DCT comparison and sum the square of differences of the DC coefficients between successive I- frames. Yeo et al. have investigated using only the DC values of the DCT coefficients for frame comparison in the compressed domain [24]. They sum the DC differences between successive frames. If the difference is the maximum in a temporally sliding window and if it is n times larger than the next largest peak in the same window of frames, then it is a indexed as an visual transition. They also detect gradual transitions (e.g. dissolves, fade in and fade out) by comparing each frame to the following k th frame over some time interval. The value for k should be larger than the time interval. Page 94/154

95 Zabih et al. [25] have developed a method for detecting visual hard transitions (editing cuts) by checking the spatial distribution of entering and exiting edge pixels. Another edge-based approach has been performed by Shen et al [26]. In this approach, the edges are extracted directly from the compressed image. Then, Hausdorff distance histograms are obtained for each region by comparing edge points extracted from successive I-frames. The histogram of the whole frame is obtained by merging the histograms of sub-regions in multiple passes. The merging algorithm is designed to increase the SNR of true motion during each pass while suppressing the mismatch information introduced by the noise. Hampapur and his colleagues [27] presented a model driven approach to digital video segmentation. The paper deals with extracting features that correspond to editing cuts, spatial edits, and chromatic edits. The authors present extensive formal treatment of shot boundary identification based on models of video edit effects. All of the above techniques have reported good results for visual transition detection (editing cuts). Boreczky and Rowe [28] performed a comparison of algorithms to detect visual boundaries (editing instances). They selected and implemented some of the above mentioned algorithms. Their results showed that DCT based algorithms had the lowest precision for a given recall. This result was expected due to a large number of false positives generated because of random noise in the black frames between commercials Content segmentation based on audio cues A/V analysis based on audio clues can be beneficial to scene classification, as audio is different depending on what is being broadcasted. For instance, audio in sport is different from audio in news. Little research has been developed about the subject, and it took place mainly four or five years ago. It is very difficult thought, to find research done in video segmentation using only audio cues. It is much more common to see the audio information for indexing as a help of the video information. In [28] audio features like volume distribution, pitch contour (temporal domain) and frequency centroid and bandwidth (frequency domain) are used. Then some statistical treatment of these and a later use of neural network conforms the methodology used. Work in the compressed domain frame is done in [29] where some audio features are extracted from the compressed audio information and used identify silences, dialogs and nondialog (and no-silence). Both high level and low level audio features are used in [30] to distinguish between different kinds of TV programs. The behavior of the features in different TV programs is modelled so they can be used in a HMM. Once the model is done, it is the HMM who retrieves and recognizes the different patterns of behavior, and hence different TV programs. With this philosophy, simulations show an accuracy of 84.7% in classifying adds, basketball game, football game, news and weather forecast. But it also exposes that for better results video information should be used. In [31] there is a distinction in audio for music, speech and audio effects, although only music and speech is used in the experiment. The music and speech are identified as they have different frequency response. The time edges for music or for speech are used for browsing a video. This same line based in distinguishing music and speech for classifying soundtrack is used also in [32] but in this occasion the algorithm is mainly based in ZCR characterization reaching 90% of accuracy and using also some help of other figures it reached 98%. Page 95/154

96 Content segmentation based on audio-visual cues for A/V scene segmentation Combining audio and video can play an important role in Content Segmentation. The algorithm proposed by Hari and Shih-Fu [33] segmenting the content into audio scenes, looking for changes in the behavior of 10 different features, and segmenting the content in video scenes (independently) using the idea of recall (a distance between histogram weighted by the length of the shots and their time distance). At the end they merge audio and video scenes using a nearest neighbor algorithm. A different approach come from Shu-Ching Chen [30] where first the Video Shot Boundary are detected using Pixel comparison methods, segmentation method and if necessary object tracking method. Audio data are split according to the shot and 9 different low level audio features are calculated for each shot. Scene are extracted applying a threshold to a function of all these audio features. This approach shows good result in terms of precision and recall and it seems to be robust also to case in which audio changes doesn t come during a shot transition D Segmentation The use of depth information in segmentation of video images and still images is becoming more and more common. Different classification of 3D image segmentation techniques are possible, we may classify them in structural techniques, stochastic techniques and hybrid techniques [31]. The first ones try to find structural properties of the region to be segmented. 3D edge-detection techniques, morphological techniques, graph searching algorithms, deformable models and isosurfaces and level sets belong to this group. Stochastic techniques perform segmentation by statistical analysis. Thresholding approaches, classification techniques, clustering algorithms, Markov random fields are the main stochastic techniques. Region growing, split and merge and artificial neural networks can be classified as hybrid techniques. The advantages of adding stereo information are remarkable not only in segmentation but also in locating objects in 3D and handling occlusion events, which makes tracking easier too. In [34] a system to detect and track people based on a 3D silhouette is developed. The 3D silhouette is constructed from a 2D silhouette using a combination of colour and disparity based background subtraction followed by a volume of interest filter. In [35] a method for person counting using stereo is explained. Reprojection and Gaussian mixtures are used to reduce the amount of data while maintaining the essential spatial characteristics of the tracked objects. In [36] methods for occlusions recovery (occlusions preserving surfaces and occlusions breaking surfaces) using depth maps are developed. Volume segmentation is also an important part of computer based medical application for diagnosis and analysis of anatomical data. It facilitates automatic or semiautomatic extraction of the anatomical organ or region of interest [31] Object feature (descriptor) extraction The result of the previous processing step is a number of regions corresponding to objects in the analysed images. The next step is to find properties of these regions : object features that can help in their identification. These features can refer to the boundary of a region or to its Page 96/154

97 content. Commonly used boundary features belong to one of the following categories [4][6][9]: Bounding box Extremity points (top-left, bottom-right ) Chain codes; Simple geometric shape descriptors such as length or curvature; Fourier descriptors of the boundary; Features derived from the B-spline representation of a contour; Fractal dimension; Shape invariants such as number of corners/vertices and shape number. Examples of classes of content (regional) features are [4][6][9]: Simple scalar region descriptors such as area, compactness, elongatedness; Moments; Convex hull; Graph representation based on region skeleton. Furthermore, there are descriptors that refer to the relation between regions: a region is inside another one or one is above another. Examples of techniques to derive such descriptors are: Region decomposition; Region neighbourhood graphs. Various descriptors are sensitive to certain properties of an object while being invariant to others. A shape descriptor will discriminate between a square and a circle but not between a red circle and a blue one. The choice of the object descriptors involved in an application depends on the goals of that application, on the type of objects one is looking for. This choice represents the difficult point in the object descriptor extraction step. Other object descriptors may also be built from the information extracted from the tracking operation and are described below Object classification In the last stage of a content analysis application, objects are classified based on their descriptors. The assumption of this step is that the system has a description of the objects of interest (reference objects), i.e. it has their descriptors. Having this information, the system compares the descriptors of the objects found in an image with the descriptors of the reference objects. Based on the result of the comparison, a decision is taken whether the found object is of a certain type. In the ideal case, the system has descriptions of all the objects that can be encountered in the analysed images and, consequently, can identify all objects in an image. Practically, a system has descriptors of a certain set of objects of interest and it is able to identify whether any of those objects are present in an image. We said earlier that an object is classified by comparing its descriptors with a reference set. The choice of the comparison method is the big challenge in the object recognition step. This step is strongly dependent on the type of the object descriptors computed in the previous step. When quantitative descriptors such as boundary or regional descriptors are used, object recognition can be done using decision theoretic methods (also called statistical pattern recognition methods [37][38]) such as: Page 97/154

98 Parametric density estimation methods, with the Bayesian approaches as the most known ones; Nonparametric density estimation methods such as k-nearest-neighbour method and kernel methods [39]; Linear discriminant analysis, e.g. Fisher criterion [39], support vector machine classifier [40][41]; Nonlinear discriminant analysis, e.g. maximum likelihood optimisation criterion, nonlinear support vector machine classifiers. When relational descriptors are used, structural pattern recognition methods (also called syntactic pattern recognition techniques [38]), such as rule-based methods, grammatical methods, decision trees, are the best alternative. All of the above steps require some prior knowledge or make some assumptions about the type of task they are going to perform. These facts make the design of a content analysis system application dependent. However, the same type of algorithms can be used in various stages of different content analysis applications provided that the algorithms are finely tuned to the requirements of each application. Depending of the applications, some of these steps might turn to a minimalist version or can be performed in a latter stage after tracking, which will be described in the next sections Object Tracking Introduction Tracking is the problem of generating an inference about the motion of an object given a sequence of images. Good solutions to this problem have a variety of applications, like motion capture, recognition from motion (i.e. determining the identity of the object from its motion), surveillance and targeting. One may distinguish two phases in the Object Tracking operation. Given consecutive frames the first step to perform an efficient object tracking is to match the objects, the blobs from the different images composing the sequence. The second step is the estimation of the motion of the objects. Tracking an object create new descriptors, like speed or trajectory and eventually would lead to a better segmentation process. We may distinguish between feature point tracking and object tracking Objects Matching The matching operation combines blobs of consecutive frames to determine the relation between the blobs. Two kinds of approaches could be performed. In the feature tracking approach, moving objects are represented by some feature points as mentioned above, detected prior to tracking or during tracking. The traditional statement of the feature point tracking treats the points as indistinguishable, and cinematic constraints are solely used to establish the correspondences (for example the IPAN tracker [43]or the KLT tracker [44]). The other approach computes some simple distances between positions in the two frames, using the estimated position of the objects. This estimated position comes from the motion estimation. Page 98/154

99 Motion Estimation Given two consecutive positions of an object, man could infer the next position of this object (actually one of its representative points, like the centre of gravity). This operation is the global motion estimation and could be performed by several algorithms. Here is a sample of the currently used algorithms Kalman Filtering The Kalman filter [45] is a set of mathematical equations that provides an efficient computational (recursive) solution of the least-squares method. The filter is very powerful in several aspects: it supports estimations of past, present, and even future states, and it can do so even when the precise nature of the modelled system is unknown. Difficulties may arise when the estimated states have non-linear behaviour. A Kalman filter that linearises about the current mean and covariance is referred to as an extended Kalman filter or EKF Particle Filtering The Kalman filter is limited in the range of probability distributions it can represent. Particle filtering is an approach which can handle more general distributions, for example multiple peaks in the distribution, high-dimensional state vectors. A particular type of particle filtering is the Condensation algorithm (Conditional Density Propagation)[46]. By using the statistical technique of importance sampling it is possible to build a condensation tracker which runs in real time Mean-shift tracking A more recent approach developed by D. Comaniciu [47], is based on an efficient minimisation of the Bhattacharya similarity measure [48] in a multi-dimensional parameter space. It is extremely robust towards changes in appearance caused by camera motion, partial occlusions, clutter, and target scale variation. It has the disadvantage that a track must be initiated by an external agent Multi-hypothesis tracking (MHT) Crosswise to previous considerations, a great improvement of robustness of tracking is the handling of multi-hypothesis. I. J. Cox [49] developed an implementation of such Multi- Hypothesis Tracking. Assumptions on the on object may occur on the initiation of a tracking, its termination and during the matching process. Recently, [50] and [51] proposed methods to fusion MHT schemes and particle filtering techniques. These researches are mainly focused on optimization by factorization of the numerous hypothesis that could be drawn at each frame. Example of application: Vehicle tracking: Sullivan et al. construct a set of regions of interest (ROI s) in each frame [52]. Their system then watches for characteristic edge signatures in the ROI that indicate the presence of a vehicle. Page 99/154

100 An alternative method for initiating car tracks is to track individual features, and then group those tracks into possible cars. Beymer et al. use this strategy rather successfully [53]. Their system tracks corner points, identified using a second moment matrix, using a Kalman filter. Because the road is plane and the camera is fixed, the homography connecting the road plane and the camera can be used to determine the distance between points; and points can lie together on a car only if this distance doesn t change with time Current problems in Object Tracking Object Tracking is not a solved problem. The main issue comes from the occlusion of objects behind a static part of the field, or two objects merging into one because they are too close for the camera to separate them. This kind of troubles lead to a wrong matching, then a wrong estimation etc. Several methods, like a tracking through occlusion increase the matching and estimation process. 6.3 Legal aspects and privacy European law related to video surveillance In the European law, Art.8 of the European Human Rights Convention protects the right to privacy. The right of the protection of personal data has been confirmed by Art.8 of the Draft of the Charter for Fundamental Rights of the European Community. A specification of these general provisions can be found in the European Privacy Protection Directive from This Directive, binding for all EU member states, is analysed in [54], a document adopted in It is worth mentioning that the Directive is not applicable in matters of public security and if the data are not processed in files. So, first, surveillance by the police cannot be judged by the Directive. On the other hand, technical surveillance by private bodies is completely regulated by the Directive, even if a company is working in security. Second, a simple conventional camera-monitor-system might not be a matter of the Directive, but the storage of digital pictures does. A few articles from the Directive may be stressed: Art. 10 regulates the "notice". The affected person must be given information about the identity of the person in charge for the processing, the identity of the processing body the purpose of the processing information on further recipients and the rights of the affected. In addition Art. 12 guarantees detailed information on the storage and the logical structure of the automatic processing. Page 100/154

101 There might be practical problems to realize the right to object (of Art. 14) in video surveillance, because the data collection happens automatically without any possibility of the affected to intervene in this process. According to Art.15, nobody shall be subject to a considerably affecting decision made exclusively on the basis of automated data processing. This regulation is relevant, if biometrical methods of identification are used. The use of automated face recognition systems in public areas, which can have an immense impact on the affected person, lies in conflict with this regulation. Finally it is worth mentionning Art. 20 and 21 of the Directive: Undoubtedly video surveillance includes specific risks for rights and liberties. So this method has to be subject of a prior checking. Moreover the controller must make available (on demand) to everyone information about the person in charge, the purpose, a description of the categories of those affected, the data recipients, a general description of the measures taken to guarantee the data security Forensic video decisions Document [55] provides a good summary of the advantages and disadvantages of processing of digital video data for surveillance application, in particular when using it in a court. Beside the possible loss of quality due to compression or resolution limitations, other concerns are discussed. One important issue is the compression or storage using temporal redundancy for areas of the video frames where no changes occur. When re-constructing the video sequence for interpretation, some pictures are created while they are actually montage of events that took place over a period of time even though they purport to have occurred at the same instant. So, when an image is printed, parts of the picture often contain activity that occurred at different time from activity in another part of the same image. This could fail the obligation of reliability of the data. Pure intra coding must be preferred compared to methods using temporal compression. Page 101/154

102 7 State-of-the-art for multimedia content distribution 7.1 What is streaming technology? Video delivery via file download Probably the most straightforward approach for video delivery over the Internet is the download, but we refer to it as video download to keep in mind that it is a video and not a generic file. Specifically, video download is similar to a file download, but it is a large file. This approach allows the use of established delivery mechanisms, for example TCP as the transport layer or FTP or HTTP at the higher layers. However, it has a number of disadvantages. Since videos generally correspond to very large files, the download approach usually requires long download times and large storage spaces. These are important practical constraints. In addition, the entire video must be downloaded before viewing can begin. This requires patience on the viewers part and also reduces flexibility in certain circumstances, e.g. if the viewer is unsure of whether he/she wants to view the video, he must still download the entire video before viewing it and making a decision Video delivery via streaming Video delivery by video streaming attempts to overcome the problems associated with file download, and also provides a significant amount of additional capabilities. The basic idea of video streaming is to split the video into parts, transmit these parts successively, and enable the receiver to decode and playback the video as these parts are received, without having to wait for the entire video to be delivered [204]. Video streaming can conceptually be thought to consist in the following steps: 1. Partition the compressed video into packets 2. Start delivery of these packets 3. Begin decoding and playback at the receiver while the video is still being delivered Video streaming enables simultaneous delivery and playback of the video. This is in contrast to file download where the entire video must be delivered before playback can begin. In video streaming there is usually a short delay (usually on the order of 1-15 seconds) between the start of delivery and the beginning of playback at the client. This delay, referred to as the pre-roll delay, provides a number of benefits to smoothly balance any network alteration. Video streaming provides a number of benefits including low delay before viewing starts and low storage requirements since only a small portion of the video is stored at the client at any point in time. The length of the delay is given by the time duration of the pre-roll buffer, and the required storage is approximately given by the amount of data in the pre-roll buffer Expressing video streaming as a sequence of constraints A significant amount of insight can be obtained by expressing the problem of video streaming as a sequence of constraints. Consider the time interval between displayed frames to be denoted by, e.g.. is 33 ms for 30 frames/s video and 100 ms for 10 frames/s video. Each frame must be delivered and decoded by its playback time, therefore the sequence of frames has an associated sequence of deliver/decode/display deadlines: Frame N must be delivered and decoded by time TN Frame N+1 must be delivered and decoded by time TN +. Frame N+2 must be delivered and decoded by time TN + 2. Page 102/154

103 Any data that is lost in transmission cannot be used at the receiver. Furthermore, any data that arrives too late is also useless. Specifically, any data that arrives after its decoding and display deadline is too late to be displayed. Note that data may still be useful even if it arrives after its display time, for example if subsequent data depends on this late data. Therefore, an important goal of video streaming is to perform the streaming in a manner so that this sequence of constraints is met. The most common example of streaming video-over-ip is a service that is being provided by some cable or Telecom companies called Video-on-Demand (VoD). In a VoD system a customer can order a movie from a location that is served from a remote location over an IP network. The video is produced (served from a disk server), wrapped in an UDP/IP packet, transported to the customer s location and consumed (watched). 7.2 Video Compression standard in Multimedia The technologies for video compression are based both the standards H.261/3/4 and MPEG- 1/2/4. The most popular proprietary solutions (e.g. RealNetworks [209] and Microsoft Windows Media [210]) are based on the same basic principles and practices, and therefore by understanding them one can gain a basic understanding for both standard and proprietary video streaming systems. Video compression standards [204] provide a number of benefits, foremost of which is ensuring interoperability or communication between encoders and decoders made by different people or different companies. In this way, standards lower the risk for both the consumer and the manufacturer, and this can lead to quicker acceptance and widespread use. In addition, these standards are designed for a large variety of applications, and the resulting economies of scale lead to reduced cost and further widespread use. Currently there are two families of video compression standards, performed under the auspices of the International Telecommunications Union-Telecommunications (ITU-T, formerly the International Telegraph and Telephone Consultative Committee, CCITT) and the International Organization for Standardization (ISO). The first video compression standard to gain widespread acceptance was the ITU H.261 [211], which was designed for videoconferencing over the integrated services digital network (ISDN). H.261 was adopted as a standard in It was designed to operate at p = 1,2,..., 30 multiples of the baseline ISDN data rate, or p x 64 kb/s. In 1993, the ITU-T initiated a standardization effort with the primary goal of videotelephony over the public switched telephone network (PSTN) (conventional analog telephone lines), where the total available data rate is only about 33.6 kb/s. The video compression portion of the standard is H.263 and its first phase was adopted in 1996 [205]. An enhanced H.263, H.263 Version 2 (V2), was finalized in 1997, and a completely new algorithm, originally referred to as H.26L, is currently finalized as H.264/AVC. The Moving Pictures Expert Group (MPEG) was established by the ISO in 1988 to develop a standard for compressing moving pictures (video) and its associated audio on digital storage media (CD-ROM). The resulting standard, commonly known as MPEG-1, was finalized in 1991 and achieves approximately VHS quality video and audio at about 1.5 Mb/s [206].A second phase of their work, commonly known as MPEG-2, was an extension of MPEG-1 developed for application toward digital television and for higher bit rates [207]. A third standard, to be called MPEG-3, was originally envisioned for higher bit rate applications such as HDTV, but it was recognized that those applications could also be addressed within the context of MPEG-2; hence those goals were wrapped into MPEG-2 (consequently, there is no Page 103/154

104 MPEG-3 standard). Currently, the video portion of digital television (DTV) and high definition television (HDTV) standards for large portions of North America, Europe, and Asia is based on MPEG-2. A third phase of work, known as MPEG-4, was designed to provide improved compression efficiency and error resilience features, as well as increased functionality, including object-based processing, integration of both natural and synthetic (computer generated) content, content-based interactivity [208]. The following table identifies the current and emerging video compression standards in multimedia applications. Table 16 : Current and emerging video compression standards Video Coding Primary Intended Applications Bit Rate Standard H.261 Video telephony and teleconferencing over ISDN p x 64 kb/s MPEG-1 Video on digital storage media (CDROM) 1.5 Mb/s MPEG-2 Digital Television 2-20 Mb/s H.263 Video telephony over PSTN 33.6 kb/s and up MPEG-4-Part 2 Object-based coding, synthetic content, Variable interactivity, video streaming H.264 MPEG-4 Part 10 (AVC) Improved video compression 10 s to 100 s of kb/s Currently, the video compression standards that are primarily used for video communication and video streaming are H.263 V2, MPEG-4, and the emerging H.264/MPEG-4 Part 10 AVC will probably gain wide acceptance. 7.3 Video file format standards Media formats are the form and technology used to communicate information. Multimedia presentations, for example, combine sound, pictures, and videos, all of which are different types of media MPEG formats The MPEG family of file formats includes some well-known file types with the following file name extensions:.mpg,.mpeg,.mp3,.mpa and.mpe. The Moving Picture Experts Group (MPEG) standards are an ever-evolving set of standards for video and audio coding and compression, which are developed by the Moving Picture Experts Group. The best known standards are MPEG1, MPEG2, MPEG Audio Layer 3 (MP3) and the new MPEG4. The following is a short description of each of these formats. MPEG1: This standard was designed to allow coding of progressive video at a transmission rate of about 1.5 Mb/s. This file format was originally designed specifically for Video-CD (VCD) and CD-i media. The most common implementations of the MPEG1 standard provide a video resolution of 352x240 (NTSC) / 352x288 (PAL) at 30 (NTSC) / 25 (PAL) frames per second (fps), although other resolutions and frame rates are possible. When using this standard, the result is a video quality slightly below the quality of conventional VHS VCR videos. MPEG2: MPEG2 is an enhanced form of MPEG1; it even includes MPEG1 headers in the data stream. Major improvements include prediction modes and increased precision. The result is a higher quality video, though at the expense of additional encoding/decoding power. Video encoded with MPEG2 commonly uses a higher resolution than MPEG1, but this is not an Page 104/154

105 absolute rule. DVD video, as well as Super Video-CD (SVCD) is coded with MPEG2. The screen resolution of DVD is 720x480 (NTSC) / 720x576 (PAL), while SVCD uses 480x480 (NTSC) / 480x576 (PAL) [214]. MPEG Audio Layer-3 (MP3): This standard has also evolved from early MPEG work. It is an audio compression technology that is part of the MPEG1 and MPEG2 specifications. MP3 was developed in 1991 by the Fraunhofer Institute in Germany, and it uses perceptual audio coding to compress near-cd-quality sound by a factor of 12, while providing almost the same fidelity. Perceptual audio coding eliminates audio frequencies which is inaudible to the human ear. It is noteworthy that there exist quite a few audio coding schemes that are more efficient and produce the same or better quality sound than MP3, but because of it s great success in the PC user base, it has become a de facto standard for storing music on computers. MPEG4: MPEG4 is the result of another international effort involving hundreds of researchers and engineers from all over the world in the Motion Picture Experts Group. MPEG4, whose formal ISO/IEC designation is ISO/IEC was finalized in October 1998 and became an International Standard in the first months of The backward compatible extensions under the title of MPEG4 Version 2 were frozen at the end of 1999, to acquire the formal International Standard Status early in Some work, on extensions in specific domains, is still in progress. MPEG4 relies on the proven success of three fields: Digital television; Interactive graphics applications (synthetic content); Interactive multimedia (World Wide Web, distribution of and access to content) MPEG4 provides the standardized technological elements enabling the integration of the production, distribution and content access paradigms of the three fields. Microsoft created the first implementation of this standard in the United States in Windows Media Technologies with the release of the Microsoft MPEG4 version 3 video codec. This standard was developed for encoding multimedia content efficiently in a variety of bit rates, including low Internet rates to rates that reproduce a full-frame, television-quality presentation. The Microsoft MPEG4 video codec intrinsically supports streaming multimedia by allowing multiple streams to exist in one encoded data stream. This standard also has an advanced motion estimation algorithm, which allows for greater compression. Other versions of MPEG4 have since been developed by other authors. Both QuickTime and DivX support their own versions of MPEG4. The goal over time will be to make the different MPEG4 versions interoperable so that any player can play clips authored by any other vendor RealPlayer Media files with the file extensions.ra,.rm and.ram are known as RealPlayer files. Realplayer content is media that has been created by the software that is developed by RealNetworks. This software can stream live or pre-recorded audio or video to a client computer, either to a RealPlayer client program or a Web browser with the RealPlayer plugin, by decompressing it dynamically so that it can be played back in real time. Real Networks is one of the industry leaders in Internet audio and video streaming technologies. Their competitors include primarily Apple Computer's QuickTime and Microsoft's Windows Media formats. Their core technologies Real Video and Real Audio form the basis of many content distribution systems for the Internet. Unlike many other solutions, their streaming technologies use UDP and RTP, and require a special streaming server. Their main consumer end software product is RealPlayer (now known as RealOne Player), the front-end to the aforementioned technologies. Page 105/154

106 Real Audio and Real Video download a.ram file to your computer, which gives directions on how to retrieve the audio stream. RealPlayer then connects and retrieves the.rm (Real media/video) or.ra (Real audio) file. RealPlayer supports Synchronized Multimedia Integration Language (SMIL), which is a language for delivering multimedia presentations QuickTime QuickTime is more than what people often believe it is. In addition to the common understanding that it is a video file format using the file extensions.mov and.qt, it is also a programming library and API in C and Java made by Apple. The QuickTime concept includes a browser plug-in and file format for the display, playback, editing and creation of all kinds of multimedia, e.g. audio, video, animation, graphics, 3D graphics and VR. It is probably Apple's most important technology after the Mac OS. Official versions are available for Mac and Windows only, but there are several free software projects to offer support on Linux as well. In May 1991, Apple announced the first version of QuickTime, available on Macintosh only. It was not until the World Wide Web became more widely used, and particularly when in 1994, Apple released a version for Windows, that QuickTime came into its own. The QuickTime plug-in enabled web users to view content such as movies and sound that were starting to become available. Despite heavy competition from RealOne Player and Windows Media Player, QuickTime is very popular. Over 100 million copies of QuickTime 4 were downloaded, and QuickTime 5 is on track to exceed that within its first year of release. A large proportion of this popularity is attributable to the fact that QuickTime is by far the most popular format for the delivery of movie trailers on the web. Trailers such as those for Star Wars: Episode I and Lord of The Rings were primarily available in QuickTime.mov format, and millions of people installed it in order to able to view these. It should be pointed out that QuickTime File Format (.mov) is not a codec itself, but a format for delivering a large number of other codecs. The format is based on the Macintosh resource fork, and is represented by a tree-like structure. Data and metadata are stored atoms, which are just containers. Branch atoms contain several related leaf atoms, which hold the data itself. The actual media data is stored in tracks, so, for example, a movie clip may contain a video track, an audio track and perhaps several text tracks for subtitles/closed captions. This format is very flexible, and openly documented, meaning third parties such as the QuickTime for Linux project can create software that reads and writes.mov files without the need for QuickTime to be installed. The format also actually forms the basis for the MPEG 4 standard. QuickTime supports a large number of video compressors or codecs, but the most important codec for QuickTime is probably the Sorenson codecs. These are licensed exclusively to Apple and constitute a key factor in many decisions to choose QuickTime over other platforms for video. The Sorenson codec gives very good quality and relatively small file sizes, and is probably the thing that Linux and other non- Windows or Mac users miss most through not having official QuickTime. It is supposedly a great codec, and Apple guards it jealously, with good reason. The full list of codecs supported for input and output in QuickTime, can be found in appendix C. Developers can also create plug-ins to enable further formats, which can be automatically loaded when a file requiring them is encountered Microsoft s Window media files The following file formats are standards for the Microsoft Windows operating sytems, but are also possible to use with Windows Media Player for other operating systems, like Linux and MacOS Audio Visual Interleave (.avi): Audio Video Interleave (AVI) is a special case of Resource Interchange File Format (RIFF). AVI is also a format that has been defined by Microsoft. The Page 106/154

107 .avi file format is perhaps the most common format for audio and video data on computers, and this file format is a good example of a de facto standard. Many different codecs can be used to form an.avi-file[215]. Advanced Streaming Format (.asf): This file format stores both audio and video information, and it is specifically designed to run on networks like the Internet. This file format is a flexible and compressed format that can contain streaming audio and/or video, slide shows, and synchronized events. When.asf files are used, content is delivered to the application as streamed data, whether it is streamed from the Internet or not. An Audio Video Interleave (.avi) file can be compressed and converted to an.asf file, the result being that the file can begin playing over networks after only a few seconds. Files can be unlimited in length and can run over Internet bandwidths. Windows Media Audio (.wma): This file type uses the Windows Media Audio codec created by Microsoft. The codec is designed to handle all types of audio content. Such files are very resistant to signal degradation that is caused by loss of data. This loss-tolerance makes this file type useful with streaming content. In addition, when an improved encoding algorithm is used, this codec processes audio quickly. According to Microsoft, this improved compression algorithm also creates smaller audio files than those that are created with most other codecs that compress the same content. The smaller file size means that content that is created by using the Windows Media Audio codec can be downloaded faster. The quality may suffer as a consequence of fast and efficient compression though. According to Microsoft WindowsMedia Audio sounds better, delivering the same quality as MP3 at half the size [215]. Windows Media file with Audio and/or Video (.wmv): You can use a.wmv file either to download and play files or to stream content. The.wmv file format is similar to the Advanced Streaming Format file format. See above in the section on the.asf file type for more information about the properties of these files. Audio for Windows (.wav): Microsoft Windows uses the Wave Form Audio (WAV) file format to store sounds as waveforms DivX DivX is a video codec, and originally started as an alternate version of the Microsoft MPEG4 version 3 video codec, but DivX has gradually evolved into its own format. It is not a particularly standardised format, as it has origins from the Open Source environment, and is regurlarly updated. Currently, DivX version 5 is the latest, with more versions are sure to follow. DivX movie files often employ an MP3 audio codec for sound, but other sound codecs can also be used. This video file format has become very popular over the last couple of years as it makes it possible to compress the video data from a DVD disc holding up to 8 GB of data to the size of a CD disc, which holds around 700 MB. Of course this means sacrificing picture and sound quality, but the result is more than acceptable in many cases Streaming format vs file format Recording vs. transmission As described in [213], for a given picture quality, the data rate of compressed video will vary with picture content. A variable bit-rate channel will give the best results. In transmission, most practical channels are fixed and the overall bit-rate is kept constant by the use of stuffing (meaningless data). Page 107/154

108 In a DVD, the use of stuffing is a waste of storage capacity. However, a storage medium can be slowed down or speeded up, either physically or, in the case of a disk drive, by changing the rate of data transfer requests. This approach allows a variable-rate channel to be obtained without capacity penalty. When a medium is replayed, the speed can be adjusted to keep a data buffer approximately half full, irrespective of the actual bit rate which can change dynamically. If the decoder reads from the buffer at an increased rate, it will tend to empty the buffer, and the drive system will simply increase the access rate to restore balance. This technique only works if the audio and video were encoded from the same clock; otherwise, they will slip over the length of the recording. To satisfy these conflicting requirements, Program Streams and Transport Streams have been devised as alternatives. A Program Stream works well on a single program with variable bit rate in a recording environment; a Transport Stream works well on multiple programs in a fixed bit-rate transmission environment. The problem of genlocking to the source does not occur in a DVD player. The player determines the time base of the video with a local SPG (internal or external) and simply obtains data from the disk in order to supply pictures on that time base. In transmission, the decoder has to recreate the time base at the encoder or it will suffer overflow or underflow. Thus, a Transport Stream uses Program Clock Reference (PCR), whereas a Program Stream has no need for the Program Clock Introduction to program streams A Program Stream is a PES packet multiplex that carries several elementary streams that were encoded using the same master clock or system time clock (STC). This stream might be a video stream and its associated audio streams, or a multi-channel audio-only program. The elementary video stream is divided into Access Units (AUs), each of which contains compressed data describing one picture. These pictures are identified as I, P, or B and each carries an Access Unit number that indicates the correct display sequence. One video Access Unit becomes one program-stream packet. In video, these packets vary in size. For example, an I picture packet will be much larger than a B picture packet. Digital audio Access Units are generally of the same size and several are assembled into one program stream packet. These packets should not be confused with transport stream packets that are smaller and of fixed size. Video and audio Access Unit boundaries rarely coincide on the time axis, but this lack of coincidence is not a problem because each boundary has its own time stamp structure. Program Streams are one way of combining several PES packet streams and are advantageous for recording applications such as DVD Transport stream [213] A transport stream is more than a multiplex of many PES packets. In Program Streams, time stamps are sufficient to recreate the time axis because the audio and video are locked to a common clock. For transmission down a data network over distance, there is an additional requirement to recreate the clock for each pro-gram at the decoder. This requires an additional layer of syntax to provide program clock reference (PCR) signals. The transport stream carries many different programs and each may use a different compression factor and a bit rate that can change dynamically even though the overall bit rate stays constant. This behaviour is called statistical multiplexing and it allows a program that is handling difficult material to borrow bandwidth from a program handling easy material. Each video PES can have a different number of audio and data PESs associated with it. Despite this flexibility, a Page 108/154

109 decoder must be able to change from one program to the next and correctly select the appropriate audio and data channels. Some of the programs can be protected so that they can only be viewed by those who have paid a subscription or fee. The transport stream must contain conditional access (CA) information to administer this protection. The transport stream contains program specific information (PSI) to handle these tasks. The transport layer converts the PES data into small packets of constant size that are self contained. When these packets arrive at the decoder, there may be jitter in the timing. The use of time division multiplexing also causes delay, but this factor is not fixed because the proportion of the bit stream allocated to each program is not fixed. Time stamps are part of the solution, but they only work if a stable clock is available. The transport stream must contain further data allowing the recreation of a stable clock. The operation of digital video production equipment is heavily dependent on the distribution of a stable system clock for synchronization. In video production, genlocking is used, but over long distances, the distribution of a separate clock is not practical. In a transport stream, the different programs may have originated in different places that are not necessarily synchronized. As a result, the transport stream has to provide a separate means of synchronizing for each program. This additional synchronization method is called a Program Clock Reference (PCR) and it recreates a stable reference clock that can be divided down to create a time line at the decoder, so that the time stamps for the elementary streams in each program stream become useful. Consequently, one definition of a program is a set of elementary streams sharing the same timing reference. In a Single Program Transport Stream (SPTS), there will be one PCR channel that recreates one program clock for both audio and video. The SPTS is often used as the communication between an audio/video coder and a multiplexer. Figure 36: Transport Stream Packet Structure Figure 36 shows the structure of a transport stream packet. The size is a constant of 188 bytes and it is always divided into a header and a payload. a) shows the minimum header of 4 bytes. In this header, the most important information is: 1. The sync byte. This byte is recognized by the decoder so that the header and the payload can be deserialized. Page 109/154

2. The transport error indicator. This indicator is set if the error correction layer above the trans-port layer is experiencing a raw-bit error rate (BER) that is too high to be correctable.

110 2. The transport error indicator. This indicator is set if the error correction layer above the trans-port layer is experiencing a raw-bit error rate (BER) that is too high to be correctable. It indicates that the packet may contain errors. See Section 8 for details of the error correction layer. 3. The Packet Identification (PID). This thirteen-bit code is used to distinguish between different types of packets. More will be said about PID later. 4. The continuity counter. This four-bit value is incremented by the encoder as each new packet having the same PID is sent. It is used to determine if any packets are lost, repeated, or out of sequence. In some cases, more header information is needed, and if this is the case, the adaptation field control bits are set to indicate that the header is larger than normal. b) shows that when this happens the extra header length is described by the adaptation field length code. Where the header is extended, the payload becomes smaller to maintain constant packet length. Program Clock Reference (PCR) [213], the encoder used for a particular program will have a 27 MHz program clock. In the case of an SDI (Serial Digital Interface) input, the bit clock can be divided by 10 to produce the encoder program clock. Where several programs originate in the same production facility, it is possible that they will all have the same clock. In case of an analog video input, the H sync period will need to be multiplied by a constant in a phase locked loop to produce 27 MHz. The adaptation field in the packet header is periodically used to include the PCR (program clock reference) code that allows generation of a locked clock at the decoder. If the encoder or a remultiplexer has to switch sources, the PCR may have a discontinuity. The continuity count can also be disturbed. This event is handled by the discontinuity indicator, which tells the decoder to expect a disturbance. Otherwise, a discontinuity is an error condition. Figure 37: Use of the PCR by the decoder Figure 37 shows how the PCR is used by the decoder to recreate a remote version of the 27- MHz clock for each program. MPEG requires that PCRs are sent at a rate of at least 10 PCRs per second, whereas DVB specifies a minimum of 25 PCRs per second. Packet Identification (PID) As described in [213], a 13-bit field in the transport packet header contains the Packet Identification Code (PID). The PID is used by the demultiplexer to distinguish between packets containing different types of information. The transport-stream bit rate must be constant, even though the sum of the rates of all of the different streams it contains can vary. This require- Page 110/154

111 ment is handled by the use of null packets that contain all zeros in the payload. If the real payload rate falls, more null packets are inserted. Null packets always have the same PID, which is 8191 or thirteen 1's. In a given transport stream, all packets belonging to a given elementary stream will have the same PID. Packets in another elementary stream will have another PID. The demultiplexer can easily select all data for a given elementary stream simply by accepting only packets with the right PID. Data for an entire program can be selected using the PIDs for video, audio, and teletext data. The demultiplexer can correctly select packets only if it can correctly associate them with the transport stream to which they belong. The demultiplexer can do this task only if it knows what the right PIDs are. This is the function of the program Specific Information (PSI). Program Specific Information (PSI) PSI is carried in packets having unique PIDs, some of which are standardized and some of which are specified by the Program Association Table (PAT) and the Conditional Access Table (CAT). These packets must be included periodically in every transport stream. The PAT always has a PID of 0, and the CAT always has a PID of 1. These values and the null packet PID of 8191 are the only fixed PIDs in the whole MPEG system. The demultiplexer must determine all of the remaining PIDs by accessing the appropriate table. However, in ATSC and DVB, PMTs may require specific PIDs. In this respect (and in some others), MPEG and DVB/ATSC are not fully interchangeable. The program streams that exist in the transport stream are listed in the program association table (PAT) packets (PID = 0) that specify the PIDs of all Program Map Table (PMT) packets. The first entry in the PAT, program 0, is always reserved for network data and contains the PID of network information table (NIT) packets. The PIDs for Entitlement Control Messages (ECM) and Entitlement Management Messages (EMM) are listed in the Conditional Access Table (CAT) packets (PID = 1). Figure 38 : PIDs of vide, audio and data elementary streams belonging to the same program stream Figure 38 shows, the PIDs of the video, audio and the data elementary streams that belong to the same program stream. They are listed in the Program Map Table (PMT) packets. Each PMT packet has its own PID. Upon first receiving a transport stream, the demultiplexer must look for PIDs 0 and 1 in the packet headers. All PID 0 packets contain the Program Association Table (PAT). All PID 1 packets contain Conditional Access Table (CAT) data. Page 111/154

112 By reading the PAT, the demux can find the PIDs of the Network Information Table (NIT) and of each Program Map Table (PMT). By finding the PMTs, the demux can find the PIDs of each elementary stream. Consequently, if the decoding of a particular program is required, reference to the PAT and then the PMT is all that is needed to find the PIDs of all of the elementary streams in the program. If the program is encrypted, then access to the CAT will also be necessary. As demultiplexing is impossible without a PAT, the lockup speed is a function of how often the PAT packets are sent. MPEG specifies a maximum of 0.5 seconds between the PAT packets and the PMT packets that are referred to in those PAT packets. In DVB and ATSC, the NIT may reside in packets that have a specific PID. Figure 39 shows the structure of a file containing an MEPG2-TS transport stream. The file is already organised as detailed above and allows optimisation in the streaming pump while sending the content. Figure 39 : MPEG2-TS file structure Streaming MPEG4 vs MPEG2 While MPEG-2 can be used for streaming over Ethernet, the high bit rate precludes streaming high-quality video over other types of networks. MPEG-4, using about one-half the bit rate of MPEG-2 for similar video quality, offers the ability to stream high-quality video over almost any broadband connection. Transport-stream format To take advantage of existing infrastructures, it is also possible to use MPEG-4 over MPEG-2 transport. With this technique, applicable also to digital cable, satellite, and terrestrial broadcasts, MPEG-2 transport streams contain MPEG-4 content rather than MPEG-2 content. Page 112/154

113 7.4 Standard Network protocols for video distribution The Transport Layer protocols that are covered are: 1. UDP - User Datagram Protocol 2. TCP - Transmission Control Protocol The higher-layer protocols that are covered are: 1. RTP - Real-time Transport Protocol (Section 7.4.3). 2. RTSP - Real-time Transport Streaming Protocol (Section 7.4.4) UDP It is an unreliable, connectionless transport protocol. There is no ordering of packets, no retransmission of lost or damaged packets, and no splitting of data into packets. It assumes the use of IP as the underlying network protocol. Applications can use it to send encapsulated raw IP datagrams.[203] Advantages for video transmission Low Header Overhead: The header is only 8 bytes long (compare with 20+ bytes for TCP) which means that little of the precious bandwidth is lost to it per datagram. No Retransmission: If there is congestion along the route, UDP will drop packets and not bother with retransmission. Therefore, there is no waiting overhead which would be induced by requesting and receiving a dropped packet. (The assumption here is that missing data is not significantly noticeable in moving pictures.) Disadvantages for video transmission Data Loss: In some cases, especially when dealing with low bandwidth video, losing a few packets could equate to losing a whole second or more (when the frame rate is low and packets/frame number is low). This would mean data loss even in cases where bandwidth is available for retransmission. No Congestion Control: If the data in not released to UDP in a timed fashion (i.e. controlled by a higher protocol or application), UDP would saturate the network by releasing as much data as possible in the shortest amount of time. This could cause severe network congestion Synthesis The most basic requirement of real-time media is on-time delivery. Reliability is a secondary issue. In a networked environment one often has to trade-off between these two features. To deliver best quality real-time media to the client, reliability has to be sacrificed for on-time delivery. UDP takes this concept to the extreme by not guaranteeing any level of reliability. It is the ultimate protocol for real-time media transmission in most circumstances, but especially in a case where a dynamically adaptive protocol is implemented on top of it (with capabilities to dynamically provide a relevant level of reliability) TCP TCP is a reliable, connection oriented, end-to-end protocol. Breaks up data into chunks that never exceed 64K bytes (usually about 1500 bytes) and sends each as a separate IP datagram. Page 113/154

114 TCP times out transmitted data and retransmits it as need-be. It also takes care of ordering datagrams into the originally sent order. All TCP connections are full-duplex (traffic can go in both directions at the same time) and point-to-point (each connection has exactly two end points - multicast and broadcast are not supported by TCP). A TCP connection is a byte stream as opposed to a message stream which means that it does not preserve message boundaries end-to-end. TCP decides if data is sent immediately or if it is buffered until a substantial amount is gathered Advantages for video transmission No Data Loss: TCP is a reliable protocol which means that every byte that is presented to the TCP layer at the sender side, is received by the appropriate protocol above TCP on the receiver side (in the correct order) or the connection is broken. This is especially important for the audio component of real-time media where the dropping of particular segments (such the word NOT ) can totally change the meaning of the message. This feature is most beneficial in the rare situations where the available bandwidth exceeds the data rate of the media transmitted, and the media presented to the client side is not affected by the delay caused by media retransmission Disadvantages for video transmission High Delay: Because TCP has to deliver all the bytes to the higher level protocols at the receiver side in the exact order that it received them from the higher-level protocol on the sender side, it has to ensure that every byte reaches the TCP layer at the receiver end. Mandatory retransmission of unacknowledged bytes ensures this, but it also causes high delays in presentation of these, and subsequent, bytes to the higher-level protocols. This shows up as undesired media break-up and pausing in the audio and video. The congestion control component of TCP can also cause delays. When dealing with video content which has a certain constant delivery rate requirement, the congestion control algorithm could directly undermine these requirements by dropping the instantaneous delivery rate in response to network congestion. Although no delays due to ACK timer time-out would be associated with this delay, the delay could still be significant if the network is perceived to be severely congested. High Bandwidth Overhead: The fixed part of the TCP header is 20 bytes. There are additional, optional header fields which can extend the header length to be as much as 60 bytes. One of these headers is added per segment which means that a large percentage of the bandwidth consumed by TCP transmissions goes to header as opposed to data transmission. This is especially true in the acknowledgment segments which are usually only one byte (of data) long RTP: Real Time Protocol RTP [203] provides end-to-end network transport functions suitable for networked transmission of real-time data, such as video. Currently, it is most widely used in IP environments, running on top of UDP, but it is not limited to this environment as it was developed to function independently of the underlying transport and network layers. It does not provide any mechanisms to ensure timely delivery or provide quality-of-service guarantees. It does not guarantee delivery or prevent out-of -order delivery like TCP. Even though it has been mostly implemented in applications rather than the OS kernel, RTP is still a transport protocol. Page 114/154

115 Because multimedia applications vary in their requirements, RTP is by design a protocol framework rather than a simple protocol. This is why it was designed to be tightly coupled to the application, unlike TCP which is implemented as a distinct layer. RTP services include payload type identification, sequence numbering, and time stamping. A closely integrated control protocol called RTCP (Real-time Transport Control Protocol) monitors the delivery of data and provides feedback for dynamic adjustments Advantages RTP was designed from the ground up for the purpose of real-time media transmission over the network. The advantages that it provides to video transmission are many, but from the network perspective, the real-time control feedback provided by RTCP is key. This feedback provides congestion control which UDP does not have and, in so doing, when used in conjunction with UDP, minimizes the losses through dropped frames. It also provides information which the streaming application can use as a basis for deciding on the packet structure that it should transmit and time to wait for acknowledgments (when used in conjunction with TCP). The true benefits of RTP is realized when used in conjunction with UDP. It provides the basic features needed (but not provided by UDP) for video streaming. It exploits the advantages of UDP and adds a thin layer of features, at a slight performance cost, to make it more suitable for real-time media transmission [203]. On the other hand, when used on top of TCP, RTP could enhance the transmission of video by providing the synchronization feature, but, in most cases, the overhead introduced by the replication of some of TCP s functionality (e.g. packet sequencing) more than over-shadows any performance gains Disadvantages RTP outlines a framework for real-time manipulation of data in-between retrieval from storage and transmission over the network. This has the potential to introduce two types of overheads [203] : 1. Processing overhead which, in extreme cases, could cause under-utilization of the available network bandwidth. In effect, a bad implementation could shift the bottleneck from the network to the hardware/software internal to the video server which should not happen considering the processing and transfer bandwidths of today s servers. The processing overhead must be monitored carefully in RTP implementations to ensure that such a bottleneck shift does not occur. 2. Communications overhead which is comprised of both control and header overhead. The control overhead is limited by RTCP at 5% of the total transmission bandwidth, but that is only for the control traffic. The RTP header overhead could be higher - that depends on how the messages handed down to the lower network layers are sized. The smaller these messages are, the bigger the header overhead (one header per message) becomes. As mentioned earlier, the minimum (fixed) header size of an RTP packet is 12 bytes. Note: One should notice that when using Transport Stream format, it is useless and redundant to use the RTP protocole. Stream engine generally stream Transport Stream directly over UDP RTSP RTSP defines an extensible framework for control and delivery of real-time media across the network. It defines the connection between streaming media client and server software and provides a standard way for client and server software of multiple vendors to stream Page 115/154

116 multimedia content. It is designed to work on top of the well-established RTP protocol, but it is not tied to it. It is not based on the reliability of the underlying protocols and can run on top of either UDP or TCP. RTSP is in many ways similar to HTTP: the syntax and operation are intentionally similar. It does differ from HTTP in a number of crucial ways though: it introduces a number of new methods. it needs to maintain state by default in almost all cases. the data is carried out-of-band, by a different protocol. both a client and a server can issue requests Advantages As it is really a set of libraries that allow for a standard way to access features of lower-layer protocols, RTSP is good for video streaming from a marketing perspective. It provides software developers with a layer of abstraction from the complicated world of networking, and therefore makes video server and client application development much easier. From a performance point of view - there are no obvious advantages to using RTSP Disadvantages RTSP is an application layer collection of commands that have to be implemented as part of all video streaming software packages. As such, unless it is a non-optimal implementation, it cannot be a disadvantage to the overall video streaming application. 7.5 Unicast & Multicast A UDP packet is encapsulated in an IP packet, which can be a unicast or a multicast packet Unicast In the Unicast mode, a UDP packet is sent from one machine to another. The sender specifies the destination IP address (and the destination port) in the IP Header Multicast The range of IP addresses through is meant exclusively for IP multicast. In the multicast mode, a packet is sent to any machine that wishes to receive it. The sender specifies a multicast address as the destination address in the IP Header. Machines that would like to receive this IP multicast will simply join the multicast. IP multicast operates at the IP level and so is best-effort by design. IP multicast is built on top of hardware multicasting. For instance, an IP multicast address will be mapped onto a corresponding ethernet multicast address Advantage of Multicast The advantage of using multicast lies in the fact that the sender need send only one packet to an IP multicast group and all participating machines will receive the packet. This is efficient compared to unicast, where a sending machine will have to send out multiple copies of the same packet to each of the receivers. Multicast results in lower network bandwidth usage as well as lower CPU load on the sender. Page 116/154

117 Disadvantage of Multicast Because all ethernet devices built in the recent years support the multicast feature, IP multicast is virtually guaranteed to work on machines on the same network segment. Nowadays most switches also support IP multicast. 1. However, multicast does not work over the global Internet as of today. For multicast to work across networks and over the Internet, all routers would have to forward multicast traffic as well as propagate multicast routes. This is not a problem for unicast. 2. There is no feedback channel from the stream receiver to the sender. So a receiver cannot tell a sender about end-to-end network performance, like packet losses or latency. 3. Lack of a feedback channel also means that users will not have the ability to fast forward and rewind or seek through multimedia files. These functions need the receiver to talk to the sender. 4. There are security considerations with multicasting. When a network segment receives a multicast stream, by design, every machine on that network can receive the stream. 7.6 Problems of video transmission Video-over-IP is a new and emerging technology that combines switched packet networking with streaming video [201]. There are few standards for Video-over-IP today. The integration of these two technologies has lead to several questions as to the measurements that determine the quality of a Video-over-IP stream. Unlike data transfers over IP, streaming video quality is measured live and at the end-point, or more to the point the TV. Quality end-point video is not solely a function of network bandwidth nor is it solely a function of MPEG-2. In fact many of the issues that surround quality end-point video are a combination of both the MPEG-2 quality and the level of deterministic IP packet delivery of the network. Unlike data traffic which measures quality by speed of reliable through-put with little attention to the nature of the payload, video (as voice) demands more from network transport. Networks designed to carry streaming video must account for the payload they carry. Furthermore the type of MPEG-2 stream being transported effects the minimum and maximum boundary characteristics that the network packet delivery and the overall system must conform to for quality streaming video at the end-point. Measuring and monitoring these streams will involve measuring packet arrival times at layer 3, the IP layer, average and instantaneous behaviors of these arrival times and finally the boundaries of the system by decoding part of layer 7, the MPEG-2 content. The MPEG-2 content combined with the system buffering limits will impose the boundaries on the transport [201] Key parameters to quality streaming video-over-ip Basically from the VoD server MPEG-2 is wrapped in a packet and shipped out at a constant rate consistent with the rate of the MPEG-2 TS. For example, movie 1 was MPEG-2 encoded at 3.75 Mb/s, meaning the decoder for video must see 3.75Mb every second with a ns MPEG-2 packet jitter tolerance. So the VoD server groups 7 MPEG-2 TS packets for every Ethernet packet and (theoretically) sending that packet out the Ethernet port at an even and constant rate as to facilitate 3.75 Mb/s at the end point. Page 117/154

Figure 40 : Format of Typical MPEG-2 TS over IP Packet Because there are multiple clock domains in this system, buffering is used to help smooth out clocking and speed variations.

118 Figure 40 : Format of Typical MPEG-2 TS over IP Packet Because there are multiple clock domains in this system, buffering is used to help smooth out clocking and speed variations. Figure 41 shows the basic flow diagram for quality Streaming Video-over-IP. As Ethernet packets come from the VoD server and from the switched network, the MPEG-2 TS packets get buffered and streamed to the decoder at a smooth 3.75 Mb/s rate. Then the MPEG-2 is decoded and displayed on the TV. Figure 41 : Flow diagram for quality streaming Video-over-IP It is suggested that there are five properties that must be measured and monitored to ensure quality transport of Video-over-IP: 1) Inter-packet arrival jitter causing delay 2) Inter-packet arrival jitter causing burst 3) Ethernet packet loss 4) Ethernet inter-packet arrival average drift/deviation from the MPEG-2 data transport rate 5) MPEG-2 quality due to packet corruption on the network, MPEG-2 encoding errors, or MPEG-2 packet loss Effects of Jitter/Delay and network Inter-Packet Gap Drift Network packet jitter that causes large delays can cause the end-point buffer to run dry[201], producing segments of time in which the decoder has nothing to decode. This leads to degradation of the video quality observed on the TV. In many cases the TV will show macro blocking video or simply go blank. Jitter delays can be caused by several things including switch QoS settings, switch aggregation, and/or server problems. Network packet long term rate variations can also cause the buffer to run dry in the same way. Page 118/154

119 Figure 42 : Effects of Jitter/Burst and Network Inter-packet Gap Drift Effects of Jitter/Burst and Network Inter-Packet Gap Drift Similar to the prior case, network jitter [201] that causes bursts of packet can cause buffer overflows. This is a much more difficult case to monitor, because data loss can occur at several points on the network. Faster delivery of Network packets overflows the buffer in the equipment and Network packets are dropped in the network. So measuring quality at the TV can not show the entire story. When the network drops packets due to packet burst-induced overflow, the MPEG decoder may actually underflow as a result. This happens because the MPEG decoder s buffer continues to drain while some of the MPEG packets simply did not make it to the decoder s buffer. Thus, there may actually be both overflow and underflow conditions existing simultaneously on the path between the encoder/server and the decoder. Figure 43 : Effects of Jitter/Burst and Ethernet Inter-packet Gap Drift Page 119/154

120 7.6.4 Network Packet Loss Packets are dropped on the network [201]. This is a simple case to see the effect; if the data does not arrive it will lead to poor quality. In all these cases there is an underlying issue of quality of the MPEG. If the MPEG is encoded poorly or the MPEG payload is corrupted anywhere along the way, including data corruption right from the RAID, video quality is compromised. Another effect to consider is the dynamic nature of the network and its influence on other streams. Because it is a shared network, the more streams the greater the chance network switch elements have to buffer and (re)order traffic, thus creating jitter, delay, burst and packet loss. 7.7 Client Viewing over a set-top box Figure 44 : Ethernet packet Loss What is a set-top box?! " #" As described in [219], Set-top boxes are often associated with these major categories: 1. Broadcast TV Set-top Boxes - (a.k.a. Thin Boxes) - The more elementary level set-top box with no return channel (back-end). They might however come with some memory, interface ports and some processing power. Page 120/154

121 2. Enhanced TV Set-top Boxes - (May be known as: Smart TV set-top box, Thick Boxes) - These have a return channel, usually through a phone line, and are the mainstay of today's set-top boxes. These are capable of Video on Demand, Near Video on Demand, e-commerce, Internet browsing, communications and chat. They are giving way to the next category. 3. Advanced Set-top Boxes - (a.k.a. Advanced digital Set-top boxes, Smart TV Set-top box, Thick Boxes) - These are in many ways like PC s. These have good processors, memory and optional large hard-drives. They are more often used with high-speed connections. The Explorer 6000 & 8000 set-top boxes from Scientific Atlanta are in this category. Advanced set-top boxes are more likely to be integrated with DVRs and high-definition TV oriented functionality. 4. All-in-one Set-top Boxes - (a.k.a. Integrated set top box, Super Box; maybe be known as Advanced set top box) - A fully integrated set-top box. Features could include everything from high-speed Internet access to digital video recording to games and e- mail capacity. The opposite of this is when are two or more set top boxes (sidecars) are used in tandem by the subscriber s TV. 5. Hybrid Digital Cable Box A specialized and often more expensive Cable TV set-top box with high end functions. Motorola Broadband s DCP501 home theater system is an example. It has a DVD player. Set-top boxes (STB) act as a gateway between your television or PC or PC-TV and your telephone, satellite or cable feed (incoming signal). In terms of ITV, the STB receives encoded and/or compressed digital signals from the signal source (satellite, TV station, cable network, etc.) and decodes (and/or decompresses) those signals, converting them into analog signals displayable on your television. The STB also accepts commands from the user (often by use of a remote control (keypad or keyboard) and transmits these commands back to the network, often through a back channel (which may be a separate phone line). Interactive television STBs can have many functions such as television receiver, modem, game console, Web browser, a way of sending , Electronic Program Guide (EPG), even CD ROM, DVD player, video-conferencing, cable telephony etc. Many STBs are able to communicate in real time with devices such as camcorders, DVDs, CD players and music keyboards. Set-top boxes are usually computers that process digital information. These typically have onscreen user interfaces that can be seen on the TV screen and interacted with through the use of an Hand-held Interactive Keypad, which is little more than an advanced remote control. (These are also known as Control Devices.) STBs also have facilities for upgrading software such as browsers and Electronic Program Guides (EPGs). Some have huge hard-drives and smart card slots to put your smart card into for purchases and identifying yourself to your cable, satellite TV provider. To provide interactive services, the set-top box, from the standpoint of its hardware, needs four important components: a network, an interface, a buffer, as well as decoder/synchronization Hardware. (1) The network interface: Allows the user to receive data from the server and send data back to the server, in a manner that it can be understood by the server. (2) The decoder: In order to save storage space, disk bandwidth, and network bandwidth, movies are usually encoded (compressed) before they are sent over the network. Thus, the endusers need a decoder to decode (uncompress among other things) the incoming stream data before it is viewable. This is part of what a modem does. The decoding process is sometimes known as Demodulation or Heavy Lifting. Page 121/154

122 (3) The buffer: Due to delay jitters in the network, the arrival time of a video stream cannot be determined exactly. In order to guarantee continuous consistent playback for the viewer (end-user/subscriber) the stream is often received one or even a few seconds before it's actually seen by the end-user. This way if there are fluctuations (even those measured in milliseconds) in the transport time of the video stream to that receiver, the viewer will not know the difference as their buffer has a bit of time to spare. (4) Synchronization hardware: Let us remember that a movie (or whatever one watches via a set-top box) consists of both video and audio streams. They must be synchronized with each other before being viewed. Platform - (Sometimes also known as "ITV client") (1) The underlying system and standards that makes up the Built-in and/or set-top box. The platform enables interactivity (among other things). Platforms can include ITV related software, middleware and/or hardware. ITV platforms however are often associated with the Middleware provider. Liberate, OpenTV, PowerTV, Worldgate and Microsoft TV (MSTV) are middleware platforms and/or Platform providers. (2) The operating system (i.e. Windows 98, Windows NT, etc.) used by the computer that a visitor to your Web site is using. (3) It can be used to refer to major communications such as digital terrestrial (MMDS), cable, satellite, and the Internet. Thus these would be referred to as "cable platform," "satellite platform" etc. Some popular ITV Platforms are: ImagicTV, Liberate, MEDIAHIGHWAY, MHEG-5, Microsoft TV Foundation Edition, MSNTV, Myrio, OpenTV, PowerTV, WebTV, Worldgate Middleware - (a.k.a. System software or Platform software) - A general term for any programming that serves to "glue together" or mediate between two separate and usually already existing programs. It includes an application manager, the virtual machine (such as Java Virtual Machine ), the interactive engine, the libraries and databases. In terms of Interactive TV related, that would often be software that provides services that occur between the server and end-user. This includes software that connects two separate applications together. This is particularly necessary as there are in use a number of different programs, platforms and software that are all oriented to the same goal of providing interactive TV. If the set-top box has a Resident Application, it is often considered as in the Middleware category. In this case a Resident Application is a program or programs that are built into the memory of the set-top box. These are updated, often automatically, by the service provider via the data stream (signal) that the set-top receives from the service provider. Software - (Software, Set-top Box Software) This is software that adds features to the set-top box that often it doesn t need to operate, or at least operate minimally. For instance if the settop box was Voice-enabled, so it obeys commands spoken by the subscriber, that would largely be thanks to the voice recognition software in the set-top box Technology survey of Set-op box The following table summaries the various evolution in term of technology, service and providers. Page 122/154

123 Technology Network Services Manufacturers Price $700-$1,000 Technology Network Services Manufacturers Price $500-$800 Technology RTOS OS9, VX Works Stellar Introduces Middleware based on Windows OS NT embedded MPEG1 and MPEG2 Sigma introduces MPEG 2 off the shelf video chip ATM or IP debate Video On Demand, Internet Access Stellar One s Netris 3000 NTe, ATM or IP Motorola, Acorn (now Pace Technologies), NLC, DiviCom, Mitsubishi, NCI, NEC, Scientific Atlanta, Thomson, Tatung, Zenith, WebTV Real Time Operating Systems (RTOS) Linux, VX Works Windows NTe and XPe MPEG1 and MPEG2 multiple video chip options exist Migration to IP complete Ethernet, ADSL VOD, Internet to TV, Internet to PC, Stellar One s ConnectTV 3000 NTe, IP or ATM Fujitsu Siemens Activy running ConnectTV 2.1 IP only Pace Technologies, NLC, NEC, Thomson, Motorola, RCA Real Time Operating Systems (RTOS) Linux, VX Works Windows CE.NET MPEG1, MPEG2, MPEG4, WM9, RealNetworks software and hardware decode available STB on a Chip Solutions Evolutions Network Services Manufacturer Price Technology Network Services Manufacturer IP options expanded Ethernet, ADSL, VDSL, Etherloop, LRE VOD, Internet to TV, Internet to PC, Music on Demand, , Personalization Fujitsu Siemens Activy Costron, Eagle, I3, Pace Technologies, NLC, Samsung, Kreatel, Thomson, RCA, Amino Communication $150-$300 in volume Linux or Windows CE MPEG1/2/4 & Window Media 9, H264, software and hardware decode available STB on a Chip Solutions IP options expanded Ethernet, ADSL, VDSL, VOD, Internet to TV, Internet to PC, Music on Demand, , Personalization, PVR.. Page 123/154

124 Price less than $100 in volume Note: Windows CE will address the market of Set-top box according to the pricing approach defined by Bill Gates and white papers from Microsoft [222]. Bill Gates, WINHEC 2003 on CE. NET License pricing: We have a new pricing capability where we've brought the standard unit, small number of unit price down for the core pieces to $3. We have volume discounts, of course, that go even beyond that. The following diagram (Figure 45) shows the various tendencies of the market, either low cost with minimum services (VOD, Internet) or high-end with high quality and recording capabilities. Figure 45: STB and Middleware Changes Viewing over PC, PDA and laptop We will hereafter present some media player applications, predominantly ones that are available for the Pocket PC operating system [213]. Two of the most well-known media players, Windows Media Player and RealPlayer (now known as RealOne Player) are available for both laptop and PDA platforms. Naturally the laptop and PC versions are more advanced and include many additional features as well as being able to play media files For PDAs there are some media player programs available that are specific for the PDA format. Packetvideo and PocketTV are media player applications only available for the Pocket PC OS, and Fireviewer and gmovie Player are only available for PalmOS. A feature that separates media player applications into two classes is the ability or non-ability to play streaming media. More specifically, a player can have no streaming abilities, pseudostreaming abilities or real streaming abilities [216]. Page 124/154

125 Also it has to be stated that the media playing applications for devices like Pocket PC which are less powerful than regular PCs, although looking much like and seemingly the same application as for a full PC operating system, are quite downsized in comparison and lacks many of the features the full versions possesses. For instance, both RealOne Player and Windows Media Player are unable to accept anything else than media files, media redirection files or Internet links which lead directly to the media file in question Windows Media Player Windows Media Player is the standard media playing tool that comes with the PC, laptop and Pocket PC operating system, and is reminiscent of the more advanced Windows Media Player for the full Windows operating systems. Windows Media Player supports the organization and playback of Windows Media content, MP3 audio files, Windows Media Audio, Windows Media Video and streamed content in Windows Media format using Windows Media protocols ( and mms://). This application is then streaming capable, although in a quite restricted manner. Streaming of MP3 files is not supported, so that the only streamed content playable with Windows Media Player is the Windows Media formatted content. It allows full screen display of video. This application in the portable device format only exists for Pocket PC RealOne Player RealOne Player from RealNetworks is another popular media player application, perhaps due to the fact that it was one of the first players to offer true streaming to the public. RealNetworks are showing that they are committing themselves to mobile media, with the incorporation of RealOne Player in the Nokia 9210i (for the EPOC/Symbian OS), and also after having made an agreement with HP Compaq that has lead to HP Compaq shipping all new ipaqs with RealOne Player installed. The RealOne Player for Pocket PC devices is a lightweight version optimized for resource constrained devices like the ones utilizing the Pocket PC operating system. RealAudio and RealVideo programs can be streamed in real time over a wireless data connection, and supports network types like IEEE b, GPRS, HSCSD, CSD, CDPD and 1XRTT. RealAudio and RealVideo files can also be downloaded and played back locally. The RealOne Player for the Pocket PC can be used on most Pocket PC devices [209] QuickTime At present QuickTime has no player application for handheld devices. Only QuickTime applications for full PC operating systems like Windows, Linux and MacOS exist. The question is how long QuickTime can wait before releasing an application for mobile devices as well. Apparently, PVPlayer from Packetvideo is able to or will be able to play QuickTime files DivX and Pocket DivX The DivX and Pocket DivX Player is a free Open Source multifunction video and audio player for the Pocket PC platform that can play DivX, OpenDivX, MPEG4, MPEG1 videos and MP3 audio [217]. This application is available in different versions for different Pocket PC devices DirectShow structure A few years ago, Microsoft introduced a media streaming layer on top of DirectX that was meant to handle pretty much any type of media you could throw at it, called DirectShow. It's included in DirectX Media, which also includes DirectAnimation (mostly web page stuff), DirectX Transform (more web page stuff), and the old Direct3D Retained Mode that used to be Page 125/154

part of the standard DirectX collection. This tutorial will show you how to play back standard media types in your game or application using DirectShow from C++.

126 part of the standard DirectX collection. This tutorial will show you how to play back standard media types in your game or application using DirectShow from C++. Microsoft doesn't offer the DirectX Media SDK for download, so you either have to order a CD or look in the latest Platform SDK. As with DirectX, the DirectShow API is accessed through COM interfaces, so I assume from the start that you have at least a basic knowledge of COM and how it works as far as getting pointers to the interfaces you need and releasing references correctly. DirectShow is set up with the ideas of a number "filters" joined together to create a "graph". Figure 46 : Direct Show graph Here's a visual representation of the graph an filters using a utility that comes with the DirectXMedia SDK, called GraphEdit: Page 126/154

127 Each box represents a filter. Arrows connecting boxes represent the output of one filter being passed to the input of another filter. Arrows also show the flow of data in the graph. GraphEdit is nice for those just getting started with DirectShow, as it gives a nice visual equivalent to what you'll be doing in software. GraphEdit also lets you drag filters around, connect them to other filters, and to run your final complete graph. Each graph that is built follows certain guidelines. First of all, there must be a source filter. This is the initial source of your data, be it a file, a URL for streaming media, or some device such as a firewire card hooked to a video camera. The output of the source filter is then run through any number of transform filters. Transform filters are any intermediate filters that take a certain type of input data, modify the data coming in, then pass the modified data to it's output. The final piece of a graph is a renderer filter. Renderer filters are the final destination of any data handled in a filter graph. Renderers can represent things such as a video window for displaying video on the screen, sound card for outputting sound, or a file writer for storing data to disk. The way that filters are connected in a graph is through their "pins". Every filter, no matter what type, must have at least one pin to connect to other filters. When attempting to connect two filters, the pins on both the filters pass information back and forth to determine if the downstream filter (the one accepting data) can handle the data passed in by the upstream filter (the one sending data). If the pins successfully negotiate a data type they both know, a successful connection has been made between the 2 filters. As you can see in the image above, a filter is not restricted to a number of inputs or outputs, and many times a filter will require more than one input or output for the data being handled. For example, the MPEG- 1 Stream Splitter filter needs to send the audio and video portions of MPEG-1 data to separate decoder filters. DirectShow is distributed with a number of filters provided by Microsoft, including source filters, transform filters, and renderer filters. They provide us with a useful "File Source Filter" that we can use for reading in any type of file, transform filters capable of handling MPEG-1 video, AVI video, WAV audio, and other formats, and finally, renderer filters for outputting Page 127/154

128 sound and video. For standard formats, the filters provided by Microsoft may be all you need for playback. Building And Testing A Graph In GraphEdit GraphEdit is a great tool that you can use to create and test a filter graph to play a media type that you plan on using in your program. I won't go into details on how to use GraphEdit, but it is an important that you know it and use it before you end up wasting time coding, only to realize that 2 certain filters don't agree on a data type. The first thing to do is to select "Render Media File" from the menu and select the media file that you wish to playback. The most popular and well-supported codecs with free or cheap tools available are [218]: DivX This codec from DivX Networks is extremely popular online, and has even found its way into several consumer electronics devices. The Pro version enables a few more encoding options and is available as either a free ad-supported download or for purchase for $ We paired it up with the popular free video conversion software VirtualDub to perform our transcoding tests. The latest version takes the encoding improvements from version 5.1 and speeds everything up a great deal, making for the best DivX codec yet. Windows Media Video 9 Microsoft's latest video codec makes vast improvements over that from series 8, and more importantly, the decode stream parameters are now fixed future iterations of Windows Media Video will make improvements in the encoder, but those will be able to play back on devices that are WMV9 compatible. This has enabled them to find support in several new and a great many upcoming DVD players, digital media adapters, and portable video players. QuickTime 6.5/Sorenson3 Though not popular in consumer electronics devices, QuickTime is everywhere on the Web. It's especially common for movie trailers, which may be the most popular form of downloaded video on the Web. The majority of online QuickTime trailers and videos are encoded with the commercial Sorenson3 codec that ships with QuickTime 6.5, though typically at fairly high bitrates. Page 128/154

JPEG Descrizione ed applicazioni. Arcangelo Bruna. Advanced System Technology

JPEG Descrizione ed applicazioni. Arcangelo Bruna. Advanced System Technology JPEG 2000 Descrizione ed applicazioni Arcangelo Bruna Market s requirements for still compression standard Application s dependent Digital Still Cameras (High / mid / low bit rate) Mobile multimedia (Low