Lecture 14, Video Coding Stereo Video Coding

Lecture 14, Video Coding Stereo Video Coding A further application of the tools we saw (particularly the motion compensation and prediction) is stereo video coding. Stereo video is used for creating a spatial impression, where each eye sees its own picture or video, and where the pictures between the eyes have slight disparities between them, which contain the depth information. The closer an object is to the eyes, the more disparity we have between the pictures for each eye. The brain uses this disparity to estimate the depth of an object. The simplest approach would be to have 2 separate video streams, one for each eye, with the problem that we obtain twice the data rate. This problem of increased data rate becomes even worse if we have more than the 2 views, also called multi view video. This is used for instance for auto-stereoscopic displays (without glasses). These displays produce different views to different angles from the display, such that each eye sees a different view, and such that if the head is moved, slightly different views are seen, which increases the 3-D effect. The number of views usually is up to about a dozen. Another common approach for Stereo Displays

is the use of shutter glasses. They use the fact that modern monitors or displays can display 100 pictures per second or more, such that we can display one image for the left eye, followed by an image for the right eye, separated by the shutter glasses, which opens each glass only when the image for the corresponding eye is displayed. But this again needs only 2 views. The multi-view approach would then even increase the problem with the bit rate, by multiplying the bit rate needed for a single video stream by the number of streams. An interesting approach to reduce the bit rate can be seen in auto-stereoscopic displays, where there are approaches to generate multiple views from just a pair of views, such that we have reduced multi view video to stereo video. This is also an important approach, because often only 2 views are available as a video source. Our Goal is now to use the redundancies between the stereo and multi-view videos to reduce the bit rate or to generate new views. The problem, which becomes especially apparent when looking at autostereoscopic displays, is that the new views, that we generate, might not have sufficient quality. Often, visually important information cannot be interpolated, for instance if new angles with

new patterns appear in the image. The solution of this problem would be important, because for instance the auto stereoscopic display development would then be independent of the transmission of multiview video. It could all be based for instance on stereo video content. Even stereo video content contains a lot of redundancies between the 2 views, which could be used to reduce the necessary data rate for the transmission. The approach that is taken is to use motion estimation and compensation not only in the temporal direction, but also between the left and right stereo video, or between the multiple view video streams. This principle can be seen in the following image,

(From: "Joint Prediction Algorithm and Architecture for Stereo Video Hybrid Coding Systems", Li-Fu Ding, Shao-Yi Chien, and Liang-Gee Chen, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006) Applying the principle of motion compensation to the disparity between views is now called "Disparity Compensation". This is a principle which was also defined within MPEG for multi view video sequences. A possible coder structure using this principle can be seen in following image,

The principle of joint motion and disparity compensation is illustrated in the following image (both from "Joint Prediction Algorithm and Architecture for Stereo Video Hybrid Coding Systems", Li-Fu Ding, Shao-Yi Chien, and Liang-Gee Chen, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 16, NO. 11, NOVEMBER 2006)

(SW: Search Window, ME: Motion Estimation, DE: Dispatity Estimation). Interesting is the quantitiy of the use of the disparity estimation in a typical video sequence, (DE: Disparity Estimation, ME: Motion

Estimation, Q: Quality factor, higher is better qualty and more bit rate). Here it can be seen that still just the motion estimation is used most often. The following image shows, that particularly for moving objects the disparity estimation is beneficial. For instance if a moving object frees a part of the background, which is already seen in the other view, it can be predicted using the other picture, but not the past picture, (DV: Disparity Vector). The performance of the proposed system in the above source can be seen in following image,

(SP: Simple Profile, TSP: Temporal Scalabilty Profile, JPA: Joint Prediction Algorithm). Here it can be seen that there is the possibility of considerable bit rate savings using disparity compensation. Depth Map Coding Another simple possibility of 3-D video coding is to use depth maps. Here, only 1 view is encoded, and other views are generated using the depth map. This depth map can have a relatively low spatial resolution, because the exact object boundaries can be obtained from the main video view. This depth map can then be encoded with relatively low bit rate. The disadvantage is, that the additional views are only approximated, because there cannot be any generation of additional information. Hence

this leads to 3D views which usually don't have a high quality as in transmitting the separate views. The following image shows an example of a depth map, (from: "COMPRESSION AND TRANSMISSION OF DEPTH MAPS FOR IMAGE-BASED RENDERING" Ravi Krishnamurthy, Bing-Bing Chai, Hai Tao, and Sriram Sethuraman, ICIP 2001) Here, picture (a) and (b) are the left and right view, and (c) is the generated depth map of the 2 views. In the depth map, bright means near and dark means further away. This can be seen as a low resolution video in itself. As a result, for the transmission we need just one high resolution video, accompanied with

the low resolution depth map video, which is then used to generate an artificial 3D video view. This is also an approach treated in MPEG, for a standardization. Blue-Ray 3D: Here we usually have a standard video stream, but with the left and right views appearing in the video as the left and rigtht half of the video image, or the lower and upper part. This means we have a reduced spatial resolution for 3-D videos, either horizontally or vertically, and we have no explicitly use of the redundancies between the 2 views.