3D omnistereo panorama generation from a monocular video sequence Ning, Zhiyu U

Size: px

Start display at page:

Download "3D omnistereo panorama generation from a monocular video sequence Ning, Zhiyu U"

Charlene Agnes Ford
5 years ago
Views:

3D omnistereo panorama generation from a monocular video sequence Ning, Zhiyu U4146445 A report

1 3D omnistereo panorama generation from a monocular video sequence Ning, Zhiyu U A report for COMP6703 escience Project at The Department of Computer Science Australian National University

2 Acknowledgement I would like to thank my client and supervisor, Dr Hongdong Li, for giving me the chance to do computer vision project and giving me so much support and inspiration. I would like to thank my supervisor, Dr. Alistair Rendell, my lecturer Dr. Henry Gardner, Dr. Pascal Vuylsteker for giving me academic supports and useful advice. And thanks to all the support and help from my family and friends. 2

3 ABSTRACT An interesting part in computer vision is to generate 3D video from traditional 2D video. To achieve this, first we need to understand how to use a single monocular camera to generate a 3D scene sensation. We call this 3D scene stereo panorama. In paper[1], Shmuel Peleg and his fellows proposed a new approach to generate a stereo panorama by using only a single video camera rotate about an axis behind its lens based on the principle of X-slit camera in paper[2]. Specifically, the stereo panorama images is obtained by pasting together strips taken from each image in the video sequence. However, they did not specify how to choose the specific strip width and the location of each strip, how the parameters in use will affect the 3D sensation. And they didn t give a benchmark on how to recognize the stereo image pair s quality. The strip width chosen is obviously related with the camera rotating speed. And the location of each strip is related with disparity. However, in this report, I will use a Virtual speed rather than actual speed of camera to determine the strip width and present the relationship between visual speed and actual speed. And analyze the location of each strip in accordance with the actual 3D sensation of the stereo panorama images. Also, I will give my benchmark on how to distinguish the stereo image s quality. Keyword: computer vision, stereo panorama, X-slit camera, parameters, disparity 3

4 CONTENTS Acknowledgement...2 ABSTRACT INTRODUCTION General introduction Stereo Panorama images Camera classifications Sharp RD3D STRUCTURE BACKGROUNDS X-slit camera camera calibration Single viewpoint projections Multiple viewpoint projections Benchmark on stereo panorama image quality Disparity Algorithm built Sharp RD3D API 25 3 CLIENT REQUIREMENT SPECIFICATIONS Theory analysis requirement Implementation requirement PROJECT PLAN MILESTONE FUTURE WORK THEORY ANALYSES AND MODELING Methodology Parameter analysis Image points matching

5 5.4 Algorithm building Modeling 41 6 IMPLEMENTATIONS Initial implementation methods Software introduction Input specification Output results Testing on sharp RD3D API RESULTS AND ANALYSIS Testing effect Result analysis and improvement CONCLUSIONS.. 55 APPENDIX.56 Bibliography

6 1 INTRODUCTION 1.1 General introduction A normal panoramic image covers 360 degrees of the natural scene. Traditionally, It can be captured either by using a camera with special lens or multiple cameras. However, now we can use software to generate a 360 degree panoramic image automatically from a set of images, covering the whole panoramic scene, taken by a video camera. But, it seems a problem when we try to use this kind of software to generate a stereo panoramic image, which consists of a pair of images, where one is for the left eye viewing and the other is for the right eye viewing. That is because the ordinary panorama image is generated from approximately a single viewpoint where stereo panorama images are generated from two different viewpoints simulating the human eyes locations. Therefore, an approach by using two cameras to simulate the two eyes to capture the images is proposed. However, this approach is a bit more complicated and hard to be implemented by ordinary users. Shmuel Peleg proposed a new approach [1] to generate a stereo panorama images by using multiple viewpoint projections, which was called circular projections by Peleg. It used a single video camera to rotate about a vertical axis which was behind its lens and captured a sequence of images, which had different viewpoints. This approach can be used to generate a stereo panorama in accordance with X-slit camera principles. The X-slit camera [2] consists of two slits rather than a pin hole in a traditional camera. The details will be presented in chapter 2.1. However, the parameters and their effects in the 3D sensation are not discussed. To understand why a single camera can simulate a two-camera system, you should be able to understand the basic principle of X-slit camera, and the multiple viewpoint projections, which will all be introduced in the chapter 2. The camera calibration will 6

7 tell you how s the real time camera work and its critical parameters. Moreover, to implement the approach in [1], we need to know something about the benchmark of the stereo panorama image quality and the SHARP RD3D,the world s first laptop to present 3D images without special glasses, which is used to present the test result. They will also be presented in chapter 2. In this section, short introductions are given to stereo panorama images, camera classifications, SHARP RD3D as well as the structure of the remaining chapters. 1.2 Stereo Panorama images A stereo panorama images consists of a pair of images. One is for left eye viewing; the other is for right eye viewing. Then you will have a stereo sensation about the panorama scene. Figure 1.1 illustrates the principle of 3D imaging. FIGURE 1.1 7

8 1.3 Camera classifications In theory, we can define only one kind of camera. That is X-slit camera [2, 3], which has two slits inside. The details will be introduced in chapter 3.1. The pinhole camera, which is known to all, is just a special kind of X-slit camera with two slits intersect and orthogonal to each other. The parameters of the camera will directly affect the critical parameters in this project. Therefore, it is quite important. The process of estimating the intrinsic and extrinsic parameters of a camera is called camera calibration, which will be introduced in chapter Sharp RD3D The Sharp RD3D is the world's first auto-stereo 3D notebook computer. Sharp Corporation's TFT 3D LCD technology makes it possible to view eye-popping 3D images without special glasses [6]. We will test our result in sharp RD3D laptop. And in chapter 2.6, we will discuss the3d image generation API in the sharp RD3D notebook computer. 1.5 Structure This report is divided into eight chapters. Chapter 1 provides introduction for the project. Chapter 2 provides background information relevant to the project..chapter 3 addresses the client s requirement specification. Chapter 4 compares my initial plan of project with actual timetable. Chapter 5 describes the approach used to achieve the target, which is proposed by Shmuel Peleg, and its extension studies. Then I proposed my algorithms to achieve the target. Chapter 6 describes the experimental setup of the algorithms proposed in chapter 5. Chapter 7 evaluates the results and suggests future work. Finally, Chapter 8 gives a conclusion of the report. 8

9 2 BACKGROUNDS This chapter addresses the relevant background knowledge of the project. In chapter 2.1, there will be an introduction in X-slit camera and how to use a single camera to simulate the effect of a X-slit camera. Chapter 2.2 will introduce the parameters of a regular camera and how to define them in mathematics. Chapter 2.3 will discuss traditional viewing projection in capturing panorama images, which is single viewpoint projection. Chapter 2.4 will discuss the several kinds of multiple viewpoint projection and the one will be applied in the project. Chapter 2.5 will give a benchmark on how to distinguish the quality of stereo panorama image and finally, Chapter 2.6 will explain the Sharp RD3D 3D imaging API. 2.1 X-slit camera The first physical X-slit camera was designed by one of the pioneers of color photography, Ducos du Hauron [3], in 19 th century. The X-slit camera model is shown in Figure 2.1. The general X-slit camera is designed with two arbitrary slits l1, l2. And the projection ray of a 3D point P will first intersect slit l2, and then intersect slit l1, finally to the image plane. FIGURE 2.1 9

10 Specially, if the two slits exists in the same plane and orthogonal to each other, then a pin hole camera will be produced. I refer the reader to [8] for detailed description of X-slit camera rendering and to [2, 7] for how to use a translating pinhole camera to simulate the effects produced by X-slit camera with different parameters. Hereby, I only introduce some simple knowledge of X-slit camera which will be applied into this project. According to [2, 7, 8], the novel views of X-slit camera rendering are created by sampling the column strips from input images, as shown in figure 2.2. FIGURE 2.2 The width of the column strips is related with the range of pinhole camera positions and the range of width of each pinhole image. Therefore, the width of the column strips can vary, as shown in figure 2.3. It is determined by the speed of the translating cameras. Particularly, if the pinhole camera translates at constant speed, the column strips can share the same width. 10

11 FIGURE 2.3 The specific location of column strips is related with the location of virtual camera s vertical slit, as shown in figure 2.4. Therefore, if the orientation and location of the vertical slits vary, the novel views created will vary each other. FIGURE 2.4 As mentioned above, we can easily find out that the width of the column strips and the location of the column strips are all related with camera parameters, such as the focal length of the aperture or vertical slit, the horizontal viewing angle of the camera and the resolution rate of the camera. Therefore, we can get a conclusion that if we 11

12 need to accurately define the width and location of the column strips, we should define the camera parameters first. That is camera calibration. 2.2 Camera Calibration In this section, I will introduce the mathematic definition [5] of the geometric model as well as the method to estimate the camera parameters in mathematics. Then discuss the parameters that I need to define in my project and propose a future research extension from my project Geometric camera models In mathematics, we can define 2 kinds of camera models [5]. One is the camera with perspective projection; the other is camera with affine projection. In my opinion, the later model is just a special case of the former one, which is just a rational approximation of perspective projection for the observed objects lying at an approximate constant distance from the cameras. Also, it distinguished the camera parameters in [5] as intrinsic parameters which related the camera s coordinate system to the idealized coordinate system [5], and extrinsic parameters, which related the camera s coordinate system to a fixed world coordinate system and specify its position and orientation in space [5]. According to [5], they obtained an equation that represented the homogeneous coordinate vector of a 3D point p in the camera coordinate system with only the camera intrinsic parameters as follows. 12

13 P=(x, y, z, 1) T denotes this time the homogeneous coordinate vector of P in the camera coordinate system. M denotes a 3*4 matrix which is from K. K is related with the camera s intrinsic parameters α, θ, u0, β, v0. α and β are the magnification values which transfer the meters to pixel units mathematically, and α=kf, β=lf, where f is the camera s focal length and k, l are the scale parameters. u0 and v0 are the offset values that adjusted the difference between the center of CCD matrix with the principal point C0, which is the pierce point of the camera s optical axis with the physical retina, as shown in figure 2.5. FIGURE 2.5, taken from [5] And θ is also an offset to adjust the manufacturing error of a camera coordinate system, which is the angle between the two image axes that are not equal to 90 degree. And in [5], they obtained an equation that represented the homogeneous coordinate vector of a 3D point p in the camera coordinate system with both the camera intrinsic parameters and extrinsic parameters as follows. 13

The notation R is a rotation matrix, which is used to define the rotation parameters of the camera respect with the world coordinate system. Notation t is just a translation vector.

14 The notation R is a rotation matrix, which is used to define the rotation parameters of the camera respect with the world coordinate system. Notation t is just a translation vector. r T 1, r T 2 and r T 3 denote the three rows of the matrix R and t x, t y and t z are the Co-ordinates of the vector t in the frame attached to the camera Geometric camera calibration Geometric camera calibration is the process to estimate the intrinsic and extrinsic parameters of a camera. In my project, it is a must to know some of the camera parameters. Generally speaking, I need to know the camera s horizontal viewing angle and its focal length when capturing the input video sequence, as shown in figure 2.6. In addition, I want to know the input image size measured in pixels. They all belong to the camera s intrinsic parameters. FIGURE 2.6, taken from [5]. 14

15 The 2φ is the camera s viewing angle and the f is the focal length when capturing current images, and d is the diameter of the film. However, I use my own camera to capture the input video sequence myself, so I know the camera parameters exactly, and no need to estimate it. But in the future, if I need to extend the study on this topic, it is a must to understand how to estimate the parameters that I need in the project. Then I can generate the good quality (This will be addressed in chapter 2.5) stereo panorama video from any input video sequence. In [5], it defined a method called least-squares parameter estimation to estimate the intrinsic and extrinsic. The details can be found in chapter 3 of [5]. It supposed that a camera watches several geometric features whose locations are known in a fixed world coordinate system. It then computed the perspective projection matrix M (chapter 2.1) associated with the camera and computed the intrinsic and extrinsic parameters of the camera from this matrix. These, they called it a linear approach to do camera calibration [5]. 2.3 Single viewpoint projections Single viewpoint projection, by literately, it is a view from only a fixed point in the world coordinate system, as shown in figure 2.7. All the projection rays will be intersected into a fixed center point, which is the pinhole of the camera. 15

16 FIGURE 2.7 Currently, we can use software to create a panorama image by using single view point projection principle. One among which is Canon's free software PHOTOSTITCH 3.1. it is a easy to use software, which reads in a set of images and stitches them together to form a full 360 degree panorama image. The final panorama image, looks like a picture taken by a special camera fixed in a center point of the panorama scene, similar as the pinhole view point in figure 2.7. However, such software can only generate single panorama image rather than a pair of panorama image, because they all based on the single view point projection principle. To generate a stereo pair of panorama images, we should use the software based on multiple view point principle, especially in two points view. 16

17 2.4 Multiple viewpoint projections Before talking something about the multiple viewpoint projections, we can look around in the nature. Humans have two eyes, and most of the animals have at least two eyes and move their head when looking for food or something else. The reason is that they need to analyze the environment. Or, we can say, they need to perceive the depth of the scene. That is the location of each object in the scene. Refer this to image-processing, we can not define the depth of a scene along the corresponding projection ray in a single image. We need at least two pictures, and then depth can be measure through triangulation [5]. To generate a 3D video, depth information is quite important. For the object that has infinite depth in the scene, our eyes will perceive it as zero disparity. For the object that is closer to us, our eyes will perceive greater disparity. This is the reason why you will see two forefingers when you put one of your forefingers in front of you within a small distance. Therefore, to generate the stereo panorama image, we should first know something about the multiple viewpoint projection. The multiple viewpoint projection, by literately, it refers to viewing a scene at different points at the same time. Figure 2.8 illustrates the two view point projection. The 3D point P is projected into the two image planes of the viewing camera with the optical centers O and the viewing camera with the optical centers O. 17

18 FIGURE 2.8, taken from [5]. From this simple two view point projection model, we are able to perceive a real 3D object, because these two cameras simulate our two eyes viewing objects, and then we obtain the information we need to judge the depth of every points in the objects. Then we can actually see the 3D object. In my project, I use a camera rotate about an axis behind its lens, this is also a kind of multiple view point projection, as shown in figure 2.9. FIGURE

19 Since the camera will rotate around the rotating axis in figure 2.9, every location of the camera varies from each other. However, for the two locations that close to each other, they share part of the scene (the shade part in the figure 2.9). So, this is possible to create a 3D view of the scene, because all the information needed exist in the image sequence. However, I do not use the shade part as the information that I need to create the 3D scene. I will use another approach, as shown in figure FIGURE 2.10, taken from [10]. 19

20 Recalling the X-slit camera principle that I have introduced in chapter 2.1, when I actually rotate the camera around the center axis, I can image that two slit cameras capturing the scene at the same time, as shown in figure 2.10, the right eye projection ray and the left eye projection ray. After rotating for a circle, they will share the full 360 degree scene. And this two slit camera play the roles of simulating our two eyes to capture the scene. Therefore, we will get two of the image strips, one from right eye projection rays, the other one from left eye projection rays. Then we can get a stereo panorama image pair. But, how can we tell if the stereo panorama image has good quality or not, I propose an approach that will be introduced in the next chapter. 2.5 Benchmark on stereo panorama image quality This chapter, I will introduce a approach I use to define a panorama image quality. According to this benchmark, I can tell which panorama image has the good image quality and which does not Disparity Disparity refers to the difference in images from the left and right eye that the brain uses as a binoculars cue to determine depth or distance of an object [11]. As shown in figure FIGURE 2.11, taken from [12]. 20

21 According to the definition mentioned above, when we perceive the ball in figure 2.11, we will base on the difference between the images that our left and right eyes actually see. This is the same as the 3D display principle. That is to display the ball in the screen by extending the lines from the eyes to the ball and then to the screen. In my project, the system behaves mainly like the human example. Two virtual slit cameras take pictures of the same scene, but they are displaced by a certain distance - exactly like our eyes (figure 2.10). The computer then computes the depth from these two image strips. As mentioned above, disparity plays an important role in forming 3D object that we can see. Therefore, disparity is an essential part when we need to define a stereo image s quality Algorithm built My approach on define a stereo image s quality is based on the algorithm below. 1 Input a stereo image pair. 2 Display both of the images at the same time with the pixel information on, which is just an axis that mark the width of the image by pixels, as shown in figure Find the same points from the object which is the closest to the cameras (refer to chapter 5.3) from both of the images, and compute the pixel difference a. That is OCI1-OCI2=a. where OCI 1 is the pixel value in the axis of the first image of the object closed to the camera, respectably the OCI2. 4 Find the same points from the object which is far away to the cameras (refer to chapter 5.3) from both of the images, and compute the pixel difference a. That is OFI1-OFI2=b. where OCI 1 is the pixel value in the axis of the first image of the object far away from the camera, respectably the OCI2. 5 Compare the value a with 1 and compare the value b with 2. Where 1 and 21

22 2 are the values we can need to get from experiment, details will be addressed in the session follows. 6 if a 1 or a is a little bit bigger than 1, and b 2 or b is a little bit smaller than 2. Particularly, if the objects locate almost at infinite distance from the camera, 2 0, then b 0, which is what we call zero disparity. FIGURE 2.12, taken from my home. So, in my benchmark, I need to define a reasonable 1 and a reasonable 2. Recalling that we see the object quite closed to us with great disparity and object at infinite distance with zero disparity. So if object s distance becomes farer and farer, the disparity amount becomes smaller and smaller. As shown in figure 2.13, by triangle similar principle, D/b f/d, where D is the radial distance, b stands for the distance between two cameras that simulate human two eyes, and f is the camera s focal length, d is the disparity. In addition, O and O stand for the object which the cameras are looking at, and D stands for the distance between the object and the 22

23 cameras. So, actually, we can establish a relationship between the object s distance D and its disparity that D=1/2D=bf/2d, where D is measured by meter. FIGURE 2.13, modified from [19]. Then I obtain a mathematic expression for my equation that =λbf/2d, where λ is a ratio whose unit is measured as pixels/meter. And the value of λ is determined by the camera s intrinsic parameters, as shown in figure a.6 and 5.6. And we can get its relationship as shown in figure

24 FIGURE 2.14 Therefore, 1 and 2 can be obtained by the expression introduced above with the depth of the object known(the depth of the object can be obtained by two images with 2/3 overlapping, as proposed in [20]). Note,this algorithm introduced can not be applied to the points that exist in the plane which locates exactly in the middle between the two cameras, as shown in figure 2.13, the plane perpendicular to b and consists the line O O, because all the points in the line are detected by the two cameras with no disparity. Therefore, the plane is called zero disparity plane. 24

25 2.6 Sharp RD3D API In this section, I will introduce the environment that I used to test my result, which is based on the Sharp RD3D laptop. As introduced in chapter one, sharp RD3D, the world s first 3D display laptop, can make people perceive 3D effect without glass, which is due to a special layer which consists a series of special designed vertical slot called parallax barrier, as shown in figure FIGURE 2.13, taken from [21]. With the parallax barrier turned off, human eyes perceive the same light from the screen, and with the parallax barrier turned on, human eyes perceive divided light from the screen, which is due to the effect of parallax barrier. Hence, we can perceive the 3D sensation without special glasses. As mentioned above, the sharp RD3D API is a application platform which can display the pair of stereo panorama images. To make 3D sensation successful, we need to mosaic the two images in accordance with Sharp s special technology as shown in figure

26 3 CLIENT REQUIREMENT SPECIFICATIONS Client requirements are the most important guidance to the project objectives and also the compulsory contents of the project. My client requires not only the practical implementation of Peleg s theory, but also the extension of his theory. Therefore, to finish the project successfully, I need to fulfill both of them. 3.1 Theory analysis requirement 1. Understand the basic principle of X-slit camera and its projections. 2. Understand how X-slit camera projections can be achieved by a single camera. 3. Understand how X-slit camera principle can be applied into stereo panorama image generation, and learn the approach proposed in [1]. 4. Do extension research and provide detail mathematic induction on how to choose the parameters proposed in [1], such as the strip width and the distance between two strips. 5. Gives a theory analysis on how the parameters will affect stereo sensation. 6..Synthesize the theory and provide a detailed algorithm to generate a stereo panorama with a single camera. 3.2 Implementation requirement 1. Understand the Matlab language toolbox in image processing and video processing. 2. Capture a video sequence by a single monocular camera as input video. 3. Use the algorithm built to write a Matlab code to process the input video to generate a stereo panorama image. 4. Study the 3D display API for Sharp RD3D laptop. 5. Use the stereo panorama image to generate a 3D video or virtual 3D scene in Sharp RD3D laptop. 6. Improve the stereo sensation. 26

27 4 Project Plan This chapter describes the initial project plan and also the alternative plan during the project development process. And then it will address the future development work from the current project. 4.1 MILESTONE INITIAL PLAN: Milestone Date Understanding the paper needed for this project August 18th, 2006 Modeling and building algorithm September 8th, 2006 Implementing and debugging October 8th, 2006 Testing and analyzing October 22nd, 2006 Finalizing report October 29th, 2006 ACTUAL PLAN: Milestone Date Understanding the paper needed for this project August 8th, 2006 Modeling and building algorithm September 18th, 2006 Implementing and debugging October 10th, 2006 Testing and analyzing October 20th, 2006 Finalizing report 27

28 4.2 FUTURE WORK Due to the time limit for this project, there are a lot to be extended from what I have done in this semester. This section will address part of them. For the software part: Firstly, it is converted the matlab code to C++ code. That is because matlab is the software that is convenient to use by researchers rather than the general users. In addition, running matlab is much more slowly in my project than running a c++ program. Secondly, building an extra function for the software that can automatically measure the quality of the stereo image pair. Thirdly, try to build an extra function for the software to apply camera calibration principle to estimate camera s intrinsic and extrinsic parameters. With this function, the software should be able to process inputs by general users rather than inputs with a specific camera model. For the research part: Firstly, try to extend the 3D image creation from limiting the camera s moving track (rotating about an axis behind it) to allowing camera s free movement. Secondly, try to solve the problem of displaying dynamic scene rather than static scene in my project. Thirdly, try to build a algorithm to solve the 2D video to 3D video conversion. 28

29 5 THEORY ANALYSES AND MODELING In this section, it will first introduce the method that I used in this project. And then focus on discussing the steps on how to choose the parameters discussed in the method. Then discuss the technology I used in my steps when specifying the parameters. Finally, giving the algorithm on achieving the goal and model the prototype based on the algorithm. 5.1 Methodology To generate a stereo panorama image, we can use several approach. The common one is proposed in [10], which used two cameras to simulate the human two eyes to rotate around an axis, as shown in figure 5.1 and 5.2. FIGURE 5.1, a two-camera device, taken from [10]. 29

30 FIGURE 5.2, the methodology of using two cameras to generate stereo panorama image, taken from [10]. As shown in the figure 5.2, and figure 2.9, the two cameras share some common information about a same object and have some unique information about a same object. Therefore, the system can compute the depth of the object according to this information obtained, to generate a 3D image pair. However, it is not easy for everyone to obtain such a device in figure 5.1. So, it is difficult to generate a 3D sensation image by common user. In this project, I used the technology proposed in [1, 13], based on the principle of [8] to generate a stereo panorama image pair, which will use only one camera. Recalling that the knowledge that I introduced in chapter 2.1 and chapter 2.4, we can establish a prototype that two virtual slit cameras exist behind the physical camera are capturing the images while the physical camera is recording the scene around an axis, as shown in figure

31 FIGURE 5.3, taken from [13]. The letter O stands for the physical camera s optical center, and the letters V L and V R stand for the virtual slit cameras, where V L plays the role as human s left eye and V R plays the role as human s right eye. 2d stands for the distance between the two cameras, and r stands for the physical camera s rotation radius. With the prototype addressed in figure 5.3, we know that when we rotate the physical camera about an axis, it is capturing a full 360 degree video sequence. And we can get a set of images. For each of the images, a part of it is captured by each of the slit cameras. Recalling the knowledge in chapter 2.1, the slit cameras actually capture vertical strips in a single frame. In the figure 5.3, we link the projection rays from V L and V R with physical camera s optical center, and extend them to the image plane. Then we can obtain the two corresponding vertical strips. Hence, for a full 360 degree rotation of the physical camera, we can get two set of strips, where one is captured by the left-eye slit camera, the other is captured by the right-eye slit camera. Therefore, after stitching them each set of strips together, we can actually obtain two panorama images, where one is for left eye viewing, the other is for right eye viewing. This approach is simple and directly. Even by common users, they can obtain their own stereo pair according to this approach. So, how to choose the strips location in the image planes, and how to choose the width of the strips and the camera s rotating 31

32 radius become an important issue. And if the final stereo panorama image will generate great 3D sensation becomes an interesting topic. They will be addressed in the next section. 5.2 Parameter analysis In this section, I will analyze the parameters needed to create a stereo panorama image pair based on figure 5.3, as shown in figure 5.4, in six steps. FIGURE 5.4, derived from figure 5.3. Firstly, to simulate the human eyes function exactly, the distance between the two virtual slit cameras should be roughly the same as the human eyes distance. Therefore, we can get 2d (6.5,7)cm. Secondly, we need to know the camera s rotation radius r. In this project, I define this r myself, so I know the exact value of r. But in the future, we can build software to detect the r + f value, by comparing several images with some same objects, where f is the physical camera s focal length, as addressed in chapter 4.2. Thirdly, I use the canon ixus4.0 as my physical camera (figure 5.5), so I know the 32

focal length of this camera when recording videos. Then f value is knowable. However, in the future, we can apply camera calibration principle to estimate the camera s intrinsic parameters. FIGURE 5.

33 focal length of this camera when recording videos. Then f value is knowable. However, in the future, we can apply camera calibration principle to estimate the camera s intrinsic parameters. FIGURE 5.5 Fourthly, from the triangle similar principle, we can get 2v = 2d f / r, with an unit mm. So, the mm value of distance between the two vertical strips is computable. However, I use a digital camera to capture image, I can only get pixel values from the image, hence, I need to know the pixels value from the mm value. Based on figure 2.6, I build a simple to understand geometry model, as shown in figure 5.6. FIGURE 5.6, simple camera model. The 2 thetas refer to the physical camera s horizontal viewing angle. From this simple 33

34 model, we can easily get the image plane width w = 2f tan (theta). Then we can divide 2v obtained in step 3 with w, we can get percentage information that the distance between the two vertical strips occupied. Since I know the width of the image in pixels, I simply multiply the Width in pixels with the percentage obtained, I can get the distance between the two vertical strips in pixels. Fifthly, to obtain the width of each vertical strip, I need to define a camera s virtual speed rather than its actual speed. My virtual speed is defined as pixels per second instead of millimeter per second. This is easy because we can use the value directly to determent the width of the vertical strip in pixel, and don t need to worry about convert the value from millimeter to pixel. In addition, the virtual speed is easier to detect than the actual speed. When we rotate the camera clockwise, the images we obtain look like moving to the left as shown in figure 5.7. FIGURE 5.7 The point P s original position is occupied by another P, and P move to the left hand side of P. And the distance between P and P is the distance that P moved in a frame s time. So assume the camera actual speed is a constant amount, then we can obtain a virtual camera speed by multiply the distance D with the frames rate per second (in the movie, the value is 24 frames per second, TV is from 25-30, in my project, my value is 20 frames per second), and get a value in pixels per second. However, the 34

35 camera s actual speeds are not always the same. So we need to detect every virtual speed that appeared in the whole camera moving sequence. We can use image matching point principle to achieve this, which will be addressed in the next session. Hence, the distance between P and P is the value we need to know to determent the width of the vertical strips. Also, we can get the vertical virtual speed if the camera vibrates while rotating. Finally, we need to discuss if the values we got above can actual generate a great 3D sensation. Before this, we should understand the basic principle of 3D movie playing process. The 3D movie process includes two major parts. One is what we have discussed, the generating part, by using several cameras or using the approach I addressed in chapter 5.1, the left part of figure 5.8. The other part is the playing process. That is the audience physically watching the movie, the right part of figure 5.8. If we need to exactly control the effect we want to achieve, all the parameters for the left part of figure 5.8 and the right part of figure 5.8 should be exactly the same. FIGURE 5.8 That is camwide= eyewide, D=D (D is the distance between the cameras and the object, D is that we can feel the distance that between the object and us), a=b (a is the camera s viwing angle, B is the human eye s viewing angle.) If we can meet every of the conditions addressed above, we can control exactly the distance that object A 35

36 jumps out of the screen. It is the same distance with A and the image plane. However, we can t actually meet all the conditions. For camwide = eyewide, it is easy, as defined in first step in chapter 5.2. But for a=b, it is almost impossible, because everybody has different eyesight. And for D=D, it is quite hard to achieve this, because you can not force people to sit at a fixed location. Therefore, if we need to apply our parameters discussed in the six steps to generate a reasonable 3D sensation, we need to find alternate way rather than simply meet the three conditions addressed above. Here, what we call reasonable 3D sensation is that the audience should be able to feel a reasonable amount that some object pops out which means that the object jumps out of the screen a certain amount, and the audience would like to touch it by hand, as in figure 5.9. FIGURE 5.9, taken from [6]. And my alternate way to generate the reasonable 3D sensation is to adjust the parameters discussed in the six steps to adjust the difference between D and D, a and B. First, we have a look at figure

37 FIGURE 5.10 From figure 5.10, we can get a conclusion that if D<D or B<a, it will reduce the effect that object A pops out the screen, which may cause a failure in 3D sensation generation. To avoid this, we can simply increase the size of the screen or ask the audience to sit closer to the screen. However, in my project, I can not control both of these, what I can do is to control the parameters. Therefore, I can increase the distance between the two virtual slit cameras to improve the 3D effect, as shown in figure FIGURE 5.11 The effect of increasing the distance between left slit camera and right slit camera is 37

38 that we increase the D value to make it greater than D to adjust the Pops out effect. 5.3 Image points matching In this section, I will introduce the method that I used to detect the virtual speed addressed in the last section, which is called image points matching. Currently, there are two popular methods are used in image points matching. One is called L.K. Feature ; the other is called Harris Corner. And a new method [14] is proposed in 2006, which in my opinion is the best of all. It detects the object s invariant feature to match the same points; therefore it can be used to do advanced search that automatically looking for same points in a set of un-order images. However, in my project, the order of input images are known, and I just use the simple principle to find same points, so I choose the Harris corner approach, which is easy to understand and used. The details of Harris corner are addressed in [15]. With this approach, we are able to find out the feature points in an image such as the edge or intersection points etc. Then we can generates putative matches between previously detected feature points in two images by looking for points that are maximally correlated with each other within windows surrounding each point. And only the points that correlate most strongly with each other in both directions are returned [20], as shown in figure

39 FIGURE 5.12 In figure, A,B,C and D are the feature points detected by Harris corners, and only both directions co-relationships are satisfied in a specific window around the point that we can say A is the same point as B. In computer vision, generally speaking, we use the approach based on convolution computation to carry out the matching. As mentioned above, there are only two steps to determine the virtual speed in this project. The first step is to use harris corner approach to detect each frame s feature points. And record their row and column information. The second step is to compare the feature points of two frames which are adjacent. Computes the difference of the same points horizontal coordinate and vertical coordinate. And then divide it with the frames rate per second. 5.4 Algorithm building In this section, I will conclude the analysis mentioned above to provide an algorithm to generate a stereo panorama image pair which will be used in the implementation section. There are a few steps. 39

40 Firstly, input a video sequence which is recorded based on the approach of figure 5.3 and figure 5.4. Secondly, get each frame from the input video and save them as an image cube, as shown in figure FIGURE 5.13, taken from 10. Thirdly, find the image matching points based on harris corner approach & mathematical co-relationship, and compute the virtual speed. Fourthly, compute the width of each vertical strip based on the virtual speed and compute the location of the vertical strips. Specifically, the width of the vertical strip in the first frame can be chosen based on the virtual speed derived from first frame and second frame, which is the same as that in the second frame. Fifthly, stitch the strips together to form a pair of panorama images, as shown in 40

41 figure Modeling The project is just an initial step. Therefore, the software is not as important as theory. So, I use matlab language to write codes to test if the perspective effect of the theory is in accordance with practice. Therefore, the modeling process is quite simple. Data format modeling Firstly, I need to identify the input data format. In matlab, it supports only avi format video files. And if the avi is compressed, the de-compressed codec should be known to matlab. Therefore, my test data is the avi file with known de-compress codec. Secondly, I need to identify the image format. The image format used in the project can be any image format supposed by matlab, such as bmp, jpg, tiff, gif etc. So, I choose bmp as the image format. Component modeling As mentioned in chapter 5.4, I need a special unit to detect the virtual speed of the camera. Therefore, the total structure is divided into 2 parts, as shown in figure FIGURE

42 As shown in figure 5.14, the motion estimation and computer relative shift units are used to detect the virtual speed, and the results obtained are used to decide the width of each vertical strip. The panorama images built unit will process the results passed from computer relative shift unit to select the width of each strip, and decide the locations of left and right strips according to the camera s parameters, then it will output a stereo image pair. 42

43 6 IMPLEMENTATIONS This section will introduce the way I used to implement the theory analyzed above. It will include the methods that I try at the very beginning according to my own understanding of such an issue, and compare my original approaches with the one I used in my project, and give an reason why I didn t use the initial approaches. 6.1 Initial implementations methods At the beginning of the project, I used my ways to generate a stereo panorama image pair. Firstly, I use a single camera to simulate the two-camera device as shown in figure 5.1. In figure 6.1, it illustrates the way I implemented it. FIGURE 6.1 As shown in figure 6.1, I fixed the camera first at camera spot A to capture a video sequence. And then I fixed the camera in camera spot B to capture a video sequence. Particularly, if the actions in the two process are exactly the same (same vertical or horizontal shaking and same speed at each frame), each pair of the frames consist a 43

44 pair of stereo image, as shown in figure 6.2. FIGURE 6.2 After mosaic them, a stereo image is generated, as shown in figure 6.3. FIGURE

45 Therefore, we can generate a stereo panorama image by stitching the unit-pairs. There are several advantages of this approach. The first one is that it is quite easy to understand by people without any computer vision knowledge. The second one, also the most important one is that we can generate the vision of virtual camera, as shown in figure 6.4. FIGURE 6.4 In figure 6.4, the vision of virtual camera A in the virtual system O can be simulated, because all of the A s projection ray can be calculated by the A s projection ray, due to A rotating a 360 degree. However, the mathematics involved in achieving this are complicated. The disadvantage of this approach is that it only get better result when left spot camera actions and the right spot camera actions are exactly the same. The second approach I tried to generate a stereo image pair is a bit different from the one mentioned above, as shown in figure

46 FIGURE 6.5 This approach used several camera pairs to capture unit stereo pairs, as indicated by the circles in figure 6.5. That is in each circle, I used the camera to capture a single image at camera spot A, and then shift the camera to camera spot B which locates about 6.5cm (human eyes distance) away from spot A to capture another image. These two images then comprise a stereo image pair. Then I can obtain a set of stereo image pairs. The location of each circle can be chosen in a range which makes the image taken in the same spot at current circle (for example, spot A in the second circle clock wisely) overlaps at least 1/3 of the former image (the spot A marked in figure 6.5). Then I can stitch the unit pairs together to generate a panorama images, which will use SIFT algorithm to mosaic images. Particularly, if the radius of spot A to the center axis is greater than a value B, the curve from A to B is almost equal to the horizontal distance from A to B. That is we can canister that spot A and spot B locate in the same circle. Then we don t need a whole video sequence to generate a stereo panorama image pair. Instead, we need only some pairs of unit stereo images, as shown in figure 6.6 and

47 FIGURE 6.6, unit pair generated by initial approach two. FIGURE 6.7, 3D image generated from figure

48 The advantage of this approach is that it is more easy and quick to implement than the first approach and the one I used in my project, which is due to the great reduce on image cube size. The disadvantage of this approach is that it can not be used to generate a virtual camera s vision. In the following section, I will introduce the way I used to implement the theory that I used for my project. 6.2 Software introduction In my project, I use Matlab as my programming language. It is because my project is an initial research project. I need an easy to use language to implement the theory I obtained. MATLAB is a high-level language and interactive environment that enables you to perform computationally intensive tasks faster than with traditional programming languages such as C, C++, and Fortran. And it has a lot of features built especially for engineering, including computer vision. We can read in video, images directly into Matlab by its simple built-in function (which are the same as class in Java), and manipulate them directly with matrix. In my project, I used MATLAB version R2006b as the major language to carry out the stereo panorama images generation. And also, I write java scrip and a simple html language to display my stereo image in SHARP RD3D computer. 6.3 Input specification To generate a perfect 3D panorama image, the input video s quality is quite important. To identify the input video s quality, I classify the input video into two categories. One is in ideal situation which has constant rotating speed and has not any vertical 48

49 vibrating. The other is in normal situation that allows the different rotating speed and some vertical vibrating. For the ideal situation, I don t need to apply for the virtual speed detection algorithm to every frame of the video to adjust the vertical and horizontal vibration. Therefore the program will run at a high speed. For the normal situation, I need to apply the virtual speed detection algorithm into every frame of the video, which is quite slow in Matlab execution. In my project, I used Canon ixus 4.0 to capture the video sequence, and use my elbow as the rotating radius. Therefore, my input is in normal situation as mentioned above. Moreover, the scene of the video greatly affect the 3D sensation, later, I will discuss it in the test section. 6.4 Output results My program will output a pair of stereo image, and I used SHARP company s open source SDK to generate a stereo image for SHARP RD3D notebook computer to view. Figure 6.8 and figure 6.9 illustrate the output image pairs, where figure 6.10 and figure 6.11 illustrate their synthesis 3D image that can be displayed on SHARP RD3D computer respectively. 49

50 FIGURE 6.8 FIGURE

51 FIGURE 6.10 FIGURE 6.11 The advantages of this approach are obvious. Firstly, it is easy to capture a video sequence by any of home-use cameras. That is the people without specific knowledge can make it. Secondly, due to the continuous camera positions when capturing video, we can easily calculate a virtual 3D view at virtual points, as shown in figure FIGURE 6.12 There is always a projection ray intersecting the physical camera s moving circle, where a specific camera position can be found. For example, to find out the camera s 51

52 viewing rays in position A, we need to find out the projections ray from the physical camera s locations B and B. Therefore, in a virtual position A, we can find out that, all the projection rays by the virtual camera locating at A can be found from a set of rays generated by the physical camera s locations from B to B in the moving circle. Hence, in theory, we can generate real time 3D sensation within this circle. 6.5 Testing on sharp RD3D API There came a kind of software for viewing stereo image with SHARP RD3D. I can test the results on such software. However, the software limits the image size of the stereo photo. For a panorama image, the width is much greater than the height. The software will automatically fit the image into screen size, where I can not adjust it. So the small image makes little 3D sensation. I write a HTML language and a java script to display my image, which enables user to move forward or backward to look for the whole panorama scene. 52

53 7 RESULTS AND ANALYSIS In this section, I will first introduce the way I used to do the test for my project, and then discuss the result I get from a set of test feedback questionnaires. 7.1 Testing effect I used a questionnaire approach to test the 3D sensation of my stereo image among a group of 10 people for two main purposes. At first stage, I measured the each of the testers eye distances by a ruler and asked the testers to look at a set of 5 stereo images from the same scene which were generated with different parameters chosen, which were the different distances between two strips calculated in accordance with different distances between two virtual slit cameras, where the figures were from 6cm to 8cm, with a 0.5cm increase. And then I asked them to pick up the nicest one among them. I would like to know how the human eyes distance would affect the 3D sensation. At the second stage, I will use the most voted stereo image from first stage and other 2 stereo panorama images, which have identical parameters with the one chosen at first stage but taken in different environment, as the test data, and ask them to find out which one is the best. And then, I will discuss the result and give an improvement on 3D effect if applicable. Finally, I want to ask them to give their suggestions on how to improve 3D sensation based on their own knowledge base and imagination. 53

COSC579: Scene Geometry. Jeremy Bolton, PhD Assistant Teaching Professor

COSC579: Scene Geometry. Jeremy Bolton, PhD Assistant Teaching Professor COSC579: Scene Geometry Jeremy Bolton, PhD Assistant Teaching Professor Overview Linear Algebra Review Homogeneous vs non-homogeneous representations Projections and Transformations Scene Geometry The